You may see your Collector or Collector Cluster is offline and not be sure of the cause.
This relates to the cluster-health of your collector cluster and the reason why this has happened may be different. Some basic checks for reviewing this offline state are listed below to assist with troubleshooting this topic before contacting Support.
Process Basic first check for a collector(s) cluster being "OFFLINE"
1. Check if your collectors are indeed down or have a connectivity issue
- are you able to SSH to them?
- are all of the opsview-components stated as running
/opt/opsview/watchdog/bin/opsview-monit summary -B
2. Check the opsview-messagequeue "cluster-health-queue"
- the main queue to review and ensure this is processing messages (reaching zero or near zero) would be the cluster-health-queue
/opt/opsview/messagequeue/sbin/rabbitmqctl list_queues | awk '$2>0'
- If you receive an error whilst running the above command, please check on the collector(s) cluster "status", which should return an output similar to the below
Cluster status of node email@example.com ...
Cluster name: firstname.lastname@example.org
3. Opsview Version: Are the opsview version of your orchestrator and collector(s) in sync?
- This may be checked from the UI within the Monitoring Collectors page, under the “Version” column, comparing your "Master Monitoring Server" and the collector(s) in question
--- also the Component Overview page will list all components for an Opsview server
- you may also use the command line, to check packages against all the servers:
rpm -qa | grep opsview | sort
dpkg -l | grep opsview
If your collector(s) version does not match between the UI and server/collector(s) level, then please review "Why does my collector show the wrong Opsview version in the UI?"
4. Check your collector(s) disk space are not full
df -h /
5. Use the top command to review the:
- Load Average: Ensure your load average for the collector(s) involved is not overloaded as this may affect the processing of your collector(s)
- CPU Usage: Ensuring your server is not overloaded. Also, you may run lscpu to review the number of CPUs you have
- Available RAM: Ensuring your server has the capacity to function as expected and not randomly kill off processes
6. Ensure the time of all your servers are in sync
- Orchestrator and all the collector(s) in question.
- If the time is not in sync, then cluster-health-queue messages will be disregarded
7. Check the system logs on the orchestrator and on the collector(s) server for any communication refused error messages
Please sign in to leave a comment.