Configuration > Monitoring Collectors the page shows details on the health of both individual collector nodes and each Cluster.
- The ONLINE/OFFLINE status directly relates to the processing of the cluster-health-queue shown during the command output of
/opt/opsview/messagequeue/sbin/rabbitmqctl list_queues. If you see a build-up here, then the latest statuses will not be shown and this queue will need to be cleared before they are
- This is to be checked on your orchestrator and your collector cluster in question
- This may be completed with a rabbitmqctl
- If the queue is not purging, then stop the
opsview-orchestratorcomponents, purge the queue, and then start up those two components afterward
Please visit the Collector Offline document for basic/initial troubleshooting of your investigation as to why a collector cluster may be offline.
The Status column shows the current state of the cluster. Possible values are:
- ONLINE - Cluster is running normally
- DEGRADED - Cluster has some issues. Hover over the status to get a list of alarms
- OFFLINE - Cluster has not responded within a set period, so is assumed to be offline
Cluster Health Alarms
The table below describes the possible alarms that will be shown when users hover over the status of a DEGRADED cluster. These alarms refer to conditions of the following Opsview components:
|Alarms||Description||Suggestions / Actions|
|All [Components Name] components are unavailable
e.g. All opsview-executor components are unavailable
|Master/ Orchestrator server can’t communicate with any [Components Name] components on collector cluster. This may be because of a network/communications issue, or because no [Components Name] components are running on the cluster.
Note: this alarm only triggers when all [Components Name] components on the collector cluster are unavailable, since a cluster may be configured to only have these components running on a subset of the collectors. Furthermore, the cluster may be able to continue monitoring with some (though not all) of the [Components Name] components stopped.
|To resolve this, ensure that the master/orchestrator server can communicate with the collector cluster (i.e. resolve any network issues) and that at least one scheduler is running
e.g. SSH to collector and run
|Not enough messages received ([Components Name 1] → [Components Name 2]): [Time Period] [Percentage Messages Received]%.
e.g. Not enough messages received (opsview-scheduler → opsview-executor):[15m] 0%.
|Less than 70% of the messages sent by [Components Name 1] have been received by [Components Name 2] within the time period. This could indicate a communication problem between the components on the collector cluster, or that [Components Name 2] is overloaded and is struggling to process the messages it is receiving in a timely fashion.
e.g. 0% of messages sent by the scheduler have been received by the executor within a 15-minute period.
|If 0% of the messages sent have been received by [Components Name 2] and no other alarms are present then this may imply a communications failure on the cluster. To resolve this ensure that the collectors in the cluster can all communicate on all ports (seehttps://knowledge.opsview.com/docs/ports#collector-clusters) and that opsview-messagequeue is running on all the collectors without errors.
Alternatively, this may be indicate that not all the required components are running on the collectors in the cluster. Please run /opt/opsview/watchdog/bin/opsview-monit summary on each collector to check that all the components are in a running state. If any are stopped then run /opt/opsview/watchdog/bin/opsview-monit start [component name]to start them.
If > 0% messages sent have been received by [Components Name 2], then this likely implies a performance issue in the cluster. To address this you can:
Reduce the load on the cluster e.g.
- Reduce the number of objects monitored by that cluster
- Reduce the number of checks being performed on each object in the cluster (i.e. remove host templates/service checks).
- Increase the check interval for monitored hosts
Increase the resources in the cluster
- Add additional collectors to the cluster
- Improve the hardware/resources of each collector in the cluster (i.e. investigate bottleneck by inspecting self-monitoring statistics and allocate additional CPU/memory resources as needed).
Note: For a fresh collector/cluster which has just been set up or which has minimal activity, the “Not enough messages received” alarm will be suppressed to avoid unnecessary admin/user concerns. This does not impact the "All [Components Name] components are unavailable" alarm, which will still be raised for an offline collector.
Please sign in to leave a comment.