The Configuration > Monitoring Collectors
page shows details on the "Health" of both individual Collector nodes and each Collectot Cluster.
- The ONLINE/OFFLINE status directly relates to the processing of the cluster-health-queue shown within the orchestrators output of
/opt/opsview/messagequeue/sbin/rabbitmqctl list_queues
. If you see a build-up here, then the latest statuses will not be shown and this queue will need to be cleared before they are
- To remove a backlog of messages on this queue replace list_queues from the above command with
purge_queue cluster-health-queue
- To remove a backlog of messages on this queue replace list_queues from the above command with
- If the queue is not purging, then stop the
opsview-scheduler
andopsview-orchestrator
components, purge the queue, and then start up those two components afterward
Please visit the Collector Offline document for basic/initial troubleshooting of your investigation as to why a collector cluster may be offline.
Clusters Tab
The Status column shows the current state of the cluster. Possible values are:
- ONLINE - Cluster is running normally
- DEGRADED - Cluster has some issues. Hover over the status to get a list of alarms
- OFFLINE - Cluster has not responded within a set period, so is assumed to be offline
Cluster Health Alarms
The table below describes the possible alarms that will be shown when users hover over the status of a DEGRADED cluster. These alarms refer to conditions of the following Opsview components:
- opsview-schedulers
- opsview-executors
- opsview-results-sender
Alarms | Description | Suggestions / Actions |
---|---|---|
All [Components Name] components are unavailable e.g. All opsview-executor components are unavailable |
Master/ Orchestrator server can’t communicate with any [Components Name] components on collector cluster. This may be because of a network/communications issue, or because no [Components Name] components are running on the cluster. Note: this alarm only triggers when all [Components Name] components on the collector cluster are unavailable, since a cluster may be configured to only have these components running on a subset of the collectors. Furthermore, the cluster may be able to continue monitoring with some (though not all) of the [Components Name] components stopped. |
To resolve this, ensure that the master/orchestrator server can communicate with the collector cluster (i.e. resolve any network issues) and that at least one scheduler is running e.g. SSH to collector and run /opt/opsview/watchdog/bin/opsview-monit start [Component Name]
|
Not enough messages received ([Components Name 1] → [Components Name 2]): [Time Period] [Percentage Messages Received]%. e.g. Not enough messages received (opsview-scheduler → opsview-executor):[15m] 0%. |
Less than 70% of the messages sent by [Components Name 1] have been received by [Components Name 2] within the time period. This could indicate a communication problem between the components on the collector cluster, or that [Components Name 2] is overloaded and is struggling to process the messages it is receiving in a timely fashion. e.g. 0% of messages sent by the scheduler have been received by the executor within a 15-minute period. |
If 0% of the messages sent have been received by [Components Name 2] and no other alarms are present then this may imply a communications failure on the cluster. To resolve this ensure that the collectors in the cluster can all communicate on all ports (seehttps://knowledge.opsview.com/docs/ports#collector-clusters) and that opsview-messagequeue is running on all the collectors without errors. Alternatively, this may be indicate that not all the required components are running on the collectors in the cluster. Please run /opt/opsview/watchdog/bin/opsview-monit summary on each collector to check that all the components are in a running state. If any are stopped then run /opt/opsview/watchdog/bin/opsview-monit start [component name]to start them. If > 0% messages sent have been received by [Components Name 2], then this likely implies a performance issue in the cluster. To address this you can: Reduce the load on the cluster e.g. - Reduce the number of objects monitored by that cluster - Reduce the number of checks being performed on each object in the cluster (i.e. remove host templates/service checks). - Increase the check interval for monitored hosts Increase the resources in the cluster - Add additional collectors to the cluster - Improve the hardware/resources of each collector in the cluster (i.e. investigate bottleneck by inspecting self-monitoring statistics and allocate additional CPU/memory resources as needed). |
Note: For a fresh collector/cluster which has just been set up or which has minimal activity, the “Not enough messages received” alarm will be suppressed to avoid unnecessary admin/user concerns. This does not impact the "All [Components Name] components are unavailable" alarm, which will still be raised for an offline collector.
Comments
0 comments
Please sign in to leave a comment.