Opsview - Collector Cluster Health – Support - ITRS Group

The Configuration > Monitoring Collectors page shows details on the "Health" of both individual Collector nodes and each Collectot Cluster.

The ONLINE/OFFLINE status directly relates to the processing of the cluster-health-queue shown within the orchestrators output of /opt/opsview/messagequeue/sbin/rabbitmqctl list_queues. If you see a build-up here, then the latest statuses will not be shown and this queue will need to be cleared before they are
- To remove a backlog of messages on this queue replace list_queues from the above command with purge_queue cluster-health-queue
If the queue is not purging, then stop the opsview-scheduler and opsview-orchestrator components, purge the queue, and then start up those two components afterward

Please visit the Collector Offline document for basic/initial troubleshooting of your investigation as to why a collector cluster may be offline.

Clusters Tab

The Status column shows the current state of the cluster. Possible values are:

ONLINE - Cluster is running normally
DEGRADED - Cluster has some issues. Hover over the status to get a list of alarms
OFFLINE - Cluster has not responded within a set period, so is assumed to be offline

Cluster Health Alarms

The table below describes the possible alarms that will be shown when users hover over the status of a DEGRADED cluster. These alarms refer to conditions of the following Opsview components:

opsview-schedulers
opsview-executors
opsview-results-sender

Alarms	Description	Suggestions / Actions
All [Components Name] components are unavailable e.g. All opsview-executor components are unavailable	Master/ Orchestrator server can’t communicate with any [Components Name] components on collector cluster. This may be because of a network/communications issue, or because no [Components Name] components are running on the cluster. Note: this alarm only triggers when all [Components Name] components on the collector cluster are unavailable, since a cluster may be configured to only have these components running on a subset of the collectors. Furthermore, the cluster may be able to continue monitoring with some (though not all) of the [Components Name] components stopped.	To resolve this, ensure that the master/orchestrator server can communicate with the collector cluster (i.e. resolve any network issues) and that at least one scheduler is running e.g. SSH to collector and run `/opt/opsview/watchdog/bin/opsview-monit start [Component Name]`
Not enough messages received ([Components Name 1] → [Components Name 2]): [Time Period] [Percentage Messages Received]%. e.g. Not enough messages received (opsview-scheduler → opsview-executor):[15m] 0%.	Less than 70% of the messages sent by [Components Name 1] have been received by [Components Name 2] within the time period. This could indicate a communication problem between the components on the collector cluster, or that [Components Name 2] is overloaded and is struggling to process the messages it is receiving in a timely fashion. e.g. 0% of messages sent by the scheduler have been received by the executor within a 15-minute period.	If 0% of the messages sent have been received by [Components Name 2] and no other alarms are present then this may imply a communications failure on the cluster. To resolve this ensure that the collectors in the cluster can all communicate on all ports (seehttps://knowledge.opsview.com/docs/ports#collector-clusters) and that opsview-messagequeue is running on all the collectors without errors. Alternatively, this may be indicate that not all the required components are running on the collectors in the cluster. Please run /opt/opsview/watchdog/bin/opsview-monit summary on each collector to check that all the components are in a running state. If any are stopped then run /opt/opsview/watchdog/bin/opsview-monit start [component name]to start them. If > 0% messages sent have been received by [Components Name 2], then this likely implies a performance issue in the cluster. To address this you can: Reduce the load on the cluster e.g. - Reduce the number of objects monitored by that cluster - Reduce the number of checks being performed on each object in the cluster (i.e. remove host templates/service checks). - Increase the check interval for monitored hosts Increase the resources in the cluster - Add additional collectors to the cluster - Improve the hardware/resources of each collector in the cluster (i.e. investigate bottleneck by inspecting self-monitoring statistics and allocate additional CPU/memory resources as needed).

Note: For a fresh collector/cluster which has just been set up or which has minimal activity, the “Not enough messages received” alarm will be suppressed to avoid unnecessary admin/user concerns. This does not impact the "All [Components Name] components are unavailable" alarm, which will still be raised for an offline collector.

Articles in this section

Opsview - Collector Cluster Health

Comments

Articles in this section

Related articles