If the check_distribution service notifies of an error, or if "mon node status" indicates expired checks, this can be due to several reasons.
Example of service error:
ERROR: There are 11 expired checks
Possible cause: The responsible node is experiencing high load
Before doing anything else, use a process monitoring tool such as top or other derivatives such as htop or iotop to check system load. Do this to ensure that the checks aren't expiring simply due to the node being unable to keep up with the amount of checks it has been assigned. If the node is experiencing high load, consider adding more capacity, or re-distributing your checks across your cluster. Look in the OP5 documentation for details of how to do this. If you need assistance with this, please contact support.
Possible cause: active checks may be disabled on some nodes
When running a distributed solution, expired checks can be the product of diverging settings on the attribute "active_checks_enabled". The configuration should be unified across the cluster, but there are also runtime settings that decide whether checks should run, and these could diverge between nodes. To investigate this as a possible issue you can download and run the script attached to the bottom of this article: mon_node_output_parse_diff.pl
(Author: Jonatan Sundeen)
Usage:
perl mon_node_output_parse_diff.pl Usage: my-program <input-file-name> Use case to check services active_checks_enabled Get data for service checks # mon node ctrl --self --all "mon query ls services -c host_name,description,active_checks_enabled" > mon_node_services.txt Parse data with scrip perl mon_node_output_parse_diff.pl mon_node_services.txt Use case to check hosts active_checks_enabled Get data for host checks # mon node ctrl --self --all "mon query ls hosts -c name,active_checks_enabled" > mon_node_hosts.txt Parse data with scrip perl mon_node_output_parse_diff.pl mon_node_hosts.txt
The script expects 2 input files (nominally "mon_node_services.txt" and "mon_node_hosts.txt" in the following example) that contain the output of the named "mon node ctrl" commands. It will output the checks that differ and on what hosts they differ.
If there is no output, the settings do not differ between nodes.
An example usage (one-liner) collects data on services and hosts across all nodes in your cluster, then runs the command for both of the txt files containing the collected data. In this example the perl script is expected to be in /tmp.
# cd /tmp && mon node ctrl --self --all "mon query ls services -c host_name,description,active_checks_enabled" > mon_node_services.txt && mon node ctrl --self --all "mon query ls hosts -c name,active_checks_enabled" > mon_node_hosts.txt && perl /tmp/mon_node_output_parse_diff.pl mon_node_services.txt && perl /tmp/mon_node_output_parse_diff.pl mon_node_hosts.txt
Review the services that have active checks disabled. Passive checks including business services should have active checks disabled.
You can run the following command which will display all checks with active_checks disabled for service checks on the local node:
# mon query ls services -c host_name,description,active_checks_enabled | grep "0\$"
And this command for the hosts (local node only):
# mon query ls hosts -c name,active_checks_enabled | grep "0\$"
If you wish to correct this automatically rather than inspect and fix the discrepancies manually (which may be a good idea to understand what happened), the next step is to decide a host which will act as master data for updating the others.
You can run the one liner below on the chosen master server to propagate its settings to the other master and pollers in the cluster via external commands. It will log all commands to the file indicated at the end of the command, and will run in the background. To see current status, "tail" the txt file where output is logged.
Propagate settings for active service checks:
(IFS=$'\n'; for line in $(mon query ls services -c host_name,description,active_checks_enabled); do varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varActive=$(echo $line | cut -d ";" -f3); if [ "$varActive" -eq 1 ] && [ "$varService" != "" ]; then varSubmitCommand="ENABLE_SVC_CHECK"; elif [ "$varActive" -eq 0 ] && [ "$varService" != "" ]; then varSubmitCommand="DISABLE_SVC_CHECK"; fi; varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varSave="mon ecmd submit $varSubmitCommand \"$varHost;$varService\""; eval "$varSave"; done) > propagate_active_service_checks.txt &
Propagate settings for active host checks:
(IFS=$'\n'; for line in $(mon query ls hosts -c name,active_checks_enabled); do varHost=$(echo $line | cut -d ";" -f1); varActive=$(echo $line | cut -d ";" -f2); if [ "$varActive" -eq 1 ]; then varSubmitCommand="ENABLE_HOST_CHECK"; elif [ "$varActive" -eq 0 ]; then varSubmitCommand="DISABLE_HOST_CHECK"; fi; varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varSave="mon ecmd submit $varSubmitCommand \"$varHost\""; eval "$varSave"; done) > propagate_active_host_checks.txt &
-
Tags:
- OP5 Monitor
- exported_docs_10_05_24
Comments
0 comments
Please sign in to leave a comment.