If the check_distribution service notifies of an error, or if "mon node status" indicates expired checks, this can be due to several reasons.
Example of service error:
ERROR: There are 11 expired checks
Possible cause: The responsible node is experiencing high load
Before doing anything else, use tools like top/htop/iotop to ensure that the checks aren't expiring simply due to the assigned node being unable to keep up with the amount of checks it has been assigned. If the node is experiencing high load, consider adding more capacity, or re-distributing your checks across your cluster. If you need assistance with this, please contact support.
Possible cause: active checks may be disabled on some nodes
When running a distributed solution, expired checks can be the product of diverging settings on the attribute "active_checks_enabled". The configuration should be unified across the cluster, but there are also runtime settings that decide whether checks should run, and these could diverge between nodes. To investigate this as a possible issue and analyze what services that have diverging settings you can download the script attached to the bottom of this article: mon_node_output_parse_diff.pl
(Author: Jonatan Sundeen)
Usage:
perl mon_node_output_parse_diff.pl Usage: my-program <input-file-name> Use case to check services active_checks_enabled Get data for service checks # mon node ctrl --self --all "mon query ls services -c host_name,description,active_checks_enabled" > mon_node_services.txt Parse data with scrip perl mon_node_output_parse_diff.pl mon_node_services.txt Use case to check hosts active_checks_enabled Get data for host checks # mon node ctrl --self --all "mon query ls hosts -c name,active_checks_enabled" > mon_node_hosts.txt Parse data with scrip perl mon_node_output_parse_diff.pl mon_node_hosts.txt
The script will output the checks that differ and on what hosts they differ on, but it requires "mon_node_services.txt" and "mon_node_hosts.txt" (example names) that contain the output of the named "mon node ctrl" commands.
If there is no output, the settings do not differ between nodes.
Example usage (one-liner) that collects data on services and hosts across all nodes in your cluster, then runs the command for both txt files collected in /tmp. This also expects the perl script to be in /tmp -- output is written to "mon_node_services.txt" and "mon_node_hosts.txt":
# cd /tmp && mon node ctrl --self --all "mon query ls services -c host_name,description,active_checks_enabled" > mon_node_services.txt && mon node ctrl --self --all "mon query ls hosts -c name,active_checks_enabled" > mon_node_hosts.txt && perl /tmp/mon_node_output_parse_diff.pl mon_node_services.txt && perl /tmp/mon_node_output_parse_diff.pl mon_node_hosts.txt
Review the services that has active checks disabled. Passive checks including business services should have active checks disabled.
You can run this command which will display all checks with active_checks disabled for service checks on the local node:
# mon query ls services -c host_name,description,active_checks_enabled | grep "0\$"
And this for the hosts (local node only):
# mon query ls hosts -c name,active_checks_enabled | grep "0\$"
If you wish to correct this automatically rather than inspect and fix the discrepancies manually (which may be a good idea to understand what happened), the next step is to decide a host which will act as master data for updating the others.
You can run the one liner below on the chosen master server to propagate its settings to the other master and pollers in the cluster via external commands. It will log all commands to the file indicated at the end of the command, and will run in the background. To see current status, "tail" the txt file where output is logged.
Propagate settings for active service checks:
(IFS=$'\n'; for line in $(mon query ls services -c host_name,description,active_checks_enabled); do varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varActive=$(echo $line | cut -d ";" -f3); if [ "$varActive" -eq 1 ] && [ "$varService" != "" ]; then varSubmitCommand="ENABLE_SVC_CHECK"; elif [ "$varActive" -eq 0 ] && [ "$varService" != "" ]; then varSubmitCommand="DISABLE_SVC_CHECK"; fi; varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varSave="mon ecmd submit $varSubmitCommand \"$varHost;$varService\""; eval "$varSave"; done) > propagate_active_service_checks.txt &
Propagate settings for active host checks:
(IFS=$'\n'; for line in $(mon query ls hosts -c name,active_checks_enabled); do varHost=$(echo $line | cut -d ";" -f1); varActive=$(echo $line | cut -d ";" -f2); if [ "$varActive" -eq 1 ]; then varSubmitCommand="ENABLE_HOST_CHECK"; elif [ "$varActive" -eq 0 ]; then varSubmitCommand="DISABLE_HOST_CHECK"; fi; varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varSave="mon ecmd submit $varSubmitCommand \"$varHost\""; eval "$varSave"; done) > propagate_active_host_checks.txt &
-
Tags:
- KEDB
- check_distribution
- self-monitoring
- ERROR: There are
- expired checks
- active_checks_enabled
- OP5
Comments
0 comments
Please sign in to leave a comment.