Related to:
mon node status
shows the status of one or more pollers and/or peers as inactive or disconnected.
Problem
- Pollers or peers not connected/ in an inactive state.
Possible Cause(s)
- Different OP5 versions within the cluster.
- SSH keys not propagated within the cluster.
- SSH keys have changed or are no longer valid.
- Invalid Naemon configuration.
ist diagnose can quickly diagnose most aspects of the above-mentioned errors.
Possible Solution(s)
Basic troubleshooting
Try a mon restart
first on all nodes. If a restart does not fix the issue, proceed with checking other steps.
Ensure that the nodes are able to communicate with each other. Tools such as ssh, ping, or nc can be good to verify if communications can be established.
An example using nc is shown below. Merlin runs on port 15551 by default.
[root@mon9-mas01 ~]# nc -zv mon9-mas02peer 15551
Connection to mc-rocky-mon9-mas02peer (xx.xx.xx.xx) 15551 port [tcp/*] succeeded!
Verify OS and OP5 versions
Clustering in OP5 requires the same OS and OP5 versions. Run this command on all devices, and make sure that all devices are running the same version:
cat /etc/op5-monitor-release
It should give output such as this:
If there are differences, please rectify the situation by getting all devices on the same version.
Troubleshooting SSH issues
Check in the /var/log/secure
file, and see if there are any errors pertaining to SSH. If there is, run these commands on the server having the issue. This will need to run for each additional server in the cluster. An example:
# mon sshkey push <hostname1>
# asmonitor mon sshkey push <hostname1>
# mon sshkey push <hostname2>
# asmonitor mon sshkey push <hostname2>
This pushes all SSH keys over to the other servers in the cluster. OP5 uses password-less SSH connections for some communications, so we need to make sure all the SSH keys are moved everywhere.
Check the Merlin log file /var/log/op5/merlin/neb.log
as well. In some instances, you may see errors like below:
[1676376045] 4: stdout: Offending RSA key in /opt/monitor/.ssh/known_hosts:1
[1676376045] 4: stdout: RSA host key for monitor1_peer has changed and you have requested strict checking.
[1676376045] 4: stdout: Host key verification failed.
For scenarios where an IP address or hostname has changed, you will need to first remove the known_hosts entry of the affected node before running the mon sshkey push
commands. On all affected nodes, run the command below to remove the ssh key for the monitor user:
runuser -l monitor -c 'ssh-keygen -R hostname'
After removing the known_hosts entry and re-runnign mon sshkey push
, restart Merlin on all nodes:
sysetmctl restart merlind
and then observethe status via mon node status
.
Verify Naemon configuration
If you have a corrupted or damaged naemon config, (which is located at /opt/monitor/etc/naemon.cfg
), you can simply copy the nagios.conf file from one of the other servers in the cluster, then runmon restart
afterward.
If the issue persists
- Please contact our Client Services team via the chat service box available on any of our websites or via email to support@itrsgroup.com
- Make sure you provide to us:
-Any troubleshooting step already verified from the ones described in this article.
Comments
0 comments
Please sign in to leave a comment.