Problem:
There are situations that customers attempted to remove one of their peer nodes for whatever reason resulting in other nodes like Pollers to become INACTIVE.
Sample:
From Master, there are now only two nodes present, but Poller node is INACTIVE
#00 0/0:0 local ipc: ACTIVE - 0.000s latency
Uptime: 15s. Connected: 16s. Last alive: 1s ago
Host checks (handled, expired, total) : 2, 0, 710 (0.28% : 0.29%)
Service checks (handled, expired, total): 146, 0, 4514 (3.23% : 3.36%)
#01 0/0:0 poller op5poll (INACTIVE)
Uptime: unknown. Connected: unknown. Last alive: 1s ago
Host checks (handled, expired, total) : 0, 0, 19 (0.00% : 0.00%)
Service checks (handled, expired, total): 0, 0, 171 (0.00% : 0.00%)
From Poller, you will observe that Peer master op52 is still visible.
[root@op5poll etc]# mon node status
Total checks (host / service): 19 / 171
#00 0/0:0 local ipc: ACTIVE - 0.000s latency
Uptime: 7m 55s. Connected: 7m 55s. Last alive: 0s ago
Host checks (handled, expired, total) : 19, 0, 19 (100.00% : 100.00%)
Service checks (handled, expired, total): 151, 0, 171 (88.30% : 88.30%)
#01 0/0:0 master op51 (INACTIVE)
Uptime: unknown. Connected: unknown. Last alive: 3s ago
Host checks (handled, expired, total) : 0, 0, 0 (0.00% : 0.00%)
Service checks (handled, expired, total): 0, 0, 0 (0.00% : 0.00%)
#02 0/0:0 master op52 (INACTIVE)
Uptime: unknown. Connected: unknown. Last alive: unknown
Host checks (handled, expired, total) : 0, 0, 0 (0.00% : 0.00%)
Service checks (handled, expired, total): 0, 0, 0 (0.00% : 0.00%)
Solution:
It is important as well to remove the removed peer from the Poller's configuration. In this example the removed Peer is op52.
Login to op52:
mon node ctrl --self mon node remove op5poll
mon node list
mon restart
mon node status
Login to op51:
mon node ctrl --type=poller mon node remove op52
mon node ctrl --type=poller mon restart
mon node status
Expected result:
[root@op51 ~]# mon node status
Total checks (host / service): 710 / 4514
#00 0/0:0 local ipc: ACTIVE - 0.000s latency
Uptime: 37m 29s. Connected: 37m 30s. Last alive: 0s ago
Host checks (handled, expired, total) : 690, 0, 691 (99.86% : 97.18%)
Service checks (handled, expired, total): 3821, 0, 4343 (87.98% : 84.65%)
#01 0/0:0 poller op5poll: ACTIVE - 0.000s latency - (UNENCRYPTED)
Uptime: 7s. Connected: 5s. Last alive: 5s ago
Host checks (handled, expired, total) : 0, 19, 19 (0.00% : 0.00%)
Service checks (handled, expired, total): 0, 151, 171 (0.00% : 0.00%)
--------------------------------------------------------------------------
[root@op5poll ~]# mon node status
Total checks (host / service): 19 / 171
#00 0/0:0 local ipc: ACTIVE - 0.000s latency
Uptime: 31s. Connected: 31s. Last alive: 1s ago
Host checks (handled, expired, total) : 1, 0, 19 (5.26% : 5.26%)
Service checks (handled, expired, total): 18, 0, 171 (10.53% : 10.53%)
#01 0/0:0 master op51: ACTIVE - 0.000s latency - (UNENCRYPTED)
Uptime: 37m 53s. Connected: 23s. Last alive: 1s ago
Host checks (handled, expired, total) : 0, 0, 0 (0.00% : 0.00%)
Service checks (handled, expired, total): 0, 0, 0 (0.00% : 0.00%)
Peer Master op52 is now removed from both Master and Poller's perspective and both nodes are now ACTIVE.
-
Tags:
Comments
0 comments
Please sign in to leave a comment.