Related to:
Duplicate email alerts, out of time email alerts
Problem
Problem 1 - A user receives duplicate email alerts at exactly the same time. For example, a user receives two instances of EmailAlert01 at 10:00 AM.
Problem 2 - A user receives duplicate email alerts at different times. For example:
- User receives one instance of EmailAlert01 at 10:00 AM.
- User receives another instance of EmailAlert01 at 2:00 PM.
Problem 3 - User receives an unexpected email alert. These email alerts are unexpected because of the following scenarios:
- Scenario 1 - A Cell, Sampler, Managed Entity or Rule is snoozed during the time of the email alert.
- Scenario 2 - A Sampler or Rule is inactive during the time of the email alert.
Possible Cause(s)
Possible Cause 1 of Problem 1 - A Rule-Action pair and an Alerting-Effect pair are configured to monitor the same data items.
Possible Cause 2 of Problem 1 - The email script (or external email application such as sendmail) that is configured in the Action/Effect ran more than once.
Possible Cause 1 of Problem 2 - An error keyword from an old file was detected again by the FKM sampler (or a State Tracker sampler) when the old file changed its filename. The FKM sampler detects this as a new file and reads the file again. This happens because of the following combination of events:
- The FKM sampler's files > file > source > filename setting has a wildcard. For example: /opt/application/log/application*.log
- The FKM sampler's wildcardMonitorAllMatches setting is disabled.
- The old file's filename is /opt/application/log/application.log.
- The FKM sampler detected an error in /opt/application/log/application.log.
- The FKM error trigger key was cleared using the FKM's Clear This Trigger command (or any similar FKM command).
- The old file's filename was changed from /opt/application/log/application.log to /opt/application/log/application_old.log. The latter still matches the value configured in the files > file > source > filename setting (see #1).
- The FKM sampler treats /opt/application/log/application_old.log as a new file and monitors it.
- The FKM sampler detected the errors again in /opt/application/log/application_old.log.
Possible Cause 2 of Problem 2 - An error keyword from an old file was detected again by the FKM sampler (or a State Tracker sampler) when the old file's contents were updated. The FKM sampler detects the said old file as a new file and reads the file again. This happens because of the following combination of events:
-
- The FKM sampler's files > file > source > filename setting has a wildcard. For example: /opt/application/log/application*.log
- The FKM sampler's wildcardMonitorAllMatches setting is disabled.
- The old file's filename is /opt/application/log/application_01.log.
- The FKM sampler detected an error in /opt/application/log/application_01.log.
- The FKM error trigger key was cleared using the FKM's Clear This Trigger command (or any similar FKM command).
- A new file was created. The filename is /opt/application/log/application_02.log.
- The FKM sampler treats /opt/application/log/application_02.log as a new file and monitors it.
- The old file, /opt/application/log/application_01.log, was updated.
- The FKM sampler treats /opt/application/log/application_01.log as a new file and monitors it.
- The FKM sampler detected the errors again in /opt/application/log/application_01.log.
Possible Cause 1 of Problem 3 - A Cell value changes before the Sampler becomes inactive and the Rule does not have any Active Time defined on it. Using the following screenshot (view the image on a new tab):
The Rule still applied the severity and fired the email alert (see below) even though the Sampler was inactive.
2021-08-06 22:30:39.241+0800 INFO: ActionManager Action DataItem 'send email alert' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="toolkit for alerts and actions")][(@type="")]/dataview[(@name="toolkit for alerts and actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
2021-08-06 22:30:39.241+0800 INFO: ActionManager Firing action 'send email alert'
2021-08-06 22:30:40.150+0800 INFO: ActionManager Finished executing '/home/MNL/rgonzales/scripts/scripts/print_env.bash' with arguments ''.
2021-08-06 22:30:40.150+0800 INFO: ActionManager Completed action 'send email alert', Exit code: 0
Possible Cause 2 of Problem 3 - The Cell was snoozed after the email alert was fired. This can be easily identified by searching the Gateway log file for the strings "ActionManager" (or "AlertManager" if the email alert was produced by an Effect) and "CommandManager". The resulting log entries are as follows:
2021-08-06 22:54:41.100+0800 INFO: ActionManager Action DataItem 'send email alert' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="toolkit for alerts and actions")][(@type="")]/dataview[(@name="toolkit for alerts and actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
2021-08-06 22:54:41.100+0800 INFO: ActionManager Firing action 'send email alert'
2021-08-06 22:54:42.109+0800 INFO: ActionManager Finished executing '/home/MNL/rgonzales/scripts/scripts/print_env.bash' with arguments ''.
2021-08-06 22:54:42.109+0800 INFO: ActionManager Completed action 'send email alert', Exit code: 0
2021-08-06 23:06:47.886+0800 INFO: GatewayControl _commandExec: /SNOOZE:manualAllMe [ImXSSIo] requestId=1
2021-08-06 23:06:47.887+0800 INFO: CommandManager Executing command [/SNOOZE:manualAllMe] with id [54] for request id [1]
2021-08-06 23:06:47.887+0800 INFO: CommandManager Executing command '/SNOOZE:manualAllMe' on DataItem '/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="toolkit for alerts and actions")][(@type="")]/dataview[(@name="toolkit for alerts and actions")]/rows/row[(@name="pugo")]/cell[(@column="status")]', issued by user 'MNL\rgonzales' on '192.168.200.6'
From the log entries, the Snooze Command was executed on the cell after the email alert was fired.
Possible Cause 3 of Problem 3 - Specific to the FKM sampler; the Active Time is defined on the Sampler level and not on the File level.
- Sampler level - The FKM's file pointer stops when the sampler becomes inactive. Hence, when the sampler becomes active, the file pointer starts on the line where it stopped.
- File level - The FKM's file pointer continues to scan the log entries but does not detect the configured error keywords when the sampler is inactive. The sampler's file pointer starts to detect error keywords when the sampler becomes active.
-
Possible Solution(s)
Solution to Cause 1 of Problem 1 - Access the Gateway log file and review the "AlertManager" and "ActionManager" log entries. Below are the steps:
-
- Locate the Gateway log file.
- Open the Gateway log file using any text viewer application (e.g. Windows' Notepad app or Linux's VIM).
- Search for the string "ActionManager".
- Copy and paste the output to a text file.
- Search for the string "AlertManager".
- Copy and paste the output to a text file.
- Review the text file.
- Look for entries that have the same timestamp and target xpaths. Below are example log entries:
-
2021-08-06 15:17:59.263+0800 INFO: ActionManager Action DataItem 'send email alert' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
2021-08-06 15:17:59.264+0800 INFO: ActionManager Firing action 'send email alert'
2021-08-06 15:17:59.433+0800 INFO: AlertManager Alert DataItem 'alerting for rowName / pugo / CRITICAL / 0' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
2021-08-06 15:17:59.434+0800 INFO: AlertManager Alert: 'alerting for rowName / pugo / CRITICAL / 0'; Effect: 'fire email alert'; TO: ; CC: ; BCC: ; DataItem: /geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")]
2021-08-06 15:17:59.434+0800 INFO: EffectManager Firing effect 'fire email alert'
2021-08-06 15:18:00.143+0800 INFO: EffectManager Finished executing '/home/MNL/rgonzales/scripts/scripts/print_env.bash' with arguments ''.
2021-08-06 15:18:00.143+0800 INFO: ActionManager Completed effect 'fire email alert' for alert 'alerting for rowName / pugo / CRITICAL / 0', Exit code: 0
2021-08-06 15:18:00.143+0800 INFO: ActionManager Finished executing '/home/MNL/rgonzales/scripts/scripts/print_env.bash' with arguments ''.
2021-08-06 15:18:00.143+0800 INFO: ActionManager Completed action 'send email alert', Exit code: 0 - From the above:
- The 'send email alert' Action fired on 2021-08-06 15:18:00.143+0800. It was triggered because of the following data item:
-
2021-08-06 15:17:59.263+0800 INFO: ActionManager Action DataItem 'send email alert' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
-
- The 'fire email alert' Effect fired on 2021-08-06 15:18:00.143+0800. It was triggered because of the following data item:
-
2021-08-06 15:17:59.433+0800 INFO: AlertManager Alert DataItem 'alerting for rowName / pugo / CRITICAL / 0' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
-
- The 'send email alert' Action fired on 2021-08-06 15:18:00.143+0800. It was triggered because of the following data item:
-
- Identify the Rule that uses the Action.
- Identify the Alerting hierarchy the users the Effect.
- Look for entries that have the same timestamp and target xpaths. Below are example log entries:
- Review the configuration.
- Implement the necessary changes.
Solution to Cause 2 of Problem 1 - Access the Gateway log file and look for more than one entry of either the same "AlertManager" or "ActionManager". If there are none:
-
- Check the log file of the external email script/email application.
- Implement the necessary changes on the external email script/email application.
Solutions to Cause 1 of Problem 2
- Solution 1 -The new filename of the rolled over file (or old file) should not match the value of the FKM sampler's files > file > source > filename setting.
- Solution 2 - Move the old file to a different directory.
- Solution 3 - If the problem happens to a State Tracker sampler, move the old file to a different directory.
Solutions to Cause 2 of Problem 2
- Solution 1 - Enable the FKM sampler's wildcardMonitorAllMatches setting so that there are separate monitoring for every file.
- Solution 2 - Move the old file to a different directory.
- Solution 3 - If the problem happens to a State Tracker sampler, move the old file to a different directory.
Solutions to Cause 1 of Problem 3
- Solution 1 - Use the Sampler's Active Time on the Rule so that it also becomes inactive when the Sampler is inactive.
- Solution 2 - If the Rule is using the delay function, either remove it or use "delay X samples" instead of "delay X seconds". The "delay X samples" relies on a sampling Sampler. Since an inactive Sampler does not sample, the "delay X samples" function's counter also stops.
Solution to Cause 2 of Problem 3 - Ensure that the Snooze command (whether manual or via a Scheduled Command) is executed before the email alert is fired.
Solution to Cause 3 of Problem 3 - Define the Active Time on the appropriate setting:
Sampler level (view the image on a new tab)
File level (view the image on a new tab)
Related Articles
If Issue Persists
- Please contact with our Client Services team via the chat service box available in any of our websites or via email to support@itrsgroup.com
- Make sure you provide to us:
- The troubleshooting steps done (refer to the Possible Cause and Solution sections of this article)
- Date and time of the issue
- Screenshot of the duplicate email alerts or out of time alerts
- Complete Gateway log file
- Name of the Rule-Action or Alerting-Effect configuration
- Gateway diagnostics file
- Any troubleshooting step already verified from the ones described in this article.
Comments
0 comments
Please sign in to leave a comment.