One of the most common problems when configuring Geneos (or any monitoring tool for that matter) is ensuring that the users are told when things are going wrong and action is required, that alerts are not missed and maybe more importantly that false alerts, or alerts which are not actionable, are minimized or removed completely.
"One of the most common mistakes when monitoring is to alert on too many things, once the number of alerts exceeds what is manageable, you are essentially not monitoring at all"
While on the face of it this seems obvious, its essentially hard to achieve, and requires constant tweaking and change to keep it tuned to a changing environment. This article aims to talk through:
In truth you could write a book on this subject, so this page cannot really be considered comprehensive, but it is a start, and will be supplemented over time. As it stands the following content exists:
Common Mistakes in the monitoring space
|Frequency of Alerts
Managing the frequency of alerts
Core to the philosophy of good monitoring is that under normal BAU conditions:
1) Alerts are reported at an appropriate level of severity
2) That teams act within an appropriate time scale to those alerts
3) That no significant alerts are missed
4) Alerts occur at a manageable level
Manageable levels means the team responsibility for the systems can keep up with the alerts (the number of un-actioned or suppressed alerts do not grow over time). Appropriate level essentially means that:
|Identifying False Alerts
Identifying false alerts
When trying to get alerts to a manageable level you need to get a handle on why the alert levels are high, this includes their frequency and their severity. For example it is generally more acceptable to have many more warnings than critical's. Alerts can therefore be false positives if they do not require action or suggest action is required more quickly than is actually required (its critical rather than warning for example).
The following are some common false positives:
Even if the frequency of alerts seems workable, if the team are getting dozens of critical alerts a week then this maybe indicative of some very unstable systems that are having a significant impact on the business, or poorly configured monitoring. The simplest way to think about what severity level a given alert should be at is how long you will tolerate that situation before acting. In rough terms:
In all cases an action is defined as something which reduces the severity level, therefore buys you more time, or resolves the situation completely.
|Embedding Expert Knowledge
Embedding Expert Knowledge (removing generalizations)
Really effective monitoring for any given system requires expert knowledge of that system, and a solid understanding of how it behaves 'normally'. In comparison to a car, knowledge of that make and model will prove a good understanding, but each specific instance of that car will have its own nuances, the person that drives that car regularly is best placed to know what normal looks like, and when things are going wrong or unusual.
The deployment of general monitoring is therefore a good start, but really effective monitoring needs to be tweaked for that specific instance. Not performing these tweaks is a common source of false alerts. The experts, like in the car analogy, are those people that attend to the system on a regular basis, this may be the development teams that designed it, the support teams that support it, the end users, or more likely a combination of these teams.
Examples of specific tweaks for a specific application might include embedding logic into the monitoring to include its observable issues when under load (High CPU, High Mem, slow throughput, Dropped trades etc), The effect on the app of downstream and upstream applications misbehaving, The time it takes a to start up and its observable states during that start up, and what normal looks like and so on.
The process also requires an expert in monitoring, someone that knows what effective monitoring looks like, and what the selected tools can and cannot do. Both the system expert and the monitoring expert also need a solid understanding of who the monitoring is aimed at, since the type of information gathered and compiled will vary. For example the information provided to a support tech will differ from that presented to an Exec.
|Ensuring responsibility for effective monitoring
The need for constant maintenance
Effective monitoring requires constant maintenance, for example:
If this maintenance is not performed then false alerts will creep in. Within your organisation responsibility for the health and improvements of the monitoring must be clearly defined and aggressively implemented. High quality monitoring will in turn enforce high quality processes and systems in the teams and the systems it monitors.
The bigger goal is a zero tolerance to on-going alerts, this requires not just good tools but cultural change, which is far more challenging than tweaking configuration.
|Less vs More
Target the monitoring
Geneos is capable of monitoring an enormous diversity of systems, and if there is nothing out of the box, then it can normally be written in the more generic plugins such as the Toolkit, SQL Toolkit or API Plugin. When deploying monitoring the designers therefore have to elect how much of the systems they will monitor, and exactly what - of all the things they could - they will monitor. There are two ends to this scale:
and anything in between.
Both approaches have their merits, but in the context of manageable alert levels the latter approach has the best chance of success. Starting with just the critical systems also helps embed a culture of timely reaction to alerts into the teams, and allows a zero tolerance approach to critical's, and in the most mature teams warnings. Having achieved this culture adding new monitoring while maintaining quality monitoring and process is fairly straight forward.
Conversely If you start from the outset with a large real estate generating unmanageable or inappropriate (severity) alert levels then this can be a difficult situation to recover from. The teams who adopt the monitoring quickly become acclimatized to constant alerts and simply use the monitoring as a reactive analysis tool. In some cases there is a fear within the organisation that turning off or downsizing the monitoring might result in a missed alert, when in reality they are close to this situation already.
|Actually getting alerted
Actually getting alerted
Another important factor in alert governance is the actual mechanism that you nominate to be alerted. Examples in Geneos include but are not limited to:
The choice of notification method can be significant when considering the concept of 'False Alert', or to put it another way exceeds what is manageable (as well as correct). For example if a system is generating 100 critical alerts a day, and the chosen method of displaying those alerts is to show them in the console for as long as they are on-going, and have them clear when that situation ends, that might be deemed (while not ideal) workable. If on the other hand if an E-mail was generated each time an alert occurred the same number of alerts might overwhelm and desensitize the effected team (100 E-mails is bordering on spam). Consider also that the alerts will not 'be removed' when the situation is fixed since the nature of a mail is that it is non-mutable, in a worse case there may be a second mail for each event to say the situation has been resolved which the user will have to correlate.
So the choice of alerting mechanism is significant. Of the examples listed above they all have different pros and cons, and capabilities.
In a monitoring system of any scale, it is probably also true that there will not be a 'one size fits all' alerting mechanism. In much the same way as way as effective management of alerts requires customization to the specific behaviors of the monitored app, so the alerting requires specific tailoring to the intended audience of the alerts. Designers of a monitoring system should pick and choose whats is appropriate, working with the teams involved to ensure they will both see and act on the alerts when they are generated. For example in a team that is already saturated with E-mails from other systems and their working environments, even if the monitoring system generates just a few critical alerts a week, they may be missed due to these external factors.
Escalation is desirable where alerts are not being actioned within the agreed time scales. Consideration needs to be given to what alerting mechanism is used for the escalation. While one on one hard a simple change of severity is a form of escalation (assuming there is room for maneuver and you are not already red), so is a change of alerting type and audience. A common process for example (though not automatically the best) would be to show a red on the console for a time, then send a mail to an individual or group of individuals if that situation persists. A change of alerting type is likely in the case of escalation, since by definition the previous method has not worked. The same care and attention needs to be taken when considering escalation alerting types, if not more so given its likely to go to more senior resources who work in different locations and have a different focus.
Specific things you can do in Geneos to solve the common monitoring mistakes
|The available severity levels
Available Severity levels
Before going into detail on the options for managing alerts its important we highlight the available severity levels . There are 4 in Geneos, these also have associated numeric values and colours
Snoozing - manually disabling alerts
Any data item can be snoozed in Geneos. By a data item we mean any of a Gateway, Probe, Entity, Sampler, Dataview, Table cell or Headline. Snoozing has just one effect:
"Snoozing stops the propagation of severity from that data item to its parent"
It does not effect the severity of the data item you have snoozed, so for example in the screen shot below you can see a cell with a critical severity has been snoozed. This means it will no longer propagate its severity to its parent (the Data view). Because this was the only critical cell in the data view the data view's severity becomes OK. However the cell remains of critical severity.
The act of snoozing an item is manual, I.E an operator makes a conscious decision to suppress an alert. At the point of snoozing they can elect an exit condition for the snooze to end. If you review the screen shot above you can see examples of exit conditions in the snooze menu. By default the menu includes the 'Manual' option, which means it can only be removed by an operator. There are plenty of legitimate reasons to snooze alerts, for example:
The danger of any system which allows manual suppression of alerts (for arbitrary time scales or without planned or reasonable exit conditions) is that operators use snoozing as a mechanism to handle being over whelmed by alerts - or in short - they snooze everything. As should be obvious this is not a good strategy for dealing with excessive alerts and should be activity discouraged. Snoozed items should therefore be activity managed. There are a number of tools in Geneos that will help with this.
You can add a view to the console that displays a list of all the snoozed items. The view below will show all the snoozed cells and managed entities in the connected gateways
The paths that drive this view however can be quite expensive since they look at all cells all the time.
Gateway Snooze data view
Within the gateway itself you can add a Gateway plugin that lists all the snoozed items, the XML for the sampler is below
This produces a data view which looks much like the screen below
Rules and alerts can be set on this data view as normal, allowing users and managers to track snoozes in their system. It includes a column on the 'Snooze Type' which helps identify what exit criteria users are selecting for the suppression of Alerts via snoozing.
Stopping manual snoozes
You as an organisation may decide that it is never appropriate to use the snooze command without a valid exit criteria, in which case you can use the security settings to actually remove this option for selected users.
Taking account of snoozes in actions
By default if a data item or any of its ancestors are snoozed then Actions run within rules will not fire. This setting is defined under the Advanced section of the Action definition in the setup. You can be more explicit by adding it in the rule block itself. For example:
Note however the rule above would not take account of the cells ancestors (for example the Managed Entity it is on)
|Programmatically disabling monitoring
Active vs inactive monitoring - automatically disabling monitoring
Every data item in Genoes also has an 'Active'status, by default all data items are Active, if they are made inactive then they do not propagate their severity to their parent (in the same way as snooze blocks severity propagation), thus;
"An Inactive status stops the propagation of severity from that data item to its parent"
Unlike Snoozing a data item, which is a manual action, changing the active status is performed programatically. The severity of the data item is not changed, so if a cell, is critical, it remains critical, it just does not influence the severity of its parent.
There are two main methods to set an item inactive:
Within a rule block
You can set the Active Status of a data item explicitly within a rule block as a literal, for example:
The XML for the above rule would be of the form
The other methods is via active times
When considering alerting, its often relevant to also consider whether the systems are expected to be operational at any given point of the day, month or year. Alerting outside these time can generate unnecessary noise, to the responsible team, or others who maybe in another time zone. In Geneos designers can use Active times to suppress alerts during downtime. An Active time can be set within the gateway setup, and used in a number if places, one of the most common is in rules:
In the above example an active time is used explicitly in the rule block, and in the second example (on the right of the figure) in the rules active time settings. In the case of the rules active time setting, the whole rule would only be active when within that active time. When referencing active times within the body of a rule we can be more granular. Since severity is ONLY defined by rules, when a rule is outside active time it would not be setting severity, which would, in effect, suppress an alert. An example of the rule is provided below.
<rule name="Suppress Known Disk Problem">
<target>/geneos/gateway[(@name="systemAlerts")]/directory/probe[(@name="iconfluencesrv")]/managedEntity[(@name="iconfluencesrv")]/sampler[(@name="Linux Disk")][(@type="Linux Defaults")]/dataview[(@name="Linux Disk")]/rows/row[(@name="/mnt/resource")]/cell[(@column="percentageUsed")]</target>
<activeTime ref="Working Day"></activeTime>
And here is the active time XML for a sample working day
Active times (that directly influence the designers ability to suppress alerts) are also commonly used in the following places:
|Alerts with more complex Signatures
Working with more complex alert conditions
Often an alert situation is more complex than the designer builds into their monitoring, the simplistic point value case can triggered more frequently than the actual alert condition which is affecting the business. Examples include:
A delay can be built into a rule transaction, it stops the remainder the transaction from occurring until that delay has passed, and the original condition has remained true. They are useful for conditions which may self correct. For example a CPU has gone over 90% and stayed above 90% for 60 seconds.
An example can be seen below.
The delay can be specified in terms of seconds
or samples, thus if the sample time was 20 seconds, 20 samples would be 40 seconds.
Another example is a process that has an automated restart script, such that if it goes down it is restarted. In this case, the fact that it has failed is of interest, but its not an alert until the restart has also failed. In the rule below we are looking at a process plugin, and expecting there to be a single instance. We have added an Action to the gateway which, should Geneos detect that the process count is 0, will try and restart the process automatically.
The rule will trigger the restart, and turn the cell Warning while the restart is being performed. If the restart is successful and the instance count goes back to 1 the cell will become 'OK'. If the process count stays 0 for 60 seconds, then the cell will turn Critical - its a genuine Alert that requires action.
In the example above we are still generating a warning alert for the duration of time that the process is down, even though we expect it to recover without human intervention. We could choose to be aggressive in limiting alerts by not generating the warning alerts, and simply assuming that recovery of this process is part of BAU. We could also consider utilizing the OK and Undefined serverities a little more. For example have the cell be undefined when the instance count = 1 (I.E. everything is fine, there is nothing of interest), then have it turn OK during the restart process. I.E. everything is OK, but just an FYI that a restart is underway.
This use of the Undefined severity as a valid state for 'Everything is OK' can help extend the use of the severity levels in Geneos, allowing OK to be used as an 'Of interest, but not yet a warning level'
Use of History Periods for more temporal alerting vs point values
While the delay function is useful for detecting extended periods of a selected state it suffers in that, if the condition is not true, even for a short time then the delay is reset. For example a server may exhibit high CPU for a number of minutes, but have brief periods where it drops below the selected threshold. Or an auto restarting process may restart many times in a short period. Both of these may be valid alert states, but will not be detected by the 'delay' method.
If we consider the auto restarting use case, lets say that as well as detecting when it fails to come up, we are also interested if it restarts 5 times (or more) in one hour. We can achieve this by monitoring the 'Average Instance count' over the hour. If it never goes down this should be 1. Anything below 1 means at least 1 restart occurred. Assuming the restarts are working and the sample time of the process sampler is 20 seconds, then any Average below 0.972 means at least 5 restarts occurred, or the process was down over multiple samplers - both worthy of attention.
There a few steps we need to take to set this up in Geneos
1) Create a history period for the selected time, this goes in the rules section of the setup, the example XML is below
<historyPeriod name="last Hour">
2) We also need to add an additional column in the selected process sampler which will retain the average instance count, this is added via the Advanced tab of the sampler, an example is shown below
3) finally you need to define a rule, that will both calculate the Average instance count, and set severity under your chosen conditions.
In this particular rule the severities have been graded, such that:
In the event that an alert has occurred which requires action within some time scale (generally therefore warning and critical) it is likely that a team member will pick it up for review. Once this has occurred there maybe a case for downgrading the alert. In much the same was as snoozing a cell. The designer of the monitoring can use the act of user assignment within their rules to change the state of the system.
Any data item can be assigned (so Gateways, probes, manged entities, samplers, data views, table cells and headlines). The act of assigning a user has no default impact on the severity of a data item unless the rules are designed to take it into account. For example the rule shown below will turn the cell warning. If it is assigned it will turn OK with the assignment icon to mark that it is being dealt with.
<rule name="Known Issue">
This would have the following effect:
If you do choose to use user assignment as a way of dealing with alerts, it may also be of interest to track what is and what is not user assigned within your environment. Like the monitoring of snoozes their is a gateway plugin that tracks snoozes in a system. The view includes the number of minutes that the data item has been assigned, so you can include rules to look for items that have been assigned for extended periods.
The XML for the sampler can be found below
<sampler name="GW User Assignment Data">
Note that unlike snooze, User assignment does not have other indicators in the likes of the state tree and Entities view. You can also create a list view within the console that shows the list of assigned items.
When an item is user assigned the operator can select an exit condition for the assignment. For example, until the severity changes, a date and time or duration, or until the value changes. There is also a simple assignment with no automatic exit condition, you as an organisation may decide that it is never appropriate to use the user assignment command without a valid exit criteria, in which case you can use the security settings to actually remove this option.