One of the most common problems when configuring Geneos (or any monitoring tool for that matter) is ensuring that the users are told when things are going wrong and action is required, that alerts are not missed and maybe more importantly that false alerts, or alerts which are not actionable, are minimized or removed completely.
"One of the most common mistakes when monitoring is to alert on too many things, once the number of alerts exceeds what is manageable, you are essentially not monitoring at all"
While on the face of it this seems obvious, its essentially hard to achieve, and requires constant tweaking and change to keep it tuned to a changing environment. This article aims to talk through:
- In general terms the kind of mistakes that are made in the monitoring space that lead to a loss of control around alerting
- Some of the specific things you can do in Geneos to manage alerts.
In truth you could write a book on this subject, so this page cannot really be considered comprehensive, but it is a start, and will be supplemented over time. As it stands the following content exists:
- Managing the frequency of alerts
- Identifying false alerts
- Embedding Expert Knowledge (removing generalizations)
- The need for constant maintenance
- Target the monitoring
- Actually getting alerted
- Available Severity levels
- Snoozing - manually disabling alerts
- Active vs inactive monitoring - automatically disabling monitoring
- Working with more complex alert conditions
- User Assignment
Common Mistakes in the monitoring space
Managing the frequency of alerts
Core to the philosophy of good monitoring is that under normal BAU conditions:
1) Alerts are reported at an appropriate level of severity
2) That teams act within an appropriate time scale to those alerts
3) That no significant alerts are missed
4) Alerts occur at a manageable level
Manageable levels means the team responsibility for the systems can keep up with the alerts (the number of un-actioned or suppressed alerts do not grow over time). Appropriate level essentially means that:
Alert Severity | Response form monitoring team | Period they should be tolerated |
---|---|---|
Critical | The systems have been impacted in a business critical way such that immediate action is required in minutes, end users will either soon notice or will have already noticed an outage and are seriously impacted. | Minutes |
Warning | The systems are about to be, or have been compromised in such a way as their performance or function is degraded, but the business will continue to function. | Hours or days (not weeks) |
OK | The systems are performing as expected, but the item remains of interest and needs to be monitored | Unlimited |
Undefined | The data is of interest, but requires no action, or supports analysis and investigation when other alerts occur. Note that even items with undefined severity should be of interest, we talk later about targeted monitoring and the need to avoid monitoring for the sake of monitoring, or just 'because we can'. | Unlimited |
Identifying false alerts
When trying to get alerts to a manageable level you need to get a handle on why the alert levels are high, this includes their frequency and their severity. For example it is generally more acceptable to have many more warnings than critical's. Alerts can therefore be false positives if they do not require action or suggest action is required more quickly than is actually required (its critical rather than warning for example).
The following are some common false positives:
Possible false Alert | Description | |
---|---|---|
1 | The Alert is already being looked at | An event has occurred but a member of the team is working on it, in theory this may reduce the severity for the remainder of the team, or even negate the situation completely. |
2 | The specific alert occurs very frequently | In this case the team may become desensitized to the alert and simply ignore it |
3 | It has occurred outside hours | The alert has occurred outside a given time window, and requires no action. It may self correct before the relevant time window occurs. Predictable maintenance windows would be an example of this. |
4 | Its a consequence of another fault and is not the root problem | Another system or function has failed and this is an inevitable consequence of that failure. For example A disk has filled up, and the application relying on that disk has failed. While the disk being full maybe critical (its the root problem), and the app failing maybe just a warning, so the action is not on the app but the server. |
5 | The rule is too generic | A blanket rule has been applied, and triggers alerts on systems where that behavior is acceptable. For example, a rule which dictates CPU should be critical if > 95% may not be applicable on a mainframe which is expected to run close to 100% most of the time. or an FKM is configured to look for the word 'Error' in a log file, but that word occurs far to often to be useful. |
6 | The situation is temporary or transitive | The alerting situation occurs for a period but then normal operation resumes without human intervention. Applications may be busy for periods for example, and monitoring may be set to detect a busy applications without any leeway for that busy period to end under normal operation. |
7 | The alerts are on secondary systems, such as a UAT or Development environment | If teams are simultaneously monitoring UAT and Production environments and their chosen alerting essentially merges these alerts (for example they are using the notifier in the Active Console), then they will receive noise on secondary systems which may obscure actual alerts. |
8 | Alerts on situations which can be automatically recovered | For example if a process goes down, and the situation is detected and a script is automatically run to restart the process; if all this works as planned there maybe no need for an alert . Alerts may be set if the script fails to restart the process, or the process terminates frequently, both of which may require operator investigation and intervention. |
9 | The monitored items have been modified and removed and the monitoring has not been updated | Or to out it another way the team responsible for changing the monitoring are not keeping up with change in the systems that are monitored. This maybe that they have insufficient access, are under constant time pressure, or are not aware that the system has been changed. |
10 | Miss Configured sampler / rule | |
11 | inappropiate severity | |
12 | Alert is informative rather indicating criticality | Alert is configured outside of monitoring by exception and configured that something has completed sucessfully a task (rather than if the task fail). |
Inappropriate Severity:
Even if the frequency of alerts seems workable, if the team are getting dozens of critical alerts a week then this maybe indicative of some very unstable systems that are having a significant impact on the business, or poorly configured monitoring. The simplest way to think about what severity level a given alert should be at is how long you will tolerate that situation before acting. In rough terms:
Severity level | Time to act |
---|---|
Critical | Minutes |
Warning | Hours or a day or two |
Ok and Undefined | No Limit |
In all cases an action is defined as something which reduces the severity level, therefore buys you more time, or resolves the situation completely.
Embedding Expert Knowledge (removing generalizations)
Really effective monitoring for any given system requires expert knowledge of that system, and a solid understanding of how it behaves 'normally'. In comparison to a car, knowledge of that make and model will prove a good understanding, but each specific instance of that car will have its own nuances, the person that drives that car regularly is best placed to know what normal looks like, and when things are going wrong or unusual.
The deployment of general monitoring is therefore a good start, but really effective monitoring needs to be tweaked for that specific instance. Not performing these tweaks is a common source of false alerts. The experts, like in the car analogy, are those people that attend to the system on a regular basis, this may be the development teams that designed it, the support teams that support it, the end users, or more likely a combination of these teams.
Examples of specific tweaks for a specific application might include embedding logic into the monitoring to include its observable issues when under load (High CPU, High Mem, slow throughput, Dropped trades etc), The effect on the app of downstream and upstream applications misbehaving, The time it takes a to start up and its observable states during that start up, and what normal looks like and so on.
The process also requires an expert in monitoring, someone that knows what effective monitoring looks like, and what the selected tools can and cannot do. Both the system expert and the monitoring expert also need a solid understanding of who the monitoring is aimed at, since the type of information gathered and compiled will vary. For example the information provided to a support tech will differ from that presented to an Exec.
The need for constant maintenance
Effective monitoring requires constant maintenance, for example:
- Like any software the monitoring tools will itself experience issues
- The underlying system it monitors will be subject to functional change
- The BAU signature of the underlying system may also change (see the previous section on embedding expert knowledge)
- The up stream and down stream applications may change the inputs into the system, or the OS and hardware on which it relies may be updated
- New failures cases may be identified, and ones may become redundant or become the source of false alerts.
If this maintenance is not performed then false alerts will creep in. Within your organisation responsibility for the health and improvements of the monitoring must be clearly defined and aggressively implemented. High quality monitoring will in turn enforce high quality processes and systems in the teams and the systems it monitors.
The bigger goal is a zero tolerance to on-going alerts, this requires not just good tools but cultural change, which is far more challenging than tweaking configuration.
Less vs More
Target the monitoring
Geneos is capable of monitoring an enormous diversity of systems, and if there is nothing out of the box, then it can normally be written in the more generic plugins such as the Toolkit, SQL Toolkit or API Plugin. When deploying monitoring the designers therefore have to elect how much of the systems they will monitor, and exactly what - of all the things they could - they will monitor. There are two ends to this scale:
- Monitor everything we can
- Build up the monitoring slowly, starting with the critical components, and then only the most important metrics.
and anything in between.
Both approaches have their merits, but in the context of manageable alert levels the latter approach has the best chance of success. Starting with just the critical systems also helps embed a culture of timely reaction to alerts into the teams, and allows a zero tolerance approach to critical's, and in the most mature teams warnings. Having achieved this culture adding new monitoring while maintaining quality monitoring and process is fairly straight forward.
Conversely If you start from the outset with a large real estate generating unmanageable or inappropriate (severity) alert levels then this can be a difficult situation to recover from. The teams who adopt the monitoring quickly become acclimatized to constant alerts and simply use the monitoring as a reactive analysis tool. In some cases there is a fear within the organisation that turning off or downsizing the monitoring might result in a missed alert, when in reality they are close to this situation already.
Actually getting alerted
Another important factor in alert governance is the actual mechanism that you nominate to be alerted. Examples in Geneos include but are not limited to:
- Directly via the interface on a local screen (Active Console, Montage and so on), specifics include list views, the state tree, the event ticker, or the notifier
- Playing a sound, which is possible in Geneos via the notification manager
- Dashboards, which are essentially a subset of local screen and large scrteen, but are signifcant enough to pull out for their own line item.
- On large screens mounted in a work area, generally a shared resource, and not directly interacted with, dashboards are particularly prevalent in this space
- E-mail, Social Media and SMS sending out specific events to a personal or group E-mail address or social media channels
- An update to an external ticketing system, automatically generating a new ticket in a secondary ticketing system
- and so on ...
The choice of notification method can be significant when considering the concept of 'False Alert', or to put it another way exceeds what is manageable (as well as correct). For example if a system is generating 100 critical alerts a day, and the chosen method of displaying those alerts is to show them in the console for as long as they are on-going, and have them clear when that situation ends, that might be deemed (while not ideal) workable. If on the other hand if an E-mail was generated each time an alert occurred the same number of alerts might overwhelm and desensitize the effected team (100 E-mails is bordering on spam). Consider also that the alerts will not 'be removed' when the situation is fixed since the nature of a mail is that it is non-mutable, in a worse case there may be a second mail for each event to say the situation has been resolved which the user will have to correlate.
So the choice of alerting mechanism is significant. Of the examples listed above they all have different pros and cons, and capabilities.
- Directly via the interface, I.E. on the Active console, or in a web browser (web montage), there are number of mechanisms within the console, and its worth considering each as a system of being alerted:
- As a severity colour, principally showing an entity,cell, part of a tree structure and so on as one colour or another. The colour remains for as long as state persists, and clears (changes colour) once fixed. It relies on the end user observing the screen, so needs to be present all the time on the operators monitors, not occluded by other applications. More significantly the artifacts that are alerting need to to have some representation on the screen, rather than in some hidden or off screen window, or requiring the use of scrolling (which is a common mistake made by UI's and users of UI's). In Geneos the severity propagates up to higher level artifacts (entities go red once they have at least one cell for example). However if a team has not implemented effective alert governance then artifacts within the middle and top hierarchy can be persistently red which essentially obfuscates all new alerts beneath them unless they happen to be looking at that particular low level view at that time. Therefore alerts can come and go without any operates noticing. An exclusive reliance on this alert mechanism within a system with lots of warnings and critical leaves the monitoring team in a reactive rather than pro active state. It is also ineffective if used exclusively and not being watched all time, which is rather impractical.
- List Views, or in more general terms a dynamic list of alerts, whereby when a situation triggers there is an entry in the list, and when that situation corrects the item is removed. In Geneos at least the user can then click on one of these alerts to go to the source alert. The criteria for what goes into the list is defined within its configuration. These have similar advantages and disadvantages to the use of straight severity colour, I.E. that alerts will clear when the situation is resolved, but also in that alerts can come and go without the user noticing if they are not looking at the screen, and that new alerts can appear but not be noticed if they are off screen (outside the current scroll-able area). New alerts may also be inserted into the list rather than be added to the bottom, which means if the list perpetually has items then the addition of a new entry may be missed, a situation which gets worse the longer the average length of the list. They are therefore most effective when kept largely empty and alerts that do get added quickly actioned.
- Notifiers, or to put it another way the small pop ups that occur on screen when an event occurs (normally in the bottom right). These can be configured for the duration of time they persist, their aesthetic, as well as the specific events that trigger them. Their appearance is normally animated, which can be eye catching for the end user, and an advantage over many other alerting mechanisms, but if their frequency is to common users get quickly desensitized to their appearance, which renders them ineffective as an alerting system. In practice more than a few a day will quickly become to frequent to reliably act on. With respect to time they spend on screen we can broadly consider two states, some time, and until the user dismisses them. These modes exhibit slightly different alert capabilities If they are removed after some time then they can be missed so are similar to the other Direct via interface systems. If they persist then they cannot be missed, however the alert may have been corrected in the time between the alert occurring the user dismissing the alert. In addition where they must be dismissed by the user we need to consider how many alerts it is practical to display at the same time, if the number of finite (vs infinite) then a replacement strategy will have to be selected, and therefore alerts can still be missed. If you allow unlimited popups then you risk under certain conditions filling the screen. Given all this, notifier popups can be very effective for a small number of alerts a day, notably if set to dismiss only on user interaction and are particularly good at supplementing other alerting types. They are ineffective where the specific alert types are two frequent.
- Event Tickers, or in more general terms list of historical events. These are events that have occurred within the monitored system in the past, they tend to be sortable (by time or severity). They are non-mutable, I.E. their content is of an alert that has happened and their state will not change. If the event that caused the alert is cleared the system may create a new historical event for that clear down, but it will not modify the original event. Ideally these events would be correlated (although this does not currently occur in Geneos). The fact that a historical event is non-mutable is often a source of confusion for users. For example they may expect a critical event to be removed from the list once the critical event has been resolved, but it does not because its an historical artifact. Some clients have also tried to\treat the Event list as a 'todo' list, but since historical events cannot be changed or removed this is not effective. As an alerting mechanism it has the advantage of being ordered by time with the latest alert at the top, you will also see events that have occurred even if you were not observing the system at the time they happen (assuming you scan the list periodically). Disadvantages include the fact that red events will be present on screen for their life time, even if they have been fixed, so used in isolation may give the impression of alerts which have in fact been fixed (you may be working in a world which is permanently red as a result even if your alert governance is good). If you have not got a good handle on alert governance and get a lot of alerts then items will enter into the list to quick to acknowledge (dozens at a time in some cases).
- As a severity colour, principally showing an entity,cell, part of a tree structure and so on as one colour or another. The colour remains for as long as state persists, and clears (changes colour) once fixed. It relies on the end user observing the screen, so needs to be present all the time on the operators monitors, not occluded by other applications. More significantly the artifacts that are alerting need to to have some representation on the screen, rather than in some hidden or off screen window, or requiring the use of scrolling (which is a common mistake made by UI's and users of UI's). In Geneos the severity propagates up to higher level artifacts (entities go red once they have at least one cell for example). However if a team has not implemented effective alert governance then artifacts within the middle and top hierarchy can be persistently red which essentially obfuscates all new alerts beneath them unless they happen to be looking at that particular low level view at that time. Therefore alerts can come and go without any operates noticing. An exclusive reliance on this alert mechanism within a system with lots of warnings and critical leaves the monitoring team in a reactive rather than pro active state. It is also ineffective if used exclusively and not being watched all time, which is rather impractical.
- Playing sound, playing sound at the point an alert occurs can be an effective alerting mechanism since it utilizes a sense that is not commonly associated with a desk or work environment, so can stand out. However great care needs to be made in the frequency of the alerts that generate sound alerts, and the exact sound. Like the notifier more than a few a day and staff will become desensitized or even annoyed. Even if the frequency is appropriate then the sounds needs to be fairly benign and not intrusive. Choosing the 'red alert' siren from star trek or your favorite theme tune will at least last a few days before getting switched of. A good option will be to record a spoken phrase which actually describe the alert spoken with minimal emotion. For example a spoken 'Critical alert in application X', and then setting up specific alerts for each major app. You can use the in built windows sound recorder for this if you need to and save it as a WAV. In Geneos sounds can be associated with Notifier events. As an alert mechanism they are therefore effective as a supplement to another system, but only for very specific events.
- Large screen displays, This alert mechanism consists of a large screen(s) mounted on a wall or stand in proximity to the team responsible for dealing with the alerts. They display information, but are not directly interacted with, I.E. if there is more investigation, detailed analysis, or remedial action required then it is completed on some other terminal. They are therefore effective at displaying alerts which have communal ownership (everyone can see there is something to do), they also avoid the need to have dedicated real estate on the team members desktops to a monitoring tool, instead allowing them to glance up from time to time to see if there is anything to action. They can be particularly effective when combined with sound which can highlight the need to remember to look (getting over the issue of everyone being absorbed in their own work). It is however important that all the alerts which are expected to be shown via the large screen are present in the available screen real estate. This is to say that the designer avoids the need to scroll the screen to show alerts, or even rotate between different screens. Observers should be able to assess whether action is required from a quick glance, rather than an extended viewing period (extended being 10 seconds or more). For this reason dashboards which focus on summary information translate well to a large screen. They can also be effective at highlighting to the wider organisation the state of play in a given group, and the systems they manage. This level of transparency where a team has a good handle on alert governance can be a good motivator to the team to maintain that culture, and avoid a decay in the quality of the alert handling.
- Dashboards, Technically dashboards are a sub set of the 'Directly via interface' alert mechanism, but their significance in the process of selecting an alerting mechanism of important enough to discuss separately. In Geneos at least, because they are a free form drawing and design tool, dashboards are unique in the variety of ways any given alert can be visualized. The size, colour, shape, widgets, labels and values of any given event can be tailored, allowing the designer to modify the teams focus within their monitoring environment at any given point in time. Like the other 'Direct via interface' alert methods, the alerts will be present for as long as they are on-going, and then be downgraded or removed once the alert state has ended. This does mean that events can be missed, with the exception being charts which have a temporal element. Dashboards are most effective when displaying summary information, highlighting the need to act while avoiding giving all the detail needed to perform the work. They become less effective even cumbersome when users try to replace detail views (such as metric views) with dashboards, or as historical reporting systems.
- Pushing alerts into a ticketing system, in this alerting mechanism tickets are automatically raised in an external ticketing system (such as JIRA or Service Now and so on). We ignore manual movement of alerts since in the context of this discussion we are talking about alerts, not processes within the team to move work into queues (a manual move assumes they have already been alerted, but here we are talking about the fact that the ticket has been raised as the mechanism of alert). The main benefit is that tracking these alerts to resolution can be done in the formal context of a ticket and workflow. Most tools will thus provide a full audit trail, and tightly defined ownership. There will also be a solid history of alerts and actions that have occurred as a result of those alerts which can provide an excellent basis of reporting. However where tickets are automatically raised a high (or even low) number of false alerts can create a lot of noise and unnecessary work within the ticketing system while the tickets get shut down, causing pressure on the monitoring team and a reduction in advocacy around both the monitoring tool and ticketing system. For example in a well maintained ticket flow even 10 or so false alerts tickets a day would quickly annoy a team who rely on such a system to manage their day to day work. There is also the risk of ticket storms, where a particular poorly managed event in the monitoring results in dozens of alerts from just a single root cause, or a serious event that causes significant outage (which is the worst possible moment for a team to have 300 new tickets raised in the middle of a crisis). If this is therefore a chosen method of alerting it is important to build in throttling, putting hard limits on the number of tickets that can be raised within a selected time span (accepting the risks of missing an alert if this is the only system of alert management). Consideration also needs to be given to the restart of monitoring components, I.E.If there are on-going alerts, and a monitoring component is restarted, then they may re-fire on start up causing duplicate alerts in the ticket system. The designer will need to find a way to identify an alert as a duplicate of an existing ticket, or know that the alert has already been sent via some persistent storage method. Such systems are not currently built into Geneos, some development will have to be done for the specific ticket system linkup to allow this. Teams will also need to consonant of open tickets and the state of the monitoring. If they do not keep on top of tickets in the ticket system, with the number and age of tickets increasing over time, and falling out of line with the monitoring systems then the relevance of the ticketing system will fall, and become less effective as an alerting system.
- E-mail, this involves sending E-mails on alerts to an individual or a group. The E-mail contains the specifics of the alert. If the mails go to an individual and they are solely responsible for actioning the alerts then this may be a practical alerting system. As an individual they should be able to track what they have and have not fixed, and benefit from a history of alerts in their in-box. The alerts can be moved via E-mail rules into specific folders and flagged as required to ensure a workflow. Outside of their inbox however there will be no audit trail, and it will be a single point of failure. This does not scale well however into group working, either because the mails go to an E-mail group, or into a set of individual E-mail boxes. Tracking what has and has not been actioned, whats in progress and what is being left behind is hard to track, and in the worst case generates even more E-mail to the group. It may also compound what is a common issue in most organisations around the sheer quantity of mail which staff members receive, adding to an already busy channel, allowing for the possibility of missed alerts. It is suggested that if its necessary to automatically outside of the monitoring framework that it is done via a ticketing system not E-mail.
Context sensitive
In a monitoring system of any scale, it is probably also true that there will not be a 'one size fits all' alerting mechanism. In much the same way as way as effective management of alerts requires customization to the specific behaviors of the monitored app, so the alerting requires specific tailoring to the intended audience of the alerts. Designers of a monitoring system should pick and choose whats is appropriate, working with the teams involved to ensure they will both see and act on the alerts when they are generated. For example in a team that is already saturated with E-mails from other systems and their working environments, even if the monitoring system generates just a few critical alerts a week, they may be missed due to these external factors.
Escalation
Escalation is desirable where alerts are not being actioned within the agreed time scales. Consideration needs to be given to what alerting mechanism is used for the escalation. While one on one hard a simple change of severity is a form of escalation (assuming there is room for maneuver and you are not already red), so is a change of alerting type and audience. A common process for example (though not automatically the best) would be to show a red on the console for a time, then send a mail to an individual or group of individuals if that situation persists. A change of alerting type is likely in the case of escalation, since by definition the previous method has not worked. The same care and attention needs to be taken when considering escalation alerting types, if not more so given its likely to go to more senior resources who work in different locations and have a different focus.
Specific things you can do in Geneos to solve the common monitoring mistakes
Available Severity levels
Before going into detail on the options for managing alerts its important we highlight the available severity levels . There are 4 in Geneos, these also have associated numeric values and colours
Severity Level | Numeric value | Colour |
---|---|---|
Critical | 3 | Red |
Warning | 2 | Yellow |
Ok | 1 | OK |
Undefined | 0 | Grey |
Snoozing
Snoozing - manually disabling alerts
Any data item can be snoozed in Geneos. By a data item we mean any of a Gateway, Probe, Entity, Sampler, Dataview, Table cell or Headline. Snoozing has just one effect:
"Snoozing stops the propagation of severity from that data item to its parent"
The picture below shows the effectof snoozing a critical cell on its parent
It does not effect the severity of the data item you have snoozed, so for example in the screen shot below you can see a cell with a critical severity has been snoozed. This means it will no longer propagate its severity to its parent (the Data view). Because this was the only critical cell in the data view the data view's severity becomes OK. However the cell remains of critical severity.
The act of snoozing an item is manual, I.E an operator makes a conscious decision to suppress an alert. At the point of snoozing they can elect an exit condition for the snooze to end. If you review the screen shot above you can see examples of exit conditions in the snooze menu. By default the menu includes the 'Manual' option, which means it can only be removed by an operator. There are plenty of legitimate reasons to snooze alerts, for example:
- There maybe some planned or unplanned maintenance
- The operator knows the situation is temporary, and snoozes the data for a short period assuming it will clear
- The Alert is caused by a problem upstream which is being looked into
Managing snoozes
The danger of any system which allows manual suppression of alerts (for arbitrary time scales or without planned or reasonable exit conditions) is that operators use snoozing as a mechanism to handle being over whelmed by alerts - or in short - they snooze everything. As should be obvious this is not a good strategy for dealing with excessive alerts and should be activity discouraged. Snoozed items should therefore be activity managed. There are a number of tools in Geneos that will help with this.
Snooze Dockable
You can add a view to the console that displays a list of all the snoozed items. The view below will show all the snoozed cells and managed entities in the connected gateways
The paths that drive this view however can be quite expensive since they look at all cells all the time.
Gateway Snooze data view
Within the gateway itself you can add a Gateway plugin that lists all the snoozed items, the XML for the sampler is below
Rules and alerts can be set on this data view as normal, allowing users and managers to track snoozes in their system. It includes a column on the 'Snooze Type' which helps identify what exit criteria users are selecting for the suppression of Alerts via snoozing.
Stopping manual snoozes
You as an organisation may decide that it is never appropriate to use the snooze command without a valid exit criteria, in which case you can use the security settings to actually remove this option for selected users.
Taking account of snoozes in actions
By default if a data item or any of its ancestors are snoozed then Actions run within rules will not fire. This setting is defined under the Advanced section of the Action definition in the setup. You can be more explicit by adding it in the rule block itself. For example:
Note however the rule above would not take account of the cells ancestors (for example the Managed Entity it is on)
Programmatically disabling monitoring
Active vs inactive monitoring - automatically disabling monitoring
Every data item in Genoes also has an 'Active'status, by default all data items are Active, if they are made inactive then they do not propagate their severity to their parent (in the same way as snooze blocks severity propagation), thus;
"An Inactive status stops the propagation of severity from that data item to its parent"
Unlike Snoozing a data item, which is a manual action, changing the active status is performed programatically. The severity of the data item is not changed, so if a cell, is critical, it remains critical, it just does not influence the severity of its parent.
There are two main methods to set an item inactive:
Within a rule block
You can set the Active Status of a data item explicitly within a rule block as a literal, for example:
The XML for the above rule would be of the form
Active Times
When considering alerting, its often relevant to also consider whether the systems are expected to be operational at any given point of the day, month or year. Alerting outside these time can generate unnecessary noise, to the responsible team, or others who maybe in another time zone. In Geneos designers can use Active times to suppress alerts during downtime. An Active time can be set within the gateway setup, and used in a number if places, one of the most common is in rules:
In the above example an active time is used explicitly in the rule block, and in the second example (on the right of the figure) in the rules active time settings. In the case of the rules active time setting, the whole rule would only be active when within that active time. When referencing active times within the body of a rule we can be more granular. Since severity is ONLY defined by rules, when a rule is outside active time it would not be setting severity, which would, in effect, suppress an alert. An example of the rule is provided below.
And here is the active time XML for a sample working day
Location | Effect on suppressing alerts |
---|---|
Sampler - Advanced | If a sampler is outside of active time then it will not sample and the connected data views will be empty |
Alerting --> Advanced | Determines whether the alerts will fire, which they will not when they are outside of an active time |
Database Logging --> Items --> Item --> Advanced | Determines whether changes to the data item values will be logged to the database. The suppressing of alerts may extend into reporting as well, and so ensuring that only events of interest go into the database maybe equally important vs the real time alerts. |
Alerts with more complex Signatures
Working with more complex alert conditions
Often an alert situation is more complex than the designer builds into their monitoring, the simplistic point value case can triggered more frequently than the actual alert condition which is affecting the business. Examples include:
- A CPU that spikes, but then returns to normal values, in this case the alert condition may only be value if there is an extended period of high CPU.
- A process that has an automated restart
- There is redundancy in the systems, and this would need to break as well for an alert condition to occur.
Using Delays
A delay can be built into a rule transaction, it stops the remainder the transaction from occurring until that delay has passed, and the original condition has remained true. They are useful for conditions which may self correct. For example a CPU has gone over 90% and stayed above 90% for 60 seconds.
An example can be seen below.
The delay can be specified in terms of seconds
delay 60
or samples, thus if the sample time was 20 seconds, 20 samples would be 40 seconds.
delay 2 samples
Another example is a process that has an automated restart script, such that if it goes down it is restarted. In this case, the fact that it has failed is of interest, but its not an alert until the restart has also failed. In the rule below we are looking at a process plugin, and expecting there to be a single instance. We have added an Action to the gateway which, should Geneos detect that the process count is 0, will try and restart the process automatically.
The rule will trigger the restart, and turn the cell Warning while the restart is being performed. If the restart is successful and the instance count goes back to 1 the cell will become 'OK'. If the process count stays 0 for 60 seconds, then the cell will turn Critical - its a genuine Alert that requires action.
In the example above we are still generating a warning alert for the duration of time that the process is down, even though we expect it to recover without human intervention. We could choose to be aggressive in limiting alerts by not generating the warning alerts, and simply assuming that recovery of this process is part of BAU. We could also consider utilizing the OK and Undefined serverities a little more. For example have the cell be undefined when the instance count = 1 (I.E. everything is fine, there is nothing of interest), then have it turn OK during the restart process. I.E. everything is OK, but just an FYI that a restart is underway.
This use of the Undefined severity as a valid state for 'Everything is OK' can help extend the use of the severity levels in Geneos, allowing OK to be used as an 'Of interest, but not yet a warning level'
Use of History Periods for more temporal alerting vs point values
While the delay function is useful for detecting extended periods of a selected state it suffers in that, if the condition is not true, even for a short time then the delay is reset. For example a server may exhibit high CPU for a number of minutes, but have brief periods where it drops below the selected threshold. Or an auto restarting process may restart many times in a short period. Both of these may be valid alert states, but will not be detected by the 'delay' method.
If we consider the auto restarting use case, lets say that as well as detecting when it fails to come up, we are also interested if it restarts 5 times (or more) in one hour. We can achieve this by monitoring the 'Average Instance count' over the hour. If it never goes down this should be 1. Anything below 1 means at least 1 restart occurred. Assuming the restarts are working and the sample time of the process sampler is 20 seconds, then any Average below 0.972 means at least 5 restarts occurred, or the process was down over multiple samplers - both worthy of attention.
(3600 seconds / 20 sample time) = 180
Assuming 5 are 0, then 175 / 180 = 0.972
There a few steps we need to take to set this up in Geneos
1) Create a history period for the selected time, this goes in the rules section of the setup, the example XML is below
2) We also need to add an additional column in the selected process sampler which will retain the average instance count, this is added via the Advanced tab of the sampler, an example is shown below
In this particular rule the severities have been graded, such that:
- If the process has not failed in the last hour it is undefined
- If the process has failed between 1-4 times, its OK, using OK essentially as an informational state
- If it has failed 5 or more times its a warning alert
- and if its been down for at least 50% of the time its critical
User Assignment
In the event that an alert has occurred which requires action within some time scale (generally therefore warning and critical) it is likely that a team member will pick it up for review. Once this has occurred there maybe a case for downgrading the alert. In much the same was as snoozing a cell. The designer of the monitoring can use the act of user assignment within their rules to change the state of the system.
Any data item can be assigned (so Gateways, probes, manged entities, samplers, data views, table cells and headlines). The act of assigning a user has no default impact on the severity of a data item unless the rules are designed to take it into account. For example the rule shown below will turn the cell warning. If it is assigned it will turn OK with the assignment icon to mark that it is being dealt with.
This would have the following effect:
If you do choose to use user assignment as a way of dealing with alerts, it may also be of interest to track what is and what is not user assigned within your environment. Like the monitoring of snoozes their is a gateway plugin that tracks snoozes in a system. The view includes the number of minutes that the data item has been assigned, so you can include rules to look for items that have been assigned for extended periods.
Note that unlike snooze, User assignment does not have other indicators in the likes of the state tree and Entities view. You can also create a list view within the console that shows the list of assigned items.
When an item is user assigned the operator can select an exit condition for the assignment. For example, until the severity changes, a date and time or duration, or until the value changes. There is also a simple assignment with no automatic exit condition, you as an organisation may decide that it is never appropriate to use the user assignment command without a valid exit criteria, in which case you can use the security settings to actually remove this option.
Comments
0 comments
Please sign in to leave a comment.