NOTE: this page is work in progress, we will be adding to it over time as we identify more and more configuration that can help identify high alert levels.
|ALERT GOVERNANCE - Detection Template
You will need the following to deploy the Alert Governance Identification Template
The following are the steps you will need to deploy the Alert Governance Identification Template
|Types of false alert
Types of false alert
Within the Alert Governance part we identified a number of situations which have the capacity to generate high or false alert levels. They are listed below
Even if the frequency of alerts seems workable, if the team are getting dozens of critical alerts a week then this maybe indicative of some very unstable systems that are having a significant impact on the business, or poorly configured monitoring. The simplest way to think about what severity level a given alert should be at is how long you will tolerate that situation before acting. In rough terms:
In all cases an action is defined as something which reduces the severity level, therefore buys you more time, or resolves the situation completely.
Below we try and define systems tat can be used to detect these situations.
|DEALING WITH HIGH FREQUENT ALERTS (TYPE 2 ALERTS)
In this section we will deal with a specific case of where The specific alert occurs very frequently. Specifically we are interested in:
The theory is that the same alert being fired to frequently within a short window of time essentially creates noise in the monitoring system, I.E. we get 100 alerts in 5 minutes where 1 would have done. Note this is not an attempt to solve the more complex case of root cause analysis (I.E. one thing occurring causes multiple other alerts and events both up stream and down stream).
The most significant question to consider in this scenario is one of equivalence. On what condition is one alert considered equivalent to another? The most precise answer is that the same cell switches severity frequently (Ok --> Critical --> OK --> Critical), but in reality when we investigated these kind of scenarios this was almost never the case. Instead we found cases where equivalence could be justified at the row, column or data view level. I.E. A column of data generated X alerts at the same time, where X was the number of rows and they all turned critical at the same time, or a Data view had aggressive rules which meant that a large percentage of the alerts on a given gateway could be attributed to that view. In both cases the events tables for these gateways were being populated with thousands of alerts. We found it was therefore practical to consider two alerts equivalent if they came from the same data view within a given time window.
We can therefore modify our high frequency alert description to be:
Assuming you have deployed the template you will see the following view
Each row shows a data view that has generated X or more alerts in a Y second window within a time period Z. The defaults as shipped are 5 or more alerts at critical all within a 120 second window within the last 24 hours. You can modify the parameters in the Environments --> Alerts parameters section of the include file:
In addition there will be historical view of the number of alerts at the selected severity level that have occurred on the gateway
Each row is a day, and displays the number of alerts that have occurred at the selected severity level or higher (defined via the ALERTS_FREQUENCY_SERVERITY_THRESHOLD attribute). The table will go back X months where X is defined by the ALERTS_HISTORY_FREQUENCY_MONTHS environment variable. The data can be cut and pasted into Excel to produce a chart over time of alert levels.
Correcting Frequency based alerts
While the template will highlight data views that are generating excessive alerts within selected time brackets, it cannot provide advice on if it needs to be fixed, and if so, what to so - this takes expert judgement and knowledge of why it is setup like it is. In principle the objective is to reduce the number of alerts that are generated within a small time scale. Methods include: