OP5 Monitor - How to tune thresholds and limit amount of notifications

Why tuning is needed

Any monitoring system, including op5 Monitor, needs to be adjusted to suit your organization's specific needs. This "how-to" describes the basics of identifying the source(s) of your notifications and adjusting thresholds and configuration to get rid of false/uninteresting or exaggerated alerts ("alarms"/notifications). A new installation definitely needs tuning, but it's also a good idea to generate reports on a regular basis to prevent being flooded with notifications.

Basic workflow

When tuning an op5 Monitor configuration, you follow three basic steps:

Create a report to identify your top alert producers
Make a decision on what to do (use table below)

If?	then?
the problem needs to or will be fixed	acknowledge the problem or/and disable notifications and/or active checks of the host/service.
it's a false/uninteresting or exaggerated source of alerts.	Identify what needs to be adjusted and set/use new thresholds, time periods, timeouts and check_attempts.

Perform a configuration change or acknowledge the problem (use comments).

Alert Summary

First, create an Alert Summary report using the suggested click-path/options below.

Report -> 'Summary' -> 'Create Summary Report'

Report Type: Top Hard Alert Producers
Report Period: Last 7 days
State Types: Hard States
(because notifications are only generated on hard states (last check_attempt)
Host States: Host Problem States
Service States: Service Problem States

-> 'Create Summary Report!'

The report you get lists your top sources for alerts resulting in problem-state notifications, i.e, the alerts you get SMS and/or e-mail about.

Finding the problem

If one of your listed top alert producers turns out not to be an actual problem, you need to find out why, when and how often the false/uninteresting or exaggerated "problem" occurs to be able to adjust your configuration.

Here some views in Monitor comes in handy. Using the Alert Summary Report as a starting point, you can click on the host/service names in the report to reach the host/service information pages. On the information pages you can:

take a look at simple graphs (for the most common services)
see how long you've had the problem (Current State Duration)
read any comments about the host/service
reach even more detailed views about the problem:

View Availability Report For This Host/Service: Produces an Availability report where you can see the extent of the problem (hours/minutes of total unscheduled downtime)

View Trends For This Host/Service: Gives you an overview of when, how often and for how long durations the problem occurs.

View Alert History For This Host/Service: Gives you a very detailed list of all Alert events. This info is often needed to determine what threshold that has triggered the alert.

Making the adjustments

Depending on what conclusions you've made about the alert producer you now need to make some adjustments to your configuration. Two common problems and solutions listed below. There are, of course, a lot of adjustments that can be made. Some other commonly adjusted parameters are listed below the table.

If?	then?
you get a lot of notifications during backup-hours?	create a new time period excluding the backup hours and use it as check_period and/or notification_period for the host/service.
you get a lot of notifications on PING-services on the far end of your WAN-connections?	put all the hosts experiencing the same problem in a host group. Create a more tolerant service (by editing the service or creating a new template used by the service). Set higher timeout, warning or critical thresholds. Clone the new service to the new hostgroup.

Other commonly adjusted parameters (please see in line help or manual for details):

max_check_attempts
normal_check_interval
retry_check_interval

Articles in this section

OP5 Monitor - How to tune thresholds and limit amount of notifications

Why tuning is needed

Basic workflow

Alert Summary

Finding the problem

Making the adjustments

Comments

Articles in this section

Why tuning is needed

Basic workflow

Alert Summary

Finding the problem

Making the adjustments

Related articles