Why tuning is needed
Any monitoring system, including op5 Monitor, needs to be adjusted to suit your organization's specific needs. This "how-to" describes the basics of identifying the source(s) of your notifications and adjusting thresholds and configuration to get rid of false/uninteresting or exaggerated alerts ("alarms"/notifications). A new installation definitely needs tuning, but it's also a good idea to generate reports on a regular basis to prevent being flooded with notifications.
Basic workflow
When tuning an op5 Monitor configuration, you follow three basic steps:
- Create a report to identify your top alert producers
- Make a decision on what to do (use table below)
If? | then? |
---|---|
the problem needs to or will be fixed | acknowledge the problem or/and disable notifications and/or active checks of the host/service. |
it's a false/uninteresting or exaggerated source of alerts. | Identify what needs to be adjusted and set/use new thresholds, time periods, timeouts and check_attempts. |
- Perform a configuration change or acknowledge the problem (use comments).
Alert Summary
First, create an Alert Summary report using the suggested click-path/options below.
Report -> 'Summary' -> 'Create Summary Report'
Report Type: Top Hard Alert Producers
Report Period: Last 7 days
State Types: Hard States
(because notifications are only generated on hard states (last check_attempt)
Host States: Host Problem States
Service States: Service Problem States
-> 'Create Summary Report!'
The report you get lists your top sources for alerts resulting in problem-state notifications, i.e, the alerts you get SMS and/or e-mail about.
Finding the problem
If one of your listed top alert producers turns out not to be an actual problem, you need to find out why, when and how often the false/uninteresting or exaggerated "problem" occurs to be able to adjust your configuration.
Here some views in Monitor comes in handy. Using the Alert Summary Report as a starting point, you can click on the host/service names in the report to reach the host/service information pages. On the information pages you can:
- take a look at simple graphs (for the most common services)
- see how long you've had the problem (Current State Duration)
- read any comments about the host/service
- reach even more detailed views about the problem:
View Availability Report For This Host/Service: Produces an Availability report where you can see the extent of the problem (hours/minutes of total unscheduled downtime)
View Trends For This Host/Service: Gives you an overview of when, how often and for how long durations the problem occurs.
View Alert History For This Host/Service: Gives you a very detailed list of all Alert events. This info is often needed to determine what threshold that has triggered the alert.
Making the adjustments
Depending on what conclusions you've made about the alert producer you now need to make some adjustments to your configuration. Two common problems and solutions listed below. There are, of course, a lot of adjustments that can be made. Some other commonly adjusted parameters are listed below the table.
If? | then? |
---|---|
you get a lot of notifications during backup-hours? | create a new time period excluding the backup hours and use it as check_period and/or notification_period for the host/service. |
you get a lot of notifications on PING-services on the far end of your WAN-connections? |
|
Other commonly adjusted parameters (please see in line help or manual for details):
max_check_attempts
normal_check_interval
retry_check_interval
Comments
0 comments
Please sign in to leave a comment.