This article will describe two different ways of resolving incorrect report data.
Which of the methods should be used is up to your own discretion, and you should contact support if you are unsure. The methods are:
- Refreshing and rebuilding the back-end SQL data that is used in creating reports of different kinds in Monitor ("synchronizing and reinserting").
- or, inserting a global OK state for all objects at a certain point in time.
The first process collects the naemon log data of all configured Monitor nodes in your cluster (masters, peers and pollers) and uses this data as a source of truth to rebuild the
report_data database table.
report_datatable, and can therefore possibly destroy data. It should only be performed if recommended by a technical contact from support or professional services. Remember to backup your data (example below).
- A basic understanding of Linux, SSH and the command line interface.
- A planned service window where all monitoring performed by Monitor will be temporarily disabled (including all peers/pollers).
- No scheduled downtime of host or service objects being active at the time of the service window.
- The currently installed version of Monitor is at least 7.0.3.
Option 1: Process for rebuilding report data from logs (synchronize and reinsert)
Verify that the currently installed Monitor version is at least 7.0.3:
# cat /etc/op5-monitor-release
Verify that no hosts or services are currently within a scheduled downtime, nor will enter a scheduled downtime anytime soon. This information can be found via the Scheduled Downtime page in the web interface.Recurring downtime entries, if any, are inserted around midnight. Make sure the process is not started any time near midnight.
Log on to your report generating master node (your "primary master") via SSH, and preferably using a terminal multiplexer like screen or tmux to ensure the process continues if the connection drops, and perform the steps below:
Make sure that the "op5 community" package repository and the "support tools" are installed.
Shut down HTTPD and Naemon + Merlin on all nodes (this command will propagate to all nodes):
# mon node ctrl --self --all -- "service httpd stop && mon stop"
Back up the current report data table in full, using the command below. Make sure that there is sufficient free space in the target directory (~) prior to executing this command. It may take some time to complete.
mysqldump merlin report_data | gzip > ~/merlin.report_data.sql.gz
Back up naemon's state information using the command below.
cp -pv /opt/monitor/var/status.sav ~/status.sav.bak
Launch the tool that rebuilds the report data, using the command below, on your primary master. In this example, log data from Jan 1st, 2021 and onward only is processed for insertion. This interval should be as restrictive as possible/necessary to lessen the load.
mon mt report-data-reinsert -s 2021-01-01
Follow the instructions displayed on-screen upon executing the command. An output example of what the process looks like can be found at the end of this article.
- First, start up the services only at the node where the op5 mt report-data-reinsert tool was just run (service httpd start && mon start), create a report and verify that it looks correct.
- When verified, start up HTTPD and Naemon + Merlin similarly to how they were previously shut down on all nodes:
# mon node ctrl --self --all -- "service httpd start && mon start"
Rollback procedure: Restoring the backup
In case of trouble, you can restore the backup files that were created in the instructions above.
If you do need to restore the report data table, do the following:
Shut down some of the system services using the command below:
service httpd stop && mon stop
Restore the old report data into the database, using a command like the below:
zcat ~/merlin.report_data.sql.gz | mysql merlin
Restore naemon's previous state data, using the command below:
cp -pv ~/status.sav.bak /opt/monitor/var/status.sav
Start the system services again, using the commands below.
service httpd start && mon start
Example output of report-data-reinsert execution
This example output is provided as reference:
root@master01:~# op5 mt report-data-reinsert -s 2014-09-01
Verifying node command execution capabilities...
Testing (master01)... ok
Testing (master02)... ok
Testing (poller01)... ok
Next up is:1) Collect alert log data (since 2014-09-01 00:00:00) from the listed nodes.
2) Sort and deduplicate the data.
3) Write the results to file: /tmp/alerts.1409522400.1423572510.p7lQYZ.log
Please be advised that the sorting will create additional temporary files in
/tmp (or $TMPDIR if set via environ), which might require large amounts of
disk space. This amount cannot be pre-determined.
The process might also be very time consuming, depending on the amount of data
to read at each node. Only the alert events found in the log data at each node
is downloaded (compressed) via the network (in case of remote nodes).
Shutting down Monitor at all listed nodes before continuing is recommended!
Continue (y/N)? y
Tue Feb 10 13:48:56 CET 2015 (master01) starting...
Tue Feb 10 13:51:51 CET 2015 (master01) done (entries: 104168) (errors: 0)
Tue Feb 10 13:51:51 CET 2015 (master02) starting...
Tue Feb 10 13:53:08 CET 2015 (master02) done (entries: 114406) (errors: 0)
Tue Feb 10 13:53:08 CET 2015 (poller01) starting...
Tue Feb 10 13:53:43 CET 2015 (poller01) done (entries: 23153) (errors: 0)
Final number of collected log entries: 126370
Next up is:
1) Delete all current report data entries that are timestamped 2014-09-01 00:00:00 or more recently.
2) Insert the report data entries found in '/tmp/alerts.1409522400.1423572510.p7lQYZ.log'
Continue (y/N)? y
Verifying MySQL connectivity... ok
Deleting old report data entries... ok
Importing 17.15 MiB of data from 1 files
Importing data: 100.00% (17.15 MiB) done
17.15 MiB, 126370 lines imported in 9.689s.
Creating sql table indexes. This will likely take ~16 seconds
788923 database entries indexed in 12 seconds
Option 2: Process for inserting a global OK data point
This process will insert data points stating that all hosts are OK at a certain point in time. The purpose of this is to insert such a point in the past, after a "bad" point that was never corrected, to make current report data accurate. The idea is that since most objects will have multiple state changes after this fake data point in the past, it should only affect nodes that have long-running incorrect bad states, since other objects will change as other state changes take effect in chronological order.
Example of issue
If a hypothetical service changed to critical last month after being OK since 2018-10-26, and is still (incorrectly) critical, assuming no other data points exist, it would look like this in the database:
- 2018-10-26 -> Last month = OK
- Last month = Change to critical, which incorrectly still applies
- Today = Still critical, no other data points exist to change the state
Inserting an OK data point after the "last month" data point would make this object OK from that point onwards into the future, forever, as there are no newer changes. Since the points in the database reflect state changes, one way to think about this "global OK" would be that:
"the state for everything will change to OK at this point in time, and will remain so until the next upcoming point in the database says otherwise".
Step 1: Back up the database
# mysqldump merlin report_data --complete-insert --extended-insert=FALSE | gzip > /tmp/report_data_dump.gz
This will create a file containing SQL queries with data to restore your entire database. It can be manually read with zless if you wish to inspect the contents.
If you are running Monitor in a VM, we also highly recommend that you do a snapshot for redundancy.
After backup, run the "global OK" command
# mon mt report-data-insert-all-ok "2000-01-01"
For a short help text, run "mon mt":
This command inserts an OK event for *all* hosts and services at the given timestamp into the report_data database table. The timestamp could be any PHP strtotime() compatible text string.
Please note that the community repo as mentioned above must be installed and enabled for this command to install and function if you do not already have it. This command will enable the repo and install it:
# yum --enablerepo=\* clean all; yum --enablerepo=op5-community install op5-support-modules\*
Change the state of a single host
It needs to be pointed out that manually editing your database as described below, especially in production, is something that you will have to do completely at your own risk. We do not accept any responsibility for downtime, data loss, or any other complications that might come because of this procedure. However, if you still want to go ahead at your own risk, please see the below commands and adapt them to each host and that host's timestamp.
First, it's probably a good idea to make a backup of the merlin database before doing inserts in the db:
# mysqldump merlin | gzip > /root/merlin.sql.gz
The command below should fix the issue. Replace HOSTNAMEHERE and TIMESTAMPHERE:
# mysql merlin -e "INSERT INTO report_data (timestamp, event_type, host_name, state, hard, retry, output) values(TIMESTAMPHERE, 801, 'HOSTNAMEHERE', 0, 1, 1, 'OK - FIXED MANUALLY')"
Choose a timestamp (epoch time) that is suitable. Perhaps 5 minutes (300 seconds) after the latest entry DOWN entry. So if the latest entry has timestamp 1575987403, the timestamp for the new insert should be 1575987703. The command above needs to be repeated for each host with this problem. Verify that the command has the effect that you are looking for after your first try before proceeding to the rest of the hosts if others need to be edited.
Rollback procedure, in case of issues
If you wish to roll back to the state your database was in when the above backup file was created, do the following:
- Shut down Naemon and Merlin with mon stop. Shut down the web server as well (httpd/apache).
- Run the following to execute all SQL commands in the backup file, restoring the database:
# zcat /tmp/report_data_dump.gz | mysql -D merlin
This will "steamroll" your database back to the state it was in when the backup ran, and will not care about newer data existing. All changes made later will be lost.
When the commands have finished, start "mon" and your web server back up and verify functionality.
Please sign in to leave a comment.