The processes sampler allows for a stop and start command to be configured and for the start script to be called automatically when the monitored process is not running.
Problem:
When a Processes restart script has failed, the reason for the failure may not be clear from the netprobe log.
Possible cause(s):
Here are reasons that a restart script may fail to run:
- file ownership
- file permissions
- operating environment
Any restart script needs to be executable by the same user that the netprobe is running as and this is easy to check by looking at the file attributes.
An example - file ownership & file permissions:
A netprobe (netprobe-7053) is configured to monitor other netprobes using the processes plugin. In this following example a netprobe-7052 is monitored and restart script is specified.
The monitoring netprobe-7053 is running with geneos:geneos user/group ownership.
[root@huw-el7-gateway ~]# geneos ps
Type Name Host PID Ports User Group Starttime Version Home
netprobe netprobe-7053 localhost 2064 [7053] geneos geneos
It may not be obvious from the restart script permissions (see below) whether it is executable by the geneos user. (The geneos user could have sudo permissions that would allow this.)
[root@huw-el7-gateway ~]# ls -lrt /usr/local/bin/start-netprobe.sh
-rwxr-xr--. 1 root root 66 Apr 1 09:47 /usr/local/bin/start-netprobe.sh
The easiest way to test that is to run the script as the netprobe user. Login as, (or sudo) the netprobe user and test the command.
[root@huw-el7-gateway ~]# sudo -l -U geneos
[geneos@huw-el7-gateway ~]$ /usr/local/bin/start-netprobe.sh 7052
-bash: /usr/local/bin/start-netprobe.sh: Permission denied
The solution here is to ensure that the netprobe user can run this script, by altering either the script or the users permissions.
An example - environment - an overlooked case
Running the script on the CLI, using the same uid as the netprobe works as expected. The process was not restarted as expected from while the netpobe log (below) showed that an attempt had been made to run the restart process but gave no indication of an error. (below). Running the script on the CLI, using the same uid as the netprobe however does work as expected.
2025-03-26 11:19:54.061-0400 INFO: AutoRestart Alert Manager ( /opt/monitoring/bin/alertmanager.py start) - try 1 of 1
Possible solution(s):
[Guide: Please provide solutions that correspond in both number and order to the identified root causes.]
-
Solution Root Cause File ownership & file permissions:
The solution here is to ensure that the netprobe user can run this script, by altering either the script or the users permissions using unix tools chown, chmod
-
Solution Root Cause Environment:
Capture the output of the command in a file (/tmp/debug.log) using shell redirection. e.g.
Check the log for error messages when the start attempt fails.
/usr/bin/env: python3: No such file or directory
The above message indicates that the python interpreter was not found, i.e. not in the netprobe environment.
Edit the netprobe environment to include the path where (in this case) the python binary is located.
Setting the PATH would normally be done in a netprobe startup script.
Related article(s):
If you need further help:
-
Please contact our support team via the chat service box on any of our websites or raise a support request.
-
Make sure you provide us with:
- Background of the issue or request.
- Use cases, requirements, business impact, etc.
- Encountered error messages.
- Log files or diagnostic files.
- Screenshots.
- And other important information relevant to your inquiry.
Comments
0 comments
Please sign in to leave a comment.