What are Zombie processes
According to Wikipedia:
It is a process that has completed execution (via the
exit system call) but still has an entry in the process table: it is a process in the "Terminated state". This occurs for the child processes, where the entry is still needed to allow the parent process to read its child's exit status: once the exit status is read via the
wait system call, the zombie's entry is removed from the process table and it is said to be "reaped". A child process always first becomes a zombie before being removed from the resource table.
In most cases, under normal system operation zombies are immediately waited on by their parent and then reaped by the system – processes that stay zombies for a long time are generally an error and cause a resource leak, but the only resource they occupy is the process table entry – process ID.
Why is the Netprobe spawning Zombie processes
The Netprobe does not create zombie processes. Rather, it (or its plug-ins) creates child processes to execute user scripts. These child processes eventually become zombie (AKA defunct) processes as explained in the Wikipedia article.
One plug-in that creates child processes is the Toolkit plug-in. It uses the popen function to open a process by creating a pipe, forking, and invoking the shell.
Should I worry
No, you should not. Zombie processes are eventually removed from the system.
What should I do
Geneos can detect zombie processes via the Hardware plug-in. You may configure a monitoring that can identify long running zombie processes. Given this:
- Create a user script that:
- Extracts the running processes in an OS
- Outputs the result to a text file
- Configure a Hardware sampler.
- Configure an action that executes the user script in step 1.
- Configure a rule:
- Target XPath should be the sampler's zombieProcesses metric.
- Add a condition that triggers the action in step 3.
- Add a delay of 30 samples or any number of samples that adds up to 10 minutes.
- For example, if your Hardware sampler's samplingInterval is 20 seconds, your rule's delay should be 30 samples.
- This delay should prevent false alerts.
This results to an output file that should show zombie processes that are still present in the system for 10 minutes. Check if there are zombie processes that are running for 10 or more minutes and are linked to the Netprobe. If there are, raise a ticket to ITRS Support and provide the following:
- The action's output file
- Name of the Netprobe as configured in the Gateway/GSE
- Netprobe version
- Netprobe OS version
- Netprobe log file
- Gateway diagnostics file