Related to:
Gateway hung, gateway not updating, Gateway not responding, Gateway frozen
Problem
- The Active Console or Dashboard is showing outdated values
- The Gateway has stopped responding to commands
Possible Cause(s)
-
- Gateway Performance
- The Geneos Gateway is a complex, multi-threaded application and benefits from careful configuration to give the best performance in each different deployment. Out-of-the-box the configuration is intended for relatively small scale configurations and tries to have a minimal resource impact on the server it is running on. Once load on the Gateway increases, and there are a variety of reasons for this, the Gateway will appear to slow down.
- The best single indicator of Gateway Performance is the maxDataAge value which represented, as the name suggests, the maximum age of any data item awaiting processing. This is normally between zero and a few milliseconds. Once this climbs higher and stays high then your Gateway is becoming overloaded. On the other hand there are legitimate reasons for this value to be high for short periods, especially during Gateway start-up and during configuration saves.
This indicator can be seen in two places. The first is in the Gateway-probeData virtual sampler as a headline. It is updated with the actual value as often as the sampler is configured to run however as this sampler's data is also subject to the same processing then this view may itself be out-of-date. The other place to look for this information in in the Gateway log file where a set of Data Quality metrics are written out every ten minutes and these are output from a separate thread that is not subject to delay when other functions are overloading the Gateway. These look like this (highlighted in red):2021-08-19 13:19:12.058+0000 INFO: DataQuality Statistics 10 minute periodic summary
2021-08-19 13:19:12.058+0000 INFO: DataQuality Maximum data age : period - 0 ms, lifetime - 0 ms
2021-08-19 13:19:12.058+0000 INFO: DataQuality Maximum queue data size: period - 0 bytes, lifetime - 6538 bytes
2021-08-19 13:19:12.058+0000 INFO: DataQuality Maximum total data size: period - 0 bytes, lifetime - 7283 bytes
- The Geneos Gateway is a complex, multi-threaded application and benefits from careful configuration to give the best performance in each different deployment. Out-of-the-box the configuration is intended for relatively small scale configurations and tries to have a minimal resource impact on the server it is running on. Once load on the Gateway increases, and there are a variety of reasons for this, the Gateway will appear to slow down.
-
- Running something like:
grep "DataQuality Maximum data age" gateway.log
may give you an immediate visual idea of how the maxDataAge changes over time. Small "blips" that last one or two 10-minute values can normally be ignored for reasons given above. An increasing value over time is a good indicator of a problem.
- Running something like:
- Network Issues
- Geneos components communicate over long-lived TCP connections where there data flow is optimised to only update changing monitored metrics, and if any of these TCP connections are interrupted by a network issue then they have to be re-established and, in the case of a Netprobe, all monitoring is restarted in the same way as if the Netprobe process were restarted. This in turn can result in a sudden increase in the volume of monitored data as samplers start-up and send their data to the Gateway. If this network issue affects multiple components then this will exacerbate the load on the Gateway and the maxDataAge will jump above normal levels until the Gateway processes this new mass of data.
- Gateway Performance
Possible Solution(s)
- Rule Threads
- The Geneos Gateway can process Rules in a manually-sized pool of threads. This feature is normally turned off as it is not necessary for smaller Gateways but should almost always be the first change made if you see the kinds of performance related issues described in this article.
The number to choose will depend on a variety of circumstances which you will have to judge based on your deployment; The number of cores on the server or VM, the profile of the other application - including other Geneos Gateways - running and so on. A good starting value is about 3 or 4 and see, using the Load Monitoring tools, where things go from there.
As described in the documentation, the Gateway will limit the number of threads used for Rules to the number of cores detected on the server, so you cannot set it too high.
- The Geneos Gateway can process Rules in a manually-sized pool of threads. This feature is normally turned off as it is not necessary for smaller Gateways but should almost always be the first change made if you see the kinds of performance related issues described in this article.
- Splitting the Gateway
- As your Geneos monitored estate grows you will be adding more probes and more samplers to the Gateway. Eventually, even with performance tuning, like Rule Threads above, you will need to consider splitting the Gateway, both for performance - the subject of this article - and also for configuration manageability.
The complexity of splitting a Gateway into two or more new Gateway instances varies and is usually related to how logically separated your existing monitoring objectives are. If, for example, your existing Gateway is collecting monitoring data for different applications, which in turn may be managed by different teams, then it makes sense to consider creating a new Gateway for each application. This approach can also apply vertically, separating low-level infrastructure, market data, databases etc.
Using Gateway Sharing you can then selectively recombine data between Gateways to ensure that dependencies can be managed.
- As your Geneos monitored estate grows you will be adding more probes and more samplers to the Gateway. Eventually, even with performance tuning, like Rule Threads above, you will need to consider splitting the Gateway, both for performance - the subject of this article - and also for configuration manageability.
- More system resources
- The other straight forward change to try is , where possible, giving the server that the Gateway is running on more resources; specifically CPU, memory and/or IOPS (disk throughput). The last option is not normally anywhere as important as the first two. Many Geneos deployments are now on VMs and so these changes can sometime be done either live or with very little downtime. There is no configuration changes for the Gateway process except possibly the number of Rule Threads (above).
Related Articles
- Performance issues and disconnections of Netprobes or Active Console users (Max Data Age and Data Quality stats)
- Data Quality Guide
- Performance Tuning Guide
- Tech Ref: Rule Threads
- Managing Load in Geneos
If Issue Persists
- Please contact with our Client Services team via the chat service box available in any of our websites or via email to support@itrsgroup.com
- Make sure you provide to us:
-
- At minimum:
- Full Gateway logs
- Active Console Diagnostic file (Help -> Create Diagnostic File)
- Gateway version details (if not in log)
- Information about the server, loading and other resource information
- Desirable:
- Gateway Diagnostics
- A sample stats.xml file - remember this is a snapshot and not a log - after the gateway has been running with load monitoring enabled for a suitable period of time
- At minimum:
Comments
0 comments
Please sign in to leave a comment.