Geneos - Managing Load in Geneos

Assumptions
Introduction
The fundamental unit of Data Load - Data view cells

Recommended hardware requirements

Managing data load on Probes
Monitoring by exception
Heartbeats and disconnections and EMF2

Recovering from a constant state of connect / disconnect

Managing load in Gateways

Gateway startup vs normal operation
Data Quality
Multi-core support

Load Monitoring
XPath assessment
Contention
Managing Load in the Active Console

Constantly connecting and disconnecting gateways
Running multiple Active Consoles
What if even an empty work space connects and disconnects, or never connects at all?
Looking at a subset of your gateways

Impact of adding capacity
Gateway Sharing

DISCLAIMER: Embedded in this document are detailed overviews of how some the internal workings of the Geneos components work. The document highlights these workings because it helps explain some important principles of how to work with load within Geneos. It is not an invitation to suggest changes to these systems (we will not be raising enhancement requests for example). I.E. you may or may not agree with how it has been designed, but many of these design decisions are deeply embedded within the architecture and rework would create significant work and risk (in terms of stability).

Assumptions

This article is written assuming that the reader has a good understanding of Geneos at an operator level, of the console, the gateway and the probe (in general terms). It also touches on some of the other components such as open access and AC Lite etc, but these are a secondary requirement.

Introduction

Like any software system Geneos has limitations in how fast it can process and manage data. Its not an uncommon or even unreasonable question as to where the limits lie, how much data can I put through, or what hardware I need to run it. However the answer is not simple, and this article attempts to describe why.

The fundamental unit of Data Load - Data view cells

Regardless of the samplers configured on a net probe they all produce tables of data called data views. Any given sampler can produce between 0.. many data views, and the data views can be varied in size, generally ranging from 2 to 20 columns, and 0 to thousands of rows. Data views also have 0.. many headlines, but tend to be less than 10, so the headlines rarely, if ever, have an impact on load in Geneos

When a gateway first connects to a probe and the samplers run the whole contents of the data view is sent up to the gateway. After that point on each sample only the cells that change, or are added and removed are sent up. This means that:

The load is higher on the first connect while the information is a) gathered by the probe b) sent up in its entirely to the gateway for the first sample.
That, in theory, the load is less on each consecutive sample. However if all the cells change all the time then the load will remain high from that probe (including if all row names change)
That the sample rate of the data view can influence the update rate in that there is the potential for cells to change more frequently. Sample rate is capped at 1 second for all samplers except the API plugin, where the rate is determined by the designer of the code producing the XML RPC calls.

The load is also influenced by the contents of the cells: the passing of long strings, like those often encountered in FKM samplers, also takes more processing than simple integers for example.

The load impact of a sampler on the Geneos estate is therefore a formula of the nature

At start up: (Number of columns * number of rows) + number of headlines

During normal running: Number of cells and headlines updated (including those added and removed) * the sample rate

Note that a row is removed or added if the first column name changes, since the first name is part of the identifier for the remainder of the cells on the row.

The Holistic load on Geneos in terms of data is therefore predicated on the specific configuration of samplers, not - as is often quoted - the number of gateways or probes (although they are not completely irrelevant as discussed later). This adds complexity when trying to answer the question 'What spec box do I need', since the counter question will be 'what, and how often, do you want to monitor?'

Recommended hardware requirements

Because the fundamental units of load are data views and data view updates, it is a difficult question to provide generic performance stats for Geneos (given all the possible sampler configurations), and therefore recommendations for the size of the required hardware. Experiments run in lab conditions with canned data rarely offer a realistic facsimile of real world use, so tend to create artificially high Geneos performance throughput. The recommendation is to try what you want to to do, with realistic load in a UAT, then use the load management guide below to tailor the configuration if load is a problem.

Managing data load on Probes

The load on a probe is defined by the samplers that are configured to run on it. As discussed in the previous section the size of the data views and the configured sample rate will impact load, as will the specific processing that sampler has to do to build the data views. (For example, scan a log file, run a data base query, look at OS level data or commands). The different processing requirements of the various samplers is currently outside the scope of this article.

When looking to manage load in the Geneos real estate it is suggested that:

You minimize the data that needs to be populated in the data view over time, a common method is looking to monitor by exception rather than getting all the data all the time. However samplers differ tremendously on how configurable the returned rows and columns are. The CPU plugin is fairly fixed for example, whereas the Toolkit and SQL Toolkit are entirely user defined.
Throttle the sample rate to the operational minimum to achieve your requirement. We are not proposing that you compromise your monitoring, just that you consider whether sampling every second for example is really required.

If a probe is overloaded in terms of CPU or memory then review the samplers which are configured, see if you can minimize the returned rows or sampling rate, and if not consider deploying a second probe and splitting the samplers. See below for notes on the possible effects downstream of splitting samplers.

Monitoring by exception

In the above section we mentioned that one method of reducing load on a probe is to monitor by exception. This topic deserves special attention which we will cover now. The principle is that only data which matters flows into the monitoring tool. By 'Matters' we mean data user will act on. Common examples we regularly see where users may NOT be managing by exception include:

Bringing in hundreds of rows via an SQL toolkit or toolkit, then looking for particular state in selected rows (which become alerts). Often its possible to add the logic to remove rows which are not alerts into the SQL or the script. In the extreme the view will be empty until a suitable alert state occurs. This is a very efficient method of monitoring.
Getting hundreds of triggers in an FKM, this may mean that your configuration is too general, and the bulk of the alerts are false alerts. I have sat in on some clients and been asked if there is a way to quickly accept a hundred or so FKM triggers at the same time as it is something they regularly do. The better question might be - 'How do I reduce the number of false alerts?'

There are other reasons to collect data of course other than data which is actionable.for example:

1) For compliance

2) For capacity planning

3) For Analytic engines

All these reasons are valid, and you may use Geneos to collect data fro these reasons but if not, and you are collecting data for the sake of data collection, or because it might be useful some time in the future, then you may consider disabling the configuration in the short term if you have a load problem.

Heartbeats and disconnections and EMF2

An important feature of Geneos is the underlying protocol by which the components communicate. Its called EMF2, it works over TCP/IP and it is a bespoke protocol that passes data between Geneos components. As well as carrying the data, it also serves to ensure that components that should be connected are connected, which it does this via heartbeats. If a heartbeat (or message) is not received for 77 seconds or more then it assumes that the target component is no longer available and it takes steps to alert the user and clear up its internal state. For example the gateway may report a probe as disconnected, or the console may drop the connection to the gateway.

There may be many reasons for such a disconnection, including a network problems, a (Geneos) process going down or a TCP/IP connection being blocked by a firewall and so on. However it may also be due to high load. In situations where the CPU is high on a Geneos component it may not have time to service (respond or listen for) the heartbeat sent to it by another component. The result is that the component assumes a disconnection event and starts to clean up the connection. In essence, it cannot distinguish between the absence of the heartbeat due to load or some other reason.

Geneos will periodically attempt to reconnect to a disconnected component (every few seconds); having done this the component will start sending its data again. This (desirable) feature has a side effect, however, when considered in the context of busy components and the heartbeats. Consider the situation where a gateway connects to a probe: the probe starts running up its samplers, sends a significant quality of data to the gateway, which makes the gateway busy and unable to service the heartbeat or take the probe messages of the queue. The probe connection is dropped when the heartbeat fails and the data is cleaned up, then some time later the connection is re-established and the whole process starts again. To the user it looks as if the system is in a spin cycle from which it cannot recover (constant connections and disconnections).

Its common to see the same issue when connecting busy gateways to an Active Console. The console is so busy trying to process all the gateway data that it drops the connection; when it finally gets CPU cycles to re-establish the connection, the process starts all over again. Where these regular disconnects are observed it is a sure sign that the designer of the Geneos system needs to start managing the load better in the Geneos real estate.

Recovering from a constant state of connect / disconnect

Assuming the cycle is being caused by high load in one of the components you need to take the loaded component down and modify its configuration (to request less data) before reconnecting. This may also involve splitting out a component into two or more instances with the monitoring function shared across the new components, or disabling rules, or connecting to fewer gateways)

Managing load in Gateways

A gateway's load is influenced by the following points:

The quality of data it receives from the connected net probes (as defined in the section above, and not by the number if probes)
The processing it does on that data
Whether it is starting up, or in standard operating mode (we'll discuss this below)

We have already established the important parameters of point 1, that is, it is not so much the number of probes as what data those probes are generating and how quickly it updates. So now we need to look at point 2, that is, what the gateway does once it has received that data.

For any given update a gateway needs to decide a) if there are one or more things it needs to do, b) what to do. Every time an update comes in that assessment must be done, the update may then be routed to systems such as:

The rule engine (which tends to account for the bulk of the load within most gateway configurations)
Actions and alerts
Data base logging
Import and export (gateway sharing)
Persistence
Active time assessments

The method the gateway uses to route these messages is important, since it explains the fundamental difference between startup of a gateway and normal operation, which is a common thread on help desk tickets.

Gateway startup vs normal operation

When a gateway is first started and the data starts coming in from the probe it uses an XPath evaluation to establish what parts of the gateway need the message. XPath evaluation, while heavily optimized in the gateway, is still slow, and so when an update is routed the gateway starts building up an internal index so as to avoid that evaluation the next time an update comes from the same cell or data item. The subsequent use of the index instead of the XPath evaluation is fast. The connections to the probes themselves are treated asynchronously due to the nature of TCP/IP over which they communicate: having sent a request to connect you cannot guarantee when the probes will respond, so data will start coming into the gateway at unpredictable times.

These conditions mean that users may perceive gateways to start slowly and use high resource (CPU), but then settle down sometime later into a stable and low resource state. While the increase in incoming data can explain some of this load, the XPath evaluations account for much more of it. In situations where this load is too high the gateway may suffer from the heartbeat issue described above, where connections to probes end up in a constant cycle of connect and disconnect due to load. This situation requires attention to resolve, with load monitoring (see below) being the first port of call.

There is another important implication of the different code paths when starting up compared to normal operation, which is to do with incremental changes to the gateway during normal running. Modifying the gateway during run time means that it need only index the paths of the newly changed items; a series of such updates may have no apparent effect on the gateway performance. However if restarted the modified paths and data requests from the probe will all impact at the same time, and the gateway may therefore suffer a big increase to its start up time, or indeed fail to start at all - a situation undetectable during a series of changes. This risk is best mitigated with regular test restarts during modification.

Data Quality

It was noted in versions of the gateway prior to 3.0.0 that it was possible for a single probe to send excessive data to the gateway, such that the gateway became to busy to service the heartbeat requests from that probe and others. The heartbeats affect the whole protocol, so that both affected and unaffected probes may start disconnecting. To defend against this, versions 3.0.0 and above of the gateway monitor two attributes of the read buffers on the gateway:

1) the oldest message

2) the size of the queue

If the oldest message goes above a certain age (in milliseconds) then the gateway determines the 'busiest' sampler and disconnects the probe its mounted on. The aim of this is to remove the offending item and get back to normal running without affecting the remaining monitoring. The 'busiest sampler' is determined not just from the amount of data it is pushing at the gateway, but also from the processing it must do on those messages, which may includes the rules they trigger, for example. Although disconnecting a probe may seem undesirable, the alternative is that all the monitoring via the gateway is disrupted which - it is suggested - is worse. It is not therefore recommended that data quality be turned off.

Multi-core support

From version 3.0.0 of the gateway multi core support was added, this allowed the rule evaluation (previously identified as the most common primary reason for high load on gateways) to be placed on to one or more additional threads. This essentially allows the gateway to process more information without reaching the limits of machine its running on. This said there are still limits. For example there is no point setting the threads (available via the Operating environment configuration of the gateway) to more than the cores on the machine

Load Monitoring

Gateways have in built functions to diagnose load called load monitoring. How to set up and start load monitoring is covered in the Gateway Reference guide, but in summary load monitoring views can be added as samplers to the gateway which return metrics on the time the gateway spends in:

Gateway Components
running rules
Logging to database
attributed to individual Managed Entities
To individual Probes
To individual samplers
To Xpath evaluations

Via these statistics user can track down where the gateway spends most of its time, and then look to reduce load in those areas. As a rule this is accomplished by making the gateway do less in that area. For example, being more specific in database logging, or reducing sampling rates or number of rules or rule targets and so on.

These statistics can also be written out to a file for analysis in other systems such as Excel.

XPath assessment

Paths are a core part of the gateway and front end components of Geneos; they are used throughout to identify sets of data items that some action will be performed on. This said, not all paths are equal in how expensive they are to process, and the use of 'Bad' XPaths can have a significant impact on the performance of the component. This guide provides an overview of what constitutes and expensive vs non-expensive path.

Contention

This article has focused on managing load within the Geneos Environment, but its important to remember that its rare that software runs in isolation on a PC or server, and Geneos is no exception. If Geneos is running low on CPU or memory resource then its also important to ensure there are not other processes running which are limiting the Geneos components' access to system resources. This may include other Geneos components. For example running a dozen gateways on a server with < 12 cores will create contention for CPU which may manifest as high load on the gateways.

The most obvious solution is to split the Geneos components across multiple machines. In many cases as long as there is a network route the location of the gateways is not important. This is not always true in the case of the samplers and therefore probes. For example the CPU plugin must run on the machine whose CPU it is monitoring, where as the SQL toolkit can run anywhere as long as it has a network path to the selected database.

Managing Load in the Active Console

The Active console can connect to 0 .. Many gateways. The connection like the other components occurs over EMF2 / TCP/IP. Upon connection the console (by default) receives a subset of the data on the gateway. Specifically:

All Probe, Entity, Sampler and Data view summary information.
All cells in all the dataviews that are:
- A column header
- a row name
- have severity of warning or critical
- Are snoozed or inactive
- Are logging to database
- are user assigned

The remainder of the table cells are requested on demand, For example the metric view is displayed on the screen, or the cell is displayed on a dashboard. The setting that limits the data (vs getting all data all the time) is configured within the ActiveConsole,gci file with the following lines.

#Enable subscription-based data transfer from GW2 to AC2

-bdosync

DataView,BDOSyncType_Level,DV1_SyncLevel_RedAmberCells

It will also get data relating to the paths that it has registered in the work space. For example a fully qualified path to a cell

/geneos/gateway[(@name="Support Services")]/directory/probe[(@name="Virtual Probe")]/managedEntity[(@name="supportServices Gateway")]/sampler[(@name="GW Client Info")][(@type="Gateway Info")]/dataview[(@name="GW Client Info")]/rows/row[(@name="9")]/cell[(@column="duration")]

would also update the state and properties of this cell.

A more general path, for example

//dataview//*

would request request all the cell and headline data form all connected data views. (This is not recommended!).

Paths are mainly registered within the following dockables:

List Views
Dashboards
Notification Filters
Metric views (notably overviews)

The wider the set of data items that match these paths, the greater the data load on the Active Console - this means that the use of paths is the most common cause of load within the console.

After the initial upload of data the console will receive all updates relating to sub set of data it has registered an interest in. This will be that which is sent by default, and that which is required as a result of the registered paths. Each time an update comes from a gateway the console needs to decide whether one or more components within the console are interested in that update. That resolution occurs in a component of the Console called the Path Model. Like the gateway this is indexed, so future resolutions tend to be faster than the initial connect, or dealing with new unseen data items. The path model exists because previously this resolution was distributed across multiple components, and in the process of centralizing we were able to optimize for common paths and therefore speed up this function.

Nevertheless, the path model is the central cause of load in the Console; the load is determined wholly from the paths registered and the number of updates coming from the gateway that require action. To make this a little more complex, however, it's not automatically the number of paths that are registered but also the structure of the paths. As we mentioned before, optimization and indexing has been performed in the path model but only for the most common paths. This means that path design remains somewhat of a black art. For example, 100 fully qualified paths to specific cells are likely to run quickly, whereas one use of an 'ancestor::' element in a path (which has not been optimized) could run many times slower than its 100 peers.

The most common approach to solving load issues in the console is therefore to review the paths registered in the work space. The best place to see these is via the 'Tools --> Refactor paths' function in the console, and then ticket the 'Disable filtering' tick box. You can copy the list into Excel to make it easier to view. This guide can then help to spot expensive paths and reduce load in the console.

Constantly connecting and disconnecting gateways

A common failure case reported against the console is gateways connecting and disconnecting. This is often related to the inability of the Active Console or the gateway to service the heartbeat which maintains their connections, it can result in a cycle of connection, busy processing, disconnection, and then a time later another attempt which results in the same. Where there are multiple gateways at play it is often unpredictable which one may be dropped since it is more or less random which heartbeats are missed. Looking at the CPU usage of the console can help show whether this is an issue: if it is high (accounting for multiple cores, for example, 25% is high if the machine has 4 cores), then heartbeats may well be missed.

In terms of fixing the issue the first, and most useful, experiment is to the connect an empty work space to the same set of gateways and ensure it can connect. If it still gets into a cycle of connect and disconnect then no amount of tweaking the console is likely to solve the problem instead we should look at the data flow through Geneos as detailed below. If it works ok (I.E. the empty work space connects to all gateways and maintains the connections), then looking at the paths as described above can be fruitful as discussed above.

If the connection and disconnection problems occur during start up then you can use the following flag in the activeconsole.gci file to possibly solve the start up issue:

-queueConnections

Ensure there are no leading or trailing spaces when you add the flag. By default the console will try and connect to all the gateways at the same time on startup; using this setting this will have the effect of connecting to the gateways one at a time. The disadvantage to this approach is that, while it may work, it will take longer to connect to all the gateways.

Running multiple Active Consoles

It is possible to run multiple Active Consoles on a PC, but this is a poor solution for managing load and not recommended, since it negates one of the core benefits of Geneos, which is that all your data ends up in one view-able screen. If you must do this, then it is essential that you run each active console in its own working directory. To do this you need a separate install per console, and the following lines should be added to each activeconsole.gci file

-wsp

./workingdir

with no leading or trailing spaces on either line.

What if even an empty work space connects and disconnects, or never connects at all?

If you connect an Active Console with an empty work space to your gateway set, and it still connects and disconnects, then, assuming the gateways themselves are not running hot, the issue is the sheer quantity of data and updates that the console is required to absorb. In these cases we need to start looking at the architecture of audiences of the Geneos real estate. No amount of configuration on the console will likely solve this situation.

1) do all users need to see all the data all the time? for example is it possible for users to look (as standard) at a sub set of the gateways?

2) Can you use gateway sharing to create hierarchies of gateways to summarize the Geneos world

3) Can you reduce the amount of data flowing into the system in the first place including update rates?

4) use the -queueconnections flag as described above, this will increase the connection time but may solve the problem

5) are you truly monitoring by exception?

Looking at a subset of your gateways

One possible solution to an overloaded Active Console (after all other attempts to resolve load have failed), is to look at a subset of your gateways at any given time. Via the Tools --> Settings section a user can define a connection to one to many gateways, either via the direct connections or via the connection files. Regardless of how the gateway connection is added at any given time it can be enabled or disabled. A gateway can be disabled by right clicking on it in the state tree (in physical mode) or in the gateways view. A disabled gateway sends no data to the console, but remains in the gateway list so that you can enable it on demand.

Via this method a user can have quick access to hundreds of gateways, but only look at the contents of a few at a time.

If you are using connection files (which is often the case where you have hundreds of gateways) we would advise unticking the 'enable new connection files' which means when the files changed or are added the console does not try and connect to all the gateways in the file by default.

This type of solution works well for users who are overall responsible for a large real estate, but where the individual gateways are controlled by local teams. These super users connect to the gateways when they need to (are escalated to for example), then disconnect when finished. Gateway sharing (described below) can be used to help create summary gateways which mean that the user can be told when to connect to a gateway because an interesting event has occurred.

Impact of adding capacity

A common approach in Geneos once a component has been identified as being overloaded is to split it into two or more instances. For example, deploy a second probe or gateway and move a subset of the monitoring to the new instance. Assuming there are adequate resources (CPU cores and memory) on the machine to fully service both processes (the new and old), then the capacity of that component will essentially have been increased and the load problems solved. Another way of thinking is that a throttle on the flow of data would have been opened. Its important to remember however that this will have an impact on downstream components. In the worst case these will now suffer load issues.

This does not mean that we should not split out overloaded components, indeed its an important method of dealing with load, but nor should we be surprised with the change in load profile in other components. In essence dealing with load in Geneos must be done holistically across the whole estate, not in isolation.

Gateway Sharing

This function allows gateways to 'Export' selected data to other gateways, including severities if the designer of the gateway chooses. This sharing is set up within the GSE via the 'Exported data' and 'Imported data' sections. In the context of load this can be used to create a summary gateway which contains the alerts from the child gateways. The user then connects their console to only the summary gateway

While this can be very effective at reducing load on the Active console, it does mean that when an alert does occur there is a level of indirection, in that the user must identify the gateway on which the alert has occurred and then enable it in the console to find the source problem. In theory there can be a whole hierarchy of gateways, which means the user could receive a summary from hundreds or thousands of gateways.

Articles in this section

Geneos - Managing Load in Geneos

Assumptions

Introduction