Version 1.2
31st July 2014
| Version | Date | Author | Notes |
|---|---|---|---|
| 1.0 | 05/12/2013 | JH | First Draft |
| 1.1 | 08/01/2014 | JH | Revision after initial feedback |
| 1.2 | 31/07/2014 | JH | Copy to new company template |
The purpose of this document is to provide information concerning the implementation of best practices for Level 2 Monitoring, encompassing process and log file monitoring.
The scope of this document is restricted to details of configuration related solely to the implementation of best practices for processes and log file monitoring and does not extend to any explanation related to the general configuration of the Process or FKM Plugins.
The intended audience for this document is any ITRS personnel or any ITRS client interested in the implementation of best practices for Level 2 monitoring.
The details in this document are the result of the accumulated experience of the Professional Services department implementing Geneos environments for many different clients and are thought to provide the most reliable solution in terms of maintenance and scalability.
Application processes can be configured as Single Instance, Active – Passive or Active – Active processes, depending on the needs and/or requirements of the application or organisation.
The following procedures detail the way in which these differences can be accommodated within the configuration in the most straightforward manner, both for monitoring and configuration purposes.
The information contained in this section of the document relates to the best practices for monitoring processes. Although there is specific configuration related to this, this configuration is mostly related to Active – Passive and Active – Active processes. There is no attempt here to provide best practice for general process monitoring or configuration (i.e. use of Samplers, Sampler Includes, Process Descriptors, Variables etc.), as this will be dependent on individual requirements.
Once the process monitoring has been configured as detailed in the Configuring Processes the Active Console will display information differently for each process type.
For Single Instance processes, the Active Console Metrics view shows the Instance Count for each process.

The default Rule for the instanceCount cell will set the severity of the cell to:
OK if the value is equal to 1Warning if the value is greater than 1Critical if the value is less than 1For Active – Passive processes, the Active Console Metrics view shows the Instance Count, but will additionally show the Cluster Count for each process.
For Active – Passive processes it can be expected that the Cluster Count will be either 1 or 0.

The processName is identical for all instances of a clustered process and this allows instances of the process to be matched across multiple servers.
Each process row contains values for the instanceCount, i.e. the specific process instance for the server, and the clusterCount, i.e. the number of instances of the particular processes that are active across all servers.
In the example above:
clusterCount is 1 and the instanceCount is 1 for the Primary FX Pricer processes and the clusterCount is 1 and the instanceCount is 0 for the Secondary FX Pricer processes. This is how Active – Passive processes are expected to run and the severity is set to OK accordinglyclusterCount is 2 and the instanceCount is 1 for both the Primary and the Secondary FX Connect processes. This is not how Active – Passive processes are expected to run and the severity is set to Critical accordinglyclusterCount is 1 and the instanceCount is 0 for the Primary FX Options processes and the clusterCount is 1 and the instanceCount is 1 for the Secondary FX Options processes. This is how Active – Passive processes are expected to run when failed over and the severity is set to OK accordinglyThe Rule for clustered processes is not associated with the instanceCount column, but instead is associated with the clusterCount column.
For Active – Passive processes, the default Rule will set the severity of the cell to:
OK if the clusterCount value is 1Critical if the clusterCount value is not equal 1This allows the user to be informed in the event that both processes are down and also in the event that both processes are up.
For Active – Active processes, the Active Console Metrics view shows the Instance Count, but will additionally show the Cluster Count for each process in the same manner as for the Active – Passive processes.

Each process row contains values for the instanceCount, i.e. the specific process instance for the server, and the clusterCount, i.e. the number of instances of the particular processes that are active across all servers.
In the example above:
clusterCount is 1 and the instanceCount is 1 for the Primary FX Pricer processes and the clusterCount is 1 and the instanceCount is 0 for the Secondary FX Pricer processes. This implies that the Active * Active processes is not running correctly. Although one of the clustered processes has failed, the application is still available, even though there might be load issues. To indicate that attention is needed the severity is set to WarningclusterCount is 2 and the instanceCount is 1 for both the Primary and the Secondary FX Connect processes. This is how Active – Active processes are expected to run and the severity is set to OK accordinglyclusterCount is 0 and the instanceCount is 0 for both the Primary and the Secondary FX Options processes. This indicates that the application has failed and the severity is set to Critical accordinglyThe Rule for clustered processes is not associated with the instanceCount column, but instead is associated with the clusterCount column.
The severity for cells is based on the value assigned to 2 variables:
activeActiveCritical: the number of processes at which the severity is set to Critical (usually 0)activeActiveOK: the number of processes expected to exist in the clusterThe default Rule will set the severity of the cell to:
OK if the clusterCount value is equal to the value set for the activeActiveOK variable
Warning if the clusterCount value is less than the value set for the activeActiveOK variable
Critical if the clusterCount value is equal to the value set for the activeActiveCritical variable
The document will not describe how to configure individual processes, as it is expected that this information is either already known or available from other sources.
Where there are specific differences or additions to process configuration to accommodate the required type of process monitoring, these details will be included.
Processes should be created in accordance with normal practice.
The default Rule for Single Instance Processes (see Processes Rules in the Appendix section) has been configured to select instanceCount cells where the Managed Entity Attribute CLUSTER TYPE is equal to Single and where the Sampler is using the PROCESSES Plugin.

N.B. If there is no requirement for clustered processes, the target for this Rule can be changed to select all instanceCount cells, and this would remove the requirement to perform the following action.
To ensure that Single Instance processes use this Rule it is necessary to add a CLUSTER TYPE Attribute with a value of Single to the relevant Managed Entity Group or Managed Entity

If processes are not single instance processes, then they will be clustered in some way to provide resilience and/or load balancing.
Active – Passive processes are used for resilience only. In these instances there will be one process that is up and running and a second process that is inactive. If the active process fails the secondary processes will become active (this can be either manual or automatic and is not relevant to the monitoring of the cluster).
There should therefore only ever be one process of an Active – Passive pair active at any time and the monitoring is configured to reflect this.
N.B. Active - Passive processes are not clustered in the technical sense of the term, but can be considered clustered in terms of the way in which the processes are monitored here.
Active – Active processes are used for resilience and load balancing. In these instances there can be any number of processes in the cluster. All processes are active and the application is available through any one of the instances. When one of the processes fails the application is still available, although the connections to the application will be reduced and the consequent load on the remaining instances will be increased.
The monitoring for Active – Active processes is configured to reflect the number of processes in any cluster and also the number at which a Critical alert should be raised. By default, a Warning alert is raised if the number of processes drops below the number of processes in the cluster, but this can be changed to be any required number.
The Rule that calculates the number of processes in a cluster is called Cluster Count and has a target of any cell in the clusterCount column where the row name is not equal to *#*, i.e. all Summary information rows.

The Rule will add up the values of all of the instanceCount cells where the Row Name is the same as the target $rowName and the Type for the Process is the same as the target $samplerType.

The Path for the Path alias looks as follows:

N.B. There are two important factors relating to this configuration:
Cluster Count Rule implies that the Process Sampler has been set to show Summary information (this should normally be the case otherwise the instanceCount column would not be available for the underlying processes)total function in the Rule implies that Compute Engine is included in the licence for the client. If this is not the case, then it would not be possible to configure monitoring for clustered processes in this manner.The above Rule has a target of the clusterCount column. This column does not exist by default in the Processes Sampler configuration and must be added for both Active – Passive and Active – Active processes.
To add the column it is necessary to:
Configure the relevant Processes Sampler to have a clusterCount Column, by setting the Additions options in the Dataviews section of the Advanced tab settings

The Additions setting should contain a single setting for the clusterCount Column, as follows:

It is also necessary to configure each process within the cluster to have the same name. The best way to achieve this it to ensure that the same Sampler configuration is used for both Managed Entities:

Processes should be created in accordance with normal practice and the additional cluster procedures outlined in Clustered Processes Rule and Clustered Processes Sampler above.
The default Rule for Active – Passive Processes (see Processes Rules in the Appendix section) has been configured to select clusterCount cells where the Managed Entity Attribute CLUSTER TYPE is equal to ActivePassive and where the Sampler is using the PROCESSES Plugin.

To ensure that Active - Passive processes use this Rule it is necessary to add a CLUSTER TYPE Attribute with a value of activePassive to the relevant Managed Entity Group or Managed Entity

Processes should be created in accordance with normal practice and the additional cluster procedures outlined above in Clustered Processes Rule and Clustered Processes Sampler.
The default Rule for Active – Active Processes (see Processes Rules in the Appendix section) has been configured to select clusterCount cells where the Managed Entity Attribute CLUSTER TYPE is equal to ActiveActive and where the Sampler is using the PROCESSES Plugin.
The Rule checks the cell value against the value of one of two Variables; activeActiveCritical or activeActiveOK.

For this Rule to work for Active – Active processes it is necessary to assign values to the relevant Variables, as follows:
Assign the appropriate values to the activeActiveCritical and activeActiveOK Variables

Assign the Environment to the relevant Managed Entities

N.B. An Environment should only be added to the Add types Section of a Managed Entity Group if it is certain that all Managed Entities belonging to the group will use these Variable settings. Variables placed here cannot be overwritten and in all other instances the Environment must be added to the Environment section in the Advanced Tab of the individual Managed Entities.
If the Warning alert is required when the number of processes in the cluster reaches a value other than one less than the number of processes in the cluster, then the warning transaction in the Rule should be changed so that it uses a different Variable (e.g. $activeActiveWarning) and the additional Variable added to the list of Variable in the Environment in Step 1 above
To ensure that Active - Active processes use this Rule it is necessary to add a CLUSTER TYPE Attribute with a value of ActiveActive to the relevant Managed Entity Group or Managed Entity

Using the CLUSTER TYPE Attribute it is possible for different Managed Entities within the same Managed Entity Group to be configured to display different cluster types.
However, it is not possible to use the CLUSTER TYPE Attribute where there is a requirement to have different processes within the same Managed Entity display different cluster types.
To enable this, the following procedure should be followed:
Check the priority of the Active Active and Active Passive Rules - The Rule with the highest priority (lowest number) is the Cluster Type that must be used at Row level
Ensure that the Cluster Type with the lower priority Rule is set as the CLUSTER TYPE Attribute for the Managed Entity (if this makes administration difficult it might be easier to create a separate Managed Entity for the different cluster process)
Add the relevant Row(s) for the Cluster Type with the higher priority as separate Targets in the relevant Active Active or Active Passive Rules. For instance; to set the FX Pricer process to Active – Passive, add the target for that process to the ActivePassive Rule:

Check that the output is displaying correctly for the individual rows:

N.B. This procedure can also be followed to add a Single Cluster Type to an individual Row within a Managed Entity that already uses either the Active Active Rule or the Active Passive Rule. However, it will also be necessary to additionally create a separate Rule to set the Severity to Undefined for the clusterCount column of the process
The FKM plugin is one of the key elements within Geneos to provide application monitoring.
The functionality of the plugin is extremely flexible and versatile to accommodate a wide variety of use case scenarios. Although this is useful, it often leaves clients with a basic configuration that is unsuited to the continuous improvement necessary to provide the most efficient monitoring over time.
These Best Practices have been developed to encourage the continuous improvement of log file monitoring.
Once the Log File monitoring has been configured as detailed in the Log File General Configuration section and has been running long enough for some initial changes to be made, as described in the Fine-Tuning Log File Monitoring section, the output for a specific log file might look similar to the following:

A row is displayed each time a specified keyword is encountered in any line within the monitored file.
The rows are grouped by the keyword that has been encountered in the file. The triggerCount column indicates the number of times a particular keyword has been encountered within the file since the last time the trigger was cleared or the file accepted. As well as showing how often a message has been detected, this also helps the user to decide how relevant a particular message might be.
The FKM Sampler has been configured in such a way that an indicative message is both appended to the row name and pre-pended to the triggerDetails cell.
The example shows that there have been 3 Garbage Collector.. errors, 483 License Errors, 2 Memory 70% errors and 2 Unconfirmed Errors.
Use the Show Tables option from the right-click menu for a file to see what keywords are being search for within the file.

This will show something similar to the following:

Ignore result in the entire line being discarded from further searches if the key is encounteredFailed or Warning(Regex) at the end of a key indicates a case-sensitive Regex searchSeverity of a key within a specific table can be set independently so as to be different from the severity of the table if requiredMessage: indicates the text that is appended to the row name and prepended to the triggerDetails cell valueUse the View File Near this Trigger from the right-click menu for a trigger to investigate a specific keyword

This will display the line containing the keyword encountered highlighted in blue, together with a section of log file both before and after the event.

The messages for Memory 70% etc. in the FKM DataView indicate that the keyword search has been configured for specific error messages.
All messages found for these specific keyword searches should be from the same source event. The Message text should indicate this, allowing the user to understand the situation more clearly, reducing the time taken to investigate and clear the event.
The Unconfirmed Errors (or Unconfirmed Failures and Unconfirmed Aborts) row shows the messages discovered that contain non-specific keyword searches, such as the word error, fail or abort, that will result in matches for messages from potentially very different events.
Continuous improvement relies on being able to sort these Unconfirmed messages into messages that can be either ignored or messages that should be monitored.
It is therefore important to be able to see what the actual messages are for these Unconfirmed Errors and to decide whether each specific message should be ignored or configured as a separate key.
Use the Trigger Details option from the right-click menu for a trigger (in this case the Unconfirmed Errors trigger) to display all of the keys that have been encountered for the key.

This will display all of the separate triggers for the Unconfirmed Error keyword in question (this is the keyword ERROR as shown by the Show Tables output).
Using the Find... option from the right-click menu for the Output window, it is then possible to search for all occurrences of the word Error in the output, , and decide how the individual messages should be handled.

In this example, the first message seems to be an information message and the various occurrences of Error...` relate to a non-error state.
The most suitable non-error indicator should be added to the <APP>_IGNORE table, or the GENERAL_IGNORE table if it is thought that the format can be used to ignore similar messages for other applications as well
Error Type 6 should probably not be used as
it might also relate to different eventsError Bad connection status between ... should be used to create a separate trigger for this event and placed in the <APP>_WARNING tableIn this manner, the number of matches from the Unconfirmed Error keyword should reduce and the number of meaningful, relevant messages should increase.
The purpose of the above configuration is to aid the creation of a system that monitors application log files for pertinent events only and which allows the user to become pro-active in their response to events.
This should reduce the number of application and system outages and so increase the availability and stability of applications.
It should be taken as a matter of good practice to only configure a message with a FKM Severity of Fail if the message indicates an actual system or application failure.
All other message events should be set to an FKM Severity of Warning (or Ignore).
In order to fine tune the log file monitoring the following procedures should be followed:
Unconfirmed Error messages as soon as messages appearUnconfirmed Error messages as:
Once the initial fine-tuning stage has been completed there should be no (or very infrequent) Unconfirmed Error messages (please note that this is a gradual process that can take up to 3 months or more to complete, depending on how much time can be allocated to this work).
After this time there might be a large number of entries in the various IGNORE tables and this has the potential to degrade the performance of the FKM Plugin.
It is possible at this point to remove the Unconfirmed Error, Unconfirmed Abort and Unconfirmed Fail keys (i.e. the UNCONFIRMED_MESSAGES FKM Table can be removed from all Sampler configuration). This will allow all of the IGNORE keys (that contained those key words) to also be deleted and so improve the performance of the FKM Plugin.
The disadvantage of doing this is that any message that has previously not been encountered that contains one of the general Unconfirmed message keys will not get picked up by the system.
For these errors, the support team initially become re-active as they can only enter the new message once it has become apparent through the behaviour of the system or application.
It is up to the application support team to decide between the relative merits of the performance of the FKM plugin over the number of potential events to which they might become re-active.
It has already been stated that the fine-tuning stage can take a number of months to complete.
During this time it is likely that the FKM DataViews and the Managed Entities to which they belong will be constantly at a Warning severity. This will be due to the high number of:
Licence Error messages in the above exampleWhile this state continues, any system or application failure messages with an FKM Severity of Fail will be spotted immediately as the Severity of the DataView and Managed Entity will change to Critical.
However, in cases where it is not possible to continually check the status of the FKM alerts, it will be difficult to spot occurrences of confirmed messages with an FKM Severity of Warning, as these will be masked by the continuous Warning state of the DataView and Managed Entity.
If this is likely to be the case it is advised to perform the following in order that these messages can be spotted during the fine-tuning phase:
<APP_WARNING> table where it is known that an action is required<APP_WARNING> table to Fail in the relevant FKM sampler during the initial fine-tuning phase<APP_INVESTIGATE> table<APP_INVESTIGATE> table to Warning in the relevant FKM Sampler<APP_INVESTIGATE> table and the initial fine-tuning stage is completed, change the severity of the <APP_WARNING> table to Warning in the relevant FKM samplerEach instance of an FKM Sampler should have the same base configuration.
The configuration for this is in the advanced tab of each FKM Sampler and it is important to remember to use this configuration whenever new FKM Samplers are created:

For each new log file that is configured, ensure that the following, minimum Table definitions are set:

The GENERAL_IGNORES Table can initially be empty.
The UNCONFIRMED_MESSAGES should contain entries for any general
keywords required, such as:

Each key should have configuration similar to the following:

warning in case the Table is accidentally set with a Failed severity within the FKM Sampler definitionIt might also be considered to initially set a Clear Time on the file as there can potentially be thousands of Unconfirmed messages.

These can be removed once some initial message management has taken place.

Whenever Unconfirmed Errors (or Abort and Fails) appear, these should be investigated to decide whether the message can be ignored or need to be created as individual keys.
Messages should not be left to reappear as Unconfirmed Errors (or Abort and Fails).
As detailed in the Usage section above, the Trigger Details option can be used to discover the different triggers for each occurrence of the relevant keyword.
Each separate occurrence should then be created as either an ignore key or as a separate alert key.
It also needs to be decided whether an ignore key is added to the GENERAL_IGNORE table or to a separate <APPLICATION>_IGNORE table, remembering that too many entries in the GENERAL_IGNORE table can have a detrimental effect on the performance of the FKM Plugin.
To add a new Ignore Key to an existing IGNORE Table, perform the following:
Add a new Key line and select Ignore Key from the drop-down list

Add the Ignore Key details as required, ensuring to select Basic from the Rules drop-down list where the key include Regex characters that are not to be treated as such, but are to treated as text

Save the configuration
To add a new Key to an existing WARNING or FAIL Table, perform the following:
Add a new Key line and select Ignore Key from the drop-down list

Add the Key details and the Message details as required, ensuring to select the relevant option from the Rules dropdown list as required

Save the configuration
If there is no <APP>_WARNINGS, <APP>_FAIL, or <APP>_IGNORES Tables and the configuration requires these to be created, perform the following:
Create the required Table

Create the required keys as described above
Add the Table to the relevant FKM Sampler or Sampler Include file
Select Severity Warning for WARNING Tables and Severity Fail for FAIL and IGNORE Tables

Select Key table type fkmTable

Select the required FKM Table

Ensure that the Table order is correct by using the right-click Move row up or Move row to top options to re-order any Tables

The ordering of the Tables should be as follows:
<APP>_IGNORESGENERAL_IGNORES<APP>_FAILS<APP>_WARNINGSUNCONFIRMED_MESSAGES
Ensure that the Table is added to all log files for the application
Save the configuration
<ruleGroup name="Processes">
<ruleGroup name="Cluster">
<rule name="Active Active">
<targets>
<target>/geneos/gateway/directory/probe/managedEntity[(attr("CLUSTER TYPE")="ActiveActive")]/sampler[(param("PluginName")="PROCESSES")]/dataview/rows/row/cell[(@column="clusterCount")]</target>
</targets>
<priority>20</priority>
<evaluateOnDataviewSample>true</evaluateOnDataviewSample>
<block>
<if>
<equal>
<dataItem>
<property>@value</property>
</dataItem>
<var ref="activeActiveCritical"></var>
</equal>
<transaction>
<update>
<property>state/@severity</property>
<severity>critical</severity>
</update>
</transaction>
<if>
<equal>
<dataItem>
<property>@value</property>
</dataItem>
<var ref="activeActiveOK"></var>
</equal>
<transaction>
<update>
<property>state/@severity</property>
<severity>ok</severity>
</update>
</transaction>
<if>
<lt>
<dataItem>
<property>@value</property>
</dataItem>
<var ref="activeActiveOK"></var>
</lt>
<transaction>
<update>
<property>state/@severity</property>
<severity>warning</severity>
</update>
</transaction>
</if>
</if>
</if>
</block>
</rule>
<rule name="Active Passive">
<targets>
<target>/geneos/gateway/directory/probe/managedEntity[(attr("CLUSTER TYPE")="ActivePassive")]/sampler[(param("PluginName")="PROCESSES")]/dataview/rows/row/cell[(@column="clusterCount")]</target>
</targets>
<priority>10</priority>
<block>
<if>
<equal>
<dataItem>
<property>@value</property>
</dataItem>
<integer>1</integer>
</equal>
<transaction>
<update>
<property>state/@severity</property>
<severity>ok</severity>
</update>
</transaction>
<transaction>
<update>
<property>state/@severity</property>
<severity>critical</severity>
</update>
</transaction>
</if>
</block>
</rule>
<rule name="Cluster Count">
<target>
<target>/geneos/gateway/directory/probe/managedEntity/sampler[(param("PluginName")="PROCESSES")]/dataview/rows/row[not(contains(@name,"#"))]/cell[(@column="clusterCount")]</target>
</targets>
<priority>100</priority>
<pathVariables>
<pathVariable name="Type">
<value>
<target>
<samplerType></samplerType>
</target>
</value>
</pathVariable>
<pathVariable name="Row">
<value>
<target>
<rowName></rowName>
</target>
</value>
</pathVariable>
</pathVariables>
<pathAliases>
<pathAlias name="Cluster
Rows">/geneos/gateway/directory/probe/managedEntity/sampler[(@type=var("Type"))][(param("PluginName")="PROCESSES")]/dataview/rows/row[(@name=var("Row"))]/cell[(@column="instanceCount")]</pathAlias>
</pathAliases>
<evaluateOnDataviewSample>true</evaluateOnDataviewSample>
<block>
<transaction>
<update>
<property>@value</property>
<total>
<dataItems>
<pathAlias ref="Cluster Rows"></pathAlias>
<property>@value</property>
</dataItems>
</total>
</update>
</transaction>
</block>
</rule>
</ruleGroup>
<ruleGroup name="Standard">
<rule name="Instance Count 1">
<targets>
<target>/geneos/gateway/directory/probe/managedEntity[(attr("CLUSTER TYPE")="Single")]/sampler[(param("PluginName")="PROCESSES")]/dataview/rows/row[not(contains(@name,"#"))]/cell[(@column="instanceCount")]</target>
</targets>
<priority>1</priority>
<block>
<if>
<lt>
<dataItem>
<property>@value</property>
</dataItem>
<integer>1</integer>
</lt>
<transaction>
<update>
<property>state/@severity</property>
<severity>critical</severity>
</update>
</transaction>
<if>
<gt>
<dataItem>
<property>@value</property>
</dataItem>
<integer>1</integer>
</gt>
<transaction>
<update>
<property>state/@severity</property>
<severity>warning</severity>
</update>
</transaction>
<transaction>
<update>
<property>state/@severity</property>
<severity>ok</severity>
</update>
</transaction>
</if>
</if>
</block>
</rule>
</ruleGroup>
</ruleGroup>