Software Event Broker Health Monitoring

The performance and stability of the Solace PubSub+ software event broker is affected by the performance and stability of the underlying system resources. The software event broker's health monitoring feature provides on-going and on-demand health monitoring tools to assess infrastructure issues that may impact event broker stability. System administrators can use the health monitoring feature to characterize the infrastructure and monitor at runtime the system resources assigned to the event broker. This information can be used to correlate event broker performance and stability issues with other infrastructure events, and assess the suitability of the infrastructure for event broker deployments.

There are two aspects to the software event broker's health monitoring feature, and they are described in the following sections:

  • Start-up Characterization—Detailed characterization and reporting of system resources at instance start-up using the POST (power-on self-test) utility.
  • Direct Runtime Metrics—Runtime monitoring of critical application functionality with early notification for degraded performance or stability.

Start-up Characterization

POST service automatically runs at start-up to characterize the available system resources. The diagnostic information provided in the POST result can be used to assess the performance and suitability of the infrastructure for event broker deployment. POST results can be viewed after boot-up using the show system post Solace CLI command, or by executing the docker exec -t <container_name> post command, as shown in the examples below.

Running POST On-Demand

You can run the POST utility on-demand, before starting the container, to perform a detailed characterization of the available system resources. The POST results are displayed on completion.

To perform characterization of available system resources, execute the following docker run command in an independent container.

$ docker run -it --rm --env 'system_scaling_maxconnectioncount=100' <container_image_id> post

 

:  To ensure appropriate system resources have been allocated, refer to Scaling Tier Deployment Considerations.

Direct Runtime Metrics

Direct runtime metrics are in-service measurements of critical application functionality, which directly correlate to system performance due to availability, performance, or stability of the underlying system resources. The metrics have corresponding events with configurable thresholds, and their values can be viewed using the show system health command. SYSTEM_SYSTEM_HEALTH_NONCRITICAL_NOTIFICATION event logs are raised whenever measurements are beyond the acceptable threshold and are throttled at one event per minute to avoid cluttering the event log.

These are the Direct Runtime metrics:

Configuring Maximum Thresholds

You can configure the maximum threshold for each direct runtime metric. To begin, enter the following Solace CLI commands:

solace> enable
solace# configure
solace(configure)# system health
solace(configure/system/health)#

Disk I/O Latency

Disk I/O latency statistics are collected wherever latency may affect performance and the monitoring does not significantly impact performance or stability. The software event broker raises an asynchronous event when measured disk I/O latency exceeds the configured threshold value. A high latency may indicate that the disk, or access to the disk, is oversubscribed.

To configure the maximum threshold for disk I/O latency, enter the following:

solace(configure/system/health)# disk-latency-high-threshold <value>

Where:

<value> is the maximum threshold for the disk I/O latency metric.

The no disk-latency-high-threshold command restores the default value.

Compute latency

This metric compares the execution time of well-known functions (for example, epoll timers) against invariant timers (for example, CPU TSC). A high latency may indicate that the CPU is oversubscribed, the CPU is being blocked by another application, or access to virtual memory is somehow retarded. The software event broker raises an asynchronous event when measured latency for CPU instructions exceeds the configured threshold value.

To configure the maximum threshold for the compute latency metric, enter the following:

solace(configure/system/health)# compute-latency-high-threshold <value>

Where:

<value> is the maximum threshold for the compute latency metric.

The no compute-latency-high-threshold command restores the default value.

:  In some environments, it may be possible for erroneous or misleading events to occur due to an unreliable Time Stamp Counter (TSC), in which case, this metric may need to be disabled by setting to a very large threshold value.

Network latency

This metric measures the round trip network transaction latency (for example, for a ping operation) between the instance and members of its high-availability (HA) group. A high latency may indicate a faulty or misconfigured physical network interface, inadequate or faulty network cabling, a network storm/loop, or an over utilized network. The software event broker raises an asynchronous event when the measured network latency exceeds the configured threshold value.

To configure the maximum network latency threshold, enter the following:

solace(configure/system/health)# network-latency-high-threshold <value>

Where:

<value> is the maximum threshold for the network latency metric.

The no network-latency-high-threshold command restores the default value.

Mate-link latency

This metric measures the completion time of a transaction over the mate-link, and is an aggregate metric as the mate-link transaction combines compute, disk, and network latencies. This is a direct metric with significant impact on the overall stability of a high-availability (HA) group. The software event broker raises an asynchronous event when the measured mate-link latency exceeds the configured threshold.

To configure the maximum mate-link latency threshold, enter the following:

solace(configure/system/health)# mate-link-latency-high-threshold <value>

Where:

<value> is the maximum threshold for the mate-link latency metric.

The no mate-link-latency-high-threshold command restores the default value.

Viewing Metrics

The direct runtime application metrics can be viewed using the show system health Solace CLI command. In the example shown below, minimum, maximum, average, and current values are displayed.

solace(configure/system/health)# show system health

Statistics since:       Sep 26 2018  09:38:45  EDT

                        Units    Min       Max       Avg      Curr     Thresh    Events
                        ------   ------    ------    ------   ------   -------   -------
Disk Latency            us       0         24054     0        0        10000000   0
Compute Latency         us       937       14502     1127     1078     500000    0
Network Latency         us       0         0         0        0        2000000   0
Mate-Link Latency       us       0         0         0        0        2000000   0

:  The clear system health stats command resets all stats to 0.