Monitoring the Health of the Software Event Broker

The performance and stability of the Solace PubSub+ software event broker is affected by the performance and stability of the underlying system resources. The software event broker's health monitoring feature provides on-going and on-demand health monitoring tools to assess infrastructure issues that may impact event broker stability. System administrators can use the health monitoring feature to characterize the infrastructure and monitor at runtime the system resources assigned to the event broker. This information can be used to correlate event broker performance and stability issues with other infrastructure events, and assess the suitability of the infrastructure for event broker deployments.

You can monitor the software event broker's health in the following ways:

Assessing System Resources using POST—Using the POST (power-on self-test) utility to obtain detailed characterization and reporting of system resources at when the container starts up.
Monitoring System Resources with Direct Runtime Metrics—Runtime monitoring of critical application functionality with early notification for degraded performance or stability.

Assessing System Resources using POST

The POST service automatically runs at start-up to characterize the available system resources. The diagnostic information provided in the POST result can be used to assess the performance and suitability of the infrastructure for event broker deployment.

You can view POST results after boot-up by running either:

the show system post Solace CLI command
the docker exec -t <container_name> post command

You can also run the POST utility on-demand, before starting the container, by running the following command:

$ docker run -it --rm --env 'system_scaling_maxconnectioncount=100' <container_image_id> post

The output of the POST utility looks similar to the following (different items might appear depending on the type of image you are using; portions of this example have been omitted for brevity):

To ensure appropriate system resources have been allocated, refer to System Resource Requirements.

Monitoring System Resources with Direct Runtime Metrics

Direct runtime metrics are in-service measurements of critical application functionality. These measurements can help to determine whether the availability, performance, or stability of the underlying system resources is sufficient to provide the desire system performance.

Direct runtime metrics have corresponding events with configurable thresholds; you can view these thresholds with the show system health CLI command. The broker raises SYSTEM_SYSTEM_HEALTH_NONCRITICAL_NOTIFICATION events whenever system measurements are beyond the acceptable thresholds for the direct runtime metrics. These events are throttled at one event per minute to avoid cluttering the event log.

Configuring Maximum Thresholds

You can set thresholds for the following direct runtime metrics:

Disk I/O Latency
Compute Latency
Network Latency
Mate-link Latency

Disk I/O Latency

Disk I/O latency statistics are collected wherever latency may affect performance and the monitoring does not significantly impact performance or stability. The software event broker raises an asynchronous event when measured disk I/O latency exceeds the configured threshold value. A high latency may indicate that the disk, or access to the disk, is oversubscribed.

To configure the maximum threshold for disk I/O latency, enter the following:

solace> enable
solace# configure
solace(configure)# system health
solace(configure/system/health)# disk-latency-high-threshold <value>

Where:

<value> is the maximum threshold for the disk I/O latency metric.

The no disk-latency-high-threshold command restores the default value.

Compute Latency

This metric compares the execution time of well-known functions (for example, epoll timers) against invariant timers (for example, CPU TSC). A high latency may indicate that the CPU is oversubscribed, the CPU is being blocked by another application, or access to virtual memory is somehow degraded. The software event broker raises an asynchronous event when measured latency for CPU instructions exceeds the configured threshold value.

To configure the maximum threshold for the compute latency metric, enter the following:

solace> enable
solace# configure
solace(configure)# system health
solace(configure/system/health)# compute-latency-high-threshold <value>

Where:

<value> is the maximum threshold for the compute latency metric.

The no compute-latency-high-threshold command restores the default value.

In some environments, it may be possible for erroneous or misleading events to occur due to an unreliable Time Stamp Counter (TSC), in which case, this metric may need to be disabled by setting to a very large threshold value.

Network Latency

This metric measures the round trip network transaction latency (for example, for a ping operation) between the instance and members of its high-availability (HA) group. A high latency may indicate a faulty or misconfigured physical network interface, inadequate or faulty network cabling, a network storm/loop, or an over utilized network. The software event broker raises an asynchronous event when the measured network latency exceeds the configured threshold value.

To configure the maximum network latency threshold, enter the following:

solace> enable
solace# configure
solace(configure)# system health
solace(configure/system/health)# network-latency-high-threshold <value>

Where:

<value> is the maximum threshold for the network latency metric.

The no network-latency-high-threshold command restores the default value.

Mate-link Latency

This metric measures the completion time of a transaction over the mate-link, and is an aggregate metric as the mate-link transaction combines compute, disk, and network latencies. This is a direct metric with significant impact on the overall stability of a high-availability (HA) group. The software event broker raises an asynchronous event when the measured mate-link latency exceeds the configured threshold.

To configure the maximum mate-link latency threshold, enter the following:

solace> enable
solace# configure
solace(configure)# system health
solace(configure/system/health)# mate-link-latency-high-threshold <value>

Where:

<value> is the maximum threshold for the mate-link latency metric.

The no mate-link-latency-high-threshold command restores the default value.

Viewing Direct Runtime Metrics

You can view the direct runtime application metrics using the show system health Solace CLI command. In the example shown below, minimum, maximum, average, and current values are displayed.

solace(configure/system/health)# show system health

Statistics since:       Oct 07 2020  09:38:45  EDT

                        Units    Min       Max       Avg      Curr     Thresh    Events
                        ------   ------    ------    ------   ------   -------   -------
Disk Latency            us       0         24054     0        0        10000000   0
Compute Latency         us       937       14502     1127     1078     500000     0
Network Latency         us       0         0         0        0        2000000    0
Mate-Link Latency       us       0         0         0        0        2000000    0

The clear system health stats command resets all stats to 0.

Provide feedback