Monitoring the Health of the Software Event Broker
The performance and stability of the Solace PubSub+ software event broker is affected by the performance and stability of the underlying system resources. The software event broker's health monitoring feature provides on-going and on-demand health monitoring tools to assess infrastructure issues that may impact event broker stability. System administrators can use the health monitoring feature to characterize the infrastructure and monitor at runtime the system resources assigned to the event broker. This information can be used to correlate event broker performance and stability issues with other infrastructure events, and assess the suitability of the infrastructure for event broker deployments.
You can monitor the software event broker's health in the following ways:
- Assessing System Resources using POST—Using the POST (power-on self-test) utility to obtain detailed characterization and reporting of system resources at when the container starts up.
- Monitoring System Resources with Direct Runtime Metrics—Runtime monitoring of critical application functionality with early notification for degraded performance or stability.
Assessing System Resources using POST
The POST service automatically runs at start-up to characterize the available system resources. The diagnostic information provided in the POST result can be used to assess the performance and suitability of the infrastructure for event broker deployment.
You can view POST results after boot-up by running either:
- the
show system post
Solace CLI command - the
docker exec -t <container_name> post
command
You can also run the POST utility on-demand, before starting the container, by running the following command:
$ docker run -it --rm --env 'system_scaling_maxconnectioncount=100' <container_image_id> post
The output of the POST utility looks similar to the following (different items might appear depending on the type of image you are using; portions of this example have been omitted for brevity):
---------------------------------------------------------------------- POST Results ---------------------------------------------------------------------- Version : Solace PubSub+ Enterprise (9.7.0.X) Time : Wed, 07 Oct 2020 14:59:08+0000 Status : PASSED Violations : ---------------------------------------------------------------------- Configuration ---------------------------------------------------------------------- System Max Connections : 200000 System Max Queue Messages : 240 ---------------------------------------------------------------------- CPU ---------------------------------------------------------------------- CPUs : 0, 1, 10, 11, 2, 3, 4, 5, 6, 7, 8, 9 (Host CPUs 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) Count : 12 (Minimum 12) Core Frequency : Core 0 : 3193.5 MHz Core 1 : 3193.4 MHz . . . Core 9 : 3193.6 MHz ---------------------------------------------------------------------- Memory ---------------------------------------------------------------------- Host RAM : 64.0 GiB Host Swap Memory : 2.0 GiB Host Virtual Memory : 64.8 GiB (Minimum 51.3 GiB) . . . ---------------------------------------------------------------------- Storage ---------------------------------------------------------------------- Device /dev/dm-0 : Size : 31.1 GB (Minimum 5.0 GB) Available : 23139.0 MB Write Throughput : 729.5 MB/s (Minimum 20.0 MB/s) Buffered Write Latency : 0.630 ms Direct Write Latency : 0.290 ms . . . ---------------------------------------------------------------------- Timing ---------------------------------------------------------------------- Clock Source : tsc TSC Synchronized : Yes CPU TSC Flags : constant_tsc : Yes nonstop_tsc : Yes tsc_reliable : Yes
To ensure appropriate system resources have been allocated, refer to System Resource Requirements.
Monitoring System Resources with Direct Runtime Metrics
Direct runtime metrics are in-service measurements of critical application functionality. These measurements can help to determine whether the availability, performance, or stability of the underlying system resources is sufficient to provide the desire system performance.
Direct runtime metrics have corresponding events with configurable thresholds; you can view these thresholds with the show system health
CLI command. The broker raises SYSTEM_SYSTEM_HEALTH_NONCRITICAL_NOTIFICATION
events whenever system measurements are beyond the acceptable thresholds for the direct runtime metrics. These events are throttled at one event per minute to avoid cluttering the event log.
Configuring Maximum Thresholds
You can set thresholds for the following direct runtime metrics:
Disk I/O Latency
Disk I/O latency statistics are collected wherever latency may affect performance and the monitoring does not significantly impact performance or stability. The software event broker raises an asynchronous event when measured disk I/O latency exceeds the configured threshold value. A high latency may indicate that the disk, or access to the disk, is oversubscribed.
To configure the maximum threshold for disk I/O latency, enter the following:
solace> enable solace# configure solace(configure)# system health solace(configure/system/health)# disk-latency-high-threshold <value>
Where:
<value>
is the maximum threshold for the disk I/O latency metric.
The no disk-latency-high-threshold
command restores the default value.
Compute Latency
This metric compares the execution time of well-known functions (for example, epoll timers) against invariant timers (for example, CPU TSC). A high latency may indicate that the CPU is oversubscribed, the CPU is being blocked by another application, or access to virtual memory is somehow degraded. The software event broker raises an asynchronous event when measured latency for CPU instructions exceeds the configured threshold value.
To configure the maximum threshold for the compute latency metric, enter the following:
solace> enable solace# configure solace(configure)# system health solace(configure/system/health)# compute-latency-high-threshold <value>
Where:
<value>
is the maximum threshold for the compute latency metric.
The no compute-latency-high-threshold
command restores the default value.
In some environments, it may be possible for erroneous or misleading events to occur due to an unreliable Time Stamp Counter (TSC), in which case, this metric may need to be disabled by setting to a very large threshold value.
Network Latency
This metric measures the round trip network transaction latency (for example, for a ping operation) between the instance and members of its high-availability (HA) group. A high latency may indicate a faulty or misconfigured physical network interface, inadequate or faulty network cabling, a network storm/loop, or an over utilized network. The software event broker raises an asynchronous event when the measured network latency exceeds the configured threshold value.
To configure the maximum network latency threshold, enter the following:
solace> enable solace# configure solace(configure)# system health solace(configure/system/health)# network-latency-high-threshold <value>
Where:
<value>
is the maximum threshold for the network latency metric.
The no network-latency-high-threshold
command restores the default value.
Mate-link Latency
This metric measures the completion time of a transaction over the mate-link, and is an aggregate metric as the mate-link transaction combines compute, disk, and network latencies. This is a direct metric with significant impact on the overall stability of a high-availability (HA) group. The software event broker raises an asynchronous event when the measured mate-link latency exceeds the configured threshold.
To configure the maximum mate-link latency threshold, enter the following:
solace> enable solace# configure solace(configure)# system health solace(configure/system/health)# mate-link-latency-high-threshold <value>
Where:
<value>
is the maximum threshold for the mate-link latency metric.
The no mate-link-latency-high-threshold
command restores the default value.
Viewing Direct Runtime Metrics
You can view the direct runtime application metrics using the show system health
Solace CLI command. In the example shown below, minimum, maximum, average, and current values are displayed.
solace(configure/system/health)# show system health Statistics since: Oct 07 2020 09:38:45 EDT Units Min Max Avg Curr Thresh Events ------ ------ ------ ------ ------ ------- ------- Disk Latency us 0 24054 0 0 10000000 0 Compute Latency us 937 14502 1127 1078 500000 0 Network Latency us 0 0 0 0 2000000 0 Mate-Link Latency us 0 0 0 0 2000000 0
The clear system health stats
command resets all stats to 0.