Monitoring Replication

This section provides the CLI commands that you can use to view replication statistics, and it provides information on some important factors to monitor to ensure that replication is performing well and providing the expected service.

Displaying System-Level Replication Info

To show the system-level replication configuration information and statistics for an event broker, enter the following User EXEC commands:

solace> show replication [stats]

Where:

replication specifies replication configuration information.

stats specifies replication statistics information.

Most of the statistics are self-explanatory—they show how many messages were queued for sending to Message VPNs on the replication mate that were in a standby state, and how many messages were received from Message VPNs on the replication mate that were in an active state.

However the "Transitions to Ineligible" statistic and "Sync Messages Queued To Standby As Async" merit some further discussion, as they could indicate network connectivity issues, or WAN bandwidth issues that need to be addressed by the system administrator.

If the WAN link between the replication sites fails, or if the WAN link is not able to keep up with the ingress rate of messages that are to be replicated to Message VPNs on the replication mate that are in a standby state, then the system transitions to asynchronous replication for synchronous topics to prevent publishing applications from being unable to publish during the WAN link impairment or outage. The system automatically transitions back to synchronous replication once the WAN link is restored, and the backlog of replicated messages that were queued for replication to Message VPNs with standby replication states on the replication mate while the event broker was in this state.

Displaying Message VPN-Level Replication Info

To show the Message VPN-level replication configuration information and statistics, enter the following User EXEC commands:

solace> show message-vpn <vpn-name> replication [stats]

Where:

replication specifies replication configuration information.

stats specifies replication statistics information.

The Message VPN status and statistics are largely the same as the global status and statistics, except that the values apply specifically to the given Message VPN.

Note in the example above, that the Message VPN has an active replication status, so the Local Bridge state is "n/a". The local bridge is only used when in replication standby state. In the replication active state, the "Remote Bridge" and "Queue" sections are of interest. When active, it is expected that the remote bridge would be connected and bound to the active Message VPN’s replication queue.

Monitoring Replication Transactions

  • To show in-progress replicated transactions, enter the following User EXEC command:
    solace> show transaction replicated
  • To show the details of in-progress replicated transactions, enter the following User EXEC command:
    solace> show transaction replicated detail
  • To show the details of a particular transaction, enter the following User EXEC command:
    solace> show transaction xid <xid> detail

    Where:

    xid specifies the XID of the transaction to be displayed.

  • For XA transactions, the XID will match the value used by the client when starting the XA transaction (typically by the application transaction manager). Local transactions do not have an XID associated with them, so the system automatically creates an internal XID for the transaction.
  • Not all transaction states are mirrored on the active and standby sites. XA and local transactions will only be displayable on the standby site if they are in the prepared state and are using synchronous replication.
  • During normal processing of transactions, it is common for transactions to proceed quickly from state to state, so it is difficult to display transactions in certain states. For example, a local transaction will never be in a displayable state on the active site. When replication service is downgraded and transaction are timing out (for example when switching from synchronous to asynchronous), transactions in various states are more likely to be displayable.

Monitoring Degraded Service

The event broker implements a WARN level event log, VPN_REPLICATION_SERVICE_DEGRADED to indicate when replication service has been degraded. This is one of the most important things to monitor in a replication deployment, so it is recommended that this event is diligently monitored and action is taken as soon as possible to rectify this problem.

Degraded replication service indicates that:

  1. Synchronous replication reverted to asynchronous replication service for both messages and transactions. This implies that message loss is possible due to a replication failover.
  2. If the reject‑msg-when-sync-ineligible option is enabled (refer to Configuring to Reject Messages When Sync Ineligible), then messages and transactions using synchronous replication mode are rejected.
  3. Messages are not being replicated to the Message VPNs with a standby replication state in a timely fashion. This could mean that messages could be accumulating on the replication queue, thus consuming message spool resources. Therefore, the replication queue’s spool usage should be monitored when the replication service is degraded.

Addressing Degraded Service

The first step to take to clear a degraded service state is to verify that there is proper link connectivity between the replication sites (that is, checking that the link is up, and that the link is performing as expected). Once this has been done, if the replication queue continues to grow while service is degraded, action must be taken to restore proper function of replication. Possible actions that can be taken include the following:

Limit the Rate of Message Replication

The most obvious solution to addressing degraded service is to limit the rate that which messages are replicated so that the number of messages pending replication is reduced over time. If the tolerable recovery window is long enough, slow periods in the application usage patterns can be taken advantage of (for example, evenings, weekends, and/or maintenance windows).

Temporarily Disconnect Slow Subscribers

If the delivery mode of the replication queue is not delivering from input, the rate of replicated message delivery is impacted by other consumers that are also receiving messages from persistent storage but are consuming those messages slowly. If these other slow consuming clients (that is, slow subscribers) are temporarily disconnected, the rate of replication may increase such that it will eventually catch up.

The simplest way to disconnect slow subscribers is to shut down the egress of these endpoints, which disallows clients from binding to the queue. To shut down endpoint egress, use the following CONFIG command:

solace(configure/message-spool/queue)# shutdown egress

Doing this could cause even more messages to build up on the queue, thus exacerbating the problem for these clients. However, once the replication queue has caught up, it should be able to continue delivering from the input stream as long as the WAN bandwidth and the link’s message spool window size are sufficient. As a result, enabling the egress for the slow consumer endpoints should not be a factor in causing the problem to recur on the replication queue. So if proper operation of replication is higher priority than message delivery to slow subscribers, this may be a reasonable action to take.

Restart Replication

If the message rate cannot be reduced to allow the messages on the replication queue to be consumed at a rate that is faster than the rate that new messages are enqueued, replication should be restarted. This implies that replication will be interrupted, resulting in a period of time where no messages will be replicated. While this might seem drastic, it is likely at this point that the replication service is already degraded, meaning the benefit of replication has already been lost for uncontrolled failovers.

If you decide to restart replication, do the following:

  1. Shut down replication (refer to Enabling Replication).
  2. Monitor the Number of delete-in-progress value from the output of the show message-spool User EXEC command. Wait for this value to go to 0.
  3. Enable replication (refer to Enabling Replication).

Clear Over-Quota Endpoints

If you allow clients to consume messages from endpoints on a Message VPN with a standby replication state (that is, active-active consumers are allowed), degraded service could occur if the messages on one of those endpoints is not adequately consumed by a client and the endpoint goes over quota.

This could happen if the consuming client:

  • is not connected, and is therefore not draining the endpoint.
  • is consuming messages at a slower rate than a consumer of messages from the same endpoint on the mate Message VPN that has an active replication state. In this case, the message spool size for the endpoint on the replication standby Message VPN grows over time.