Monitoring Replication

This section provides the CLI commands that you can use to view replication statistics, and it provides information on some important factors to monitor to ensure that replication is performing well and providing the expected service.

Displaying System-Level Replication Info

To show the system-level replication configuration information and statistics for an event broker, enter the following User EXEC commands:

solace> show replication [stats]

Where:

replication specifies replication configuration information.

stats specifies replication statistics information.

Example:

solace> show replication stats

Replication Interface:             1/6/lag1
Replication Mate:                  v:lab-128-81
  Connect-Via:                     192.168.x.x
  Connect-Ports:
    Uncompressed:                  55555
    Compressed:                    55003
    SSL:                           55443
SSL:
  Default Cipher Suite List:       Yes
  Cipher Suites:                   ECDHE-RSA-AES256-GCM-SHA384
                                   ECDHE-RSA-AES256-SHA384
                                   ECDHE-RSA-AES256-SHA
                                   AES256-GCM-SHA384
                                   AES256-SHA256
                                   AES256-SHA
                                   ECDHE-RSA-DES-CBC3-SHA
                                   DES-CBC3-SHA
                                   ECDHE-RSA-AES128-GCM-SHA256
                                   ECDHE-RSA-AES128-SHA256
                                   ECDHE-RSA-AES128-SHA
                                   AES128-GCM-SHA256
                                   AES128-SHA256
                                   AES128-SHA
  Trusted Common Names:

ConfigSync:
  Bridge:
    Admin State:                   Enabled
    State:                         up
    Authentication:
      Scheme:                      Basic
    Compressed:                    No
    SSL:                           No
    Message Spool:
      Window Size:                 65535
    Retry Delay:                   3
    SSL Server Certificate Validation:
      Enforce Trusted Common Name: Yes
      Maximum Chain Depth:         3
      Validate Certificate Dates:  Yes

Statistics While Active:
  Message Processing:
    Sync Messages Queued To Standby:                           0
    Sync Messages Queued To Standby As Async:                  0
    Async Messages Queued To Standby:                          0
    Promoted Messages Queued To Standby:                       0
    Pruned Locally Consumed Messages:                          0
  Sync Replication:
    Transitions To Ineligible:                                10
  Ack Propagation:
    Messages Sent To Standby:                                  0
    Reconcile Request From Standby:                            0
    Reconcile Scan in Progress:                               No
Statistics While Standby:
  Message Processing:
    Messages Received From Active:                             0
  Ack Propagation:
    Messages Received from Active:                             0
    Reconcile Request Sent to  Active:                         0
    Out of Sequence Ack Received:                              0
    Transaction Replication:
      Transactions Requests:                                   0
        Success:                                               0
          Prepare:                                             0
          Commit:                                              0
          Rollback:                                            0
        Fail:                                                  0
        Prepare:                                               0
        Commit:                                                0
        Rollback:                                              0

Most of the statistics are self-explanatory—they show how many messages were queued for sending to Message VPNs on the replication mate that were in a standby state, and how many messages were received from Message VPNs on the replication mate that were in an active state.

However the "Transitions to Ineligible" statistic and "Sync Messages Queued To Standby As Async" merit some further discussion, as they could indicate network connectivity issues, or WAN bandwidth issues that need to be addressed by the system administrator.

If the WAN link between the replication sites fails, or if the WAN link is not able to keep up with the ingress rate of messages that are to be replicated to Message VPNs on the replication mate that are in a standby state, then the system transitions to asynchronous replication for synchronous topics to prevent publishing applications from being unable to publish during the WAN link impairment or outage. The system automatically transitions back to synchronous replication once the WAN link is restored, and the backlog of replicated messages that were queued for replication to Message VPNs with standby replication states on the replication mate while the event broker was in this state.

Displaying Message VPN-Level Replication Info

To show the Message VPN-level replication configuration information and statistics, enter the following User EXEC commands:

solace> show message-vpn <vpn-name> replication [stats]

Where:

replication specifies replication configuration information.

stats specifies replication statistics information.

Example:

Solace> show message-vpn blue_02 replication stats

Message VPN:                          blue_02
Admin Status:                         enabled
Config Status:                        active
Local Bridge: 
  State:                              n/a
  Name:                               n/a
  Queue State:                        n/a
  Authentication:
    Scheme:                           Basic
    Basic:
      Client Username:                solace
      Password Configured:            No
    Client Certificate:
       Certificate File:
  Compressed:                         No
  SSL:                                No
  Message Spool:  
    Window Size:                      255
  Unidirectional:
    Client Profile:                   #client-profile
  Retry Delay:                        3 
Remote Bridge:
  State:                              up
  Name:                               #bridge/v:solace/blue_02/49
Queue:
  State:                              bound
  Quota (MB):                         800000
  Reject Msg to Sender on Discard:    Yes
Ack Propagation:
  Interval in Messages:               20
  Sync Replication:
    Eligible:                         yes
      Duration:                       0d 22h 36m 24s
    Mate Flow Congested:              no
      Duration:                       0d 0h 0m 0s
    Reject Msg When Sync Ineligible:  no
Transaction Replication Mode:      sync


Statistics While Active:
 Message Processing:   
   Sync Messages Queued To Standby:                   239697766
   Sync Messages Queued To Standby As Async:           87102660
   Async Messages Queued To Standby:                          0
   Promoted Messages Queued To Standby:                       0
   Pruned Locally Consumed Messages:                    5817989
 Sync Replication:
   Transitions To Ineligible:                              2479
   Ineligible High Water Mark:                    0d 1h 27m 14s
   Eligible High Water Mark:                      1d 2h 55m 23s
   Mate Flow Congested High Water Mark:           0d 1h 27m 44s
   Mate Flow Not Congested High Water Mark:       1d 2h 55m 23s
       
Ack Propagation:
  Messages Sent To Standby:         56983
  Reconcile Request From Standby:                             0
  Reconcile Scan in Progress:                                No
Statistics While Standby:
 Message Processing:
  Messages Received From Active:                              0
 Ack Propagation:  
  Messages Received from Active:                              0
  Reconcile Request Sent to  Active:                          0
  Out of Sequence Ack Received:                               0
Transaction Replication:
  Transactions Requests:                                      0
    Success:                                                  0
      Prepare:                                                0
      Commit:                                                 0
      Rollback:                                               0
    Fail:                                                     0
      Prepare:                                                0
      Commit:                                                 0
      Rollback:                                               0

The Message VPN status and statistics are largely the same as the global status and statistics, except that the values apply specifically to the given Message VPN.

Note in the example above, that the Message VPN has an active replication status, so the Local Bridge state is "n/a". The local bridge is only used when in replication standby state. In the replication active state, the "Remote Bridge" and "Queue" sections are of interest. When active, it is expected that the remote bridge would be connected and bound to the active Message VPN’s replication queue.

Monitoring Replication Transactions

To show in-progress replicated transactions, enter the following User EXEC command:

solace> show transaction replicated

To show the details of in-progress replicated transactions, enter the following User EXEC command:

solace> show transaction replicated detail

To show the details of a particular transaction, enter the following User EXEC command:
```
solace> show transaction xid <xid> detail
```
Where:
xid specifies the XID of the transaction to be displayed.

For XA transactions, the XID will match the value used by the client when starting the XA transaction (typically by the application transaction manager). Local transactions do not have an XID associated with them, so the system automatically creates an internal XID for the transaction.
Not all transaction states are mirrored on the active and standby sites. XA and local transactions will only be displayable on the standby site if they are in the prepared state and are using synchronous replication.
During normal processing of transactions, it is common for transactions to proceed quickly from state to state, so it is difficult to display transactions in certain states. For example, a local transaction will never be in a displayable state on the active site. When replication service is downgraded and transaction are timing out (for example when switching from synchronous to asynchronous), transactions in various states are more likely to be displayable.

Monitoring Degraded Service

The event broker implements a WARN level event log, VPN_REPLICATION_SERVICE_DEGRADED to indicate when replication service has been degraded. This is one of the most important things to monitor in a replication deployment, so it is recommended that this event is diligently monitored and action is taken as soon as possible to rectify this problem.

Degraded replication service indicates that:

Synchronous replication reverted to asynchronous replication service for both messages and transactions. This implies that message loss is possible due to a replication failover.
If the reject‑msg-when-sync-ineligible option is enabled (refer to Configuring to Reject Messages When Sync Ineligible), then messages and transactions using synchronous replication mode are rejected.
Messages are not being replicated to the Message VPNs with a standby replication state in a timely fashion. This could mean that messages could be accumulating on the replication queue, thus consuming message spool resources. Therefore, the replication queue’s spool usage should be monitored when the replication service is degraded.

Addressing Degraded Service

The first step to take to clear a degraded service state is to verify that there is proper link connectivity between the replication sites (that is, checking that the link is up, and that the link is performing as expected). Once this has been done, if the replication queue continues to grow while service is degraded, action must be taken to restore proper function of replication. Possible actions that can be taken include the following:

Limit the Rate of Message Replication
Temporarily Disconnect Slow Subscribers
Restart Replication
Clear Over-Quota Endpoints

Limit the Rate of Message Replication

The most obvious solution to addressing degraded service is to limit the rate that which messages are replicated so that the number of messages pending replication is reduced over time. If the tolerable recovery window is long enough, slow periods in the application usage patterns can be taken advantage of (for example, evenings, weekends, and/or maintenance windows).

Temporarily Disconnect Slow Subscribers

If the delivery mode of the replication queue is not delivering from input, the rate of replicated message delivery is impacted by other consumers that are also receiving messages from persistent storage but are consuming those messages slowly. If these other slow consuming clients (that is, slow subscribers) are temporarily disconnected, the rate of replication may increase such that it will eventually catch up.

The simplest way to disconnect slow subscribers is to shut down the egress of these endpoints, which disallows clients from binding to the queue. To shut down endpoint egress, use the following CONFIG command:

solace(configure/message-spool/queue)# shutdown egress

Doing this could cause even more messages to build up on the queue, thus exacerbating the problem for these clients. However, once the replication queue has caught up, it should be able to continue delivering from the input stream as long as the WAN bandwidth and the link’s message spool window size are sufficient. As a result, enabling the egress for the slow consumer endpoints should not be a factor in causing the problem to recur on the replication queue. So if proper operation of replication is higher priority than message delivery to slow subscribers, this may be a reasonable action to take.

Restart Replication

If the message rate cannot be reduced to allow the messages on the replication queue to be consumed at a rate that is faster than the rate that new messages are enqueued, replication should be restarted. This implies that replication will be interrupted, resulting in a period of time where no messages will be replicated. While this might seem drastic, it is likely at this point that the replication service is already degraded, meaning the benefit of replication has already been lost for uncontrolled failovers.

If you decide to restart replication, do the following:

Shut down replication (refer to Enabling Replication).
Monitor the Number of delete-in-progress value from the output of the show message-spool User EXEC command. Wait for this value to go to 0.
Enable replication (refer to Enabling Replication).

Clear Over-Quota Endpoints

If you allow clients to consume messages from endpoints on a Message VPN with a standby replication state (that is, active-active consumers are allowed), degraded service could occur if the messages on one of those endpoints is not adequately consumed by a client and the endpoint goes over quota.

This could happen if the consuming client:

is not connected, and is therefore not draining the endpoint.
is consuming messages at a slower rate than a consumer of messages from the same endpoint on the mate Message VPN that has an active replication state. In this case, the message spool size for the endpoint on the replication standby Message VPN grows over time.

Provide feedback