Monitoring Replication
This section provides the CLI commands that you can use to view replication statistics, and it provides information on some important factors to monitor to ensure that replication is performing well and providing the expected service.
Displaying System-Level Replication Info
To show the system-level replication configuration information and statistics for an event broker, enter the following User EXEC commands:
solace> show replication [stats]
Where:
replication
specifies replication configuration information.
stats
specifies replication statistics information.
solace> show replication stats Replication Interface: 1/6/lag1 Replication Mate: v:lab-128-81 Connect-Via: 192.168.x.x Connect-Ports: Uncompressed: 55555 Compressed: 55003 SSL: 55443 SSL: Default Cipher Suite List: Yes Cipher Suites: ECDHE-RSA-AES256-GCM-SHA384 ECDHE-RSA-AES256-SHA384 ECDHE-RSA-AES256-SHA AES256-GCM-SHA384 AES256-SHA256 AES256-SHA ECDHE-RSA-DES-CBC3-SHA DES-CBC3-SHA ECDHE-RSA-AES128-GCM-SHA256 ECDHE-RSA-AES128-SHA256 ECDHE-RSA-AES128-SHA AES128-GCM-SHA256 AES128-SHA256 AES128-SHA Trusted Common Names: ConfigSync: Bridge: Admin State: Enabled State: up Authentication: Scheme: Basic Compressed: No SSL: No Message Spool: Window Size: 65535 Retry Delay: 3 SSL Server Certificate Validation: Enforce Trusted Common Name: Yes Maximum Chain Depth: 3 Validate Certificate Dates: Yes Statistics While Active: Message Processing: Sync Messages Queued To Standby: 0 Sync Messages Queued To Standby As Async: 0 Async Messages Queued To Standby: 0 Promoted Messages Queued To Standby: 0 Pruned Locally Consumed Messages: 0 Sync Replication: Transitions To Ineligible: 10 Ack Propagation: Messages Sent To Standby: 0 Reconcile Request From Standby: 0 Reconcile Scan in Progress: No Statistics While Standby: Message Processing: Messages Received From Active: 0 Ack Propagation: Messages Received from Active: 0 Reconcile Request Sent to Active: 0 Out of Sequence Ack Received: 0 Transaction Replication: Transactions Requests: 0 Success: 0 Prepare: 0 Commit: 0 Rollback: 0 Fail: 0 Prepare: 0 Commit: 0 Rollback: 0
Most of the statistics are self-explanatory—they show how many messages were queued for sending to Message VPNs on the replication mate that were in a standby state, and how many messages were received from Message VPNs on the replication mate that were in an active state.
However the "Transitions to Ineligible" statistic and "Sync Messages Queued To Standby As Async" merit some further discussion, as they could indicate network connectivity issues, or WAN bandwidth issues that need to be addressed by the system administrator.
If the WAN link between the replication sites fails, or if the WAN link is not able to keep up with the ingress rate of messages that are to be replicated to Message VPNs on the replication mate that are in a standby state, then the system transitions to asynchronous replication for synchronous topics to prevent publishing applications from being unable to publish during the WAN link impairment or outage. The system automatically transitions back to synchronous replication once the WAN link is restored, and the backlog of replicated messages that were queued for replication to Message VPNs with standby replication states on the replication mate while the event broker was in this state.
Displaying Message VPN-Level Replication Info
To show the Message VPN-level replication configuration information and statistics, enter the following User EXEC commands:
solace> show message-vpn <vpn-name> replication [stats]
Where:
replication
specifies replication configuration information.
stats
specifies replication statistics information.
Solace> show message-vpn blue_02 replication stats Message VPN: blue_02 Admin Status: enabled Config Status: active Local Bridge: State: n/a Name: n/a Queue State: n/a Authentication: Scheme: Basic Basic: Client Username: solace Password Configured: No Client Certificate: Certificate File: Compressed: No SSL: No Message Spool: Window Size: 255 Unidirectional: Client Profile: #client-profile Retry Delay: 3 Remote Bridge: State: up Name: #bridge/v:solace/blue_02/49 Queue: State: bound Quota (MB): 800000 Reject Msg to Sender on Discard: Yes Ack Propagation: Interval in Messages: 20 Sync Replication: Eligible: yes Duration: 0d 22h 36m 24s Mate Flow Congested: no Duration: 0d 0h 0m 0s Reject Msg When Sync Ineligible: no Transaction Replication Mode: sync Statistics While Active: Message Processing: Sync Messages Queued To Standby: 239697766 Sync Messages Queued To Standby As Async: 87102660 Async Messages Queued To Standby: 0 Promoted Messages Queued To Standby: 0 Pruned Locally Consumed Messages: 5817989 Sync Replication: Transitions To Ineligible: 2479 Ineligible High Water Mark: 0d 1h 27m 14s Eligible High Water Mark: 1d 2h 55m 23s Mate Flow Congested High Water Mark: 0d 1h 27m 44s Mate Flow Not Congested High Water Mark: 1d 2h 55m 23s Ack Propagation: Messages Sent To Standby: 56983 Reconcile Request From Standby: 0 Reconcile Scan in Progress: No Statistics While Standby: Message Processing: Messages Received From Active: 0 Ack Propagation: Messages Received from Active: 0 Reconcile Request Sent to Active: 0 Out of Sequence Ack Received: 0 Transaction Replication: Transactions Requests: 0 Success: 0 Prepare: 0 Commit: 0 Rollback: 0 Fail: 0 Prepare: 0 Commit: 0 Rollback: 0
The Message VPN status and statistics are largely the same as the global status and statistics, except that the values apply specifically to the given Message VPN.
Note in the example above, that the Message VPN has an active replication status, so the Local Bridge state is "n/a". The local bridge is only used when in replication standby state. In the replication active state, the "Remote Bridge" and "Queue" sections are of interest. When active, it is expected that the remote bridge would be connected and bound to the active Message VPN’s replication queue.
Monitoring Replication Transactions
- To show in-progress replicated transactions, enter the following User EXEC command:
solace> show transaction replicated
Example:Solace # show transaction replicated Flags Legend T - Transaction Type (X=XA L=Local) S - Transaction State (A=Active S=Suspended I=Idle P=Prepared C=Complete) R - Replicated (Y=Yes N=No) XID Messages Message VPN T S R Last State Change Spooled --------------------------------------------- - - - ----------------- -------- 0021ABC4-00-01 blue_02 X P Y 1s 0
- To show the details of in-progress replicated transactions, enter the following User EXEC command:
solace> show transaction replicated detail
Example:Solace # show transaction replicated detail XID: 0021B028-00-01 Message VPN: blue_02 Client: username/15848/#000c0001 Client Username: default Session: N/A Idle Timeout: 0 Type: XA State: IDLE Replicated: Yes Last State Change: 0d 0h 0m 0s Messages: 10 Messages Published: 0 Messages Consumed: 150 Publisher Messages: Message Id Topic -------------------- ----------------------------------------------------------- Consumer Messages: Message Id Type Endpoint Name -------------------- ----- ----------------------------------------------------- 3118727406 queue test 3118727407 queue test 3118727408 queue test 3118727409 queue test 3118727410 queue test 3118727411 queue test 3118727412 queue test 3118727413 queue test 3118727414 queue test 3118727415 queue test
- To show the details of a particular transaction, enter the following User EXEC command:
solace> show transaction xid <xid> detail
Where:
xid
specifies the XID of the transaction to be displayed.
- For XA transactions, the XID will match the value used by the client when starting the XA transaction (typically by the application transaction manager). Local transactions do not have an XID associated with them, so the system automatically creates an internal XID for the transaction.
- Not all transaction states are mirrored on the active and standby sites. XA and local transactions will only be displayable on the standby site if they are in the prepared state and are using synchronous replication.
- During normal processing of transactions, it is common for transactions to proceed quickly from state to state, so it is difficult to display transactions in certain states. For example, a local transaction will never be in a displayable state on the active site. When replication service is downgraded and transaction are timing out (for example when switching from synchronous to asynchronous), transactions in various states are more likely to be displayable.
Monitoring Degraded Service
The event broker implements a WARN level event log, VPN_REPLICATION_SERVICE_DEGRADED
to indicate when replication service has been degraded. This is one of the most important things to monitor in a replication deployment, so it is recommended that this event is diligently monitored and action is taken as soon as possible to rectify this problem.
Degraded replication service indicates that:
- Synchronous replication reverted to asynchronous replication service for both messages and transactions. This implies that message loss is possible due to a replication failover.
- If the
reject‑msg-when-sync-ineligible
option is enabled (refer to Configuring to Reject Messages When Sync Ineligible), then messages and transactions using synchronous replication mode are rejected. - Messages are not being replicated to the Message VPNs with a standby replication state in a timely fashion. This could mean that messages could be accumulating on the replication queue, thus consuming message spool resources. Therefore, the replication queue’s spool usage should be monitored when the replication service is degraded.
Addressing Degraded Service
The first step to take to clear a degraded service state is to verify that there is proper link connectivity between the replication sites (that is, checking that the link is up, and that the link is performing as expected). Once this has been done, if the replication queue continues to grow while service is degraded, action must be taken to restore proper function of replication. Possible actions that can be taken include the following:
- Limit the Rate of Message Replication
- Temporarily Disconnect Slow Subscribers
- Restart Replication
- Clear Over-Quota Endpoints
Limit the Rate of Message Replication
The most obvious solution to addressing degraded service is to limit the rate that which messages are replicated so that the number of messages pending replication is reduced over time. If the tolerable recovery window is long enough, slow periods in the application usage patterns can be taken advantage of (for example, evenings, weekends, and/or maintenance windows).
Temporarily Disconnect Slow Subscribers
If the delivery mode of the replication queue is not delivering from input, the rate of replicated message delivery is impacted by other consumers that are also receiving messages from persistent storage but are consuming those messages slowly. If these other slow consuming clients (that is, slow subscribers) are temporarily disconnected, the rate of replication may increase such that it will eventually catch up.
The simplest way to disconnect slow subscribers is to shut down the egress of these endpoints, which disallows clients from binding to the queue. To shut down endpoint egress, use the following CONFIG command:
solace(configure/message-spool/queue)# shutdown egress
Doing this could cause even more messages to build up on the queue, thus exacerbating the problem for these clients. However, once the replication queue has caught up, it should be able to continue delivering from the input stream as long as the WAN bandwidth and the link’s message spool window size are sufficient. As a result, enabling the egress for the slow consumer endpoints should not be a factor in causing the problem to recur on the replication queue. So if proper operation of replication is higher priority than message delivery to slow subscribers, this may be a reasonable action to take.
Restart Replication
If the message rate cannot be reduced to allow the messages on the replication queue to be consumed at a rate that is faster than the rate that new messages are enqueued, replication should be restarted. This implies that replication will be interrupted, resulting in a period of time where no messages will be replicated. While this might seem drastic, it is likely at this point that the replication service is already degraded, meaning the benefit of replication has already been lost for uncontrolled failovers.
If you decide to restart replication, do the following:
- Shut down replication (refer to Enabling Replication).
- Monitor the
Number of delete-in-progress
value from the output of theshow message-spool
User EXEC command. Wait for this value to go to 0. - Enable replication (refer to Enabling Replication).
Clear Over-Quota Endpoints
If you allow clients to consume messages from endpoints on a Message VPN with a standby replication state (that is, active-active consumers are allowed), degraded service could occur if the messages on one of those endpoints is not adequately consumed by a client and the endpoint goes over quota.
This could happen if the consuming client:
- is not connected, and is therefore not draining the endpoint.
- is consuming messages at a slower rate than a consumer of messages from the same endpoint on the mate Message VPN that has an active replication state. In this case, the message spool size for the endpoint on the replication standby Message VPN grows over time.