Performing an Uncontrolled Failover
In the event of an unplanned failure of an active data center or network isolation, there will not be an opportunity to gracefully release activity from the Message VPNs at that replication site.
There are three types of uncontrolled failovers:
- Short -Term Outage
The Active site is out-of-service or isolated for a short duration (for example, minutes or hours). The replication queue has enough capacity to store all replicated messages and transactions during the outage.
- Long-Term Outage
The Active site is out-of-service or isolated for a long duration (for example, days or weeks). The replication queue does not have enough capacity to store all replicated messages and transactions during the outage.
- Complete Failure
The Active site goes out of service and cannot be recovered. A critical component (the event broker, region connectivity, etc.) has been lost, or data on the external disk has been lost.
In all these types of failovers, the following general steps must be taken:
- Step 1: Make Message VPNs at Standby Site Replication Active to Restore Service
- Step 2: Ensure Clients Cannot Connect to the Failed Site
- Step 3: If Necessary, Suspend Replication
- Step 4: Bring Message VPNs at the Failed Site Back Online as Replication Standby
In the provided example, the New York replication site has experienced the failure, and its mate Boston site takes over activity until the New York site has been restored. For simplicity, only a single Message VPN (Trading_VPN
) is presented in the example. When the failure occurred, Trading_VPN
had a replication active state at the New York site and a replication standby state at the Boston site.
While these simple examples only show replication sites with a single Message VPN, in real-world scenarios, these steps must be performed for each Message VPN involved in a replication site failover.
Consequences of an Uncontrolled Failover
There are potential consequences of an uncontrolled failover that include:
- The build-up of messages on the replication queue for the duration of the outage at the Active site.
- The replication queue becoming full.
- The loss of one or more event brokers at the failed site prior to restoring operation at the failed site.
- The possibility of lost messages and transactions being replicated asynchronously.
- An increased probability and volume of duplicate message delivery.
We recommend that you Contact Solace for help resolving any issues that may be present in the circumstances of a specific uncontrolled failover.
Step 1: Make Message VPNs at Standby Site Replication Active to Restore Service
This procedure should be followed after it has been determined that an uncontrolled failure has occurred for a data center site.
Make a Replication Standby Message VPN Replication Active
To restore service, change the replication state of the Message VPN to active.
Boston Data Center
BOS_EventBroker(configure)# message-vpn Trading_VPN BOS_EventBroker(configure/message-vpn)# replication state active
Clients will now be able to connect to the Message VPN.
Since a standby site is not available, asynchronous messages and transactions will be stored in the replication queue. By default, synchronous replication will switch to asynchronous, causing those messages and transactions to also be stored in replication queue. If reject-msg-when-sync ineligible is set on the Message VPN, synchronous replication will be blocked until the standby Message VPN is restored.
Step 2: Ensure Clients Cannot Connect to the Failed Site
It is important that the failed site does not come up with its replication Message VPNs in active state. If both sites have an active replication state at the same time, proper operation cannot be guaranteed. Since the failed event broker was configured with replication state active when it failed or became unreachable, when it recovers that will be its default state. Note that if the failed site cannot be recovered, and its configuration has to be restored from a backup, that backup configuration may have been saved with an active replication state, so this step applies in that case as well.
In this step, the goal is to allow the failed event broker to be brought back up, but also prevent clients from connecting. To do that, you should block ports that allow client connectivity, while still allowing the event broker to be managed through the management ports.
There may be a number of ways to accomplish this step; the specific actions to perform should be tested before an uncontrolled failure occurs so it is clear what to do in an actual failure scenario.
Step 3: If Necessary, Suspend Replication
If the failed site takes a long time to recover, there is risk that the replication queue will fill up. If this happens, messages published to replicated topics (in or out of transactions) will be rejected, since no replication service can be provided. If you know that there will be a prolonged outage or the replication queue is getting close to filling up (high event log has been triggered on the replication queue), it may be necessary to suspend the replication service to continue to provide non-replicated service to the replicated topics.
To suspend replication, disable the reject-msg-to-send behavior on the replication queue using the following CONFIG command:
solace(configure/message-vpn/replication/queue)# no reject-msg-to-sender-on-discard
Note that with this setting, replicated service will continue until the replication queue gets full. Once it is full, only local, non-replicated service is provided.
Step 4: Bring Message VPNs at the Failed Site Back Online as Replication Standby
Once the failed site has been recovered with management access but no client access (see Step 2: Ensure Clients Cannot Connect to the Failed Site ), then it can be prepared to be the standby site. Here the steps for preparing the recovered event broker to be the standby site:
- Step 4-1: Configure All Message VPNs as Standby
- Step 4-2: Verify the Message Spool
- Step 4-3: Heuristically Complete Transactions
- Step 4-4: Allow Clients to Connect
- Step 4-5: Wait For Synchronous Replication to be Eligible
- Step 4-6: If Necessary, Re-enable Replication
- Step 4-7: Retrieving Replication Queue Spooled Messages from the Failed Site
Step 4-1: Configure All Message VPNs as Standby
Configure all Message VPNs on the restored Replication site with a standby Replication state. In this example, the Message VPN Trading VPN
at the New York site is configured with a standby Replication state:
NY Data Center
NY_EventBroker1(configure)# message-vpn Trading_VPN NY_EventBroker1(configure/message-vpn)# replication state standby
The Config-Sync facility propagates this setting to the Trading_VPN
Message VPN on Ny-Appliance2
.
Step 4-2: Verify the Message Spool
You should verify that the message spool for the event brokers at the failed Replication site are now capable of providing service.
Before continuing, ensure that the message spool on the recovered site is active for the primary virtual router. In the sample output below (which may vary by event broker type and version), the Activity Status
of Local Inactive
and the Message Spool Status
of AD-Not Ready
indicates that the event broker and the message spool it uses are not active.
NY_EventBroker1# show redundancy Configuration Status : Enabled Auto Revert : No Redundancy Mode : Active/Active Mate Router Name : solaceBackup ADB Link To Mate : Up ADB Hello To Mate : Down Primary Virtual Router Backup Virtual Router ---------------------- ---------------------- Activity Status Local Inactive Local Active Routing Interface 1/1/lag1:1 1/1/lag1:3 VRRP VRID 33 34 Routing Interface Status Up Up VRRP Status Master Master VRRP Priority 75 250 Message Spool Status AD-NotReady AD-Disabled Priority Reported By Mate Backup-Reconcile Primary-Reconcile
In this situation, you must resolve the issue preventing failed event broker to become active. If you cannot resolve the issue, contact Solace.
Step 4-3: Heuristically Complete Transactions
If applicable, heuristically commit or heuristically rollback any prepared transactions on the failed site. Once heuristically completed, delete them to free up the resources.
To commit, rollback or delete a transaction, enter the appropriate ADMIN commands on the failed site:
NY Data Center
solace(admin/message-spool) commit-transaction xid <xid>
and/or
solace(admin/message-spool) rollback-transaction xid <xid>
and then
solace(admin/message-spool) delete-transaction xid <xid>
Where:
xid
specifies the XID of the transaction to be committed, rolled back, or deleted.
Step 4-4: Allow Clients to Connect
You previously had blocked traffic to prevent client connectivity (Step 2: Ensure Clients Cannot Connect to the Failed Site ). You now must unblock the ports to allow client connectivity. This step allows clients to connect as well as to the replication bridge, which allows data to be synchronized from the active site (Boston site to the New York site in the example).
Step 4-5: Wait For Synchronous Replication to be Eligible
Once connectivity is restored between the recovered site and the active site, the replication bridge will connect from the standby site to the active site and drain the replication queue in order to synchronize the two sites. Depending on how much message and transaction data is in the replication queue and the available bandwidth between the sites, this process may take a long time. When this process is complete, the Replication service will no longer be degraded and the Message VPN will become eligible for synchronous replication.
In the following example, the information is shown for the Boston site, which is acting as active for the recently failed New York site.
BOS_EventBroker# show message-vpn Trading_VPN replication Flags Legend: A - Admin State (U=Up, D=Down) C - Config State (A=Active, S=Standby) B - Local Bridge State (U=Up, Q=Queue Unbound, D=Down, -=N/A) R - Remote Bridge State (U=Up, D=Down, -=N/A) Q - Queue State (U=Up, D=Down, -=N/A) S - Sync Replication Eligible (Y=Yes, N=No, -=N/A) M - Reject Msg When Sync Ineligible (Y=Yes, N=No) T - Transaction Replication Mode (A=Async, S=Sync, -=N/A) Message VPN A C W B R Q S M T -------------------------------- - - - - - - - - - Trading_VPN U A N - U U - N A BOS_EventBroker#
The ‘Y’ under the ‘S’ column indicates that synchronous Replication is eligible for the Message VPN Trading_VPN
.
Step 4-6: If Necessary, Re-enable Replication
If you previously had to suspend replication because the replication queue overflowed, re-enable it. Enter the following CONFIG command:
solace(configure/message-vpn/replication/queue)# reject-msg-to-sender-on-discard
Step 4-7: Retrieving Replication Queue Spooled Messages from the Failed Site
Asynchronously spooled messages on the formerly active site (NY) can only be consumed when the activity is failed back to the formerly active state site (NY).
In order to retrieve these messages, fail back to the formerly active state site (NY) in the next maintenance window.