Performing an Uncontrolled Failover

In the event of an unplanned failure of an active data center or network isolation, there will not be an opportunity to gracefully release activity from the Message VPNs at that replication site.

There are three types of uncontrolled failovers:

Short -Term Outage
The Active site is out-of-service or isolated for a short duration (for example, minutes or hours). The replication queue has enough capacity to store all replicated messages and transactions during the outage.
Long-Term Outage
The Active site is out-of-service or isolated for a long duration (for example, days or weeks). The replication queue does not have enough capacity to store all replicated messages and transactions during the outage.
Complete Failure
The Active site goes out of service and cannot be recovered. A critical component (the event broker, region connectivity, etc.) has been lost, or data on the external disk has been lost.

In all these types of failovers, the following general steps must be taken:

Step 1: Make Message VPNs at Standby Site Replication Active to Restore Service
Step 2: Ensure Clients Cannot Connect to the Failed Site
Step 3: If Necessary, Suspend Replication
Step 4: Bring Message VPNs at the Failed Site Back Online as Replication Standby

In the provided example, the New York replication site has experienced the failure, and its mate Boston site takes over activity until the New York site has been restored. For simplicity, only a single Message VPN (Trading_VPN) is presented in the example. When the failure occurred, Trading_VPN had a replication active state at the New York site and a replication standby state at the Boston site.

While these simple examples only show replication sites with a single Message VPN, in real-world scenarios, these steps must be performed for each Message VPN involved in a replication site failover.

Consequences of an Uncontrolled Failover

There are potential consequences of an uncontrolled failover that include:

The build-up of messages on the replication queue for the duration of the outage at the Active site.
The replication queue becoming full.
The loss of one or more event brokers at the failed site prior to restoring operation at the failed site.
The possibility of lost messages and transactions being replicated asynchronously.
An increased probability and volume of duplicate message delivery.

We recommend that you Contact Solace for help resolving any issues that may be present in the circumstances of a specific uncontrolled failover.

Step 1: Make Message VPNs at Standby Site Replication Active to Restore Service

This procedure should be followed after it has been determined that an uncontrolled failure has occurred for a data center site.

Make a Replication Standby Message VPN Replication Active

To restore service, change the replication state of the Message VPN to active.

Boston Data Center

BOS_EventBroker(configure)# message-vpn Trading_VPN
BOS_EventBroker(configure/message-vpn)# replication state active

Clients will now be able to connect to the Message VPN.

Since a standby site is not available, asynchronous messages and transactions will be stored in the replication queue. By default, synchronous replication will switch to asynchronous, causing those messages and transactions to also be stored in replication queue. If reject-msg-when-sync ineligible is set on the Message VPN, synchronous replication will be blocked until the standby Message VPN is restored.

Step 2: Ensure Clients Cannot Connect to the Failed Site

It is important that the failed site does not come up with its replication Message VPNs in active state. If both sites have an active replication state at the same time, proper operation cannot be guaranteed. Since the failed event broker was configured with replication state active when it failed or became unreachable, when it recovers that will be its default state. Note that if the failed site cannot be recovered, and its configuration has to be restored from a backup, that backup configuration may have been saved with an active replication state, so this step applies in that case as well.

In this step, the goal is to allow the failed event broker to be brought back up, but also prevent clients from connecting. To do that, you should block ports that allow client connectivity, while still allowing the event broker to be managed through the management ports.

There may be a number of ways to accomplish this step; the specific actions to perform should be tested before an uncontrolled failure occurs so it is clear what to do in an actual failure scenario.

Step 3: If Necessary, Suspend Replication

If the failed site takes a long time to recover, there is risk that the replication queue will fill up. If this happens, messages published to replicated topics (in or out of transactions will be rejected), since no replication service can be provided. If you know that there will be a prolonged outage or the replication queue is getting close to filling up (high event log has been triggered on the replication queue), it may be necessary to suspend the replication service to continue to provide non-replicated service to the replicated topics.

To suspend replication, disable the reject-msg-to-send behavior on the replication queue using the following CONFIG command:

solace(configure/message-vpn/replication/queue)# no reject-msg-to-sender-on-discard

Note that with this setting, replicated service will continue until the replication queue gets full. Once it is full, only local, non-replicated service is provided.

Step 4: Bring Message VPNs at the Failed Site Back Online as Replication Standby

Once the failed site has been recovered with management access by no client access (see Step 2: Ensure Clients Cannot Connect to the Failed Site ), then it can be prepared to be the standby site. Here the steps for preparing the recovered event broker to be the standby site:

Step 4-1: Configure All Message VPNs as Standby
Step 4-2: Verify the Message Spool
Step 4-3: Heuristically Complete Transactions
Step 4-4: Allow Clients to Connect
Step 4-5: Wait For Synchronous Replication to be Eligible
Step 4-6: If Necessary, Re-enable Replication
Step 4-7: Retrieving Replication Queue Spooled Messages from the Failed Site

Step 4-1: Configure All Message VPNs as Standby

Configure all Message VPNs on the restored Replication site with a standby Replication state. In this example, the Message VPN Trading VPN at the New York site is configured with a standby Replication state:

NY Data Center

NY_EventBroker1(configure)# message-vpn Trading_VPN
NY_EventBroker1(configure/message-vpn)# replication state standby

The Config-Sync facility propagates this setting to the Trading_VPN Message VPN on Ny-Appliance2.

Step 4-2: Verify the Message Spool

You should verify that the message spool for the event brokers at the failed Replication site are now capable of providing service.

Before continuing, ensure that the message spool on the recovered site is active for the primary virtual router. In the sample output below (which may vary by event broker type and version), the Activity Status of Local Inactive and the Message Spool Status of AD-Not Ready indicates that the event broker and the message spool it uses are not active.

NY_EventBroker1# show redundancy
Configuration Status     : Enabled
Auto Revert              : No
Redundancy Mode          : Active/Active
Mate Router Name         : solaceBackup
ADB Link To Mate         : Up
ADB Hello To Mate        : Down

Primary Virtual Router  Backup Virtual Router
                               ----------------------  ----------------------

Activity Status                Local Inactive          Local Active
Routing Interface              1/1/lag1:1              1/1/lag1:3
VRRP VRID                      33                      34
Routing Interface Status       Up                      Up
VRRP Status                    Master                  Master
VRRP Priority                  75                      250
Message Spool Status           AD-NotReady             AD-Disabled
Priority Reported By Mate      Backup-Reconcile        Primary-Reconcile

In this situation, you must resolve the issue preventing failed event broker to become active. If you cannot resolve the issue, contact Solace.

Step 4-3: Heuristically Complete Transactions

If applicable, heuristically commit or heuristically rollback any prepared transactions on the failed site. Once heuristically completed, delete them to free up the resources.

To commit, rollback or delete a transaction, enter the appropriate ADMIN commands on the failed site:

NY Data Center

solace(admin/message-spool) commit-transaction xid <xid>

and/or

solace(admin/message-spool) rollback-transaction xid <xid>

and then

solace(admin/message-spool) delete-transaction xid <xid>

Where:

xid specifies the XID of the transaction to be committed, rolled back, or deleted.

Step 4-4: Allow Clients to Connect

You previously had blocked traffic to prevent client connectivity (Step 2: Ensure Clients Cannot Connect to the Failed Site ). You now must unblock the ports to allow client connectivity. This step allows clients to connect as well as to the replication bridge, which allows data to be synchronized from the active site (Boston site to the New York site in the example).

Step 4-5: Wait For Synchronous Replication to be Eligible

Once connectivity is restored between the recovered site and the active site, the replication bridge will connect from the standby site to the active site and drain the replication queue in order to synchronize the two sites. Depending on how much message and transaction data is in the replication queue and the available bandwidth between the sites, this process may take a long time. When this process is complete, the Replication service will no longer be degraded and the Message VPN will become eligible for synchronous replication.

In the following example, the information is shown for the Boston site, which is acting as active for the recently failed New York site.

BOS_EventBroker# show message-vpn Trading_VPN replication

Flags Legend:
A - Admin State (U=Up, D=Down)
C - Config State (A=Active, S=Standby)
B - Local Bridge State (U=Up, Q=Queue Unbound, D=Down, -=N/A)
R - Remote Bridge State (U=Up, D=Down, -=N/A)
Q - Queue State (U=Up, D=Down, -=N/A)
S - Sync Replication Eligible (Y=Yes, N=No, -=N/A)
M - Reject Msg When Sync Ineligible (Y=Yes, N=No)
T - Transaction Replication Mode (A=Async, S=Sync, -=N/A)
Message VPN                      A C W B R Q S M T
-------------------------------- - - - - - - - - -
Trading_VPN                      U A N - U U - N A
BOS_EventBroker#

The ‘Y’ under the ‘S’ column indicates that synchronous Replication is eligible for the Message VPN Trading_VPN.

Step 4-6: If Necessary, Re-enable Replication

If you previously had to suspend replication because the replication queue overflowed, re-enable it. Enter the following CONFIG command:

solace(configure/message-vpn/replication/queue)# reject-msg-to-sender-on-discard

Step 4-7: Retrieving Replication Queue Spooled Messages from the Failed Site

Asynchronously spooled messages on the formerly active site (NY) can only be consumed when the activity is failed back to the formerly active state site (NY).

In order to retrieve these messages, fail back to the formerly active state site (NY) in the next maintenance window.

Provide feedback