Recovering from Uncontrolled Fail‑Overs

In the event of a failure of an active data center or network isolation, there will not be an opportunity to gracefully release activity from the Message VPNs at that replication site.

There are three types of uncontrolled fail-overs:

  • Short -Term Outage
  • The active site is out-of-service or isolated for a short duration (for example, minutes or hours). The replication queue has enough capacity to store all replicated messages and transactions during the outage.

  • Long-Term Outage
  • The active site is out-of-service or isolated for a long duration (for example, days or weeks).The replication queue does not have enough capacity to store all replicated messages and transactions during the outage.

  • Complete Failure
  • The active site goes out of service and cannot be recovered. The event broker (or critical component) has to be replaced or data on the external disk is lost.

In all these types of fail-overs, the following general steps must be taken:

In the provided example, the New York replication site has experienced the failure, and its mate New Jersey site takes over activity until the New York site has been restored. For simplicity, only a single Message VPN (Trading_VPN) is presented in the example. When the failure occurred, Trading_VPN had a replication active state at the New York site and a replication standby state at the New Jersey site.

While these simple examples only show replication sites with a single Message VPN, in real-world scenarios, these steps must be performed for each Message VPN involved in a replication site fail-over.

Consequences of an Uncontrolled Fail-Over

There are a number of potential consequences of an uncontrolled fail-over, including the following:

  • The build-up of messages on the replication queue for the duration of the primary site outage
  • The replication queue becoming full
  • The loss of one or more event brokers at the failed site prior to restoring operation at the failed site
  • The possibility of lost messages and transactions being replicated asynchronously
  • An increased probability and volume of duplicate message delivery

We recommend that you Contact Solace for help resolving any issues that may be present in the circumstances of a specific uncontrolled fail-over.

Step 1: Make Message VPNs at Standby Site Replication Active to Restore Service

This procedure should be followed after it has been determined that an uncontrolled failure has occurred for a data center site.

Make a Replication Standby Message VPN Replication Active

To restore service, change the replication state of the Message VPN to active.

New Jersey Data Center

NJ_Appliance1(configure)# message-vpn Trading_VPN
NJ_Appliance1(configure/message-vpn)# replication state active

Clients will now be able to connect to the Message VPN.

Since a standby site is not available, asynchronous messages and transactions will be stored in the replication queue. By default, synchronous replication will switch to asynchronous, causing those messages and transactions to also be stored in replication queue. If reject-msg-when-sync ineligible is set on the Message VPN, synchronous replication will be blocked until the standby Message VPN is restored.

Step 2: Ensure Clients Cannot Connect to the Failed Site

It is important that the failed site does not come up with its replication Message VPNs in active state. If both sites have an active replication state at the same time, proper operation cannot be guaranteed. Since the failed event broker was configured with replication state active when it failed or became unreachable, when it recovers that will be its default state. Note that if the failed site cannot be recovered, and its configuration has to be restored from a backup, that backup configuration may have been saved with an active replication state, so this step applies in that case as well.

In this step, the goal is to allow the failed event broker to be powered on, but prevent client connectivity. To do that, client connectivity to the data ports Network Acceleration Blade (NAB) should be blocked, while still allowing the event broker to be managed through the management ports.

Implementing this step requires a customer-specific administrative plan and depends on the type of failure. Possible options for this step are:

  • Administratively shut down the NAB interface(s). This option may be available during network isolation of the message backbone where management access is maintained.
  • Remove data cables from NAB data ports
  • Prevent L2 switches that provide connectivity to the NAB ports from powering up
  • Power up L2 switches that provide connectivity to the NAB ports and disable the NAB ports
  • Prevent IP event brokers that provide client connectivity from providing connectivity to the event broker’s NAB ports

There may be other ways to accomplish this step, but this step should be formalized before a uncontrolled failure occurs so it is clear what actions to take should an actual failure occur.

Step 3: If Necessary, Suspend Replication

If the failed site takes a long time to recover, there is risk that the replication queue will fill up. If this happens, messages published to replicated topics (in or out of transactions will be rejected), since no replication service can be provided. If you know that there will be a prolonged outage or the replication queue is getting close to filling up (high event log has been triggered on the replication queue), it may be necessary to suspend the replication service to continue to provide non-replicated service to the replicated topics.

To suspend replication, disable the reject-msg-to-send behavior on the replication queue using the following CONFIG command:

solace(configure/message-vpn/replication/queue)# no reject-msg-to-sender-on-discard

Note that with this setting, replicated service will continue until the replication queue gets full. Once it is full, only local, non-replicated service is provided.

Step 4: Bring Message VPNs at the Failed Site Back Online as Replication Standby

Once the failed site has been recovered with management access by no client access (see Step 2: Ensure Clients Cannot Connect to the Failed Site ), then it can be prepared to be the standby site. Here the steps for preparing the recovered appliance to be the standby site:

  • Step 4-1: Configure All Message VPNs as Standby
  • Step 4-2: Verify the Message Spool
  • Step 4-3: Heuristically Complete Transactions
  • Step 4-4: Allow Clients to Connect
  • Step 4-5: Wait For Synchronous Replication to be Eligible
  • Step 4-6: If Necessary, Re-enable Replication
  • Step 4-7: Retrieving Replication Queue Spooled Messages from the Failed Site
  • Step 4-1: Configure All Message VPNs as Standby

    Configure all Message VPNs on the restored Replication site with a standby Replication state. In this example, the Message VPN Trading VPN at the New York site is configured with a standby Replication state:

    NY Data Center

    NY_Appliance1(configure)# message-vpn Trading_VPN
    NY_Appliance1(configure/message-vpn)# replication state standby

    The Config-Sync facility propagates this setting to the Trading_VPN Message VPN on Ny-Appliance2.

    Step 4-2: Verify the Message Spool

    You should verify that the message spool for the event brokers at the failed Replication site are now capable of providing service.

    Before continuing, ensure that the message spool on the recovered site is active for the primary virtual router. In the sample output below (which may vary by appliance version), the Activity Status of Local Inactive and the Message Spool Status of AD-Not Ready indicates that the event broker and the message spool it uses are not active.

    NY_Appliance1# show redundancy
    Configuration Status     : Enabled
    Auto Revert              : No
    Redundancy Mode          : Active/Active
    Mate Router Name         : solaceBackup
    ADB Link To Mate         : Up
    ADB Hello To Mate        : Down
                                   Primary Virtual Router  Backup Virtual Router
                                   ----------------------  ----------------------
    Activity Status                Local Inactive          Local Active
    Routing Interface              1/1/lag1:1              1/1/lag1:3
    VRRP VRID                      33                      34
    Routing Interface Status       Up                      Up
    VRRP Status                    Master                  Master
    VRRP Priority                  75                      250
    Message Spool Status           AD-NotReady             AD-Disabled
    Priority Reported By Mate      Backup-Reconcile        Primary-Reconcile

    In this situation, you must resolve the issue preventing failed event broker to become active. If you cannot resolve the issue, contact Solace.

    Step 4-3: Heuristically Complete Transactions

    If applicable, heuristically commit or heuristically rollback any prepared transactions on the failed site. Once heuristically completed, delete them to free up the resources.

    To commit, rollback or delete a transaction, enter the appropriate ADMIN commands on the failed site:

    NY Data Center

    solace(admin/message-spool) commit-transaction xid <xid>

    and/or

    solace(admin/message-spool) rollback-transaction xid <xid>

    and then

    solace(admin/message-spool) delete-transaction xid <xid>

    Where:

    xid specifies the XID of the transaction to be committed, rolled back, or deleted.

    Step 4-4: Allow Clients to Connect

    Restore client connectivity to the NAB data ports. This allows clients to connect as well as the replication bridge, which allows data to be synchronized from the active site (New Jersey site to the New York site in the example). The method through which connectivity to the NAB data ports is restored depends on the action chosen in Step 2: Ensure Clients Cannot Connect to the Failed Site of the parent procedure.

    Step 4-5: Wait For Synchronous Replication to be Eligible

    Once connectivity is restored between the recovered site and the active site, the replication bridge will connect from the standby site to the active site and drain the replication queue in order to synchronize the two sites. Depending on how much message and transaction data is in the replication queue and the available bandwidth between the sites, this process may take a long time. When this process is complete, the Replication service will no longer be degraded and the Message VPN will become eligible for synchronous replication.

    In the following example, the information displayed is for the New Jersey site, which is acting as active for the recently failed New York site.

    NJ_Appliance1# show message-vpn Trading_VPN replication

    Flags Legend:
    A - Admin State (U=Up, D=Down)
    C - Config State (A=Active, S=Standby)
    B - Local Bridge State (U=Up, Q=Queue Unbound, D=Down, -=N/A)
    R - Remote Bridge State (U=Up, D=Down, -=N/A)
    Q - Queue State (U=Up, D=Down, -=N/A)
    S - Sync Replication Eligible (Y=Yes, N=No, -=N/A)
    M - Reject Msg When Sync Ineligible (Y=Yes, N=No)
    T - Transaction Replication Mode (A=Async, S=Sync, -=N/A)
    Message VPN                      A C W B R Q S M T
    -------------------------------- - - - - - - - - -
    Trading_VPN                      U A N - U U - N A
    NJ_Appliance1#

    The ‘Y’ under the ‘S’ column indicates that synchronous Replication is eligible for the Message VPN Trading_VPN.

    Step 4-6: If Necessary, Re-enable Replication

    If you previously had to suspend replication because the replication queue overflowed, re-enable it. Enter the following CONFIG command:

    solace(configure/message-vpn/replication/queue)# reject-msg-to-sender-on-discard

    Step 4-7: Retrieving Replication Queue Spooled Messages from the Failed Site

    Asynchronously spooled messages on the formerly active site (NY) can only be consumed when the activity is failed back to the formerly active state site (NY).

    In order to retrieve these messages, fail back to the formerly active state site (NY) in the next maintenance window.