Switching Replication Service from Site to Site

Transferring activity from one replication site to another can occur through either a controlled or uncontrolled fail-over.

  • controlled fail-over—A controlled fail-over occurs when activity is transferred from one replication site to its mate replication site in a planned manner for operational reasons. By successfully following the controlled fail-over procedure, no messages will be lost, regardless of the replication mode of the messages.
  • uncontrolled fail-over—An uncontrolled fail-over occurs due to a sudden loss of connectivity to a replication site with Message VPNs with a replication active state. When this occurs, only synchronous replication mode messages can be guaranteed not to be lost. Messages remaining in the replication queue will not be available at the disaster recovery site.

The following sections provide the basic steps for performing a controlled fail‑over and for recovering from an uncontrolled fail‑over:

Performing Controlled Fail-Overs

To perform a controlled replication fail-over so that clients are switched over from a Message VPN with an active replication state at one site to the corresponding Message VPN at the alternate site, the following steps must be taken:

Note:  To be truly controlled, the replication bridge connection from the replication site that will take over activity must be bound to the other replication site’s replication queue. Before performing a controlled fail-over, every effort should be made to minimize the possibility of disconnecting the replication bridge. After the procedure has been started by configuring the Message VPNs in the site to give up activity with a replication standby state, if the replication bridge connection goes down before the replication queue has been drained, the fail-over becomes an uncontrolled fail-over.

In the example scenario used by these steps, the clients from a single Message VPN (Trading_VPN) with an active replication state at one site (New York) are switched to the corresponding Message VPN with a standby replication state at the alternate site (New Jersey).

While this simple example only shows replication sites with a single Message VPN, in real world scenarios, these steps must be performed through the CLI for each Message VPN involved in a replication site fail-over.

Step 1: Verify that the Replication Bridge is Bound to the Replication Queue

Verify the replication bridge from the router with the replication standby Message VPN that you want to make active is bound to the replication queue of the currently replication active Message VPN.

Here it is assumed that NY_Appliance1 and NJ_Appliance1 are active for the Guaranteed Messaging-enabled virtual router at their respective sites.

New York Data Center

NY_Appliance1> show message-vpn Trading_VPN replication

 

Flags Legend:

A - Admin State (U=Up, D=Down)

C - Config State (A=Active, S=Standby)

B - Local Bridge State (U=Up, Q=Queue Unbound, D=Down, -=N/A)

R - Remote Bridge State (U=Up, D=Down, -=N/A)

Q - Queue State (U=Up, D=Down, -=N/A)

S - Sync Replication Eligible (Y=Yes, N=No, -=N/A)

M - Reject Msg When Sync Ineligible (Y=Yes, N=No)

T - Transaction Replication Mode (A=Async, S=Sync, -=N/A)

 

Message VPN                      A C B R Q S M T

-------------------------------- - - - - - - - - -

Trading_VPN                      U A - U U - N A

 

NY_Appliance1>

 

New Jersey Data Center

Flags Legend:

A - Admin State (U=Up, D=Down)

C - Config State (A=Active, S=Standby)

B - Local Bridge State (U=Up, Q=Queue Unbound, D=Down, -=N/A)

R - Remote Bridge State (U=Up, D=Down, -=N/A)

Q - Queue State (U=Up, D=Down, -=N/A)

S - Sync Replication Eligible (Y=Yes, N=No, -=N/A)

M - Reject Msg When Sync Ineligible (Y=Yes, N=No)

T - Transaction Replication Mode (A=Async, S=Sync, -=N/A)

 

Message VPN                      A C B R Q S M T

-------------------------------- - - - - - - - - -

Trading_VPN                      U S U - - - N A

 

NJ_Appliance1>

Step 2: Switch the Message VPN Replication State to Standby

Switch the currently replication active Message VPN to standby.

New York Data Center

NY_Appliance1(configure)# message-vpn Trading_VPN

NY_Appliance1(configure/message-vpn)# replication state standby

Step 3: Wait for the Replication Queue in the Formerly Active Message VPN to Drain

Allow any messages or transactions that are in progress from the formerly replication active Message VPN to its corresponding Message VPN on its replication mate to arrive . Allowing the propagation of all messages and transactions to the standby Message VPN can prevent the loss of asynchronous replication messages and transactions.

To determine if the replication queue for the Message VPN that was just changed from active to standby has drained, enter the show queue command for the Message VPN’s replication queue (named #MSGVPN_REPLICATION_DATA_QUEUE). When the output displays a value of 0 for “Current Messages Spooled”, the queue has been drained.

New York Data Center

NY_Appliance1(configure)# show queue #MSGVPN_REPLICATION_DATA_QUEUE message-vpn Trading_VPN

 

Name                                 : #MSGVPN_REPLICATION_DATA_QUEUE

Message VPN                          : Trading_VPN

...

Current Messages Spooled             : 1

Current Spool Usage (MB)             : 0.0006

...

The system administrator should not configure the Message VPN in the other replication mate (NJ_Appliance1) as replication active until “Current Messages Spooled” is 0 for the replication queue for the Message VPN that was just switched to replication standby.

Step 4: Heuristically Commit or Rollback Any In-Progress Transactions

If the Message VPN is using XA transactions, there may be some prepared transactions on the formerly active site that need to be heuristically committed or rolled back. Only prepared transactions have to be addressed. Transactions in other states can be ignored. If you do not deal with the prepared transactions, you will:

  • waste transaction resources that will reduce the transaction handling capacity of both sites
  • in the event of a fail-back to the originally active site, duplicate message delivery or message loss may occur

For information on how to heuristically commit or roll back transactions, refer to Performing Heuristic Actions on Transactions.

Note:  It is important that you only perform the heuristic commit or heuristic rollback operations on the formerly active Message VPN.

Deciding whether to commit or roll back the transaction will depend on various factors. When looking at XA transactions, the end goal is to make sure that the transactions are treated consistently on all branches of the distributed transaction across both replication sites. Here are some guidelines for making this decision:

  • For prepared XA transactions that are controlled by a transaction manager in an application server, you should check the logs or state of the transaction manager for the XID of the prepared transaction to examine the other branches of the distributed transaction:
    • If all the other branches have been committed, you should heuristically commit the transaction
    • If any of the other branches have rolled back, you should heuristically roll back the transaction
  • For XA prepared transactions that are not controlled by a transaction manager, manually coordinate the distributed transaction so that all the branches of the distributed transaction are either committed or rolled back.

Showing Replicated Transactions

To show replicated transactions, enter the following User EXEC command:

solace> show transaction replicated

Example:

Solace # show transaction message-vpn blue_02 state PREPARED replicated 
Flags Legend
T - Transaction Type (X=XA L=Local)
S - Transaction State (A=Active S=Suspended I=Idle P=Prepared C=Complete)
R - Replicated (Y=Yes N=No)
XID                                                                   Messages
Message VPN                                   T S R Last State Change  Spooled
--------------------------------------------- - - - ----------------- --------
0021ABC4-00-01
blue_02                                       X P Y                1s        0

To show the details of in-progress replicated transactions, enter the following User EXEC command:

solace> show transaction message-vpn blue_02 state PREPARED replicated detail

Example:

Solace # show transaction replicated detail
XID: 0021B028-00-01
Message VPN: blue_02
Client: username/15848/#000c0001
Client Username: default
Session: N/A
Idle Timeout: 0
Type: XA
State: PREPARED
Replicated: Yes
Last State Change: 0d 0h 0m 0s
Messages: 10
Messages Published: 0
Messages Consumed: 150
Publisher Messages:
Message Id Topic
-------------------- -----------------------------------------------------------
Consumer Messages:
Message Id Type Endpoint Name
-------------------- ----- -----------------------------------------------------
3118727406 queue test
3118727407 queue test
3118727408 queue test
3118727409 queue test
3118727410 queue test
3118727411 queue test
3118727412 queue test
3118727413 queue test
3118727414 queue test
3118727415 queue test

To show the details of a particular transaction, enter the following User EXEC command:

solace> show transaction xid <xid> detail

Where:

xids specifies the XID of the transaction to be displayed.

Step 5: Make the Formerly Replication Standby Message VPN Replication Active

To restore the server, you need to switch the formerly standby replication message VPN to the active state.

NJ_Appliance1(configure)# message-vpn Trading_VPN

NJ_Appliance1(configure/message-vpn)# replication state active

At this point, client should be able to re-connect to the message VPN and full replication service will resume.

Step 6: Delete the Heuristically Completed Transactions

If you previously heuristically completed transactions, you should delete them to free up the resources. You must always delete the completed transactions on the formerly active site. You may have to delete completed transactions on the newly active site, depending on the replication mode and the XA transaction manager. The XA transaction manager may automatically the heuristically active Message VPN after it connects to the newly active site as it reconciles the XA transaction states. You should allow this process to complete before deleting the completed transactions.

To delete a completed transaction, enter the following ADMIN command on the formerly active site:

solace(admin/message-spool) delete-transaction xid <xid>

Where:

xid specifies the XID of the transaction to be deleted.

You should check both the standby and active Message VPNs for completed transactions.

Recovering from Uncontrolled Fail‑Overs

In the event of a failure of an active data center or network isolation, there will not be an opportunity to gracefully release activity from the Message VPNs at that replication site.

There are three types of uncontrolled fail-overs:

  • Short -Term Outage—The active site is out-of-service or isolated for a short duration (for example, minutes or hours). The replication queue has enough capacity to store all replicated messages and transactions during the outage.
  • Long-Term Outage—The active site is out-of-service or isolated for a long duration (for example, days or weeks).The replication queue does not have enough capacity to store all replicated messages and transactions during the outage.
  • Complete Failure—The active site goes out of service and cannot be recovered. The router (or critical component) has to be replaced or data on the external disk is lost.

In all these types of fail-overs, the following general steps must be taken:

In the provided example, the New York replication site has experienced the failure, and its mate New Jersey site takes over activity until the New York site has been restored. For simplicity, only a single Message VPN (Trading_VPN) is presented in the example. When the failure occurred, Trading_VPN had an replication active state at the New York site and a replication standby state at the New Jersey site.

Note:  While these simple examples only show replication sites with a single Message VPN, in real-world scenarios, these steps must be performed for each Message VPN involved in a replication site fail-over.

Consequences of an Uncontrolled Fail-Over

There are a number of potential consequences of an uncontrolled fail-over, including the following:

  • the build-up of messages on the replication queue for the duration of the primary site outage
  • the replication queue becoming full
  • the loss of one or more routers at the failed site prior to restoring operation at the failed site
  • the possibility of lost messages and transactions being replicated asynchronously
  • an increased probability and volume of duplicate message delivery

NOTICE: It is recommended that Solace Support staff be involved to address any such issues that may be present in the circumstances of a specific uncontrolled fail‑over.

Step 1: Make Message VPNs at Standby Site Replication Active to Restore Service

This procedure should be followed after it has been determined that an uncontrolled failure has occurred for a data center site.

Make a Replication Standby Message VPN Replication Active

To restore service, change the replication state of the Message VPN to active.

New Jersey Data Center

NJ_Appliance1(configure)# message-vpn Trading_VPN

NJ_Appliance1(configure/message-vpn)# replication state active

Clients will now be able to connect to the Message VPN.

Since a standby site is not available, asynchronous messages and transactions will be stored in the replication queue. By default, synchronous replication will switch to asynchronous, causing those messages and transactions to also be stored in replication queue. If reject-msg-when-sync ineligible is set on the Message VPN, synchronous replication will be blocked until the standby Message VPN is restored.

Step 2: Ensure Clients Cannot Connect to the Failed Site

It is important that the failed site does not come up with its replication Message VPNs in active state. If both sites have an active replication state at the same time, proper operation cannot be guaranteed. Since the failed router was configured with replication state active when it failed or became unreachable, when it recovers that will be its default state. Note that if the failed site cannot be recovered and its configuration has to be restored from a backup, that backup configuration may have been saved with an active replication state, so this step applies in that case as well.

In this step, the goal is to allow the failed router to be powered on but prevent client connectivity. To do that, client connectivity to the data ports Network Acceleration Blade (NAB) should be blocked, while still allowing the router to be managed through the management ports.

Implementing this step requires a customer-specific administrative plan and depends on the type of failure. Possible options for this step are:

  • administratively shut down the NAB interface(s). This option may be available during network isolation of the message backbone where management access is maintained.
  • remove data cables from NAB data ports
  • prevent L2 switches that provide connectivity to the NAB ports from powering up
  • power up L2 switches that provide connectivity to the NAB ports and disable the NAB ports
  • prevent IP routers that provide client connectivity from providing connectivity to the router’s NAB ports

There may be other ways to accomplish this step, but this step should be formalized before a uncontrolled failure occurs so it is clear what actions to take should an actual failure occur.

Step 3: If Necessary, Suspend Replication

If the failed site takes a long time to recover, there is risk that the replication queue will fill up. If this happens, messages published to replicated topics (in or out of transactions will be rejected), since no replication service can be provided. If you know that there will be a prolonged outage or the replication queue is getting close to filling up (high event log has been triggered on the replication queue), it may be necessary to suspend the replication service to continue to provide non-replicated service to the replicated topics.

To suspend replication, disable the reject-msg-to-send behavior on the replication queue using the following CONFIG command:

solace(configure/message-vpn/replication/queue)# no reject-msg-to-sender-on-discard

Note that with this setting, replicated service will continue until the replication queue gets full. Once it is full, only local, non-replicated service is provided.

Step 4: Bring Message VPNs at the Failed Site Back Online as Replication Standby

Once the failed site has been recovered with management access by no client access (see step 2), then it can be prepared to be the standby site. Here the steps for preparing the recovered appliance to be the standby site:

  • Step 1: Configure All Message VPNs as Standby
  • Step 2: Verify the Message Spool
  • Step 3: Heuristically Complete Transactions
  • Step 4: Allow Clients to Connect
  • Step 5: Wait For Synchronous Replication to be Eligible
  • Step 6: If Necessary, Re-enable Replication

  • Step 1: Configure All Message VPNs as Standby

    Configure all Message VPNs on the restored Replication site with a standby Replication state. In this example, the Message VPN Trading VPN at the New York site is configured with a standby Replication state:

    NY Data Center

    Ny_Appliance1(configure)# message-vpn Trading_VPN

    NY_Appliance1(configure/message-vpn)# replication state standby

    The Config-Sync facility propagates this setting to the Trading_VPN Message VPN on Ny-Appliance2.

    Step 2: Verify the Message Spool

    You should verify that the message spool for the routers at the failed Replication site are now capable of providing service.

    Before continuing, ensure that the message spool on the recovered site is active for the primary virtual router. In the sample output below, the Activity Status of Local Inactive and the Message Spool Status of AD-Not Ready indicates that the router and the message spool it uses are not active.

    Ny_Appliance1# show redundancy

    Configuration Status     : Enabled

    Auto Revert              : No

    Redundancy Mode          : Active/Active

    Mate Router Name         : solaceBackup

    ADB Link To Mate         : Up

    ADB Hello To Mate        : Down

     

                                   Primary Virtual Router  Backup Virtual Router

                                   ----------------------  ----------------------

    Activity Status                Local Inactive          Local Active

    Routing Interface              1/1/lag1:1              1/1/lag1:3

    VRRP VRID                      33                      34

    Routing Interface Status       Up                      Up

    VRRP Status                    Master                  Master

    VRRP Priority                  75                      250

    Message Spool Status           AD-NotReady             AD-Disabled

    Priority Reported By Mate      Backup-Reconcile        Primary-Reconcile

    In this situation, you must resolve the issue preventing failed router to become active. If you cannot resolve the issue, contact Solace Support.

    Step 3: Heuristically Complete Transactions

    If applicable, heuristically commit or heuristically rollback any prepared transactions on the failed site. Once heuristically completed, delete them to free up the resources.

    To commit, rollback or delete a transaction, enter the appropriate ADMIN commands on the failed site:

    NY Data Center

    solace(admin/message-spool) commit-transaction xid <xid>

    and/or

    solace(admin/message-spool) rollback-transaction xid <xid>

    and then

    solace(admin/message-spool) delete-transaction xid <xid>

    Where:

    xid specifies the XID of the transaction to be committed, rolled back, or deleted.

    Step 4: Allow Clients to Connect

    Restore client connectivity to the NAB data ports. This allows clients to connect as well as the replication bridge, which allows data to be synchronized from the active site (New Jersey site to the New York site in the example). The method through which connectivity to the NAB data ports is restored depends on the action chosen in Step 2: Ensure Clients Cannot Connect to the Failed Site of the parent procedure.

    Step 5: Wait For Synchronous Replication to be Eligible

    Once connectivity is restored between the recovered site and the active site, the replication bridge will connect from the standby site to the active site and drain the replication queue in order to synchronize the two sites. Depending on how much message and transaction data is in the replication queue and the available bandwidth between the sites, this process may take a long time. When this process is complete, the Replication service will no longer be degraded and the Message VPN will become eligible for synchronous replication.

    In the following example, the information displayed is for the New Jersey site, which is acting as active for the recently failed New York site.

    NJ_Appliance1# show message-vpn Trading_VPN replication

     

    Flags Legend:

    A - Admin State (U=Up, D=Down)

    C - Config State (A=Active, S=Standby)

    B - Local Bridge State (U=Up, Q=Queue Unbound, D=Down, -=N/A)

    R - Remote Bridge State (U=Up, D=Down, -=N/A)

    Q - Queue State (U=Up, D=Down, -=N/A)

    S - Sync Replication Eligible (Y=Yes, N=No, -=N/A)

    M - Reject Msg When Sync Ineligible (Y=Yes, N=No)

    T - Transaction Replication Mode (A=Async, S=Sync, -=N/A)

     

    Message VPN                      A C W B R Q S M T

    -------------------------------- - - - - - - - - -

    Trading_VPN                      U A N - U U - N A

     

    NJ_Appliance1#

    The ‘Y’ under the ‘S’ column indicates that synchronous Replication is eligible for the Message VPN Trading_VPN.

    Step 6: If Necessary, Re-enable Replication

    If you previously had to suspend replication because the replication queue overflowed, re-enable it. Enter the following CONFIG command:

    solace(configure/message-vpn/replication/queue)# reject-msg-to-sender-on-discard

Failing Back to Restored Sites after an Uncontrolled Fail-Over

Once the message VPN replication has become synchronous eligible, you can fail back to the restored site. This may be desirable if the primary site has higher capacity or capabilities than the backup site. However, failing back immediately is not recommended. Waiting to fail back until all message replicated before the fail over have been consumed is the safest way to proceed.

The procedure for failing back is the same as for a planned outage (see Performing Controlled Fail-Overs). However, the following considerations apply, especially if you are considering failing shortly after having switched activity:

  • Replicated messages or transactions that were in progress when the fail-over occurred and have not been consumed on the active site are at risk of loss or duplication
  • If the message-spool of the restored site could not be recovered, messages replicated before the failure that have not been consumed on the active are lost.

If there is no hardware failure of the ADB or loss of data on the external disk, then the pre-failure state of the message spool can be recovered. Examples of failure that allow for the recovery of the state include:

  • network isolation
  • power failure
  • temporary loss of connectivity to external disk

If the pre-failure message-spool can be recovered and the replication queue on the now active site has not filled, then messages that were replicated before the fail over are available and full replication behavior is restored. However, replicated messages or transactions that were in progress when the fail-over occurred and have not been consumed on the active site are at risk of loss or duplication when the originally active site become active again (fail back). In a long- term failure where the replication queue fills, then only messages and transactions that made it into the replication queue will be available on a fail back.

If there is a hardware failure or loss of data on the external disk, then the pre-failure message spool cannot be recovered and will be empty. In addition to the risk to in-progress messages and transactions, messages replicated before the failure that have not been consumed on the active are lost on a fail back.

The risks and loss and duplication can be eliminated, assuming replication was never disabled, if you wait to fail back until all replicated messages that were published before the unplanned fail-over have been consumed on the newly active site. One way to tell that this has happened is that the endpoints on both sites have exactly the same number of messages. Another method is to inspect the oldest message in endpoints holding replicated messages. If the oldest message was published after the uncontrolled fail-over for all endpoints, then it is safe to fail back.