Switching Service Between Sites

To switch replication service (that is, to fail over) from site to site within a replication group, two activities must take place:

  • The state (active or standby) of the replication service on the event broker must be manually changed.
  • The clients must connect to the newly active site (this can be configured to happen automatically).

Changing the Replication State

In the event of a failure at the active replication site, a network operator can change the replication state of the Message VPNs on event brokers at the standby replication site from standby to active. The messaging clients can then reconnect to the Message VPNs of the same name on the newly active replication site, and the processing of guaranteed messages and transactions can continue.

There is a small possibility that under high traffic rates or unfortunate timing of a failover to the standby site, some messages could be duplicated following a failover. It is recommended that applications that cannot tolerate duplicate message delivery under any scenario should implement application-layer mechanisms (for example, globally-unique message IDs) to detect duplicate message delivery. Alternatively, if the application maintains a history of the replication group message IDs the application has processed, the replication group message ID of received messages can be compared to the replication group message ID of messages already processed.

When using transactions (local and XA), not all transaction states on the active site are mirrored on the replication standby site. Only those states that are necessary to preserve the transactional behaviors on a failover are preserved. For example, XA transactions in the ACTIVE state are not mirrored. In the event of a replication failover, an application server transaction manager is expected to detect this and will handle it properly.

For more information, refer to Detecting Duplicate Messages.

For instructions for performing replication failovers, refer to Procedures for Switching Replication Service Between Sites.

Reconnecting Clients

To automatically connect messaging clients to a newly active replication site, you can use the host list feature of the Solace messaging APIs. The host list provides the IP addresses or hostnames of the event brokers in both the replication sites. Typically, the host list contains the primary site, followed by the backup site. When its connection to the primary site drops or is changed to standby state, the client automatically attempts to connect to the backup site, which is the next host in the host list.

When a Message VPN that has a replication active state is switched to replication standby, all active clients are disconnected.

Replication Failover Types

There are four types of replication failovers:

  • Controlled Failover
  • Uncontrolled Failover, Short-Term Outage
  • Uncontrolled Failover, Long-Term Outage
  • Uncontrolled Failover, Complete Failure

Controlled Failover

This is a failover triggered by an administrator following the documented procedure. In this case, both sites are in service and can communicate with each other.

Uncontrolled Failover, Short-Term Outage

In this failover type, the active site is out-of-service or isolated from the clients and the standby site for a short duration (minutes, hours). In this case, the standby site can be made active and there is enough capacity in the replication queue (#MSGVPN_REPLICATION_DATA_QUEUE) to store all the replicated messages and transactions (using asynchronous replication) until service or connectivity to the failed site is restored. The recovered site becomes the standby, drains the replication queue (#MSGVPN_REPLICATION_DATA_QUEUE) from the active site, and full replication behavior is restored.

Uncontrolled Failover, Long-Term Outage

In this failover type, the active site is out-of-service or isolated for a long duration (days, weeks). In this case, the standby site can be made active, but the replication queue (#MSGVPN_REPLICATION_DATA_QUEUE) may become full. When the replication queue fills up, reject-msg-to-sender-on-discard can be disabled on the replication queue to provide non-replicated service on the newly active site. Once the failed site is restored, the recovered site becomes standby and drains the replication queue from the active site. Messages and transactions that did not fit in the replication queue were not replicated.

Uncontrolled Failover, Complete Failure

In this failover type, the active site is out of service and cannot be recovered (for example the entire event broker has been replaced with a new one). In this case, the standby site can be made active while waiting for a replacement and replicated messages and transactions are stored in the replication queue. Depending on how long it takes to source the replacement, the short-term or long-term outage behaviors will apply.

Failing Back to the Originally Active Site

After a failover and once the failed site has been restored, it may be desirable to fail back to the originally active site. This is especially true if the backup site has a lower capacity or less fault protection than the primary site. There are no special considerations for failing back after a controlled failover. However, after an uncontrolled failover, there is a risk of message loss or duplication which is slightly different depending on the scenario:

  • Uncontrolled Failover, Short-Term Outage—When failing back replicated messages or transactions that were in progress when the failover occurred and have not been consumed are at risk of loss or duplication.
  • Uncontrolled Failover, Long-Term Outage—When failing back replicated messages or transactions that were in progress when the failover occurred and have not been consumed are at risk of loss or duplication. In addition, only those messages and transactions that made it into the replication queue will be available.
  • Uncontrolled Failover, Complete Failure—Depending on how long it takes to replace the failed replication site, either the short- or long-term failure considerations will apply. In addition, the replacement event broker will have no historical data, so replicated messages from before the failover that have not been consumed would be lost.

In all cases of failing back to the originally active site, the risk of message loss or duplication can be eliminated if all replicated messages that were published before the failover and that were in-progress during the failover have been consumed on the newly active site. In other words, if you leave the backup replication site active long enough to be certain that all messages published prior to the failover have been consumed, there will be no message loss.