Data Center Replication

The Solace Replication facility provides a data center redundancy and disaster recovery solution for the Solace Message platform. It allows mission‑critical applications to continue to function during a major service outage to a data center.

Note:  To implement Replication, Config-Sync must be enabled for each router in a Replicated site. The Config‑Sync facility provides automatic synchronization of Message VPN configuration parameters that are required to match between Replicated routers. For more information, see Config-Sync. For redundant appliances that are handling Guaranteed Messaging, durable endpoint information such as queue and topic endpoints, topic-to-queue mappings and queue options are automatically propagated whether Config-Sync is enabled or not.

When the Replication facility is implemented, Guaranteed messages that are published to Message VPNs with Replication active state at one data center are automatically propagated across Replication bridge links to matching Message VPNs with Replication standby state at another data center located in a separate geographic location. In addition, if the messages are part of a local or XA transaction, the transaction is propagated to the standby site and transaction semantics are respected. For example, rolling back a transaction would roll it back on both sites. Preparing an XA transaction would prepare the transaction on both sites. In a scenario where a major service outage occurs for one Replicated data center (that is, one Replication site), a service fail-over to the operational mate Replication site can be performed.

A typical customer deployment model for replicated data center infrastructure is to have a pair of Replication sites located some distance apart (perhaps 50 or 100 miles). These sites are considered Replication mates. The main or primary site will use a high-availability (HA) pair of routers to protect against a local failure of a router or equipment in that site. The secondary or standby site may have a single router or an HA pair of routers. The primary site provides service unless there is a failure of the primary site. If the primary site fails, service is failed over to the backup site. Once the primary site is restored, service can be failed back to the primary site.

The fail-over of a Replication site is often not an action that can just be performed at the messaging layer—typically there are servers, critical applications, and other infrastructure that must be switched as part of the fail-over. Therefore the fail-over is a co-ordinated operation that must be performed by network administrators. It does not happen automatically.

Note:  Replication is not a replacement for HA router redundancy within a data center. Router redundancy provides automatic protection against a single router failure. Replication protects against more catastrophic events in the data center and requires manual intervention to effect a fail over.

How Replication Works

This section provides a simplified description of how Replication works. All diagrams and descriptions are for the synchronous mode of Replication, where messages are stored on both sites before responding to the client. For more details of synchronous Replication, see Synchronous and Asynchronous Message Replication.

The details of how Replication works depends on whether the Guaranteed messages are part of a transaction or not. There are also differences in how local and XA transactions are handled.

Non-Transacted Messages

The figure below provides an example of how Replication uses the Guaranteed Messaging capabilities of Solace routers to propagate non-transacted Guaranteed messages published to a topic that you want to replicate.

Non-Transacted Message Replication Process

Here is how it works when publishing and consuming non-transacted messages:

  1. The client publishes a message that matches a Replicated topic subscription that is configured for the Message VPN with an active Replication state. The message is persisted on the Solace router.
  2. A copy of the message is sent over a Replication bridge (named #MSGVPN_REPLICATION_BRIDGE) that is automatically created when Replication is enabled for a Message VPN. This bridge links the Message VPN with the active Replication state and its mate Message VPN of the same name that has a standby Replication state.
  3. Note:  For added security, Transport Layer Security (TLS) / Secure Sockets Layer (SSL) encryption can be used on Message VPN Replication bridges and Replication Config-Sync bridges. For information on using SSL encryption on Replication Config-Sync bridges and Message VPN replication bridges, refer to Managing TLS/SSL Service.

  4. The message is also persisted on the Replication mate.
  5. When the message is consumed from an endpoint on the Replication active Message VPN, it is removed from the persistent store.
  6. An acknowledgment (ack) is propagated to the Replication standby Message VPN that the message has been delivered to the consumer.
  7. The message is then also removed from the persistent store of the router the Replication standby Message VPN resides on.

Local Transactions

With a local transaction, the messages and acknowledgments are propagated to the Replication mate as a part of an internal XA transaction.

This is how it works for when publishing messages in a local transaction:

  1. The client opens a local transaction and publishes multiple messages that matches a Replicated topic subscription. The messages are persisted on the active site.
  2. The client commits the transaction. An internal XA transaction containing the messages is created.
  3. Copies of the messages are sent over the Replication bridge to standby site as part of the internal XA transaction and persisted on the standby site.
  4. The internal XA transaction is committed on the standby site. The messages are inserted into the destination endpoint(s) standby site.
  5. The messages are inserted into the destination endpoints on the active site. The internal XA transaction completes. The commit response is sent back to the client.

This is how it works when consuming messages in a local transaction:

  1. The client opens a local transaction and consumes multiple messages that have been replicated (that is, when they were published, they matched a replicated topic) from an endpoint.
  2. The client commits the transaction. An internal XA transaction containing an acknowledgment of the consumed messages is created.
  3. The acknowledgment is sent over the Replication bridge to standby site as part of the internal XA transaction.
  4. The internal XA transaction is committed on the standby site. The consumed messages are removed from the endpoint on the standby site.
  5. The internal XA transaction completes. The consumed messages are removed from the endpoint on the active site. The commit response is sent back to the client.

Note:  It is possible for a single local transaction to include both the publishing and consuming of messages. For simplicity, the above transaction example only showed the publishing and consuming of messages.

XA Transactions

With an XA transaction, the XA transaction is extended to include the standby site in the distributed transaction.

This is how it works when publishing messages in an XA transaction:

  1. The client starts an XA transaction and publishes multiple messages that match a Replicated topic pattern. The messages are persisted on the active site.
  2. Copies of the messages are sent immediately over the Replication bridge to the standby Replication site. The messages are persisted on the standby site.
  3. The client ends the XA transaction. The transaction state is updated on the active site.
  4. The client prepares the XA transaction. The active site checks if the transaction can succeed. If it can, the prepare request is sent to the standby site, where it is checked. If it can also succeed on the standby site, the XA transaction state is updated on both active and standby sites. The prepare response is sent back to the client.
  5. The client commits the XA transaction. The messages are inserted into the destination endpoints on the active site. The commit request is sent to the standby site and the messages are inserted into the destination endpoints on the standby site.
  6. The commit response is sent back to the client.
  7. The XA transaction is completed.

This is how it works when consuming messages in an XA transaction:

  1. The client starts an XA transaction and consumes multiple messages that has been replicated (that is, when they were published, they matched a replicated topic).
  2. The client ends the XA transaction. The transaction state is updated with the acknowledgment on the active site.
  3. The client prepares the XA transaction. The active site checks if the transaction can succeed. If it can, the prepare request is sent to the standby site, where it is checked. If it can also succeed on the standby site, the XA transaction state is updated on both active and standby sites. The prepare response is sent back to the client.
  4. The client commits the XA transaction. The messages are removed from endpoint on the active site. The commit request is sent to the standby site and the messages are removed from the endpoint on the standby site.
  5. The commit response is sent back to the client.
  6. The XA transaction is completed.

Note:  It is possible for an XA transaction to include both the publishing and consuming of messages. For simplicity, the above transaction example only showed the publishing and consuming of messages.

Replicated Topics

To indicate which messages should be replicated between the active and standby site, you must configure a replicated topic subscription for a Message VPN. This topic pattern can be a topic subscription or a queue name subscription (a subscription for a queue is #P2P/QUE/<queueName>). If a published message matches both a replicated topic and an endpoint on the active site, then the message is replicated to the standby site.

Transactions

Which messages published within a transaction (local and XA) are replicated is determined by the same replicated topic subscriptions as non-transacted messages. Only those messages that match a replicated topic subscription are part of the extended transaction to the mate Replication site. Because of this, messages published within a transaction should all match a replicated topic subscription, otherwise parts of the transaction will not be sent to the mate replication site. In the event of a fail-over, the transaction contents will not match.

For example, if an XA transaction contains 10 messages, six of which match a replicated topic and four of which do not, only the six matching messages become part of the transaction interactions with the standby site. The active site will have 10 messages and the standby site will have only six when the transaction is committed.

Synchronous and Asynchronous Message Replication

Replication can be performed in one of two modes:

  • Synchronous Replication—A message or transaction is not considered persisted until it has been confirmed to be stored on both the active and standby sites. While providing a greater guarantee that the published message or transaction will not be lost in an uncontrolled fail-over, synchronous Replication incurs a performance penalty for the publisher, especially blocking publishers. This is because the publisher has to wait for communication between the two Replication sites to complete before publishing the next message or transaction. In those use cases, the maximum message rate of a single publisher is limited by the round-trip time and available bandwidth between the active and the standby sites.
  • Asynchronous Replication—A message or transaction is considered persisted once it has been stored on the active site and put into the Replication queue (#MSGVPN_REPLICATION_DATA_QUEUE) on the active site to be delivered to the standby site. This type of Replication gives improved performance, since it does not have to wait for communication with the standby site to complete, but during a uncontrolled failure of the active site there is a chance that a message or transaction which the client has been told has completed has not been delivered to the standby site. In this uncontrolled case, messages or transactions may be lost or duplicated.

When using non-transacted messages, replicated topics are configured to use either synchronous or asynchronous Replication mode. Messages published to a replicated topic that is configured as synchronous are not acknowledged until the message is stored on both the active and standby site. Messages published to a asynchronous topics are acknowledged once the message is stored on the active site and put into the Replication queue for delivery to the standby site. If a message matches both a synchronous and an asynchronous replicated topic, synchronous Replication is used.

When using transactions, the Replication mode is set at the message VPN level. All local and XA transactions in the message VPN use the same Replication mode. Synchronous transactions must be stored on the standby site before responding to the client. Asynchronous transactions only need to be stored in the Replication queue. The Replication mode of the replicated topics is ignored when using transactions.

Downgrading to Asynchronous When Sync Ineligible

When using synchronous Replication, if the Replication bridge connection is very slow or goes down, processing of replicated messages and transactions will stop. To allow message and transactions to continue to be processed, by default the Message VPN will switch to asynchronous Replication when the standby site is unreachable or slow. The message VPN is considered synchronous ineligible and the Replication service will be considered degraded. All messages published to synchronous replicated topics and synchronous transactions will be treated using asynchronous handling.

This behavior helps to avoid a business interruption when the standby site is temporarily unreachable.

The Replication service is considered degraded when any of the following occurs:

  • The Replication bridge is disconnected from the Replication queue.
  • A message can be put in the Replication queue but cannot be sent immediately (streamed) to the Replication mate because the link is slow or the replication queue is backed up. This state must persist for 30 seconds before replication is considered degraded.
  • A message put in the Replication queue can be sent immediately but is not acknowledged by the replication mate within 30 seconds (and is therefore retransmitted). This state must persist for 30 seconds before Replication is considered degraded.

The router generates notification events whenever it transitions between synchronous eligible and ineligible (degraded or not degraded).

Preventing Downgrade to Asynchronous Replication

If the risks associated with asynchronous replication are not acceptable, it is possible to ensure synchronous replication is always strictly enforced. To maintain a synchronous replication mode, you can enable the reject-msg-when-sync-ineligible option for a Message VPN. With this enabled, synchronous replicated messages or transactions are rejected if they cannot be successfully stored on both the active and standby site.

Replication Queue Full

If the standby site is down or unreachable for an extended amount of time, the Replication queue (#MSGVPN_REPLICATION_DATA_QUEUE) can eventually become full. It is assumed that replication is a high-priority service, so the replication queue size should be set to a large value. A system administrator can adjust the quota of a Replication queue. Like all queues, the Replication queue has threshold event logs as it fills up, which warns you that the queue may become full and action should be taken.

By default, if the Replication queue becomes full, publishing to replicated topics or using replicated transactions is rejected. No new replicated messages or transactions can be processed. To restore non-replicated service, the administrator has the option of disabling this behavior (by disabling the reject-msg-to-sender-on-discard feature on the replication queue). With this configuration, messages and transactions are not replicated, but local service can be provided. If connectivity is restored and the replication queue is drained, replication will start again. At this point, the administrator should re-enable reject-msg-to-sender-on-discard on the Replication queue.

Pruning Replication Queues

When the standby site is down and the active site is storing messages and transactions in the replication queue, there is an optimization to help prevent the replication queue from becoming full. Replicated messages that are waiting to be sent to the standby site do not need to be sent if the message has subsequently been consumed on the active site. Since the message is no longer present on the active site, there is no need to send it to the standby site for replication purposes. In this case, those messages are removed, or pruned, from the replication queue, helping to prevent it from becoming full.

Switching Service from Site to Site

There are two aspects to switching replication service from site to site. The router replication service must be manually switched and the clients must be configured to automatically connect to the newly active site.

Router Service Switch

In the event of a failure at the active replication site, a network operator can change the replication state of the Message VPNs on routers at the standby replication site from standby to active. The messaging clients can then reconnect to the Message VPNs of the same name on the newly active replication site, and the processing of guaranteed messages and transactions can continue.

There is a small possibility that under high traffic rates or unfortunate timing of a fail-over to the standby site, some messages could be duplicated following a fail-over. It is recommended that applications that cannot tolerate duplicate message delivery under any scenario should implement application-layer mechanisms (for example, globally-unique message IDs) to detect duplicate message delivery.

When using transactions (local and XA), not all transaction states on the active site are mirrored on the replication standby site. Only those states that are necessary to preserve the transactional behaviors on a fail-over are preserved. For example, XA transactions in the ACTIVE state are not mirrored. In the event of a replication fail-over, an application server transaction manager is expected to detect this and will handle it properly.

For procedural information on how to perform replication fail-overs, refer to Switching Replication Service from Site to Site.

Client Configuration

The host list feature of the Solace messaging APIs provides messaging clients with ability to automatically switch to a newly active replication site. The host list provides the IP addresses or hostnames of the routers in both the replication sites. Typically, the host list contains the primary site, followed by the backup site. When its connection to the primary site drops or is changed to standby state, the client automatically attempts to connect to the backup site, which is the next host in the host list.

Note:  When a Message VPN that has a replication active state is switched to replication standby, all active clients are disconnected.

Replication Fail-Over Types

There are four types of replication fail-overs:

  • Controlled Fail-over
  • Uncontrolled Fail-over, Short-Term Outage
  • Uncontrolled Fail-over, Long-Term Outage
  • Uncontrolled Fail-over, Complete Failure

In a uncontrolled fail-over, there are some special considerations when failing back to the originally active site.

Controlled Fail-over

This is a fail-over triggered by an administrator following the documented procedure. In this case, both sites are in service and can communicate with each other.

Uncontrolled Fail-Over, Short-Term Outage

In this fail-over type, the active site is out-of-service or isolated from the clients and the standby site for a short duration (minutes, hours). In this case, the standby site can be made active and there is enough capacity in the replication queue (#MSGVPN_REPLICATION_DATA_QUEUE) to store all the replicated messages and transactions (using asynchronous replication) until service or connectivity to the failed site is restored. The recovered site becomes the standby, drains the replication queue (#MSGVPN_REPLICATION_DATA_QUEUE) from the active site, and full replication behavior is restored.

Uncontrolled Fail-over, Long-Term Outage

In this fail-over type, the active site is out-of-service or isolated for a long duration (days, weeks). In this case, the standby site can be made active, but the replication queue (#MSGVPN_REPLICATION_DATA_QUEUE) may become full. When the replication queue fills up, reject-msg-to-sender-on-discard can be disabled on the replication queue to provide non-replicated service on the newly active site. Once the failed site is restored, the recovered site becomes standby and drains the replication queue from the active site. Messages and transactions that did not fit in the replication queue were not replicated.

Uncontrolled Fail-over, Complete Failure

In this fail-over type, the active site is out of service and cannot be recovered (for example the entire router has been replaced with a new one). In this case, the standby site can be made active while waiting for a replacement and replicated messages and transactions are stored in the replication queue. Depending on how long it takes to source the replacement, the short-term or long-term outage behaviors will apply.

Failing-Back to the Originally Active Site

After an uncontrolled fail-over and once the failed site has been restored, a fail back to the originally active site may be desirable. This is especially true if the backup site has lower capacity or fault protection than the primary site. When failing back after a controlled fail-over, there are no special considerations. However, after an uncontrolled fail-over, there are some considerations, depending on the fail-over type:

  • Uncontrolled Fail-over, Short-Term Outage—When failing back replicated messages or transactions that were in progress when the fail-over occurred and have not been consumed are at risk of loss or duplication.
  • Uncontrolled Fail-over, Long-Term Outage—When failing back replicated messages or transactions that were in progress when the fail-over occurred and have not been consumed are at risk of loss or duplication. In addition, only those messages and transactions that made it into the replication queue will be available.
  • Uncontrolled Fail-over, Complete Failure—Depending on how long it takes to replace the failed replication site, either the short- or long-term failure considerations will apply. In addition, the replacement router will have no historical data, so replicated messages from before the fail-over that have not been consumed would be lost.

In all cases, the risk of loss or duplication when failing back to the originally active site can be eliminated if all replicated messages published before the fail-over and were in-progress during the fail-over have been consumed on the newly active site. In other words, if you leave the backup replication site active long enough to be certain that all messages published prior to the fail-over have been consumed.

Allowing Clients to Connect to Standby Sites

By default, clients consume replicated messages (as part of a transaction or not) from Message VPNs with replication active states. If a client attempts to connect to a standby replication site, the connect attempt is rejected. However, it is possible to allow clients to connect and consume messages from endpoints on a replication active Message VPN and its mate replication standby Message VPN.

An example of this type of deployment would be one using a replay server. A replay server must consume all messages so they can be later replayed. For the purposes of disaster recovery, an instance of the replay server must run on both the active and the standby sites. In the event of a fail-over, the replay server on the standby site will have consumed the same set of messages as the server on the active site and therefore would be able to provide the same replay service.

To enable a replication deployment like this, some extra configuration is required:

  • the allow-clients-when-standby option must be enabled for all client profiles that are used by clients allowed to connect on to the standby site.
  • the propagation of consumer acks must be disabled for each endpoint that these standby clients will consume messages from. Ack propagation is the mechanism by which the consuming of messages on the active site is signaled to the standby site. Since messages are being consumed on both sites, this signaling needs to be disabled. The configuration is the same whether or not the messages are being consumed in a transaction or not.

Although clients are allowed to connect to the standby site, their capabilities are limited. They are not allowed to publish guaranteed messages, create endpoints, or modify existing endpoints in any way. They are, however, allowed to publish direct messages and add direct subscriptions. This allows for direct message communication between clients connected to the standby site for co-ordination purposes.

Note:  Host lists should not be used for a standby client. In this scenario, there should be a client for each site and the clients should only connect to their specific site, so a list of hosts is not needed.

Providing Service at Both Replication Sites

A possible customer deployment model for replicated data center infrastructure is to have a pair of replication sites located some distance apart (perhaps 50 or 100 miles). Each replication site will use high-availability (HA) pairs of redundant routers: one replication site can actively service one set of Message VPNs, while the other replication site actively services another set of Message VPNs. That is, the HA pairs at one site have Message VPNs with replication active states and other Message VPNs with replication standby states, and the HA pairs at the other replication site have the same Message VPNs but with the opposite replication states.

This allows both replication sites to provide messaging services to separate sets of clients. However, they should maintain enough capacity to provide service to all clients of the two replication sites if there is a planned or unplanned service outage at one of the sites.

The figure below provides a simplified example, where at the New York data center the Message VPNs A, B, C have an replication active state, and the Message VPNs D, E, F have a replication standby state; whereas at the New Jersey data center, the Message VPNs D, E, F are active and the Message VPNs A, B, C are standby.

Note:  Replication can also be deployed with single routers at one or both sites, rather than redundant pairs. However this configuration reduces the fault tolerance of the solution.