PubSub+ Cache and Datacenter Redundancy

To protect against the failure of an entire datacenter, an administrator may choose to deploy additional event brokers and PubSub+ Cache Instances in an in‑region backup datacenter, which connects to the primary datacenter over the WAN. This deployment model (as shown below) builds on the previously discussed redundant event broker deployment model.

The dashed Multiple-Node Routing links shown above are optional, and not needed in most deployments—they are there to guard against network bifurcation in the unlikely event of a double event broker failure (for example, without those links, Region1-PDC-Primary and Region1-BDC-Backup would not be able to communicate with each other if there is a simultaneous loss of both Region1-PDC-Backup and Region1-BDC-Primary). Full-mesh connectivity is not strictly needed because:

Alternate routing paths are only in place to route around failures of event brokers. Network connectivity issues are expected to be “routed around” by the underlying IP network.
The cost of hopping across an additional event broker to reach another node is minimal, especially over the LAN inside a datacenter. There is no wasted traffic reaching an event broker simply to transit to another event broker because the published messages already must be sent to all four event brokers in the Cache Cluster to keep all PubSub+ Cache Instances up to date.

The link costs assigned to the Multiple-Node Routing neighbor links have been carefully chosen to ensure that messages only traverse the WAN between datacenters once and are then distributed locally between event brokers in the other datacenter over the LAN in that datacenter.

An administrator may also choose to only deploy a single event broker in the backup datacenter, as shown in the figure below.

Configuration Summary

The event brokers in each datacenter must be configured as non-revertive Redundant pairs.
The event brokers in the primary and backup datacenter should be configured to be Replication mates.
Config-Sync should be enabled on the event brokers, and Replication should be enabled on the Message VPN used by the Distributed Cache to ensure that PubSub+ Cache configuration and client configuration is kept in sync between all the event brokers in the two datacenters. For information about configuring Replication and Config-Sync between event brokers, see Data Center Replication for Disaster Recovery.
Replicated topics are intended only for Guaranteed messaging. Therefore, be sure not to configure any of the topics used by Direct messaging and PubSub+ Cache as Replicated topics in a deployment that uses Multiple‑Node Routing. Using a cached topic also as a Replicated topic causes duplicate message delivery in the network to occur.
The Replication allow-clients-when-standby property must be enabled in the client-profiles associated with the client usernames used by the PubSub+ Cache Instances. When enabled, this property allows the PubSub+ Cache Instances to send and receive data even when the Message VPNs that the PubSub+ Cache Instances connect to are configured to be Replication standby. For information, see Allowing Client Connects to Replication Standby VPNs.
The Replication state on the Message VPN used by the Distributed Cache should be set to active on the event brokers in the primary datacenter and to standby on the event brokers in the backup datacenter.
Multiple-Node Routing neighbor connections must be configured between the event brokers in the primary and backup datacenters. To ensure that messages only cross the inter-data-center WAN link once, link costs should be configured as shown in both of the figures above.
Export-subscriptions must be enabled on the Message VPN hosting the Distributed Cache.
The eight PubSub+ Cache Instances and the Cache Cluster must be configured to be part of a single Distributed Cache.
Two PubSub+ Cache Instances per Cache Cluster should be set up to connect to each event broker.
The PubSub+ Cache Instances should be configured to use stop-on-lost-message behavior (see Configuring Stop On Lost Message Behavior).
Publishing and consuming applications can be deployed in either an active‑active or active‑standby configuration within a datacenter. However, for ease of management and debugging, Solace recommends the active‑standby deployment model for applications.
Publishing and consuming applications may optionally connect to the event brokers in the backup datacenter if the client profiles they use have the allow-clients-when-standby property enabled. However, it is suggested, for ease of management and debugging, that client applications only connect to the primary datacenter, and fail over to the backup datacenter (using host lists in the application) when the primary datacenter is offline. For information on enabling clients to connect to event brokers in a backup datacenter, see Allowing Clients to Connect to Standby Sites.

After Redundancy Failovers Within Datacenters

If the primary event broker (Region1-PDC-Primary) fails or is taken offline, the backup event broker in the datacenter (Region1-PDC-Backup) takes activity. Publishers and subscribers reconnect to the backup event broker and then have access to PubSub+ Cache and all cached data that was published prior to the failover through the PubSub+ Cache Instances connected to the backup event broker.

Any published data that was in the process of being sent over the multiple‑node routing link at the time of the failure may have been lost, and therefore unavailable in the PubSub+ Cache Instances on the backup event broker.

The operational procedures that should be followed after a redundancy failover are described in PubSub+ Cache and Event Broker Redundancy.

After Primary Datacenter Failures

To bring the backup datacenter online if the primary datacenter fails, set the Replication state to active on the Message VPNs of the primary event broker in the backup datacenter (Region1-BDC-Primary). This allows the publishers and consumers to connect to the event brokers in the backup datacenter.

Once the failed datacenter comes back online, the administrator should perform the following steps as soon as possible on the Message VPN hosting the Distributed Cache:

Set the Replication state to standby on the primary event broker in the primary datacenter (Region1-PDC-Primary).
Restart the stopped PubSub+ Cache Instances in the primary datacenter (Cache Instance 1 through 4) following the procedures described earlier in this section, causing them to resynchronize with the “up” PubSub+ Cache Instances (Cache Instance 5 through 8).

Then later during a maintenance window, an administrator should perform the following steps on the Message VPN hosting the Distributed Cache:

Set the Replication state to standby on the primary event broker in the backup datacenter (Region1-BDC-Primary).
Performing this step causes a service interruption to publishers and consumers because the applications are disconnected from the backup datacenter.
Set the Replication state to active on the primary event broker in the primary datacenter (Region1-PDC-Primary). Service is restored to publishers and consumers following this step because the applications reconnect to event brokers in the primary datacenter.

Provide feedback