PubSub+ Cache and Data Center Redundancy

To protect against the failure of an entire data center, an administrator may choose to deploy additional event brokers and PubSub+ Cache Instances in an in‑region backup data center, which connects to the primary data center over the WAN. This deployment model (as shown below) builds on the previously discussed redundant router deployment model.

PubSub+ Cache And Data Center Redundancy

The dashed Multiple-Node Routing links shown above are optional, and not needed in most deployments—they are there to guard against network bifurcation in the unlikely event of a double router failure (for example, without those links, Region1-PDC-Primary and Region1-BDC-Backup would not be able to communicate with each other if there is a simultaneous loss of both Region1-PDC-Backup and Region1-BDC-Primary). Full-mesh connectivity is not strictly needed because:

  1. Alternate routing paths are only in place to route around failures of routers. Network connectivity issues are expected to be “routed around” by the underlying IP network.
  2. The cost of hopping across an additional event broker to reach another node is minimal, especially over the LAN inside a data center. There is no wasted traffic reaching a router simply to transit to another router because the published messages already must be sent to all four routers in the Cache Cluster to keep all PubSub+ Cache Instances up to date.

The link costs assigned to the Multiple-Node Routing neighbor links have been carefully chosen to ensure that a message will only traverse the WAN between data centers once and then be distributed locally between routers in the other data center over the LAN in that data center.

An administrator may also choose to only deploy a single router in the backup data center, as shown in the figure below.

PubSub+ Cache And Data Center Redundancy–Non-Redundant Backup Site

Configuration Summary

  • The routers in each data center must be configured as non-revertive Redundant pairs.
  • The routers in the primary and backup data center should be configured to be Replication mates.
  • Config-Sync should be enabled on the routers, and Replication should be enabled on the Message VPN used by the Distributed Cache. This will ensure that PubSub+ Cache configuration and client configuration is kept in sync between all the routers in the two data centers. For information on configuring Replication and Config-Sync between routers, see Data Center Replication for Disaster Recovery.
  • Replicated topics are intended only for Guaranteed Messaging. Therefore, be sure not to configure any of the topics used by Direct messaging and PubSub+ Cache as Replicated topics in a deployment that uses Multiple‑Node Routing. Using a cached topic also as a Replicated topic will cause duplicate message delivery in the network to occur.
  • The Replication allow-clients-when-standby property must be enabled in the client-profiles associated with the client usernames used by the PubSub+ Cache Instances. When enabled, this property allows the PubSub+ Cache Instances to send and receive data even when the Message VPNs that the PubSub+ Cache Instances connect to are configured to be Replication standby. For information, see Allowing Client Connects to Replication Standby VPNs.
  • The Replication state on the Message VPN used by the Distributed Cache should be set to active on the routers in the primary data center and to standby on the routers in the backup data center.
  • Multiple-Node Routing neighbor connections must be configured between the routers in the primary and backup data centers. To ensure that messages only cross the inter-data-center WAN link once, link costs should be configured as shown in both of the figures above.
  • Export-subscriptions must be enabled on the Message VPN hosting the Distributed Cache.
  • The eight PubSub+ Cache Instances and the Cache Cluster must be configured to be part of a single Distributed Cache.
  • Two PubSub+ Cache Instances per Cache Cluster should be set up to connect to each router.
  • The PubSub+ Cache Instances should be configured to use stop-on-lost-message behavior (see Configuring Stop On Lost Message Behavior).
  • Publishing and consuming applications can be deployed in either an active‑active or active‑standby configuration within a data center. However, for ease of management and debugging, Solace recommends the active‑standby deployment model for applications.
  • Publishing and consuming applications may optionally connect to the routers in the backup data center if the client profiles they use have the allow-clients-when-standby property enabled. However, it is suggested, for ease of management and debugging, that client applications only connect to the primary data center, and fail over to the backup data center (using host lists in the application) when the primary data center is offline. For information on enabling clients to connect to routers in a backup data center, see Allowing Clients to Connect to Standby Sites.

After Redundancy Failovers Within Data Centers

If the primary router (Region1-PDC-Primary) fails or is taken offline, the backup router in the data center (Region1-PDC-Backup) will take activity. Publishers and subscribers will reconnect to that backup router and then have access to PubSub+ Cache and all cached data that was published prior to the failover through the PubSub+ Cache Instances connected to the backup router.

Any published data that was in the process of being sent over the multiple‑node routing link at the time of the failure may have been lost, and therefore unavailable in the PubSub+ Cache Instances on the backup router.

The operational procedures that should be followed after a redundancy failover as described in PubSub+ Cache and Event Broker Redundancy.

After Primary Data Center Failures

To bring the backup data center online if the primary datacenter fails, set the Replication state to active on the Message VPNs of the primary router in the backup data center (Region1-BDC-Primary). This allows the publishers and consumers to connect to the routers in the backup datacenter.

Once the failed data center comes back online, the administrator should perform the following steps as soon as possible on the Message VPN hosting the Distributed Cache:

  1. Set the Replication state to standby on the primary router in the primary data center (Region1-PDC-Primary).
  2. Restart the stopped PubSub+ Cache Instances in the primary data center (Cache Instance 1 through 4) following the procedures described earlier in this section, causing them to resynchronize with the “up” PubSub+ Cache Instances (Cache Instance 5 through 8).

Then later during a maintenance window, an administrator should perform the following steps on the Message VPN hosting the Distributed Cache:

  1. Set the Replication state to standby on the primary router in the backup data center (Region1-BDC-Primary).

    There will be a service interruption to publishers and consumers when this step is performed because the applications are disconnected from the backup data center.

  2. Set the Replication state to active on the primary router in the primary data center (Region1-PDC-Primary). Service will be restored to publishers and consumers following this step because the applications reconnect to routers in the primary data center.