PubSub+ Cache and Event Broker Redundancy

PubSub+ Cache is designed to provide redundancy through instantiating multiple cache instances for each cluster. Any number of cache instances may be provisioned across multiple servers and locations to achieve the desired protection from a single network event affecting the entire cluster. Solace recommends that for each cluster, you connect at least one cache instance to each active broker in the mesh.

PubSub+ Cache is a Direct messaging service. Unless you need Guaranteed messaging for other applications that are using the same brokers that PubSub+ Cache connects to, there is no advantage to deploying the brokers as high-availability (HA) pairs. But if you need to support Guaranteed messaging alongside your Direct messaging applications using PubSub+ Cache, then you will want to deploy your brokers in HA pairs. In that case, you will need to choose between the active/standby and active/active redundancy models.

Active/active redundancy applies only to Direct messaging in the broker. Guaranteed messaging is always active/standby regardless of the redundancy model chosen.

HA pairing of brokers ensures that there is no loss of Guaranteed messages in the event of a broker failure. Direct messaging and PubSub+ Cache cannot offer the same delivery guarantees, even in a HA pair. So while HA pairing of brokers will ensure rapid restoration of service to publishers and consumers following a broker failure, any messages published to a broker at the moment of its failure may not be delivered to some or all of the other brokers and cache instances in the mesh. Furthermore, depending upon the redundancy model chosen, there may be additional messages published to the mesh that may not be delivered to one or more cache instances in the mesh while the failover is happening.

Cache instances must communicate with a cache manager whenever they establish a connection to a broker. Following this initial communication, cache instances are designed to tolerate loss of communication with the cache manager and are capable of continuing service to an event mesh of brokers as long as they remain connected. So, assuming at least one cache instance per broker, a loss of cache manager connectivity does not result in an interruption or loss of caching services in the mesh. In all redundancy models except active/standby, administrative action may be required to restore cache manager services following a failover, since only one active broker in the mesh can be the designated cache manager.

The active/standby model is the preferred (and only) solution if you are managing a distributed cache on a HA pair of brokers and if you are using a software event broker, or if you are using Dynamic Message Routing (DMR) to interconnect the brokers into an event mesh. This approach provides administrative advantages in that if the failing broker was providing cache manager services, the cache manager on the backup broker will be able to accept registration from cache instances immediately following a failover. It should be noted that following a failover, there will be a brief period of time before the newly-active broker has connectivity to at least one cache instance in the mesh. During that period, messages published to the newly-active broker will not be cached anywhere in the mesh, and cache requests from clients connecting to that broker will not receive cache responses. This means applications must be able to tolerate possible loss of cached messages and requests during and after a service affecting event on the active broker.

Active/Standby Model

The active/active redundancy model is the preferred solution when using the PubSub+ appliance, but is not supported in the software broker, and is not supported when using DMR to interconnect the brokers into an event mesh. While this approach is still subject to potential message loss at the moment of broker failure (just like any other redundancy model involving Direct messaging), it does have the advantage of having a subset of the cache instances maintaining a connection to a broker at all times. The HA mate broker and its associated cache instance(s) will already be online and able to immediately provide caching services to the clients that were previously connected to the failed broker.

This approach may require administrative intervention to restore the cache management function if the failed broker was providing the cache management function. Any cache instances needing to restore a connection and registration to the cache cluster will be unable to do so until the failed broker comes back online and takes activity, or until cache management is administratively enabled on the newly-active broker. However, caching services should still be available, since instances that maintained connection to their brokers will provide service while the cache management service is being restored.

Active/Active Redundancy Model

Configuration Summary

Active/Standby Redundancy Model (Software Brokers or Appliances)

  • Appliances should be configured as non-revertive redundant pairs.
  • Config-Sync is recommended to ensure PubSub+ Cache and client configurations are consistent on both the primary and backup brokers.
  • Within each message VPN, the cache management function should be enabled on only one broker within the mesh. This becomes a strict requirement when there is Multi-Node Routing (MNR) connectivity between two (or more) brokers that could simultaneously have the cache management function enabled.

    The only configuration where the cache management function is allowed to be enabled simultaneously on multiple brokers, for the same VPN, within the same mesh, is for an active/standby HA pair of brokers that have no MNR connectivity between them. In this situation, and this situation only, the cache management function should be enabled on both brokers in the HA pair.

  • Both cache instances shown must be configured as part of the same cache cluster and belonging to a single distributed cache.
  • When connection to the software broker, each cache instance SESSION_HOST configuration property must specify the IP or hostname of both the active and backup broker (comma delimited).

In the event of a failover, no administrative action is required if stop-on-lost message is disabled on all cache instances. If that is not the case, an administrator will need to restart each of the affected cache instances.

Active/Active Redundancy Model (Appliances with MNR or VPN-Bridged Event Mesh)

  • The brokers should be configured as a non-revertive redundant pair.
  • Config-Sync is recommended to ensure PubSub+ Cache and client configurations are consistent on both brokers.
  • An MNR neighbor connection must be configured between the brokers in the redundant pair. However, because this is an in-data-center LAN connection, it should be assigned a very low cost (Solace recommends assigning a link cost of 1).
  • Export subscriptions must be enabled on the Message VPN hosting the Distributed Cache.
  • On only one of the brokers, distributed-cache-management must be enabled on the Message VPN hosting the distributed cache. Solace recommends enabling the distributed-cache-management service on the primary broker (Region1-PDC-Primary) and disabling distributed-cache-management on the backup broker (Region-1-PDC-Backup).
  • Both cache instances shown must be configured as part of the same cache cluster and belonging to a single distributed cache.
  • Each cache instance SESSION_HOST configuration property must specify the IP of both pairs in the broker (comma delimited), with each instance specifying a different broker in the pair as its first choice.

What to Do After Redundancy Failovers

If one broker in the HA pair fails or is taken offline, the remaining broker will take activity and provide service for the affected publishers and subscribers. Those clients will reconnect and have access to PubSub+ Cache and all cached data that was published prior the failover through the PubSub+ Cache instances that retained connectivity with the event mesh. Any cache instances connected directly to the failed or offline broker will need to re-establish connection to the event mesh.

Any published data that was in the process of delivery at the time of the failure may be lost, and may not be available in the PubSub+ Cache instances on the backup broker.

The PubSub+ Cache instances that lost connectivity due to the failure will connect to the alternative broker and immediately resume service without administrator intervention if they are configured with stop-on-lost-message disabled. This will result in each of those cache instances reporting a Lost Message state. For more information, including instructions on how to clear this state, refer to Lost Message State.

PubSub+ Cache instances that lost connectivity and are configured with stop-on-lost-message enabled will require intervention to restore cache function. Those instances will need to be restarted by following the procedure outlined in the previous section.

Depending on the deployment model chosen, administrative action may be required to restore distributed-cache-management for the cache. Specifically, when using active/active redundancy and the failing broker is configured with distributed-cache-management enabled, you have two options:

  • If the duration of the outage is expected to be short, you may choose to wait for the distributed cache manager function to resume with the recovery of the failed broker. In the interim, any cache instances currently in the Up state will continue providing caching service. However, new cache instances will be unable to register with the distributed cache during this period.
  • Alternatively, the backup broker can take over the cache management function if you enable distributed-cache-management on the surviving broker. Note that distributed-cache-management for a Message VPN should only ever be enabled on one broker. As a result, you should disable it once the failing broker recovers and resumes serving as the cache manager.