PubSub+ Cache and Event Broker Redundancy

PubSub+ Cache is designed to provide redundancy through instantiating multiple cache instances for each cluster. Any number of cache instances may be provisioned across multiple servers and locations to achieve the desired protection from a single network event affecting the entire cluster. Solace recommends that for each cluster, you connect at least one cache instance to each active broker in the mesh.

PubSub+ Cache is a Direct messaging service. Unless you need Guaranteed messaging for other applications that are using the same brokers that PubSub+ Cache connects to, there is no advantage to deploying the brokers as high-availability (HA) pairs. But if you need to support Guaranteed messaging alongside your Direct messaging applications using PubSub+ Cache, then you will want to deploy your brokers in HA pairs. In that case, you will need to choose between the active/standby and active/active redundancy models.

Active/active redundancy applies only to Direct messaging in the broker. Guaranteed messaging is always active/standby regardless of the redundancy model chosen.

HA pairing of brokers ensures that there is no loss of Guaranteed messages in the event of a broker failure. Direct messaging and PubSub+ Cache cannot offer the same delivery guarantees, even in an HA pair. So while HA pairing of brokers will ensure rapid restoration of service to publishers and consumers following a broker failure, any messages published to a broker at the moment of its failure may not be delivered to some or all of the other brokers and cache instances in an event mesh. Furthermore, depending upon the redundancy model chosen, there may be additional messages published to the mesh that may not be delivered to one or more cache instances in the mesh while the failover is happening.

Cache instances must communicate with a cache manager whenever they establish a connection to a broker. Following this initial communication, cache instances are designed to tolerate loss of communication with the cache manager and are capable of continuing service to an event mesh of brokers as long as they remain connected. So, assuming at least one cache instance per broker, a loss of cache manager connectivity does not result in an interruption or loss of caching services in the mesh.

Generally, only one broker must be designated as a cache manager. However, for redundancy, you can designate an HA pair for cache management to follow activity, as it is redundancy aware. This applies to all redundancy models.

The active/standby model is the preferred (and only) solution if you are managing a distributed cache on an HA pair of brokers and if you are using a software event broker, or if you are using Dynamic Message Routing (DMR) to interconnect the brokers into an event mesh.

This approach provides administrative advantages in that if the failing broker was providing cache manager services, the cache manager on the backup broker will be able to accept registration from cache instances immediately following a failover.

Following a failover, there will be a brief period of time before the newly-active broker has connectivity to at least one cache instance in the event mesh. During that period, messages published to the newly-active broker will not be cached anywhere in the mesh, and cache requests from clients connecting to that broker will not receive cache responses. This means applications must be able to tolerate possible loss of cached messages and requests during and after a service affecting event on the active broker.

Cache instances are able to immediately re-register with the active cache manager without administrative intervention because its availability follows the active broker in an HA pair for appliances (presuming stop-on-lost message is disabled). See Configuring Stop On Lost Message Behavior).

For software event brokers, cache instances can also immediately re-register with the cache manager of the active broker (without administrative intervention) if the use of hostlists ( the SESSION_HOST parameter) for each of the cache instances contain the IP address or hostname of both the primary and backup brokers in a high-availability (HA) redundancy group. The re-registration occurs provided that the valid when the configuration is adhered to as describe in the Active/Standby Redundancy Model section in Configuration Summary.

Active/Standby Model

The active/active redundancy model is the preferred solution when using the PubSub+ appliance, but is not supported in the software broker, and is not supported when using Dynamic Message Routing (DMR) to interconnect the brokers into an event mesh. While this approach is still subject to potential message loss at the moment of broker failure (just like any other redundancy model involving Direct messaging), it does have the advantage of having a subset of the cache instances maintaining a connection to a broker at all times. The HA mate broker and its associated cache instance(s) will already be online and able to immediately provide caching services to the clients that were previously connected to the failed broker.

Cache instances are able to immediately re-register with the active Cache Manager (without administrative intervention) because the Cache Manager availability follows the active broker in an HA pair (presuming stop-on-lost message is disabled and the required configuration is adhered to). See Configuring Stop On Lost Message Behavior) and Active/Active Redundancy Model configuration i in Configuration Summary.

 

Active/Active Redundancy Model

Configuration Summary

Active/Standby Redundancy Model (Software Brokers or Appliances)

  • Appliances should be configured as non-revertive redundant pairs.
  • Config-Sync is recommended to ensure PubSub+ Cache and client configurations are consistent on both the primary and backup brokers.
  • Both cache instances shown must be configured as part of the same cache cluster and belonging to a single distributed cache.
  • When connection is to the software broker, each cache instance SESSION_HOST configuration property must specify the IP or hostname of both the active and backup broker (comma delimited).

In the event of a failover, no administrative action is required if stop-on-lost message is disabled on all cache instances. If that is not the case, an administrator will need to restart each of the affected cache instances.

Active/Active Redundancy Model (Appliances with MNR or VPN-Bridged Event Mesh)

  • The brokers should be configured as a non-revertive redundant pair.
  • Config-Sync is recommended to ensure PubSub+ Cache and client configurations are consistent on both brokers.
  • An MNR neighbor connection must be configured between the brokers in the redundant pair. However, because this is an in-data-center LAN connection, it should be assigned a very low cost (Solace recommends assigning a link cost of 1).
  • Export subscriptions must be enabled on the Message VPN hosting the Distributed Cache.
  • Both cache instances shown must be configured as part of the same cache cluster and belonging to a single distributed cache.
  • Each cache instance SESSION_HOST configuration property must specify the IP of both pairs in the broker (comma delimited), with each instance specifying a different broker in the pair as its first choice.

What to Do After Redundancy Failovers

If one broker in the HA pair fails or is taken offline, the remaining broker will take activity and provide service for the affected publishers and subscribers. Those clients will reconnect and have access to PubSub+ Cache and all cached data that was published prior the failover through the PubSub+ Cache instances that retained connectivity with the event mesh. Any cache instances connected directly to the failed or offline broker will need to re-establish connection to the event mesh.

Any published data that was in the process of delivery at the time of the failure may be lost, and may not be available in the PubSub+ Cache instances on the backup broker.

The PubSub+ Cache instances that lost connectivity due to the failure will connect to the alternative broker and immediately resume service without administrator intervention if they are configured with stop-on-lost-message disabled. This will result in each of those cache instances reporting a Lost Message state. For more information, including instructions on how to clear this state, refer to Lost Message State.

PubSub+ Cache instances that lost connectivity and are configured with stop-on-lost-message enabled will require intervention to restore cache function. Those instances will need to be restarted by following the procedure outlined in the previous section.