General Considerations

When deploying PubSub+ Cache in a network, the following considerations and recommendations are common to all the use-cases discussed in this section.

Message Transport

PubSub+ Cache Instances only cache Direct messages, and this section assumes that the messages to be cached are originally published using Direct Messaging. Although Guaranteed messages published with a non‑persistent message delivery mode may be converted to Direct messages to fulfill a topic subscription and then be cached, such a messaging pattern is not considered in this section. Caching Guaranteed messages that have had their delivery mode changed to Direct have additional restrictions and require additional configuration beyond what is discussed in this section. Solace Professional Services must be consulted when considering any use-cases where published Guaranteed messages are to be cached.

Distributed Cache Management

PubSub+ Cache is configured and managed through an event broker. In a network deployment of PubSub+ Cache, many PubSub+ Cache Instances are connected to several routers, and all these PubSub+ Cache Instances are part of the same Distributed Cache. In such a distributed model, it is important that only one of the routers is responsible for managing the Distributed Cache and all of its Cache Clusters and PubSub+ Cache Instances. That is, there is only one Designated Router, with its internal client known as "Cache Manager".

In PubSub+ software event brokers and appliances, distributed cache management is tied to the HA (Redundancy) and/or the DR (Replication) models. As a result, explicitly controlling cache management is not necessary. In an HA pair or a data center, the active event broker will automatically take the role of the Cache Manager, which makes the Cache Manager redundancy aware and guarantees continuity of PubSub+ Cache service in failover scenarios (where the current Cache Manager becomes offline or stops its service). This applies to all supported PubSub+ Cache redundancy models (active/standby and active/active).

Active/active Redundancy Model

In an active/active redundancy model, the event broker designates a virtual broker for cache management that might be different from the one you prefer. If you wish to take full control over this choice, you can utilize the optional parameters [auto | primary | backup]:

solace(configure)# create distributed-cache <name> message-vpn <vpn-name> [auto | primary | backup]

Where:

auto associates the Distributed Cache automatically at the time of creation with either the primary or backup virtual router (auto is the default and recommended value. It should work for all use cases that don't use the active/active redundancy model).

The choice between primary and backup is determined by using the active-standby-role if it is configured, or by using primary if it is not.

Alternatively, you can contact Solace support for additional assistance.

  • When Config-Sync is enabled, the expectation is that the administrator explicitly declares the virtual router association to be primary (or backup) for each configured Distributed Cache, and Config-Sync will ensure the opposite configuration is propagated to the mate, thus ensuring that only one side of the HA pair hosts an active Cache Manager.
  • When and Config-Sync is not enabled, the expectation is that the administrator designates the desired side of the active/active pair as primary, and its mate as backup.

Stop-On-Lost-Message and Restarting Cache Instances

Solace recommends that PubSub+ Cache instances have the stop-on-lost message property disabled so that if an instance detects potential message loss it can continue to provide service while administrative action is considered. This approach is considered preferable to one that takes the cache instance offline by default, as doing so can complicate recovery following network events interrupting connectivity to many clients. A broker configured in a high availability pair requiring a failover to the backup would be an example of one such event, as would some other networking layer failures. In such cases, the affected cache instance(s) will indicate a lost message state which is documented in detail here: Lost Message State.

Restarting Cache Instances

If a PubSub+ Cache Instance temporarily loses connectivity with the router, or detects lost messages over its connection to the router, and it has stop-on-lost-message enabled, it will transition to the stopped-lost-message state. After verifying that at least one other remaining cache instance is up and contains valid data, the administrator can enter the clear-event Distributed Cache Admin EXEC command to bring the stopped instance back online. This will cause the stopped PubSub+ Cache Instance to resynchronize its contents from the operational PubSub+ Cache Instance, and then come back online.

It is recommended that the PubSub+ Cache Instance be brought back online during a period of relatively low publishing and cache-request activity. Under heavy traffic conditions with large numbers of cached topics, it may not be possible for a restarting PubSub+ Cache Instance to successfully buffer the inbound message stream while resynchronizing with another PubSub+ Cache Instance in the Cache Cluster. Additionally, the resynchronization activity will place additional load on the PubSub+ Cache Instance from which the restarting PubSub+ Cache Instance is receiving the cached data.

When multiple PubSub+ Cache Instances need to be brought back online, it is not recommended to start those PubSub+ Cache Instances simultaneously, but rather to start one PubSub+ Cache Instance and wait for it to transition to the “up” state before starting the next PubSub+ Cache Instance. This places a lower workload on the PubSub+ Cache Instances supplying the reference data. Additionally, when bringing online PubSub+ Cache Instances that are all connected to the same router, starting them one-at-a-time ensures that only the first PubSub+ Cache Instance brought back online might need to resynchronize over the network from a PubSub+ Cache Instance connected to another router. The remaining PubSub+ Cache Instances on the router can then be resynchronized to that first PubSub+ Cache Instance brought back online on the same router, which is preferable to resynchronizing them to a PubSub+ Cache Instance connected to another router. During resynchronization, a PubSub+ Cache instance will not participate in cache requests. Only when the PubSub+ Cache instance is synchronized will it begin fulfilling caching requests in the Cache Cluster.

What to Do After All PubSub+ Cache Instances In a Cluster Stop

If all PubSub+ Cache Instances in a Cache Cluster are configured to stop-on-lost-message and temporarily lose connectivity with the router, or detect lost messages over their connections to the routers, they will transition to the stopped-lost-message state.

In this situation, the administrator should do the following:

  1. Decide which PubSub+ Cache Instance has the “best” data to synchronize with, and enter a clear-event Distributed Cache Admin EXEC command to bring the stopped instance back online.
  2. After the first PubSub+ Cache Instance transitions to the “up” state, enter a clear‑event Distributed Cache Admin EXEC command for the next PubSub+ Cache Instance in the Cache Cluster, to bring it back online and synchronize with the first PubSub+ Cache Instance.
  3. Continue bringing each PubSub+ Cache Instance in the Cache Cluster back online, one at a time.
  4. If applicable for your application, restart the publishers, and force them to send “Initial” messages to repopulate these PubSub+ Cache Instances with the latest data.

In some use cases, rather than following the steps outlined above, an administrator may decide it is better to erase the data from the PubSub+ Cache Instances rather than synchronizing to “stale” data. This can be achieved by simply restarting the PubSub+ Cache processes on the servers hosting the PubSub+ Cache Instances.

IP Networking Considerations

To support a networked deployment of PubSub+ Cache, a robust underlying IP network is required. Such a network should include the following:

  • For each event broker, separate physical connections to separate Layer 2 switches (using active-standby Ethernet port bonding) to protect against physical link outages and Layer 2 switch failures.
  • A minimum of two Layer 2 paths or IP routing paths between each pair of event brokers in the network so that the underlying IP network can reroute around failures and protect against physical link outages, Layer 2 Switch failures, and IP router failures.
  • Sufficient IP network bandwidth between sites to support the data rates of live data messages sent to the PubSub+ Cache Instances and to resynchronize PubSub+ Cache Instances following an outage.

The alternate routing paths provided by Multiple-Node Routing links between event brokers are only in place to route around failures of the event brokers. Network connectivity issues are expected to be resolved by rerouting in the underlying IP network. To minimize the risk of message loss when a reroute happens in the IP network, a large “queue max-depth” on neighbor connections between routers should be configured so that the router has sufficient capacity to buffer published messages destined for remote routers while the underlying IP network is reconverging following a network fault.

In the event that an egress queue overflows on a Multiple-Node Routing neighbor link, or a reroute happens at the Multiple-Node Routing layer between event brokers (due to either a router failure or a serious fault at the IP layer which isolates one or more event brokers from the rest of the network), published messages queued on those links could be lost. Downstream PubSub+ Cache Instances in the network cannot detect this, so they will not declare any message loss and will stay “up”. Administrators may want to employ out-of-band router monitoring solutions to alert an administrator to any routing flaps, reroutes, and queue overflows that occur at the Multiple-Node-Routing layer, so that appropriate recovery actions can be initiated on any PubSub+ Cache Instances that might contain stale or incomplete information as a result of the incident.