High Availability for Software Event Brokers

You can deploy PubSub+ software event brokers in high-availability (HA) redundancy groups for fault tolerance. An HA redundancy group is made up of three event broker instances: two acting as active-standby messaging nodes and a third acting as a monitoring node. An HA redundancy group provides 1:1 event broker sparing to increase overall service availability. If one of the event brokers fails or is taken out of service, the other event broker automatically takes over and provides service to the clients that were previously served by the now-out-of-service event broker.

Software Event Broker HA Redundancy Model

PubSub+ software event brokers support an active/standby redundancy model. With this model, a primary event broker provides messaging services to clients, while a backup event broker waits in standby mode—it only provides service should the primary event broker fail. A third event broker acts as a monitoring node, to act as a tie-breaker and prevent split-brain scenarios that would otherwise cause both the primary and backup event broker to become active simultaneously.

The software event broker HA redundancy model supports both Direct and Guaranteed Messaging clients.

Active/Standby Redundancy Model

Active-Standby HA Model

In the active/standby model:

  • All clients connect to the active event broker in the redundancy group (typically the primary event broker).
  • The other event broker (typically the backup event broker) acts only as a standby. While one primary event broker is active, clients cannot connect to the standby event broker, and no messaging traffic can flow through the standby.
  • The active event broker uses the IP network to automatically propagate all Guaranteed messages and Guaranteed Messaging state to the standby event broker.
  • If the primary event broker fails for any reason, the backup event broker will become active and provide messaging services to the clients.
  • When the primary event broker comes back online, the backup event broker continues to provide service to the clients, while it automatically resynchronizes its Guaranteed messages and the Guaranteed Messaging state with the primary event broker. Once resynchronized, the backup event broker can either continue to provide service to the clients (the default behavior for the software event broker), or it can transfer activity back to the primary event broker (if you have configured the software event brokers to auto-revert to the primary).

Synchronizing Software Event Broker Configurations

The primary and backup event brokers in a software event broker HA redundancy group must have the same system and Message VPN level configurations, and this configuration must remain in sync while the event brokers are running. The Config-Sync facility is used to automatically synchronize their configurations.

The mate link service is also used for the synchronization of Guaranteed Messages and message state between the primary and backup event broker. The mate link service uses the Management VRF for the messaging traffic to/from the clients, and the Guaranteed Messaging data synchronization between the active and standby event brokers is all done over the same interface.

Failover Mechanism

The PubSub+ software event broker supports host list failover mechanism through which client connections are transferred from one message routing node to another upon the node failure. This mechanism uses lists of IP addresses, or corresponding DNS names, of both the primary and backup event broker. The primary and backup event brokers have different IP addresses at all time, but only one of them is active and accepts connections. Connecting clients know these IP addresses, and the clients (not event brokers of the HA group) handle reconnecting from one IP address to the other.

The client API is responsible for connecting to whichever event broker is active in the HA redundancy group. This kind of configuration would be common in cloud environments.

Your client APIs and VPN bridge connections must be configured with the host lists for the primary and backup event brokers in the HA redundancy group. Once configured, if the primary event broker becomes unavailable for any reason, the backup event broker will take over activity, and the client APIs and Message VPN bridges will reconnect to the newly active event broker without impacting the client applications.

Client Host Lists

This failover mechanism relies on client applications using configured host lists to connect and reconnect to valid hosts for the HA redundancy group. For Solace Messaging APIsʼ use of host lists, refer to Host.

When using host lists, the active software event broker will accept client connections on the Management VRF static IP address, and the standby event broker will reject such connection requests. Primary and backup event broker IP interfaces are ignored, and client connections to these interfaces are rejected too.

Client Using the Host List to Connect

Client Using the Host List to Connect

Notice that the client application’s host list is configured with two IP addresses:

  • the primary event broker’s static IP address for the Management VRF (or corresponding DNS name)
  • the backup event broker’s static IP address for the Management VRF (or corresponding DNS name)

VPN Bridging & Fault Tolerance

For details on how to establish Message VPN bridge connections to remote event brokers when those remote event brokers have been deployed in high-availability (HA) redundant event broker pairs for fault tolerance, refer to Bridging to Remote Event Brokers That Use Redundancy.

Software Event Broker IP Addressing

PubSub+ software event brokers rely on a Message Backbone service for all messaging traffic to and from clients, and on a Management service for management traffic. Both Message Backbone and Management services share the Management Virtual Routing and Forwarding (VRF) object that is used to connect to the IP network.

By default, software event brokers have a single network interface that is mapped to the Management VRF. This is different from PubSub+ appliances that have two separate network interfaces.

By default the software event broker network interface is configured as a DHCP client. However, to use software event broker redundancy, each event broker instance in the HA redundancy group, including the monitoring node, must have a unique static IP address associated with the Management VRF, and this IP address must be in the same subnet and statically configured (that is, DHCP is not supported). Using static IP addresses in the HA redundancy group is a prerequisite for the software event broker redundancy functionality.

Failure Detection

All three nodes in the HA redundancy group—primary, backup, and monitoring—continuously communicate with each other using a protocol that runs over the static IP interfaces in the Management VRF, and, by default, uses ports 8300, 8301, and 8302.

If the active event broker in the group becomes unreachable for any reason, and neither the monitoring node nor the backup event broker can see the active event broker, but they can still see each other, then the backup event broker will take activity, and provide messaging services to the clients.

Similarly, if the active event broker loses connectivity with both the standby event broker and monitoring node, the active event broker will give up activity, to eliminate the possibility it might be operating in a split-brain fashion. This implies that for an event broker to take (or keep) activity and provide service, it must be able to communicate with at least one other node in the group—either the mate event broker and/or the monitoring node.

For redundancy to function properly, all three nodes in the group need to be configured with the Management VRF static IP addresses of the other nodes in the group and the assigned role of each of the nodes (message-routing-node, or monitoring-node).

All three nodes in the HA redundancy group also need to be configured with the same HA redundancy group password as a security mechanism to ensure that only the nodes in the group can communicate with each other, and that other hosts on the network cannot impersonate the event brokers or attempt to join the HA redundancy group.

The following figure shows a correct failure detection configuration.

Configuring for Failure Detection

VMR Redundancy Failure Detection Configuration

Notice above how all three nodes in the HA redundancy group are configured with static IP addresses.

Failover Sequence

If the active event broker goes offline, a failure is detected within a HA redundancy group.

Subsequently, a failover occurs in the following sequence:

  1. The backup event broker takes over messaging activity.
  2. Once the failed primary event broker comes back on-line, it resynchronizes to match the currently active backup event broker.
  3. The primary event broker takes on the “Standby” role, or, if auto-revert is enabled, messaging activity automatically switches back to the primary event broker.

The diagrams below show a failover sequence in detail.

Normal Operation

The diagram below shows a typical HA redundancy group under normal operation, when both the primary and backup event brokers are online and capable of providing service to the clients. This group is configured with the host lists for client connections (192.168.1.1 and 192.168.1.2).

Typical Operation with host list

HA Normal Failover Operation

Taking Over Activity

If the active event broker fails or is taken offline, the backup event broker and monitoring node will detect the failure, and the backup event broker will take over activity. When the backup event broker takes activity, it will start accepting connections on the Management VRF static IP address. The connecting clients will traverse their host lists and connect to the backup event broker using the backup event broker’s static IP address (on the Management VRF).

The diagram below shows the failover mechanism. Notice how the client uses the backup event broker IP address 192.168.1.2 after the backup event broker took over messaging activity.

VMR HA Host List Failover

Resynchronization

Once the failed event broker comes back on-line, it will use the mate link VRF to resynchronize its message-spool contents to match the active event broker. This process may take a few seconds if the differences in the message-spool contents are minimal between the two event brokers, but it may also take several hours if the failed event broker was offline for a long time, and large quantities of data have been spooled on the active event broker.

Resynchronization is not a service-affecting operation, and the backup event broker continues to service connected clients while the resynchronization is taking place. However, the primary event broker is not able to provide service to clients during the resynchronization process. (Note that when disk resynchronization occurs, the redundancy status will be displayed as down.)

The following diagram shows the resynchronization process.

VMR HA Resynchronization

Taking on the Standby Role

Once resynchronization has completed, the primary event broker takes on the Standby role, and is available to provide service to clients should the backup event broker go offline for any reason.

The following diagram shows this state of the HA redundancy group.

VMR HA Taking on Standby

If auto-revert is enabled on the event brokers of this group, then activity will automatically switch back to the primary event broker. The diagram below shows this scenario. You’ll notice that this diagram is the same as that showing the normal operating state.

VMR HA Switching Back After Failover

Next Steps