Event Broker Service Maintenance Validation Checks

Before Solace upgrades your event broker service, pre-maintenance validation checks run to confirm your environment is ready. After the upgrade completes, Solace Cloud runs post-maintenance validation checks to confirm the upgrade was successful and your event broker service is healthy. You can view upgrade activity details using the Cloud Console, including the results of validation checks.

If a validation check fails, the following actions occur:

  • Pre-maintenance validation check failure: The event broker service upgrade does not start. Your existing event broker service continues to run uninterrupted.

  • Post-maintenance validation check failure: The event broker service upgrade has been applied, but one or more aspects of your event broker service have not returned to a healthy state in the expected window. You and Solace are notified.

This topic provides the following information for each pre- and post-maintenance validation check:

  • What the check is validating.

  • Why the check may fail.

  • Who is responsible for acting if the check fails: This is either Solace or you, depending on the type of ownership for your Kubernetes cluster.

  • Actions you can take.

If the maintenance validation check failure is one you can act on, you can try to resolve it, and perform the following actions based on the type of upgrade check:

If you can't resolve the validation check failure, or the check specifies that you should contact Solace, include the following information in your support request: the service ID, the failure timestamp, and the check name.

For more information about the individual validation checks, see:

Pre-Maintenance Validation Checks

Before initiating an event broker service upgrade, Solace runs pre-maintenance validation checks to confirm the event broker service is in a healthy state. If any of the pre-maintenance checks fail, the upgrade is canceled and your event broker service continues to run uninterrupted. Solace is notified immediately. You may also be notified, depending on the check.

Solace Cloud performs the following pre-maintenance validation checks before upgrading your event broker service:

5K/50K Node Group Migration

Checks whether the event broker service is being migrated to, or sized for, high-scale tiers (specifically, Enterprise 5K and Enterprise 50K event broker services), and then validates that the target node group is sized and configured correctly for the target event broker service scale. This check is a tier-specific migration check, not a generic capacity check.

Why the check might fail: Insufficient capacity in the target node group, or a configuration mismatch.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. Node-group sizing for these tiers is managed by Solace.

What actions can you take: Contact Solace

Affinity

Checks whether the existing node type can support the memory requirements of the target event broker version. When a new event broker version increases the per-pod memory request, this check confirms the current nodes still have enough memory headroom to schedule the upgraded event broker service.

Why the check might fail:

  • If the current node type does not have enough memory to meet the increased memory requirements of the target event broker version for the upgrade.

  • The existing nodes are at capacity and cannot accommodate the higher memory requests of the upgraded event broker service.

Who is responsible if the check fails:

Cluster Ownership Type Responsibility
Public Clusters and Dedicated Clusters Solace
Customer-Controlled Clusters You

What actions can you take:

  1. Check the target event broker version's memory requirements.

  2. Confirm your node pool has nodes with sufficient memory headroom to meet the new requirement.

  3. Resize or scale the node pool if needed.

If your nodes appear to meet the required memory resource requirements, but the check fails, contact Solace.

Cipher Suite

This check is advisory only. A warning does not block the event broker service upgrade.

Checks whether high-client-count event broker services use deprecated cipher suites. The check exists to give you visibility into the client-side impact of the upgrade before it proceeds.

Why it might warn: Client connectivity to the event broker service may break if the clients are using deprecated ciphers.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types.

What actions can you take: No action is required for the upgrade to proceed. Contact Solace if you'd like help assessing client-side impacts of the upgrade.

Distributed Tracing

Checks whether you have distributed tracing enabled, and if so, that the distributed tracing collector is compatible with the target event broker version.

Why the check might fail: A version skew between the distributed tracing collector and the event broker version, or a configuration or connectivity issue with your tracing destination.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. Tracing compatibility is managed by Solace.

What actions can you take: Contact Solace

Dynamic Message Routing (DMR)

Checks that all DMR links to and from the event broker service are up before the upgrade.

Why the check might fail: A link to another event broker service in your event mesh is down. This could be a network issue, remote-event broker service, or a configuration mismatch.

Who is responsible if the check fails: You, for all cluster ownership types.

What actions can you take:

  1. Identify which DMR link is down.

  2. Check connectivity between this event broker service and the remote endpoint.

  3. Confirm the remote event broker service is healthy.

If both endpoints appear healthy, or you need additional help, contact Solace.

Helm Dry Run

Solace performs a dry run of the event broker service upgrade against your Kubernetes cluster to confirm the upgrade can be applied without errors, and that the current Helm values are compatible with the Helm chart used for the upgrade.

Why the check might fail:

  • The Kubernetes API server is unreachable or it is rate-limiting requests.

  • Network connectivity between Solace and the cluster has degraded.

  • A configuration drift in the cluster prevents the upgrade plan from applying cleanly.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. Solace investigates the failure and contacts you if cluster-side input or action is required.

What actions can you take: Contact Solace

Helm Status

Validates that the event broker Helm release has the status DEPLOYED before the event broker service upgrade starts.

Why the check might fail: This check can fail when a prior upgrade leaves the release in a partial state.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types.

What actions can you take: Contact Solace

Image Access

Checks that the configured image registry can access the Solace Container Registry image for the target event broker version for the event broker service upgrade.

Why the check might fail:

  • The cluster cannot reach the image registry (network or DNS).

  • The image pull credentials are missing or expired.

  • The expected image version is not available in the registry yet.

Who is responsible if the check fails:

Cluster Ownership Type Responsibility
Public Clusters and Dedicated Clusters Solace
Customer-Controlled Clusters You

What actions can you take:

  1. Verify network connectivity from your worker nodes to the registry.

  2. Check that image-pull secrets in the event broker service's namespace are valid and not expired.

  3. Confirm any private-registry mirrors you maintain include the target event broker version.

If the image is not reachable from a Solace-managed registry, or if you can't determine the cause, contact Solace.

Kubernetes Compatibility

Checks that your Kubernetes cluster is running a version that the target event broker version supports.

Why the check might fail: Your cluster runs a Kubernetes version below the supported minimum for the new event broker version.

Who is responsible if the check fails:

Cluster Ownership Type Responsibility
Public Clusters and Dedicated Clusters Solace
Customer-Controlled Clusters You

What actions can you take:

  1. Upgrade your Kubernetes cluster to a supported version.

  2. Re-run the event broker upgrade once the cluster upgrade completes.

Kubernetes Resource Access

Checks whether the event broker service can provision the required persistent storage once upgraded. The check provisions a test PVC against the primary StatefulSet's storage class.

Why the check might fail:

  • The storage class doesn't support the requested volume type.

  • The volume provisioner is failing (cloud-side quota, IAM, or service issue).

  • Your cluster has no remaining capacity in the relevant storage class.

Who is responsible if the check fails:

Cluster Ownership Type Responsibility
Public Clusters and Dedicated Clusters Solace
Customer-Controlled Clusters You

What actions can you take:

  1. Review the storage class used by your event broker's PVCs (kubectl get pvc -n <namespace>).
  2. Check cloud-provider quotas for the storage type.
  3. Verify the dynamic provisioner is healthy.

If your storage class is correctly configured but provisioning still fails, contact Solace.

Router Name

Checks that the event broker service's router name matches the name Solace has on record.

Why the check might fail:

  • DNS resolution between Solace and the event broker service has degraded.

  • A configuration change has altered the event broker service's router name.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types.  Router-identity mismatches require coordinated investigation by Solace.

What actions can you take: Contact Solace

Service Package

Checks that the Mission Control Agent installed in your cluster supports the target event broker version.

Why the check might fail: The Mission Control Agent may be an older version that doesn't include the components required by the target event broker version.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. The Mission Control Agent is managed by Solace. Solace must upgrade the Mission Control Agent before the event broker service upgrade can proceed.

What actions can you take: Contact Solace

Spool File Corruption

Checks that the persistent message-spool files are intact on the event broker service's storage volumes.

Why the check might fail: Possible file-system corruption on the persistent volume.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. This is a data-integrity signal that Solace investigates before any upgrade proceeds.

What actions can you take: Contact Solace immediately. Do not retry the upgrade.

Standby Message Sync

Checks that the next message ID on the backup messaging node of a high-availability service is synchronized with the primary messaging node before the upgrade begins.

Why the check might fail: The backup messaging node has not caught up to the primary messaging node's message state.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. This is a data-integrity signal that Solace investigates before any upgrade proceeds.

What actions can you take: Contact Solace immediately. Do not retry the upgrade.

MQTT

This validation check does not run if your event broker service does not have MQTT cache enabled.

Checks if the MQTT cache is enabled and properly configured on your event broker service.

Why the check might fail: The MQTT cache configuration is missing required settings or has invalid values.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types.

What actions can you take: Contact Solace

ULimit Check

Checks that the worker-node ulimit values (open files, processes) meet the requirements of the new event broker version.

Why the check might fail: The target event broker version requires higher ulimit thresholds than the current node configuration.

Who is responsible if the check fails:

Cluster Ownership Type Responsibility
Public Clusters and Dedicated Clusters Solace
Customer-Controlled Clusters You

What actions can you take:

For a Customer-Controlled Cluster:

  1. Update your node-image or node-pool configuration to increase the ulimit values to at least the numbers provided by the resource calculator for Solace software event brokers.

  2. Verify any pod-security policies or security contexts permit the required limits.

  3. Roll the affected nodes to apply the new configuration.

If you are unsure what values to set, contact Solace to confirm the exact minimums for the target event broker version.

Post-Maintenance Validation Checks

After a successful upgrade, Solace runs post-maintenance validation checks to confirm the event broker service has returned to a healthy state. Most post-maintenance failures indicate a transient condition that may resolve on its own. Some post-maintenance failures require your attention, especially when they involve client connections or network configuration on your side.

Solace Cloud performs the following post-maintenance validation checks after upgrading your event broker service:

Bridges

Checks that the event broker to event broker Message VPN bridges (including inter-service bridges you've configured) have returned to their pre-upgrade state. The check compares bridge status after the upgrade with the pre-upgrade baseline. The check fails only when a bridge that was UP before the upgrade is still DOWN after the upgrade. If a bridge was already DOWN before the upgrade and remains DOWN after, the check passes because the upgrade did not introduce a regression.

Why the check might fail:

  • A remote event broker is down or unreachable.

  • Bridge credentials or TLS configuration no longer matches.

  • A network-policy change has blocked the bridge.

Who is responsible if the check fails: You, for all cluster ownership types.

What actions can you take:

  1. Check the Bridges tab in Broker Manager. Note which bridge is down.

  2. Confirm the remote endpoint is reachable and healthy.

  3. Verify credentials and TLS configuration on both sides of the bridge.

  4. If the bridge is between two Solace-managed event broker services, contact Solace.

Client connections restored (per protocol)

Checks how many client connections of each protocol have returned compared with the pre-upgrade snapshot. There are two categories of results for this check:

Protocols appear in the post-upgrade results only if they had a non-zero pre-upgrade client count. For example, if your service had no AMQP clients before the upgrade, no amqpClients entry appears in the post-upgrade check results.

  • Recoverable protocols: smfClients, mqttClients, amqpClients.

    Clients using these protocols are expected to reconnect automatically. The following results are possible:

    • PASS: The post-upgrade reconnection count is greater than or equal to the pre-upgrade count.

    • WARNING: The post-upgrade reconnection count is below the pre-upgrade count.

    • FAIL: Zero clients reconnected.

  • Informational protocols: restIncomingClients, restOutgoingClients, webTransportClients.

    Clients using these protocols are not designed to auto-reconnect after an event broker service restart. The result for these protocols always reports as INFO (never FAIL). This is intended as a visibility signal rather than an upgrade failure.

Why the check might fail:

  • Some clients did not reconnect automatically (missing retry logic, stale connection parameters).

  • A network policy, firewall, or DNS change has broken client connectivity.

  • The client fleet was already in a degraded state before the upgrade.

Who is responsible if the check fails: You, for all cluster ownership types.

What actions can you take:

  • Recoverable protocols:

    1. Check your client applications and confirm they are attempting to reconnect.

    2. Review client-side logs for connection failures (authentication, TLS, DNS).

    3. Verify that any network or firewall rules covering the event broker service endpoints still allow client traffic.

    4. If client-side logs are clean and clients still can't connect, contact Solace.

  • Informational protocols: If REST and WebTransport clients do not auto-reconnect, treat the INFO result from this check as a count comparison, not a fault. If you need those clients to come back, drive the reconnection from the client side.

Distributed Tracing

Checks that the distributed tracing pipeline is operating after the upgrade.

If distributed tracing is not configured on your event broker service, this check returns a PASS. A PASS result therefore means either:

  • Tracing is healthy

  • Tracing is not in use

Why the check might fail: The distributed tracing collector failed to come back up or is seeing errors when pushing to the tracing destination.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. The distributed tracing pipeline is managed by Solace.

What actions can you take: If you have distributed tracing configured and this check fails, contact Solace

Dynamic Message Routing (DMR)

Checks that the DMR links to this event broker service in an event mesh reconnected after the upgrade. This check is compared against the pre-upgrade DMR state: only links that were up before the upgrade and are still down after will fail this check.

Why the check might fail:

  • A remote event broker service is in a bad state.

  • A network change has broken the link-layer connectivity between event broker services.

  • A DMR configuration drift has been introduced somewhere in the event mesh.

Who is responsible if the check fails: You, for all cluster ownership types.

What actions can you take:

  1. Check the health of the event mesh in the Cloud Console to identify which links are down.

  2. Verify network reachability between the affected event broker services.

  3. If the problem link is between two Solace-managed event broker services, or you can't determine the cause, contact Solace.

Kubernetes Resource Status

Checks that the Kubernetes resources (pods, services, endpoints) associated with the event broker service return to steady state after the event broker service upgrade.

Why the check might fail: The check can fail for several reasons, including:

  • A pod keeps restarting or enters a crash loop.

  • A node issue is preventing one of the event broker service replicas from scheduling.

Who is responsible if the check fails:

Cluster Ownership Type Responsibility
Public Clusters and Dedicated Clusters Solace
Customer-Controlled Clusters You

What actions can you take:

  1. Check the pods in your cluster and identify any that are not in a Running or Ready state. Review the pod details and recent events to understand why they aren't healthy.

  2. Check node conditions.

  3. Resolve any cluster-side issues.

If the Kubernetes cluster appears healthy but event broker pods are still unhealthy, contact Solace.

Monitoring

Checks that the event broker service's Insights Agents are up after the event broker service upgrade. For HA event broker services, this check covers the Insights Agents for the primary, backup, and monitoring nodes. For developer and standalone class event broker services, it covers the single Insights Agent for the single node.

Why the check might fail: An Insights Agent failed to come back up, or lost connectivity to the monitoring platform.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. Solace manages the Insights Agents.

What actions can you take: Contact Solace.

Redundancy

Checks that the HA redundancy group (primary, backup, and monitor nodes) has returned to a fully redundant healthy state.

Why the check might fail: One member of the HA redundancy group is slow to come back, or there is a persistent issue with one of the event broker replicas.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types. HA redundancy is managed by Solace.

What actions can you take: Contact Solace

Spool File Corruption

Checks that the persistent message-spool files are intact after the event broker service upgrade.

Why the check might fail: Indicates a possible data-integrity issue introduced during or before the event broker service upgrade.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types.

What actions can you take: Contact Solace immediately.

Standby Message Sync

Checks that the backup messaging node has caught up with the primary messaging node's state after the event broker service upgrade.

Why the check might fail: An unusually large backlog, a replication issue, or an ongoing sync after heavy traffic.

Who is responsible if the check fails: Solace is responsible for all cluster ownership types.

What actions can you take: Wait a few minutes; brief transient sync lag is normal after a rolling restart. If the warning persists, contact Solace.