Solace PubSub+ Troubleshooting

Operator Troubleshooting
Developer Troubleshooting
Dashboard Troubleshooting
Advanced Asset Management

This topic describes how operators and developers can troubleshoot instances of Solace PubSub+ services.

A Solace PubSub+ Service Instance is backed by a Message VPN on one or more PubSub+ Software Event Brokers, depending on the plan you’ve chosen.

You can discover the Solace PubSub+ Software Event Broker backing a service instance from the Solace PubSub+ Credentials which become available by binding an app to the instance or creating a service key for the instance.

Operator Troubleshooting

The operator will have access to the logs of all the components of Solace PubSub+ for VMware Tanzu:

Service Broker Logs
Service Broker Agent Logs
Solace PubSub+ Event Broker Logs

An installation of Solace PubSub+ for VMware Tanzu may be configured to use System Logging as a means of gathering all logs.

Accessing Logs

The following locations in Solace PubSub+ BOSH VMs will hold logs of interest that can help in troubleshooting issues:

All VM job logs
- /var/vcap/sys/log
All Solace PubSub+ Event Broker Logs
- /var/vcap/store/containers/pubsub/volumes/jail/logs

The service broker logs are collected in the management VM under /var/vcap/sys/log/broker-logs.

When the Management app is deployed in High Availability mode, the service broker logs are captured on both the primary and backup VMS so it does not matter from which you get the logs.

This means that you can use the bosh logs command to download the service broker logs along with the other logs from the management VM.

You can also watch service broker logs as follows:

Set your API endpoint to the Cloud Controller of your deployment.

$ cf api api.YOUR-SYSTEM-DOMAIN
Setting api endpoint to api.YOUR-SYSTEM-DOMAIN...
OK
API endpoint:  <span>https:</span>//api.YOUR-SYSTEM-DOMAIN (API version: 2.82.0)
Not logged in. Use 'cf login' to log in.

    $ cf login
    API endpoint: <span>https:</span>//api.YOUR-SYSTEM-DOMAIN
    Email> user<span>@</span>example.com
    Password>
    ```

1. Target the `solace` org and `solace-broker` space.

    ```
    $ cf target -o solace -s solace-broker
    api endpoint:   https://api.YOUR-SYSTEM-DOMAIN
    api version:    2.82.0
    user:           admin
    org:            solace
    space:          solace-broker
    ```

1. Discover the Solace PubSub+ Service Broker App name.

    ```
    $ cf apps
    Getting apps in org solace /space solace-broker as user@example.com...
    OK

    name                           requested state   instances   memory   disk   urls
    solace-broker-1.2.0            started           1/1         1G       512M   solace-broker.YOUR-SYSTEM-DOMAIN

    OK
    ```

    <p class="note"><strong>Note:</strong> Take note of the application name. In this case 'solace-broker-1.2.0'.</p>

3. To watch the logs of Solace PubSub+ Service Broker:

    ```
    $ cf logs solace-broker-1.2.0 | tee saved_solace-broker-1.2.0.txt
    Retrieving logs for app solace-broker-1.2.0 in org solace /space solace-broker as user@example.com...
    ....
    ```

### <a id='operator_diag'></a> Additional Diagnostics

In addition to regular logging, Solace PubSub+ Event Brokers tools can help gather diagnostics logs on demand.

For example, having identified a problem with a given Solace service instance, the operator can access the backing Solace PubSub+ Event Brokers to examine logs and gather diagnostics.

The backing Solace PubSub+ Event Brokers for a given service instance can be identified from the IPs used in the bindings or service keys, or mentioned in the service broker logs or message when an exception occurs.

The operator should look for the VM in the Solace PubSub+ bosh deployment with the matching IP.

Get additional diagnostics from "EnterpriseLarge/0" VM.

$ bosh ssh EnterpriseLarge/0 # sudo -i # /var/vcap/jobs/broker_agent/bin/gather_diagnostics.sh “`

Then you can look at the gathered logs under /var/vcap/store/diag.

In the event that a process fails and creates a core dump, the core file can be found under /var/vcap/store/cores. Core files are compressed using gzip.

Installation Issues

If you get the error:

  Colocated job 'solace-bosh-dns-aliases' is already added to the instance group 'compute'

Then do the following:

Log into your bosh environment
Execute the command bosh configs
Look for runtime-configs with ‘solace’ in their name
Delete the solace runtime-configs using the bosh delete-config command
Redeploy

Upgrade Issues

The following issues may arise while upgrading a tile:

Error	Action Failed get_task: result: 1 of 1 drain scripts failed. Failed Jobs: containers.
Explanation	A pre-condition was not met. A high-availability (HA) group was not in a healthy state at the start of the upgrade. This check is done at the shutdown of each VM in an HA group.
Possible Action	Ensure an HA Group is healthy. The user can log into the failing VM and look at the log file located at /var/vcap/sys/log/containers/drain.stdout.log to determine the underlying cause. Once an HA Group is healthy, an upgrade can be retried.

Error	Action Failed get_task: result: 1 of 1 post-start scripts failed. Failed Jobs: containers.
Explanation	The upgrade was stopped due to a detected failure to ensure services remain available.
Possible Action	The user should log into the failing VM and look at the log file located at /var/vcap/sys/log/containers/post-start.stdout.log to determine the underlying cause. Once the issue is fixed, an upgrade can be retried.

Error	Process 'mariadb_ctrl' Execution failed
Explanation	Monit reports that MariaDB failed to start. This happens when using high-availability management and 2 or more MariaDB processes are shutdown at the same time. The cluster has lost quorum and MariaDB processes can’t restart automatically.
Possible Action	The MariaDB cluster needs to be bootstrapped. This is done by running the bootstrap errand with the command: bosh -d solace_deployment run-errand bootstrap. The bootstrap errand is always safe to run as it is a noop when it is not required. If the bootstrap errand does not fix the MariaDB process, the user should look at logs under /var/vcap/sys/log/mysql/.

Note: For Shared service plans, every service on a broker must be marked for upgrade before the upgrade can happen. This prevents one tenant from upgrading a router before the other tenants are ready.

Note: If some Solace PubSub+ Event Brokers have been upgraded and others have not, it is not possible to create new bindings or service keys with the non-upgraded services. This is because features such as authentication schemes might have changed with the upgrade, and would not be compatible with the services created before the upgrade.

Developer Troubleshooting

A developer using Solace PubSub+ for VMware Tanzu may encounter errors when using Cloud Foundry Command-Line Interface (cf CLI) to perform basic operations on a Solace PubSub+ for VMware Tanzu service instance.

In general, most errors are about these types:

Reaching limits (plan limits, inventory fully used)
Communication problems between the event broker and the VM inventory it manages
Unexpected health state, such as a degraded, high-availability setup

While this list it not complete, it provides representative samples with explanations and possible resolutions. Some of the resolutions will require operator intervention.

Deployment limits reached for a given plan
Operation	cf create-service solace-pubsub enterprise-large
Error	Server error, status code: 502, error code: 10001, message: Service broker error: com.solace.cloudfoundry.servicebroker.exception.SolaceServiceException: No matching Solace Message VPNs available.
Explanation	The service broker does not find any Solace Message VPNs in its inventory for the requested service plan.
Possible Action(s)	The operator needs to increase the number of allocated Solace PubSub+ Event Brokers that support the given plan, in this case a enterprise-large.

Invalid Parameter
Operation	cf update-service my-large-instance -c '{"some_option": "some_value" }'
Error	Server error, status code: 502, error code: 10001, message: Service broker error: Unrecognized parameter key some_option
Explanation	The service broker does not recognize the parameter name.
Possible Action(s)	Use the correct parameters. Please see Service Specific Parameters.

Invalid Parameter: The required feature is not enabled (TCP Routes)
Operation	cf update-service my-large-instance -c '{ "mqtt_tcp_route_enabled" : "false" }'
Error	Server error, status code: 502, error code: 10001, message: Service broker error: The parameter mqtt_tcp_route_enabled is invalid given the current configuration. It requires [ TCP Routes Enabled ]
Explanation	As indicated in the error, the given parameter is only valid when TCP Routes is enabled.
Possible Action(s)	See Configuring TCP Routes.

Invalid Parameter: The required feature is not enabled (LDAP)
Operation	cf update-service my-large-instance -c '{ "ldapGroupAdminReadOnly" : "cn=username1,ou=groups,dc=solace,dc=com" }'
Error	Server error, status code: 502, error code: 10001, message: Service broker error: The parameter ldapGroupAdminReadOnly is invalid given the current configuration. It requires [ LDAP Enabled, Management Access set to LDAP ]
Explanation	As indicated in the error, the given parameter is only valid when LDAP is enabled and Management Access is set to LDAP.
Possible Action(s)	See Configuring LDAP and Management Access to use LDAP.

Communication failure: The Solace PubSub+ Event Brokers is not reachable.
Operation	cf delete-service -f my-large-instance
Error	cf service my-large-instance Service instance: my-large-instance Service: solace-pubsub Bound apps: Tags: Plan: enterprise-large Description: Solace PubSub+ Event Broker for real-time, multi-protocol data distribution Documentation url: http://docs.solace.com Dashboard: https://enterprise-large-0.YOUR-SYSTEM.DOMAIN/#/msg-vpns/djAwNQ==?token= YWJj.eyJhY2Nlc3NfdG9rZW4iOiJ2MDA1LWlaksjdlasdjas09dasdlkansdlakslZmRmOWFlNjM3ZGYwMDcifQ%3D%3D.eHl6 Last Operation Status: delete failed Message: com.solace.cloudfoundry.servicebroker.exception.SolaceServiceException: Unable to delete Service, the associated Message VPN v001 on 10.244.0.3 is not currently available Started: 2020-01-00T00:00:00Z Updated: 2020-01-00T00:00:00Z
Explanation	The service broker cannot delete this instance because the Message VPN is flagged as unavailable. This happens when the backing Solace PubSub+ Event Broker is not reachable.
Possible Action(s)	The operator should examine the solace deployment and see why VM 10.244.0.3 is not available. The service delete operation can be reattempted once the VM health is restored.

HA service degradation.
Operation	cf bind-service my-app my-ha-instance
Error	MessageRouterException: Primary VMR 10.233.0.3 HA Group Status is degraded v001 for messageVpn v001
Explanation	The service broker will always reject an operation for an HA service when the HA status is degraded. A variety of reasons may be given in the message. Other operations on the same Message Router will fail as well.
Possible Action(s)	The operator should examine the Solace deployment and see why VM 10.244.0.3 is not available. The CF operation can be reattempted once the VM health is restored and the HA status is not degraded.

Orphaned Resource Policy
Operation	cf unbind-service my-app my-service-instance
Error	Server error, status code: 502, error code: 10001, message: Service instance test-shared: Service broker error: Operation canceled. Orphaned Endpoints Policy was violated. Endpoints owned by v002.cu000001 must first be deleted: Queues: [ someQ ] Topics Endpoints: [ someTPE ]
Explanation	The service broker will reject unbinding when the client-username that was used by the application owns Endpoints such as Queues and Topic Endpoints while the current Orphaned Resource Policy is to Abort
Possible Action(s)	Delete the Endpoints before unbinding. Alternatively you can adjust the Service Orphaned Resource Policy.

Standby Replication VPN Mode
Operation	cf update-service my-app -c '{"vpnOptions:"{...}}' cf bind-service my-app my-service-instance # if LDAP is not the authentication scheme cf unbind-service my-app my-service-instance # if LDAP is not the authentication scheme Client Profile management in the Solace Service Dashboard
Error	The operation [...] is not supported on message-vpn v001 with its current configuration: replication admin-state [enabled] and config-state [standby]. Management operations may only be run on the 'active' replication group
Explanation	If a service’s message VPN has been configured to be part of a replication group and its configuration state is set to 'standby’, all unsupported configuration operations are rejected. This includes creating or deleting client usernames, updating VPN options through cf update-service, or configuring client profiles through the service dashboard.
Possible Action(s)	Configuration is still allowed on the 'active’ member of the replication group and will be config sync’d between the two. Managing bindings when replication configuration is set to standby and when using internal authentication is not supported.

Dashboard Troubleshooting

Dashboard 403 Error using VMware Tanzu SSO
Operation	Accessing the service dashboard with Single Sign On
Error	On the dashboard’s webpage, 403: Forbidden is shown.
Explanation	The user’s permissions are determined through a call to the CF API controller with the given service and user’s access token. CF then tells the dashboard whether the user has read and manage permissions.
Possible Action(s)	Verify that the account used to access the dashboard has permission to read/manage the service instance by verifying their organization and space roles in CloudFoundry. Note: Users with 'admin’ privileges still require an appropriate assigned role in the org and space to view and service instances and specifically require the SpaceDeveloper permission to manage the service instances. This is a limitation of VMware Tanzu/cloudfoundry UAA used in Single Sign On. If Single Sign On access for Solace PubSub+ was revoked through the Third Party Access pane on the VMware Tanzu User Management page, logout of the dashboard by clicking the user icon in the top right corner and clicking logout and retry accessing the dashboard through the dashboard url.

Solace PubSub+ Broker Manager Redirected to Port 943
Operation	Accessing PubSub+ Broker Manager
Error	The page never loads and the URL contains the port 943.
Explanation	If the PubSub+ Broker Manager was accessed during an upgrade to v2.15.0, it is possible that the browser got redirected to port 943.
Possible Action(s)	Wait for the upgrade to complete. Empty the browser’s cache and try again, the webpage should no longer be improperly redirected.

Advanced Asset Management

In cases where a service or event broker can no longer be managed by regular managenent operations, the instance in question can be purged. This is done by accessing the Service Dashboard as an administrator at the solace-pubsub-broker-2.x.x app’s registered route. When a service is purged, its associated Cloud Foundry service and bindings are not deleted, and the underlying resources may or may not be freed. When an event broker is purged, the underlying deployment (either on demand or operator allocated) may not be deleted resulting in orphaned VMs requiring additional cleanup operations. Services or event brokers should only be purged as a last resort.

Service no longer exists in Cloud Foundry but still appears in administrator Service Dashboard
Operation	Viewing the list of services in the Solace PubSub+ Service Dashboard.
Error	A service is listed without a corresponding name from Cloud Foundry, and no such service is expected.
Explanation	If a service is removed from Cloud Foundry without Cloud Foundry notifying the Solace Service Broker, such is the case with `cf purge-service-instance`, the service may be orphaned. This orphaned service will continue to consume Event Broker resources such as VPNs.
Possible Action(s)	The service instance in question can be purged from Advanced > Purge Service on the home page of the Solace PubSub+ Service Dashboard when accessed by an administrator. Note: Purging service instances is a nonrecoverable operation. Only proceed if the service is confirmed missing in Cloud Foundry.

Cannot manage enterprise evaluation services after upgrade to 2.16.0+ when using Solace PubSub+ for VMware Tanzu Enterprise Evaluation Edition
Operation	Managing an enterprise evaluation service, eg. creating bindings.
Error	All operations are rejected indicating that the underlying broker is not as expected with an error similar to Service broker error: Will not bind service instance because the VMR is not expected: VMR 10.0.4.11 Expected Edition:'Solace PubSub+ Enterprise Evaluation Edition' Actual:'Solace PubSub+ Standard'
Explanation	Starting with 2.16.0, upgrades from Enterprise Evaluation Edition brokers are no longer supported. Prior to an upgrade to 2.16, all Enterprise Evaluation Edition service instances and configured event broker resources should be removed.
Possible Action(s)	If Enterprise Evaluation Edition brokers and services remain after an upgrade, they can be removed through the Advanced > Purge Broker section on the home page of the Solace PubSub+ Service Dashboard when accessed by an administrator. The underlying deployments should be manually deleted by setting the number of instances in the Resource Config section to 0 and by deleting any relevant On Demand instances. Note: Only Solace PubSub+ Enterprise Evaluation Edition users are affected. Solace PubSub+ Enterprise Edition upgrades are still supported.

Create a pull request or raise an issue on the source for this page in GitHub