Distributed Tracing Best Practices

To optimize your distributed tracing experience with your PubSub+ event broker and PubSub+ event broker service, use the recommendations and best practices in the following areas:

Traceability
Performance
Operational Enhancement

Traceability

The core value of distributed tracing is the ability to trace your messages of interest along the data path (for troubleshooting, debugging, data lineage, proof of delivery, and other use cases), and analyze this tracing data with the help of an observability tool of your choice.

Accessing Trace Messages in Backends

You can view or search for traced messages or spans for traced messages in your backend application to:

learn about its delivery status, delivery time, or other delivery details
debug if there are any errors with the message delivery

Ensure that you publish messages with user properties to make searching for them easier in the backend application. For example, if your application is an online store, order number or customer ID might be useful pieces of data for the publisher to insert into the user properties.

Viewing and searching for trace messages is different based on the backend application that you're using. See the current documentation for your backend application for more information.

For an example of how to configure tracing using Jaeger, see Code Lab for Solace Distributed Tracing and Context Propagation. This code lab shows you how to perform context propagation using auto instrumentation and visualize it in the Jaegar UI. In the Jaegar UI , you can:

select your event broker to view the Solace trace messages, with the option to see more details as needed. For more information, see Viewing Trace Messages in the Jaeger UI.
use the Tags field to search for a message with a specific user property (key-value pair). For more information, see Searching for Trace Messages in the Jaeger UI.

To interpret the trace messages in your backend trace logs, see Distributed Tracing OpenTelemetry Span Fields.

Performance

When you enable distributed tracing for events in PubSub+, it generates additional trace messages. This section describes how to optimize tracing with receive and send spans and manage your data path variables, and provides general performance guidelines.

Optimizing Tracing with Receive and Send Spans

Receive spans and send spans are generated by the event broker as a byproduct of message tracing. The event broker generates:

a receive span each time it receives and persists a message from a publisher on a topic that is being traced.
a send span each time there's an attempt to deliver a message to a consumer.

Receive and send spans are then transported from the event broker to the OpenTelemetry Collector.

Each receive span generates one trace message. This tracing can affect CPU usage and can impact disk space or network bandwidth on an event broker that is operating close to its limits. Send spans can also impact broker performance, but they are grouped into a larger trace message (one send span is generated each time the event broker attempts to deliver a message to a consumer), which reduces the overall number of messages generated by the event broker and helps limit the impact on the event broker's performance.

To optimize tracing with receive and send spans, consider the following:

Minimize the number of receive and send spans generated by the event broker. When enabling distributed tracing, enable it on a few topics at a time to measure and manage its impact on performance.
When sending messages that include a trace context to the event broker, modify the publishing application to reduce the size of the propagated context. To implement this recommendation, you can manage the baggage and trace state (which are optional and potentially large pieces of the trace context), and reduce the amount of user property data in the published messages.

Distributed tracing has a minimal impact on latency (10-20% for appliances, and less than 10% for software event brokers).

Managing Your Data Path

Some aspects in your architecture can have an impact on your data path performance when enabling distributed tracing. You can use the following considerations to improve performance and manage your architecture:

Message size impacts performance. Distributed tracing has a relatively low impact on the maximum number of messages per second that you can publish to an event broker if your average published message size is 10 KB or larger, but has a considerable impact if your average published message size is 1 KB or smaller.
The performance impact is less when message fanout is low. For example, 1:1 ratio is low fanout and means that one published message is sent to one consumer, which is less expensive to trace. Performance impact becomes substantially greater for higher message fanout. For example, 1:50 ratio is high fanout and means that one published message is sent to 50 consumers. This high ratio negatively impacts the performance of the event broker because of the generation of additional send spans as a result of the additional message deliveries.
Significantly higher tracing (and guaranteed messaging) performance could be achieved by disabling mate-link encryption for high availability (HA). However, this is not generally recommended for production deployments.

If your event broker is running at under 50% the maximum guaranteed messaging rate, and your steady-state disk usage is under 50% of max-spool-usage, it is generally safe to enable distributed tracing, particularly if you follow the recommendations in Optimizing Tracing with Receive and Send Spans. If your event broker is running much closer to its maximum messaging capacity, you must be more cautious about the topics on which you enable disabled tracing, and should contact Solace to evaluate your use case and the potential performance impact.

General Performance Guidelines

The following best practices help you maximize your benefit from using distributed tracing while having the least impact on your data path performance.

Only enable tracing on the message VPNs where it’s needed.
Use trace filters to control which events are traced.
Deploy multiple OpenTelemetry Collectors to service the distributed tracing's non-exclusive queue (A single event broker is likely able to generate spans at a higher rate than a single Collector can consume at).
Choose and architect your backend to handle the volume of trace messages generated by the event broker.
If you're using tracing for debugging or troubleshooting, or using it intermittently, pre-configure trace filters with the topics of interest, where each trace filter contains all topics for a particular tracing use case. Only enable specific trace filters as required.

Tracing message delivery may not be guaranteed by the OpenTelemetry Collector or the tracing backends.

Operational Enhancement

To keep your distributed tracing infrastructure in good health, we recommend the following best practices:

As your use of distributed tracing increases (and the number of trace spans increase), the amount of data stored in your backend storage increases. To help you manage these new requirements:
- Manage your storage usage using strategies that align with your organization's internal storage and retention policies. For example, your retention policy may require that you periodically remove unwanted data to reduce retention charges from your observability or storage vendors.
- Monitor the metrics (see OpenTelemetry Metrics and Health Monitoring ) or logs (see OpenTelemetry Logging) that are necessary for optimal performance. See Performance for more information.
Perform regular maintenance and manage your storage scaling solutions as you require. For example, you can use automation tools, logging analysis, and so on to identify and address bottlenecks and changes in performance.
New tracing event broker features may require new versions of the OpenTelemetry Collector. See Distributed Tracing Version Compatibility before you upgrade your event broker, and to ensure that your PubSub+ event broker, OpenTelemetry Collector, and PubSub+ Messaging API versions are compatible.

Provide feedback