Distributed Tracing

Distributed tracing is a method used to monitor applications and diagnose performance issues in microservices architectures. It provides a way to track a request's path through various services that compose a distributed system, helping developers understand how requests are processed and where delays or failures occur.

Concepts

  • Trace: A single trace represents the entire journey of a request as it travels through the services in a system.
  • Spans: Each trace is made up of multiple spans, where each span represents a specific unit of work or segment of the request within a service.

Instrumentation

The effectiveness of distributed tracing depends heavily on the ability to modify or extend the communication protocols used by services to carry trace and span metadata.

  • HTTP: Ddistributed tracing often utilizes HTTP headers to propagate trace and span identifiers.
  • Kafka: Producers add trace identifiers and span information to the messages’ metadata. When consumers process these messages, they extract this metadata, creating new spans linked to the original trace.
  • Relational Databases: Instrument the database client libraries used by the application.

If a protocol does not support custom headers, metadata, or any other means of embedding tracing information, e.g. MQTT 3.1, it becomes difficult to propagate trace context across service boundaries using that protocol.

Performance Impact and Overhead

  • Data Propagation: As a trace progresses through the system, metadata such as trace IDs, span IDs, and potentially other contextual data need to be propagated with each request. This can increase the size of HTTP headers or message payloads.
  • Resource Consumption: Every component that participates in tracing must allocate resources (CPU, memory) to generate, store, and transmit trace data.
  • Latency: Instrumentation can introduce latency into service operations. Capturing trace data, especially if it involves synchronous operations like logging to a file system or network calls to send data to a tracing system, can delay request processing.

Sampling Strategies

In performance-critical scenarios, employing sampling in distributed tracing is a common practice to reduce the impact on system performance.

  • Random Sampling: This involves tracing a random percentage of requests. It reduces overhead but can miss critical issues as it lacks context on which traces might be more important.
  • Adaptive Sampling: Adjusts the sampling rate based on current system load or other metrics. For example, during low traffic, a higher sampling rate might be used, while during peak traffic, the rate decreases.
  • Priority Sampling: Traces deemed more likely to contain valuable information (e.g., requests that result in errors, have high latency, or are of particular business relevance) are selected with higher priority.

Sampling can be set on entry point of the invocation chain or across any subsequence invocations.

Pros Cons
Sampling at Entry Point Consistency, Reduced Overhead, Simpler Configuration. Lack of Flexibility, Potential for Information Loss
ampling Across Services Adaptive and Flexible, Targeted Tracing, Better Resource Allocation Increased Complexity, Inconsistencies in Traces