Jaeger: Distributed Tracing

Key Insights

Distributed tracing solves the observability gap in microservices by connecting request flows across service boundaries, making it possible to debug issues that span multiple systems in seconds instead of hours.
Jaeger uses a span-based data model where each operation creates a span with timing and metadata, linked together via trace IDs that propagate through HTTP headers or message queues between services.
Production deployments require careful sampling strategies—trace every request in development but use adaptive sampling in production to balance observability with performance overhead.

The Observability Problem in Distributed Systems

When you have a monolithic application, debugging is straightforward. You check the logs, maybe set a breakpoint, and follow the execution path. But microservices architectures shatter this simplicity. A single user request might touch a dozen services, each with its own logs, metrics, and failure modes.

Traditional logging fails here because it’s local to each service. You might see “Database query took 2 seconds” in service A’s logs and “Request timeout” in service B’s logs, but connecting these events requires manual correlation, timestamps that might not align, and a lot of grep. This doesn’t scale when you’re handling thousands of requests per second.

Distributed tracing solves this by assigning each request a unique trace ID that follows it through your entire system. Every operation creates a span—a record of work done—and these spans link together to form a complete picture of what happened.

Jaeger Architecture and Core Concepts

Jaeger is an open-source distributed tracing platform originally built by Uber. It consists of several components that work together:

Jaeger Agent: A daemon that runs alongside your application, receiving spans over UDP and batching them for the collector
Jaeger Collector: Receives spans from agents, validates them, and writes to storage
Storage Backend: Typically Elasticsearch, Cassandra, or Kafka for persistence
Jaeger Query: Service that retrieves traces from storage
Jaeger UI: Web interface for visualizing and analyzing traces

The data model is hierarchical. A trace represents an end-to-end request journey. Each trace contains multiple spans, which represent individual operations. Spans have parent-child relationships forming a tree structure.

Here’s what a trace looks like in JSON format:

{
  "traceID": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spans": [
    {
      "spanID": "1a2b3c4d5e6f7g8h",
      "operationName": "HTTP GET /api/users",
      "startTime": 1609459200000000,
      "duration": 245000,
      "tags": {
        "http.method": "GET",
        "http.url": "/api/users",
        "http.status_code": 200
      },
      "references": []
    },
    {
      "spanID": "2b3c4d5e6f7g8h9i",
      "operationName": "SQL SELECT users",
      "startTime": 1609459200050000,
      "duration": 180000,
      "tags": {
        "db.type": "postgresql",
        "db.statement": "SELECT * FROM users WHERE active = true"
      },
      "references": [
        {
          "refType": "CHILD_OF",
          "traceID": "4bf92f3577b34da6a3ce929d0e0e4736",
          "spanID": "1a2b3c4d5e6f7g8h"
        }
      ]
    }
  ]
}

The references field links spans together. Tags provide metadata about the operation—HTTP status codes, database queries, error information, or custom business context.

Instrumenting Your Application

Modern instrumentation uses OpenTelemetry, the industry standard that replaced Jaeger’s native clients. OpenTelemetry provides automatic instrumentation for common frameworks and libraries, plus APIs for custom spans.

Here’s a Python Flask application with tracing:

from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import requests

# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

app = Flask(__name__)

# Auto-instrument Flask and requests library
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

@app.route('/api/order/<order_id>')
def get_order(order_id):
    # Automatic span created by Flask instrumentation
    
    # Create custom span for business logic
    with tracer.start_as_current_span("validate_order") as span:
        span.set_attribute("order.id", order_id)
        
        # This request is automatically traced
        user_response = requests.get(f"http://user-service/api/users/{order_id}")
        
        # Add custom tags
        span.set_attribute("user.found", user_response.status_code == 200)
        
        if user_response.status_code != 200:
            span.set_attribute("error", True)
            span.add_event("User not found")
            return {"error": "User not found"}, 404
    
    return {"order_id": order_id, "status": "confirmed"}

The magic happens in the HTTP headers. When your service makes an outbound request, OpenTelemetry automatically injects trace context:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-1a2b3c4d5e6f7g8h-01
tracestate: vendor1=value1,vendor2=value2

The receiving service extracts this context and creates child spans, maintaining the trace continuity across service boundaries.

For Go services, the pattern is similar:

package main

import (
    "context"
    "net/http"
    
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func main() {
    // Initialize Jaeger exporter
    exp, _ := jaeger.New(jaeger.WithAgentEndpoint())
    tp := trace.NewTracerProvider(trace.WithBatcher(exp))
    otel.SetTracerProvider(tp)
    
    // Wrap HTTP handler with instrumentation
    handler := http.HandlerFunc(orderHandler)
    http.Handle("/api/order", otelhttp.NewHandler(handler, "order-handler"))
    
    http.ListenAndServe(":8080", nil)
}

func orderHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    tracer := otel.Tracer("order-service")
    
    // Create custom span
    ctx, span := tracer.Start(ctx, "process_order")
    defer span.End()
    
    span.SetAttributes(
        attribute.String("order.type", "express"),
        attribute.Int("order.items", 3),
    )
    
    // Use ctx for downstream calls to propagate trace
    processPayment(ctx)
}

Deploying Jaeger

For local development, use the all-in-one Docker image:

# docker-compose.yml
version: '3.8'

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "6831:6831/udp"  # Agent UDP
      - "16686:16686"    # UI
      - "14268:14268"    # Collector HTTP
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
      - COLLECTOR_OTLP_ENABLED=true

Run docker-compose up and access the UI at http://localhost:16686.

Production deployments need separate components for scalability. Here’s a Kubernetes deployment with Elasticsearch backend:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: jaeger-collector
  template:
    metadata:
      labels:
        app: jaeger-collector
    spec:
      containers:
      - name: jaeger-collector
        image: jaegertracing/jaeger-collector:latest
        env:
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch:9200
        - name: COLLECTOR_OTLP_ENABLED
          value: "true"
        ports:
        - containerPort: 14268
        - containerPort: 4317  # OTLP gRPC
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: jaeger-agent
spec:
  selector:
    matchLabels:
      app: jaeger-agent
  template:
    metadata:
      labels:
        app: jaeger-agent
    spec:
      containers:
      - name: jaeger-agent
        image: jaegertracing/jaeger-agent:latest
        args:
        - --reporter.grpc.host-port=jaeger-collector:14250
        ports:
        - containerPort: 6831
          protocol: UDP

The DaemonSet ensures every node runs an agent, minimizing network hops for span collection.

Analyzing Traces in the Jaeger UI

The Jaeger UI is where tracing pays off. Search for traces by service, operation, tags, or time range. When you find a slow request, the trace view shows exactly where time was spent.

A typical debugging session: You notice your API response times spiked. Search for traces of the slow endpoint, filter by duration > 1 second. Open a trace and you see:

HTTP GET /api/dashboard: 2.3s total
- Get user profile: 50ms
- Get recent orders: 2.1s
  - SQL query orders: 2.0s (⚠️ This is the problem)
  - Process results: 100ms
- Render response: 150ms

The UI highlights the slow span in red. Click it to see tags: db.statement: SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 100. No index on created_at. Problem identified in 30 seconds.

The service dependency graph shows which services call each other, helping you understand system architecture and identify chatty relationships that cause latency.

Best Practices and Performance Considerations

Sampling is critical. Tracing every request in production creates massive overhead. Use adaptive sampling:

# Jaeger Collector sampling configuration
sampling:
  strategies:
    - service: checkout-service
      type: probabilistic
      param: 0.1  # Sample 10% of traces
    - service: auth-service
      type: ratelimiting
      param: 10  # Max 10 traces per second
    - service: critical-payment-service
      type: probabilistic
      param: 1.0  # Always trace

For high-value requests, force sampling regardless of strategy:

from opentelemetry.trace import TraceFlags

# Force sampling for this trace
span = tracer.start_span("critical_operation")
span.get_span_context().trace_flags = TraceFlags.SAMPLED

Tag naming conventions matter for searchability. Use structured tags:

http.method, http.status_code, http.url (standard semantic conventions)
user.id, order.id (business identifiers)
error: true, error.message (error tracking)

Avoid high-cardinality tags like timestamps or unique identifiers in tag values—use them as span attributes instead.

Minimize overhead by keeping span creation lightweight. Don’t create spans for trivial operations like variable assignments. Focus on I/O boundaries: HTTP calls, database queries, cache operations, message queue interactions.

Integrate tracing with metrics and logs. Add trace IDs to log entries so you can jump from a trace to detailed logs:

import logging

logger = logging.getLogger(__name__)

# Add trace context to logs
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(f"Processing order", extra={"trace_id": trace_id})

Distributed tracing transforms microservices debugging from archaeology to precision surgery. Jaeger gives you the tools to understand your system’s behavior in production, find bottlenecks quickly, and build confidence in your architecture’s performance characteristics.