Jaeger: Distributed Tracing
When you have a monolithic application, debugging is straightforward. You check the logs, maybe set a breakpoint, and follow the execution path. But microservices architectures shatter this...
Key Insights
- Distributed tracing solves the observability gap in microservices by connecting request flows across service boundaries, making it possible to debug issues that span multiple systems in seconds instead of hours.
- Jaeger uses a span-based data model where each operation creates a span with timing and metadata, linked together via trace IDs that propagate through HTTP headers or message queues between services.
- Production deployments require careful sampling strategies—trace every request in development but use adaptive sampling in production to balance observability with performance overhead.
The Observability Problem in Distributed Systems
When you have a monolithic application, debugging is straightforward. You check the logs, maybe set a breakpoint, and follow the execution path. But microservices architectures shatter this simplicity. A single user request might touch a dozen services, each with its own logs, metrics, and failure modes.
Traditional logging fails here because it’s local to each service. You might see “Database query took 2 seconds” in service A’s logs and “Request timeout” in service B’s logs, but connecting these events requires manual correlation, timestamps that might not align, and a lot of grep. This doesn’t scale when you’re handling thousands of requests per second.
Distributed tracing solves this by assigning each request a unique trace ID that follows it through your entire system. Every operation creates a span—a record of work done—and these spans link together to form a complete picture of what happened.
Jaeger Architecture and Core Concepts
Jaeger is an open-source distributed tracing platform originally built by Uber. It consists of several components that work together:
- Jaeger Agent: A daemon that runs alongside your application, receiving spans over UDP and batching them for the collector
- Jaeger Collector: Receives spans from agents, validates them, and writes to storage
- Storage Backend: Typically Elasticsearch, Cassandra, or Kafka for persistence
- Jaeger Query: Service that retrieves traces from storage
- Jaeger UI: Web interface for visualizing and analyzing traces
The data model is hierarchical. A trace represents an end-to-end request journey. Each trace contains multiple spans, which represent individual operations. Spans have parent-child relationships forming a tree structure.
Here’s what a trace looks like in JSON format:
{
"traceID": "4bf92f3577b34da6a3ce929d0e0e4736",
"spans": [
{
"spanID": "1a2b3c4d5e6f7g8h",
"operationName": "HTTP GET /api/users",
"startTime": 1609459200000000,
"duration": 245000,
"tags": {
"http.method": "GET",
"http.url": "/api/users",
"http.status_code": 200
},
"references": []
},
{
"spanID": "2b3c4d5e6f7g8h9i",
"operationName": "SQL SELECT users",
"startTime": 1609459200050000,
"duration": 180000,
"tags": {
"db.type": "postgresql",
"db.statement": "SELECT * FROM users WHERE active = true"
},
"references": [
{
"refType": "CHILD_OF",
"traceID": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanID": "1a2b3c4d5e6f7g8h"
}
]
}
]
}
The references field links spans together. Tags provide metadata about the operation—HTTP status codes, database queries, error information, or custom business context.
Instrumenting Your Application
Modern instrumentation uses OpenTelemetry, the industry standard that replaced Jaeger’s native clients. OpenTelemetry provides automatic instrumentation for common frameworks and libraries, plus APIs for custom spans.
Here’s a Python Flask application with tracing:
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import requests
# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
app = Flask(__name__)
# Auto-instrument Flask and requests library
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
@app.route('/api/order/<order_id>')
def get_order(order_id):
# Automatic span created by Flask instrumentation
# Create custom span for business logic
with tracer.start_as_current_span("validate_order") as span:
span.set_attribute("order.id", order_id)
# This request is automatically traced
user_response = requests.get(f"http://user-service/api/users/{order_id}")
# Add custom tags
span.set_attribute("user.found", user_response.status_code == 200)
if user_response.status_code != 200:
span.set_attribute("error", True)
span.add_event("User not found")
return {"error": "User not found"}, 404
return {"order_id": order_id, "status": "confirmed"}
The magic happens in the HTTP headers. When your service makes an outbound request, OpenTelemetry automatically injects trace context:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-1a2b3c4d5e6f7g8h-01
tracestate: vendor1=value1,vendor2=value2
The receiving service extracts this context and creates child spans, maintaining the trace continuity across service boundaries.
For Go services, the pattern is similar:
package main
import (
"context"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func main() {
// Initialize Jaeger exporter
exp, _ := jaeger.New(jaeger.WithAgentEndpoint())
tp := trace.NewTracerProvider(trace.WithBatcher(exp))
otel.SetTracerProvider(tp)
// Wrap HTTP handler with instrumentation
handler := http.HandlerFunc(orderHandler)
http.Handle("/api/order", otelhttp.NewHandler(handler, "order-handler"))
http.ListenAndServe(":8080", nil)
}
func orderHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tracer := otel.Tracer("order-service")
// Create custom span
ctx, span := tracer.Start(ctx, "process_order")
defer span.End()
span.SetAttributes(
attribute.String("order.type", "express"),
attribute.Int("order.items", 3),
)
// Use ctx for downstream calls to propagate trace
processPayment(ctx)
}
Deploying Jaeger
For local development, use the all-in-one Docker image:
# docker-compose.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "6831:6831/udp" # Agent UDP
- "16686:16686" # UI
- "14268:14268" # Collector HTTP
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
- COLLECTOR_OTLP_ENABLED=true
Run docker-compose up and access the UI at http://localhost:16686.
Production deployments need separate components for scalability. Here’s a Kubernetes deployment with Elasticsearch backend:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-collector
spec:
replicas: 3
selector:
matchLabels:
app: jaeger-collector
template:
metadata:
labels:
app: jaeger-collector
spec:
containers:
- name: jaeger-collector
image: jaegertracing/jaeger-collector:latest
env:
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
- name: COLLECTOR_OTLP_ENABLED
value: "true"
ports:
- containerPort: 14268
- containerPort: 4317 # OTLP gRPC
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: jaeger-agent
spec:
selector:
matchLabels:
app: jaeger-agent
template:
metadata:
labels:
app: jaeger-agent
spec:
containers:
- name: jaeger-agent
image: jaegertracing/jaeger-agent:latest
args:
- --reporter.grpc.host-port=jaeger-collector:14250
ports:
- containerPort: 6831
protocol: UDP
The DaemonSet ensures every node runs an agent, minimizing network hops for span collection.
Analyzing Traces in the Jaeger UI
The Jaeger UI is where tracing pays off. Search for traces by service, operation, tags, or time range. When you find a slow request, the trace view shows exactly where time was spent.
A typical debugging session: You notice your API response times spiked. Search for traces of the slow endpoint, filter by duration > 1 second. Open a trace and you see:
- HTTP GET /api/dashboard: 2.3s total
- Get user profile: 50ms
- Get recent orders: 2.1s
- SQL query orders: 2.0s (⚠️ This is the problem)
- Process results: 100ms
- Render response: 150ms
The UI highlights the slow span in red. Click it to see tags: db.statement: SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 100. No index on created_at. Problem identified in 30 seconds.
The service dependency graph shows which services call each other, helping you understand system architecture and identify chatty relationships that cause latency.
Best Practices and Performance Considerations
Sampling is critical. Tracing every request in production creates massive overhead. Use adaptive sampling:
# Jaeger Collector sampling configuration
sampling:
strategies:
- service: checkout-service
type: probabilistic
param: 0.1 # Sample 10% of traces
- service: auth-service
type: ratelimiting
param: 10 # Max 10 traces per second
- service: critical-payment-service
type: probabilistic
param: 1.0 # Always trace
For high-value requests, force sampling regardless of strategy:
from opentelemetry.trace import TraceFlags
# Force sampling for this trace
span = tracer.start_span("critical_operation")
span.get_span_context().trace_flags = TraceFlags.SAMPLED
Tag naming conventions matter for searchability. Use structured tags:
http.method,http.status_code,http.url(standard semantic conventions)user.id,order.id(business identifiers)error: true,error.message(error tracking)
Avoid high-cardinality tags like timestamps or unique identifiers in tag values—use them as span attributes instead.
Minimize overhead by keeping span creation lightweight. Don’t create spans for trivial operations like variable assignments. Focus on I/O boundaries: HTTP calls, database queries, cache operations, message queue interactions.
Integrate tracing with metrics and logs. Add trace IDs to log entries so you can jump from a trace to detailed logs:
import logging
logger = logging.getLogger(__name__)
# Add trace context to logs
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(f"Processing order", extra={"trace_id": trace_id})
Distributed tracing transforms microservices debugging from archaeology to precision surgery. Jaeger gives you the tools to understand your system’s behavior in production, find bottlenecks quickly, and build confidence in your architecture’s performance characteristics.