Design a Logging System: Centralized Log Aggregation
Debugging a production issue across 50 microservices by SSH-ing into individual containers is a special kind of pain. I've watched engineers spend hours grepping through scattered log files, piecing...
Key Insights
- A well-designed centralized logging system requires four distinct layers: collection agents, a buffer/queue for backpressure handling, tiered storage for cost optimization, and a query layer—skipping any layer creates operational pain at scale.
- Structured logging with correlation IDs isn’t optional; it’s the difference between spending 5 minutes or 5 hours debugging a production incident across distributed services.
- The build-vs-buy decision hinges on log volume and team size: below 100GB/day, managed solutions usually win; above 1TB/day, self-hosted becomes cost-effective if you have the operational expertise.
Introduction: Why Centralized Logging Matters
Debugging a production issue across 50 microservices by SSH-ing into individual containers is a special kind of pain. I’ve watched engineers spend hours grepping through scattered log files, piecing together request flows like archaeologists reconstructing pottery shards.
Centralized logging solves this by aggregating logs from all services into a single, queryable system. Beyond debugging, it enables compliance auditing, performance analysis, security monitoring, and capacity planning. When your CEO asks “what happened at 3 AM?” you need an answer in minutes, not days.
This article walks through designing a production-grade centralized logging system. We’ll cover the architecture, implementation details, and the trade-offs you’ll face at each decision point.
Core Architecture Components
A centralized logging system has five distinct layers, each with specific responsibilities:
┌─────────────────────────────────────────────────────────────────────┐
│ QUERY & VISUALIZATION │
│ (Kibana, Grafana, Custom UIs) │
└─────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
│ ┌──────────────┬──────────────┬──────────────┐ │
│ │ HOT │ WARM │ COLD │ │
│ │ Elasticsearch│ Elasticsearch│ S3 │ │
│ │ (7 days) │ (30 days) │ (1+ year) │ │
│ └──────────────┴──────────────┴──────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────────┐
│ BUFFER/QUEUE LAYER │
│ (Kafka / Redis Streams) │
└─────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────────┐
│ COLLECTION AGENTS │
│ (Fluentd, Filebeat, Vector, OTEL Collector) │
└─────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────────┐
│ LOG PRODUCERS │
│ (Applications, Databases, Load Balancers, OS) │
└─────────────────────────────────────────────────────────────────────┘
Log Producers are your applications and infrastructure. They emit logs to stdout, files, or directly to collectors.
Collection Agents run alongside your applications, reading logs and forwarding them. Vector and Fluentd are the current leaders here—Vector for performance, Fluentd for plugin ecosystem.
Buffer Layer absorbs traffic spikes and provides backpressure. When your storage layer is overwhelmed or down for maintenance, Kafka holds your logs safely. Skip this layer at your peril.
Storage Layer persists logs for querying. Elasticsearch dominates, but ClickHouse is gaining ground for cost-sensitive deployments. S3 handles long-term archival.
Query Layer provides the interface for searching and visualization. Kibana, Grafana, or custom tooling built on your storage APIs.
Structured Logging Standards
Unstructured logs are the enemy of observability. Compare these:
# Bad: Unstructured
[2024-01-15 10:23:45] ERROR - Failed to process order 12345 for user john@example.com
# Good: Structured JSON
{
"timestamp": "2024-01-15T10:23:45.123Z",
"level": "error",
"service": "order-service",
"version": "2.3.1",
"trace_id": "abc123def456",
"span_id": "789xyz",
"message": "Failed to process order",
"order_id": "12345",
"user_id": "usr_789",
"error_code": "PAYMENT_DECLINED",
"duration_ms": 234
}
The structured version enables queries like “show me all PAYMENT_DECLINED errors for order-service v2.3.1 in the last hour” without regex gymnastics.
Here’s a Python implementation that enforces structure:
import json
import logging
import sys
from contextvars import ContextVar
from datetime import datetime, timezone
from typing import Any
import uuid
# Context variables for request-scoped data
trace_id_var: ContextVar[str] = ContextVar('trace_id', default='')
span_id_var: ContextVar[str] = ContextVar('span_id', default='')
user_id_var: ContextVar[str] = ContextVar('user_id', default='')
class StructuredFormatter(logging.Formatter):
def __init__(self, service_name: str, version: str):
super().__init__()
self.service_name = service_name
self.version = version
def format(self, record: logging.LogRecord) -> str:
log_entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"level": record.levelname.lower(),
"service": self.service_name,
"version": self.version,
"logger": record.name,
"message": record.getMessage(),
"trace_id": trace_id_var.get(),
"span_id": span_id_var.get(),
}
# Add user context if available
if user_id := user_id_var.get():
log_entry["user_id"] = user_id
# Add extra fields from the log call
if hasattr(record, 'extra_fields'):
log_entry.update(record.extra_fields)
# Add exception info if present
if record.exc_info:
log_entry["exception"] = self.formatException(record.exc_info)
return json.dumps(log_entry, default=str)
class StructuredLogger:
def __init__(self, name: str, service_name: str, version: str):
self.logger = logging.getLogger(name)
self.logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(StructuredFormatter(service_name, version))
self.logger.addHandler(handler)
def _log(self, level: int, message: str, **kwargs: Any):
record = self.logger.makeRecord(
self.logger.name, level, "", 0, message, (), None
)
record.extra_fields = kwargs
self.logger.handle(record)
def info(self, message: str, **kwargs):
self._log(logging.INFO, message, **kwargs)
def error(self, message: str, **kwargs):
self._log(logging.ERROR, message, **kwargs)
def warning(self, message: str, **kwargs):
self._log(logging.WARNING, message, **kwargs)
# Middleware for web frameworks (FastAPI example)
def inject_trace_context(trace_id: str = None, span_id: str = None):
trace_id_var.set(trace_id or str(uuid.uuid4()))
span_id_var.set(span_id or str(uuid.uuid4())[:16])
# Usage
logger = StructuredLogger("order-service", "order-service", "2.3.1")
inject_trace_context()
logger.info("Processing order", order_id="12345", amount=99.99)
logger.error("Payment failed", order_id="12345", error_code="DECLINED")
The critical piece is the trace_id. This correlation ID must propagate across all service calls, typically via HTTP headers (X-Trace-ID or W3C Trace Context). Without it, you cannot follow a request through your distributed system.
Collection and Transport Pipeline
Three patterns exist for log collection:
Agent-based: A daemon runs on each host, tailing log files. Simple and battle-tested.
Sidecar: Each container gets a logging sidecar. More isolation but higher resource overhead.
Direct shipping: Applications send logs directly to the buffer layer. Lowest latency but couples your app to your logging infrastructure.
For most deployments, agent-based collection with Vector provides the best balance:
# vector.yaml - Production configuration
sources:
kubernetes_logs:
type: kubernetes_logs
auto_partial_merge: true
exclude_paths_glob_patterns:
- "**/kube-system/**"
host_metrics:
type: host_metrics
collectors:
- cpu
- memory
- disk
transforms:
parse_json:
type: remap
inputs: ["kubernetes_logs"]
source: |
# Parse JSON logs, fall back to raw message
parsed, err = parse_json(.message)
if err == null {
. = merge(., parsed)
del(.message)
}
# Ensure required fields exist
.service = .kubernetes.pod_labels."app.kubernetes.io/name" ?? "unknown"
.namespace = .kubernetes.pod_namespace ?? "default"
.timestamp = .timestamp ?? now()
# Normalize log levels
.level = downcase(.level ?? .severity ?? "info")
filter_noise:
type: filter
inputs: ["parse_json"]
condition: |
# Drop health check spam
!contains(string!(.message), "/health") &&
!contains(string!(.message), "/ready")
add_metadata:
type: remap
inputs: ["filter_noise"]
source: |
.environment = get_env_var("ENVIRONMENT") ?? "production"
.cluster = get_env_var("CLUSTER_NAME") ?? "default"
sinks:
kafka:
type: kafka
inputs: ["add_metadata"]
bootstrap_servers: "kafka-1:9092,kafka-2:9092,kafka-3:9092"
topic: "logs-{{ namespace }}"
encoding:
codec: json
buffer:
type: disk
max_size: 5368709120 # 5GB disk buffer
when_full: block
batch:
max_bytes: 1048576 # 1MB batches
timeout_secs: 1
# Dead letter queue for failed events
dlq:
type: file
inputs: ["add_metadata"]
path: "/var/log/vector/dlq/%Y-%m-%d.log"
encoding:
codec: json
Key configuration decisions:
Disk buffering prevents log loss when downstream systems are unavailable. Size it for your expected outage duration.
Batch settings balance latency against throughput. Larger batches are more efficient but increase delivery latency.
Topic partitioning by namespace enables per-team access control and independent scaling.
Storage and Retention Strategy
Log storage costs grow linearly with retention, but query frequency follows a power law—90% of queries hit the last 24 hours. Design your storage around this reality.
Hot tier: Fast SSDs, full indexing, 7-14 days retention. This handles real-time debugging.
Warm tier: Cheaper storage, reduced replicas, 30-90 days. For incident post-mortems and trend analysis.
Cold tier: Object storage (S3/GCS), query-on-demand. Compliance and forensics.
Here’s an Elasticsearch index template with Index Lifecycle Management:
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-policy",
"index.lifecycle.rollover_alias": "logs",
"index.mapping.total_fields.limit": 2000,
"index.refresh_interval": "5s",
"index.translog.durability": "async",
"index.translog.sync_interval": "5s"
},
"mappings": {
"dynamic": "strict",
"properties": {
"timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"level": {
"type": "keyword"
},
"service": {
"type": "keyword"
},
"version": {
"type": "keyword"
},
"trace_id": {
"type": "keyword"
},
"span_id": {
"type": "keyword"
},
"user_id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"context": {
"type": "object",
"dynamic": true
},
"environment": {
"type": "keyword"
},
"namespace": {
"type": "keyword"
}
}
}
}
}
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "1d",
"max_docs": 100000000
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
},
"set_priority": {
"priority": 50
},
"allocate": {
"require": {
"data": "warm"
}
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "logs-s3-repo"
}
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}
Critical settings explained:
dynamic: strict prevents mapping explosion from rogue log fields. Unmapped fields cause indexing failures rather than silently creating thousands of fields.
Rollover conditions use size, age, and document count. Hit any threshold and a new index is created.
Searchable snapshots in the cold phase keep data queryable while storing it in cheap object storage.
Querying, Alerting, and Observability
Effective alerting on logs requires understanding signal vs. noise. Alert on rates and patterns, not individual log lines.
Here’s a Grafana Loki alert configuration:
# loki-rules.yaml
groups:
- name: application-errors
interval: 1m
rules:
- alert: HighErrorRate
expr: |
sum(rate({namespace="production"} |= "level=error" [5m])) by (service)
/
sum(rate({namespace="production"} [5m])) by (service)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
- alert: ErrorRateSpike
expr: |
sum(rate({namespace="production"} |= "level=error" [5m])) by (service)
>
sum(rate({namespace="production"} |= "level=error" [5m] offset 1h)) by (service) * 3
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate spike in {{ $labels.service }}"
description: "Error rate is 3x higher than 1 hour ago"
- alert: PaymentFailures
expr: |
sum(rate({service="payment-service"} |~ "error_code=(DECLINED|TIMEOUT|GATEWAY_ERROR)" [5m])) > 10
for: 3m
labels:
severity: critical
team: payments
annotations:
summary: "Elevated payment failures"
runbook: "https://wiki.internal/runbooks/payment-failures"
For unified observability, correlate logs with traces and metrics using the trace ID:
# Query helper for correlated observability
def get_request_context(trace_id: str) -> dict:
"""Fetch logs, traces, and metrics for a single request."""
logs = elasticsearch.search(
index="logs-*",
body={"query": {"term": {"trace_id": trace_id}}}
)
traces = jaeger.get_trace(trace_id)
# Get metrics for the time window of the trace
start_time = traces.spans[0].start_time
end_time = traces.spans[-1].end_time
metrics = prometheus.query_range(
f'http_request_duration_seconds{{trace_id="{trace_id}"}}',
start=start_time,
end=end_time
)
return {
"logs": logs,
"traces": traces,
"metrics": metrics
}
Scaling Considerations and Trade-offs
At scale, you’ll face these decisions:
Sampling: At 1TB/day, you might sample debug logs (keep 10%) while retaining all errors. Implement sampling at the collection layer, not in storage.
Multi-tenancy: Separate indices per team enable access control and independent retention. Use Elasticsearch’s document-level security or separate clusters for strict isolation.
Cost optimization: The biggest lever is retention. Reducing hot tier from 14 to 7 days can cut costs 40%. Second is field cardinality—don’t index high-cardinality fields like request bodies.
Build vs. buy decision matrix:
| Factor | Build (Elasticsearch/Loki) | Buy (Datadog/CloudWatch) |
|---|---|---|
| Volume < 100GB/day | ❌ Overhead not worth it | ✅ Simpler, predictable cost |
| Volume > 1TB/day | ✅ 10x cost savings | ❌ Prohibitively expensive |
| Team < 5 engineers | ❌ Operational burden | ✅ Zero maintenance |
| Compliance requirements | ✅ Full control | ⚠️ Depends on vendor |
For most startups, managed solutions win until you hit scale. For enterprises with platform teams, self-hosted provides control and cost efficiency.
The logging system you design today will be queried thousands of times during your next major incident. Invest in structured logging, proper correlation IDs, and tiered storage. Your future self, debugging at 3 AM, will thank you.