Log Aggregation: Centralized Logging Architecture
When your application runs on a single server, tailing log files works fine. But the moment you scale to multiple instances, containers, or microservices, local logging becomes a nightmare. You're...
Key Insights
- Centralized logging transforms debugging from SSH-hopping across servers to querying a single interface, making incident response 10x faster in distributed systems
- The standard architecture—application → shipper → buffer → storage → visualization—separates concerns and prevents log loss during traffic spikes through asynchronous processing
- Structured logging with correlation IDs is non-negotiable; unstructured logs make aggregation worthless since you can’t reliably query or correlate events across services
Why Centralized Logging Matters
When your application runs on a single server, tailing log files works fine. But the moment you scale to multiple instances, containers, or microservices, local logging becomes a nightmare. You’re SSH-ing into different machines, grepping through files, trying to piece together what happened during an incident. If a container dies, those logs disappear with it.
Centralized log aggregation solves this by collecting logs from all sources into a single queryable system. You get correlation across services, persistent storage independent of application lifecycle, and the ability to search across your entire infrastructure in seconds. This isn’t a nice-to-have—it’s essential infrastructure for any system running more than one instance.
Core Architecture Components
A centralized logging system has five key layers:
Log Shippers/Agents run alongside your applications and forward logs to the aggregation pipeline. Examples include Filebeat, Fluentd, and Vector.
Message Queue/Buffer sits between shippers and storage, handling backpressure when log volume spikes. Kafka and Redis are common choices.
Aggregation Service processes, filters, and enriches logs before storage. Logstash and Fluentd fill this role.
Storage Backend indexes and stores logs for querying. Elasticsearch dominates here, though Loki and ClickHouse are gaining traction.
Visualization Layer provides the UI for searching and dashboarding. Kibana and Grafana are the standards.
The flow looks like this:
Application (structured logs)
↓
Log Shipper (Filebeat/Fluentd)
↓
Message Queue (Kafka) [optional but recommended]
↓
Aggregation (Logstash/Fluentd)
↓
Storage (Elasticsearch/Loki)
↓
Visualization (Kibana/Grafana)
This separation of concerns is critical. When log volume spikes, the queue absorbs the burst. If storage is slow, shippers buffer locally. Each component can scale independently.
Log Collection Strategies
Unstructured logs are your enemy. If you’re still writing logs like "User logged in successfully", you’re making aggregation nearly impossible. Use structured logging with JSON:
import logging
import json
from uuid import uuid4
from contextvars import ContextVar
# Store correlation ID in context
correlation_id = ContextVar('correlation_id', default=None)
class StructuredLogger:
def __init__(self, name):
self.logger = logging.getLogger(name)
self.logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(message)s'))
self.logger.addHandler(handler)
def info(self, message, **kwargs):
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'level': 'INFO',
'message': message,
'correlation_id': correlation_id.get(),
'service': 'user-service',
**kwargs
}
self.logger.info(json.dumps(log_entry))
# Usage
logger = StructuredLogger(__name__)
def process_request(user_id):
correlation_id.set(str(uuid4()))
logger.info('Processing user request', user_id=user_id, action='login')
This produces queryable logs:
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "INFO",
"message": "Processing user request",
"correlation_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"service": "user-service",
"user_id": 12345,
"action": "login"
}
Now you can query by user_id, correlation_id, or any field. Correlation IDs are especially critical—they let you trace a request across multiple services.
For shipping logs, configure Filebeat to watch your application logs:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
json.keys_under_root: true
json.add_error_key: true
fields:
environment: production
datacenter: us-east-1
output.kafka:
hosts: ["kafka1:9092", "kafka2:9092"]
topic: application-logs
partition.round_robin:
reachable_only: false
compression: gzip
max_message_bytes: 1000000
This configuration parses JSON logs, adds environment metadata, and ships to Kafka with compression.
Popular Technology Stacks
The ELK Stack (Elasticsearch, Logstash, Kibana) remains the most popular. Here’s a Docker Compose setup for local development:
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports:
- "9200:9200"
volumes:
- es_data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
ports:
- "5000:5000"
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
ports:
- "5601:5601"
depends_on:
- elasticsearch
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
volumes:
es_data:
Grafana Loki is gaining popularity for its lower resource requirements. It doesn’t index log content—only metadata—making it cheaper to run at scale.
Cloud-native options like AWS CloudWatch, Google Cloud Logging, and Azure Monitor handle infrastructure for you but lock you into their ecosystem and can get expensive at high volumes.
Implementation Best Practices
Log Retention Policies prevent storage costs from spiraling. Configure index lifecycle management:
{
"policy": "logs-policy",
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"freeze": {}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
This keeps recent logs on fast storage, moves older logs to cheaper storage, and deletes after 90 days.
Sampling is essential for high-traffic systems. You don’t need every successful health check logged. Implement smart sampling:
import random
def should_log(level, endpoint):
if level == 'ERROR':
return True # Always log errors
if endpoint == '/health':
return random.random() < 0.01 # 1% sampling
if endpoint.startswith('/api/'):
return random.random() < 0.1 # 10% sampling
return True
Security matters. Encrypt logs in transit and at rest. Implement role-based access control so developers can only see logs from their services. Redact sensitive data before logging:
import re
def sanitize_log(data):
if 'credit_card' in data:
data['credit_card'] = re.sub(r'\d', '*', data['credit_card'])
if 'ssn' in data:
data['ssn'] = '***-**-****'
return data
Querying and Alerting
Effective querying requires understanding your storage backend’s query language. For Elasticsearch/Kibana (KQL):
# Find errors for specific user
level:ERROR AND user_id:12345
# Find slow requests in the last hour
response_time > 1000 AND @timestamp > now-1h
# Trace a request across services
correlation_id:"a1b2c3d4-e5f6-7890-abcd-ef1234567890"
# Find 5xx errors excluding health checks
status:[500 TO 599] AND NOT endpoint:"/health"
Set up alerts for critical issues. Here’s an Elastalert rule:
name: High Error Rate
type: frequency
index: logs-*
num_events: 50
timeframe:
minutes: 5
filter:
- term:
level: "ERROR"
- term:
service: "payment-service"
alert:
- slack:
slack_webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK"
alert_text: "Payment service error rate exceeded threshold"
Performance and Scalability
Use Kafka for buffering at scale. Configure topics appropriately:
kafka-topics.sh --create \
--topic application-logs \
--partitions 12 \
--replication-factor 3 \
--config retention.ms=86400000 \
--config compression.type=lz4 \
--config segment.bytes=1073741824
This creates a topic with 12 partitions (for parallel processing), 3x replication (for durability), 24-hour retention, and LZ4 compression.
Monitor your logging infrastructure itself. If your logging system goes down during an incident, you’re blind. Set up separate alerting for:
- Shipper lag/errors
- Queue depth
- Storage disk usage
- Indexing rate vs. ingestion rate
Conclusion
Centralized logging isn’t optional for modern systems. Start with structured JSON logs and correlation IDs. Choose a stack that fits your scale—ELK for full-featured needs, Loki for cost efficiency, or cloud-native for managed simplicity. Implement retention policies from day one, and remember to monitor the monitors. Your future self, debugging a production incident at 2 AM, will thank you.