Log Aggregation: Centralized Logging Architecture

When your application runs on a single server, tailing log files works fine. But the moment you scale to multiple instances, containers, or microservices, local logging becomes a nightmare. You're...

Key Insights

  • Centralized logging transforms debugging from SSH-hopping across servers to querying a single interface, making incident response 10x faster in distributed systems
  • The standard architecture—application → shipper → buffer → storage → visualization—separates concerns and prevents log loss during traffic spikes through asynchronous processing
  • Structured logging with correlation IDs is non-negotiable; unstructured logs make aggregation worthless since you can’t reliably query or correlate events across services

Why Centralized Logging Matters

When your application runs on a single server, tailing log files works fine. But the moment you scale to multiple instances, containers, or microservices, local logging becomes a nightmare. You’re SSH-ing into different machines, grepping through files, trying to piece together what happened during an incident. If a container dies, those logs disappear with it.

Centralized log aggregation solves this by collecting logs from all sources into a single queryable system. You get correlation across services, persistent storage independent of application lifecycle, and the ability to search across your entire infrastructure in seconds. This isn’t a nice-to-have—it’s essential infrastructure for any system running more than one instance.

Core Architecture Components

A centralized logging system has five key layers:

Log Shippers/Agents run alongside your applications and forward logs to the aggregation pipeline. Examples include Filebeat, Fluentd, and Vector.

Message Queue/Buffer sits between shippers and storage, handling backpressure when log volume spikes. Kafka and Redis are common choices.

Aggregation Service processes, filters, and enriches logs before storage. Logstash and Fluentd fill this role.

Storage Backend indexes and stores logs for querying. Elasticsearch dominates here, though Loki and ClickHouse are gaining traction.

Visualization Layer provides the UI for searching and dashboarding. Kibana and Grafana are the standards.

The flow looks like this:

Application (structured logs)
Log Shipper (Filebeat/Fluentd)
Message Queue (Kafka) [optional but recommended]
Aggregation (Logstash/Fluentd)
Storage (Elasticsearch/Loki)
Visualization (Kibana/Grafana)

This separation of concerns is critical. When log volume spikes, the queue absorbs the burst. If storage is slow, shippers buffer locally. Each component can scale independently.

Log Collection Strategies

Unstructured logs are your enemy. If you’re still writing logs like "User logged in successfully", you’re making aggregation nearly impossible. Use structured logging with JSON:

import logging
import json
from uuid import uuid4
from contextvars import ContextVar

# Store correlation ID in context
correlation_id = ContextVar('correlation_id', default=None)

class StructuredLogger:
    def __init__(self, name):
        self.logger = logging.getLogger(name)
        self.logger.setLevel(logging.INFO)
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter('%(message)s'))
        self.logger.addHandler(handler)
    
    def info(self, message, **kwargs):
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'level': 'INFO',
            'message': message,
            'correlation_id': correlation_id.get(),
            'service': 'user-service',
            **kwargs
        }
        self.logger.info(json.dumps(log_entry))

# Usage
logger = StructuredLogger(__name__)

def process_request(user_id):
    correlation_id.set(str(uuid4()))
    logger.info('Processing user request', user_id=user_id, action='login')

This produces queryable logs:

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "INFO",
  "message": "Processing user request",
  "correlation_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "service": "user-service",
  "user_id": 12345,
  "action": "login"
}

Now you can query by user_id, correlation_id, or any field. Correlation IDs are especially critical—they let you trace a request across multiple services.

For shipping logs, configure Filebeat to watch your application logs:

# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  json.keys_under_root: true
  json.add_error_key: true
  fields:
    environment: production
    datacenter: us-east-1

output.kafka:
  hosts: ["kafka1:9092", "kafka2:9092"]
  topic: application-logs
  partition.round_robin:
    reachable_only: false
  compression: gzip
  max_message_bytes: 1000000

This configuration parses JSON logs, adds environment metadata, and ships to Kafka with compression.

The ELK Stack (Elasticsearch, Logstash, Kibana) remains the most popular. Here’s a Docker Compose setup for local development:

version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - es_data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.11.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    ports:
      - "5000:5000"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200

volumes:
  es_data:

Grafana Loki is gaining popularity for its lower resource requirements. It doesn’t index log content—only metadata—making it cheaper to run at scale.

Cloud-native options like AWS CloudWatch, Google Cloud Logging, and Azure Monitor handle infrastructure for you but lock you into their ecosystem and can get expensive at high volumes.

Implementation Best Practices

Log Retention Policies prevent storage costs from spiraling. Configure index lifecycle management:

{
  "policy": "logs-policy",
  "phases": {
    "hot": {
      "actions": {
        "rollover": {
          "max_size": "50GB",
          "max_age": "1d"
        }
      }
    },
    "warm": {
      "min_age": "7d",
      "actions": {
        "shrink": {
          "number_of_shards": 1
        },
        "forcemerge": {
          "max_num_segments": 1
        }
      }
    },
    "cold": {
      "min_age": "30d",
      "actions": {
        "freeze": {}
      }
    },
    "delete": {
      "min_age": "90d",
      "actions": {
        "delete": {}
      }
    }
  }
}

This keeps recent logs on fast storage, moves older logs to cheaper storage, and deletes after 90 days.

Sampling is essential for high-traffic systems. You don’t need every successful health check logged. Implement smart sampling:

import random

def should_log(level, endpoint):
    if level == 'ERROR':
        return True  # Always log errors
    if endpoint == '/health':
        return random.random() < 0.01  # 1% sampling
    if endpoint.startswith('/api/'):
        return random.random() < 0.1  # 10% sampling
    return True

Security matters. Encrypt logs in transit and at rest. Implement role-based access control so developers can only see logs from their services. Redact sensitive data before logging:

import re

def sanitize_log(data):
    if 'credit_card' in data:
        data['credit_card'] = re.sub(r'\d', '*', data['credit_card'])
    if 'ssn' in data:
        data['ssn'] = '***-**-****'
    return data

Querying and Alerting

Effective querying requires understanding your storage backend’s query language. For Elasticsearch/Kibana (KQL):

# Find errors for specific user
level:ERROR AND user_id:12345

# Find slow requests in the last hour
response_time > 1000 AND @timestamp > now-1h

# Trace a request across services
correlation_id:"a1b2c3d4-e5f6-7890-abcd-ef1234567890"

# Find 5xx errors excluding health checks
status:[500 TO 599] AND NOT endpoint:"/health"

Set up alerts for critical issues. Here’s an Elastalert rule:

name: High Error Rate
type: frequency
index: logs-*
num_events: 50
timeframe:
  minutes: 5
filter:
- term:
    level: "ERROR"
- term:
    service: "payment-service"
alert:
- slack:
    slack_webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK"
alert_text: "Payment service error rate exceeded threshold"

Performance and Scalability

Use Kafka for buffering at scale. Configure topics appropriately:

kafka-topics.sh --create \
  --topic application-logs \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=86400000 \
  --config compression.type=lz4 \
  --config segment.bytes=1073741824

This creates a topic with 12 partitions (for parallel processing), 3x replication (for durability), 24-hour retention, and LZ4 compression.

Monitor your logging infrastructure itself. If your logging system goes down during an incident, you’re blind. Set up separate alerting for:

  • Shipper lag/errors
  • Queue depth
  • Storage disk usage
  • Indexing rate vs. ingestion rate

Conclusion

Centralized logging isn’t optional for modern systems. Start with structured JSON logs and correlation IDs. Choose a stack that fits your scale—ELK for full-featured needs, Loki for cost efficiency, or cloud-native for managed simplicity. Implement retention policies from day one, and remember to monitor the monitors. Your future self, debugging a production incident at 2 AM, will thank you.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.