System Design: Message Queues (Kafka, RabbitMQ, SQS)

Message queues decouple services by introducing an intermediary that stores and forwards messages between producers and consumers. Instead of Service A calling Service B directly and waiting for a...

Key Insights

  • Choose Kafka for high-throughput event streaming and replay capabilities, RabbitMQ for complex routing patterns, and SQS when you want managed simplicity in AWS environments.
  • Message ordering guarantees come with tradeoffs—Kafka guarantees order within partitions, RabbitMQ within single queues, and SQS FIFO within message groups, but scaling any of these while maintaining strict ordering requires careful design.
  • “Exactly-once delivery” is largely a myth at the application level—design your consumers to be idempotent regardless of which queue technology you choose.

Introduction to Message Queues

Message queues decouple services by introducing an intermediary that stores and forwards messages between producers and consumers. Instead of Service A calling Service B directly and waiting for a response, Service A drops a message in a queue and moves on. Service B processes it when ready.

This decoupling provides three critical benefits: temporal decoupling (services don’t need to be available simultaneously), load leveling (queues absorb traffic spikes), and failure isolation (one service crashing doesn’t cascade).

The core components are straightforward:

  • Producers send messages to the queue
  • Consumers read and process messages
  • Brokers store messages and manage delivery
  • Topics/Queues organize messages into logical channels

Use async messaging when operations don’t need immediate responses, when you need to handle traffic spikes gracefully, or when you want to enable event-driven architectures. Stick with synchronous calls for user-facing operations requiring immediate feedback or simple request-response patterns where latency matters.

Apache Kafka: High-Throughput Event Streaming

Kafka isn’t just a message queue—it’s a distributed commit log. This distinction matters because Kafka retains messages after consumption, enabling replay and multiple consumer groups to read the same data independently.

Architecture fundamentals:

  • Topics are split into partitions, which are the unit of parallelism
  • Consumer groups allow multiple instances to share the load—each partition is assigned to exactly one consumer in a group
  • Replication across brokers provides fault tolerance

Kafka shines for event sourcing, log aggregation, and real-time analytics pipelines. If you need to process millions of events per second or replay historical data, Kafka is your tool.

from confluent_kafka import Producer, Consumer, KafkaError
import json

# Producer with partition key for ordering
def produce_order_event(producer, order):
    # Using customer_id as key ensures all orders for a customer
    # go to the same partition, maintaining order
    producer.produce(
        topic='orders',
        key=order['customer_id'].encode('utf-8'),
        value=json.dumps(order).encode('utf-8'),
        callback=delivery_report
    )
    producer.flush()

def delivery_report(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()}[{msg.partition()}]')

# Consumer with manual offset management
def consume_orders():
    consumer = Consumer({
        'bootstrap.servers': 'localhost:9092',
        'group.id': 'order-processor',
        'auto.offset.reset': 'earliest',
        'enable.auto.commit': False  # Manual commit for reliability
    })
    consumer.subscribe(['orders'])
    
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            if msg.error().code() == KafkaError._PARTITION_EOF:
                continue
            raise Exception(msg.error())
        
        order = json.loads(msg.value().decode('utf-8'))
        process_order(order)
        consumer.commit(asynchronous=False)  # Commit after processing

The partition key is crucial—messages with the same key always land in the same partition, guaranteeing order for that key. Choose keys wisely based on your ordering requirements.

RabbitMQ: Flexible Message Routing

RabbitMQ implements AMQP and excels at complex routing scenarios. Its exchange-binding-queue model provides flexibility that Kafka’s partition model can’t match.

Exchange types determine routing:

  • Direct: Routes to queues with matching routing keys
  • Fanout: Broadcasts to all bound queues
  • Topic: Pattern-based routing with wildcards
  • Headers: Routes based on message headers
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Direct exchange for specific routing
channel.exchange_declare(exchange='orders', exchange_type='direct')
channel.queue_declare(queue='high_priority_orders', durable=True)
channel.queue_declare(queue='standard_orders', durable=True)

channel.queue_bind(exchange='orders', queue='high_priority_orders', 
                   routing_key='priority.high')
channel.queue_bind(exchange='orders', queue='standard_orders', 
                   routing_key='priority.standard')

# Fanout for broadcasting events
channel.exchange_declare(exchange='order_events', exchange_type='fanout')
channel.queue_declare(queue='inventory_updates')
channel.queue_declare(queue='analytics')
channel.queue_declare(queue='notifications')

for queue in ['inventory_updates', 'analytics', 'notifications']:
    channel.queue_bind(exchange='order_events', queue=queue)

# Dead letter queue configuration
channel.queue_declare(
    queue='orders_dlq',
    durable=True
)
channel.queue_declare(
    queue='orders_processing',
    durable=True,
    arguments={
        'x-dead-letter-exchange': '',
        'x-dead-letter-routing-key': 'orders_dlq',
        'x-message-ttl': 300000  # 5 minutes
    }
)

RabbitMQ’s dead letter queues are essential for production. When messages fail processing or expire, they’re automatically routed to the DLQ for investigation rather than being lost.

Amazon SQS: Managed Simplicity

SQS eliminates operational overhead entirely. No brokers to manage, no clusters to scale—AWS handles everything.

Two queue types serve different needs:

  • Standard queues: Nearly unlimited throughput, at-least-once delivery, best-effort ordering
  • FIFO queues: Exactly-once processing, strict ordering within message groups, 3,000 messages/second with batching
import boto3
import json

sqs = boto3.client('sqs')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/orders.fifo'

def send_order(order):
    response = sqs.send_message(
        QueueUrl=QUEUE_URL,
        MessageBody=json.dumps(order),
        MessageGroupId=order['customer_id'],  # FIFO ordering key
        MessageDeduplicationId=order['order_id']  # Prevents duplicates
    )
    return response['MessageId']

# Lambda handler for SQS trigger with batch failure handling
def lambda_handler(event, context):
    batch_item_failures = []
    
    for record in event['Records']:
        try:
            order = json.loads(record['body'])
            process_order(order)
        except Exception as e:
            # Report failed item for retry
            batch_item_failures.append({
                'itemIdentifier': record['messageId']
            })
    
    return {'batchItemFailures': batch_item_failures}

The batchItemFailures response is critical—it tells SQS which specific messages failed, allowing successful messages to be deleted while failed ones return to the queue. Without this, a single failure causes the entire batch to retry.

Comparison Matrix & Decision Framework

Aspect Kafka RabbitMQ SQS
Throughput Millions/sec Tens of thousands/sec Thousands/sec (FIFO)
Ordering Per partition Per queue Per message group (FIFO)
Delivery At-least-once (configurable) At-least-once with acks At-least-once / exactly-once (FIFO)
Message Replay Yes No No
Ops Overhead High Medium None
Routing Flexibility Limited Excellent Basic

Decision framework:

  1. Need message replay or event sourcing? → Kafka
  2. Complex routing requirements? → RabbitMQ
  3. Running on AWS with simple needs? → SQS
  4. Processing millions of events/second? → Kafka
  5. Need RPC-style patterns? → RabbitMQ
  6. Want zero operational overhead? → SQS

Cost scales differently too. Kafka requires dedicated infrastructure regardless of usage. SQS charges per request, making it economical for variable workloads but expensive at high volumes.

Common Patterns & Pitfalls

Fan-out pattern: One event triggers multiple independent processes. Use Kafka consumer groups, RabbitMQ fanout exchanges, or SNS-to-SQS fan-out in AWS.

Competing consumers: Multiple instances process from the same queue for horizontal scaling. All three technologies support this natively.

Saga/choreography: Services react to events and publish their own, creating distributed workflows without a central coordinator.

Pitfalls to avoid:

Poison messages crash consumers repeatedly. Always implement dead letter queues and set retry limits.

Consumer lag means you’re falling behind. Monitor it religiously—it’s your early warning system.

The “exactly-once” illusion: Network failures can cause duplicate deliveries in any system. Build idempotent consumers:

import hashlib
import redis

redis_client = redis.Redis()

def idempotent_process_order(order):
    # Create idempotency key from order content
    idempotency_key = f"processed:{order['order_id']}"
    
    # Check if already processed
    if redis_client.exists(idempotency_key):
        return  # Already processed, skip
    
    # Process the order
    result = actually_process_order(order)
    
    # Mark as processed with TTL (e.g., 24 hours)
    redis_client.setex(idempotency_key, 86400, 'processed')
    
    return result

Production Considerations

Monitoring essentials:

  • Consumer lag (Kafka), queue depth (RabbitMQ/SQS)
  • Processing latency percentiles
  • Error rates and DLQ growth
  • Producer acknowledgment failures

Schema evolution breaks consumers when done carelessly. Use schema registries (Confluent for Kafka), version your message formats, and make changes backward-compatible. Adding optional fields is safe; removing or renaming fields requires coordination.

When to consider alternatives:

  • Redis Streams: When you need Kafka-like semantics but already run Redis and have modest throughput needs
  • NATS: For lightweight, high-performance pub/sub without persistence requirements
  • Cloud Pub/Sub: Google Cloud’s managed option, similar positioning to SQS

Pick the tool that matches your actual requirements, not the one with the most impressive benchmarks. A well-operated RabbitMQ cluster beats a poorly configured Kafka deployment every time.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.