System Design: Pub/Sub Messaging Pattern
The publish-subscribe pattern fundamentally changes how components communicate. Instead of service A directly calling service B (request-response), service A publishes an event to a topic, and any...
Key Insights
- Pub/Sub decouples producers from consumers, enabling independent scaling and deployment, but introduces complexity around message ordering, delivery guarantees, and debugging distributed flows.
- Choose your delivery semantics deliberately: at-least-once delivery with idempotent handlers is the pragmatic default for most systems; exactly-once is expensive and often unnecessary.
- The right pub/sub technology depends on your throughput requirements, ordering guarantees, and operational capacity—Redis for simplicity, Kafka for scale, managed services for reduced ops burden.
Introduction to Pub/Sub
The publish-subscribe pattern fundamentally changes how components communicate. Instead of service A directly calling service B (request-response), service A publishes an event to a topic, and any interested services subscribe to receive it. The publisher doesn’t know or care who’s listening.
This inversion matters. In request-response, the caller must know the callee’s address, handle its failures, and wait for its response. With pub/sub, the publisher fires and forgets. Subscribers can be added, removed, or scaled without touching publisher code.
The trade-off is visibility. When service A calls service B, you can trace the request. When service A publishes “OrderCreated” and services B, C, and D react, understanding system behavior requires distributed tracing and careful logging. Pub/sub trades direct control for flexibility.
Core Components & Architecture
Every pub/sub system has three actors:
Publishers produce messages without knowledge of subscribers. They send messages to topics (or channels, or exchanges—terminology varies by system).
Message Broker receives messages from publishers, stores them (durably or transiently), and routes them to subscribers. The broker is the decoupling layer. Popular brokers include Kafka, RabbitMQ, and Redis.
Subscribers register interest in specific topics and receive matching messages. They can filter messages, process them synchronously or asynchronously, and acknowledge receipt.
The architecture looks like this: Publishers connect to the broker and push messages to named topics. The broker maintains subscriber registrations and either pushes messages to subscribers or allows them to pull. Subscribers process messages and acknowledge completion.
// Core pub/sub abstractions
interface Message<T = unknown> {
id: string;
topic: string;
payload: T;
timestamp: Date;
metadata: Record<string, string>;
}
interface Publisher {
publish<T>(topic: string, payload: T, metadata?: Record<string, string>): Promise<void>;
}
interface Subscriber {
subscribe<T>(
topic: string,
handler: (message: Message<T>) => Promise<void>
): Promise<Subscription>;
}
interface Subscription {
unsubscribe(): Promise<void>;
}
// Message acknowledgment for reliable delivery
interface AcknowledgeableMessage<T> extends Message<T> {
ack(): Promise<void>;
nack(requeue?: boolean): Promise<void>;
}
These interfaces capture the essential contract. Publishers don’t return responses—they confirm the broker received the message. Subscribers receive messages with enough context to process them and signal completion.
When to Use Pub/Sub
Pub/sub shines in specific scenarios:
Event-driven architectures: When actions in one service should trigger reactions in others. Order placed? Notify inventory, shipping, and analytics services via a single published event.
Microservices communication: Services evolve independently. Pub/sub prevents tight coupling that makes deployments risky.
Real-time notifications: Push updates to many clients simultaneously. A single “PriceUpdated” event reaches all connected trading terminals.
Log aggregation: Applications publish logs; aggregators subscribe. Adding a new log destination doesn’t require application changes.
Workload distribution: Publish tasks to a topic; worker pools subscribe and process in parallel.
However, pub/sub is wrong when:
You need synchronous responses: If the caller must wait for a result, request-response is simpler. Don’t contort pub/sub into RPC.
Message ordering is critical across topics: Pub/sub guarantees ordering within a partition or topic, not globally. If you need strict cross-topic ordering, reconsider your design.
You have two services: The overhead of a message broker for point-to-point communication is rarely justified. Start with direct calls; add pub/sub when you have multiple consumers.
Debugging simplicity matters more than flexibility: Pub/sub makes request tracing harder. For simple systems, this complexity isn’t worth it.
Implementation Patterns
Fan-Out
One message reaches multiple subscribers. When “UserRegistered” is published, the email service, analytics service, and CRM integration all receive it independently. Each subscriber processes at its own pace.
Message Filtering
Subscribers specify criteria beyond topic matching. “Only send me OrderCreated events where order.total > 1000.” This reduces network traffic and processing load. Some brokers support server-side filtering; others require client-side filtering.
Dead Letter Queues
Messages that repeatedly fail processing go to a dead letter queue (DLQ) instead of blocking the main queue. This prevents poison messages from halting your system. Monitor DLQ depth and investigate failures.
Delivery Semantics
At-most-once: Fire and forget. Message might be lost. Fastest, but unreliable.
At-least-once: Broker retries until acknowledged. Messages might be delivered multiple times. Requires idempotent handlers.
Exactly-once: Each message processed exactly once. Requires coordination between broker and consumer, typically via transactions. Expensive and complex.
For most systems, at-least-once with idempotent handlers is the right choice. Here’s a complete in-memory implementation:
import asyncio
from dataclasses import dataclass, field
from typing import Callable, Dict, List, Set, Awaitable
from datetime import datetime
import uuid
@dataclass
class Message:
id: str
topic: str
payload: dict
timestamp: datetime = field(default_factory=datetime.utcnow)
metadata: dict = field(default_factory=dict)
class InMemoryPubSub:
def __init__(self):
self._subscribers: Dict[str, List[Callable[[Message], Awaitable[None]]]] = {}
self._lock = asyncio.Lock()
async def publish(self, topic: str, payload: dict, metadata: dict = None) -> str:
message = Message(
id=str(uuid.uuid4()),
topic=topic,
payload=payload,
metadata=metadata or {}
)
async with self._lock:
handlers = self._subscribers.get(topic, []).copy()
# Fan-out: deliver to all subscribers concurrently
if handlers:
await asyncio.gather(
*[handler(message) for handler in handlers],
return_exceptions=True # Don't let one failure stop others
)
return message.id
async def subscribe(
self,
topic: str,
handler: Callable[[Message], Awaitable[None]]
) -> Callable[[], Awaitable[None]]:
async with self._lock:
if topic not in self._subscribers:
self._subscribers[topic] = []
self._subscribers[topic].append(handler)
async def unsubscribe():
async with self._lock:
self._subscribers[topic].remove(handler)
return unsubscribe
# Usage example
async def main():
pubsub = InMemoryPubSub()
async def order_handler(msg: Message):
print(f"Processing order: {msg.payload['order_id']}")
async def analytics_handler(msg: Message):
print(f"Recording analytics for: {msg.payload['order_id']}")
await pubsub.subscribe("orders.created", order_handler)
await pubsub.subscribe("orders.created", analytics_handler)
await pubsub.publish("orders.created", {"order_id": "12345", "total": 99.99})
asyncio.run(main())
Popular Pub/Sub Technologies Compared
| Technology | Throughput | Durability | Ordering | Complexity |
|---|---|---|---|---|
| Apache Kafka | Very High | Persistent | Per-partition | High |
| RabbitMQ | Medium | Configurable | Per-queue | Medium |
| Redis Pub/Sub | High | None (fire-forget) | None | Low |
| AWS SNS/SQS | High | Persistent | FIFO available | Low (managed) |
| Google Cloud Pub/Sub | High | Persistent | Per-key | Low (managed) |
Kafka excels at high-throughput event streaming with replay capability. Choose it for event sourcing, log aggregation at scale, or when you need message replay. Expect operational complexity.
RabbitMQ offers flexible routing with exchanges and bindings. Good for complex routing logic and when you need protocol flexibility (AMQP, MQTT, STOMP).
Redis Pub/Sub is simple and fast but provides no persistence or delivery guarantees. Use for real-time features where occasional message loss is acceptable. Redis Streams adds durability if needed.
Managed services (SNS/SQS, Cloud Pub/Sub) reduce operational burden. Choose these unless you have specific requirements they can’t meet or cost constraints at scale.
# Redis Pub/Sub example
import redis
import json
from threading import Thread
def publisher():
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
event = {
"event_type": "order.created",
"order_id": "ord_123",
"customer_id": "cust_456",
"total": 150.00
}
r.publish("orders", json.dumps(event))
print(f"Published: {event}")
def subscriber():
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
pubsub = r.pubsub()
pubsub.subscribe("orders")
print("Subscribed to 'orders' channel")
for message in pubsub.listen():
if message["type"] == "message":
event = json.loads(message["data"])
print(f"Received: {event}")
# Run subscriber in background, then publish
sub_thread = Thread(target=subscriber, daemon=True)
sub_thread.start()
import time
time.sleep(0.1) # Let subscriber connect
publisher()
Production Considerations
Message Ordering
Most brokers guarantee ordering within a partition or queue, not globally. Design for this: use consistent partition keys (e.g., customer_id) so related messages arrive in order. Accept that messages for different keys may arrive out of order.
Idempotency
At-least-once delivery means handlers must tolerate duplicates. Track processed message IDs and skip duplicates:
from datetime import datetime, timedelta
from typing import Set, Callable, Awaitable
class IdempotentHandler:
def __init__(
self,
handler: Callable[[Message], Awaitable[None]],
ttl_seconds: int = 3600
):
self._handler = handler
self._processed: dict[str, datetime] = {}
self._ttl = timedelta(seconds=ttl_seconds)
async def handle(self, message: Message) -> bool:
"""Returns True if message was processed, False if duplicate."""
self._cleanup_expired()
if message.id in self._processed:
print(f"Skipping duplicate message: {message.id}")
return False
await self._handler(message)
self._processed[message.id] = datetime.utcnow()
return True
def _cleanup_expired(self):
"""Remove old entries to prevent memory growth."""
cutoff = datetime.utcnow() - self._ttl
self._processed = {
msg_id: timestamp
for msg_id, timestamp in self._processed.items()
if timestamp > cutoff
}
# Usage
async def process_payment(msg: Message):
print(f"Charging customer for order {msg.payload['order_id']}")
idempotent_processor = IdempotentHandler(process_payment)
# First call processes
await idempotent_processor.handle(message) # Returns True
# Duplicate is skipped
await idempotent_processor.handle(message) # Returns False
In production, use Redis or a database for deduplication state instead of in-memory storage.
Backpressure
When subscribers can’t keep up, messages queue up. Monitor queue depth. Options: add consumers, drop messages (if acceptable), or apply backpressure to publishers.
Monitoring
Track: messages published/consumed per second, consumer lag (how far behind subscribers are), error rates, and DLQ depth. Alert on consumer lag growth—it indicates processing can’t keep up with publishing.
Summary & Best Practices
Pub/sub enables loose coupling and independent scaling but requires careful design around delivery guarantees, ordering, and observability.
Implementation Checklist:
- Define clear message schemas with versioning strategy
- Choose delivery semantics (at-least-once is usually right)
- Implement idempotent handlers—assume duplicates will happen
- Set up dead letter queues for failed messages
- Use consistent partition keys for ordering requirements
- Add correlation IDs for distributed tracing
- Monitor consumer lag and queue depth
- Document topic ownership and message contracts
- Plan for schema evolution (backward compatibility)
- Test failure scenarios: broker down, slow consumers, poison messages
Start simple. A managed service with at-least-once delivery and idempotent handlers solves most problems. Add complexity only when requirements demand it.