System Design: Event-Driven Architecture

Key Insights

Event-driven architecture decouples services temporally and spatially, enabling independent scaling and deployment, but introduces complexity in debugging, ordering guarantees, and eventual consistency that you must explicitly design for.
The choice between choreography (services react to events independently) and orchestration (a central coordinator manages flow) depends on your domain complexity—start with choreography for simple flows, but don’t hesitate to introduce orchestrators when business logic becomes convoluted.
Events are your API contract—invest heavily in schema design, versioning strategies, and event documentation from day one, because changing event structures in production is significantly harder than modifying REST endpoints.

Introduction to Event-Driven Architecture

Event-driven architecture (EDA) flips the traditional request-response model on its head. Instead of Service A calling Service B and waiting for a response, Service A publishes an event describing what happened, and any interested services react independently.

This distinction matters. In request-response, the caller knows about the callee. In EDA, the producer has no knowledge of who consumes its events. This inversion creates loose coupling that enables teams to deploy, scale, and evolve services independently.

Let’s establish terminology. An event is an immutable fact about something that happened—OrderPlaced, PaymentProcessed, InventoryReserved. A producer publishes events without knowing who will consume them. A consumer subscribes to events and reacts accordingly. A broker (Kafka, RabbitMQ, AWS EventBridge) sits between producers and consumers, handling routing, persistence, and delivery guarantees.

Choose EDA when you need to decouple services that evolve at different rates, when multiple consumers need to react to the same business event, or when you need to build audit trails and replay capabilities. Avoid it for simple CRUD applications or when strong consistency is non-negotiable.

Core Components and Patterns

Your broker choice shapes your architecture. Apache Kafka excels at high-throughput, ordered event streams with replay capability—ideal for event sourcing. RabbitMQ provides flexible routing and is simpler to operate for traditional pub/sub. AWS EventBridge offers serverless event routing with native AWS integrations.

Three patterns dominate EDA implementations:

Pub/Sub is the foundation—producers publish to topics, consumers subscribe. Simple and effective for notification-style events.

Event Sourcing stores state as a sequence of events rather than current state. Instead of updating an order record, you append OrderCreated, ItemAdded, OrderShipped events. Current state is derived by replaying events.

CQRS (Command Query Responsibility Segregation) separates read and write models. Commands produce events that update the write model; events also update denormalized read models optimized for queries.

Event schema design deserves serious attention. Here’s a practical schema with versioning:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "eventId": {
      "type": "string",
      "format": "uuid",
      "description": "Unique identifier for idempotency"
    },
    "eventType": {
      "type": "string",
      "enum": ["order.placed.v1", "order.placed.v2"]
    },
    "timestamp": {
      "type": "string",
      "format": "date-time"
    },
    "correlationId": {
      "type": "string",
      "format": "uuid",
      "description": "Links related events across services"
    },
    "payload": {
      "type": "object",
      "properties": {
        "orderId": { "type": "string" },
        "customerId": { "type": "string" },
        "items": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "sku": { "type": "string" },
              "quantity": { "type": "integer" },
              "priceInCents": { "type": "integer" }
            },
            "required": ["sku", "quantity", "priceInCents"]
          }
        },
        "totalInCents": { "type": "integer" }
      },
      "required": ["orderId", "customerId", "items", "totalInCents"]
    }
  },
  "required": ["eventId", "eventType", "timestamp", "payload"]
}

Version your event types explicitly in the type name. When you need breaking changes, introduce order.placed.v2 while maintaining v1 consumers during migration.

Designing Event Flows

Model events around business facts, not technical operations. OrderPlaced is a business event; OrderTableUpdated is not. Events should be meaningful to domain experts.

Consider an e-commerce order flow. When a customer places an order, multiple services react:

# Order Service publishes the initial event
order_placed_event = {
    "eventId": str(uuid.uuid4()),
    "eventType": "order.placed.v1",
    "timestamp": datetime.utcnow().isoformat(),
    "correlationId": correlation_id,
    "payload": {
        "orderId": "ord_123",
        "customerId": "cust_456",
        "items": [
            {"sku": "WIDGET-001", "quantity": 2, "priceInCents": 1999}
        ],
        "totalInCents": 3998
    }
}

# Inventory Service consumes and reacts
@event_handler("order.placed.v1")
async def reserve_inventory(event):
    order = event["payload"]
    for item in order["items"]:
        await inventory_repo.reserve(item["sku"], item["quantity"], order["orderId"])
    
    await publish_event({
        "eventType": "inventory.reserved.v1",
        "correlationId": event["correlationId"],
        "payload": {"orderId": order["orderId"]}
    })

# Payment Service consumes and reacts
@event_handler("order.placed.v1")
async def initiate_payment(event):
    order = event["payload"]
    payment_intent = await payment_gateway.create_charge(
        customer_id=order["customerId"],
        amount_cents=order["totalInCents"]
    )
    
    await publish_event({
        "eventType": "payment.initiated.v1",
        "correlationId": event["correlationId"],
        "payload": {"orderId": order["orderId"], "paymentId": payment_intent.id}
    })

# Notification Service consumes and reacts
@event_handler("order.placed.v1")
async def send_confirmation(event):
    order = event["payload"]
    await email_service.send_order_confirmation(order["customerId"], order["orderId"])

Handle idempotency explicitly. Consumers may receive the same event multiple times due to retries or rebalancing. Store processed eventId values and skip duplicates:

async def process_event(event):
    if await event_store.already_processed(event["eventId"]):
        logger.info(f"Skipping duplicate event {event['eventId']}")
        return
    
    await handle_business_logic(event)
    await event_store.mark_processed(event["eventId"])

Implementation Patterns

Here’s a production-ready Kafka consumer with proper error handling:

from confluent_kafka import Consumer, Producer, KafkaError
import json
import logging

class EventConsumer:
    def __init__(self, config, topics, handler):
        self.consumer = Consumer({
            'bootstrap.servers': config['brokers'],
            'group.id': config['consumer_group'],
            'auto.offset.reset': 'earliest',
            'enable.auto.commit': False,  # Manual commits for reliability
            'max.poll.interval.ms': 300000,
        })
        self.consumer.subscribe(topics)
        self.handler = handler
        self.dlq_producer = Producer({'bootstrap.servers': config['brokers']})
        self.max_retries = 3

    async def run(self):
        while True:
            msg = self.consumer.poll(timeout=1.0)
            if msg is None:
                continue
            if msg.error():
                if msg.error().code() == KafkaError._PARTITION_EOF:
                    continue
                logging.error(f"Consumer error: {msg.error()}")
                continue

            await self._process_with_retry(msg)

    async def _process_with_retry(self, msg):
        event = json.loads(msg.value().decode('utf-8'))
        retries = 0
        
        while retries < self.max_retries:
            try:
                await self.handler(event)
                self.consumer.commit(msg)
                return
            except TransientError as e:
                retries += 1
                await asyncio.sleep(2 ** retries)  # Exponential backoff
            except PermanentError as e:
                logging.error(f"Permanent failure for event {event['eventId']}: {e}")
                self._send_to_dlq(msg, str(e))
                self.consumer.commit(msg)
                return
        
        # Max retries exceeded
        self._send_to_dlq(msg, "Max retries exceeded")
        self.consumer.commit(msg)

    def _send_to_dlq(self, original_msg, error_reason):
        dlq_event = {
            "originalEvent": json.loads(original_msg.value()),
            "error": error_reason,
            "failedAt": datetime.utcnow().isoformat(),
            "originalTopic": original_msg.topic()
        }
        self.dlq_producer.produce(
            'dead-letter-queue',
            json.dumps(dlq_event).encode('utf-8')
        )
        self.dlq_producer.flush()

Dead letter queues are essential. When processing fails permanently, you need somewhere to park problematic events for investigation without blocking the main flow.

Data Consistency and Saga Pattern

Distributed transactions across services don’t work in EDA. Instead, use sagas—sequences of local transactions coordinated through events.

Here’s a choreography-based saga for order fulfillment with compensation:

# Payment Service - handles payment and listens for failures
@event_handler("inventory.reserved.v1")
async def charge_payment(event):
    try:
        charge = await payment_gateway.capture(event["payload"]["orderId"])
        await publish_event({
            "eventType": "payment.completed.v1",
            "correlationId": event["correlationId"],
            "payload": {"orderId": event["payload"]["orderId"], "chargeId": charge.id}
        })
    except PaymentDeclinedException:
        await publish_event({
            "eventType": "payment.failed.v1",
            "correlationId": event["correlationId"],
            "payload": {"orderId": event["payload"]["orderId"], "reason": "declined"}
        })

# Inventory Service - compensates on payment failure
@event_handler("payment.failed.v1")
async def release_inventory(event):
    order_id = event["payload"]["orderId"]
    await inventory_repo.release_reservation(order_id)
    await publish_event({
        "eventType": "inventory.released.v1",
        "correlationId": event["correlationId"],
        "payload": {"orderId": order_id}
    })

# Order Service - updates order status based on saga outcome
@event_handler("payment.completed.v1")
async def confirm_order(event):
    await order_repo.update_status(event["payload"]["orderId"], "CONFIRMED")

@event_handler("payment.failed.v1")
async def cancel_order(event):
    await order_repo.update_status(event["payload"]["orderId"], "CANCELLED")

Accept eventual consistency. The order might show as “PROCESSING” for a few seconds while events propagate. Design your UX around this reality rather than fighting it.

Observability and Debugging

Correlation IDs are non-negotiable. Every event in a business flow shares the same correlation ID, enabling you to trace the complete journey:

from opentelemetry import trace
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

def publish_event_with_tracing(event, topic):
    with tracer.start_as_current_span("publish_event") as span:
        span.set_attribute("event.type", event["eventType"])
        span.set_attribute("event.id", event["eventId"])
        
        # Inject trace context into event headers
        carrier = {}
        inject(carrier)
        
        event["_traceContext"] = carrier
        
        producer.produce(topic, json.dumps(event).encode('utf-8'))

def consume_event_with_tracing(event):
    # Extract trace context from event
    ctx = extract(event.get("_traceContext", {}))
    
    with tracer.start_as_current_span("consume_event", context=ctx) as span:
        span.set_attribute("event.type", event["eventType"])
        span.set_attribute("correlation.id", event["correlationId"])
        
        return process_event(event)

Monitor consumer lag religiously. If consumers fall behind producers, you’ll experience increasing latency and potential data loss if retention expires.

Trade-offs and Anti-patterns

EDA isn’t always the answer. Avoid it when:

You need synchronous responses (user expects immediate confirmation)
Your domain is simple CRUD without complex workflows
Your team lacks distributed systems experience
Strong consistency is a hard requirement

Common anti-patterns to avoid:

Event storms: One event triggers cascading events that overwhelm the system. Design circuit breakers and rate limiting.

Anemic events: Events that only contain IDs, forcing consumers to call back to the producer for details. Include enough data for consumers to act independently.

Tight coupling through events: If changing an event schema requires coordinated deployments across multiple teams, you’ve recreated the coupling you tried to escape.

Using events for queries: Events are for state changes, not data retrieval. Don’t publish GetUserDetails events.

Consider hybrid approaches. Use synchronous calls for simple queries and immediate responses; use events for state changes that multiple services care about. The goal is appropriate coupling, not zero coupling.