Outbox Pattern: Reliable Event Publishing

Key Insights

The outbox pattern solves the dual-write problem by storing events in the same database transaction as your business data, then publishing them asynchronously via a separate process.
At-least-once delivery is the realistic guarantee—design your consumers to be idempotent using message IDs and a processed events table.
Choose polling for simplicity or Change Data Capture (CDC) for lower latency, but either approach requires careful attention to ordering, retries, and table maintenance.

The Dual-Write Problem

Every time you save data to a database and publish an event to a message broker, you’re performing a dual write. This seems straightforward until you consider what happens when one operation succeeds and the other fails.

# The naive approach - don't do this
def create_order(order_data):
    order = Order(**order_data)
    db.session.add(order)
    db.session.commit()  # Step 1: Database write succeeds
    
    # Step 2: What if this fails? Or the process crashes here?
    message_broker.publish("orders", {
        "event": "OrderCreated",
        "order_id": order.id,
        "customer_id": order.customer_id,
        "total": order.total
    })
    
    return order

This code has several failure modes. The message broker might be temporarily unavailable. The application might crash between the commit and the publish. Network issues might cause a timeout after the message was actually sent, leading you to retry and publish duplicates.

The result? Your database says the order exists, but downstream services never learn about it. Or worse, they receive the event multiple times with no way to deduplicate. Customers wonder why their order confirmation email never arrived. The inventory service never reserved the items. Your support team fields complaints while engineers dig through logs.

Wrapping both operations in a distributed transaction sounds appealing but introduces its own complexity and performance penalties. Most message brokers don’t support XA transactions anyway.

How the Outbox Pattern Works

The outbox pattern sidesteps distributed transactions entirely. Instead of publishing directly to a message broker, you write events to an “outbox” table in your database within the same transaction as your business data.

A separate publisher process reads from the outbox table and handles the actual message broker communication. If publishing fails, the message stays in the outbox and gets retried. If it succeeds, the message is marked as published.

┌─────────────────────────────────────────────────────────┐
│                    Single Transaction                    │
│  ┌─────────────────┐         ┌─────────────────────┐    │
│  │  orders table   │         │   outbox table      │    │
│  │                 │         │                     │    │
│  │  INSERT order   │         │  INSERT event       │    │
│  └─────────────────┘         └─────────────────────┘    │
└─────────────────────────────────────────────────────────┘
                                        │
                                        ▼
                            ┌─────────────────────┐
                            │  Publisher Process  │
                            │  (polls or CDC)     │
                            └─────────────────────┘
                                        │
                                        ▼
                            ┌─────────────────────┐
                            │   Message Broker    │
                            └─────────────────────┘

The transaction boundary is the key insight. Either both the order and the outbox event are persisted, or neither is. You’ve traded a distributed systems problem for a local database transaction—a much more tractable challenge.

Implementing the Outbox Table

The outbox table needs to capture enough information for the publisher to route and deliver messages correctly.

CREATE TABLE outbox (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id VARCHAR(255) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    published_at TIMESTAMP WITH TIME ZONE NULL,
    retry_count INTEGER DEFAULT 0,
    last_error TEXT NULL
);

CREATE INDEX idx_outbox_unpublished ON outbox (created_at) 
    WHERE published_at IS NULL;

CREATE INDEX idx_outbox_aggregate ON outbox (aggregate_type, aggregate_id);

The aggregate_type and aggregate_id columns let you route events to appropriate topics and maintain ordering per aggregate. The payload column stores the full event data—I recommend JSONB for flexibility and queryability, though binary formats like Avro work if you need schema evolution guarantees.

Here’s how the application code changes:

from dataclasses import dataclass, asdict
from datetime import datetime
import json
import uuid

@dataclass
class OutboxEvent:
    id: uuid.UUID
    aggregate_type: str
    aggregate_id: str
    event_type: str
    payload: dict
    created_at: datetime

def create_order(order_data):
    order = Order(**order_data)
    
    outbox_event = OutboxEvent(
        id=uuid.uuid4(),
        aggregate_type="Order",
        aggregate_id=str(order.id),
        event_type="OrderCreated",
        payload={
            "order_id": str(order.id),
            "customer_id": order.customer_id,
            "total": float(order.total),
            "items": [asdict(item) for item in order.items]
        },
        created_at=datetime.utcnow()
    )
    
    # Both writes in the same transaction
    db.session.add(order)
    db.session.execute(
        """INSERT INTO outbox (id, aggregate_type, aggregate_id, 
           event_type, payload, created_at)
           VALUES (:id, :aggregate_type, :aggregate_id, 
           :event_type, :payload, :created_at)""",
        {
            "id": outbox_event.id,
            "aggregate_type": outbox_event.aggregate_type,
            "aggregate_id": outbox_event.aggregate_id,
            "event_type": outbox_event.event_type,
            "payload": json.dumps(outbox_event.payload),
            "created_at": outbox_event.created_at
        }
    )
    db.session.commit()
    
    return order

The Publisher Component

The publisher reads unpublished events and sends them to the message broker. Polling is the simplest approach and works well for most use cases.

import time
from datetime import datetime
import logging

logger = logging.getLogger(__name__)

class OutboxPublisher:
    def __init__(self, db, message_broker, batch_size=100, poll_interval=1.0):
        self.db = db
        self.broker = message_broker
        self.batch_size = batch_size
        self.poll_interval = poll_interval
    
    def run(self):
        logger.info("Outbox publisher started")
        while True:
            try:
                published_count = self.process_batch()
                if published_count == 0:
                    time.sleep(self.poll_interval)
            except Exception as e:
                logger.error(f"Publisher error: {e}")
                time.sleep(self.poll_interval * 5)
    
    def process_batch(self):
        # SELECT FOR UPDATE SKIP LOCKED allows multiple publishers
        events = self.db.execute(
            """SELECT id, aggregate_type, aggregate_id, event_type, 
                      payload, created_at, retry_count
               FROM outbox 
               WHERE published_at IS NULL 
               ORDER BY created_at 
               LIMIT :batch_size
               FOR UPDATE SKIP LOCKED""",
            {"batch_size": self.batch_size}
        ).fetchall()
        
        published = 0
        for event in events:
            try:
                topic = self._get_topic(event.aggregate_type)
                self.broker.publish(
                    topic=topic,
                    key=event.aggregate_id,  # Ensures ordering per aggregate
                    value={
                        "event_id": str(event.id),
                        "event_type": event.event_type,
                        "aggregate_type": event.aggregate_type,
                        "aggregate_id": event.aggregate_id,
                        "payload": event.payload,
                        "timestamp": event.created_at.isoformat()
                    }
                )
                
                self.db.execute(
                    "UPDATE outbox SET published_at = :now WHERE id = :id",
                    {"now": datetime.utcnow(), "id": event.id}
                )
                self.db.commit()
                published += 1
                
            except Exception as e:
                logger.warning(f"Failed to publish {event.id}: {e}")
                self.db.execute(
                    """UPDATE outbox 
                       SET retry_count = retry_count + 1, last_error = :error 
                       WHERE id = :id""",
                    {"error": str(e), "id": event.id}
                )
                self.db.commit()
        
        return published
    
    def _get_topic(self, aggregate_type):
        return f"domain-events.{aggregate_type.lower()}"

The FOR UPDATE SKIP LOCKED clause is crucial for scaling. It lets you run multiple publisher instances without processing the same event twice, while avoiding lock contention.

For lower latency, consider Change Data Capture (CDC) tools like Debezium. CDC monitors the database transaction log and streams changes in near real-time. The trade-off is operational complexity—you’re now running Kafka Connect and managing connector configurations.

Ensuring Exactly-Once Delivery

Let’s be direct: exactly-once delivery in distributed systems is a myth, or at best, an expensive approximation. The outbox pattern provides at-least-once delivery. Your consumers must be idempotent.

The simplest approach is tracking processed event IDs:

class OrderEventConsumer:
    def __init__(self, db):
        self.db = db
    
    def handle(self, event):
        event_id = event["event_id"]
        
        # Check if already processed
        existing = self.db.execute(
            "SELECT 1 FROM processed_events WHERE event_id = :id",
            {"id": event_id}
        ).fetchone()
        
        if existing:
            logger.info(f"Skipping duplicate event {event_id}")
            return
        
        # Process the event
        with self.db.begin():
            self._process_order_created(event["payload"])
            
            # Record that we processed this event
            self.db.execute(
                """INSERT INTO processed_events (event_id, processed_at)
                   VALUES (:id, :now)""",
                {"id": event_id, "now": datetime.utcnow()}
            )

The processed events check and the business logic must be in the same transaction. Otherwise, you’re back to the dual-write problem on the consumer side.

Production Considerations

The outbox table will grow indefinitely unless you maintain it. Archive or delete published events after a retention period:

-- Run periodically (e.g., daily via cron)
DELETE FROM outbox 
WHERE published_at IS NOT NULL 
  AND published_at < NOW() - INTERVAL '7 days';

Monitor the lag between event creation and publication. A growing backlog indicates publisher issues or insufficient capacity:

SELECT 
    COUNT(*) as pending_events,
    MIN(created_at) as oldest_pending,
    EXTRACT(EPOCH FROM (NOW() - MIN(created_at))) as max_lag_seconds
FROM outbox 
WHERE published_at IS NULL;

Alert when lag exceeds acceptable thresholds. For most systems, anything over a few minutes warrants investigation.

If you’re using CDC, monitor connector lag separately. Debezium exposes metrics for this purpose. Connector failures can be subtle—the connector might be running but stuck on a problematic transaction.

When to Use (and Skip) the Outbox Pattern

Use the outbox pattern when:

You need reliable event publishing alongside database writes
Your message broker doesn’t support transactions with your database
You want to decouple your application from broker availability
Event ordering per aggregate matters

Skip it when:

You’re using a database that supports transactional outbox natively (some cloud databases offer this)
Your events are truly fire-and-forget with no consistency requirements
You’re already using a saga/choreography pattern that handles failures differently
The added latency (polling interval) is unacceptable and CDC is too complex for your operations team

The outbox pattern adds complexity. You’re maintaining an extra table, running a publisher process, and handling the operational overhead of both. But when you need reliable event publishing—and most event-driven systems do—that complexity pays for itself in reduced data inconsistencies and fewer 3 AM pages about missing events.

Start with polling. It’s simple, debuggable, and sufficient for most workloads. Graduate to CDC when you’ve proven you need sub-second latency and have the operational maturity to support it.