Change Data Capture: Database Event Streaming

Key Insights

Log-based CDC using tools like Debezium captures database changes with minimal performance impact and guarantees you won’t miss events, unlike polling or trigger-based approaches
CDC transforms your database into an event stream, enabling real-time data synchronization across services without coupling them through direct API calls
The operational complexity of CDC—schema evolution, snapshot management, and exactly-once delivery—requires careful planning but pays off in architectural flexibility

What is Change Data Capture?

Change Data Capture (CDC) is the process of identifying and capturing row-level changes in a database—inserts, updates, and deletes—and streaming them as events to downstream systems. Instead of querying a database to ask “what changed?”, CDC tells you proactively whenever something changes.

The concept isn’t new. Databases have always maintained transaction logs for recovery purposes. What’s changed is our ability to tap into these logs reliably and stream changes to modern event platforms like Apache Kafka. This shift has made CDC a cornerstone of event-driven architectures.

Why does this matter? Because modern systems need to react to data changes in real-time. When a customer updates their address, your shipping service, analytics platform, and CRM all need to know—immediately. CDC makes this possible without tight coupling between services.

CDC Implementation Patterns

There are three primary approaches to implementing CDC, each with distinct trade-offs.

Log-based CDC reads the database’s transaction log (WAL in PostgreSQL, binlog in MySQL). This is the gold standard—it captures every change with minimal performance impact on the source database and maintains the exact order of operations.

Trigger-based CDC uses database triggers to write changes to a shadow table, which is then read by an external process. This works but adds write overhead and can become a bottleneck.

Query-based CDC polls the database periodically, comparing timestamps or version columns to detect changes. It’s simple but misses deletes, can’t capture intermediate states, and creates load on the source database.

Here’s a concrete comparison:

-- Trigger-based CDC in PostgreSQL
CREATE TABLE order_changes (
    id SERIAL PRIMARY KEY,
    operation VARCHAR(10),
    changed_at TIMESTAMP DEFAULT NOW(),
    order_id INTEGER,
    old_data JSONB,
    new_data JSONB
);

CREATE OR REPLACE FUNCTION capture_order_changes()
RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO order_changes (operation, order_id, old_data, new_data)
    VALUES (
        TG_OP,
        COALESCE(NEW.id, OLD.id),
        CASE WHEN TG_OP != 'INSERT' THEN to_jsonb(OLD) END,
        CASE WHEN TG_OP != 'DELETE' THEN to_jsonb(NEW) END
    );
    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER orders_cdc_trigger
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION capture_order_changes();

-- Query-based CDC (polling approach)
-- Requires an updated_at column that's always set on changes
SELECT * FROM orders 
WHERE updated_at > :last_poll_timestamp
ORDER BY updated_at;

-- Problem: This misses deletes entirely and requires 
-- application discipline to always update the timestamp

The trigger approach works but doubles your write load. The polling approach is fragile. Log-based CDC avoids both problems.

Debezium Deep Dive

Debezium is the de facto standard for log-based CDC in the open-source world. It runs as a set of Kafka Connect connectors that read database transaction logs and produce change events to Kafka topics.

The architecture is straightforward: Debezium connectors run inside Kafka Connect, reading from source databases and writing to Kafka. Each table gets its own topic, and each change becomes a message containing the before state, after state, and metadata about the operation.

Here’s a typical Debezium connector configuration for PostgreSQL:

{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.dbname": "inventory",
    "topic.prefix": "dbserver1",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "dbz_publication",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": false,
    "value.converter.schemas.enable": false
  }
}

When a row changes, Debezium produces an event like this:

{
  "before": {
    "id": 1001,
    "customer_id": 42,
    "status": "pending",
    "total": 150.00
  },
  "after": {
    "id": 1001,
    "customer_id": 42,
    "status": "shipped",
    "total": 150.00
  },
  "source": {
    "version": "2.4.0",
    "connector": "postgresql",
    "name": "dbserver1",
    "ts_ms": 1699459200000,
    "db": "inventory",
    "schema": "public",
    "table": "orders",
    "txId": 12345,
    "lsn": 98765432
  },
  "op": "u",
  "ts_ms": 1699459200500
}

The op field indicates the operation: c for create, u for update, d for delete, and r for read (during snapshots).

Building a CDC Pipeline

Let’s build a complete pipeline. This Docker Compose setup gives you PostgreSQL, Kafka, and Debezium:

version: '3.8'
services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: inventory
    command:
      - "postgres"
      - "-c"
      - "wal_level=logical"
    ports:
      - "5432:5432"

  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  connect:
    image: debezium/connect:2.4
    depends_on:
      - kafka
      - postgres
    ports:
      - "8083:8083"
    environment:
      BOOTSTRAP_SERVERS: kafka:29092
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses

After starting the stack, register the connector via the Kafka Connect REST API:

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d @connector-config.json

Here’s a Python consumer that processes change events:

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'dbserver1.public.orders',
    bootstrap_servers=['localhost:9092'],
    auto_offset_reset='earliest',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    event = message.value
    operation = event.get('op')
    
    if operation == 'c':
        print(f"New order created: {event['after']['id']}")
    elif operation == 'u':
        before = event['before']
        after = event['after']
        if before['status'] != after['status']:
            print(f"Order {after['id']} status: {before['status']} -> {after['status']}")
    elif operation == 'd':
        print(f"Order deleted: {event['before']['id']}")

Common Use Cases

Event-driven microservices: CDC lets services react to data changes without the source service knowing about them. The orders service doesn’t need to publish events—CDC extracts them automatically.

Cache invalidation: Instead of TTL-based expiration or manual invalidation, CDC tells you exactly when cached data becomes stale.

Search index synchronization: Keep Elasticsearch in sync with your database without batch jobs.

Here’s a practical example of CDC-powered Elasticsearch sync:

from kafka import KafkaConsumer
from elasticsearch import Elasticsearch
import json

es = Elasticsearch(['http://localhost:9200'])
consumer = KafkaConsumer(
    'dbserver1.public.products',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    event = message.value
    op = event.get('op')
    
    if op in ('c', 'u', 'r'):  # create, update, or snapshot read
        product = event['after']
        es.index(
            index='products',
            id=product['id'],
            document={
                'name': product['name'],
                'description': product['description'],
                'price': product['price'],
                'category': product['category']
            }
        )
    elif op == 'd':  # delete
        es.delete(index='products', id=event['before']['id'])

Data warehouse loading: Stream changes to your warehouse instead of nightly batch loads.

Audit logging: CDC captures every change with timestamps and transaction IDs—perfect for compliance.

Operational Considerations

Schema evolution is the trickiest part. When you add a column, Debezium handles it gracefully. Removing or renaming columns requires more care—consider using a schema registry with compatibility checks.

Snapshots happen when you first start a connector or need to re-sync. Debezium reads the entire table, which can take hours for large tables and creates significant load. Plan for this.

Monitoring lag is critical. Track the difference between the database’s current LSN and the connector’s position. If lag grows, you’re falling behind and need to investigate.

Tombstones (delete markers) are how Kafka represents deleted records. Debezium sends a change event for the delete, followed by a tombstone. Consumers must handle both.

Exactly-once semantics remain challenging. Debezium guarantees at-least-once delivery. For exactly-once, you need idempotent consumers or transactional writes on the consumer side.

Alternatives and When to Use Them

Application-level events are simpler when you control the source application and need rich business events, not just data changes.

The outbox pattern combines the reliability of CDC with application-controlled event schemas. Write events to an outbox table, and use CDC to stream them. This gives you the best of both worlds.

Managed services like AWS DMS or Google Cloud Datastream reduce operational burden but offer less flexibility and can be expensive at scale.

Choose CDC when you need to capture changes from databases you don’t control, want to avoid modifying source applications, or need guaranteed capture of all changes. Choose application events when you need rich business semantics. Choose the outbox pattern when you want both.

CDC isn’t magic—it’s infrastructure that requires care and feeding. But when you need real-time data synchronization across a distributed system, it’s often the most reliable path forward.