Design a Webhook Delivery System: Reliable Notifications

Key Insights

A reliable webhook system requires decoupling event production from delivery through persistent queues, enabling retries without blocking your main application flow.
Exponential backoff with jitter prevents thundering herd problems and gives failing endpoints time to recover, while dead letter queues ensure no event is silently lost.
Rate limiting per destination is non-negotiable—your reliability shouldn’t come at the cost of overwhelming your customers’ infrastructure.

Introduction: Why Webhook Reliability Matters

Webhooks are the backbone of event-driven integrations. When a user completes a payment, when a deployment finishes, when a document gets signed—these events need to reach external systems reliably. Unlike APIs where the consumer initiates requests, webhooks flip the model: you’re pushing data to endpoints you don’t control.

This inversion creates unique failure modes. The destination server might be down. Network partitions happen. Endpoints timeout under load. Your customer’s infrastructure might reject requests during maintenance windows. A naive “fire and forget” approach means lost events, broken integrations, and angry customers.

A production-grade webhook system needs to guarantee at-least-once delivery, handle failures gracefully, and scale without becoming a bottleneck. Let’s build one.

Core Architecture Components

The fundamental principle is decoupling. Your application shouldn’t wait for webhook delivery to complete. Instead, events flow through a persistent queue that workers consume independently.

Here’s the high-level architecture:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Application │────▶│ Event Queue │────▶│   Workers   │────▶│  Endpoints  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                           │                   │
                           ▼                   ▼
                    ┌─────────────┐     ┌─────────────┐
                    │  Database   │     │   Metrics   │
                    └─────────────┘     └─────────────┘

The database schema needs to track webhook state across retries:

CREATE TABLE webhook_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type VARCHAR(100) NOT NULL,
    payload JSONB NOT NULL,
    endpoint_id UUID NOT NULL REFERENCES webhook_endpoints(id),
    idempotency_key VARCHAR(255) UNIQUE NOT NULL,
    
    -- Delivery state
    status VARCHAR(20) DEFAULT 'pending', -- pending, delivered, failed, dead
    attempt_count INTEGER DEFAULT 0,
    max_attempts INTEGER DEFAULT 5,
    
    -- Scheduling
    next_attempt_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    last_attempt_at TIMESTAMP WITH TIME ZONE,
    last_response_code INTEGER,
    last_error TEXT,
    
    -- Timestamps
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    delivered_at TIMESTAMP WITH TIME ZONE,
    
    -- Indexing for worker queries
    INDEX idx_pending_events (status, next_attempt_at) 
        WHERE status IN ('pending', 'failed')
);

CREATE TABLE webhook_endpoints (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id UUID NOT NULL,
    url VARCHAR(2048) NOT NULL,
    secret VARCHAR(255) NOT NULL, -- For HMAC signing
    is_active BOOLEAN DEFAULT true,
    failure_count INTEGER DEFAULT 0,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Workers poll for events where status IN ('pending', 'failed') AND next_attempt_at <= NOW(), process them, and update state atomically. Use SELECT FOR UPDATE SKIP LOCKED to prevent multiple workers from grabbing the same event.

Retry Strategy with Exponential Backoff

When a delivery fails, you need to retry—but not immediately. Hammering a failing endpoint wastes resources and can worsen outages. Exponential backoff gives systems time to recover.

The formula is straightforward: delay = base_delay * (multiplier ^ attempt_count) + jitter. Jitter prevents synchronized retries from creating traffic spikes.

import random
import time
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional

@dataclass
class RetryConfig:
    base_delay_seconds: float = 10.0
    multiplier: float = 2.0
    max_delay_seconds: float = 3600.0  # Cap at 1 hour
    max_attempts: int = 5
    jitter_factor: float = 0.2  # ±20% randomization

def calculate_next_attempt(
    attempt_count: int, 
    config: RetryConfig
) -> Optional[datetime]:
    """Calculate when to retry a failed webhook delivery."""
    if attempt_count >= config.max_attempts:
        return None  # Move to dead letter queue
    
    # Exponential backoff
    delay = config.base_delay_seconds * (config.multiplier ** attempt_count)
    
    # Apply cap
    delay = min(delay, config.max_delay_seconds)
    
    # Add jitter to prevent thundering herd
    jitter_range = delay * config.jitter_factor
    jitter = random.uniform(-jitter_range, jitter_range)
    delay = max(0, delay + jitter)
    
    return datetime.utcnow() + timedelta(seconds=delay)

def process_webhook_result(
    event_id: str,
    success: bool,
    response_code: Optional[int],
    error: Optional[str],
    db_connection
) -> None:
    """Update webhook state after delivery attempt."""
    event = db_connection.get_event(event_id)
    
    if success:
        db_connection.update_event(event_id, {
            'status': 'delivered',
            'delivered_at': datetime.utcnow(),
            'last_response_code': response_code
        })
        return
    
    next_attempt = calculate_next_attempt(
        event.attempt_count + 1,
        RetryConfig()
    )
    
    if next_attempt is None:
        # Max retries exceeded - move to dead letter queue
        db_connection.update_event(event_id, {
            'status': 'dead',
            'last_response_code': response_code,
            'last_error': error
        })
        enqueue_to_dlq(event)
    else:
        db_connection.update_event(event_id, {
            'status': 'failed',
            'attempt_count': event.attempt_count + 1,
            'next_attempt_at': next_attempt,
            'last_response_code': response_code,
            'last_error': error
        })

A typical retry schedule with these defaults: 10s, 20s, 40s, 80s, 160s. After five failures, the event moves to a dead letter queue for manual inspection or automated alerting.

Ensuring Delivery Guarantees

At-least-once delivery means consumers might receive duplicates. Your job is to make this safe through idempotency keys and secure through cryptographic signatures.

Every webhook should include a signature header that consumers can verify:

import hmac
import hashlib
import json
import time
from typing import Dict, Any

def sign_webhook_payload(
    payload: Dict[Any, Any],
    secret: str,
    timestamp: Optional[int] = None
) -> Dict[str, str]:
    """Generate webhook signature headers."""
    timestamp = timestamp or int(time.time())
    payload_bytes = json.dumps(payload, separators=(',', ':')).encode('utf-8')
    
    # Include timestamp to prevent replay attacks
    signed_content = f"{timestamp}.{payload_bytes.decode('utf-8')}"
    
    signature = hmac.new(
        secret.encode('utf-8'),
        signed_content.encode('utf-8'),
        hashlib.sha256
    ).hexdigest()
    
    return {
        'X-Webhook-Timestamp': str(timestamp),
        'X-Webhook-Signature': f"sha256={signature}"
    }

def verify_webhook_signature(
    payload_body: bytes,
    timestamp: str,
    signature: str,
    secret: str,
    tolerance_seconds: int = 300
) -> bool:
    """Verify incoming webhook signature (consumer side)."""
    # Check timestamp to prevent replay attacks
    try:
        ts = int(timestamp)
        if abs(time.time() - ts) > tolerance_seconds:
            return False
    except ValueError:
        return False
    
    # Reconstruct and compare signature
    signed_content = f"{timestamp}.{payload_body.decode('utf-8')}"
    expected = hmac.new(
        secret.encode('utf-8'),
        signed_content.encode('utf-8'),
        hashlib.sha256
    ).hexdigest()
    
    return hmac.compare_digest(f"sha256={expected}", signature)

Include an idempotency key in every payload. Consumers should store processed keys and skip duplicates:

{
  "id": "evt_abc123",
  "idempotency_key": "payment_completed_ord_789_1699900000",
  "type": "payment.completed",
  "created_at": "2024-01-15T10:30:00Z",
  "data": {
    "order_id": "ord_789",
    "amount": 9999
  }
}

Monitoring, Observability, and Alerting

You can’t fix what you can’t see. Every delivery attempt should emit structured logs:

import json
import logging
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Optional

@dataclass
class WebhookDeliveryLog:
    event_id: str
    endpoint_id: str
    customer_id: str
    event_type: str
    attempt_number: int
    
    # Outcome
    success: bool
    response_code: Optional[int]
    response_time_ms: float
    error_message: Optional[str]
    
    # Context
    timestamp: str = None
    
    def __post_init__(self):
        self.timestamp = datetime.utcnow().isoformat()

def log_delivery_attempt(log_data: WebhookDeliveryLog) -> None:
    """Emit structured log for webhook delivery."""
    logger = logging.getLogger('webhook.delivery')
    
    log_entry = asdict(log_data)
    log_entry['log_type'] = 'webhook_delivery'
    
    level = logging.INFO if log_data.success else logging.WARNING
    logger.log(level, json.dumps(log_entry))

# Usage after each delivery attempt:
log_delivery_attempt(WebhookDeliveryLog(
    event_id="evt_abc123",
    endpoint_id="ep_456",
    customer_id="cust_789",
    event_type="payment.completed",
    attempt_number=2,
    success=False,
    response_code=503,
    response_time_ms=2340.5,
    error_message="Service Unavailable"
))

Key metrics to track: delivery success rate (target: >99.9%), p50/p95/p99 latency, queue depth, and per-endpoint failure rates. Set alerts when an endpoint’s failure rate exceeds 50% over 10 minutes—it might need automatic disabling.

Scaling Considerations

As volume grows, you’ll hit bottlenecks. Fair scheduling prevents one noisy customer from starving others. Rate limiting prevents your reliability from becoming a DDoS attack on customer infrastructure.

Implement a token bucket rate limiter per endpoint:

import time
from dataclasses import dataclass
from typing import Dict
import threading

@dataclass
class TokenBucket:
    capacity: float
    refill_rate: float  # tokens per second
    tokens: float
    last_refill: float

class EndpointRateLimiter:
    def __init__(self, default_rate: float = 10.0, default_capacity: float = 20.0):
        self.buckets: Dict[str, TokenBucket] = {}
        self.default_rate = default_rate
        self.default_capacity = default_capacity
        self.lock = threading.Lock()
    
    def _get_bucket(self, endpoint_id: str) -> TokenBucket:
        if endpoint_id not in self.buckets:
            self.buckets[endpoint_id] = TokenBucket(
                capacity=self.default_capacity,
                refill_rate=self.default_rate,
                tokens=self.default_capacity,
                last_refill=time.time()
            )
        return self.buckets[endpoint_id]
    
    def _refill(self, bucket: TokenBucket) -> None:
        now = time.time()
        elapsed = now - bucket.last_refill
        bucket.tokens = min(
            bucket.capacity,
            bucket.tokens + (elapsed * bucket.refill_rate)
        )
        bucket.last_refill = now
    
    def acquire(self, endpoint_id: str, tokens: float = 1.0) -> bool:
        """Try to acquire tokens. Returns True if allowed."""
        with self.lock:
            bucket = self._get_bucket(endpoint_id)
            self._refill(bucket)
            
            if bucket.tokens >= tokens:
                bucket.tokens -= tokens
                return True
            return False
    
    def time_until_available(self, endpoint_id: str, tokens: float = 1.0) -> float:
        """Calculate seconds until tokens become available."""
        with self.lock:
            bucket = self._get_bucket(endpoint_id)
            self._refill(bucket)
            
            if bucket.tokens >= tokens:
                return 0.0
            
            deficit = tokens - bucket.tokens
            return deficit / bucket.refill_rate

When acquire() returns False, reschedule the webhook for time_until_available() seconds in the future rather than dropping it.

Operational Tooling and Self-Service

Build tools for both your operations team and customers. An admin dashboard should show queue depth, per-endpoint health scores, and enable manual retries or bulk replay. Customers need visibility into their webhook logs: what was sent, when, and whether it succeeded.

Consider automatic endpoint disabling after sustained failures (e.g., 100 consecutive failures). Send an email notification and require manual re-enablement to prevent wasting resources on abandoned integrations.

If building from scratch feels excessive, evaluate managed solutions: AWS EventBridge for AWS-native architectures, Svix for a dedicated webhook service, or Hookdeck for debugging and reliability features. These handle the infrastructure complexity while you focus on your core product.

The webhook system you build today becomes the foundation for every integration tomorrow. Invest in reliability upfront—your customers’ trust depends on it.