Design a Webhook Delivery System: Reliable Notifications
Webhooks are the backbone of event-driven integrations. When a user completes a payment, when a deployment finishes, when a document gets signed—these events need to reach external systems reliably....
Key Insights
- A reliable webhook system requires decoupling event production from delivery through persistent queues, enabling retries without blocking your main application flow.
- Exponential backoff with jitter prevents thundering herd problems and gives failing endpoints time to recover, while dead letter queues ensure no event is silently lost.
- Rate limiting per destination is non-negotiable—your reliability shouldn’t come at the cost of overwhelming your customers’ infrastructure.
Introduction: Why Webhook Reliability Matters
Webhooks are the backbone of event-driven integrations. When a user completes a payment, when a deployment finishes, when a document gets signed—these events need to reach external systems reliably. Unlike APIs where the consumer initiates requests, webhooks flip the model: you’re pushing data to endpoints you don’t control.
This inversion creates unique failure modes. The destination server might be down. Network partitions happen. Endpoints timeout under load. Your customer’s infrastructure might reject requests during maintenance windows. A naive “fire and forget” approach means lost events, broken integrations, and angry customers.
A production-grade webhook system needs to guarantee at-least-once delivery, handle failures gracefully, and scale without becoming a bottleneck. Let’s build one.
Core Architecture Components
The fundamental principle is decoupling. Your application shouldn’t wait for webhook delivery to complete. Instead, events flow through a persistent queue that workers consume independently.
Here’s the high-level architecture:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Application │────▶│ Event Queue │────▶│ Workers │────▶│ Endpoints │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Database │ │ Metrics │
└─────────────┘ └─────────────┘
The database schema needs to track webhook state across retries:
CREATE TABLE webhook_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
endpoint_id UUID NOT NULL REFERENCES webhook_endpoints(id),
idempotency_key VARCHAR(255) UNIQUE NOT NULL,
-- Delivery state
status VARCHAR(20) DEFAULT 'pending', -- pending, delivered, failed, dead
attempt_count INTEGER DEFAULT 0,
max_attempts INTEGER DEFAULT 5,
-- Scheduling
next_attempt_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
last_attempt_at TIMESTAMP WITH TIME ZONE,
last_response_code INTEGER,
last_error TEXT,
-- Timestamps
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
delivered_at TIMESTAMP WITH TIME ZONE,
-- Indexing for worker queries
INDEX idx_pending_events (status, next_attempt_at)
WHERE status IN ('pending', 'failed')
);
CREATE TABLE webhook_endpoints (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL,
url VARCHAR(2048) NOT NULL,
secret VARCHAR(255) NOT NULL, -- For HMAC signing
is_active BOOLEAN DEFAULT true,
failure_count INTEGER DEFAULT 0,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
Workers poll for events where status IN ('pending', 'failed') AND next_attempt_at <= NOW(), process them, and update state atomically. Use SELECT FOR UPDATE SKIP LOCKED to prevent multiple workers from grabbing the same event.
Retry Strategy with Exponential Backoff
When a delivery fails, you need to retry—but not immediately. Hammering a failing endpoint wastes resources and can worsen outages. Exponential backoff gives systems time to recover.
The formula is straightforward: delay = base_delay * (multiplier ^ attempt_count) + jitter. Jitter prevents synchronized retries from creating traffic spikes.
import random
import time
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
@dataclass
class RetryConfig:
base_delay_seconds: float = 10.0
multiplier: float = 2.0
max_delay_seconds: float = 3600.0 # Cap at 1 hour
max_attempts: int = 5
jitter_factor: float = 0.2 # ±20% randomization
def calculate_next_attempt(
attempt_count: int,
config: RetryConfig
) -> Optional[datetime]:
"""Calculate when to retry a failed webhook delivery."""
if attempt_count >= config.max_attempts:
return None # Move to dead letter queue
# Exponential backoff
delay = config.base_delay_seconds * (config.multiplier ** attempt_count)
# Apply cap
delay = min(delay, config.max_delay_seconds)
# Add jitter to prevent thundering herd
jitter_range = delay * config.jitter_factor
jitter = random.uniform(-jitter_range, jitter_range)
delay = max(0, delay + jitter)
return datetime.utcnow() + timedelta(seconds=delay)
def process_webhook_result(
event_id: str,
success: bool,
response_code: Optional[int],
error: Optional[str],
db_connection
) -> None:
"""Update webhook state after delivery attempt."""
event = db_connection.get_event(event_id)
if success:
db_connection.update_event(event_id, {
'status': 'delivered',
'delivered_at': datetime.utcnow(),
'last_response_code': response_code
})
return
next_attempt = calculate_next_attempt(
event.attempt_count + 1,
RetryConfig()
)
if next_attempt is None:
# Max retries exceeded - move to dead letter queue
db_connection.update_event(event_id, {
'status': 'dead',
'last_response_code': response_code,
'last_error': error
})
enqueue_to_dlq(event)
else:
db_connection.update_event(event_id, {
'status': 'failed',
'attempt_count': event.attempt_count + 1,
'next_attempt_at': next_attempt,
'last_response_code': response_code,
'last_error': error
})
A typical retry schedule with these defaults: 10s, 20s, 40s, 80s, 160s. After five failures, the event moves to a dead letter queue for manual inspection or automated alerting.
Ensuring Delivery Guarantees
At-least-once delivery means consumers might receive duplicates. Your job is to make this safe through idempotency keys and secure through cryptographic signatures.
Every webhook should include a signature header that consumers can verify:
import hmac
import hashlib
import json
import time
from typing import Dict, Any
def sign_webhook_payload(
payload: Dict[Any, Any],
secret: str,
timestamp: Optional[int] = None
) -> Dict[str, str]:
"""Generate webhook signature headers."""
timestamp = timestamp or int(time.time())
payload_bytes = json.dumps(payload, separators=(',', ':')).encode('utf-8')
# Include timestamp to prevent replay attacks
signed_content = f"{timestamp}.{payload_bytes.decode('utf-8')}"
signature = hmac.new(
secret.encode('utf-8'),
signed_content.encode('utf-8'),
hashlib.sha256
).hexdigest()
return {
'X-Webhook-Timestamp': str(timestamp),
'X-Webhook-Signature': f"sha256={signature}"
}
def verify_webhook_signature(
payload_body: bytes,
timestamp: str,
signature: str,
secret: str,
tolerance_seconds: int = 300
) -> bool:
"""Verify incoming webhook signature (consumer side)."""
# Check timestamp to prevent replay attacks
try:
ts = int(timestamp)
if abs(time.time() - ts) > tolerance_seconds:
return False
except ValueError:
return False
# Reconstruct and compare signature
signed_content = f"{timestamp}.{payload_body.decode('utf-8')}"
expected = hmac.new(
secret.encode('utf-8'),
signed_content.encode('utf-8'),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(f"sha256={expected}", signature)
Include an idempotency key in every payload. Consumers should store processed keys and skip duplicates:
{
"id": "evt_abc123",
"idempotency_key": "payment_completed_ord_789_1699900000",
"type": "payment.completed",
"created_at": "2024-01-15T10:30:00Z",
"data": {
"order_id": "ord_789",
"amount": 9999
}
}
Monitoring, Observability, and Alerting
You can’t fix what you can’t see. Every delivery attempt should emit structured logs:
import json
import logging
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Optional
@dataclass
class WebhookDeliveryLog:
event_id: str
endpoint_id: str
customer_id: str
event_type: str
attempt_number: int
# Outcome
success: bool
response_code: Optional[int]
response_time_ms: float
error_message: Optional[str]
# Context
timestamp: str = None
def __post_init__(self):
self.timestamp = datetime.utcnow().isoformat()
def log_delivery_attempt(log_data: WebhookDeliveryLog) -> None:
"""Emit structured log for webhook delivery."""
logger = logging.getLogger('webhook.delivery')
log_entry = asdict(log_data)
log_entry['log_type'] = 'webhook_delivery'
level = logging.INFO if log_data.success else logging.WARNING
logger.log(level, json.dumps(log_entry))
# Usage after each delivery attempt:
log_delivery_attempt(WebhookDeliveryLog(
event_id="evt_abc123",
endpoint_id="ep_456",
customer_id="cust_789",
event_type="payment.completed",
attempt_number=2,
success=False,
response_code=503,
response_time_ms=2340.5,
error_message="Service Unavailable"
))
Key metrics to track: delivery success rate (target: >99.9%), p50/p95/p99 latency, queue depth, and per-endpoint failure rates. Set alerts when an endpoint’s failure rate exceeds 50% over 10 minutes—it might need automatic disabling.
Scaling Considerations
As volume grows, you’ll hit bottlenecks. Fair scheduling prevents one noisy customer from starving others. Rate limiting prevents your reliability from becoming a DDoS attack on customer infrastructure.
Implement a token bucket rate limiter per endpoint:
import time
from dataclasses import dataclass
from typing import Dict
import threading
@dataclass
class TokenBucket:
capacity: float
refill_rate: float # tokens per second
tokens: float
last_refill: float
class EndpointRateLimiter:
def __init__(self, default_rate: float = 10.0, default_capacity: float = 20.0):
self.buckets: Dict[str, TokenBucket] = {}
self.default_rate = default_rate
self.default_capacity = default_capacity
self.lock = threading.Lock()
def _get_bucket(self, endpoint_id: str) -> TokenBucket:
if endpoint_id not in self.buckets:
self.buckets[endpoint_id] = TokenBucket(
capacity=self.default_capacity,
refill_rate=self.default_rate,
tokens=self.default_capacity,
last_refill=time.time()
)
return self.buckets[endpoint_id]
def _refill(self, bucket: TokenBucket) -> None:
now = time.time()
elapsed = now - bucket.last_refill
bucket.tokens = min(
bucket.capacity,
bucket.tokens + (elapsed * bucket.refill_rate)
)
bucket.last_refill = now
def acquire(self, endpoint_id: str, tokens: float = 1.0) -> bool:
"""Try to acquire tokens. Returns True if allowed."""
with self.lock:
bucket = self._get_bucket(endpoint_id)
self._refill(bucket)
if bucket.tokens >= tokens:
bucket.tokens -= tokens
return True
return False
def time_until_available(self, endpoint_id: str, tokens: float = 1.0) -> float:
"""Calculate seconds until tokens become available."""
with self.lock:
bucket = self._get_bucket(endpoint_id)
self._refill(bucket)
if bucket.tokens >= tokens:
return 0.0
deficit = tokens - bucket.tokens
return deficit / bucket.refill_rate
When acquire() returns False, reschedule the webhook for time_until_available() seconds in the future rather than dropping it.
Operational Tooling and Self-Service
Build tools for both your operations team and customers. An admin dashboard should show queue depth, per-endpoint health scores, and enable manual retries or bulk replay. Customers need visibility into their webhook logs: what was sent, when, and whether it succeeded.
Consider automatic endpoint disabling after sustained failures (e.g., 100 consecutive failures). Send an email notification and require manual re-enablement to prevent wasting resources on abandoned integrations.
If building from scratch feels excessive, evaluate managed solutions: AWS EventBridge for AWS-native architectures, Svix for a dedicated webhook service, or Hookdeck for debugging and reliability features. These handle the infrastructure complexity while you focus on your core product.
The webhook system you build today becomes the foundation for every integration tomorrow. Invest in reliability upfront—your customers’ trust depends on it.