System Design: Back Pressure Handling

Back pressure is a flow control mechanism that allows consumers to signal producers to slow down when they can't keep up with incoming data. Think of it like a water pipe system: if you pump water...

Key Insights

  • Back pressure is the mechanism that prevents fast producers from overwhelming slow consumers, and without it, your system will eventually collapse under load
  • The four fundamental strategies—blocking, bounded buffering, dropping, and rate limiting—each have distinct trade-offs between latency, throughput, and data loss that must match your use case
  • Effective back pressure requires end-to-end design: from API rate limits to message queue configurations to database connection pools, every component must participate in the flow control chain

What is Back Pressure?

Back pressure is a flow control mechanism that allows consumers to signal producers to slow down when they can’t keep up with incoming data. Think of it like a water pipe system: if you pump water faster than the drain can handle, pressure builds up. Without a release valve or flow regulator, pipes burst.

In software systems, “pipes bursting” means out-of-memory errors, cascading failures, and 3 AM pages. The fundamental problem is simple: producers and consumers rarely operate at the same speed. Networks have variable latency. Databases have connection limits. CPUs have finite cycles. When a fast upstream component meets a slow downstream component, something has to give.

Real-world scenarios where back pressure becomes critical:

  • Message queues: A burst of events from user activity overwhelms your event processor
  • API gateways: A viral moment sends 100x normal traffic to your service
  • Streaming pipelines: A slow database write blocks a real-time analytics pipeline
  • Microservices: Service A calls Service B, which calls Service C—any slowdown propagates upstream

Symptoms of Missing Back Pressure

Before diving into solutions, recognize the symptoms. Systems without proper back pressure handling exhibit predictable failure modes.

Memory exhaustion happens when unbounded queues or buffers grow indefinitely. The JVM runs out of heap, the container hits its memory limit, and the OOM killer terminates your process.

Cascading failures occur when one slow service causes upstream services to accumulate pending requests, eventually exhausting their resources too. This domino effect can take down entire clusters.

Timeout storms emerge when requests queue up so long that they exceed client timeouts. Clients retry, adding more load, creating a death spiral.

Here’s what a naive producer-consumer looks like without back pressure:

public class UnboundedProducerConsumer {
    // This will grow until OOM
    private final Queue<Event> queue = new LinkedList<>();
    
    public void produce(Event event) {
        // No limit, no waiting, just pile it on
        queue.add(event);
    }
    
    public void consume() {
        while (true) {
            Event event = queue.poll();
            if (event != null) {
                // Slow processing: 100ms per event
                processSlowly(event);
            }
        }
    }
    
    // Producer: 10,000 events/sec
    // Consumer: 10 events/sec
    // Result: Queue grows by 9,990 events/sec until memory exhaustion
}

This code works fine in testing with small loads. In production, it’s a time bomb.

Back Pressure Strategies

Four fundamental strategies exist for handling back pressure. Each makes different trade-offs.

Blocking/Synchronous

The producer blocks until the consumer is ready. This is the simplest approach and guarantees no data loss, but it couples producer throughput to consumer throughput.

Buffering with Bounds

Use a fixed-size buffer. When full, either block the producer or reject new items. This provides some elasticity for bursts while preventing unbounded growth.

Dropping

When overwhelmed, drop data. This could be random sampling, dropping oldest items, or intelligent load shedding. Appropriate when some data loss is acceptable.

Rate Limiting

Explicitly limit the rate of incoming requests using algorithms like token bucket or leaky bucket. This smooths out bursts and provides predictable throughput.

Here’s a bounded queue implementation demonstrating different overflow policies:

public class BoundedQueue<T> {
    private final Queue<T> queue = new LinkedList<>();
    private final int maxSize;
    private final OverflowPolicy policy;
    private final Lock lock = new ReentrantLock();
    private final Condition notFull = lock.newCondition();
    private final Condition notEmpty = lock.newCondition();
    
    public enum OverflowPolicy {
        BLOCK,      // Wait until space available
        DROP_NEW,   // Reject incoming item
        DROP_OLD    // Remove oldest, add new
    }
    
    public BoundedQueue(int maxSize, OverflowPolicy policy) {
        this.maxSize = maxSize;
        this.policy = policy;
    }
    
    public boolean offer(T item) throws InterruptedException {
        lock.lock();
        try {
            if (queue.size() >= maxSize) {
                switch (policy) {
                    case BLOCK:
                        while (queue.size() >= maxSize) {
                            notFull.await();
                        }
                        break;
                    case DROP_NEW:
                        return false; // Reject the item
                    case DROP_OLD:
                        queue.poll(); // Remove oldest
                        break;
                }
            }
            queue.add(item);
            notEmpty.signal();
            return true;
        } finally {
            lock.unlock();
        }
    }
    
    public T take() throws InterruptedException {
        lock.lock();
        try {
            while (queue.isEmpty()) {
                notEmpty.await();
            }
            T item = queue.poll();
            notFull.signal();
            return item;
        } finally {
            lock.unlock();
        }
    }
}

Implementation Patterns

Reactive Streams and the request(n) Protocol

The Reactive Streams specification provides a standard for asynchronous stream processing with back pressure. The key insight is the request(n) method: consumers explicitly request how many items they can handle.

import reactor.core.publisher.Flux;
import reactor.core.scheduler.Schedulers;
import java.time.Duration;

public class ReactiveBackPressure {
    
    public void demonstrateBackPressure() {
        Flux.range(1, 1_000_000)
            .doOnNext(i -> System.out.println("Produced: " + i))
            // Back pressure strategy: buffer up to 256, then error
            .onBackpressureBuffer(256, 
                dropped -> System.out.println("Dropped: " + dropped))
            .publishOn(Schedulers.boundedElastic())
            .doOnNext(i -> {
                // Simulate slow consumer
                try {
                    Thread.sleep(10);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
                System.out.println("Consumed: " + i);
            })
            .subscribe();
    }
    
    // Alternative strategies:
    // .onBackpressureDrop()     - Silently drop items
    // .onBackpressureLatest()   - Keep only most recent
    // .onBackpressureError()    - Throw exception
}

Circuit Breakers as Back Pressure Signals

Circuit breakers serve as back pressure signals in distributed systems. When a downstream service is overwhelmed, the circuit opens, immediately rejecting requests instead of queuing them.

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import java.time.Duration;

public class CircuitBreakerBackPressure {
    
    private final CircuitBreaker circuitBreaker;
    
    public CircuitBreakerBackPressure() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)
            .slowCallRateThreshold(80)
            .slowCallDurationThreshold(Duration.ofSeconds(2))
            .waitDurationInOpenState(Duration.ofSeconds(30))
            .slidingWindowSize(100)
            .build();
            
        this.circuitBreaker = CircuitBreaker.of("downstream-service", config);
    }
    
    public Response callDownstream(Request request) {
        return circuitBreaker.executeSupplier(() -> {
            // When circuit is open, this throws CallNotPermittedException
            // This IS back pressure - fast failure instead of queuing
            return downstreamClient.call(request);
        });
    }
}

Back Pressure in Distributed Systems

Distributed systems require explicit back pressure signaling across network boundaries.

HTTP 429 and Retry-After

REST APIs should return HTTP 429 (Too Many Requests) with a Retry-After header to signal back pressure:

from flask import Flask, jsonify, request
from functools import wraps
import time
from collections import defaultdict

app = Flask(__name__)

# Simple rate limiter using token bucket
class RateLimiter:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate  # tokens per second
        self.capacity = capacity
        self.tokens = defaultdict(lambda: capacity)
        self.last_update = defaultdict(time.time)
    
    def allow(self, key: str) -> tuple[bool, int]:
        now = time.time()
        elapsed = now - self.last_update[key]
        self.last_update[key] = now
        
        # Replenish tokens
        self.tokens[key] = min(
            self.capacity,
            self.tokens[key] + elapsed * self.rate
        )
        
        if self.tokens[key] >= 1:
            self.tokens[key] -= 1
            return True, 0
        else:
            # Calculate retry-after
            retry_after = int((1 - self.tokens[key]) / self.rate) + 1
            return False, retry_after

limiter = RateLimiter(rate=10, capacity=20)  # 10 req/sec, burst of 20

def rate_limited(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        client_ip = request.remote_addr
        allowed, retry_after = limiter.allow(client_ip)
        
        if not allowed:
            response = jsonify({
                "error": "Rate limit exceeded",
                "retry_after": retry_after
            })
            response.status_code = 429
            response.headers["Retry-After"] = str(retry_after)
            response.headers["X-RateLimit-Remaining"] = "0"
            return response
            
        return f(*args, **kwargs)
    return decorated

@app.route("/api/resource")
@rate_limited
def get_resource():
    return jsonify({"data": "your resource"})

Kafka Consumer Lag

In Kafka, consumer lag (the difference between the latest offset and the consumer’s current offset) serves as a back pressure indicator. Monitor it and scale consumers accordingly.

Monitoring and Observability

You can’t manage what you can’t measure. Track these key metrics:

from prometheus_client import Counter, Gauge, Histogram, start_http_server

# Queue metrics
queue_depth = Gauge(
    'message_queue_depth',
    'Current number of items in queue',
    ['queue_name']
)

queue_capacity = Gauge(
    'message_queue_capacity',
    'Maximum queue capacity',
    ['queue_name']
)

# Back pressure events
items_rejected = Counter(
    'items_rejected_total',
    'Items rejected due to back pressure',
    ['queue_name', 'reason']
)

items_dropped = Counter(
    'items_dropped_total',
    'Items dropped due to overflow',
    ['queue_name']
)

# Processing metrics
processing_time = Histogram(
    'item_processing_seconds',
    'Time to process each item',
    ['queue_name'],
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

# Usage in your queue
class InstrumentedQueue:
    def __init__(self, name: str, max_size: int):
        self.name = name
        self.max_size = max_size
        self.queue = []
        queue_capacity.labels(queue_name=name).set(max_size)
    
    def offer(self, item) -> bool:
        if len(self.queue) >= self.max_size:
            items_rejected.labels(
                queue_name=self.name, 
                reason='queue_full'
            ).inc()
            return False
        self.queue.append(item)
        queue_depth.labels(queue_name=self.name).set(len(self.queue))
        return True

Set alerts when queue depth exceeds 80% capacity or rejection rate spikes above baseline.

Trade-offs and Best Practices

Latency vs throughput: Blocking strategies minimize data loss but increase latency. Dropping strategies maintain low latency but lose data. Choose based on your requirements—financial transactions need different handling than analytics events.

Choose the right strategy:

  • User-facing APIs: Rate limiting with clear 429 responses
  • Event processing: Bounded buffers with monitoring
  • Real-time analytics: Sampling/dropping is often acceptable
  • Financial systems: Blocking with strict ordering guarantees

Common pitfalls to avoid:

  • Unbounded queues: They always overflow eventually
  • Silent drops: If you drop data, log and metric it
  • Ignoring upstream: Back pressure must propagate all the way to the source
  • Fixed timeouts: Use adaptive timeouts based on current load

Back pressure isn’t optional—it’s a fundamental requirement for any system that handles variable load. Design it in from the start, instrument it thoroughly, and test it under realistic conditions. Your 3 AM self will thank you.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.