System Design: Circuit Breaker Pattern

Key Insights

Circuit breakers prevent cascading failures by failing fast when downstream services are unhealthy, giving them time to recover instead of overwhelming them with requests
The three states (Closed, Open, Half-Open) create a self-healing system that automatically tests recovery and restores normal operation without manual intervention
Proper configuration requires understanding your specific failure modes—there’s no universal threshold, and getting it wrong causes either unnecessary outages or insufficient protection

Introduction: Why Services Fail Gracefully

In distributed systems, failure isn’t a possibility—it’s a certainty. Services go down, networks partition, and databases become unresponsive. The question isn’t whether your dependencies will fail, but how your system behaves when they do.

Consider a common scenario: your payment service calls an external fraud detection API. That API starts responding slowly, taking 30 seconds instead of 200 milliseconds. Your payment service threads pile up waiting for responses. Soon, your thread pool is exhausted. Now your payment service can’t handle any requests—even ones that don’t need fraud detection. The slow dependency has cascaded into a complete outage.

The circuit breaker pattern, borrowed from electrical engineering, solves this problem. Just as an electrical circuit breaker trips to prevent a short circuit from burning down your house, a software circuit breaker trips to prevent a failing dependency from taking down your entire system. When a downstream service fails repeatedly, the circuit breaker opens and immediately rejects requests without even attempting the call. This gives the failing service time to recover while keeping your system responsive.

The Three States Explained

A circuit breaker operates as a state machine with three distinct states, each serving a specific purpose in the failure-recovery lifecycle.

Closed is the normal operating state. Requests flow through to the downstream service. The circuit breaker monitors these calls, tracking failures. As long as failures stay below the configured threshold, everything continues normally.

Open is the failure state. When failures exceed the threshold, the circuit trips open. All subsequent requests fail immediately without attempting to call the downstream service. This is the “fail fast” behavior that prevents resource exhaustion. The circuit stays open for a configured timeout period.

Half-Open is the recovery testing state. After the timeout expires, the circuit allows a limited number of test requests through. If these succeed, the circuit closes and normal operation resumes. If they fail, the circuit opens again and the timeout restarts.

This creates a self-healing loop: detect failure, protect the system, periodically test for recovery, and automatically restore normal operation.

from enum import Enum
from dataclasses import dataclass
from typing import Optional
import time

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class StateTransition:
    from_state: CircuitState
    to_state: CircuitState
    reason: str
    timestamp: float

class CircuitStateManager:
    def __init__(self, failure_threshold: int, recovery_timeout: float):
        self.state = CircuitState.CLOSED
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time: Optional[float] = None
        self.transitions: list[StateTransition] = []
    
    def record_success(self) -> Optional[StateTransition]:
        if self.state == CircuitState.HALF_OPEN:
            return self._transition_to(CircuitState.CLOSED, "successful test request")
        self.failure_count = 0
        return None
    
    def record_failure(self) -> Optional[StateTransition]:
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.state == CircuitState.HALF_OPEN:
            return self._transition_to(CircuitState.OPEN, "test request failed")
        
        if self.failure_count >= self.failure_threshold:
            return self._transition_to(CircuitState.OPEN, "failure threshold exceeded")
        return None
    
    def should_allow_request(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        
        if self.state == CircuitState.OPEN:
            if self._timeout_expired():
                self._transition_to(CircuitState.HALF_OPEN, "recovery timeout expired")
                return True
            return False
        
        return True  # HALF_OPEN allows test requests
    
    def _timeout_expired(self) -> bool:
        if self.last_failure_time is None:
            return True
        return time.time() - self.last_failure_time >= self.recovery_timeout
    
    def _transition_to(self, new_state: CircuitState, reason: str) -> StateTransition:
        transition = StateTransition(self.state, new_state, reason, time.time())
        self.transitions.append(transition)
        self.state = new_state
        if new_state == CircuitState.CLOSED:
            self.failure_count = 0
        return transition

Core Implementation from Scratch

Building a circuit breaker requires wrapping service calls with failure detection and state management. The core abstraction is a call() method that either executes the wrapped function or fails fast based on circuit state.

import time
from typing import TypeVar, Callable, Optional, Any
from functools import wraps

T = TypeVar('T')

class CircuitBreakerOpen(Exception):
    """Raised when circuit is open and requests are being rejected."""
    def __init__(self, circuit_name: str, retry_after: float):
        self.circuit_name = circuit_name
        self.retry_after = retry_after
        super().__init__(f"Circuit '{circuit_name}' is open. Retry after {retry_after:.1f}s")

class CircuitBreaker:
    def __init__(
        self,
        name: str,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        expected_exceptions: tuple = (Exception,)
    ):
        self.name = name
        self.state_manager = CircuitStateManager(failure_threshold, recovery_timeout)
        self.expected_exceptions = expected_exceptions
        self.recovery_timeout = recovery_timeout
    
    def call(self, func: Callable[..., T], *args, **kwargs) -> T:
        if not self.state_manager.should_allow_request():
            retry_after = self._calculate_retry_after()
            raise CircuitBreakerOpen(self.name, retry_after)
        
        try:
            result = func(*args, **kwargs)
            self.state_manager.record_success()
            return result
        except self.expected_exceptions as e:
            self.state_manager.record_failure()
            raise
    
    def _calculate_retry_after(self) -> float:
        if self.state_manager.last_failure_time is None:
            return 0.0
        elapsed = time.time() - self.state_manager.last_failure_time
        return max(0.0, self.recovery_timeout - elapsed)
    
    def __call__(self, func: Callable[..., T]) -> Callable[..., T]:
        """Decorator usage: @circuit_breaker"""
        @wraps(func)
        def wrapper(*args, **kwargs) -> T:
            return self.call(func, *args, **kwargs)
        return wrapper
    
    @property
    def state(self) -> CircuitState:
        return self.state_manager.state

Configuration Parameters That Matter

The effectiveness of your circuit breaker depends entirely on proper configuration. Poor settings either trip too aggressively (causing unnecessary outages) or too slowly (failing to protect your system).

Failure threshold determines how many failures trigger the circuit. Set this based on your traffic volume and acceptable error rate. For a service handling 1000 requests per second, 5 failures might be noise. For a service handling 10 requests per minute, 5 failures is catastrophic.

Recovery timeout controls how long the circuit stays open. Too short, and you hammer a recovering service. Too long, and you stay degraded unnecessarily. Start with 30-60 seconds and adjust based on typical recovery times for your dependencies.

Sliding window vs. count-based approaches matter for high-throughput services. Count-based (5 failures trips the circuit) is simple but doesn’t account for request volume. Sliding window (5% failure rate over the last 100 requests) provides more nuanced detection.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class CircuitBreakerConfig:
    # Failure detection
    failure_threshold: int = 5
    failure_rate_threshold: Optional[float] = None  # e.g., 0.5 for 50%
    sliding_window_size: int = 100
    
    # Recovery behavior
    recovery_timeout: float = 30.0
    half_open_max_requests: int = 3
    
    # What counts as failure
    timeout_duration: float = 10.0
    record_exceptions: tuple = field(default_factory=lambda: (Exception,))
    ignore_exceptions: tuple = field(default_factory=tuple)
    
    @classmethod
    def for_critical_service(cls) -> "CircuitBreakerConfig":
        """Conservative settings for payment/auth services."""
        return cls(
            failure_threshold=3,
            recovery_timeout=60.0,
            half_open_max_requests=1,
            timeout_duration=5.0
        )
    
    @classmethod
    def for_degradable_service(cls) -> "CircuitBreakerConfig":
        """Aggressive settings for non-critical features."""
        return cls(
            failure_threshold=10,
            recovery_timeout=15.0,
            half_open_max_requests=5,
            timeout_duration=30.0
        )

Integration with Real Services

Circuit breakers become valuable when wrapping actual service calls. The key is combining the breaker with fallback strategies that maintain functionality during outages.

import httpx
from typing import Optional, Any
import json

class UserServiceClient:
    def __init__(self, base_url: str, cache: Optional[dict] = None):
        self.base_url = base_url
        self.cache = cache or {}
        self.http_client = httpx.Client(timeout=10.0)
        
        self.circuit_breaker = CircuitBreaker(
            name="user-service",
            failure_threshold=5,
            recovery_timeout=30.0,
            expected_exceptions=(httpx.HTTPError, httpx.TimeoutException)
        )
    
    def get_user(self, user_id: str) -> dict[str, Any]:
        cache_key = f"user:{user_id}"
        
        try:
            user = self.circuit_breaker.call(self._fetch_user, user_id)
            self.cache[cache_key] = user  # Update cache on success
            return user
        except CircuitBreakerOpen:
            # Fallback to cached data
            if cache_key in self.cache:
                return {**self.cache[cache_key], "_cached": True}
            # Fallback to minimal response
            return {"id": user_id, "name": "Unknown", "_degraded": True}
        except httpx.HTTPError:
            # Request attempted but failed
            if cache_key in self.cache:
                return {**self.cache[cache_key], "_cached": True}
            raise
    
    def _fetch_user(self, user_id: str) -> dict[str, Any]:
        response = self.http_client.get(f"{self.base_url}/users/{user_id}")
        response.raise_for_status()
        return response.json()

# Usage
client = UserServiceClient("https://api.example.com")
user = client.get_user("123")  # Returns cached/degraded response if circuit is open

Using Established Libraries

For production systems, battle-tested libraries handle edge cases you haven’t considered. They provide thread safety, metrics integration, and configuration flexibility that takes months to build correctly.

Resilience4j (Java) is the modern standard for JVM applications:

import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import org.springframework.stereotype.Service;

@Service
public class PaymentService {
    
    private final PaymentGatewayClient gatewayClient;
    
    @CircuitBreaker(name = "paymentGateway", fallbackMethod = "processPaymentFallback")
    public PaymentResult processPayment(PaymentRequest request) {
        return gatewayClient.charge(request);
    }
    
    private PaymentResult processPaymentFallback(PaymentRequest request, Exception ex) {
        // Queue for retry, return pending status
        paymentQueue.enqueue(request);
        return PaymentResult.pending(request.getId(), "Payment queued for processing");
    }
}

// Configuration in application.yml:
// resilience4j.circuitbreaker.instances.paymentGateway:
//   failure-rate-threshold: 50
//   wait-duration-in-open-state: 30s
//   sliding-window-size: 100

The trade-off between DIY and libraries is clear: build your own for learning or when you need minimal dependencies; use libraries for production systems where reliability matters.

Monitoring and Observability

A circuit breaker you can’t observe is a circuit breaker you can’t trust. Instrument these key metrics:

State changes: Every transition (closed→open, open→half-open, half-open→closed) should emit an event
Failure rate: Track failures per second and failure percentage over sliding windows
Rejection rate: How many requests are being rejected by open circuits
Recovery time: How long circuits stay open before successfully closing

import logging
from dataclasses import dataclass
from typing import Protocol

class MetricsCollector(Protocol):
    def increment(self, metric: str, tags: dict) -> None: ...
    def gauge(self, metric: str, value: float, tags: dict) -> None: ...

@dataclass
class CircuitBreakerMetrics:
    collector: MetricsCollector
    circuit_name: str
    
    def record_call(self, success: bool, duration_ms: float):
        status = "success" if success else "failure"
        self.collector.increment(
            "circuit_breaker.calls",
            {"circuit": self.circuit_name, "status": status}
        )
    
    def record_state_change(self, from_state: str, to_state: str):
        self.collector.increment(
            "circuit_breaker.state_changes",
            {"circuit": self.circuit_name, "from": from_state, "to": to_state}
        )
        logging.warning(
            f"Circuit '{self.circuit_name}' transitioned: {from_state} -> {to_state}"
        )
    
    def record_rejection(self):
        self.collector.increment(
            "circuit_breaker.rejections",
            {"circuit": self.circuit_name}
        )

Alert on state changes to open—this indicates a dependency problem that needs investigation. Dashboard your circuits prominently; they’re often the first indicator of system-wide issues.

Circuit breakers aren’t optional in distributed systems. They’re the difference between a single service degradation and a cascading outage that pages your entire team at 3 AM. Implement them early, configure them thoughtfully, and monitor them religiously.