System Design: Circuit Breaker Pattern
In distributed systems, failure isn't a possibility—it's a certainty. Services go down, networks partition, and databases become unresponsive. The question isn't whether your dependencies will fail,...
Key Insights
- Circuit breakers prevent cascading failures by failing fast when downstream services are unhealthy, giving them time to recover instead of overwhelming them with requests
- The three states (Closed, Open, Half-Open) create a self-healing system that automatically tests recovery and restores normal operation without manual intervention
- Proper configuration requires understanding your specific failure modes—there’s no universal threshold, and getting it wrong causes either unnecessary outages or insufficient protection
Introduction: Why Services Fail Gracefully
In distributed systems, failure isn’t a possibility—it’s a certainty. Services go down, networks partition, and databases become unresponsive. The question isn’t whether your dependencies will fail, but how your system behaves when they do.
Consider a common scenario: your payment service calls an external fraud detection API. That API starts responding slowly, taking 30 seconds instead of 200 milliseconds. Your payment service threads pile up waiting for responses. Soon, your thread pool is exhausted. Now your payment service can’t handle any requests—even ones that don’t need fraud detection. The slow dependency has cascaded into a complete outage.
The circuit breaker pattern, borrowed from electrical engineering, solves this problem. Just as an electrical circuit breaker trips to prevent a short circuit from burning down your house, a software circuit breaker trips to prevent a failing dependency from taking down your entire system. When a downstream service fails repeatedly, the circuit breaker opens and immediately rejects requests without even attempting the call. This gives the failing service time to recover while keeping your system responsive.
The Three States Explained
A circuit breaker operates as a state machine with three distinct states, each serving a specific purpose in the failure-recovery lifecycle.
Closed is the normal operating state. Requests flow through to the downstream service. The circuit breaker monitors these calls, tracking failures. As long as failures stay below the configured threshold, everything continues normally.
Open is the failure state. When failures exceed the threshold, the circuit trips open. All subsequent requests fail immediately without attempting to call the downstream service. This is the “fail fast” behavior that prevents resource exhaustion. The circuit stays open for a configured timeout period.
Half-Open is the recovery testing state. After the timeout expires, the circuit allows a limited number of test requests through. If these succeed, the circuit closes and normal operation resumes. If they fail, the circuit opens again and the timeout restarts.
This creates a self-healing loop: detect failure, protect the system, periodically test for recovery, and automatically restore normal operation.
from enum import Enum
from dataclasses import dataclass
from typing import Optional
import time
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class StateTransition:
from_state: CircuitState
to_state: CircuitState
reason: str
timestamp: float
class CircuitStateManager:
def __init__(self, failure_threshold: int, recovery_timeout: float):
self.state = CircuitState.CLOSED
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time: Optional[float] = None
self.transitions: list[StateTransition] = []
def record_success(self) -> Optional[StateTransition]:
if self.state == CircuitState.HALF_OPEN:
return self._transition_to(CircuitState.CLOSED, "successful test request")
self.failure_count = 0
return None
def record_failure(self) -> Optional[StateTransition]:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
return self._transition_to(CircuitState.OPEN, "test request failed")
if self.failure_count >= self.failure_threshold:
return self._transition_to(CircuitState.OPEN, "failure threshold exceeded")
return None
def should_allow_request(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if self._timeout_expired():
self._transition_to(CircuitState.HALF_OPEN, "recovery timeout expired")
return True
return False
return True # HALF_OPEN allows test requests
def _timeout_expired(self) -> bool:
if self.last_failure_time is None:
return True
return time.time() - self.last_failure_time >= self.recovery_timeout
def _transition_to(self, new_state: CircuitState, reason: str) -> StateTransition:
transition = StateTransition(self.state, new_state, reason, time.time())
self.transitions.append(transition)
self.state = new_state
if new_state == CircuitState.CLOSED:
self.failure_count = 0
return transition
Core Implementation from Scratch
Building a circuit breaker requires wrapping service calls with failure detection and state management. The core abstraction is a call() method that either executes the wrapped function or fails fast based on circuit state.
import time
from typing import TypeVar, Callable, Optional, Any
from functools import wraps
T = TypeVar('T')
class CircuitBreakerOpen(Exception):
"""Raised when circuit is open and requests are being rejected."""
def __init__(self, circuit_name: str, retry_after: float):
self.circuit_name = circuit_name
self.retry_after = retry_after
super().__init__(f"Circuit '{circuit_name}' is open. Retry after {retry_after:.1f}s")
class CircuitBreaker:
def __init__(
self,
name: str,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
expected_exceptions: tuple = (Exception,)
):
self.name = name
self.state_manager = CircuitStateManager(failure_threshold, recovery_timeout)
self.expected_exceptions = expected_exceptions
self.recovery_timeout = recovery_timeout
def call(self, func: Callable[..., T], *args, **kwargs) -> T:
if not self.state_manager.should_allow_request():
retry_after = self._calculate_retry_after()
raise CircuitBreakerOpen(self.name, retry_after)
try:
result = func(*args, **kwargs)
self.state_manager.record_success()
return result
except self.expected_exceptions as e:
self.state_manager.record_failure()
raise
def _calculate_retry_after(self) -> float:
if self.state_manager.last_failure_time is None:
return 0.0
elapsed = time.time() - self.state_manager.last_failure_time
return max(0.0, self.recovery_timeout - elapsed)
def __call__(self, func: Callable[..., T]) -> Callable[..., T]:
"""Decorator usage: @circuit_breaker"""
@wraps(func)
def wrapper(*args, **kwargs) -> T:
return self.call(func, *args, **kwargs)
return wrapper
@property
def state(self) -> CircuitState:
return self.state_manager.state
Configuration Parameters That Matter
The effectiveness of your circuit breaker depends entirely on proper configuration. Poor settings either trip too aggressively (causing unnecessary outages) or too slowly (failing to protect your system).
Failure threshold determines how many failures trigger the circuit. Set this based on your traffic volume and acceptable error rate. For a service handling 1000 requests per second, 5 failures might be noise. For a service handling 10 requests per minute, 5 failures is catastrophic.
Recovery timeout controls how long the circuit stays open. Too short, and you hammer a recovering service. Too long, and you stay degraded unnecessarily. Start with 30-60 seconds and adjust based on typical recovery times for your dependencies.
Sliding window vs. count-based approaches matter for high-throughput services. Count-based (5 failures trips the circuit) is simple but doesn’t account for request volume. Sliding window (5% failure rate over the last 100 requests) provides more nuanced detection.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class CircuitBreakerConfig:
# Failure detection
failure_threshold: int = 5
failure_rate_threshold: Optional[float] = None # e.g., 0.5 for 50%
sliding_window_size: int = 100
# Recovery behavior
recovery_timeout: float = 30.0
half_open_max_requests: int = 3
# What counts as failure
timeout_duration: float = 10.0
record_exceptions: tuple = field(default_factory=lambda: (Exception,))
ignore_exceptions: tuple = field(default_factory=tuple)
@classmethod
def for_critical_service(cls) -> "CircuitBreakerConfig":
"""Conservative settings for payment/auth services."""
return cls(
failure_threshold=3,
recovery_timeout=60.0,
half_open_max_requests=1,
timeout_duration=5.0
)
@classmethod
def for_degradable_service(cls) -> "CircuitBreakerConfig":
"""Aggressive settings for non-critical features."""
return cls(
failure_threshold=10,
recovery_timeout=15.0,
half_open_max_requests=5,
timeout_duration=30.0
)
Integration with Real Services
Circuit breakers become valuable when wrapping actual service calls. The key is combining the breaker with fallback strategies that maintain functionality during outages.
import httpx
from typing import Optional, Any
import json
class UserServiceClient:
def __init__(self, base_url: str, cache: Optional[dict] = None):
self.base_url = base_url
self.cache = cache or {}
self.http_client = httpx.Client(timeout=10.0)
self.circuit_breaker = CircuitBreaker(
name="user-service",
failure_threshold=5,
recovery_timeout=30.0,
expected_exceptions=(httpx.HTTPError, httpx.TimeoutException)
)
def get_user(self, user_id: str) -> dict[str, Any]:
cache_key = f"user:{user_id}"
try:
user = self.circuit_breaker.call(self._fetch_user, user_id)
self.cache[cache_key] = user # Update cache on success
return user
except CircuitBreakerOpen:
# Fallback to cached data
if cache_key in self.cache:
return {**self.cache[cache_key], "_cached": True}
# Fallback to minimal response
return {"id": user_id, "name": "Unknown", "_degraded": True}
except httpx.HTTPError:
# Request attempted but failed
if cache_key in self.cache:
return {**self.cache[cache_key], "_cached": True}
raise
def _fetch_user(self, user_id: str) -> dict[str, Any]:
response = self.http_client.get(f"{self.base_url}/users/{user_id}")
response.raise_for_status()
return response.json()
# Usage
client = UserServiceClient("https://api.example.com")
user = client.get_user("123") # Returns cached/degraded response if circuit is open
Using Established Libraries
For production systems, battle-tested libraries handle edge cases you haven’t considered. They provide thread safety, metrics integration, and configuration flexibility that takes months to build correctly.
Resilience4j (Java) is the modern standard for JVM applications:
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import org.springframework.stereotype.Service;
@Service
public class PaymentService {
private final PaymentGatewayClient gatewayClient;
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "processPaymentFallback")
public PaymentResult processPayment(PaymentRequest request) {
return gatewayClient.charge(request);
}
private PaymentResult processPaymentFallback(PaymentRequest request, Exception ex) {
// Queue for retry, return pending status
paymentQueue.enqueue(request);
return PaymentResult.pending(request.getId(), "Payment queued for processing");
}
}
// Configuration in application.yml:
// resilience4j.circuitbreaker.instances.paymentGateway:
// failure-rate-threshold: 50
// wait-duration-in-open-state: 30s
// sliding-window-size: 100
The trade-off between DIY and libraries is clear: build your own for learning or when you need minimal dependencies; use libraries for production systems where reliability matters.
Monitoring and Observability
A circuit breaker you can’t observe is a circuit breaker you can’t trust. Instrument these key metrics:
- State changes: Every transition (closed→open, open→half-open, half-open→closed) should emit an event
- Failure rate: Track failures per second and failure percentage over sliding windows
- Rejection rate: How many requests are being rejected by open circuits
- Recovery time: How long circuits stay open before successfully closing
import logging
from dataclasses import dataclass
from typing import Protocol
class MetricsCollector(Protocol):
def increment(self, metric: str, tags: dict) -> None: ...
def gauge(self, metric: str, value: float, tags: dict) -> None: ...
@dataclass
class CircuitBreakerMetrics:
collector: MetricsCollector
circuit_name: str
def record_call(self, success: bool, duration_ms: float):
status = "success" if success else "failure"
self.collector.increment(
"circuit_breaker.calls",
{"circuit": self.circuit_name, "status": status}
)
def record_state_change(self, from_state: str, to_state: str):
self.collector.increment(
"circuit_breaker.state_changes",
{"circuit": self.circuit_name, "from": from_state, "to": to_state}
)
logging.warning(
f"Circuit '{self.circuit_name}' transitioned: {from_state} -> {to_state}"
)
def record_rejection(self):
self.collector.increment(
"circuit_breaker.rejections",
{"circuit": self.circuit_name}
)
Alert on state changes to open—this indicates a dependency problem that needs investigation. Dashboard your circuits prominently; they’re often the first indicator of system-wide issues.
Circuit breakers aren’t optional in distributed systems. They’re the difference between a single service degradation and a cascading outage that pages your entire team at 3 AM. Implement them early, configure them thoughtfully, and monitor them religiously.