Circuit Breaker: Fault Tolerance Pattern
Distributed systems fail in interesting ways. A single slow database query can exhaust your connection pool. A third-party API timing out can block your request threads. Before you know it, your...
Key Insights
- Circuit breakers prevent cascading failures by failing fast when a downstream service is unhealthy, giving it time to recover instead of overwhelming it with requests
- The three-state model (Closed, Open, Half-Open) provides automatic recovery detection, eliminating the need for manual intervention when services come back online
- Proper tuning requires understanding your service’s SLAs—set failure thresholds and timeout windows based on actual traffic patterns, not arbitrary defaults
Why Services Fail (And Take Everything With Them)
Distributed systems fail in interesting ways. A single slow database query can exhaust your connection pool. A third-party API timing out can block your request threads. Before you know it, your entire system grinds to a halt because one dependency decided to have a bad day.
Here’s a typical scenario that plays out in production:
import requests
def get_user_recommendations(user_id: str) -> list:
# This call has no protection
response = requests.get(
f"https://recommendation-service/users/{user_id}/recommendations",
timeout=30 # 30 seconds feels "safe"
)
return response.json()
def get_user_profile(user_id: str) -> dict:
profile = fetch_user_from_db(user_id)
# If recommendation service is slow, this blocks for 30 seconds
profile["recommendations"] = get_user_recommendations(user_id)
return profile
When the recommendation service starts timing out, every request to get_user_profile now takes 30 seconds. Your thread pool fills up. New requests queue. Users see spinning loaders. Your monitoring lights up red.
The insidious part? The recommendation service might be struggling under load, and you’re making it worse by continuing to hammer it with requests it can’t handle.
The Circuit Breaker Concept
A circuit breaker works exactly like its electrical namesake. When too much current flows through an electrical circuit, the breaker trips to prevent damage. In software, when too many failures occur, the circuit breaker trips to prevent cascade failures.
The pattern uses three states:
Closed (normal operation): Requests flow through normally. The breaker monitors for failures. If failures exceed a threshold within a time window, the breaker trips to Open.
Open (failing fast): All requests fail immediately without attempting the downstream call. After a timeout period, the breaker transitions to Half-Open.
Half-Open (testing recovery): A limited number of requests are allowed through to test if the downstream service has recovered. If they succeed, the breaker closes. If they fail, it opens again.
┌─────────┐ failures > threshold ┌─────────┐
│ CLOSED │ ─────────────────────► │ OPEN │
└─────────┘ └─────────┘
▲ │
│ │ timeout expires
│ success ▼
│ ┌───────────┐
└─────────────────────────── │ HALF-OPEN │
failure └───────────┘
─────────────────────────┘
This state machine gives you automatic failure detection and automatic recovery detection. No manual intervention required.
Implementing a Basic Circuit Breaker
Let’s build a circuit breaker from scratch to understand the mechanics:
import time
from enum import Enum
from typing import Callable, TypeVar
from threading import Lock
T = TypeVar('T')
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreakerOpen(Exception):
"""Raised when circuit breaker is open"""
pass
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
half_open_max_calls: int = 3
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self._state = CircuitState.CLOSED
self._failure_count = 0
self._last_failure_time: float = 0
self._half_open_calls = 0
self._lock = Lock()
@property
def state(self) -> CircuitState:
with self._lock:
if self._state == CircuitState.OPEN:
if time.time() - self._last_failure_time >= self.recovery_timeout:
self._state = CircuitState.HALF_OPEN
self._half_open_calls = 0
return self._state
def call(self, func: Callable[[], T]) -> T:
current_state = self.state
if current_state == CircuitState.OPEN:
raise CircuitBreakerOpen("Circuit breaker is open")
if current_state == CircuitState.HALF_OPEN:
with self._lock:
if self._half_open_calls >= self.half_open_max_calls:
raise CircuitBreakerOpen("Half-open call limit reached")
self._half_open_calls += 1
try:
result = func()
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
with self._lock:
if self._state == CircuitState.HALF_OPEN:
self._state = CircuitState.CLOSED
self._failure_count = 0
def _on_failure(self):
with self._lock:
self._failure_count += 1
self._last_failure_time = time.time()
if self._state == CircuitState.HALF_OPEN:
self._state = CircuitState.OPEN
elif self._failure_count >= self.failure_threshold:
self._state = CircuitState.OPEN
Usage is straightforward:
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60)
def fetch_recommendations(user_id: str) -> list:
try:
return breaker.call(
lambda: requests.get(f"https://api/users/{user_id}/recs", timeout=5).json()
)
except CircuitBreakerOpen:
return [] # Return empty list as fallback
except requests.RequestException:
return [] # Request failed, but breaker recorded it
Configuration and Tuning
The default values in tutorials are almost never right for production. Here’s how to think about each parameter:
Failure Threshold: How many failures before tripping? Consider your traffic volume. If you get 1000 requests per minute, 5 failures might be noise. If you get 10 requests per minute, 5 failures is catastrophic. A good starting point: trip when failure rate exceeds 50% over a meaningful sample size (minimum 10-20 requests).
Recovery Timeout: How long to wait before testing recovery? This depends on your downstream service. If it’s a database that needs time to clear connections, 60 seconds might be appropriate. If it’s an autoscaling service, 30 seconds might be enough for new instances to spin up.
Half-Open Max Calls: How many test requests during recovery? Too few and you might close the circuit prematurely on a lucky request. Too many and you’re essentially back to hammering a struggling service. Start with 3-5 calls.
Sliding Window vs. Count-Based: The implementation above uses a simple count. Production systems often use sliding time windows (e.g., “5 failures in the last 60 seconds”) to avoid stale failure counts from affecting current decisions.
# More sophisticated: sliding window failure rate
from collections import deque
class SlidingWindowCircuitBreaker:
def __init__(self, window_size: float = 60.0, failure_rate_threshold: float = 0.5):
self.window_size = window_size
self.failure_rate_threshold = failure_rate_threshold
self._calls: deque = deque() # (timestamp, success: bool)
def _failure_rate(self) -> float:
now = time.time()
# Remove old entries
while self._calls and self._calls[0][0] < now - self.window_size:
self._calls.popleft()
if len(self._calls) < 10: # Minimum sample size
return 0.0
failures = sum(1 for _, success in self._calls if not success)
return failures / len(self._calls)
Production-Ready Libraries
Don’t build your own circuit breaker for production. Use battle-tested libraries that handle edge cases you haven’t thought of.
Resilience4j (Java/Kotlin):
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(100)
.permittedNumberOfCallsInHalfOpenState(10)
.build();
CircuitBreaker breaker = CircuitBreaker.of("recommendationService", config);
Supplier<List<Recommendation>> decoratedSupplier = CircuitBreaker
.decorateSupplier(breaker, () -> recommendationClient.getRecommendations(userId));
Try<List<Recommendation>> result = Try.ofSupplier(decoratedSupplier)
.recover(CallNotPermittedException.class, e -> Collections.emptyList());
Polly (.NET):
var circuitBreaker = Policy
.Handle<HttpRequestException>()
.Or<TimeoutException>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (ex, duration) =>
logger.LogWarning($"Circuit opened for {duration.TotalSeconds}s: {ex.Message}"),
onReset: () =>
logger.LogInformation("Circuit closed"),
onHalfOpen: () =>
logger.LogInformation("Circuit half-open, testing...")
);
var recommendations = await circuitBreaker.ExecuteAsync(
() => httpClient.GetFromJsonAsync<List<Recommendation>>($"/users/{userId}/recs")
);
Circuit Breakers in Practice
Circuit breakers work best when combined with other resilience patterns:
import logging
from prometheus_client import Counter, Gauge
# Metrics
circuit_state = Gauge('circuit_breaker_state', 'Current state', ['service'])
circuit_trips = Counter('circuit_breaker_trips_total', 'Total trips', ['service'])
class ObservableCircuitBreaker(CircuitBreaker):
def __init__(self, service_name: str, **kwargs):
super().__init__(**kwargs)
self.service_name = service_name
self.logger = logging.getLogger(f"circuit.{service_name}")
def _on_failure(self):
previous_state = self._state
super()._on_failure()
if previous_state != CircuitState.OPEN and self._state == CircuitState.OPEN:
self.logger.warning(f"Circuit OPENED after {self._failure_count} failures")
circuit_trips.labels(service=self.service_name).inc()
circuit_state.labels(service=self.service_name).set(
{"closed": 0, "open": 1, "half_open": 0.5}[self._state.value]
)
def _on_success(self):
previous_state = self._state
super()._on_success()
if previous_state == CircuitState.HALF_OPEN:
self.logger.info("Circuit CLOSED after successful recovery probe")
Combine with retries carefully—retry before the circuit breaker, not after:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=10))
def fetch_with_retry(url: str):
return requests.get(url, timeout=5)
def fetch_with_resilience(url: str):
return breaker.call(lambda: fetch_with_retry(url))
When Not to Use Circuit Breakers
Circuit breakers add complexity. Sometimes that complexity isn’t worth it:
Low-traffic services: If you’re making 10 requests per hour, you won’t get enough data points for meaningful failure detection. Simple timeouts with retries are sufficient.
Already-queued workloads: If requests go through a message queue, the queue itself provides backpressure. Failed messages can be retried or dead-lettered without circuit breakers.
Idempotent batch jobs: If a nightly job can simply retry from the beginning, you don’t need sophisticated failure handling mid-stream.
Single points of failure: If your only database is down, a circuit breaker just makes you fail faster. You’re still down. Focus on eliminating the single point of failure instead.
The circuit breaker pattern shines in high-traffic synchronous systems where partial degradation is acceptable and downstream services can recover if given breathing room. Use it there. Skip it elsewhere.