Circuit Breaker: Fault Tolerance Pattern

Key Insights

Circuit breakers prevent cascading failures by failing fast when a downstream service is unhealthy, giving it time to recover instead of overwhelming it with requests
The three-state model (Closed, Open, Half-Open) provides automatic recovery detection, eliminating the need for manual intervention when services come back online
Proper tuning requires understanding your service’s SLAs—set failure thresholds and timeout windows based on actual traffic patterns, not arbitrary defaults

Why Services Fail (And Take Everything With Them)

Distributed systems fail in interesting ways. A single slow database query can exhaust your connection pool. A third-party API timing out can block your request threads. Before you know it, your entire system grinds to a halt because one dependency decided to have a bad day.

Here’s a typical scenario that plays out in production:

import requests

def get_user_recommendations(user_id: str) -> list:
    # This call has no protection
    response = requests.get(
        f"https://recommendation-service/users/{user_id}/recommendations",
        timeout=30  # 30 seconds feels "safe"
    )
    return response.json()

def get_user_profile(user_id: str) -> dict:
    profile = fetch_user_from_db(user_id)
    # If recommendation service is slow, this blocks for 30 seconds
    profile["recommendations"] = get_user_recommendations(user_id)
    return profile

When the recommendation service starts timing out, every request to get_user_profile now takes 30 seconds. Your thread pool fills up. New requests queue. Users see spinning loaders. Your monitoring lights up red.

The insidious part? The recommendation service might be struggling under load, and you’re making it worse by continuing to hammer it with requests it can’t handle.

The Circuit Breaker Concept

A circuit breaker works exactly like its electrical namesake. When too much current flows through an electrical circuit, the breaker trips to prevent damage. In software, when too many failures occur, the circuit breaker trips to prevent cascade failures.

The pattern uses three states:

Closed (normal operation): Requests flow through normally. The breaker monitors for failures. If failures exceed a threshold within a time window, the breaker trips to Open.

Open (failing fast): All requests fail immediately without attempting the downstream call. After a timeout period, the breaker transitions to Half-Open.

Half-Open (testing recovery): A limited number of requests are allowed through to test if the downstream service has recovered. If they succeed, the breaker closes. If they fail, it opens again.

┌─────────┐  failures > threshold  ┌─────────┐
│  CLOSED │ ─────────────────────► │  OPEN   │
└─────────┘                        └─────────┘
     ▲                                  │
     │                                  │ timeout expires
     │ success                          ▼
     │                            ┌───────────┐
     └─────────────────────────── │ HALF-OPEN │
              failure             └───────────┘
              ─────────────────────────┘

This state machine gives you automatic failure detection and automatic recovery detection. No manual intervention required.

Implementing a Basic Circuit Breaker

Let’s build a circuit breaker from scratch to understand the mechanics:

import time
from enum import Enum
from typing import Callable, TypeVar
from threading import Lock

T = TypeVar('T')

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreakerOpen(Exception):
    """Raised when circuit breaker is open"""
    pass

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_max_calls: int = 3
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        
        self._state = CircuitState.CLOSED
        self._failure_count = 0
        self._last_failure_time: float = 0
        self._half_open_calls = 0
        self._lock = Lock()
    
    @property
    def state(self) -> CircuitState:
        with self._lock:
            if self._state == CircuitState.OPEN:
                if time.time() - self._last_failure_time >= self.recovery_timeout:
                    self._state = CircuitState.HALF_OPEN
                    self._half_open_calls = 0
            return self._state
    
    def call(self, func: Callable[[], T]) -> T:
        current_state = self.state
        
        if current_state == CircuitState.OPEN:
            raise CircuitBreakerOpen("Circuit breaker is open")
        
        if current_state == CircuitState.HALF_OPEN:
            with self._lock:
                if self._half_open_calls >= self.half_open_max_calls:
                    raise CircuitBreakerOpen("Half-open call limit reached")
                self._half_open_calls += 1
        
        try:
            result = func()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        with self._lock:
            if self._state == CircuitState.HALF_OPEN:
                self._state = CircuitState.CLOSED
            self._failure_count = 0
    
    def _on_failure(self):
        with self._lock:
            self._failure_count += 1
            self._last_failure_time = time.time()
            
            if self._state == CircuitState.HALF_OPEN:
                self._state = CircuitState.OPEN
            elif self._failure_count >= self.failure_threshold:
                self._state = CircuitState.OPEN

Usage is straightforward:

breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60)

def fetch_recommendations(user_id: str) -> list:
    try:
        return breaker.call(
            lambda: requests.get(f"https://api/users/{user_id}/recs", timeout=5).json()
        )
    except CircuitBreakerOpen:
        return []  # Return empty list as fallback
    except requests.RequestException:
        return []  # Request failed, but breaker recorded it

Configuration and Tuning

The default values in tutorials are almost never right for production. Here’s how to think about each parameter:

Failure Threshold: How many failures before tripping? Consider your traffic volume. If you get 1000 requests per minute, 5 failures might be noise. If you get 10 requests per minute, 5 failures is catastrophic. A good starting point: trip when failure rate exceeds 50% over a meaningful sample size (minimum 10-20 requests).

Recovery Timeout: How long to wait before testing recovery? This depends on your downstream service. If it’s a database that needs time to clear connections, 60 seconds might be appropriate. If it’s an autoscaling service, 30 seconds might be enough for new instances to spin up.

Half-Open Max Calls: How many test requests during recovery? Too few and you might close the circuit prematurely on a lucky request. Too many and you’re essentially back to hammering a struggling service. Start with 3-5 calls.

Sliding Window vs. Count-Based: The implementation above uses a simple count. Production systems often use sliding time windows (e.g., “5 failures in the last 60 seconds”) to avoid stale failure counts from affecting current decisions.

# More sophisticated: sliding window failure rate
from collections import deque

class SlidingWindowCircuitBreaker:
    def __init__(self, window_size: float = 60.0, failure_rate_threshold: float = 0.5):
        self.window_size = window_size
        self.failure_rate_threshold = failure_rate_threshold
        self._calls: deque = deque()  # (timestamp, success: bool)
    
    def _failure_rate(self) -> float:
        now = time.time()
        # Remove old entries
        while self._calls and self._calls[0][0] < now - self.window_size:
            self._calls.popleft()
        
        if len(self._calls) < 10:  # Minimum sample size
            return 0.0
        
        failures = sum(1 for _, success in self._calls if not success)
        return failures / len(self._calls)

Production-Ready Libraries

Don’t build your own circuit breaker for production. Use battle-tested libraries that handle edge cases you haven’t thought of.

Resilience4j (Java/Kotlin):

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(100)
    .permittedNumberOfCallsInHalfOpenState(10)
    .build();

CircuitBreaker breaker = CircuitBreaker.of("recommendationService", config);

Supplier<List<Recommendation>> decoratedSupplier = CircuitBreaker
    .decorateSupplier(breaker, () -> recommendationClient.getRecommendations(userId));

Try<List<Recommendation>> result = Try.ofSupplier(decoratedSupplier)
    .recover(CallNotPermittedException.class, e -> Collections.emptyList());

Polly (.NET):

var circuitBreaker = Policy
    .Handle<HttpRequestException>()
    .Or<TimeoutException>()
    .CircuitBreakerAsync(
        exceptionsAllowedBeforeBreaking: 5,
        durationOfBreak: TimeSpan.FromSeconds(30),
        onBreak: (ex, duration) => 
            logger.LogWarning($"Circuit opened for {duration.TotalSeconds}s: {ex.Message}"),
        onReset: () => 
            logger.LogInformation("Circuit closed"),
        onHalfOpen: () => 
            logger.LogInformation("Circuit half-open, testing...")
    );

var recommendations = await circuitBreaker.ExecuteAsync(
    () => httpClient.GetFromJsonAsync<List<Recommendation>>($"/users/{userId}/recs")
);

Circuit Breakers in Practice

Circuit breakers work best when combined with other resilience patterns:

import logging
from prometheus_client import Counter, Gauge

# Metrics
circuit_state = Gauge('circuit_breaker_state', 'Current state', ['service'])
circuit_trips = Counter('circuit_breaker_trips_total', 'Total trips', ['service'])

class ObservableCircuitBreaker(CircuitBreaker):
    def __init__(self, service_name: str, **kwargs):
        super().__init__(**kwargs)
        self.service_name = service_name
        self.logger = logging.getLogger(f"circuit.{service_name}")
    
    def _on_failure(self):
        previous_state = self._state
        super()._on_failure()
        
        if previous_state != CircuitState.OPEN and self._state == CircuitState.OPEN:
            self.logger.warning(f"Circuit OPENED after {self._failure_count} failures")
            circuit_trips.labels(service=self.service_name).inc()
        
        circuit_state.labels(service=self.service_name).set(
            {"closed": 0, "open": 1, "half_open": 0.5}[self._state.value]
        )
    
    def _on_success(self):
        previous_state = self._state
        super()._on_success()
        
        if previous_state == CircuitState.HALF_OPEN:
            self.logger.info("Circuit CLOSED after successful recovery probe")

Combine with retries carefully—retry before the circuit breaker, not after:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=10))
def fetch_with_retry(url: str):
    return requests.get(url, timeout=5)

def fetch_with_resilience(url: str):
    return breaker.call(lambda: fetch_with_retry(url))

When Not to Use Circuit Breakers

Circuit breakers add complexity. Sometimes that complexity isn’t worth it:

Low-traffic services: If you’re making 10 requests per hour, you won’t get enough data points for meaningful failure detection. Simple timeouts with retries are sufficient.

Already-queued workloads: If requests go through a message queue, the queue itself provides backpressure. Failed messages can be retried or dead-lettered without circuit breakers.

Idempotent batch jobs: If a nightly job can simply retry from the beginning, you don’t need sophisticated failure handling mid-stream.

Single points of failure: If your only database is down, a circuit breaker just makes you fail faster. You’re still down. Focus on eliminating the single point of failure instead.

The circuit breaker pattern shines in high-traffic synchronous systems where partial degradation is acceptable and downstream services can recover if given breathing room. Use it there. Skip it elsewhere.