System Design: Heartbeat and Health Checking
Distributed systems fail in ways that monoliths never could. A service might be running but unable to reach its database. A container might be alive but stuck in an infinite loop. A node might be...
Key Insights
- Heartbeats and health checks serve different purposes: heartbeats detect process liveness, while health checks verify service readiness—conflating them causes cascading failures
- Deep health checks that verify dependencies can become the source of outages when those dependencies slow down; use timeouts and circuit breakers aggressively
- The phi accrual failure detector and sliding window approaches dramatically reduce false positives compared to simple timeout-based detection
Why Systems Need a Pulse
Distributed systems fail in ways that monoliths never could. A service might be running but unable to reach its database. A container might be alive but stuck in an infinite loop. A node might be reachable from the load balancer but partitioned from the rest of the cluster.
Passive monitoring—waiting for errors to bubble up through logs or user complaints—isn’t enough. By the time you notice, users have already experienced degraded service. Heartbeats and health checks flip this model: instead of waiting for failure signals, you continuously verify that systems are functioning correctly.
These patterns aren’t just about detecting dead processes. They’re about answering increasingly nuanced questions: Is this service alive? Can it accept traffic? Is it performing well enough to remain in rotation? The answers determine routing decisions, scaling actions, and alerting thresholds across your entire infrastructure.
Heartbeat Patterns Explained
Heartbeats come in two fundamental flavors: push-based and pull-based. Each has distinct trade-offs that affect your system’s failure detection characteristics.
Push-based heartbeats have services actively report their status to a central coordinator. The coordinator marks a service as failed if it doesn’t receive a heartbeat within a configured timeout. This approach scales well because the coordinator doesn’t need to maintain connections to every service.
Pull-based health checks have the coordinator (or load balancer) actively poll services. This gives the coordinator more control over check frequency and allows it to verify specific functionality, not just network reachability.
Here’s a basic push-based heartbeat implementation:
import socket
import threading
import time
from dataclasses import dataclass
from typing import Dict
@dataclass
class ServiceStatus:
last_seen: float
metadata: dict
class HeartbeatReceiver:
def __init__(self, port: int, timeout_seconds: float = 10.0):
self.port = port
self.timeout = timeout_seconds
self.services: Dict[str, ServiceStatus] = {}
self.lock = threading.Lock()
def start(self):
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('0.0.0.0', self.port))
threading.Thread(target=self._reaper, daemon=True).start()
while True:
data, addr = sock.recvfrom(1024)
service_id = data.decode('utf-8')
with self.lock:
self.services[service_id] = ServiceStatus(
last_seen=time.time(),
metadata={'address': addr}
)
def _reaper(self):
while True:
time.sleep(self.timeout / 2)
now = time.time()
with self.lock:
dead = [sid for sid, status in self.services.items()
if now - status.last_seen > self.timeout]
for sid in dead:
print(f"Service {sid} marked as dead")
del self.services[sid]
class HeartbeatSender:
def __init__(self, service_id: str, receiver_host: str,
receiver_port: int, interval: float = 3.0):
self.service_id = service_id
self.receiver = (receiver_host, receiver_port)
self.interval = interval
self.sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
def start(self):
while True:
self.sock.sendto(self.service_id.encode(), self.receiver)
time.sleep(self.interval)
The interval-to-timeout ratio matters. A common pattern is setting the timeout to 3x the heartbeat interval, allowing for two missed heartbeats before declaring failure. This accounts for network jitter and temporary GC pauses without being overly sensitive.
Implementing Health Check Endpoints
Modern services expose two distinct endpoints: /health (or /healthz) for liveness and /ready for readiness. Conflating these causes real problems.
Liveness answers: “Should this process be killed and restarted?” Only return unhealthy if the process is in an unrecoverable state—deadlocked, corrupted memory, or otherwise broken beyond repair.
Readiness answers: “Should this instance receive traffic?” Return unhealthy during startup, when dependencies are unavailable, or when the instance is draining for shutdown.
Here’s a FastAPI implementation with proper separation:
from fastapi import FastAPI, Response
from contextlib import asynccontextmanager
import asyncio
import aiohttp
import asyncpg
app = FastAPI()
# Global state for dependency connections
db_pool = None
cache_healthy = False
startup_complete = False
@asynccontextmanager
async def lifespan(app: FastAPI):
global db_pool, startup_complete
db_pool = await asyncpg.create_pool(
'postgresql://localhost/myapp',
min_size=2, max_size=10
)
startup_complete = True
yield
await db_pool.close()
app = FastAPI(lifespan=lifespan)
@app.get("/health")
async def liveness():
# Shallow check: can we respond at all?
# Only fail if process is fundamentally broken
return {"status": "ok"}
@app.get("/ready")
async def readiness(response: Response):
checks = {}
healthy = True
# Check database connectivity
try:
async with db_pool.acquire() as conn:
await asyncio.wait_for(
conn.fetchval("SELECT 1"),
timeout=2.0
)
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"failed: {str(e)}"
healthy = False
# Check cache connectivity
try:
async with aiohttp.ClientSession() as session:
async with session.get(
"http://redis:6379/ping",
timeout=aiohttp.ClientTimeout(total=1.0)
) as resp:
checks["cache"] = "ok" if resp.status == 200 else "degraded"
except Exception:
checks["cache"] = "unreachable"
healthy = False
# Check startup completion
if not startup_complete:
checks["startup"] = "in_progress"
healthy = False
else:
checks["startup"] = "complete"
if not healthy:
response.status_code = 503
return {"status": "healthy" if healthy else "unhealthy", "checks": checks}
Notice the aggressive timeouts on dependency checks. A health check that hangs waiting for a slow database will cause the load balancer to timeout, leading to flapping and cascading failures.
Failure Detection and Consensus
Simple timeout-based failure detection produces false positives. Network latency varies, GC pauses happen, and services under load respond slowly. The phi accrual failure detector, used by Akka and Cassandra, provides a more nuanced approach.
Instead of a binary alive/dead decision, it outputs a “suspicion level” (phi) based on the statistical distribution of observed heartbeat intervals. You configure a threshold phi value; higher thresholds mean more tolerance for variation.
import math
import time
from collections import deque
from dataclasses import dataclass
@dataclass
class PhiAccrualDetector:
threshold: float = 8.0
max_sample_size: int = 1000
min_std_deviation: float = 0.5
acceptable_heartbeat_pause: float = 0.0
first_heartbeat_estimate: float = 1.0
def __post_init__(self):
self.intervals = deque(maxlen=self.max_sample_size)
self.last_heartbeat: float = None
def heartbeat(self):
now = time.time()
if self.last_heartbeat is not None:
interval = now - self.last_heartbeat
self.intervals.append(interval)
self.last_heartbeat = now
def phi(self) -> float:
if self.last_heartbeat is None:
return 0.0
now = time.time()
time_diff = now - self.last_heartbeat
if len(self.intervals) < 2:
# Not enough data, use estimate
mean = self.first_heartbeat_estimate
std_dev = mean / 4
else:
mean = sum(self.intervals) / len(self.intervals)
variance = sum((x - mean) ** 2 for x in self.intervals) / len(self.intervals)
std_dev = max(math.sqrt(variance), self.min_std_deviation)
# Calculate phi using normal distribution CDF approximation
y = (time_diff - mean) / std_dev
e = math.exp(-y * (1.5976 + 0.070566 * y * y))
if time_diff > mean:
p_later = e / (1.0 + e)
else:
p_later = 1.0 - 1.0 / (1.0 + e)
return -math.log10(p_later) if p_later > 0 else float('inf')
def is_available(self) -> bool:
return self.phi() < self.threshold
This detector adapts to the actual behavior of each service. A service with consistent 100ms heartbeats will be marked suspicious faster than one with high variance.
Health Checking in Orchestration Platforms
Kubernetes provides three probe types, each serving a distinct purpose:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: myapp:latest
ports:
- containerPort: 8080
# Startup probe: only runs until success, then stops
# Gives slow-starting apps time to initialize
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10
# Total startup budget: 30 * 10 = 300 seconds
# Liveness probe: restarts container if failing
# Only check process health, not dependencies
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 0 # Startup probe handles delay
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# Readiness probe: removes from service endpoints
# Check dependencies here
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 1
Common pitfalls I see repeatedly:
- Liveness probes checking dependencies: Database goes down, all pods restart, database comes back, thundering herd crushes it again
- Missing startup probes: Liveness probe kills pods before they finish initializing
- Timeouts longer than periods: Creates overlapping probes and unpredictable behavior
- Success threshold > 1 for liveness: Delays restart of genuinely broken pods
Cascading Failures and Circuit Breakers
Health checks can cause the very outages they’re meant to prevent. When a dependency slows down, deep health checks queue up, consuming threads and connections. The service becomes unresponsive to actual traffic, fails its own health checks, and gets killed—even though the original problem was external.
Wrap dependency checks in circuit breakers:
import asyncio
import time
from enum import Enum
from dataclasses import dataclass, field
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
recovery_timeout: float = 30.0
half_open_max_calls: int = 3
state: CircuitState = field(default=CircuitState.CLOSED, init=False)
failure_count: int = field(default=0, init=False)
last_failure_time: float = field(default=0, init=False)
half_open_calls: int = field(default=0, init=False)
async def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
else:
raise CircuitOpenError("Circuit breaker is open")
if self.state == CircuitState.HALF_OPEN:
if self.half_open_calls >= self.half_open_max_calls:
raise CircuitOpenError("Half-open call limit reached")
self.half_open_calls += 1
try:
result = await func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
class CircuitOpenError(Exception):
pass
# Usage in health check
db_circuit = CircuitBreaker(failure_threshold=3, recovery_timeout=15.0)
async def check_database():
try:
return await db_circuit.call(actual_db_check)
except CircuitOpenError:
return {"status": "unknown", "reason": "circuit_open"}
When the circuit opens, the health check returns “unknown” rather than hanging. The service remains in rotation with degraded status rather than being killed.
Observability and Alerting
Expose health check metrics for dashboarding and alerting:
from prometheus_client import Counter, Histogram, Gauge, generate_latest
health_check_duration = Histogram(
'health_check_duration_seconds',
'Time spent performing health checks',
['check_name']
)
health_check_status = Gauge(
'health_check_status',
'Current health check status (1=healthy, 0=unhealthy)',
['check_name']
)
health_check_failures = Counter(
'health_check_failures_total',
'Total health check failures',
['check_name', 'reason']
)
async def instrumented_health_check():
checks = ['database', 'cache', 'external_api']
for check in checks:
with health_check_duration.labels(check_name=check).time():
try:
result = await run_check(check)
health_check_status.labels(check_name=check).set(1 if result else 0)
except Exception as e:
health_check_status.labels(check_name=check).set(0)
health_check_failures.labels(
check_name=check,
reason=type(e).__name__
).inc()
Alert on sustained failures, not individual blips. A service that fails one health check isn’t concerning; one that’s been flapping for ten minutes is. Track check latency percentiles—a gradual increase often precedes complete failure.
Health checking appears simple but hides significant complexity. Get the fundamentals right: separate liveness from readiness, use adaptive failure detection, protect against cascading failures, and instrument everything. Your 3 AM self will thank you.