System Design: Heartbeat and Health Checking

Key Insights

Heartbeats and health checks serve different purposes: heartbeats detect process liveness, while health checks verify service readiness—conflating them causes cascading failures
Deep health checks that verify dependencies can become the source of outages when those dependencies slow down; use timeouts and circuit breakers aggressively
The phi accrual failure detector and sliding window approaches dramatically reduce false positives compared to simple timeout-based detection

Why Systems Need a Pulse

Distributed systems fail in ways that monoliths never could. A service might be running but unable to reach its database. A container might be alive but stuck in an infinite loop. A node might be reachable from the load balancer but partitioned from the rest of the cluster.

Passive monitoring—waiting for errors to bubble up through logs or user complaints—isn’t enough. By the time you notice, users have already experienced degraded service. Heartbeats and health checks flip this model: instead of waiting for failure signals, you continuously verify that systems are functioning correctly.

These patterns aren’t just about detecting dead processes. They’re about answering increasingly nuanced questions: Is this service alive? Can it accept traffic? Is it performing well enough to remain in rotation? The answers determine routing decisions, scaling actions, and alerting thresholds across your entire infrastructure.

Heartbeat Patterns Explained

Heartbeats come in two fundamental flavors: push-based and pull-based. Each has distinct trade-offs that affect your system’s failure detection characteristics.

Push-based heartbeats have services actively report their status to a central coordinator. The coordinator marks a service as failed if it doesn’t receive a heartbeat within a configured timeout. This approach scales well because the coordinator doesn’t need to maintain connections to every service.

Pull-based health checks have the coordinator (or load balancer) actively poll services. This gives the coordinator more control over check frequency and allows it to verify specific functionality, not just network reachability.

Here’s a basic push-based heartbeat implementation:

import socket
import threading
import time
from dataclasses import dataclass
from typing import Dict

@dataclass
class ServiceStatus:
    last_seen: float
    metadata: dict

class HeartbeatReceiver:
    def __init__(self, port: int, timeout_seconds: float = 10.0):
        self.port = port
        self.timeout = timeout_seconds
        self.services: Dict[str, ServiceStatus] = {}
        self.lock = threading.Lock()
        
    def start(self):
        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        sock.bind(('0.0.0.0', self.port))
        
        threading.Thread(target=self._reaper, daemon=True).start()
        
        while True:
            data, addr = sock.recvfrom(1024)
            service_id = data.decode('utf-8')
            with self.lock:
                self.services[service_id] = ServiceStatus(
                    last_seen=time.time(),
                    metadata={'address': addr}
                )
    
    def _reaper(self):
        while True:
            time.sleep(self.timeout / 2)
            now = time.time()
            with self.lock:
                dead = [sid for sid, status in self.services.items()
                        if now - status.last_seen > self.timeout]
                for sid in dead:
                    print(f"Service {sid} marked as dead")
                    del self.services[sid]

class HeartbeatSender:
    def __init__(self, service_id: str, receiver_host: str, 
                 receiver_port: int, interval: float = 3.0):
        self.service_id = service_id
        self.receiver = (receiver_host, receiver_port)
        self.interval = interval
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        
    def start(self):
        while True:
            self.sock.sendto(self.service_id.encode(), self.receiver)
            time.sleep(self.interval)

The interval-to-timeout ratio matters. A common pattern is setting the timeout to 3x the heartbeat interval, allowing for two missed heartbeats before declaring failure. This accounts for network jitter and temporary GC pauses without being overly sensitive.

Implementing Health Check Endpoints

Modern services expose two distinct endpoints: /health (or /healthz) for liveness and /ready for readiness. Conflating these causes real problems.

Liveness answers: “Should this process be killed and restarted?” Only return unhealthy if the process is in an unrecoverable state—deadlocked, corrupted memory, or otherwise broken beyond repair.

Readiness answers: “Should this instance receive traffic?” Return unhealthy during startup, when dependencies are unavailable, or when the instance is draining for shutdown.

Here’s a FastAPI implementation with proper separation:

from fastapi import FastAPI, Response
from contextlib import asynccontextmanager
import asyncio
import aiohttp
import asyncpg

app = FastAPI()

# Global state for dependency connections
db_pool = None
cache_healthy = False
startup_complete = False

@asynccontextmanager
async def lifespan(app: FastAPI):
    global db_pool, startup_complete
    db_pool = await asyncpg.create_pool(
        'postgresql://localhost/myapp',
        min_size=2, max_size=10
    )
    startup_complete = True
    yield
    await db_pool.close()

app = FastAPI(lifespan=lifespan)

@app.get("/health")
async def liveness():
    # Shallow check: can we respond at all?
    # Only fail if process is fundamentally broken
    return {"status": "ok"}

@app.get("/ready")
async def readiness(response: Response):
    checks = {}
    healthy = True
    
    # Check database connectivity
    try:
        async with db_pool.acquire() as conn:
            await asyncio.wait_for(
                conn.fetchval("SELECT 1"),
                timeout=2.0
            )
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"failed: {str(e)}"
        healthy = False
    
    # Check cache connectivity
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(
                "http://redis:6379/ping",
                timeout=aiohttp.ClientTimeout(total=1.0)
            ) as resp:
                checks["cache"] = "ok" if resp.status == 200 else "degraded"
    except Exception:
        checks["cache"] = "unreachable"
        healthy = False
    
    # Check startup completion
    if not startup_complete:
        checks["startup"] = "in_progress"
        healthy = False
    else:
        checks["startup"] = "complete"
    
    if not healthy:
        response.status_code = 503
    
    return {"status": "healthy" if healthy else "unhealthy", "checks": checks}

Notice the aggressive timeouts on dependency checks. A health check that hangs waiting for a slow database will cause the load balancer to timeout, leading to flapping and cascading failures.

Failure Detection and Consensus

Simple timeout-based failure detection produces false positives. Network latency varies, GC pauses happen, and services under load respond slowly. The phi accrual failure detector, used by Akka and Cassandra, provides a more nuanced approach.

Instead of a binary alive/dead decision, it outputs a “suspicion level” (phi) based on the statistical distribution of observed heartbeat intervals. You configure a threshold phi value; higher thresholds mean more tolerance for variation.

import math
import time
from collections import deque
from dataclasses import dataclass

@dataclass
class PhiAccrualDetector:
    threshold: float = 8.0
    max_sample_size: int = 1000
    min_std_deviation: float = 0.5
    acceptable_heartbeat_pause: float = 0.0
    first_heartbeat_estimate: float = 1.0
    
    def __post_init__(self):
        self.intervals = deque(maxlen=self.max_sample_size)
        self.last_heartbeat: float = None
        
    def heartbeat(self):
        now = time.time()
        if self.last_heartbeat is not None:
            interval = now - self.last_heartbeat
            self.intervals.append(interval)
        self.last_heartbeat = now
    
    def phi(self) -> float:
        if self.last_heartbeat is None:
            return 0.0
            
        now = time.time()
        time_diff = now - self.last_heartbeat
        
        if len(self.intervals) < 2:
            # Not enough data, use estimate
            mean = self.first_heartbeat_estimate
            std_dev = mean / 4
        else:
            mean = sum(self.intervals) / len(self.intervals)
            variance = sum((x - mean) ** 2 for x in self.intervals) / len(self.intervals)
            std_dev = max(math.sqrt(variance), self.min_std_deviation)
        
        # Calculate phi using normal distribution CDF approximation
        y = (time_diff - mean) / std_dev
        e = math.exp(-y * (1.5976 + 0.070566 * y * y))
        
        if time_diff > mean:
            p_later = e / (1.0 + e)
        else:
            p_later = 1.0 - 1.0 / (1.0 + e)
        
        return -math.log10(p_later) if p_later > 0 else float('inf')
    
    def is_available(self) -> bool:
        return self.phi() < self.threshold

This detector adapts to the actual behavior of each service. A service with consistent 100ms heartbeats will be marked suspicious faster than one with high variance.

Health Checking in Orchestration Platforms

Kubernetes provides three probe types, each serving a distinct purpose:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myapp:latest
        ports:
        - containerPort: 8080
        
        # Startup probe: only runs until success, then stops
        # Gives slow-starting apps time to initialize
        startupProbe:
          httpGet:
            path: /health
            port: 8080
          failureThreshold: 30
          periodSeconds: 10
          # Total startup budget: 30 * 10 = 300 seconds
        
        # Liveness probe: restarts container if failing
        # Only check process health, not dependencies
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 0  # Startup probe handles delay
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        
        # Readiness probe: removes from service endpoints
        # Check dependencies here
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
          successThreshold: 1

Common pitfalls I see repeatedly:

Liveness probes checking dependencies: Database goes down, all pods restart, database comes back, thundering herd crushes it again
Missing startup probes: Liveness probe kills pods before they finish initializing
Timeouts longer than periods: Creates overlapping probes and unpredictable behavior
Success threshold > 1 for liveness: Delays restart of genuinely broken pods

Cascading Failures and Circuit Breakers

Health checks can cause the very outages they’re meant to prevent. When a dependency slows down, deep health checks queue up, consuming threads and connections. The service becomes unresponsive to actual traffic, fails its own health checks, and gets killed—even though the original problem was external.

Wrap dependency checks in circuit breakers:

import asyncio
import time
from enum import Enum
from dataclasses import dataclass, field

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 3
    
    state: CircuitState = field(default=CircuitState.CLOSED, init=False)
    failure_count: int = field(default=0, init=False)
    last_failure_time: float = field(default=0, init=False)
    half_open_calls: int = field(default=0, init=False)
    
    async def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
            else:
                raise CircuitOpenError("Circuit breaker is open")
        
        if self.state == CircuitState.HALF_OPEN:
            if self.half_open_calls >= self.half_open_max_calls:
                raise CircuitOpenError("Half-open call limit reached")
            self.half_open_calls += 1
        
        try:
            result = await func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED
    
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

class CircuitOpenError(Exception):
    pass

# Usage in health check
db_circuit = CircuitBreaker(failure_threshold=3, recovery_timeout=15.0)

async def check_database():
    try:
        return await db_circuit.call(actual_db_check)
    except CircuitOpenError:
        return {"status": "unknown", "reason": "circuit_open"}

When the circuit opens, the health check returns “unknown” rather than hanging. The service remains in rotation with degraded status rather than being killed.

Observability and Alerting

Expose health check metrics for dashboarding and alerting:

from prometheus_client import Counter, Histogram, Gauge, generate_latest

health_check_duration = Histogram(
    'health_check_duration_seconds',
    'Time spent performing health checks',
    ['check_name']
)

health_check_status = Gauge(
    'health_check_status',
    'Current health check status (1=healthy, 0=unhealthy)',
    ['check_name']
)

health_check_failures = Counter(
    'health_check_failures_total',
    'Total health check failures',
    ['check_name', 'reason']
)

async def instrumented_health_check():
    checks = ['database', 'cache', 'external_api']
    
    for check in checks:
        with health_check_duration.labels(check_name=check).time():
            try:
                result = await run_check(check)
                health_check_status.labels(check_name=check).set(1 if result else 0)
            except Exception as e:
                health_check_status.labels(check_name=check).set(0)
                health_check_failures.labels(
                    check_name=check, 
                    reason=type(e).__name__
                ).inc()

Alert on sustained failures, not individual blips. A service that fails one health check isn’t concerning; one that’s been flapping for ten minutes is. Track check latency percentiles—a gradual increase often precedes complete failure.

Health checking appears simple but hides significant complexity. Get the fundamentals right: separate liveness from readiness, use adaptive failure detection, protect against cascading failures, and instrument everything. Your 3 AM self will thank you.