Health Checks: Liveness and Readiness Probes

Distributed systems fail. Services crash, connections drop, memory leaks accumulate, and threads deadlock. The question isn't whether your service will experience failures—it's whether your...

Key Insights

  • Liveness probes answer “should this process be restarted?” while readiness probes answer “should this instance receive traffic?"—conflating them causes cascading failures
  • Keep liveness probes simple and dependency-free; checking external services in liveness probes turns network blips into unnecessary restarts
  • Startup probes are essential for applications with variable initialization times—without them, aggressive liveness probes will kill containers before they’re ready

Why Health Checks Matter

Distributed systems fail. Services crash, connections drop, memory leaks accumulate, and threads deadlock. The question isn’t whether your service will experience failures—it’s whether your infrastructure will detect and recover from them automatically.

Health checks are your first line of defense against silent failures. Without them, a service might continue receiving traffic while stuck in an unrecoverable state, returning errors to every request. Worse, a single unhealthy instance can trigger cascading failures across your entire system as retry storms overwhelm downstream services.

The cost of undetected failures compounds quickly. A deadlocked service that sits in a load balancer pool for five minutes before someone notices has already failed thousands of requests. Proper health checks reduce that detection time to seconds and enable automatic recovery without human intervention.

Liveness vs Readiness: Understanding the Difference

The most common mistake engineers make with health checks is treating liveness and readiness as interchangeable. They serve fundamentally different purposes, and conflating them creates fragile systems.

Liveness probes answer one question: “Is this process in a state where restarting it would help?” A failed liveness probe tells the orchestrator to kill the container and start a fresh one. Use liveness probes to detect deadlocks, infinite loops, or corrupted internal state that won’t resolve without a restart.

Readiness probes answer a different question: “Can this instance currently handle requests?” A failed readiness probe removes the instance from the load balancer but keeps it running. Use readiness probes during startup, when dependencies are temporarily unavailable, or during graceful shutdown.

Here’s the critical insight: a service can be live but not ready. During startup, your application process is running (live) but hasn’t established database connections yet (not ready). During a downstream outage, your service is healthy internally (live) but can’t fulfill requests (not ready). Keeping the service running while removing it from traffic lets it recover automatically when conditions improve.

Implementing Basic Health Endpoints

Start with simple HTTP endpoints that return appropriate status codes. A 200 response indicates health; anything else indicates a problem.

from fastapi import FastAPI, Response
from datetime import datetime

app = FastAPI()

# Track application state
app_state = {
    "started_at": datetime.utcnow(),
    "ready": False
}

@app.get("/health/live")
async def liveness():
    """Simple liveness check - is the process responding?"""
    return {"status": "alive", "timestamp": datetime.utcnow().isoformat()}

@app.get("/health/ready")
async def readiness(response: Response):
    """Readiness check - can we handle traffic?"""
    if not app_state["ready"]:
        response.status_code = 503
        return {"status": "not_ready", "reason": "initialization_incomplete"}
    return {"status": "ready"}

@app.on_event("startup")
async def startup():
    # Simulate initialization work
    app_state["ready"] = True

The equivalent in Express.js:

const express = require('express');
const app = express();

let isReady = false;

app.get('/health/live', (req, res) => {
  res.json({ status: 'alive', timestamp: new Date().toISOString() });
});

app.get('/health/ready', (req, res) => {
  if (!isReady) {
    return res.status(503).json({ 
      status: 'not_ready', 
      reason: 'initialization_incomplete' 
    });
  }
  res.json({ status: 'ready' });
});

// Mark ready after initialization
async function initialize() {
  // Connect to databases, warm caches, etc.
  isReady = true;
}

initialize();

Designing Effective Liveness Probes

Liveness probes should be lightweight and avoid external dependencies. If your liveness probe checks the database and the database goes down, every instance of your service will restart simultaneously—turning a database blip into a complete service outage.

A good liveness probe verifies that the application’s internal machinery is functioning. Check that worker threads are responsive, that the event loop isn’t blocked, and that the process hasn’t entered an unrecoverable state.

package main

import (
    "net/http"
    "sync/atomic"
    "time"
)

var (
    lastWorkerHeartbeat atomic.Int64
    workerHealthy       atomic.Bool
)

func init() {
    workerHealthy.Store(true)
    
    // Background worker that processes jobs
    go func() {
        for {
            // Update heartbeat timestamp
            lastWorkerHeartbeat.Store(time.Now().UnixNano())
            
            // Do actual work here
            processJobs()
            
            time.Sleep(100 * time.Millisecond)
        }
    }()
    
    // Monitor worker health
    go func() {
        for {
            time.Sleep(5 * time.Second)
            lastBeat := lastWorkerHeartbeat.Load()
            elapsed := time.Since(time.Unix(0, lastBeat))
            
            // Worker is unhealthy if no heartbeat for 30 seconds
            workerHealthy.Store(elapsed < 30*time.Second)
        }
    }()
}

func livenessHandler(w http.ResponseWriter, r *http.Request) {
    if !workerHealthy.Load() {
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte(`{"status": "unhealthy", "reason": "worker_deadlock"}`))
        return
    }
    w.Write([]byte(`{"status": "alive"}`))
}

This pattern detects deadlocked workers without checking external services. If the worker thread stops updating its heartbeat, the liveness probe fails and triggers a restart.

Building Smart Readiness Probes

Readiness probes should verify that your service can actually fulfill its purpose. This means checking database connections, cache availability, and critical downstream services.

import asyncio
from fastapi import FastAPI, Response
import asyncpg
import aioredis
from typing import Dict, Any

app = FastAPI()

# Connection pools initialized at startup
db_pool = None
redis_client = None

async def check_database() -> Dict[str, Any]:
    """Verify database connectivity and pool health."""
    try:
        async with db_pool.acquire(timeout=2.0) as conn:
            await conn.fetchval("SELECT 1")
        
        pool_size = db_pool.get_size()
        pool_free = db_pool.get_idle_size()
        
        return {
            "healthy": True,
            "pool_size": pool_size,
            "pool_free": pool_free
        }
    except asyncio.TimeoutError:
        return {"healthy": False, "error": "connection_timeout"}
    except Exception as e:
        return {"healthy": False, "error": str(e)}

async def check_redis() -> Dict[str, Any]:
    """Verify Redis connectivity."""
    try:
        await asyncio.wait_for(
            redis_client.ping(),
            timeout=1.0
        )
        return {"healthy": True}
    except asyncio.TimeoutError:
        return {"healthy": False, "error": "connection_timeout"}
    except Exception as e:
        return {"healthy": False, "error": str(e)}

@app.get("/health/ready")
async def readiness(response: Response):
    """Comprehensive readiness check."""
    checks = await asyncio.gather(
        check_database(),
        check_redis(),
        return_exceptions=True
    )
    
    db_status = checks[0] if not isinstance(checks[0], Exception) else {
        "healthy": False, "error": str(checks[0])
    }
    redis_status = checks[1] if not isinstance(checks[1], Exception) else {
        "healthy": False, "error": str(checks[1])
    }
    
    all_healthy = db_status.get("healthy") and redis_status.get("healthy")
    
    result = {
        "status": "ready" if all_healthy else "not_ready",
        "checks": {
            "database": db_status,
            "redis": redis_status
        }
    }
    
    if not all_healthy:
        response.status_code = 503
    
    return result

Notice the timeouts on each check. Without them, a slow database response could cause the readiness probe itself to timeout, making debugging difficult.

Kubernetes Configuration Best Practices

Proper probe configuration is as important as the probe implementation. Aggressive timeouts cause unnecessary restarts; lenient ones delay failure detection.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myapp:latest
        ports:
        - containerPort: 8080
        
        # Startup probe: for slow-initializing applications
        startupProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 30  # 5 + (5 * 30) = up to 155 seconds to start
          timeoutSeconds: 3
        
        # Liveness probe: detect deadlocks and unrecoverable states
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 0  # Startup probe handles initial delay
          periodSeconds: 10
          failureThreshold: 3     # Restart after 30 seconds of failures
          timeoutSeconds: 5
        
        # Readiness probe: control traffic routing
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 2     # Remove from LB after 10 seconds
          successThreshold: 1
          timeoutSeconds: 3

Key timing considerations:

  • Startup probes run before liveness and readiness probes, giving your application time to initialize without risking premature restarts
  • failureThreshold × periodSeconds determines how long before action is taken—make this longer than your longest expected transient failure
  • timeoutSeconds should exceed your probe endpoint’s expected response time under load, with margin for garbage collection pauses

Anti-Patterns and Troubleshooting

Anti-pattern: Checking downstream services in liveness probes. When your liveness probe verifies database connectivity and the database fails, Kubernetes restarts all your pods simultaneously. The database recovers, but now your entire service is restarting. Use readiness probes for dependency checks.

Anti-pattern: Missing startup probes. Applications with variable startup times—especially JVM-based services or those loading large ML models—need startup probes. Without them, you’re forced to set high initialDelaySeconds on liveness probes, delaying detection of actual failures.

Anti-pattern: Overly complex health checks. A health endpoint that runs database migrations, validates configuration, and checks every downstream service will be slow and fragile. Keep probes focused on their specific purpose.

Debugging probe failures:

# Check probe configuration
kubectl describe pod <pod-name> | grep -A 10 "Liveness\|Readiness\|Startup"

# View probe failure events
kubectl get events --field-selector reason=Unhealthy

# Test probe endpoint directly
kubectl exec -it <pod-name> -- curl -v localhost:8080/health/ready

When probes fail intermittently, check for resource contention. A container hitting CPU limits might not respond to probes in time. Monitor probe latency percentiles, not just success rates.

Health checks seem simple, but they’re foundational to reliable service operation. Get them right, and your system recovers from failures automatically. Get them wrong, and your health checks become the source of outages rather than the solution.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.