Health Checks: Liveness and Readiness Probes
Distributed systems fail. Services crash, connections drop, memory leaks accumulate, and threads deadlock. The question isn't whether your service will experience failures—it's whether your...
Key Insights
- Liveness probes answer “should this process be restarted?” while readiness probes answer “should this instance receive traffic?"—conflating them causes cascading failures
- Keep liveness probes simple and dependency-free; checking external services in liveness probes turns network blips into unnecessary restarts
- Startup probes are essential for applications with variable initialization times—without them, aggressive liveness probes will kill containers before they’re ready
Why Health Checks Matter
Distributed systems fail. Services crash, connections drop, memory leaks accumulate, and threads deadlock. The question isn’t whether your service will experience failures—it’s whether your infrastructure will detect and recover from them automatically.
Health checks are your first line of defense against silent failures. Without them, a service might continue receiving traffic while stuck in an unrecoverable state, returning errors to every request. Worse, a single unhealthy instance can trigger cascading failures across your entire system as retry storms overwhelm downstream services.
The cost of undetected failures compounds quickly. A deadlocked service that sits in a load balancer pool for five minutes before someone notices has already failed thousands of requests. Proper health checks reduce that detection time to seconds and enable automatic recovery without human intervention.
Liveness vs Readiness: Understanding the Difference
The most common mistake engineers make with health checks is treating liveness and readiness as interchangeable. They serve fundamentally different purposes, and conflating them creates fragile systems.
Liveness probes answer one question: “Is this process in a state where restarting it would help?” A failed liveness probe tells the orchestrator to kill the container and start a fresh one. Use liveness probes to detect deadlocks, infinite loops, or corrupted internal state that won’t resolve without a restart.
Readiness probes answer a different question: “Can this instance currently handle requests?” A failed readiness probe removes the instance from the load balancer but keeps it running. Use readiness probes during startup, when dependencies are temporarily unavailable, or during graceful shutdown.
Here’s the critical insight: a service can be live but not ready. During startup, your application process is running (live) but hasn’t established database connections yet (not ready). During a downstream outage, your service is healthy internally (live) but can’t fulfill requests (not ready). Keeping the service running while removing it from traffic lets it recover automatically when conditions improve.
Implementing Basic Health Endpoints
Start with simple HTTP endpoints that return appropriate status codes. A 200 response indicates health; anything else indicates a problem.
from fastapi import FastAPI, Response
from datetime import datetime
app = FastAPI()
# Track application state
app_state = {
"started_at": datetime.utcnow(),
"ready": False
}
@app.get("/health/live")
async def liveness():
"""Simple liveness check - is the process responding?"""
return {"status": "alive", "timestamp": datetime.utcnow().isoformat()}
@app.get("/health/ready")
async def readiness(response: Response):
"""Readiness check - can we handle traffic?"""
if not app_state["ready"]:
response.status_code = 503
return {"status": "not_ready", "reason": "initialization_incomplete"}
return {"status": "ready"}
@app.on_event("startup")
async def startup():
# Simulate initialization work
app_state["ready"] = True
The equivalent in Express.js:
const express = require('express');
const app = express();
let isReady = false;
app.get('/health/live', (req, res) => {
res.json({ status: 'alive', timestamp: new Date().toISOString() });
});
app.get('/health/ready', (req, res) => {
if (!isReady) {
return res.status(503).json({
status: 'not_ready',
reason: 'initialization_incomplete'
});
}
res.json({ status: 'ready' });
});
// Mark ready after initialization
async function initialize() {
// Connect to databases, warm caches, etc.
isReady = true;
}
initialize();
Designing Effective Liveness Probes
Liveness probes should be lightweight and avoid external dependencies. If your liveness probe checks the database and the database goes down, every instance of your service will restart simultaneously—turning a database blip into a complete service outage.
A good liveness probe verifies that the application’s internal machinery is functioning. Check that worker threads are responsive, that the event loop isn’t blocked, and that the process hasn’t entered an unrecoverable state.
package main
import (
"net/http"
"sync/atomic"
"time"
)
var (
lastWorkerHeartbeat atomic.Int64
workerHealthy atomic.Bool
)
func init() {
workerHealthy.Store(true)
// Background worker that processes jobs
go func() {
for {
// Update heartbeat timestamp
lastWorkerHeartbeat.Store(time.Now().UnixNano())
// Do actual work here
processJobs()
time.Sleep(100 * time.Millisecond)
}
}()
// Monitor worker health
go func() {
for {
time.Sleep(5 * time.Second)
lastBeat := lastWorkerHeartbeat.Load()
elapsed := time.Since(time.Unix(0, lastBeat))
// Worker is unhealthy if no heartbeat for 30 seconds
workerHealthy.Store(elapsed < 30*time.Second)
}
}()
}
func livenessHandler(w http.ResponseWriter, r *http.Request) {
if !workerHealthy.Load() {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte(`{"status": "unhealthy", "reason": "worker_deadlock"}`))
return
}
w.Write([]byte(`{"status": "alive"}`))
}
This pattern detects deadlocked workers without checking external services. If the worker thread stops updating its heartbeat, the liveness probe fails and triggers a restart.
Building Smart Readiness Probes
Readiness probes should verify that your service can actually fulfill its purpose. This means checking database connections, cache availability, and critical downstream services.
import asyncio
from fastapi import FastAPI, Response
import asyncpg
import aioredis
from typing import Dict, Any
app = FastAPI()
# Connection pools initialized at startup
db_pool = None
redis_client = None
async def check_database() -> Dict[str, Any]:
"""Verify database connectivity and pool health."""
try:
async with db_pool.acquire(timeout=2.0) as conn:
await conn.fetchval("SELECT 1")
pool_size = db_pool.get_size()
pool_free = db_pool.get_idle_size()
return {
"healthy": True,
"pool_size": pool_size,
"pool_free": pool_free
}
except asyncio.TimeoutError:
return {"healthy": False, "error": "connection_timeout"}
except Exception as e:
return {"healthy": False, "error": str(e)}
async def check_redis() -> Dict[str, Any]:
"""Verify Redis connectivity."""
try:
await asyncio.wait_for(
redis_client.ping(),
timeout=1.0
)
return {"healthy": True}
except asyncio.TimeoutError:
return {"healthy": False, "error": "connection_timeout"}
except Exception as e:
return {"healthy": False, "error": str(e)}
@app.get("/health/ready")
async def readiness(response: Response):
"""Comprehensive readiness check."""
checks = await asyncio.gather(
check_database(),
check_redis(),
return_exceptions=True
)
db_status = checks[0] if not isinstance(checks[0], Exception) else {
"healthy": False, "error": str(checks[0])
}
redis_status = checks[1] if not isinstance(checks[1], Exception) else {
"healthy": False, "error": str(checks[1])
}
all_healthy = db_status.get("healthy") and redis_status.get("healthy")
result = {
"status": "ready" if all_healthy else "not_ready",
"checks": {
"database": db_status,
"redis": redis_status
}
}
if not all_healthy:
response.status_code = 503
return result
Notice the timeouts on each check. Without them, a slow database response could cause the readiness probe itself to timeout, making debugging difficult.
Kubernetes Configuration Best Practices
Proper probe configuration is as important as the probe implementation. Aggressive timeouts cause unnecessary restarts; lenient ones delay failure detection.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: myapp:latest
ports:
- containerPort: 8080
# Startup probe: for slow-initializing applications
startupProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30 # 5 + (5 * 30) = up to 155 seconds to start
timeoutSeconds: 3
# Liveness probe: detect deadlocks and unrecoverable states
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0 # Startup probe handles initial delay
periodSeconds: 10
failureThreshold: 3 # Restart after 30 seconds of failures
timeoutSeconds: 5
# Readiness probe: control traffic routing
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 2 # Remove from LB after 10 seconds
successThreshold: 1
timeoutSeconds: 3
Key timing considerations:
- Startup probes run before liveness and readiness probes, giving your application time to initialize without risking premature restarts
- failureThreshold × periodSeconds determines how long before action is taken—make this longer than your longest expected transient failure
- timeoutSeconds should exceed your probe endpoint’s expected response time under load, with margin for garbage collection pauses
Anti-Patterns and Troubleshooting
Anti-pattern: Checking downstream services in liveness probes. When your liveness probe verifies database connectivity and the database fails, Kubernetes restarts all your pods simultaneously. The database recovers, but now your entire service is restarting. Use readiness probes for dependency checks.
Anti-pattern: Missing startup probes. Applications with variable startup times—especially JVM-based services or those loading large ML models—need startup probes. Without them, you’re forced to set high initialDelaySeconds on liveness probes, delaying detection of actual failures.
Anti-pattern: Overly complex health checks. A health endpoint that runs database migrations, validates configuration, and checks every downstream service will be slow and fragile. Keep probes focused on their specific purpose.
Debugging probe failures:
# Check probe configuration
kubectl describe pod <pod-name> | grep -A 10 "Liveness\|Readiness\|Startup"
# View probe failure events
kubectl get events --field-selector reason=Unhealthy
# Test probe endpoint directly
kubectl exec -it <pod-name> -- curl -v localhost:8080/health/ready
When probes fail intermittently, check for resource contention. A container hitting CPU limits might not respond to probes in time. Monitor probe latency percentiles, not just success rates.
Health checks seem simple, but they’re foundational to reliable service operation. Get them right, and your system recovers from failures automatically. Get them wrong, and your health checks become the source of outages rather than the solution.