System Design: Load Balancing Strategies and Algorithms

Load balancing distributes incoming network traffic across multiple backend servers to ensure no single server bears too much demand. In distributed systems, it's the traffic cop that keeps your...

Key Insights

  • Load balancing algorithm choice depends on your workload characteristics—stateless services benefit from round robin simplicity, while stateful applications need session-aware strategies like IP hash or least connections.
  • Layer 4 load balancing offers better performance with lower latency, but Layer 7 provides content-aware routing essential for modern microservices architectures.
  • Health checks are only as good as their configuration—aggressive intervals catch failures faster but increase overhead, while lenient thresholds risk routing traffic to degraded instances.

Introduction to Load Balancing

Load balancing distributes incoming network traffic across multiple backend servers to ensure no single server bears too much demand. In distributed systems, it’s the traffic cop that keeps your application responsive under varying loads.

The benefits are straightforward: availability improves because traffic routes around failed servers, scalability becomes achievable by adding more servers behind the balancer, and fault tolerance emerges naturally when no single point of failure exists in your compute layer.

In a typical architecture, load balancers sit between clients and your application servers. You might have multiple tiers—a global load balancer directing traffic to regional clusters, each with their own local balancers distributing requests to application instances. Understanding where and how to deploy them determines your system’s resilience.

Load Balancing Algorithms

Choosing the right algorithm impacts both performance and reliability. Here are the four algorithms you’ll encounter most often, with implementations you can adapt.

Round Robin cycles through servers sequentially. It’s simple and works well when servers have identical capacity and requests have uniform processing costs.

Weighted Round Robin extends this by assigning weights based on server capacity. A server with weight 3 receives three times the traffic of a server with weight 1.

Least Connections routes requests to the server handling the fewest active connections. This adapts well to variable request processing times.

IP Hash generates a hash from the client’s IP address to consistently route that client to the same server. This provides session persistence without external session storage.

import hashlib
from collections import defaultdict
from typing import List, Optional

class Server:
    def __init__(self, name: str, weight: int = 1):
        self.name = name
        self.weight = weight
        self.connections = 0
        self.healthy = True

class LoadBalancer:
    def __init__(self, servers: List[Server]):
        self.servers = servers
        self.current_index = 0
        self.weighted_index = 0
        self.weighted_current = 0
    
    def get_healthy_servers(self) -> List[Server]:
        return [s for s in self.servers if s.healthy]
    
    def round_robin(self) -> Optional[Server]:
        healthy = self.get_healthy_servers()
        if not healthy:
            return None
        server = healthy[self.current_index % len(healthy)]
        self.current_index += 1
        return server
    
    def weighted_round_robin(self) -> Optional[Server]:
        healthy = self.get_healthy_servers()
        if not healthy:
            return None
        
        while True:
            self.weighted_index = (self.weighted_index + 1) % len(healthy)
            if self.weighted_index == 0:
                self.weighted_current -= 1
                if self.weighted_current <= 0:
                    self.weighted_current = max(s.weight for s in healthy)
            
            if healthy[self.weighted_index].weight >= self.weighted_current:
                return healthy[self.weighted_index]
    
    def least_connections(self) -> Optional[Server]:
        healthy = self.get_healthy_servers()
        if not healthy:
            return None
        return min(healthy, key=lambda s: s.connections)
    
    def ip_hash(self, client_ip: str) -> Optional[Server]:
        healthy = self.get_healthy_servers()
        if not healthy:
            return None
        hash_value = int(hashlib.md5(client_ip.encode()).hexdigest(), 16)
        return healthy[hash_value % len(healthy)]


# Usage example
servers = [
    Server("server-1", weight=3),
    Server("server-2", weight=2),
    Server("server-3", weight=1),
]

lb = LoadBalancer(servers)

# Round robin distributes evenly
for _ in range(6):
    server = lb.round_robin()
    print(f"Round Robin -> {server.name}")

# IP hash ensures same client hits same server
print(f"IP Hash 192.168.1.1 -> {lb.ip_hash('192.168.1.1').name}")
print(f"IP Hash 192.168.1.1 -> {lb.ip_hash('192.168.1.1').name}")  # Same result

Layer 4 vs Layer 7 Load Balancing

Layer 4 (L4) load balancing operates at the transport layer, making routing decisions based on IP addresses and TCP/UDP ports. It doesn’t inspect packet contents, making it fast and efficient.

Layer 7 (L7) load balancing works at the application layer, understanding HTTP headers, cookies, and request paths. This enables content-based routing but adds processing overhead.

Use L4 when: You need raw performance, your backend servers are homogeneous, or you’re load balancing non-HTTP protocols like database connections or game servers.

Use L7 when: You need path-based routing (/api to API servers, /static to CDN), header inspection for A/B testing, SSL termination, or request manipulation.

# Layer 4 Load Balancing (TCP/UDP passthrough)
# /etc/nginx/nginx.conf

stream {
    upstream backend_tcp {
        least_conn;
        server 10.0.1.10:3306 weight=5;
        server 10.0.1.11:3306 weight=3;
        server 10.0.1.12:3306 backup;
    }

    server {
        listen 3306;
        proxy_pass backend_tcp;
        proxy_connect_timeout 1s;
        proxy_timeout 3s;
    }
}
# Layer 7 Load Balancing (HTTP-aware)
# /etc/nginx/conf.d/app.conf

upstream api_servers {
    least_conn;
    server 10.0.2.10:8080;
    server 10.0.2.11:8080;
    keepalive 32;
}

upstream static_servers {
    server 10.0.3.10:80;
    server 10.0.3.11:80;
}

server {
    listen 80;
    server_name example.com;

    # Path-based routing
    location /api/ {
        proxy_pass http://api_servers;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location /static/ {
        proxy_pass http://static_servers;
        proxy_cache_valid 200 1d;
    }

    # Header-based routing for canary deployments
    location / {
        set $backend "api_servers";
        if ($http_x_canary = "true") {
            set $backend "canary_servers";
        }
        proxy_pass http://$backend;
    }
}

Health Checks and Failover

Health checks determine whether a server can receive traffic. Active checks probe servers at regular intervals. Passive checks monitor actual request outcomes and mark servers unhealthy after repeated failures.

Configure check intervals based on your tolerance for routing to unhealthy servers. A 5-second interval means up to 5 seconds of potential bad traffic after a failure. A 1-second interval catches failures faster but generates more health check traffic.

Set failure thresholds to avoid flapping. Requiring 3 consecutive failures before marking a server unhealthy prevents temporary network blips from triggering unnecessary failovers.

from fastapi import FastAPI, Response
from datetime import datetime, timedelta
import asyncio
import psycopg2
import redis

app = FastAPI()

class HealthStatus:
    def __init__(self):
        self.db_healthy = True
        self.cache_healthy = True
        self.last_check = None
        self.ready = False

health = HealthStatus()

async def check_database() -> bool:
    try:
        conn = psycopg2.connect(
            "postgresql://user:pass@localhost/db",
            connect_timeout=3
        )
        cursor = conn.cursor()
        cursor.execute("SELECT 1")
        conn.close()
        return True
    except Exception:
        return False

async def check_cache() -> bool:
    try:
        r = redis.Redis(host='localhost', socket_timeout=2)
        return r.ping()
    except Exception:
        return False

@app.get("/health/live")
async def liveness():
    """Basic liveness - is the process running?"""
    return {"status": "alive", "timestamp": datetime.utcnow().isoformat()}

@app.get("/health/ready")
async def readiness():
    """Readiness - can we serve traffic?"""
    health.db_healthy = await check_database()
    health.cache_healthy = await check_cache()
    health.last_check = datetime.utcnow()
    
    is_ready = health.db_healthy and health.cache_healthy
    
    response_data = {
        "ready": is_ready,
        "checks": {
            "database": "healthy" if health.db_healthy else "unhealthy",
            "cache": "healthy" if health.cache_healthy else "unhealthy",
        },
        "timestamp": health.last_check.isoformat()
    }
    
    if not is_ready:
        return Response(
            content=str(response_data),
            status_code=503,
            media_type="application/json"
        )
    
    return response_data

@app.get("/health/detailed")
async def detailed_health():
    """Detailed health for monitoring systems - not for load balancers"""
    return {
        "version": "1.2.3",
        "uptime_seconds": 3600,
        "connections": {
            "database_pool_size": 10,
            "database_pool_available": 7,
            "cache_connections": 5
        },
        "memory_mb": 256,
        "last_health_check": health.last_check.isoformat() if health.last_check else None
    }

Scaling Patterns

Horizontal scaling adds more servers behind the load balancer. It’s the preferred approach for stateless services because it provides linear capacity growth and natural fault tolerance.

Vertical scaling increases individual server resources. Use it when your application can’t easily parallelize or when horizontal scaling introduces coordination overhead.

Global load balancing distributes traffic across geographic regions. DNS-based approaches like GeoDNS route users to the nearest datacenter, reducing latency and providing disaster recovery.

# terraform/alb-autoscaling.tf

resource "aws_lb" "app" {
  name               = "app-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.public_subnet_ids

  enable_deletion_protection = true
}

resource "aws_lb_target_group" "app" {
  name     = "app-targets"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 10
    path                = "/health/ready"
    matcher             = "200"
  }

  deregistration_delay = 30  # Connection draining

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 3600
    enabled         = false  # Enable only if needed
  }
}

resource "aws_lb_listener" "app" {
  load_balancer_arn = aws_lb.app.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = var.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

resource "aws_autoscaling_group" "app" {
  name                = "app-asg"
  desired_capacity    = 3
  max_size            = 10
  min_size            = 2
  target_group_arns   = [aws_lb_target_group.app.arn]
  vpc_zone_identifier = var.private_subnet_ids
  health_check_type   = "ELB"
  
  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 75
    }
  }
}

resource "aws_autoscaling_policy" "scale_on_cpu" {
  name                   = "scale-on-cpu"
  autoscaling_group_name = aws_autoscaling_group.app.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

Common Pitfalls and Best Practices

Session affinity challenges: Sticky sessions seem convenient but create uneven load distribution and complicate scaling. Store session data externally in Redis or a database instead.

Thundering herd problem: When a server recovers from failure, the load balancer may immediately route a flood of queued requests to it. Implement gradual warm-up by slowly increasing the server’s weight over time.

Connection draining: During deployments, configure a deregistration delay so in-flight requests complete before the server terminates. 30 seconds handles most web requests; adjust based on your longest expected request duration.

Monitoring essentials: Track these metrics at minimum: requests per second per backend, error rates per backend, latency percentiles (p50, p95, p99), and active connection counts. Alert on sudden changes, not just threshold breaches.

Conclusion

Match your load balancing strategy to your workload. For quick decisions:

Scenario Algorithm Layer Key Config
Stateless API Round Robin L7 Path-based routing
Varied request times Least Connections L7 Connection limits
Session-dependent IP Hash L7 Fallback strategy
Database proxying Least Connections L4 TCP health checks
High throughput Round Robin L4 Minimal processing

Start simple with round robin and L7 load balancing. Add complexity only when metrics show you need it. The best load balancing strategy is the one you understand well enough to debug at 3 AM.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.