System Design: Rate Limiting Algorithms (Token Bucket, Leaky Bucket)

Rate limiting is your first line of defense against both malicious actors and well-intentioned clients that accidentally hammer your API. Without it, a single misbehaving client can degrade service...

Key Insights

  • Token bucket excels at handling burst traffic while maintaining average rate limits, making it ideal for APIs where occasional spikes are acceptable
  • Leaky bucket provides predictable, smooth output rates but sacrifices flexibility—choose it when downstream systems require consistent throughput
  • Distributed rate limiting requires atomic operations and careful consideration of consistency vs. availability trade-offs; Redis with Lua scripts is the pragmatic choice for most production systems

Why Rate Limiting Matters

Rate limiting is your first line of defense against both malicious actors and well-intentioned clients that accidentally hammer your API. Without it, a single misbehaving client can degrade service for everyone, exhaust your compute budget, or bring down your entire system.

Every major API provider implements rate limiting: Stripe limits requests per second, GitHub throttles API calls per hour, and AWS services have request quotas baked into their design. These aren’t arbitrary restrictions—they’re architectural necessities.

Rate limiting serves three critical functions: protecting backend services from overload, ensuring fair resource allocation among clients, and providing predictable costs for infrastructure planning. The algorithm you choose determines how strictly you enforce these limits and how your system behaves under pressure.

Token Bucket Algorithm

The token bucket algorithm works like a bucket that fills with tokens at a constant rate. Each request consumes one token. If tokens are available, the request proceeds immediately. If the bucket is empty, the request is rejected or queued.

The key insight: tokens accumulate when traffic is low, allowing bursts when traffic spikes. A bucket with capacity 100 and refill rate of 10 tokens/second can handle 100 simultaneous requests, then sustains 10 requests/second afterward.

This makes token bucket ideal for APIs where legitimate usage patterns include occasional bursts—a mobile app syncing after coming online, or a batch job that runs periodically.

import time
from threading import Lock

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        """
        Initialize token bucket.
        
        Args:
            capacity: Maximum tokens the bucket can hold
            refill_rate: Tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity
        self.last_refill = time.monotonic()
        self.lock = Lock()
    
    def _refill(self):
        """Add tokens based on elapsed time."""
        now = time.monotonic()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now
    
    def acquire(self, tokens: int = 1) -> bool:
        """
        Attempt to acquire tokens.
        
        Returns True if tokens were acquired, False otherwise.
        """
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False
    
    def wait_time(self, tokens: int = 1) -> float:
        """Calculate seconds until requested tokens are available."""
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                return 0.0
            tokens_needed = tokens - self.tokens
            return tokens_needed / self.refill_rate


# Usage
limiter = TokenBucket(capacity=100, refill_rate=10)

def handle_request(request):
    if limiter.acquire():
        return process_request(request)
    else:
        return {"error": "Rate limit exceeded"}, 429

The implementation is straightforward: track token count and last refill time, calculate accumulated tokens on each request. Thread safety matters—the lock prevents race conditions when multiple threads check and consume tokens simultaneously.

Leaky Bucket Algorithm

The leaky bucket algorithm inverts the model: requests enter a queue (the bucket), and a separate process drains requests at a constant rate (the leak). If the queue fills up, new requests are rejected.

The result is perfectly smooth output regardless of input burstiness. This matters when your downstream systems can’t handle variable load—database connections, legacy services, or third-party APIs with their own strict limits.

import time
from threading import Lock, Thread
from collections import deque
from typing import Callable, Any

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        """
        Initialize leaky bucket.
        
        Args:
            capacity: Maximum requests that can queue
            leak_rate: Requests processed per second
        """
        self.capacity = capacity
        self.leak_rate = leak_rate
        self.queue = deque()
        self.lock = Lock()
        self.processing = False
    
    def submit(self, request: Any) -> bool:
        """
        Submit a request to the bucket.
        
        Returns True if queued, False if bucket is full.
        """
        with self.lock:
            if len(self.queue) >= self.capacity:
                return False
            self.queue.append(request)
            return True
    
    def start_processing(self, handler: Callable[[Any], None]):
        """Start the leak process that drains requests."""
        self.processing = True
        interval = 1.0 / self.leak_rate
        
        def drain():
            while self.processing:
                request = None
                with self.lock:
                    if self.queue:
                        request = self.queue.popleft()
                
                if request is not None:
                    handler(request)
                
                time.sleep(interval)
        
        Thread(target=drain, daemon=True).start()
    
    def stop_processing(self):
        """Stop the drain process."""
        self.processing = False
    
    def queue_depth(self) -> int:
        """Current number of queued requests."""
        with self.lock:
            return len(self.queue)


# Usage
bucket = LeakyBucket(capacity=50, leak_rate=10)

def process_request(request):
    print(f"Processing: {request}")

bucket.start_processing(process_request)

# Requests are queued and processed at exactly 10/second
for i in range(100):
    if not bucket.submit(f"request_{i}"):
        print(f"Dropped request_{i} - bucket full")

Notice the architectural difference: token bucket is synchronous (immediate accept/reject), while leaky bucket introduces a queue and background processor. This adds latency but guarantees output rate consistency.

Head-to-Head Comparison

Aspect Token Bucket Leaky Bucket
Burst handling Allows bursts up to capacity No bursts—constant output
Latency Immediate response Queue introduces delay
Memory O(1) - just counters O(n) - stores queued requests
Implementation Simpler More complex (needs drain process)
Best for APIs, user-facing services Background jobs, downstream protection

Here’s a benchmark comparing behavior under burst traffic:

import time
import statistics

def benchmark_algorithm(limiter, acquire_fn, requests: int, name: str):
    """Benchmark a rate limiter under burst conditions."""
    results = {"accepted": 0, "rejected": 0, "latencies": []}
    
    for _ in range(requests):
        start = time.monotonic()
        accepted = acquire_fn(limiter)
        latency = time.monotonic() - start
        
        if accepted:
            results["accepted"] += 1
        else:
            results["rejected"] += 1
        results["latencies"].append(latency)
    
    print(f"\n{name} Results:")
    print(f"  Accepted: {results['accepted']}/{requests}")
    print(f"  Rejected: {results['rejected']}/{requests}")
    print(f"  Avg latency: {statistics.mean(results['latencies'])*1000:.3f}ms")
    return results


# Compare under 200 simultaneous requests
token_bucket = TokenBucket(capacity=100, refill_rate=10)
leaky_bucket = LeakyBucket(capacity=100, leak_rate=10)

benchmark_algorithm(
    token_bucket, 
    lambda l: l.acquire(), 
    200, 
    "Token Bucket"
)

# For leaky bucket, we measure queue acceptance
benchmark_algorithm(
    leaky_bucket,
    lambda l: l.submit("request"),
    200,
    "Leaky Bucket"
)

The token bucket accepts the first 100 requests instantly, rejects the rest. The leaky bucket queues 100 requests but processes them over 10 seconds. Choose based on whether your clients can tolerate queuing delays.

Distributed Rate Limiting Considerations

Single-server rate limiting breaks down when you scale horizontally. If you have 10 servers each allowing 100 requests/second, a client could potentially make 1000 requests/second by hitting different servers.

Redis solves this with atomic operations. A Lua script ensures check-and-update happens atomically, preventing race conditions between servers:

-- token_bucket.lua
-- KEYS[1]: bucket key
-- ARGV[1]: capacity
-- ARGV[2]: refill rate (tokens per second)
-- ARGV[3]: current timestamp (milliseconds)
-- ARGV[4]: tokens requested

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

-- Get current state
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Calculate token refill
local elapsed = (now - last_refill) / 1000.0
local tokens_to_add = elapsed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)

-- Try to acquire tokens
local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

-- Save state
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)  -- TTL for cleanup

return allowed
import redis
import time

class DistributedTokenBucket:
    def __init__(self, redis_client: redis.Redis, key: str, 
                 capacity: int, refill_rate: float):
        self.redis = redis_client
        self.key = key
        self.capacity = capacity
        self.refill_rate = refill_rate
        self._load_script()
    
    def _load_script(self):
        self.script = self.redis.register_script("""
            -- Lua script from above
            local key = KEYS[1]
            local capacity = tonumber(ARGV[1])
            local refill_rate = tonumber(ARGV[2])
            local now = tonumber(ARGV[3])
            local requested = tonumber(ARGV[4])
            
            local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
            local tokens = tonumber(bucket[1]) or capacity
            local last_refill = tonumber(bucket[2]) or now
            
            local elapsed = (now - last_refill) / 1000.0
            tokens = math.min(capacity, tokens + elapsed * refill_rate)
            
            local allowed = 0
            if tokens >= requested then
                tokens = tokens - requested
                allowed = 1
            end
            
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return allowed
        """)
    
    def acquire(self, tokens: int = 1) -> bool:
        now = int(time.time() * 1000)
        result = self.script(
            keys=[self.key],
            args=[self.capacity, self.refill_rate, now, tokens]
        )
        return bool(result)

The trade-off: Redis adds network latency to every request. For extremely high-throughput systems, consider local rate limiting with periodic synchronization, accepting some over-limit requests in exchange for lower latency.

Production Implementation Patterns

In production, rate limiting belongs in middleware where it can protect your entire application consistently:

// rateLimit.js - Express middleware with pluggable strategies
const Redis = require('ioredis');

class RateLimitMiddleware {
  constructor(options) {
    this.redis = new Redis(options.redisUrl);
    this.capacity = options.capacity || 100;
    this.refillRate = options.refillRate || 10;
    this.keyPrefix = options.keyPrefix || 'ratelimit';
    this.keyExtractor = options.keyExtractor || (req => req.ip);
  }

  middleware() {
    return async (req, res, next) => {
      const key = `${this.keyPrefix}:${this.keyExtractor(req)}`;
      
      try {
        const allowed = await this.checkLimit(key);
        
        // Always set rate limit headers
        const state = await this.getState(key);
        res.set({
          'X-RateLimit-Limit': this.capacity,
          'X-RateLimit-Remaining': Math.floor(state.tokens),
          'X-RateLimit-Reset': Math.ceil(Date.now() / 1000) + 
            Math.ceil((this.capacity - state.tokens) / this.refillRate)
        });

        if (!allowed) {
          return res.status(429).json({
            error: 'Rate limit exceeded',
            retryAfter: Math.ceil(1 / this.refillRate)
          });
        }

        next();
      } catch (error) {
        // Fail open - don't block requests if Redis is down
        console.error('Rate limit check failed:', error);
        next();
      }
    };
  }

  async checkLimit(key) {
    // Token bucket implementation via Lua script
    const script = `
      local tokens = tonumber(redis.call('HGET', KEYS[1], 'tokens') or ARGV[1])
      local last = tonumber(redis.call('HGET', KEYS[1], 'last') or ARGV[3])
      local elapsed = (tonumber(ARGV[3]) - last) / 1000
      tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))
      
      if tokens >= 1 then
        tokens = tokens - 1
        redis.call('HMSET', KEYS[1], 'tokens', tokens, 'last', ARGV[3])
        redis.call('EXPIRE', KEYS[1], 3600)
        return 1
      end
      return 0
    `;
    
    const result = await this.redis.eval(
      script, 1, key,
      this.capacity, this.refillRate, Date.now()
    );
    return result === 1;
  }

  async getState(key) {
    const [tokens, last] = await this.redis.hmget(key, 'tokens', 'last');
    return { 
      tokens: parseFloat(tokens) || this.capacity,
      lastRefill: parseInt(last) || Date.now()
    };
  }
}

// Usage
const limiter = new RateLimitMiddleware({
  redisUrl: process.env.REDIS_URL,
  capacity: 100,
  refillRate: 10,
  keyExtractor: (req) => req.headers['x-api-key'] || req.ip
});

app.use('/api', limiter.middleware());

Key production considerations: always fail open (don’t block legitimate traffic if your rate limiter fails), include rate limit headers so clients can self-regulate, and use appropriate key extraction (API key for authenticated requests, IP for anonymous).

Monitor your rate limiters. Track rejection rates per client, alert on sudden spikes (potential attacks), and review limits periodically—what made sense at launch may not fit your current traffic patterns.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.