Throttling: Request Rate Control

Every production API eventually faces the same problem: too many requests, not enough capacity. Maybe it's a legitimate traffic spike, a misbehaving client, or a deliberate attack. Without...

Key Insights

  • Throttling is about protecting your system’s stability, not punishing users—choose algorithms based on whether you need to handle bursts gracefully or enforce strict steady-state limits.
  • Distributed throttling requires atomic operations; a Redis Lua script is the most reliable way to prevent race conditions across multiple service instances.
  • Always communicate limits clearly through standard HTTP headers; clients can’t respect limits they don’t know about.

Why Throttling Matters

Every production API eventually faces the same problem: too many requests, not enough capacity. Maybe it’s a legitimate traffic spike, a misbehaving client, or a deliberate attack. Without throttling, your service degrades for everyone.

Throttling is the mechanism that controls how many requests a client can make within a given time window. It’s often confused with rate limiting—and while the terms are used interchangeably, there’s a subtle distinction. Rate limiting typically refers to the policy (“100 requests per minute”), while throttling is the enforcement mechanism that rejects or delays requests exceeding that limit.

The goal isn’t to block legitimate users. It’s to ensure fair resource allocation, protect downstream dependencies, maintain predictable performance, and give your system breathing room during unexpected load.

Common Throttling Algorithms

Choosing the right algorithm depends on your tolerance for bursts, memory constraints, and accuracy requirements.

Token Bucket

The token bucket algorithm allows controlled bursts while maintaining an average rate. Tokens accumulate at a fixed rate up to a maximum capacity. Each request consumes one token. If no tokens are available, the request is rejected.

This is ideal when you want to allow occasional bursts but enforce a long-term average rate.

Leaky Bucket

Requests enter a queue and are processed at a constant rate, like water leaking from a bucket. This smooths out traffic but adds latency since requests wait in the queue rather than being immediately rejected.

Fixed Window

Count requests in fixed time intervals (e.g., per minute). Simple to implement but has a boundary problem: a client could make 100 requests at 11:59:59 and another 100 at 12:00:01, effectively doubling their rate.

Sliding Window Log

Track timestamps of all requests and count those within the sliding window. Accurate but memory-intensive—you’re storing every request timestamp.

Sliding Window Counter

A hybrid approach that interpolates between the previous and current fixed windows based on the current position in time. Good balance of accuracy and memory efficiency.

Here’s a token bucket implementation in Python:

import time
from threading import Lock

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        """
        capacity: Maximum tokens the bucket can hold
        refill_rate: Tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity
        self.last_refill = time.monotonic()
        self.lock = Lock()
    
    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now
    
    def consume(self, tokens: int = 1) -> bool:
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False
    
    def wait_time(self, tokens: int = 1) -> float:
        """Returns seconds to wait before tokens become available."""
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                return 0.0
            tokens_needed = tokens - self.tokens
            return tokens_needed / self.refill_rate

# Usage
bucket = TokenBucket(capacity=10, refill_rate=2)  # 10 max, 2 tokens/sec

for i in range(15):
    if bucket.consume():
        print(f"Request {i}: allowed")
    else:
        wait = bucket.wait_time()
        print(f"Request {i}: throttled, retry in {wait:.2f}s")

Implementing Client-Side Throttling

Good API clients don’t wait to be throttled—they proactively respect rate limits. This reduces wasted requests and improves overall system efficiency.

Build a request queue that enforces limits locally and handles rejections gracefully:

interface ThrottledRequest<T> {
  execute: () => Promise<T>;
  resolve: (value: T) => void;
  reject: (error: Error) => void;
  retries: number;
}

class ThrottledClient {
  private queue: ThrottledRequest<any>[] = [];
  private processing = false;
  private tokens: number;
  private lastRefill: number;
  
  constructor(
    private capacity: number = 10,
    private refillRate: number = 5, // tokens per second
    private maxRetries: number = 3,
    private baseDelay: number = 1000
  ) {
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }

  async request<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push({ execute: fn, resolve, reject, retries: 0 });
      this.processQueue();
    });
  }

  private async processQueue(): Promise<void> {
    if (this.processing || this.queue.length === 0) return;
    this.processing = true;

    while (this.queue.length > 0) {
      this.refill();
      
      if (this.tokens < 1) {
        const waitTime = (1 - this.tokens) / this.refillRate * 1000;
        await this.sleep(waitTime);
        continue;
      }

      const request = this.queue.shift()!;
      this.tokens -= 1;

      try {
        const result = await request.execute();
        request.resolve(result);
      } catch (error: any) {
        if (error.status === 429 && request.retries < this.maxRetries) {
          const retryAfter = this.parseRetryAfter(error) || 
            this.baseDelay * Math.pow(2, request.retries);
          
          request.retries += 1;
          await this.sleep(retryAfter);
          this.queue.unshift(request); // Re-queue at front
        } else {
          request.reject(error);
        }
      }
    }

    this.processing = false;
  }

  private parseRetryAfter(error: any): number | null {
    const header = error.headers?.['retry-after'];
    if (!header) return null;
    const seconds = parseInt(header, 10);
    return isNaN(seconds) ? null : seconds * 1000;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage
const client = new ThrottledClient(10, 5, 3, 1000);

const results = await Promise.all(
  urls.map(url => client.request(() => fetch(url).then(r => r.json())))
);

Server-Side Throttling Strategies

On the server, throttling typically lives in middleware. You’ll need to decide on several dimensions:

  • Scope: Per-user, per-API-key, per-IP, or global
  • Granularity: Per-endpoint or service-wide
  • Tiers: Different limits for different subscription levels

Here’s a FastAPI middleware using Redis for distributed state with a sliding window counter:

import time
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import redis.asyncio as redis

app = FastAPI()
redis_client = redis.from_url("redis://localhost:6379")

RATE_LIMITS = {
    "free": {"requests": 100, "window": 60},
    "pro": {"requests": 1000, "window": 60},
    "enterprise": {"requests": 10000, "window": 60},
}

async def get_user_tier(api_key: str) -> str:
    # In production, look this up from your database
    tier = await redis_client.hget("api_keys", api_key)
    return tier.decode() if tier else "free"

@app.middleware("http")
async def throttle_middleware(request: Request, call_next):
    api_key = request.headers.get("X-API-Key", request.client.host)
    tier = await get_user_tier(api_key)
    limits = RATE_LIMITS[tier]
    
    window = limits["window"]
    max_requests = limits["requests"]
    
    now = time.time()
    window_start = int(now // window) * window
    key = f"ratelimit:{api_key}:{window_start}"
    
    pipe = redis_client.pipeline()
    pipe.incr(key)
    pipe.expire(key, window * 2)  # Keep for current + next window
    results = await pipe.execute()
    
    current_count = results[0]
    remaining = max(0, max_requests - current_count)
    reset_time = window_start + window
    
    if current_count > max_requests:
        return JSONResponse(
            status_code=429,
            content={
                "error": "rate_limit_exceeded",
                "message": f"Rate limit of {max_requests} requests per {window}s exceeded",
                "retry_after": int(reset_time - now)
            },
            headers={
                "X-RateLimit-Limit": str(max_requests),
                "X-RateLimit-Remaining": "0",
                "X-RateLimit-Reset": str(int(reset_time)),
                "Retry-After": str(int(reset_time - now))
            }
        )
    
    response = await call_next(request)
    response.headers["X-RateLimit-Limit"] = str(max_requests)
    response.headers["X-RateLimit-Remaining"] = str(remaining)
    response.headers["X-RateLimit-Reset"] = str(int(reset_time))
    
    return response

Distributed Throttling Challenges

When your service runs across multiple instances, throttling gets complicated. Each instance needs a consistent view of request counts, but network latency and race conditions conspire against you.

The “thundering herd” problem is particularly nasty: when a rate limit window resets, all queued clients retry simultaneously, potentially overwhelming your service.

Redis Lua scripts solve the atomicity problem by executing multiple operations as a single unit:

-- sliding_window_rate_limit.lua
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove old entries outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current requests in window
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add current request with timestamp as score
    redis.call('ZADD', key, now, now .. '-' .. math.random())
    redis.call('EXPIRE', key, window)
    return {1, limit - count - 1, window}  -- allowed, remaining, reset
else
    -- Get oldest entry to calculate reset time
    local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
    local reset = window - (now - oldest[2])
    return {0, 0, reset}  -- denied, remaining, reset
end

Call it from your application:

SCRIPT = """..."""  # Lua script above
script_sha = await redis_client.script_load(SCRIPT)

result = await redis_client.evalsha(
    script_sha,
    1,  # number of keys
    f"ratelimit:{user_id}",  # key
    60,  # window in seconds
    100,  # limit
    time.time()  # current timestamp
)

allowed, remaining, reset = result

Communicating Limits to Clients

Standard HTTP headers make your rate limits discoverable:

  • X-RateLimit-Limit: Maximum requests allowed in the window
  • X-RateLimit-Remaining: Requests remaining in current window
  • X-RateLimit-Reset: Unix timestamp when the window resets
  • Retry-After: Seconds to wait before retrying (on 429 responses)

Always return these headers on every response, not just rejections. Clients need visibility into their consumption to throttle proactively.

Your 429 response body should include actionable information:

{
  "error": "rate_limit_exceeded",
  "message": "You have exceeded your rate limit of 100 requests per minute",
  "limit": 100,
  "window_seconds": 60,
  "retry_after": 23
}

Monitoring and Tuning

Throttling isn’t set-and-forget. Track these metrics:

  • Rejection rate: Percentage of requests returning 429. High rates might indicate limits are too aggressive or a client is misbehaving.
  • Headroom: How close legitimate users get to their limits. Consistently hitting 80%+ suggests you should raise limits.
  • Latency percentiles: Throttling should improve p99 latency by shedding excess load.
  • Retry storms: Spikes in traffic after limit windows reset.

Set alerts for sudden spikes in rejection rates—they often indicate either an attack or a legitimate client with a bug. Review your limits quarterly against actual traffic patterns. The goal is limits that protect your infrastructure without impacting normal usage.

Throttling done right is invisible to well-behaved clients and a firm boundary against abuse. Start conservative, monitor aggressively, and adjust based on real data.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.