Design a Rate Limiter: API Throttling System

Key Insights

Token bucket handles bursty traffic gracefully while sliding window counters provide accurate rate limiting—choose based on whether you prioritize user experience or strict enforcement
Redis with Lua scripts solves distributed rate limiting atomically, but you must design for Redis failures with local fallbacks to avoid cascading outages
Place rate limiters at the API gateway for broad protection, but implement application-layer limits for fine-grained per-user controls that require business context

Why Rate Limiting Matters

Every production API needs rate limiting. Without it, a single misbehaving client can exhaust your database connections, a bot can scrape your entire catalog in minutes, or a DDoS attack can bankrupt you with cloud bills.

Rate limiting serves three critical purposes: protecting infrastructure from overload, ensuring fair resource allocation among users, and controlling costs in pay-per-use architectures. Whether you’re building an API gateway, securing login endpoints against brute force attacks, or managing webhook delivery, the underlying problem is the same—controlling request flow over time.

The requirements seem simple: limit requests to N per time window. But production systems demand more. You need sub-millisecond latency impact, accurate counting across distributed servers, graceful handling of edge cases, and clear communication to clients about their limits.

Rate Limiting Algorithms Overview

Four algorithms dominate production systems, each with distinct tradeoffs.

Token Bucket maintains a bucket that fills with tokens at a fixed rate. Each request consumes a token. If the bucket is empty, requests are rejected. This algorithm excels at handling burst traffic—a bucket with 100 tokens allows 100 rapid requests before throttling kicks in, then settles to the refill rate.

Leaky Bucket processes requests at a constant rate regardless of arrival pattern. Incoming requests queue up and “leak” out at a fixed rate. This smooths traffic but introduces latency and requires queue management.

Fixed Window counts requests in discrete time intervals (e.g., per minute). Simple to implement but suffers from boundary spikes—a user could make 100 requests at 11:59:59 and another 100 at 12:00:01, effectively doubling their limit.

Sliding Window eliminates boundary issues by considering a rolling time period. The counter variant approximates by weighting the previous window’s count, while the log variant tracks individual request timestamps for perfect accuracy at higher memory cost.

import time
from threading import Lock
from collections import deque

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.monotonic()
        self.lock = Lock()
    
    def allow_request(self, tokens: int = 1) -> bool:
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(
                self.capacity,
                self.tokens + elapsed * self.refill_rate
            )
            self.last_refill = now
            
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False


class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.prev_count = 0
        self.curr_count = 0
        self.curr_window_start = self._get_window_start()
        self.lock = Lock()
    
    def _get_window_start(self) -> int:
        return int(time.time()) // self.window_seconds * self.window_seconds
    
    def allow_request(self) -> bool:
        with self.lock:
            now = time.time()
            window_start = self._get_window_start()
            
            # Roll over to new window if needed
            if window_start != self.curr_window_start:
                self.prev_count = self.curr_count
                self.curr_count = 0
                self.curr_window_start = window_start
            
            # Weight previous window by remaining time fraction
            window_progress = (now - window_start) / self.window_seconds
            weighted_count = (
                self.prev_count * (1 - window_progress) + self.curr_count
            )
            
            if weighted_count < self.limit:
                self.curr_count += 1
                return True
            return False

Single-Server Implementation

For single-server deployments, in-memory rate limiting is straightforward and fast. The key challenges are thread safety and memory management.

Store rate limit state in hash maps keyed by client identifier (user ID, API key, or IP address). Use locks or atomic operations to prevent race conditions when multiple threads check and update counts simultaneously.

from functools import wraps
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import time
from threading import Lock
from typing import Dict, Tuple

app = FastAPI()

class RateLimiter:
    def __init__(self):
        self.clients: Dict[str, Tuple[float, int]] = {}  # key -> (window_start, count)
        self.lock = Lock()
        self.window_seconds = 60
        self.limit = 100
    
    def is_allowed(self, client_id: str) -> Tuple[bool, int, int]:
        """Returns (allowed, remaining, reset_time)"""
        with self.lock:
            now = time.time()
            window_start = int(now) // self.window_seconds * self.window_seconds
            reset_time = window_start + self.window_seconds
            
            if client_id not in self.clients:
                self.clients[client_id] = (window_start, 1)
                return True, self.limit - 1, reset_time
            
            stored_window, count = self.clients[client_id]
            
            if stored_window != window_start:
                self.clients[client_id] = (window_start, 1)
                return True, self.limit - 1, reset_time
            
            if count >= self.limit:
                return False, 0, reset_time
            
            self.clients[client_id] = (window_start, count + 1)
            return True, self.limit - count - 1, reset_time

limiter = RateLimiter()

def rate_limit(func):
    @wraps(func)
    async def wrapper(request: Request, *args, **kwargs):
        client_id = request.headers.get("X-API-Key", request.client.host)
        allowed, remaining, reset_time = limiter.is_allowed(client_id)
        
        if not allowed:
            return JSONResponse(
                status_code=429,
                content={"error": "Rate limit exceeded"},
                headers={
                    "X-RateLimit-Remaining": "0",
                    "X-RateLimit-Reset": str(reset_time),
                    "Retry-After": str(int(reset_time - time.time()))
                }
            )
        
        response = await func(request, *args, **kwargs)
        return response
    return wrapper

@app.get("/api/resource")
@rate_limit
async def get_resource(request: Request):
    return {"data": "your resource"}

Distributed Rate Limiting

Single-server rate limiting breaks down when you scale horizontally. A user hitting different servers could exceed their limit by a factor of your server count.

The fundamental challenge is consistency versus availability. Strongly consistent distributed counting adds latency and creates a single point of failure. Eventually consistent approaches may temporarily allow limit violations.

Redis provides an excellent middle ground. It’s fast, supports atomic operations, and handles the distributed coordination for you. The critical insight is using Lua scripts—Redis executes them atomically, eliminating race conditions.

-- sliding_window.lua
-- KEYS[1] = rate limit key
-- ARGV[1] = current timestamp (milliseconds)
-- ARGV[2] = window size (milliseconds)
-- ARGV[3] = max requests

local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

-- Remove expired entries
local window_start = now - window
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)

-- Count current requests
local current = redis.call('ZCARD', key)

if current < limit then
    -- Add new request with timestamp as score
    redis.call('ZADD', key, now, now .. '-' .. math.random())
    redis.call('PEXPIRE', key, window)
    return {1, limit - current - 1}  -- allowed, remaining
else
    return {0, 0}  -- denied, remaining
end

import redis
import time
from typing import Tuple

class DistributedRateLimiter:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self.lua_script = self.redis.register_script("""
            local key = KEYS[1]
            local now = tonumber(ARGV[1])
            local window = tonumber(ARGV[2])
            local limit = tonumber(ARGV[3])
            
            redis.call('ZREMRANGEBYSCORE', key, '-inf', now - window)
            local current = redis.call('ZCARD', key)
            
            if current < limit then
                redis.call('ZADD', key, now, now .. '-' .. math.random())
                redis.call('PEXPIRE', key, window)
                return {1, limit - current - 1}
            end
            return {0, 0}
        """)
    
    def is_allowed(
        self, 
        key: str, 
        limit: int, 
        window_ms: int
    ) -> Tuple[bool, int]:
        try:
            now_ms = int(time.time() * 1000)
            result = self.lua_script(
                keys=[f"ratelimit:{key}"],
                args=[now_ms, window_ms, limit]
            )
            return bool(result[0]), int(result[1])
        except redis.RedisError:
            # Fail open or closed based on your requirements
            return True, limit  # Fail open

Always plan for Redis failures. Implement a local fallback rate limiter or decide whether to fail open (allow requests) or fail closed (deny requests) based on your security requirements.

System Design and Architecture

Rate limiters can live at multiple layers. API gateways (Kong, AWS API Gateway) handle broad protection against abuse. Application middleware enables business-logic-aware limits. Database-layer throttling prevents connection exhaustion.

For most systems, implement rate limiting at two levels: coarse limits at the gateway and fine-grained limits in application middleware where you have user context.

Store rate limit rules in a database for dynamic configuration:

CREATE TABLE rate_limit_rules (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    key_pattern VARCHAR(255) NOT NULL,  -- e.g., "user:{user_id}:endpoint:{path}"
    requests_limit INT NOT NULL,
    window_seconds INT NOT NULL,
    tier VARCHAR(50),  -- 'free', 'pro', 'enterprise'
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE rate_limit_overrides (
    id SERIAL PRIMARY KEY,
    entity_type VARCHAR(50) NOT NULL,  -- 'user', 'organization', 'ip'
    entity_id VARCHAR(255) NOT NULL,
    requests_limit INT NOT NULL,
    window_seconds INT NOT NULL,
    expires_at TIMESTAMP,
    UNIQUE(entity_type, entity_id)
);

Always return standard rate limit headers so clients can self-regulate:

X-RateLimit-Limit: Maximum requests allowed
X-RateLimit-Remaining: Requests remaining in current window
X-RateLimit-Reset: Unix timestamp when the window resets
Retry-After: Seconds until the client should retry (on 429 responses)

Advanced Considerations

Production systems need multi-tier limits. A user might have 1000 requests per hour globally, but only 10 login attempts per minute and 100 requests per minute to expensive endpoints.

from dataclasses import dataclass
from typing import List, Optional
from fastapi import Request
from fastapi.responses import JSONResponse

@dataclass
class RateLimitRule:
    name: str
    limit: int
    window_seconds: int
    key_func: callable

class MultiTierRateLimiter:
    def __init__(self, redis_client, rules: List[RateLimitRule]):
        self.limiter = DistributedRateLimiter(redis_client)
        self.rules = rules
    
    async def check_limits(self, request: Request) -> Optional[JSONResponse]:
        for rule in self.rules:
            key = rule.key_func(request)
            allowed, remaining = self.limiter.is_allowed(
                key, rule.limit, rule.window_seconds * 1000
            )
            if not allowed:
                return JSONResponse(
                    status_code=429,
                    content={
                        "error": "Rate limit exceeded",
                        "limit_name": rule.name
                    },
                    headers={"Retry-After": str(rule.window_seconds)}
                )
        return None

# Usage
rules = [
    RateLimitRule(
        "global", 1000, 3600,
        lambda r: f"global:{r.state.user_id}"
    ),
    RateLimitRule(
        "endpoint", 100, 60,
        lambda r: f"endpoint:{r.state.user_id}:{r.url.path}"
    ),
]

Implement exponential backoff in your clients. When receiving 429 responses, wait for the Retry-After duration, then add jitter to prevent thundering herds.

Monitor rate limit metrics: rejection rates by endpoint, users hitting limits frequently (potential abuse or undersized limits), and latency impact of your rate limiting layer.

Summary and Production Checklist

Algorithm Selection:

Token bucket for APIs where burst tolerance improves user experience
Sliding window counter for strict, accurate enforcement
Fixed window only for simple, non-critical use cases

Implementation Checklist:

Use Redis Lua scripts for atomic distributed operations
Implement local fallback for Redis failures
Return standard rate limit headers on all responses
Log rate limit violations for security analysis
Configure alerts for unusual rejection patterns
Test behavior under Redis latency and failures

Production Libraries:

Node.js: rate-limiter-flexible (excellent Redis support)
Java: Guava RateLimiter, Resilience4j
Python: limits, slowapi for FastAPI
Go: golang.org/x/time/rate, go-redis/redis_rate

Start with conservative limits and adjust based on actual usage patterns. It’s easier to increase limits than to deal with the fallout from an overloaded system.