Design a Rate Limiter: API Throttling System
Every production API needs rate limiting. Without it, a single misbehaving client can exhaust your database connections, a bot can scrape your entire catalog in minutes, or a DDoS attack can bankrupt...
Key Insights
- Token bucket handles bursty traffic gracefully while sliding window counters provide accurate rate limiting—choose based on whether you prioritize user experience or strict enforcement
- Redis with Lua scripts solves distributed rate limiting atomically, but you must design for Redis failures with local fallbacks to avoid cascading outages
- Place rate limiters at the API gateway for broad protection, but implement application-layer limits for fine-grained per-user controls that require business context
Why Rate Limiting Matters
Every production API needs rate limiting. Without it, a single misbehaving client can exhaust your database connections, a bot can scrape your entire catalog in minutes, or a DDoS attack can bankrupt you with cloud bills.
Rate limiting serves three critical purposes: protecting infrastructure from overload, ensuring fair resource allocation among users, and controlling costs in pay-per-use architectures. Whether you’re building an API gateway, securing login endpoints against brute force attacks, or managing webhook delivery, the underlying problem is the same—controlling request flow over time.
The requirements seem simple: limit requests to N per time window. But production systems demand more. You need sub-millisecond latency impact, accurate counting across distributed servers, graceful handling of edge cases, and clear communication to clients about their limits.
Rate Limiting Algorithms Overview
Four algorithms dominate production systems, each with distinct tradeoffs.
Token Bucket maintains a bucket that fills with tokens at a fixed rate. Each request consumes a token. If the bucket is empty, requests are rejected. This algorithm excels at handling burst traffic—a bucket with 100 tokens allows 100 rapid requests before throttling kicks in, then settles to the refill rate.
Leaky Bucket processes requests at a constant rate regardless of arrival pattern. Incoming requests queue up and “leak” out at a fixed rate. This smooths traffic but introduces latency and requires queue management.
Fixed Window counts requests in discrete time intervals (e.g., per minute). Simple to implement but suffers from boundary spikes—a user could make 100 requests at 11:59:59 and another 100 at 12:00:01, effectively doubling their limit.
Sliding Window eliminates boundary issues by considering a rolling time period. The counter variant approximates by weighting the previous window’s count, while the log variant tracks individual request timestamps for perfect accuracy at higher memory cost.
import time
from threading import Lock
from collections import deque
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.monotonic()
self.lock = Lock()
def allow_request(self, tokens: int = 1) -> bool:
with self.lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_rate
)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
class SlidingWindowCounter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.prev_count = 0
self.curr_count = 0
self.curr_window_start = self._get_window_start()
self.lock = Lock()
def _get_window_start(self) -> int:
return int(time.time()) // self.window_seconds * self.window_seconds
def allow_request(self) -> bool:
with self.lock:
now = time.time()
window_start = self._get_window_start()
# Roll over to new window if needed
if window_start != self.curr_window_start:
self.prev_count = self.curr_count
self.curr_count = 0
self.curr_window_start = window_start
# Weight previous window by remaining time fraction
window_progress = (now - window_start) / self.window_seconds
weighted_count = (
self.prev_count * (1 - window_progress) + self.curr_count
)
if weighted_count < self.limit:
self.curr_count += 1
return True
return False
Single-Server Implementation
For single-server deployments, in-memory rate limiting is straightforward and fast. The key challenges are thread safety and memory management.
Store rate limit state in hash maps keyed by client identifier (user ID, API key, or IP address). Use locks or atomic operations to prevent race conditions when multiple threads check and update counts simultaneously.
from functools import wraps
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import time
from threading import Lock
from typing import Dict, Tuple
app = FastAPI()
class RateLimiter:
def __init__(self):
self.clients: Dict[str, Tuple[float, int]] = {} # key -> (window_start, count)
self.lock = Lock()
self.window_seconds = 60
self.limit = 100
def is_allowed(self, client_id: str) -> Tuple[bool, int, int]:
"""Returns (allowed, remaining, reset_time)"""
with self.lock:
now = time.time()
window_start = int(now) // self.window_seconds * self.window_seconds
reset_time = window_start + self.window_seconds
if client_id not in self.clients:
self.clients[client_id] = (window_start, 1)
return True, self.limit - 1, reset_time
stored_window, count = self.clients[client_id]
if stored_window != window_start:
self.clients[client_id] = (window_start, 1)
return True, self.limit - 1, reset_time
if count >= self.limit:
return False, 0, reset_time
self.clients[client_id] = (window_start, count + 1)
return True, self.limit - count - 1, reset_time
limiter = RateLimiter()
def rate_limit(func):
@wraps(func)
async def wrapper(request: Request, *args, **kwargs):
client_id = request.headers.get("X-API-Key", request.client.host)
allowed, remaining, reset_time = limiter.is_allowed(client_id)
if not allowed:
return JSONResponse(
status_code=429,
content={"error": "Rate limit exceeded"},
headers={
"X-RateLimit-Remaining": "0",
"X-RateLimit-Reset": str(reset_time),
"Retry-After": str(int(reset_time - time.time()))
}
)
response = await func(request, *args, **kwargs)
return response
return wrapper
@app.get("/api/resource")
@rate_limit
async def get_resource(request: Request):
return {"data": "your resource"}
Distributed Rate Limiting
Single-server rate limiting breaks down when you scale horizontally. A user hitting different servers could exceed their limit by a factor of your server count.
The fundamental challenge is consistency versus availability. Strongly consistent distributed counting adds latency and creates a single point of failure. Eventually consistent approaches may temporarily allow limit violations.
Redis provides an excellent middle ground. It’s fast, supports atomic operations, and handles the distributed coordination for you. The critical insight is using Lua scripts—Redis executes them atomically, eliminating race conditions.
-- sliding_window.lua
-- KEYS[1] = rate limit key
-- ARGV[1] = current timestamp (milliseconds)
-- ARGV[2] = window size (milliseconds)
-- ARGV[3] = max requests
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
-- Remove expired entries
local window_start = now - window
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
-- Count current requests
local current = redis.call('ZCARD', key)
if current < limit then
-- Add new request with timestamp as score
redis.call('ZADD', key, now, now .. '-' .. math.random())
redis.call('PEXPIRE', key, window)
return {1, limit - current - 1} -- allowed, remaining
else
return {0, 0} -- denied, remaining
end
import redis
import time
from typing import Tuple
class DistributedRateLimiter:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.lua_script = self.redis.register_script("""
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', key, '-inf', now - window)
local current = redis.call('ZCARD', key)
if current < limit then
redis.call('ZADD', key, now, now .. '-' .. math.random())
redis.call('PEXPIRE', key, window)
return {1, limit - current - 1}
end
return {0, 0}
""")
def is_allowed(
self,
key: str,
limit: int,
window_ms: int
) -> Tuple[bool, int]:
try:
now_ms = int(time.time() * 1000)
result = self.lua_script(
keys=[f"ratelimit:{key}"],
args=[now_ms, window_ms, limit]
)
return bool(result[0]), int(result[1])
except redis.RedisError:
# Fail open or closed based on your requirements
return True, limit # Fail open
Always plan for Redis failures. Implement a local fallback rate limiter or decide whether to fail open (allow requests) or fail closed (deny requests) based on your security requirements.
System Design and Architecture
Rate limiters can live at multiple layers. API gateways (Kong, AWS API Gateway) handle broad protection against abuse. Application middleware enables business-logic-aware limits. Database-layer throttling prevents connection exhaustion.
For most systems, implement rate limiting at two levels: coarse limits at the gateway and fine-grained limits in application middleware where you have user context.
Store rate limit rules in a database for dynamic configuration:
CREATE TABLE rate_limit_rules (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
key_pattern VARCHAR(255) NOT NULL, -- e.g., "user:{user_id}:endpoint:{path}"
requests_limit INT NOT NULL,
window_seconds INT NOT NULL,
tier VARCHAR(50), -- 'free', 'pro', 'enterprise'
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE rate_limit_overrides (
id SERIAL PRIMARY KEY,
entity_type VARCHAR(50) NOT NULL, -- 'user', 'organization', 'ip'
entity_id VARCHAR(255) NOT NULL,
requests_limit INT NOT NULL,
window_seconds INT NOT NULL,
expires_at TIMESTAMP,
UNIQUE(entity_type, entity_id)
);
Always return standard rate limit headers so clients can self-regulate:
X-RateLimit-Limit: Maximum requests allowedX-RateLimit-Remaining: Requests remaining in current windowX-RateLimit-Reset: Unix timestamp when the window resetsRetry-After: Seconds until the client should retry (on 429 responses)
Advanced Considerations
Production systems need multi-tier limits. A user might have 1000 requests per hour globally, but only 10 login attempts per minute and 100 requests per minute to expensive endpoints.
from dataclasses import dataclass
from typing import List, Optional
from fastapi import Request
from fastapi.responses import JSONResponse
@dataclass
class RateLimitRule:
name: str
limit: int
window_seconds: int
key_func: callable
class MultiTierRateLimiter:
def __init__(self, redis_client, rules: List[RateLimitRule]):
self.limiter = DistributedRateLimiter(redis_client)
self.rules = rules
async def check_limits(self, request: Request) -> Optional[JSONResponse]:
for rule in self.rules:
key = rule.key_func(request)
allowed, remaining = self.limiter.is_allowed(
key, rule.limit, rule.window_seconds * 1000
)
if not allowed:
return JSONResponse(
status_code=429,
content={
"error": "Rate limit exceeded",
"limit_name": rule.name
},
headers={"Retry-After": str(rule.window_seconds)}
)
return None
# Usage
rules = [
RateLimitRule(
"global", 1000, 3600,
lambda r: f"global:{r.state.user_id}"
),
RateLimitRule(
"endpoint", 100, 60,
lambda r: f"endpoint:{r.state.user_id}:{r.url.path}"
),
]
Implement exponential backoff in your clients. When receiving 429 responses, wait for the Retry-After duration, then add jitter to prevent thundering herds.
Monitor rate limit metrics: rejection rates by endpoint, users hitting limits frequently (potential abuse or undersized limits), and latency impact of your rate limiting layer.
Summary and Production Checklist
Algorithm Selection:
- Token bucket for APIs where burst tolerance improves user experience
- Sliding window counter for strict, accurate enforcement
- Fixed window only for simple, non-critical use cases
Implementation Checklist:
- Use Redis Lua scripts for atomic distributed operations
- Implement local fallback for Redis failures
- Return standard rate limit headers on all responses
- Log rate limit violations for security analysis
- Configure alerts for unusual rejection patterns
- Test behavior under Redis latency and failures
Production Libraries:
- Node.js:
rate-limiter-flexible(excellent Redis support) - Java: Guava
RateLimiter, Resilience4j - Python:
limits,slowapifor FastAPI - Go:
golang.org/x/time/rate,go-redis/redis_rate
Start with conservative limits and adjust based on actual usage patterns. It’s easier to increase limits than to deal with the fallout from an overloaded system.