Throttling: Request Rate Control
Every production API eventually faces the same problem: too many requests, not enough capacity. Maybe it's a legitimate traffic spike, a misbehaving client, or a deliberate attack. Without...
Key Insights
- Throttling is about protecting your system’s stability, not punishing users—choose algorithms based on whether you need to handle bursts gracefully or enforce strict steady-state limits.
- Distributed throttling requires atomic operations; a Redis Lua script is the most reliable way to prevent race conditions across multiple service instances.
- Always communicate limits clearly through standard HTTP headers; clients can’t respect limits they don’t know about.
Why Throttling Matters
Every production API eventually faces the same problem: too many requests, not enough capacity. Maybe it’s a legitimate traffic spike, a misbehaving client, or a deliberate attack. Without throttling, your service degrades for everyone.
Throttling is the mechanism that controls how many requests a client can make within a given time window. It’s often confused with rate limiting—and while the terms are used interchangeably, there’s a subtle distinction. Rate limiting typically refers to the policy (“100 requests per minute”), while throttling is the enforcement mechanism that rejects or delays requests exceeding that limit.
The goal isn’t to block legitimate users. It’s to ensure fair resource allocation, protect downstream dependencies, maintain predictable performance, and give your system breathing room during unexpected load.
Common Throttling Algorithms
Choosing the right algorithm depends on your tolerance for bursts, memory constraints, and accuracy requirements.
Token Bucket
The token bucket algorithm allows controlled bursts while maintaining an average rate. Tokens accumulate at a fixed rate up to a maximum capacity. Each request consumes one token. If no tokens are available, the request is rejected.
This is ideal when you want to allow occasional bursts but enforce a long-term average rate.
Leaky Bucket
Requests enter a queue and are processed at a constant rate, like water leaking from a bucket. This smooths out traffic but adds latency since requests wait in the queue rather than being immediately rejected.
Fixed Window
Count requests in fixed time intervals (e.g., per minute). Simple to implement but has a boundary problem: a client could make 100 requests at 11:59:59 and another 100 at 12:00:01, effectively doubling their rate.
Sliding Window Log
Track timestamps of all requests and count those within the sliding window. Accurate but memory-intensive—you’re storing every request timestamp.
Sliding Window Counter
A hybrid approach that interpolates between the previous and current fixed windows based on the current position in time. Good balance of accuracy and memory efficiency.
Here’s a token bucket implementation in Python:
import time
from threading import Lock
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
"""
capacity: Maximum tokens the bucket can hold
refill_rate: Tokens added per second
"""
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = capacity
self.last_refill = time.monotonic()
self.lock = Lock()
def _refill(self):
now = time.monotonic()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
def consume(self, tokens: int = 1) -> bool:
with self.lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def wait_time(self, tokens: int = 1) -> float:
"""Returns seconds to wait before tokens become available."""
with self.lock:
self._refill()
if self.tokens >= tokens:
return 0.0
tokens_needed = tokens - self.tokens
return tokens_needed / self.refill_rate
# Usage
bucket = TokenBucket(capacity=10, refill_rate=2) # 10 max, 2 tokens/sec
for i in range(15):
if bucket.consume():
print(f"Request {i}: allowed")
else:
wait = bucket.wait_time()
print(f"Request {i}: throttled, retry in {wait:.2f}s")
Implementing Client-Side Throttling
Good API clients don’t wait to be throttled—they proactively respect rate limits. This reduces wasted requests and improves overall system efficiency.
Build a request queue that enforces limits locally and handles rejections gracefully:
interface ThrottledRequest<T> {
execute: () => Promise<T>;
resolve: (value: T) => void;
reject: (error: Error) => void;
retries: number;
}
class ThrottledClient {
private queue: ThrottledRequest<any>[] = [];
private processing = false;
private tokens: number;
private lastRefill: number;
constructor(
private capacity: number = 10,
private refillRate: number = 5, // tokens per second
private maxRetries: number = 3,
private baseDelay: number = 1000
) {
this.tokens = capacity;
this.lastRefill = Date.now();
}
private refill(): void {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
}
async request<T>(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push({ execute: fn, resolve, reject, retries: 0 });
this.processQueue();
});
}
private async processQueue(): Promise<void> {
if (this.processing || this.queue.length === 0) return;
this.processing = true;
while (this.queue.length > 0) {
this.refill();
if (this.tokens < 1) {
const waitTime = (1 - this.tokens) / this.refillRate * 1000;
await this.sleep(waitTime);
continue;
}
const request = this.queue.shift()!;
this.tokens -= 1;
try {
const result = await request.execute();
request.resolve(result);
} catch (error: any) {
if (error.status === 429 && request.retries < this.maxRetries) {
const retryAfter = this.parseRetryAfter(error) ||
this.baseDelay * Math.pow(2, request.retries);
request.retries += 1;
await this.sleep(retryAfter);
this.queue.unshift(request); // Re-queue at front
} else {
request.reject(error);
}
}
}
this.processing = false;
}
private parseRetryAfter(error: any): number | null {
const header = error.headers?.['retry-after'];
if (!header) return null;
const seconds = parseInt(header, 10);
return isNaN(seconds) ? null : seconds * 1000;
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage
const client = new ThrottledClient(10, 5, 3, 1000);
const results = await Promise.all(
urls.map(url => client.request(() => fetch(url).then(r => r.json())))
);
Server-Side Throttling Strategies
On the server, throttling typically lives in middleware. You’ll need to decide on several dimensions:
- Scope: Per-user, per-API-key, per-IP, or global
- Granularity: Per-endpoint or service-wide
- Tiers: Different limits for different subscription levels
Here’s a FastAPI middleware using Redis for distributed state with a sliding window counter:
import time
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import redis.asyncio as redis
app = FastAPI()
redis_client = redis.from_url("redis://localhost:6379")
RATE_LIMITS = {
"free": {"requests": 100, "window": 60},
"pro": {"requests": 1000, "window": 60},
"enterprise": {"requests": 10000, "window": 60},
}
async def get_user_tier(api_key: str) -> str:
# In production, look this up from your database
tier = await redis_client.hget("api_keys", api_key)
return tier.decode() if tier else "free"
@app.middleware("http")
async def throttle_middleware(request: Request, call_next):
api_key = request.headers.get("X-API-Key", request.client.host)
tier = await get_user_tier(api_key)
limits = RATE_LIMITS[tier]
window = limits["window"]
max_requests = limits["requests"]
now = time.time()
window_start = int(now // window) * window
key = f"ratelimit:{api_key}:{window_start}"
pipe = redis_client.pipeline()
pipe.incr(key)
pipe.expire(key, window * 2) # Keep for current + next window
results = await pipe.execute()
current_count = results[0]
remaining = max(0, max_requests - current_count)
reset_time = window_start + window
if current_count > max_requests:
return JSONResponse(
status_code=429,
content={
"error": "rate_limit_exceeded",
"message": f"Rate limit of {max_requests} requests per {window}s exceeded",
"retry_after": int(reset_time - now)
},
headers={
"X-RateLimit-Limit": str(max_requests),
"X-RateLimit-Remaining": "0",
"X-RateLimit-Reset": str(int(reset_time)),
"Retry-After": str(int(reset_time - now))
}
)
response = await call_next(request)
response.headers["X-RateLimit-Limit"] = str(max_requests)
response.headers["X-RateLimit-Remaining"] = str(remaining)
response.headers["X-RateLimit-Reset"] = str(int(reset_time))
return response
Distributed Throttling Challenges
When your service runs across multiple instances, throttling gets complicated. Each instance needs a consistent view of request counts, but network latency and race conditions conspire against you.
The “thundering herd” problem is particularly nasty: when a rate limit window resets, all queued clients retry simultaneously, potentially overwhelming your service.
Redis Lua scripts solve the atomicity problem by executing multiple operations as a single unit:
-- sliding_window_rate_limit.lua
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Remove old entries outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
-- Count current requests in window
local count = redis.call('ZCARD', key)
if count < limit then
-- Add current request with timestamp as score
redis.call('ZADD', key, now, now .. '-' .. math.random())
redis.call('EXPIRE', key, window)
return {1, limit - count - 1, window} -- allowed, remaining, reset
else
-- Get oldest entry to calculate reset time
local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
local reset = window - (now - oldest[2])
return {0, 0, reset} -- denied, remaining, reset
end
Call it from your application:
SCRIPT = """...""" # Lua script above
script_sha = await redis_client.script_load(SCRIPT)
result = await redis_client.evalsha(
script_sha,
1, # number of keys
f"ratelimit:{user_id}", # key
60, # window in seconds
100, # limit
time.time() # current timestamp
)
allowed, remaining, reset = result
Communicating Limits to Clients
Standard HTTP headers make your rate limits discoverable:
X-RateLimit-Limit: Maximum requests allowed in the windowX-RateLimit-Remaining: Requests remaining in current windowX-RateLimit-Reset: Unix timestamp when the window resetsRetry-After: Seconds to wait before retrying (on 429 responses)
Always return these headers on every response, not just rejections. Clients need visibility into their consumption to throttle proactively.
Your 429 response body should include actionable information:
{
"error": "rate_limit_exceeded",
"message": "You have exceeded your rate limit of 100 requests per minute",
"limit": 100,
"window_seconds": 60,
"retry_after": 23
}
Monitoring and Tuning
Throttling isn’t set-and-forget. Track these metrics:
- Rejection rate: Percentage of requests returning 429. High rates might indicate limits are too aggressive or a client is misbehaving.
- Headroom: How close legitimate users get to their limits. Consistently hitting 80%+ suggests you should raise limits.
- Latency percentiles: Throttling should improve p99 latency by shedding excess load.
- Retry storms: Spikes in traffic after limit windows reset.
Set alerts for sudden spikes in rejection rates—they often indicate either an attack or a legitimate client with a bug. Review your limits quarterly against actual traffic patterns. The goal is limits that protect your infrastructure without impacting normal usage.
Throttling done right is invisible to well-behaved clients and a firm boundary against abuse. Start conservative, monitor aggressively, and adjust based on real data.