API Rate Limiting: Implementation Strategies

Key Insights

Token bucket and sliding window algorithms each solve different rate limiting needs—token bucket allows controlled bursts while sliding window prevents boundary exploitation through precise timestamp tracking.
Production rate limiting requires distributed state management via Redis or similar stores; in-memory solutions fail in multi-server deployments and lose state on restarts.
Effective rate limiting goes beyond simple request counting—implement tiered limits per user type, return proper HTTP 429 responses with Retry-After headers, and monitor limit hits to tune thresholds based on actual usage patterns.

Introduction to Rate Limiting

Rate limiting protects your API from abuse, ensures fair resource distribution among users, and controls infrastructure costs. Without it, a single misbehaving client can overwhelm your servers, degrade performance for legitimate users, or rack up cloud computing bills.

The core use cases are straightforward: prevent malicious actors from scraping your data or launching denial-of-service attacks, enforce fair usage policies across your user base, and manage costs by capping requests to expensive downstream services or database operations.

Several algorithms handle rate limiting, each with tradeoffs. Fixed windows are simple but allow traffic spikes at boundary transitions. Token bucket permits controlled bursts while maintaining average rates. Sliding window log provides the most accuracy by tracking individual requests. Sliding window counter approximates the log approach with less memory overhead.

Token Bucket Algorithm

The token bucket algorithm maintains a bucket with a maximum capacity of tokens. Tokens refill at a constant rate. Each request consumes one or more tokens. When the bucket is empty, requests are rejected until tokens refill.

This algorithm excels at handling bursty traffic patterns. If your API sits idle, the bucket fills to capacity, allowing a burst of requests when traffic arrives. The refill rate ensures the average request rate stays within limits over time.

Here’s a practical implementation:

class TokenBucket {
  constructor(capacity, refillRate) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate; // tokens per second
    this.lastRefill = Date.now();
  }

  refill() {
    const now = Date.now();
    const timePassed = (now - this.lastRefill) / 1000;
    const tokensToAdd = timePassed * this.refillRate;
    
    this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
    this.lastRefill = now;
  }

  consume(tokens = 1) {
    this.refill();
    
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    
    return false;
  }

  getWaitTime() {
    this.refill();
    if (this.tokens >= 1) return 0;
    return Math.ceil((1 - this.tokens) / this.refillRate * 1000);
  }
}

The refill() method calculates tokens added since the last check based on elapsed time. The consume() method attempts to take tokens, returning success or failure. The getWaitTime() method tells clients how long to wait before retrying.

Sliding Window Log Algorithm

Sliding window log tracks timestamps of individual requests within a rolling time window. This provides precise rate limiting without the boundary problem that affects fixed windows.

With fixed windows, a user could make 100 requests at 11:59 and another 100 at 12:01, bypassing a 100-requests-per-hour limit. Sliding window log prevents this by checking requests within the last 60 minutes from the current moment.

Redis sorted sets are perfect for this implementation:

class SlidingWindowLimiter {
  constructor(redisClient, limit, windowMs) {
    this.redis = redisClient;
    this.limit = limit;
    this.windowMs = windowMs;
  }

  async isAllowed(identifier) {
    const now = Date.now();
    const windowStart = now - this.windowMs;
    const key = `ratelimit:${identifier}`;

    // Remove old entries outside the window
    await this.redis.zremrangebyscore(key, 0, windowStart);

    // Count requests in current window
    const requestCount = await this.redis.zcard(key);

    if (requestCount < this.limit) {
      // Add current request with timestamp as score
      await this.redis.zadd(key, now, `${now}-${Math.random()}`);
      await this.redis.pexpire(key, this.windowMs);
      return {
        allowed: true,
        remaining: this.limit - requestCount - 1
      };
    }

    // Get oldest request to calculate retry time
    const oldestRequest = await this.redis.zrange(key, 0, 0, 'WITHSCORES');
    const retryAfter = Math.ceil((parseInt(oldestRequest[1]) + this.windowMs - now) / 1000);

    return {
      allowed: false,
      remaining: 0,
      retryAfter
    };
  }
}

This approach stores each request as a sorted set member with its timestamp as the score. We remove expired entries, count remaining requests, and either allow or reject the new request. The memory overhead is higher than token bucket, but accuracy is perfect.

Implementing Rate Limiting Middleware

Express middleware provides a clean integration point for rate limiting. The middleware should set appropriate headers, return 429 status codes when limits are exceeded, and support both global and route-specific limits.

const express = require('express');
const Redis = require('ioredis');

class RateLimitMiddleware {
  constructor(options = {}) {
    this.redis = options.redis || new Redis();
    this.windowMs = options.windowMs || 60000;
    this.max = options.max || 100;
    this.keyGenerator = options.keyGenerator || ((req) => req.ip);
  }

  middleware() {
    return async (req, res, next) => {
      const key = this.keyGenerator(req);
      const limiter = new SlidingWindowLimiter(this.redis, this.max, this.windowMs);
      
      try {
        const result = await limiter.isAllowed(key);

        res.setHeader('X-RateLimit-Limit', this.max);
        res.setHeader('X-RateLimit-Remaining', result.remaining);
        res.setHeader('X-RateLimit-Reset', Date.now() + this.windowMs);

        if (!result.allowed) {
          res.setHeader('Retry-After', result.retryAfter);
          return res.status(429).json({
            error: 'Too many requests',
            retryAfter: result.retryAfter
          });
        }

        next();
      } catch (error) {
        // Fail open: allow request if rate limiter fails
        console.error('Rate limiter error:', error);
        next();
      }
    };
  }
}

// Usage
const app = express();
const limiter = new RateLimitMiddleware({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 100
});

app.use('/api/', limiter.middleware());

The middleware sets standard rate limit headers so clients know their limits and remaining quota. When limits are exceeded, it returns HTTP 429 with a Retry-After header indicating when to retry. The fail-open behavior in the catch block ensures that rate limiter failures don’t take down your entire API.

Advanced Strategies

Real-world APIs need tiered rate limiting based on user subscription levels, API keys, or other criteria. Free users might get 100 requests per hour while premium users get 10,000.

class TieredRateLimiter {
  constructor(redis) {
    this.redis = redis;
    this.tiers = {
      free: { limit: 100, windowMs: 3600000 },
      premium: { limit: 10000, windowMs: 3600000 },
      enterprise: { limit: 100000, windowMs: 3600000 }
    };
  }

  async getUserTier(userId) {
    // Fetch from database or cache
    const cached = await this.redis.get(`user:${userId}:tier`);
    return cached || 'free';
  }

  middleware() {
    return async (req, res, next) => {
      const userId = req.user?.id || req.ip;
      const tier = await this.getUserTier(userId);
      const config = this.tiers[tier];

      const limiter = new SlidingWindowLimiter(
        this.redis,
        config.limit,
        config.windowMs
      );

      const result = await limiter.isAllowed(`${tier}:${userId}`);

      res.setHeader('X-RateLimit-Limit', config.limit);
      res.setHeader('X-RateLimit-Remaining', result.remaining);
      res.setHeader('X-RateLimit-Tier', tier);

      if (!result.allowed) {
        res.setHeader('Retry-After', result.retryAfter);
        return res.status(429).json({
          error: 'Rate limit exceeded',
          tier,
          limit: config.limit,
          retryAfter: result.retryAfter
        });
      }

      next();
    };
  }
}

You can also implement dynamic rate limiting that adjusts based on server load. Monitor CPU usage or response times and temporarily reduce limits when systems are stressed.

Testing and Monitoring

Rate limiters must be thoroughly tested, especially edge cases around boundary conditions and concurrent requests. Use Jest’s fake timers to control time in tests:

const { TokenBucket } = require('./token-bucket');

describe('TokenBucket', () => {
  beforeEach(() => {
    jest.useFakeTimers();
  });

  afterEach(() => {
    jest.useRealTimers();
  });

  test('allows requests within capacity', () => {
    const bucket = new TokenBucket(10, 1);
    
    expect(bucket.consume()).toBe(true);
    expect(bucket.consume(5)).toBe(true);
    expect(bucket.consume(4)).toBe(true);
    expect(bucket.consume()).toBe(false);
  });

  test('refills tokens over time', () => {
    const bucket = new TokenBucket(10, 2); // 2 tokens per second
    
    bucket.consume(10);
    expect(bucket.consume()).toBe(false);

    jest.advanceTimersByTime(3000); // 3 seconds
    expect(bucket.consume(6)).toBe(true); // 6 tokens refilled
  });

  test('handles concurrent requests correctly', async () => {
    const bucket = new TokenBucket(5, 1);
    
    const results = await Promise.all([
      bucket.consume(),
      bucket.consume(),
      bucket.consume(),
      bucket.consume(),
      bucket.consume(),
      bucket.consume()
    ]);

    const allowed = results.filter(r => r).length;
    expect(allowed).toBe(5);
  });
});

In production, monitor rate limit metrics: how often limits are hit, which endpoints are most constrained, and whether limits are too strict or too lenient. Adjust thresholds based on actual usage patterns and business requirements.

Production Considerations

For single-server deployments, in-memory rate limiting is fast and simple. But production systems typically run multiple servers behind load balancers. In-memory state doesn’t synchronize across instances, allowing users to bypass limits by hitting different servers.

Redis provides the distributed state management needed for multi-server deployments. It’s fast enough for rate limiting (sub-millisecond operations) and handles the atomic operations required for accurate counting.

Always return proper HTTP 429 status codes with Retry-After headers. Client libraries and well-behaved applications will automatically back off and retry appropriately. Include clear error messages explaining the limit and when to retry.

Performance matters for rate limiting since it runs on every request. Keep Redis operations minimal, use pipelining for multiple commands, and consider caching user tier information to avoid database lookups. Monitor rate limiter latency—if it adds more than a few milliseconds to request processing, optimize or scale your Redis infrastructure.

Rate limiting is not optional for production APIs. Choose the algorithm that fits your traffic patterns, implement it with distributed state management, test thoroughly, and monitor continuously to protect your infrastructure while providing the best experience for legitimate users.