Count-Min Sketch: Approximate Frequency Counting

Key Insights

Count-Min Sketch trades exact accuracy for massive memory savings, using O(1/ε × log(1/δ)) space instead of O(n) to track frequency estimates across billions of items
The data structure only overestimates frequencies, never underestimates—this one-sided error property makes it ideal for applications like rate limiting and heavy hitter detection where false positives are acceptable
With just 2KB of memory, you can track frequency estimates for millions of unique items with less than 1% error probability, making CMS essential for real-time stream processing

The Heavy Hitters Problem

Every system at scale eventually hits the same wall: you need to count things, but there are too many things to count exactly.

Consider tracking which URLs receive the most traffic across your CDN. You’re processing 100,000 requests per second, each with a unique-ish URL. Storing exact counts in a hash map means keeping millions of entries in memory. When traffic spikes, your counting infrastructure becomes the bottleneck.

This problem appears everywhere:

Network monitoring: Identifying IP addresses generating anomalous traffic
Trending topics: Finding frequently mentioned hashtags in real-time
Database optimization: Tracking query patterns to inform index creation
Fraud detection: Counting failed login attempts per account

The naive approach—storing every item with its count—requires O(n) memory where n is the number of unique items. At web scale, n can be billions. You need a different approach.

Count-Min Sketch offers a compelling trade-off: accept small, bounded errors in exchange for constant memory usage regardless of input size.

How Count-Min Sketch Works

A Count-Min Sketch is a 2D array of counters with dimensions width × depth. Each row uses a different hash function to map items to column positions.

When you increment an item’s count, you hash it with each row’s hash function and increment the corresponding counter in each row. When querying, you hash the item again and return the minimum value across all rows.

The minimum-value strategy is crucial. Hash collisions cause counters to be shared between items, inflating counts. By taking the minimum across multiple independent hash functions, you reduce the impact of collisions.

import hashlib
import math

class CountMinSketch:
    def __init__(self, width: int, depth: int):
        self.width = width
        self.depth = depth
        self.table = [[0] * width for _ in range(depth)]
    
    def _hash(self, item: str, seed: int) -> int:
        """Generate hash for item using seed to create independent hash functions."""
        h = hashlib.md5(f"{seed}:{item}".encode()).hexdigest()
        return int(h, 16) % self.width
    
    def add(self, item: str, count: int = 1) -> None:
        """Increment counters for item across all rows."""
        for row in range(self.depth):
            col = self._hash(item, row)
            self.table[row][col] += count
    
    def estimate(self, item: str) -> int:
        """Return minimum count across all rows (best estimate)."""
        return min(
            self.table[row][self._hash(item, row)]
            for row in range(self.depth)
        )

This basic implementation demonstrates the core mechanics. You add items by incrementing counters across all rows, and estimate by taking the minimum.

Mathematical Properties and Error Bounds

Count-Min Sketch provides probabilistic guarantees that make it predictable in production. The key properties:

One-sided error: The estimate is always greater than or equal to the true count. You never undercount. This property is valuable for rate limiting—you might occasionally block a legitimate user, but you’ll never let an attacker through.

Error bounds: For an item with true count c and total stream length N, the estimate ĉ satisfies:

c ≤ ĉ ≤ c + εN with probability at least 1 - δ

The parameters ε (epsilon) controls error magnitude, δ (delta) controls failure probability. To achieve these bounds:

Width = ⌈e/ε⌉ (where e ≈ 2.718)
Depth = ⌈ln(1/δ)⌉

def calculate_dimensions(epsilon: float, delta: float) -> tuple[int, int]:
    """
    Calculate CMS dimensions for desired error bounds.
    
    Args:
        epsilon: Error factor (e.g., 0.01 for 1% of total count)
        delta: Failure probability (e.g., 0.01 for 99% confidence)
    
    Returns:
        Tuple of (width, depth)
    """
    width = math.ceil(math.e / epsilon)
    depth = math.ceil(math.log(1 / delta))
    
    memory_bytes = width * depth * 4  # Assuming 4-byte counters
    print(f"Dimensions: {width} x {depth}")
    print(f"Memory usage: {memory_bytes:,} bytes ({memory_bytes/1024:.1f} KB)")
    
    return width, depth

# Example: 1% error with 99% probability
calculate_dimensions(0.01, 0.01)
# Output:
# Dimensions: 272 x 5
# Memory usage: 5,440 bytes (5.3 KB)

With just 5KB, you can track frequency estimates for any number of items with 1% error bounds and 99% confidence.

Implementation Deep Dive

Production implementations need more than the basic structure. Here’s a hardened version:

import array
import mmh3  # MurmurHash3 - fast, high-quality hashing

class ProductionCountMinSketch:
    def __init__(self, epsilon: float = 0.001, delta: float = 0.01):
        """
        Initialize CMS with error bounds.
        
        Args:
            epsilon: Additive error factor (estimate <= true + epsilon * N)
            delta: Probability of exceeding error bound
        """
        self.width = math.ceil(math.e / epsilon)
        self.depth = math.ceil(math.log(1 / delta))
        self.total_count = 0
        
        # Use array module for memory efficiency (4-byte unsigned ints)
        self.tables = [
            array.array('L', [0] * self.width) 
            for _ in range(self.depth)
        ]
    
    def _hash(self, item: str, seed: int) -> int:
        """MurmurHash3 provides excellent distribution and speed."""
        return mmh3.hash(item, seed, signed=False) % self.width
    
    def add(self, item: str, count: int = 1) -> None:
        """Thread-safe increment (for single-writer scenarios)."""
        if count < 0:
            raise ValueError("Count must be non-negative")
        
        self.total_count += count
        for i, table in enumerate(self.tables):
            idx = self._hash(item, i)
            table[idx] += count
    
    def estimate(self, item: str) -> int:
        """Return frequency estimate with guaranteed lower bound."""
        return min(
            table[self._hash(item, i)]
            for i, table in enumerate(self.tables)
        )
    
    def merge(self, other: 'ProductionCountMinSketch') -> None:
        """Merge another CMS into this one (for distributed counting)."""
        if self.width != other.width or self.depth != other.depth:
            raise ValueError("Dimensions must match for merge")
        
        for i in range(self.depth):
            for j in range(self.width):
                self.tables[i][j] += other.tables[i][j]
        self.total_count += other.total_count
    
    def memory_usage(self) -> int:
        """Return approximate memory usage in bytes."""
        return self.width * self.depth * 4  # 4 bytes per counter

Key improvements in this version:

MurmurHash3: Cryptographic hashes like MD5 are overkill. MurmurHash3 is faster and provides sufficient independence for CMS.
Memory-efficient arrays: Python’s array module uses contiguous memory with fixed-size integers, reducing overhead compared to lists.
Merge support: Essential for distributed systems where you aggregate sketches from multiple nodes.

Practical Applications

The real power of Count-Min Sketch emerges in streaming scenarios. Here’s a practical URL tracker that maintains the top-K most frequent URLs:

import heapq
from collections import defaultdict

class StreamingTopK:
    def __init__(self, k: int, epsilon: float = 0.001, delta: float = 0.01):
        self.k = k
        self.cms = ProductionCountMinSketch(epsilon, delta)
        self.candidates = {}  # item -> last known estimate
        self.min_heap = []    # (count, item) for top-k tracking
    
    def add(self, item: str) -> None:
        """Process a stream element."""
        self.cms.add(item)
        estimate = self.cms.estimate(item)
        
        # Update candidate tracking
        if item in self.candidates:
            self.candidates[item] = estimate
        elif len(self.min_heap) < self.k:
            self.candidates[item] = estimate
            heapq.heappush(self.min_heap, (estimate, item))
        elif estimate > self.min_heap[0][0]:
            # New item might be in top-k
            _, evicted = heapq.heapreplace(self.min_heap, (estimate, item))
            del self.candidates[evicted]
            self.candidates[item] = estimate
    
    def get_top_k(self) -> list[tuple[str, int]]:
        """Return current top-k items with estimated counts."""
        # Re-estimate all candidates (counts may have increased)
        results = [
            (item, self.cms.estimate(item)) 
            for item in self.candidates
        ]
        results.sort(key=lambda x: -x[1])
        return results[:self.k]

# Usage
tracker = StreamingTopK(k=10)
for url in stream_of_urls():
    tracker.add(url)

# Get current top 10 URLs
for url, count in tracker.get_top_k():
    print(f"{url}: ~{count} hits")

For database integration, Redis provides native CMS support:

import redis

r = redis.Redis()

# Initialize sketch (width=2000, depth=5)
r.execute_command('CMS.INITBYDIM', 'page_views', 2000, 5)

# Increment counts
r.execute_command('CMS.INCRBY', 'page_views', '/home', 1, '/about', 1, '/api/users', 5)

# Query estimates
counts = r.execute_command('CMS.QUERY', 'page_views', '/home', '/api/users')

Comparison with Alternatives

Count-Min Sketch occupies a specific niche among probabilistic data structures:

Structure	Query Type	Error Direction	Memory
Count-Min Sketch	Frequency	Overestimates	O(1/ε × log(1/δ))
Bloom Filter	Membership	False positives	O(n)
HyperLogLog	Cardinality	Both directions	O(log log n)

Use Count-Min Sketch when: You need frequency counts for arbitrary items in a stream, and overestimation is acceptable.

Use Bloom filters when: You only need membership queries (“have I seen this before?”), not counts.

Use HyperLogLog when: You need cardinality (“how many unique items?”), not individual frequencies.

The Conservative Update variant improves CMS accuracy by only incrementing counters that equal the current minimum estimate:

def conservative_add(self, item: str, count: int = 1) -> None:
    """Only increment counters at current minimum value."""
    current_min = self.estimate(item)
    for i, table in enumerate(self.tables):
        idx = self._hash(item, i)
        if table[idx] == current_min:
            table[idx] += count

This reduces overestimation at the cost of slightly more computation per update.

Conclusion and Best Practices

Count-Min Sketch is the right choice when you need frequency estimates for high-cardinality streams and can tolerate bounded overestimation. It’s wrong for exact counting or when underestimation would be catastrophic.

Parameter tuning guidelines:

Start with ε = 0.001 (0.1% error) and δ = 0.01 (99% confidence)
For rate limiting, use smaller ε to reduce false positives
For trending detection, larger ε is acceptable since you care about relative rankings

Production checklist:

Use fast, non-cryptographic hash functions (MurmurHash3, xxHash)
Implement merge operations for distributed aggregation
Consider conservative updates if accuracy matters more than throughput
Monitor total count to understand error magnitude (error ≤ ε × total)

The beauty of Count-Min Sketch is its simplicity. A few kilobytes of memory, a handful of hash functions, and you can track frequency estimates across billions of items. That’s a trade-off worth making.