Count-Min Sketch: Frequency Estimation

Key Insights

Count-Min Sketch provides frequency estimates using O(1/ε × log(1/δ)) space, enabling you to track millions of items in kilobytes of memory with configurable accuracy guarantees.
The “minimum” operation across multiple hash functions reduces collision-induced overestimation, but CMS always overestimates—never underestimates—making it suitable for applications where false positives are acceptable.
Conservative updates can reduce overestimation by 2-10x in practice, and combining CMS with a small heap enables efficient heavy hitter detection in streaming data.

The Frequency Counting Problem

Counting how often items appear sounds trivial until you’re processing billions of events per day. A naive HashMap approach works fine for thousands of unique items, but what happens when you’re tracking IP addresses across a global CDN, monitoring search queries, or analyzing database query patterns?

The math is unforgiving. Storing 100 million unique strings with their counts easily consumes gigabytes of memory. When you’re running on constrained edge nodes or need to maintain counts across thousands of dimensions simultaneously, exact counting becomes impossible.

Count-Min Sketch solves this by trading perfect accuracy for bounded memory. You specify how wrong you’re willing to be, and the data structure guarantees it won’t exceed that error bound (with high probability). For many applications—trending topic detection, anomaly identification, query optimization—approximate counts are perfectly acceptable.

How Count-Min Sketch Works

The structure is elegantly simple: a 2D array of counters with dimensions width × depth, paired with depth independent hash functions.

When you increment an item’s count, you hash it with each function to get a column index for each row, then increment all those counters. When querying, you hash again and return the minimum value across all rows.

import mmh3
from typing import List

class CountMinSketch:
    def __init__(self, width: int, depth: int):
        self.width = width
        self.depth = depth
        self.table = [[0] * width for _ in range(depth)]
    
    def _hash(self, item: str, seed: int) -> int:
        return mmh3.hash(item, seed) % self.width
    
    def add(self, item: str, count: int = 1) -> None:
        for row in range(self.depth):
            col = self._hash(item, row)
            self.table[row][col] += count
    
    def estimate(self, item: str) -> int:
        return min(
            self.table[row][self._hash(item, row)]
            for row in range(self.depth)
        )

Why take the minimum? Every counter in the sketch can only be equal to or greater than the true count—collisions only add noise, never subtract. By taking the minimum across independent hash functions, you’re selecting the counter least affected by collisions. With enough rows (hash functions), at least one counter is likely to be close to the true value.

Mathematical Properties and Error Bounds

Count-Min Sketch provides precise probabilistic guarantees. Given error parameter ε and probability parameter δ:

Width: w = ⌈e/ε⌉ (where e ≈ 2.718)
Depth: d = ⌈ln(1/δ)⌉

The guarantee: for any item with true count c, the estimate ĉ satisfies:

c ≤ ĉ ≤ c + ε × N

with probability at least 1 - δ, where N is the total count of all items.

This means if you want estimates within 0.1% of total traffic (ε = 0.001) with 99.9% confidence (δ = 0.001), you need:

Width: ⌈2.718/0.001⌉ = 2,719 counters per row
Depth: ⌈ln(1000)⌉ = 7 rows

That’s roughly 19,000 counters. With 4-byte integers, you’re using 76KB to track frequency estimates for an arbitrarily large universe of items. Compare that to a HashMap storing millions of entries.

The one-sided error property is crucial for certain applications. If you’re detecting heavy hitters (items exceeding a threshold), overestimation means false positives but never false negatives—you’ll never miss a truly frequent item.

Implementation Deep Dive

Production implementations require attention to hash function selection and memory layout.

import mmh3
import numpy as np
from math import ceil, log, e

class ProductionCMS:
    def __init__(self, epsilon: float = 0.001, delta: float = 0.001):
        self.width = ceil(e / epsilon)
        self.depth = ceil(log(1 / delta))
        # Use numpy for cache-friendly memory layout
        self.table = np.zeros((self.depth, self.width), dtype=np.int32)
        self.total_count = 0
        
    def add(self, item: str, count: int = 1) -> None:
        self.total_count += count
        item_bytes = item.encode('utf-8')
        for row in range(self.depth):
            # Use different seeds for independent hashes
            col = mmh3.hash(item_bytes, row, signed=False) % self.width
            self.table[row, col] += count
    
    def estimate(self, item: str) -> int:
        item_bytes = item.encode('utf-8')
        estimates = [
            self.table[row, mmh3.hash(item_bytes, row, signed=False) % self.width]
            for row in range(self.depth)
        ]
        return min(estimates)
    
    def memory_bytes(self) -> int:
        return self.table.nbytes

Using NumPy provides contiguous memory allocation, improving cache performance significantly. For extremely high-throughput systems, consider SIMD-optimized implementations or storing the table in a memory-mapped file for persistence.

Here’s a comparison showing the memory advantage:

import sys
from collections import Counter

def compare_memory(n_items: int, avg_key_length: int = 20):
    # Simulate HashMap memory
    hashmap = {f"item_{i:015d}": i % 1000 for i in range(n_items)}
    hashmap_size = sys.getsizeof(hashmap) + sum(
        sys.getsizeof(k) + sys.getsizeof(v) 
        for k, v in hashmap.items()
    )
    
    # CMS with 0.1% error, 99.9% confidence
    cms = ProductionCMS(epsilon=0.001, delta=0.001)
    cms_size = cms.memory_bytes()
    
    return hashmap_size, cms_size

# Results for 1M items:
# HashMap: ~85 MB
# CMS: ~76 KB (1000x smaller)

Practical Applications

The killer application for Count-Min Sketch is heavy hitter detection in streaming data. Combined with a small heap, you can track the top-k most frequent items in a stream using minimal memory:

import heapq
from typing import List, Tuple

class HeavyHitterTracker:
    def __init__(self, k: int, epsilon: float = 0.0001):
        self.k = k
        self.cms = ProductionCMS(epsilon=epsilon, delta=0.001)
        self.candidates = {}  # item -> last known estimate
        self.threshold = 0
    
    def add(self, item: str) -> None:
        self.cms.add(item)
        estimate = self.cms.estimate(item)
        
        # Update candidates if estimate exceeds threshold
        if estimate >= self.threshold:
            self.candidates[item] = estimate
            
            # Prune if too many candidates
            if len(self.candidates) > 2 * self.k:
                self._prune()
    
    def _prune(self) -> None:
        # Keep top 2k candidates, update threshold
        sorted_items = sorted(
            self.candidates.items(), 
            key=lambda x: x[1], 
            reverse=True
        )[:2 * self.k]
        self.candidates = dict(sorted_items)
        if len(sorted_items) >= self.k:
            self.threshold = sorted_items[self.k - 1][1]
    
    def get_heavy_hitters(self) -> List[Tuple[str, int]]:
        # Re-estimate and return top k
        results = [
            (item, self.cms.estimate(item)) 
            for item in self.candidates
        ]
        return sorted(results, key=lambda x: x[1], reverse=True)[:self.k]

Database systems use CMS extensively for query optimization. PostgreSQL’s extended statistics and various distributed databases use sketch-based frequency estimation to predict join cardinalities and optimize query plans.

Variants and Optimizations

Conservative update is a simple modification that significantly reduces overestimation:

def add_conservative(self, item: str, count: int = 1) -> None:
    """Only increment counters to the minimum + count."""
    item_bytes = item.encode('utf-8')
    indices = [
        (row, mmh3.hash(item_bytes, row, signed=False) % self.width)
        for row in range(self.depth)
    ]
    
    current_min = min(self.table[row, col] for row, col in indices)
    target = current_min + count
    
    for row, col in indices:
        self.table[row, col] = max(self.table[row, col], target)

Instead of blindly incrementing all counters, conservative update only raises counters to match the minimum plus the increment. This prevents runaway overestimation when popular items collide with rare ones.

For time-series analysis, decaying sketches multiply all counters by a decay factor periodically, giving recent events more weight. Sliding window variants maintain multiple sketches for different time windows.

When to Use (and When Not To)

Count-Min Sketch solves frequency estimation. It’s not a replacement for:

HyperLogLog: Counts distinct elements, not frequencies
Bloom filters: Tests set membership, not frequency
t-digest: Estimates quantiles, not frequencies

Use CMS when you need approximate frequency counts for high-cardinality data and can tolerate overestimation. Avoid it when:

You need exact counts (use a database)
You need to delete items (standard CMS doesn’t support decrements)
Underestimation would be catastrophic (CMS only overestimates)
Your cardinality is low enough for exact counting

For parameter selection in production, start with ε = 0.001 (0.1% error) and δ = 0.001 (99.9% confidence). Monitor actual error rates against ground truth samples and adjust. Most applications can tolerate much higher error rates than engineers initially assume—test with your actual use case before over-engineering precision.