Count-Min Sketch: Frequency Estimation
Counting how often items appear sounds trivial until you're processing billions of events per day. A naive HashMap approach works fine for thousands of unique items, but what happens when you're...
Key Insights
- Count-Min Sketch provides frequency estimates using O(1/ε × log(1/δ)) space, enabling you to track millions of items in kilobytes of memory with configurable accuracy guarantees.
- The “minimum” operation across multiple hash functions reduces collision-induced overestimation, but CMS always overestimates—never underestimates—making it suitable for applications where false positives are acceptable.
- Conservative updates can reduce overestimation by 2-10x in practice, and combining CMS with a small heap enables efficient heavy hitter detection in streaming data.
The Frequency Counting Problem
Counting how often items appear sounds trivial until you’re processing billions of events per day. A naive HashMap approach works fine for thousands of unique items, but what happens when you’re tracking IP addresses across a global CDN, monitoring search queries, or analyzing database query patterns?
The math is unforgiving. Storing 100 million unique strings with their counts easily consumes gigabytes of memory. When you’re running on constrained edge nodes or need to maintain counts across thousands of dimensions simultaneously, exact counting becomes impossible.
Count-Min Sketch solves this by trading perfect accuracy for bounded memory. You specify how wrong you’re willing to be, and the data structure guarantees it won’t exceed that error bound (with high probability). For many applications—trending topic detection, anomaly identification, query optimization—approximate counts are perfectly acceptable.
How Count-Min Sketch Works
The structure is elegantly simple: a 2D array of counters with dimensions width × depth, paired with depth independent hash functions.
When you increment an item’s count, you hash it with each function to get a column index for each row, then increment all those counters. When querying, you hash again and return the minimum value across all rows.
import mmh3
from typing import List
class CountMinSketch:
def __init__(self, width: int, depth: int):
self.width = width
self.depth = depth
self.table = [[0] * width for _ in range(depth)]
def _hash(self, item: str, seed: int) -> int:
return mmh3.hash(item, seed) % self.width
def add(self, item: str, count: int = 1) -> None:
for row in range(self.depth):
col = self._hash(item, row)
self.table[row][col] += count
def estimate(self, item: str) -> int:
return min(
self.table[row][self._hash(item, row)]
for row in range(self.depth)
)
Why take the minimum? Every counter in the sketch can only be equal to or greater than the true count—collisions only add noise, never subtract. By taking the minimum across independent hash functions, you’re selecting the counter least affected by collisions. With enough rows (hash functions), at least one counter is likely to be close to the true value.
Mathematical Properties and Error Bounds
Count-Min Sketch provides precise probabilistic guarantees. Given error parameter ε and probability parameter δ:
- Width: w = ⌈e/ε⌉ (where e ≈ 2.718)
- Depth: d = ⌈ln(1/δ)⌉
The guarantee: for any item with true count c, the estimate ĉ satisfies:
c ≤ ĉ ≤ c + ε × N
with probability at least 1 - δ, where N is the total count of all items.
This means if you want estimates within 0.1% of total traffic (ε = 0.001) with 99.9% confidence (δ = 0.001), you need:
- Width: ⌈2.718/0.001⌉ = 2,719 counters per row
- Depth: ⌈ln(1000)⌉ = 7 rows
That’s roughly 19,000 counters. With 4-byte integers, you’re using 76KB to track frequency estimates for an arbitrarily large universe of items. Compare that to a HashMap storing millions of entries.
The one-sided error property is crucial for certain applications. If you’re detecting heavy hitters (items exceeding a threshold), overestimation means false positives but never false negatives—you’ll never miss a truly frequent item.
Implementation Deep Dive
Production implementations require attention to hash function selection and memory layout.
import mmh3
import numpy as np
from math import ceil, log, e
class ProductionCMS:
def __init__(self, epsilon: float = 0.001, delta: float = 0.001):
self.width = ceil(e / epsilon)
self.depth = ceil(log(1 / delta))
# Use numpy for cache-friendly memory layout
self.table = np.zeros((self.depth, self.width), dtype=np.int32)
self.total_count = 0
def add(self, item: str, count: int = 1) -> None:
self.total_count += count
item_bytes = item.encode('utf-8')
for row in range(self.depth):
# Use different seeds for independent hashes
col = mmh3.hash(item_bytes, row, signed=False) % self.width
self.table[row, col] += count
def estimate(self, item: str) -> int:
item_bytes = item.encode('utf-8')
estimates = [
self.table[row, mmh3.hash(item_bytes, row, signed=False) % self.width]
for row in range(self.depth)
]
return min(estimates)
def memory_bytes(self) -> int:
return self.table.nbytes
Using NumPy provides contiguous memory allocation, improving cache performance significantly. For extremely high-throughput systems, consider SIMD-optimized implementations or storing the table in a memory-mapped file for persistence.
Here’s a comparison showing the memory advantage:
import sys
from collections import Counter
def compare_memory(n_items: int, avg_key_length: int = 20):
# Simulate HashMap memory
hashmap = {f"item_{i:015d}": i % 1000 for i in range(n_items)}
hashmap_size = sys.getsizeof(hashmap) + sum(
sys.getsizeof(k) + sys.getsizeof(v)
for k, v in hashmap.items()
)
# CMS with 0.1% error, 99.9% confidence
cms = ProductionCMS(epsilon=0.001, delta=0.001)
cms_size = cms.memory_bytes()
return hashmap_size, cms_size
# Results for 1M items:
# HashMap: ~85 MB
# CMS: ~76 KB (1000x smaller)
Practical Applications
The killer application for Count-Min Sketch is heavy hitter detection in streaming data. Combined with a small heap, you can track the top-k most frequent items in a stream using minimal memory:
import heapq
from typing import List, Tuple
class HeavyHitterTracker:
def __init__(self, k: int, epsilon: float = 0.0001):
self.k = k
self.cms = ProductionCMS(epsilon=epsilon, delta=0.001)
self.candidates = {} # item -> last known estimate
self.threshold = 0
def add(self, item: str) -> None:
self.cms.add(item)
estimate = self.cms.estimate(item)
# Update candidates if estimate exceeds threshold
if estimate >= self.threshold:
self.candidates[item] = estimate
# Prune if too many candidates
if len(self.candidates) > 2 * self.k:
self._prune()
def _prune(self) -> None:
# Keep top 2k candidates, update threshold
sorted_items = sorted(
self.candidates.items(),
key=lambda x: x[1],
reverse=True
)[:2 * self.k]
self.candidates = dict(sorted_items)
if len(sorted_items) >= self.k:
self.threshold = sorted_items[self.k - 1][1]
def get_heavy_hitters(self) -> List[Tuple[str, int]]:
# Re-estimate and return top k
results = [
(item, self.cms.estimate(item))
for item in self.candidates
]
return sorted(results, key=lambda x: x[1], reverse=True)[:self.k]
Database systems use CMS extensively for query optimization. PostgreSQL’s extended statistics and various distributed databases use sketch-based frequency estimation to predict join cardinalities and optimize query plans.
Variants and Optimizations
Conservative update is a simple modification that significantly reduces overestimation:
def add_conservative(self, item: str, count: int = 1) -> None:
"""Only increment counters to the minimum + count."""
item_bytes = item.encode('utf-8')
indices = [
(row, mmh3.hash(item_bytes, row, signed=False) % self.width)
for row in range(self.depth)
]
current_min = min(self.table[row, col] for row, col in indices)
target = current_min + count
for row, col in indices:
self.table[row, col] = max(self.table[row, col], target)
Instead of blindly incrementing all counters, conservative update only raises counters to match the minimum plus the increment. This prevents runaway overestimation when popular items collide with rare ones.
For time-series analysis, decaying sketches multiply all counters by a decay factor periodically, giving recent events more weight. Sliding window variants maintain multiple sketches for different time windows.
When to Use (and When Not To)
Count-Min Sketch solves frequency estimation. It’s not a replacement for:
- HyperLogLog: Counts distinct elements, not frequencies
- Bloom filters: Tests set membership, not frequency
- t-digest: Estimates quantiles, not frequencies
Use CMS when you need approximate frequency counts for high-cardinality data and can tolerate overestimation. Avoid it when:
- You need exact counts (use a database)
- You need to delete items (standard CMS doesn’t support decrements)
- Underestimation would be catastrophic (CMS only overestimates)
- Your cardinality is low enough for exact counting
For parameter selection in production, start with ε = 0.001 (0.1% error) and δ = 0.001 (99.9% confidence). Monitor actual error rates against ground truth samples and adjust. Most applications can tolerate much higher error rates than engineers initially assume—test with your actual use case before over-engineering precision.