Count-Min Sketch: Approximate Frequency Counting
Every system at scale eventually hits the same wall: you need to count things, but there are too many things to count exactly.
Key Insights
- Count-Min Sketch trades exact accuracy for massive memory savings, using O(1/ε × log(1/δ)) space instead of O(n) to track frequency estimates across billions of items
- The data structure only overestimates frequencies, never underestimates—this one-sided error property makes it ideal for applications like rate limiting and heavy hitter detection where false positives are acceptable
- With just 2KB of memory, you can track frequency estimates for millions of unique items with less than 1% error probability, making CMS essential for real-time stream processing
The Heavy Hitters Problem
Every system at scale eventually hits the same wall: you need to count things, but there are too many things to count exactly.
Consider tracking which URLs receive the most traffic across your CDN. You’re processing 100,000 requests per second, each with a unique-ish URL. Storing exact counts in a hash map means keeping millions of entries in memory. When traffic spikes, your counting infrastructure becomes the bottleneck.
This problem appears everywhere:
- Network monitoring: Identifying IP addresses generating anomalous traffic
- Trending topics: Finding frequently mentioned hashtags in real-time
- Database optimization: Tracking query patterns to inform index creation
- Fraud detection: Counting failed login attempts per account
The naive approach—storing every item with its count—requires O(n) memory where n is the number of unique items. At web scale, n can be billions. You need a different approach.
Count-Min Sketch offers a compelling trade-off: accept small, bounded errors in exchange for constant memory usage regardless of input size.
How Count-Min Sketch Works
A Count-Min Sketch is a 2D array of counters with dimensions width × depth. Each row uses a different hash function to map items to column positions.
When you increment an item’s count, you hash it with each row’s hash function and increment the corresponding counter in each row. When querying, you hash the item again and return the minimum value across all rows.
The minimum-value strategy is crucial. Hash collisions cause counters to be shared between items, inflating counts. By taking the minimum across multiple independent hash functions, you reduce the impact of collisions.
import hashlib
import math
class CountMinSketch:
def __init__(self, width: int, depth: int):
self.width = width
self.depth = depth
self.table = [[0] * width for _ in range(depth)]
def _hash(self, item: str, seed: int) -> int:
"""Generate hash for item using seed to create independent hash functions."""
h = hashlib.md5(f"{seed}:{item}".encode()).hexdigest()
return int(h, 16) % self.width
def add(self, item: str, count: int = 1) -> None:
"""Increment counters for item across all rows."""
for row in range(self.depth):
col = self._hash(item, row)
self.table[row][col] += count
def estimate(self, item: str) -> int:
"""Return minimum count across all rows (best estimate)."""
return min(
self.table[row][self._hash(item, row)]
for row in range(self.depth)
)
This basic implementation demonstrates the core mechanics. You add items by incrementing counters across all rows, and estimate by taking the minimum.
Mathematical Properties and Error Bounds
Count-Min Sketch provides probabilistic guarantees that make it predictable in production. The key properties:
One-sided error: The estimate is always greater than or equal to the true count. You never undercount. This property is valuable for rate limiting—you might occasionally block a legitimate user, but you’ll never let an attacker through.
Error bounds: For an item with true count c and total stream length N, the estimate ĉ satisfies:
c ≤ ĉ ≤ c + εN with probability at least 1 - δ
The parameters ε (epsilon) controls error magnitude, δ (delta) controls failure probability. To achieve these bounds:
- Width = ⌈e/ε⌉ (where e ≈ 2.718)
- Depth = ⌈ln(1/δ)⌉
def calculate_dimensions(epsilon: float, delta: float) -> tuple[int, int]:
"""
Calculate CMS dimensions for desired error bounds.
Args:
epsilon: Error factor (e.g., 0.01 for 1% of total count)
delta: Failure probability (e.g., 0.01 for 99% confidence)
Returns:
Tuple of (width, depth)
"""
width = math.ceil(math.e / epsilon)
depth = math.ceil(math.log(1 / delta))
memory_bytes = width * depth * 4 # Assuming 4-byte counters
print(f"Dimensions: {width} x {depth}")
print(f"Memory usage: {memory_bytes:,} bytes ({memory_bytes/1024:.1f} KB)")
return width, depth
# Example: 1% error with 99% probability
calculate_dimensions(0.01, 0.01)
# Output:
# Dimensions: 272 x 5
# Memory usage: 5,440 bytes (5.3 KB)
With just 5KB, you can track frequency estimates for any number of items with 1% error bounds and 99% confidence.
Implementation Deep Dive
Production implementations need more than the basic structure. Here’s a hardened version:
import array
import mmh3 # MurmurHash3 - fast, high-quality hashing
class ProductionCountMinSketch:
def __init__(self, epsilon: float = 0.001, delta: float = 0.01):
"""
Initialize CMS with error bounds.
Args:
epsilon: Additive error factor (estimate <= true + epsilon * N)
delta: Probability of exceeding error bound
"""
self.width = math.ceil(math.e / epsilon)
self.depth = math.ceil(math.log(1 / delta))
self.total_count = 0
# Use array module for memory efficiency (4-byte unsigned ints)
self.tables = [
array.array('L', [0] * self.width)
for _ in range(self.depth)
]
def _hash(self, item: str, seed: int) -> int:
"""MurmurHash3 provides excellent distribution and speed."""
return mmh3.hash(item, seed, signed=False) % self.width
def add(self, item: str, count: int = 1) -> None:
"""Thread-safe increment (for single-writer scenarios)."""
if count < 0:
raise ValueError("Count must be non-negative")
self.total_count += count
for i, table in enumerate(self.tables):
idx = self._hash(item, i)
table[idx] += count
def estimate(self, item: str) -> int:
"""Return frequency estimate with guaranteed lower bound."""
return min(
table[self._hash(item, i)]
for i, table in enumerate(self.tables)
)
def merge(self, other: 'ProductionCountMinSketch') -> None:
"""Merge another CMS into this one (for distributed counting)."""
if self.width != other.width or self.depth != other.depth:
raise ValueError("Dimensions must match for merge")
for i in range(self.depth):
for j in range(self.width):
self.tables[i][j] += other.tables[i][j]
self.total_count += other.total_count
def memory_usage(self) -> int:
"""Return approximate memory usage in bytes."""
return self.width * self.depth * 4 # 4 bytes per counter
Key improvements in this version:
-
MurmurHash3: Cryptographic hashes like MD5 are overkill. MurmurHash3 is faster and provides sufficient independence for CMS.
-
Memory-efficient arrays: Python’s
arraymodule uses contiguous memory with fixed-size integers, reducing overhead compared to lists. -
Merge support: Essential for distributed systems where you aggregate sketches from multiple nodes.
Practical Applications
The real power of Count-Min Sketch emerges in streaming scenarios. Here’s a practical URL tracker that maintains the top-K most frequent URLs:
import heapq
from collections import defaultdict
class StreamingTopK:
def __init__(self, k: int, epsilon: float = 0.001, delta: float = 0.01):
self.k = k
self.cms = ProductionCountMinSketch(epsilon, delta)
self.candidates = {} # item -> last known estimate
self.min_heap = [] # (count, item) for top-k tracking
def add(self, item: str) -> None:
"""Process a stream element."""
self.cms.add(item)
estimate = self.cms.estimate(item)
# Update candidate tracking
if item in self.candidates:
self.candidates[item] = estimate
elif len(self.min_heap) < self.k:
self.candidates[item] = estimate
heapq.heappush(self.min_heap, (estimate, item))
elif estimate > self.min_heap[0][0]:
# New item might be in top-k
_, evicted = heapq.heapreplace(self.min_heap, (estimate, item))
del self.candidates[evicted]
self.candidates[item] = estimate
def get_top_k(self) -> list[tuple[str, int]]:
"""Return current top-k items with estimated counts."""
# Re-estimate all candidates (counts may have increased)
results = [
(item, self.cms.estimate(item))
for item in self.candidates
]
results.sort(key=lambda x: -x[1])
return results[:self.k]
# Usage
tracker = StreamingTopK(k=10)
for url in stream_of_urls():
tracker.add(url)
# Get current top 10 URLs
for url, count in tracker.get_top_k():
print(f"{url}: ~{count} hits")
For database integration, Redis provides native CMS support:
import redis
r = redis.Redis()
# Initialize sketch (width=2000, depth=5)
r.execute_command('CMS.INITBYDIM', 'page_views', 2000, 5)
# Increment counts
r.execute_command('CMS.INCRBY', 'page_views', '/home', 1, '/about', 1, '/api/users', 5)
# Query estimates
counts = r.execute_command('CMS.QUERY', 'page_views', '/home', '/api/users')
Comparison with Alternatives
Count-Min Sketch occupies a specific niche among probabilistic data structures:
| Structure | Query Type | Error Direction | Memory |
|---|---|---|---|
| Count-Min Sketch | Frequency | Overestimates | O(1/ε × log(1/δ)) |
| Bloom Filter | Membership | False positives | O(n) |
| HyperLogLog | Cardinality | Both directions | O(log log n) |
Use Count-Min Sketch when: You need frequency counts for arbitrary items in a stream, and overestimation is acceptable.
Use Bloom filters when: You only need membership queries (“have I seen this before?”), not counts.
Use HyperLogLog when: You need cardinality (“how many unique items?”), not individual frequencies.
The Conservative Update variant improves CMS accuracy by only incrementing counters that equal the current minimum estimate:
def conservative_add(self, item: str, count: int = 1) -> None:
"""Only increment counters at current minimum value."""
current_min = self.estimate(item)
for i, table in enumerate(self.tables):
idx = self._hash(item, i)
if table[idx] == current_min:
table[idx] += count
This reduces overestimation at the cost of slightly more computation per update.
Conclusion and Best Practices
Count-Min Sketch is the right choice when you need frequency estimates for high-cardinality streams and can tolerate bounded overestimation. It’s wrong for exact counting or when underestimation would be catastrophic.
Parameter tuning guidelines:
- Start with ε = 0.001 (0.1% error) and δ = 0.01 (99% confidence)
- For rate limiting, use smaller ε to reduce false positives
- For trending detection, larger ε is acceptable since you care about relative rankings
Production checklist:
- Use fast, non-cryptographic hash functions (MurmurHash3, xxHash)
- Implement merge operations for distributed aggregation
- Consider conservative updates if accuracy matters more than throughput
- Monitor total count to understand error magnitude (error ≤ ε × total)
The beauty of Count-Min Sketch is its simplicity. A few kilobytes of memory, a handful of hash functions, and you can track frequency estimates across billions of items. That’s a trade-off worth making.