HyperLogLog: Approximate Distinct Counting
Counting unique elements sounds trivial until you try it at scale. The naive approach—store every element in a set and count—requires memory proportional to the number of unique elements. For a...
Key Insights
- HyperLogLog can count billions of unique elements using only 12KB of memory with ~0.81% standard error, making it indispensable for high-cardinality analytics
- The algorithm exploits a simple observation: in random binary strings, the probability of seeing n leading zeros is 1/2^n, letting us estimate cardinality from the maximum observed
- Every major database and analytics platform implements HLL natively—Redis, BigQuery, ClickHouse, Presto—so understanding the primitives helps you use them effectively
The Distinct Count Problem
Counting unique elements sounds trivial until you try it at scale. The naive approach—store every element in a set and count—requires memory proportional to the number of unique elements. For a website tracking unique visitors, that might mean storing hundreds of millions of user IDs. At 8 bytes per ID, you’re looking at gigabytes of memory just for one counter.
Real systems face this problem constantly:
- Unique visitors: How many distinct users visited your site today?
- Distinct IPs: How many unique IP addresses hit your API endpoint?
- Cardinality estimation: How many unique values exist in a database column for query planning?
- Network monitoring: How many distinct flows passed through a router?
The key insight is that most of these use cases don’t need exact counts. Whether you have 1,847,293 or 1,850,000 unique visitors rarely changes your business decision. This tolerance for approximation opens the door to probabilistic data structures that trade accuracy for dramatic memory savings.
How HyperLogLog Works
HyperLogLog rests on a beautifully simple observation about random numbers. When you hash an element to a binary string, each bit has a 50% chance of being 0 or 1. The probability of seeing one leading zero is 1/2. Two leading zeros? 1/4. Three? 1/8. If you observe a hash with 20 leading zeros, you’ve likely seen around 2^20 (roughly a million) unique elements.
Here’s the intuition in code:
import hashlib
def count_leading_zeros(hash_bytes):
"""Count leading zeros in binary representation."""
n = int.from_bytes(hash_bytes, 'big')
if n == 0:
return len(hash_bytes) * 8
return (len(hash_bytes) * 8) - n.bit_length()
def demonstrate_leading_zeros():
"""Show distribution of leading zeros as cardinality increases."""
max_zeros = 0
for i in range(1_000_000):
h = hashlib.sha256(str(i).encode()).digest()
zeros = count_leading_zeros(h)
max_zeros = max(max_zeros, zeros)
if i in [100, 1000, 10000, 100000, 999999]:
estimated = 2 ** max_zeros
print(f"After {i+1:>7} elements: max_zeros={max_zeros:>2}, "
f"estimate≈{estimated:>10,}, actual={i+1:>7}")
demonstrate_leading_zeros()
Running this produces output like:
After 100 elements: max_zeros= 9, estimate≈ 512, actual= 100
After 1000 elements: max_zeros=12, estimate≈ 4,096, actual= 1000
After 10000 elements: max_zeros=15, estimate≈ 32,768, actual= 10000
After 100000 elements: max_zeros=18, estimate≈ 262,144, actual= 100000
After 1000000 elements: max_zeros=21, estimate≈ 2,097,152, actual=1000000
The estimates are rough because we’re using a single observation. HyperLogLog improves accuracy by maintaining multiple registers. It uses the first few bits of each hash to select a register, then tracks the maximum leading zeros seen in the remaining bits for that register. The final estimate combines all registers using a harmonic mean, which handles outliers better than arithmetic mean.
The name “HyperLogLog” comes from the space complexity: O(log log n). You need log(n) bits to represent the maximum leading zeros (since that maximum is roughly log(n)), and you have m registers, giving O(m * log log n) total space.
Building a Basic Implementation
Let’s build a working HyperLogLog from scratch:
import hashlib
import math
class HyperLogLog:
def __init__(self, precision=14):
"""
Initialize HLL with given precision.
precision=14 means 2^14 = 16384 registers, ~0.81% error
"""
self.precision = precision
self.num_registers = 1 << precision
self.registers = [0] * self.num_registers
# Alpha constant for bias correction
if self.num_registers == 16:
self.alpha = 0.673
elif self.num_registers == 32:
self.alpha = 0.697
elif self.num_registers == 64:
self.alpha = 0.709
else:
self.alpha = 0.7213 / (1 + 1.079 / self.num_registers)
def _hash(self, item):
"""Return 64-bit hash of item."""
h = hashlib.sha256(str(item).encode()).digest()[:8]
return int.from_bytes(h, 'big')
def _leading_zeros(self, value, max_bits=64):
"""Count leading zeros after the register index bits."""
if value == 0:
return max_bits
return max_bits - value.bit_length()
def add(self, item):
"""Add an item to the HLL."""
h = self._hash(item)
# First 'precision' bits determine the register
register_idx = h >> (64 - self.precision)
# Remaining bits used for leading zero count
remaining = h & ((1 << (64 - self.precision)) - 1)
zeros = self._leading_zeros(remaining, 64 - self.precision) + 1
self.registers[register_idx] = max(self.registers[register_idx], zeros)
def count(self):
"""Estimate cardinality."""
# Harmonic mean of 2^register values
indicator = sum(2 ** (-r) for r in self.registers)
raw_estimate = self.alpha * (self.num_registers ** 2) / indicator
# Small range correction
if raw_estimate <= 2.5 * self.num_registers:
zeros = self.registers.count(0)
if zeros > 0:
return self.num_registers * math.log(self.num_registers / zeros)
# Large range correction (for 64-bit hashes, rarely needed)
if raw_estimate > (1 << 32) / 30:
return -(1 << 64) * math.log(1 - raw_estimate / (1 << 64))
return raw_estimate
This implementation uses 16,384 registers by default. Each register needs only 6 bits (to store values 0-63), so the total memory is about 12KB.
Accuracy and Error Bounds
The standard error of HyperLogLog is approximately 1.04/√m, where m is the number of registers. With 16,384 registers (precision=14), that’s about 0.81% error. Let’s verify empirically:
def test_accuracy():
"""Compare HLL estimates against exact counts."""
import random
test_sizes = [1000, 10000, 100000, 1000000]
precisions = [10, 12, 14] # 1024, 4096, 16384 registers
print(f"{'Actual':>10} | ", end="")
for p in precisions:
registers = 1 << p
expected_error = 1.04 / math.sqrt(registers) * 100
print(f"p={p} (±{expected_error:.1f}%) | ", end="")
print()
print("-" * 60)
for size in test_sizes:
# Generate unique random items
items = [random.randint(0, 10**15) for _ in range(size)]
print(f"{size:>10} | ", end="")
for p in precisions:
hll = HyperLogLog(precision=p)
for item in items:
hll.add(item)
estimate = hll.count()
error = (estimate - size) / size * 100
print(f"{estimate:>8.0f} ({error:>+5.1f}%) | ", end="")
print()
test_accuracy()
Typical output shows errors well within the theoretical bounds:
Actual | p=10 (±3.2%) | p=12 (±1.6%) | p=14 (±0.8%) |
------------------------------------------------------------
1000 | 987 ( -1.3%) | 1012 ( +1.2%) | 998 ( -0.2%) |
10000 | 10247 ( +2.5%) | 9923 ( -0.8%) | 10045 ( +0.4%) |
100000 | 101893 ( +1.9%) | 99284 ( -0.7%) | 100312 ( +0.3%) |
1000000 | 978432 ( -2.2%) | 1008234 ( +0.8%) | 997821 ( -0.2%) |
Merging and Distributed Counting
HyperLogLog’s killer feature for distributed systems is mergability. To combine two HLLs, simply take the maximum value for each register:
def merge(self, other):
"""Merge another HLL into this one."""
if self.precision != other.precision:
raise ValueError("Cannot merge HLLs with different precision")
for i in range(self.num_registers):
self.registers[i] = max(self.registers[i], other.registers[i])
# Usage: aggregate counts across shards
def count_unique_across_shards(shard_data):
"""
Each shard has its own HLL. Merge them for global count.
"""
combined = HyperLogLog(precision=14)
for shard_id, items in shard_data.items():
shard_hll = HyperLogLog(precision=14)
for item in items:
shard_hll.add(item)
combined.merge(shard_hll)
print(f"After merging shard {shard_id}: {combined.count():.0f}")
return combined.count()
This enables powerful patterns: count unique users per hour, then merge for daily counts. Aggregate across geographic regions. Combine results from different microservices. All without coordination or data movement.
Production Usage: Redis and Beyond
In production, you’ll rarely implement HLL yourself. Redis provides excellent native support:
import redis
def track_daily_unique_users():
"""Track unique visitors using Redis HyperLogLog."""
r = redis.Redis()
# Add users to today's HLL
today_key = "unique_visitors:2024-01-15"
# PFADD returns 1 if the internal state changed
r.pfadd(today_key, "user_123", "user_456", "user_789")
r.pfadd(today_key, "user_123") # Duplicate, won't increase count
# PFCOUNT returns the approximate cardinality
daily_count = r.pfcount(today_key)
print(f"Unique visitors today: {daily_count}")
# Merge multiple days for weekly count
week_keys = [f"unique_visitors:2024-01-{day:02d}" for day in range(9, 16)]
r.pfmerge("unique_visitors:week_2", *week_keys)
weekly_count = r.pfcount("unique_visitors:week_2")
print(f"Unique visitors this week: {weekly_count}")
# Redis HLL uses 12KB per key regardless of cardinality
# You can store millions of counters efficiently
Other platforms offer similar functionality:
- BigQuery:
APPROX_COUNT_DISTINCT()uses HLL++ internally - ClickHouse:
uniqHLL12()function with configurable precision - Presto/Trino:
approx_distinct()with optional accuracy parameter - PostgreSQL:
postgresql-hllextension
When to Use (and When Not To)
HyperLogLog excels when:
- High cardinality: Millions or billions of unique elements
- Memory constrained: Can’t afford linear memory growth
- Approximate is acceptable: Business decisions don’t require exact counts
- Distributed aggregation: Need to merge counts across nodes/time
Consider alternatives when:
- Low cardinality: For thousands of uniques, just use a set
- Exact counts required: Billing, compliance, or audit scenarios
- Need element membership: HLL can’t tell you if a specific item was seen (use Bloom filters)
- Frequency matters: Count-Min Sketch tracks how often elements appear, not just existence
The 12KB footprint and sub-1% error rate make HyperLogLog the default choice for cardinality estimation at scale. Every analytics system you use likely relies on it—now you understand why.