HyperLogLog: Cardinality Estimation

Key Insights

HyperLogLog estimates unique element counts using only 12KB of memory regardless of whether you’re counting thousands or billions of items, achieving ~0.81% error through clever use of hash function properties and stochastic averaging.
The algorithm exploits a simple probabilistic insight: if you’ve observed a hash value with many leading zeros, you’ve probably seen many distinct elements—similar to how flipping 10 heads in a row suggests you’ve been flipping coins for a while.
HyperLogLog sketches merge trivially via element-wise maximum operations, making them ideal for distributed systems where you need to combine counts across shards, time windows, or geographic regions.

The Count Distinct Problem

Counting unique elements sounds trivial until you try it at scale. The naive approach—store every element in a set and return its size—requires memory proportional to the number of distinct elements. When you’re tracking unique visitors across a billion page views or counting distinct IP addresses hitting your CDN, that set becomes untenable.

Consider these real-world scenarios:

Unique daily visitors: A high-traffic site sees 500 million page views from 50 million unique users. Storing 50 million user IDs (8 bytes each) requires 400MB per day.
Distinct search queries: A search engine processes 8 billion queries daily. Even with aggressive hashing, tracking exact distinct counts consumes gigabytes.
Network monitoring: Counting unique source IPs for DDoS detection across multiple collection points requires merging counts efficiently.

The memory cost becomes prohibitive, especially when you need these counts across multiple dimensions (per page, per hour, per region). This is where probabilistic data structures earn their keep.

The Probabilistic Insight

HyperLogLog exploits a beautiful statistical property of hash functions. When you hash random inputs, the resulting bits are uniformly distributed. This means roughly half of all hashes start with 0, a quarter start with 00, an eighth start with 000, and so on.

Here’s the key insight: if you’ve observed a hash with 10 leading zeros, you’ve probably hashed around 2^10 = 1024 distinct elements. It’s like coin flips—seeing 10 heads in a row doesn’t prove you’ve flipped 1024 times, but it’s a reasonable estimate.

import hashlib

def hash_to_binary(value: str) -> str:
    """Hash a value and return its binary representation."""
    h = hashlib.sha256(value.encode()).digest()
    # Take first 8 bytes as a 64-bit integer
    hash_int = int.from_bytes(h[:8], 'big')
    return format(hash_int, '064b')

def count_leading_zeros(binary: str) -> int:
    """Count leading zeros in a binary string."""
    count = 0
    for bit in binary:
        if bit == '0':
            count += 1
        else:
            break
    return count

# Demonstrate the distribution
samples = ["user_1", "user_2", "user_3", "user_100", "user_1000"]
for s in samples:
    binary = hash_to_binary(s)
    zeros = count_leading_zeros(binary)
    print(f"{s}: {binary[:20]}... -> {zeros} leading zeros")

Running this reveals the statistical nature—most hashes have 0-3 leading zeros, but occasionally you’ll see 5, 6, or more. The maximum number of leading zeros you’ve observed becomes your cardinality estimator: 2^(max_zeros + 1).

The problem? A single maximum is noisy. One lucky hash can wildly overestimate your count.

From LogLog to HyperLogLog

The LogLog algorithm (2003) solved the variance problem through stochastic averaging. Instead of tracking one maximum, you divide hashes into buckets and track the maximum leading zeros per bucket. The estimate becomes 2^(average of maximums).

HyperLogLog (2007) improved this further by using the harmonic mean instead of the arithmetic mean. The harmonic mean is less sensitive to outliers, dramatically reducing estimation error.

import hashlib
import math

class HyperLogLog:
    def __init__(self, precision: int = 14):
        """
        Initialize HyperLogLog with given precision.
        precision=14 means 2^14 = 16384 registers, ~0.81% error
        """
        self.precision = precision
        self.num_registers = 1 << precision
        self.registers = [0] * self.num_registers
        # Alpha constant for bias correction
        if self.num_registers >= 128:
            self.alpha = 0.7213 / (1 + 1.079 / self.num_registers)
        elif self.num_registers == 64:
            self.alpha = 0.709
        elif self.num_registers == 32:
            self.alpha = 0.697
        else:
            self.alpha = 0.673
    
    def _hash(self, value: str) -> int:
        """Hash value to 64-bit integer."""
        h = hashlib.sha256(value.encode()).digest()
        return int.from_bytes(h[:8], 'big')
    
    def add(self, value: str) -> None:
        """Add an element to the HLL."""
        h = self._hash(value)
        # First 'precision' bits determine the register
        register_idx = h >> (64 - self.precision)
        # Remaining bits used for leading zero count
        remaining = h & ((1 << (64 - self.precision)) - 1)
        # Count leading zeros + 1 (the position of first 1-bit)
        if remaining == 0:
            rank = 64 - self.precision + 1
        else:
            rank = (64 - self.precision) - remaining.bit_length() + 1
        # Update register with max
        self.registers[register_idx] = max(self.registers[register_idx], rank)
    
    def count(self) -> int:
        """Estimate cardinality."""
        # Harmonic mean of 2^register values
        indicator = sum(2.0 ** (-r) for r in self.registers)
        estimate = self.alpha * (self.num_registers ** 2) / indicator
        return int(estimate)

The harmonic mean naturally downweights outliers. If most registers show values around 10 but one shows 25, the harmonic mean stays close to 10 rather than being pulled toward 25.

Error Bounds and Memory Tradeoffs

HyperLogLog’s standard error follows a clean formula: 1.04 / √m, where m is the number of registers.

Precision	Registers	Memory	Standard Error
10	1,024	1 KB	3.25%
12	4,096	4 KB	1.63%
14	16,384	12 KB	0.81%
16	65,536	48 KB	0.41%

Each register needs only 6 bits (max value of 64 for 64-bit hashes), so memory usage is roughly (6 × 2^precision) / 8 bytes. The precision=14 configuration (12KB, 0.81% error) has become the de facto standard, used by Redis and most implementations.

This is remarkable: 12KB of memory can count billions of unique elements with sub-1% error. The memory stays constant regardless of cardinality.

Bias Corrections and Edge Cases

Raw HyperLogLog estimates have known biases at the extremes. Small cardinalities underestimate, and theoretical limits around 2^64 require adjustment.

def count_with_corrections(self) -> int:
    """Estimate cardinality with bias corrections."""
    # Raw estimate using harmonic mean
    indicator = sum(2.0 ** (-r) for r in self.registers)
    raw_estimate = self.alpha * (self.num_registers ** 2) / indicator
    
    # Small range correction: use linear counting if many registers are zero
    zero_registers = self.registers.count(0)
    if zero_registers > 0:
        # Linear counting estimate
        linear_estimate = self.num_registers * math.log(
            self.num_registers / zero_registers
        )
        # Use linear counting for small cardinalities
        if linear_estimate <= 2.5 * self.num_registers:
            return int(linear_estimate)
    
    # Large range correction (rarely needed with 64-bit hashes)
    two_to_32 = 2 ** 32
    if raw_estimate > two_to_32 / 30:
        # Adjust for hash collisions
        raw_estimate = -two_to_32 * math.log(1 - raw_estimate / two_to_32)
    
    return int(raw_estimate)

Google’s HyperLogLog++ paper (2013) introduced empirical bias correction using lookup tables derived from extensive simulations. This eliminates the need for linear counting in most cases and provides more accurate estimates across all cardinality ranges.

Merge Operations and Distributed Counting

HyperLogLog’s killer feature for distributed systems: merging is trivial. Since each register tracks the maximum leading zeros seen, merging two HLLs requires only element-wise maximum.

def merge(self, other: 'HyperLogLog') -> 'HyperLogLog':
    """Merge another HLL into this one."""
    if self.precision != other.precision:
        raise ValueError("Cannot merge HLLs with different precision")
    
    result = HyperLogLog(self.precision)
    result.registers = [
        max(a, b) 
        for a, b in zip(self.registers, other.registers)
    ]
    return result

# Demonstration
hll1 = HyperLogLog(precision=14)
hll2 = HyperLogLog(precision=14)

# Add different elements to each
for i in range(50000):
    hll1.add(f"user_{i}")
for i in range(30000, 80000):  # Overlapping range
    hll2.add(f"user_{i}")

# True distinct count: 80000 (users 0-79999)
merged = hll1.merge(hll2)
print(f"HLL1 count: {hll1.count()}")      # ~50000
print(f"HLL2 count: {hll2.count()}")      # ~50000
print(f"Merged count: {merged.count()}")  # ~80000

This enables powerful architectures: count unique users per server, merge hourly into daily, merge daily into weekly. Each merge is O(m) where m is the number of registers—microseconds regardless of cardinality.

Production Usage

Most production systems use battle-tested HLL implementations rather than rolling their own.

Redis provides HyperLogLog as a first-class data type:

# Track unique visitors per page
PFADD page:/home user:123 user:456 user:789
PFADD page:/home user:123 user:999  # Duplicate ignored

# Get approximate unique count
PFCOUNT page:/home
# Returns: 4

# Merge multiple pages for site-wide count
PFMERGE site:daily page:/home page:/about page:/products
PFCOUNT site:daily

# Memory usage: always 12KB per key
DEBUG OBJECT page:/home

BigQuery offers APPROX_COUNT_DISTINCT:

-- Exact count (expensive for large tables)
SELECT COUNT(DISTINCT user_id) FROM events;

-- HLL-based approximation (much faster)
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

-- Explicit HLL for custom aggregations
SELECT 
    date,
    HLL_COUNT.MERGE(DISTINCT hll_sketch) as unique_users
FROM (
    SELECT date, HLL_COUNT.INIT(user_id) as hll_sketch
    FROM events
    GROUP BY date, shard_id
)
GROUP BY date;

When to choose HLL over exact counting:

Cardinality exceeds millions and memory is constrained
You need to merge counts across distributed systems
Sub-2% error is acceptable for your use case
Real-time counting is required (HLL add is O(1))

When to stick with exact counting:

Cardinality is small (under 100K elements)
You need exact numbers for billing or compliance
You’re already storing the elements for other purposes

HyperLogLog represents a fundamental tradeoff in systems design: trading perfect accuracy for bounded memory and composable operations. Understanding when to make that trade is what separates pragmatic engineers from those drowning in infrastructure costs.