HyperLogLog: Approximate Distinct Counting

Key Insights

HyperLogLog can count billions of unique elements using only 12KB of memory with ~0.81% standard error, making it indispensable for high-cardinality analytics
The algorithm exploits a simple observation: in random binary strings, the probability of seeing n leading zeros is 1/2^n, letting us estimate cardinality from the maximum observed
Every major database and analytics platform implements HLL natively—Redis, BigQuery, ClickHouse, Presto—so understanding the primitives helps you use them effectively

The Distinct Count Problem

Counting unique elements sounds trivial until you try it at scale. The naive approach—store every element in a set and count—requires memory proportional to the number of unique elements. For a website tracking unique visitors, that might mean storing hundreds of millions of user IDs. At 8 bytes per ID, you’re looking at gigabytes of memory just for one counter.

Real systems face this problem constantly:

Unique visitors: How many distinct users visited your site today?
Distinct IPs: How many unique IP addresses hit your API endpoint?
Cardinality estimation: How many unique values exist in a database column for query planning?
Network monitoring: How many distinct flows passed through a router?

The key insight is that most of these use cases don’t need exact counts. Whether you have 1,847,293 or 1,850,000 unique visitors rarely changes your business decision. This tolerance for approximation opens the door to probabilistic data structures that trade accuracy for dramatic memory savings.

How HyperLogLog Works

HyperLogLog rests on a beautifully simple observation about random numbers. When you hash an element to a binary string, each bit has a 50% chance of being 0 or 1. The probability of seeing one leading zero is 1/2. Two leading zeros? 1/4. Three? 1/8. If you observe a hash with 20 leading zeros, you’ve likely seen around 2^20 (roughly a million) unique elements.

Here’s the intuition in code:

import hashlib

def count_leading_zeros(hash_bytes):
    """Count leading zeros in binary representation."""
    n = int.from_bytes(hash_bytes, 'big')
    if n == 0:
        return len(hash_bytes) * 8
    return (len(hash_bytes) * 8) - n.bit_length()

def demonstrate_leading_zeros():
    """Show distribution of leading zeros as cardinality increases."""
    max_zeros = 0
    for i in range(1_000_000):
        h = hashlib.sha256(str(i).encode()).digest()
        zeros = count_leading_zeros(h)
        max_zeros = max(max_zeros, zeros)
        
        if i in [100, 1000, 10000, 100000, 999999]:
            estimated = 2 ** max_zeros
            print(f"After {i+1:>7} elements: max_zeros={max_zeros:>2}, "
                  f"estimate≈{estimated:>10,}, actual={i+1:>7}")

demonstrate_leading_zeros()

Running this produces output like:

After     100 elements: max_zeros= 9, estimate≈       512, actual=    100
After    1000 elements: max_zeros=12, estimate≈     4,096, actual=   1000
After   10000 elements: max_zeros=15, estimate≈    32,768, actual=  10000
After  100000 elements: max_zeros=18, estimate≈   262,144, actual= 100000
After 1000000 elements: max_zeros=21, estimate≈ 2,097,152, actual=1000000

The estimates are rough because we’re using a single observation. HyperLogLog improves accuracy by maintaining multiple registers. It uses the first few bits of each hash to select a register, then tracks the maximum leading zeros seen in the remaining bits for that register. The final estimate combines all registers using a harmonic mean, which handles outliers better than arithmetic mean.

The name “HyperLogLog” comes from the space complexity: O(log log n). You need log(n) bits to represent the maximum leading zeros (since that maximum is roughly log(n)), and you have m registers, giving O(m * log log n) total space.

Building a Basic Implementation

Let’s build a working HyperLogLog from scratch:

import hashlib
import math

class HyperLogLog:
    def __init__(self, precision=14):
        """
        Initialize HLL with given precision.
        precision=14 means 2^14 = 16384 registers, ~0.81% error
        """
        self.precision = precision
        self.num_registers = 1 << precision
        self.registers = [0] * self.num_registers
        
        # Alpha constant for bias correction
        if self.num_registers == 16:
            self.alpha = 0.673
        elif self.num_registers == 32:
            self.alpha = 0.697
        elif self.num_registers == 64:
            self.alpha = 0.709
        else:
            self.alpha = 0.7213 / (1 + 1.079 / self.num_registers)
    
    def _hash(self, item):
        """Return 64-bit hash of item."""
        h = hashlib.sha256(str(item).encode()).digest()[:8]
        return int.from_bytes(h, 'big')
    
    def _leading_zeros(self, value, max_bits=64):
        """Count leading zeros after the register index bits."""
        if value == 0:
            return max_bits
        return max_bits - value.bit_length()
    
    def add(self, item):
        """Add an item to the HLL."""
        h = self._hash(item)
        # First 'precision' bits determine the register
        register_idx = h >> (64 - self.precision)
        # Remaining bits used for leading zero count
        remaining = h & ((1 << (64 - self.precision)) - 1)
        zeros = self._leading_zeros(remaining, 64 - self.precision) + 1
        self.registers[register_idx] = max(self.registers[register_idx], zeros)
    
    def count(self):
        """Estimate cardinality."""
        # Harmonic mean of 2^register values
        indicator = sum(2 ** (-r) for r in self.registers)
        raw_estimate = self.alpha * (self.num_registers ** 2) / indicator
        
        # Small range correction
        if raw_estimate <= 2.5 * self.num_registers:
            zeros = self.registers.count(0)
            if zeros > 0:
                return self.num_registers * math.log(self.num_registers / zeros)
        
        # Large range correction (for 64-bit hashes, rarely needed)
        if raw_estimate > (1 << 32) / 30:
            return -(1 << 64) * math.log(1 - raw_estimate / (1 << 64))
        
        return raw_estimate

This implementation uses 16,384 registers by default. Each register needs only 6 bits (to store values 0-63), so the total memory is about 12KB.

Accuracy and Error Bounds

The standard error of HyperLogLog is approximately 1.04/√m, where m is the number of registers. With 16,384 registers (precision=14), that’s about 0.81% error. Let’s verify empirically:

def test_accuracy():
    """Compare HLL estimates against exact counts."""
    import random
    
    test_sizes = [1000, 10000, 100000, 1000000]
    precisions = [10, 12, 14]  # 1024, 4096, 16384 registers
    
    print(f"{'Actual':>10} | ", end="")
    for p in precisions:
        registers = 1 << p
        expected_error = 1.04 / math.sqrt(registers) * 100
        print(f"p={p} (±{expected_error:.1f}%) | ", end="")
    print()
    print("-" * 60)
    
    for size in test_sizes:
        # Generate unique random items
        items = [random.randint(0, 10**15) for _ in range(size)]
        
        print(f"{size:>10} | ", end="")
        for p in precisions:
            hll = HyperLogLog(precision=p)
            for item in items:
                hll.add(item)
            
            estimate = hll.count()
            error = (estimate - size) / size * 100
            print(f"{estimate:>8.0f} ({error:>+5.1f}%) | ", end="")
        print()

test_accuracy()

Typical output shows errors well within the theoretical bounds:

    Actual | p=10 (±3.2%) | p=12 (±1.6%) | p=14 (±0.8%) | 
------------------------------------------------------------
      1000 |      987 ( -1.3%) |     1012 ( +1.2%) |      998 ( -0.2%) | 
     10000 |    10247 ( +2.5%) |     9923 ( -0.8%) |    10045 ( +0.4%) | 
    100000 |   101893 ( +1.9%) |    99284 ( -0.7%) |   100312 ( +0.3%) | 
   1000000 |   978432 ( -2.2%) |  1008234 ( +0.8%) |   997821 ( -0.2%) |

Merging and Distributed Counting

HyperLogLog’s killer feature for distributed systems is mergability. To combine two HLLs, simply take the maximum value for each register:

def merge(self, other):
    """Merge another HLL into this one."""
    if self.precision != other.precision:
        raise ValueError("Cannot merge HLLs with different precision")
    
    for i in range(self.num_registers):
        self.registers[i] = max(self.registers[i], other.registers[i])

# Usage: aggregate counts across shards
def count_unique_across_shards(shard_data):
    """
    Each shard has its own HLL. Merge them for global count.
    """
    combined = HyperLogLog(precision=14)
    
    for shard_id, items in shard_data.items():
        shard_hll = HyperLogLog(precision=14)
        for item in items:
            shard_hll.add(item)
        combined.merge(shard_hll)
        print(f"After merging shard {shard_id}: {combined.count():.0f}")
    
    return combined.count()

This enables powerful patterns: count unique users per hour, then merge for daily counts. Aggregate across geographic regions. Combine results from different microservices. All without coordination or data movement.

Production Usage: Redis and Beyond

In production, you’ll rarely implement HLL yourself. Redis provides excellent native support:

import redis

def track_daily_unique_users():
    """Track unique visitors using Redis HyperLogLog."""
    r = redis.Redis()
    
    # Add users to today's HLL
    today_key = "unique_visitors:2024-01-15"
    
    # PFADD returns 1 if the internal state changed
    r.pfadd(today_key, "user_123", "user_456", "user_789")
    r.pfadd(today_key, "user_123")  # Duplicate, won't increase count
    
    # PFCOUNT returns the approximate cardinality
    daily_count = r.pfcount(today_key)
    print(f"Unique visitors today: {daily_count}")
    
    # Merge multiple days for weekly count
    week_keys = [f"unique_visitors:2024-01-{day:02d}" for day in range(9, 16)]
    r.pfmerge("unique_visitors:week_2", *week_keys)
    weekly_count = r.pfcount("unique_visitors:week_2")
    print(f"Unique visitors this week: {weekly_count}")

# Redis HLL uses 12KB per key regardless of cardinality
# You can store millions of counters efficiently

Other platforms offer similar functionality:

BigQuery: APPROX_COUNT_DISTINCT() uses HLL++ internally
ClickHouse: uniqHLL12() function with configurable precision
Presto/Trino: approx_distinct() with optional accuracy parameter
PostgreSQL: postgresql-hll extension

When to Use (and When Not To)

HyperLogLog excels when:

High cardinality: Millions or billions of unique elements
Memory constrained: Can’t afford linear memory growth
Approximate is acceptable: Business decisions don’t require exact counts
Distributed aggregation: Need to merge counts across nodes/time

Consider alternatives when:

Low cardinality: For thousands of uniques, just use a set
Exact counts required: Billing, compliance, or audit scenarios
Need element membership: HLL can’t tell you if a specific item was seen (use Bloom filters)
Frequency matters: Count-Min Sketch tracks how often elements appear, not just existence

The 12KB footprint and sub-1% error rate make HyperLogLog the default choice for cardinality estimation at scale. Every analytics system you use likely relies on it—now you understand why.