Huffman Coding: Prefix-Free Compression

Every byte you transmit or store costs something. Compression reduces that cost by exploiting redundancy in data. Lossless compression—where the original data is perfectly recoverable—relies on a...

Key Insights

  • Huffman coding achieves optimal prefix-free compression by assigning shorter bit sequences to frequent symbols and longer sequences to rare ones, guaranteeing unambiguous decoding without delimiters.
  • The algorithm’s greedy approach—repeatedly combining the two lowest-frequency nodes—produces a binary tree where no code is a prefix of another, enabling single-pass decoding.
  • While Huffman coding forms the backbone of formats like DEFLATE, JPEG, and MP3, it’s limited to integer bit lengths; modern alternatives like arithmetic coding and ANS can approach Shannon entropy more closely.

The Problem of Data Compression

Every byte you transmit or store costs something. Compression reduces that cost by exploiting redundancy in data. Lossless compression—where the original data is perfectly recoverable—relies on a simple insight: not all symbols appear equally often.

If ’e’ appears 1000 times in your text and ‘z’ appears twice, why should both consume 8 bits per occurrence? Variable-length encoding assigns shorter codes to frequent symbols. But this creates a problem.

Consider this naive encoding:

# Naive variable-length encoding (BROKEN)
codes = {
    'a': '0',
    'b': '01',
    'c': '1',
}

# Encode "abc"
encoded = "0" + "01" + "1"  # Result: "0011"

# Now try to decode "0011"
# Is it "a" + "a" + "c" + "c"? (0, 0, 1, 1)
# Or "a" + "b" + "c"? (0, 01, 1)
# Or "b" + "c" + "c"? (01, 1, 1)
# Ambiguous!

The decoder can’t tell where one symbol ends and another begins. You could add delimiters, but that defeats the purpose of compression. You need codes that are inherently unambiguous.

The Prefix-Free Property

A prefix-free code (also called a prefix code) has one constraint: no codeword is a prefix of any other codeword. If ‘a’ maps to “01”, no other symbol can start with “01”.

This property enables instant, unambiguous decoding. As you read bits left to right, the moment you match a codeword, you know you’ve found a complete symbol—no valid codeword could extend further.

def demonstrate_prefix_property():
    # Prefix-free codes (VALID)
    prefix_free = {
        'a': '0',
        'b': '10',
        'c': '110',
        'd': '111',
    }
    
    # Non-prefix codes (INVALID)
    non_prefix = {
        'a': '0',
        'b': '01',   # '0' is a prefix of '01' - problem!
        'c': '011',
    }
    
    # Decode "01110" with prefix-free codes
    bitstring = "01110"
    result = []
    i = 0
    
    # Build reverse lookup
    decode_map = {v: k for k, v in prefix_free.items()}
    
    buffer = ""
    for bit in bitstring:
        buffer += bit
        if buffer in decode_map:
            result.append(decode_map[buffer])
            buffer = ""
    
    print(f"Decoded: {''.join(result)}")  # Output: "abda"
    return result

demonstrate_prefix_property()

Prefix-free codes map naturally to binary trees. Each symbol sits at a leaf node. Left edges represent ‘0’, right edges represent ‘1’. The path from root to leaf defines the code. Since symbols only occupy leaves, no code can be a prefix of another—you’d have to pass through a leaf to reach another leaf.

Building the Huffman Tree

David Huffman’s 1952 algorithm constructs an optimal prefix-free code using a greedy approach. “Optimal” means no other prefix-free code produces a shorter expected output for the given symbol frequencies.

The algorithm:

  1. Count symbol frequencies
  2. Create a leaf node for each symbol
  3. Insert all nodes into a priority queue (min-heap) ordered by frequency
  4. While more than one node remains:
    • Extract the two nodes with lowest frequency
    • Create a new internal node with these as children
    • Set the new node’s frequency to the sum of its children
    • Insert the new node back into the queue
  5. The remaining node is the root
import heapq
from collections import Counter
from dataclasses import dataclass, field
from typing import Optional

@dataclass(order=True)
class HuffmanNode:
    freq: int
    symbol: Optional[str] = field(compare=False, default=None)
    left: Optional['HuffmanNode'] = field(compare=False, default=None)
    right: Optional['HuffmanNode'] = field(compare=False, default=None)
    
    def is_leaf(self) -> bool:
        return self.left is None and self.right is None

def build_huffman_tree(data: bytes) -> Optional[HuffmanNode]:
    """Build Huffman tree from byte data."""
    if not data:
        return None
    
    # Count frequencies
    freq_count = Counter(data)
    
    # Handle single-symbol edge case
    if len(freq_count) == 1:
        symbol, freq = next(iter(freq_count.items()))
        leaf = HuffmanNode(freq=freq, symbol=chr(symbol))
        return HuffmanNode(freq=freq, left=leaf)
    
    # Create priority queue with leaf nodes
    heap: list[HuffmanNode] = []
    for byte_val, freq in freq_count.items():
        node = HuffmanNode(freq=freq, symbol=chr(byte_val))
        heapq.heappush(heap, node)
    
    # Build tree by combining lowest-frequency nodes
    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        
        merged = HuffmanNode(
            freq=left.freq + right.freq,
            left=left,
            right=right
        )
        heapq.heappush(heap, merged)
    
    return heap[0]

The greedy choice—always combining the two smallest—works because low-frequency symbols end up deeper in the tree (longer codes), while high-frequency symbols stay near the root (shorter codes).

Generating and Storing Codes

With the tree built, extract codes via depth-first traversal. Track the path as you descend: append ‘0’ for left, ‘1’ for right. When you hit a leaf, the accumulated path is that symbol’s code.

def generate_codes(root: Optional[HuffmanNode]) -> dict[str, str]:
    """Generate Huffman codes from tree via recursive traversal."""
    codes: dict[str, str] = {}
    
    def traverse(node: Optional[HuffmanNode], current_code: str):
        if node is None:
            return
        
        if node.is_leaf():
            # Use "0" for single-node edge case
            codes[node.symbol] = current_code if current_code else "0"
            return
        
        traverse(node.left, current_code + "0")
        traverse(node.right, current_code + "1")
    
    traverse(root, "")
    return codes

# Example usage
sample = b"abracadabra"
tree = build_huffman_tree(sample)
codes = generate_codes(tree)

print("Symbol frequencies:", Counter(sample))
print("Generated codes:")
for symbol, code in sorted(codes.items(), key=lambda x: len(x[1])):
    print(f"  '{symbol}': {code}")

For “abracadabra” (a:5, b:2, r:2, c:1, d:1), you’ll see ‘a’ gets a 1-bit code while ‘c’ and ’d’ get 3-bit codes.

Canonical Huffman codes standardize the encoding for compact storage. Instead of storing the full tree, you store only code lengths per symbol. The decoder reconstructs codes deterministically: sort symbols by code length, then alphabetically within each length, and assign codes sequentially. DEFLATE and other formats use this technique.

Encoding and Decoding Implementation

Real implementation requires careful bit manipulation. You can’t write individual bits to files—you must pack them into bytes and handle boundaries.

class HuffmanCodec:
    def __init__(self, data: bytes):
        self.tree = build_huffman_tree(data)
        self.codes = generate_codes(self.tree)
        # Reverse mapping for encoding bytes
        self.encode_map = {ord(k): v for k, v in self.codes.items()}
    
    def encode(self, data: bytes) -> tuple[bytes, int]:
        """Encode data to compressed bytes. Returns (bytes, padding_bits)."""
        # Build bit string
        bits = []
        for byte in data:
            bits.append(self.encode_map[byte])
        
        bitstring = ''.join(bits)
        
        # Pack into bytes
        padding = (8 - len(bitstring) % 8) % 8
        bitstring += '0' * padding
        
        result = bytearray()
        for i in range(0, len(bitstring), 8):
            byte_str = bitstring[i:i+8]
            result.append(int(byte_str, 2))
        
        return bytes(result), padding
    
    def decode(self, encoded: bytes, padding: int, length: int) -> bytes:
        """Decode compressed bytes back to original data."""
        # Convert bytes to bit string
        bits = ''.join(f'{byte:08b}' for byte in encoded)
        
        # Remove padding
        if padding:
            bits = bits[:-padding]
        
        # Walk tree to decode
        result = bytearray()
        node = self.tree
        
        for bit in bits:
            node = node.left if bit == '0' else node.right
            
            if node.is_leaf():
                result.append(ord(node.symbol))
                node = self.tree
                
                if len(result) == length:
                    break
        
        return bytes(result)

# Test round-trip
original = b"abracadabra"
codec = HuffmanCodec(original)
encoded, padding = codec.encode(original)
decoded = codec.decode(encoded, padding, len(original))

print(f"Original:    {original} ({len(original)} bytes)")
print(f"Encoded:     {encoded.hex()} ({len(encoded)} bytes)")
print(f"Decoded:     {decoded}")
print(f"Compression: {len(encoded)/len(original)*100:.1f}%")

Performance Analysis

Huffman coding runs in O(n log n) time—linear scan for frequencies, then heap operations for n symbols. Space is O(n) for the tree and code table.

How close to optimal is it? Shannon entropy defines the theoretical minimum bits per symbol. Huffman approaches this but can’t beat it. The limitation: Huffman assigns integer bit lengths. If the optimal code for a symbol is 2.3 bits, Huffman must round to 2 or 3.

import math

def analyze_compression(data: bytes):
    """Compare Huffman to Shannon entropy."""
    freq = Counter(data)
    total = len(data)
    
    # Calculate Shannon entropy
    entropy = 0
    for count in freq.values():
        p = count / total
        entropy -= p * math.log2(p)
    
    # Build Huffman and calculate average code length
    codec = HuffmanCodec(data)
    avg_length = sum(
        len(codec.encode_map[byte]) * count 
        for byte, count in freq.items()
    ) / total
    
    print(f"Data size: {total} bytes, {len(freq)} unique symbols")
    print(f"Shannon entropy: {entropy:.3f} bits/symbol")
    print(f"Huffman average: {avg_length:.3f} bits/symbol")
    print(f"Efficiency: {entropy/avg_length*100:.1f}%")
    print(f"Theoretical min: {entropy * total / 8:.1f} bytes")
    
    encoded, _ = codec.encode(data)
    print(f"Actual compressed: {len(encoded)} bytes")

# Test on different data patterns
print("=== Highly skewed (lots of 'a') ===")
analyze_compression(b"a" * 100 + b"bcdef")

print("\n=== English-like text ===")
text = b"the quick brown fox jumps over the lazy dog " * 10
analyze_compression(text)

print("\n=== Random data (incompressible) ===")
import os
analyze_compression(os.urandom(100))

Huffman excels with skewed distributions. It struggles with uniform distributions (random data) and very small files where tree overhead dominates.

Real-World Applications and Variants

Huffman coding underpins ubiquitous formats:

  • DEFLATE (gzip, PNG, ZIP): Combines LZ77 dictionary coding with Huffman
  • JPEG: Huffman-encodes quantized DCT coefficients
  • MP3: Huffman coding in the final compression stage

Adaptive Huffman updates the tree as data streams through, useful when you can’t scan input twice. Canonical Huffman (mentioned earlier) minimizes metadata overhead.

Modern alternatives often outperform standard Huffman:

  • Arithmetic coding: Encodes entire messages as single fractions, achieving near-entropy compression
  • ANS (Asymmetric Numeral Systems): Arithmetic coding speed with simpler implementation; used in Zstandard

Here’s a minimal file compressor tying everything together:

import sys
import struct
import pickle

def compress_file(input_path: str, output_path: str):
    with open(input_path, 'rb') as f:
        data = f.read()
    
    if not data:
        print("Empty file")
        return
    
    codec = HuffmanCodec(data)
    encoded, padding = codec.encode(data)
    
    # Write: original_length (4 bytes) + padding (1 byte) + 
    #        code_table_length (4 bytes) + code_table + encoded_data
    code_table = pickle.dumps(codec.codes)
    
    with open(output_path, 'wb') as f:
        f.write(struct.pack('>I', len(data)))
        f.write(struct.pack('B', padding))
        f.write(struct.pack('>I', len(code_table)))
        f.write(code_table)
        f.write(encoded)
    
    ratio = len(encoded) / len(data) * 100
    print(f"Compressed {len(data)} -> {len(encoded)} bytes ({ratio:.1f}%)")

def decompress_file(input_path: str, output_path: str):
    with open(input_path, 'rb') as f:
        original_length = struct.unpack('>I', f.read(4))[0]
        padding = struct.unpack('B', f.read(1))[0]
        table_length = struct.unpack('>I', f.read(4))[0]
        code_table = pickle.loads(f.read(table_length))
        encoded = f.read()
    
    # Rebuild tree from codes (simplified: use codes directly)
    # In production, rebuild tree or use canonical decoding
    decode_map = {code: symbol for symbol, code in code_table.items()}
    
    bits = ''.join(f'{byte:08b}' for byte in encoded)
    if padding:
        bits = bits[:-padding]
    
    result = bytearray()
    buffer = ""
    for bit in bits:
        buffer += bit
        if buffer in decode_map:
            result.append(ord(decode_map[buffer]))
            buffer = ""
            if len(result) == original_length:
                break
    
    with open(output_path, 'wb') as f:
        f.write(result)
    
    print(f"Decompressed {len(encoded)} -> {len(result)} bytes")

# Usage: python huffman.py compress input.txt output.huf
#        python huffman.py decompress output.huf restored.txt

Huffman coding isn’t cutting-edge, but understanding it illuminates how compression works fundamentally. The prefix-free property, the greedy tree construction, the entropy bounds—these concepts recur throughout information theory. Master Huffman, and you’ve built intuition for everything from video codecs to machine learning tokenizers.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.