Huffman Coding: Prefix-Free Compression
Every byte you transmit or store costs something. Compression reduces that cost by exploiting redundancy in data. Lossless compression—where the original data is perfectly recoverable—relies on a...
Key Insights
- Huffman coding achieves optimal prefix-free compression by assigning shorter bit sequences to frequent symbols and longer sequences to rare ones, guaranteeing unambiguous decoding without delimiters.
- The algorithm’s greedy approach—repeatedly combining the two lowest-frequency nodes—produces a binary tree where no code is a prefix of another, enabling single-pass decoding.
- While Huffman coding forms the backbone of formats like DEFLATE, JPEG, and MP3, it’s limited to integer bit lengths; modern alternatives like arithmetic coding and ANS can approach Shannon entropy more closely.
The Problem of Data Compression
Every byte you transmit or store costs something. Compression reduces that cost by exploiting redundancy in data. Lossless compression—where the original data is perfectly recoverable—relies on a simple insight: not all symbols appear equally often.
If ’e’ appears 1000 times in your text and ‘z’ appears twice, why should both consume 8 bits per occurrence? Variable-length encoding assigns shorter codes to frequent symbols. But this creates a problem.
Consider this naive encoding:
# Naive variable-length encoding (BROKEN)
codes = {
'a': '0',
'b': '01',
'c': '1',
}
# Encode "abc"
encoded = "0" + "01" + "1" # Result: "0011"
# Now try to decode "0011"
# Is it "a" + "a" + "c" + "c"? (0, 0, 1, 1)
# Or "a" + "b" + "c"? (0, 01, 1)
# Or "b" + "c" + "c"? (01, 1, 1)
# Ambiguous!
The decoder can’t tell where one symbol ends and another begins. You could add delimiters, but that defeats the purpose of compression. You need codes that are inherently unambiguous.
The Prefix-Free Property
A prefix-free code (also called a prefix code) has one constraint: no codeword is a prefix of any other codeword. If ‘a’ maps to “01”, no other symbol can start with “01”.
This property enables instant, unambiguous decoding. As you read bits left to right, the moment you match a codeword, you know you’ve found a complete symbol—no valid codeword could extend further.
def demonstrate_prefix_property():
# Prefix-free codes (VALID)
prefix_free = {
'a': '0',
'b': '10',
'c': '110',
'd': '111',
}
# Non-prefix codes (INVALID)
non_prefix = {
'a': '0',
'b': '01', # '0' is a prefix of '01' - problem!
'c': '011',
}
# Decode "01110" with prefix-free codes
bitstring = "01110"
result = []
i = 0
# Build reverse lookup
decode_map = {v: k for k, v in prefix_free.items()}
buffer = ""
for bit in bitstring:
buffer += bit
if buffer in decode_map:
result.append(decode_map[buffer])
buffer = ""
print(f"Decoded: {''.join(result)}") # Output: "abda"
return result
demonstrate_prefix_property()
Prefix-free codes map naturally to binary trees. Each symbol sits at a leaf node. Left edges represent ‘0’, right edges represent ‘1’. The path from root to leaf defines the code. Since symbols only occupy leaves, no code can be a prefix of another—you’d have to pass through a leaf to reach another leaf.
Building the Huffman Tree
David Huffman’s 1952 algorithm constructs an optimal prefix-free code using a greedy approach. “Optimal” means no other prefix-free code produces a shorter expected output for the given symbol frequencies.
The algorithm:
- Count symbol frequencies
- Create a leaf node for each symbol
- Insert all nodes into a priority queue (min-heap) ordered by frequency
- While more than one node remains:
- Extract the two nodes with lowest frequency
- Create a new internal node with these as children
- Set the new node’s frequency to the sum of its children
- Insert the new node back into the queue
- The remaining node is the root
import heapq
from collections import Counter
from dataclasses import dataclass, field
from typing import Optional
@dataclass(order=True)
class HuffmanNode:
freq: int
symbol: Optional[str] = field(compare=False, default=None)
left: Optional['HuffmanNode'] = field(compare=False, default=None)
right: Optional['HuffmanNode'] = field(compare=False, default=None)
def is_leaf(self) -> bool:
return self.left is None and self.right is None
def build_huffman_tree(data: bytes) -> Optional[HuffmanNode]:
"""Build Huffman tree from byte data."""
if not data:
return None
# Count frequencies
freq_count = Counter(data)
# Handle single-symbol edge case
if len(freq_count) == 1:
symbol, freq = next(iter(freq_count.items()))
leaf = HuffmanNode(freq=freq, symbol=chr(symbol))
return HuffmanNode(freq=freq, left=leaf)
# Create priority queue with leaf nodes
heap: list[HuffmanNode] = []
for byte_val, freq in freq_count.items():
node = HuffmanNode(freq=freq, symbol=chr(byte_val))
heapq.heappush(heap, node)
# Build tree by combining lowest-frequency nodes
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
merged = HuffmanNode(
freq=left.freq + right.freq,
left=left,
right=right
)
heapq.heappush(heap, merged)
return heap[0]
The greedy choice—always combining the two smallest—works because low-frequency symbols end up deeper in the tree (longer codes), while high-frequency symbols stay near the root (shorter codes).
Generating and Storing Codes
With the tree built, extract codes via depth-first traversal. Track the path as you descend: append ‘0’ for left, ‘1’ for right. When you hit a leaf, the accumulated path is that symbol’s code.
def generate_codes(root: Optional[HuffmanNode]) -> dict[str, str]:
"""Generate Huffman codes from tree via recursive traversal."""
codes: dict[str, str] = {}
def traverse(node: Optional[HuffmanNode], current_code: str):
if node is None:
return
if node.is_leaf():
# Use "0" for single-node edge case
codes[node.symbol] = current_code if current_code else "0"
return
traverse(node.left, current_code + "0")
traverse(node.right, current_code + "1")
traverse(root, "")
return codes
# Example usage
sample = b"abracadabra"
tree = build_huffman_tree(sample)
codes = generate_codes(tree)
print("Symbol frequencies:", Counter(sample))
print("Generated codes:")
for symbol, code in sorted(codes.items(), key=lambda x: len(x[1])):
print(f" '{symbol}': {code}")
For “abracadabra” (a:5, b:2, r:2, c:1, d:1), you’ll see ‘a’ gets a 1-bit code while ‘c’ and ’d’ get 3-bit codes.
Canonical Huffman codes standardize the encoding for compact storage. Instead of storing the full tree, you store only code lengths per symbol. The decoder reconstructs codes deterministically: sort symbols by code length, then alphabetically within each length, and assign codes sequentially. DEFLATE and other formats use this technique.
Encoding and Decoding Implementation
Real implementation requires careful bit manipulation. You can’t write individual bits to files—you must pack them into bytes and handle boundaries.
class HuffmanCodec:
def __init__(self, data: bytes):
self.tree = build_huffman_tree(data)
self.codes = generate_codes(self.tree)
# Reverse mapping for encoding bytes
self.encode_map = {ord(k): v for k, v in self.codes.items()}
def encode(self, data: bytes) -> tuple[bytes, int]:
"""Encode data to compressed bytes. Returns (bytes, padding_bits)."""
# Build bit string
bits = []
for byte in data:
bits.append(self.encode_map[byte])
bitstring = ''.join(bits)
# Pack into bytes
padding = (8 - len(bitstring) % 8) % 8
bitstring += '0' * padding
result = bytearray()
for i in range(0, len(bitstring), 8):
byte_str = bitstring[i:i+8]
result.append(int(byte_str, 2))
return bytes(result), padding
def decode(self, encoded: bytes, padding: int, length: int) -> bytes:
"""Decode compressed bytes back to original data."""
# Convert bytes to bit string
bits = ''.join(f'{byte:08b}' for byte in encoded)
# Remove padding
if padding:
bits = bits[:-padding]
# Walk tree to decode
result = bytearray()
node = self.tree
for bit in bits:
node = node.left if bit == '0' else node.right
if node.is_leaf():
result.append(ord(node.symbol))
node = self.tree
if len(result) == length:
break
return bytes(result)
# Test round-trip
original = b"abracadabra"
codec = HuffmanCodec(original)
encoded, padding = codec.encode(original)
decoded = codec.decode(encoded, padding, len(original))
print(f"Original: {original} ({len(original)} bytes)")
print(f"Encoded: {encoded.hex()} ({len(encoded)} bytes)")
print(f"Decoded: {decoded}")
print(f"Compression: {len(encoded)/len(original)*100:.1f}%")
Performance Analysis
Huffman coding runs in O(n log n) time—linear scan for frequencies, then heap operations for n symbols. Space is O(n) for the tree and code table.
How close to optimal is it? Shannon entropy defines the theoretical minimum bits per symbol. Huffman approaches this but can’t beat it. The limitation: Huffman assigns integer bit lengths. If the optimal code for a symbol is 2.3 bits, Huffman must round to 2 or 3.
import math
def analyze_compression(data: bytes):
"""Compare Huffman to Shannon entropy."""
freq = Counter(data)
total = len(data)
# Calculate Shannon entropy
entropy = 0
for count in freq.values():
p = count / total
entropy -= p * math.log2(p)
# Build Huffman and calculate average code length
codec = HuffmanCodec(data)
avg_length = sum(
len(codec.encode_map[byte]) * count
for byte, count in freq.items()
) / total
print(f"Data size: {total} bytes, {len(freq)} unique symbols")
print(f"Shannon entropy: {entropy:.3f} bits/symbol")
print(f"Huffman average: {avg_length:.3f} bits/symbol")
print(f"Efficiency: {entropy/avg_length*100:.1f}%")
print(f"Theoretical min: {entropy * total / 8:.1f} bytes")
encoded, _ = codec.encode(data)
print(f"Actual compressed: {len(encoded)} bytes")
# Test on different data patterns
print("=== Highly skewed (lots of 'a') ===")
analyze_compression(b"a" * 100 + b"bcdef")
print("\n=== English-like text ===")
text = b"the quick brown fox jumps over the lazy dog " * 10
analyze_compression(text)
print("\n=== Random data (incompressible) ===")
import os
analyze_compression(os.urandom(100))
Huffman excels with skewed distributions. It struggles with uniform distributions (random data) and very small files where tree overhead dominates.
Real-World Applications and Variants
Huffman coding underpins ubiquitous formats:
- DEFLATE (gzip, PNG, ZIP): Combines LZ77 dictionary coding with Huffman
- JPEG: Huffman-encodes quantized DCT coefficients
- MP3: Huffman coding in the final compression stage
Adaptive Huffman updates the tree as data streams through, useful when you can’t scan input twice. Canonical Huffman (mentioned earlier) minimizes metadata overhead.
Modern alternatives often outperform standard Huffman:
- Arithmetic coding: Encodes entire messages as single fractions, achieving near-entropy compression
- ANS (Asymmetric Numeral Systems): Arithmetic coding speed with simpler implementation; used in Zstandard
Here’s a minimal file compressor tying everything together:
import sys
import struct
import pickle
def compress_file(input_path: str, output_path: str):
with open(input_path, 'rb') as f:
data = f.read()
if not data:
print("Empty file")
return
codec = HuffmanCodec(data)
encoded, padding = codec.encode(data)
# Write: original_length (4 bytes) + padding (1 byte) +
# code_table_length (4 bytes) + code_table + encoded_data
code_table = pickle.dumps(codec.codes)
with open(output_path, 'wb') as f:
f.write(struct.pack('>I', len(data)))
f.write(struct.pack('B', padding))
f.write(struct.pack('>I', len(code_table)))
f.write(code_table)
f.write(encoded)
ratio = len(encoded) / len(data) * 100
print(f"Compressed {len(data)} -> {len(encoded)} bytes ({ratio:.1f}%)")
def decompress_file(input_path: str, output_path: str):
with open(input_path, 'rb') as f:
original_length = struct.unpack('>I', f.read(4))[0]
padding = struct.unpack('B', f.read(1))[0]
table_length = struct.unpack('>I', f.read(4))[0]
code_table = pickle.loads(f.read(table_length))
encoded = f.read()
# Rebuild tree from codes (simplified: use codes directly)
# In production, rebuild tree or use canonical decoding
decode_map = {code: symbol for symbol, code in code_table.items()}
bits = ''.join(f'{byte:08b}' for byte in encoded)
if padding:
bits = bits[:-padding]
result = bytearray()
buffer = ""
for bit in bits:
buffer += bit
if buffer in decode_map:
result.append(ord(decode_map[buffer]))
buffer = ""
if len(result) == original_length:
break
with open(output_path, 'wb') as f:
f.write(result)
print(f"Decompressed {len(encoded)} -> {len(result)} bytes")
# Usage: python huffman.py compress input.txt output.huf
# python huffman.py decompress output.huf restored.txt
Huffman coding isn’t cutting-edge, but understanding it illuminates how compression works fundamentally. The prefix-free property, the greedy tree construction, the entropy bounds—these concepts recur throughout information theory. Master Huffman, and you’ve built intuition for everything from video codecs to machine learning tokenizers.