LZ77 and LZ78: Dictionary-Based Compression

Key Insights

LZ77 uses a sliding window to find repeated patterns, trading memory efficiency for potentially slower compression, while LZ78 builds an explicit dictionary that grows during encoding
Dictionary-based compression fundamentally differs from statistical methods by exploiting repeated sequences rather than symbol frequency, making it ideal for structured data like text and code
Modern compression tools like gzip, PNG, and Zstandard all descend from these 1970s algorithms, proving that understanding LZ77/LZ78 remains essential for any engineer working with data compression

Introduction to Dictionary-Based Compression

Statistical compression methods like Huffman coding and arithmetic coding work by assigning shorter codes to more frequent symbols. They’re elegant, but they miss something obvious: real-world data contains repeated patterns, not just repeated characters.

Dictionary-based compression takes a different approach. Instead of encoding individual symbols based on frequency, it identifies repeated sequences and replaces them with references to earlier occurrences. The “dictionary” is simply a lookup table of patterns the algorithm has already seen.

Consider compressing the string “ABRACADABRA”. A Huffman encoder would assign codes to each letter based on frequency. A dictionary-based encoder would notice that “ABRA” appears twice and replace the second occurrence with a reference to the first. This fundamental insight—that redundancy exists at the sequence level—makes dictionary methods dominant in practical compression.

Abraham Lempel and Jacob Ziv published two foundational algorithms in 1977 and 1978, now known as LZ77 and LZ78. Nearly every compression tool you use today descends from one of these.

LZ77: Sliding Window Compression

LZ77 maintains a “sliding window” over the data stream. This window has two parts: a search buffer containing recently processed data, and a lookahead buffer containing data yet to be encoded. The algorithm scans the search buffer for the longest match to the beginning of the lookahead buffer.

Each output token contains three values: an offset (how far back the match starts), a length (how many characters match), and the next character after the match. If no match exists, the token uses offset=0, length=0, and outputs just the literal character.

Here’s a practical LZ77 encoder:

def lz77_encode(data: str, window_size: int = 4096, lookahead_size: int = 18) -> list[tuple[int, int, str]]:
    """
    Encode data using LZ77 algorithm.
    Returns list of (offset, length, next_char) tokens.
    """
    tokens = []
    pos = 0
    
    while pos < len(data):
        best_offset = 0
        best_length = 0
        
        # Define search window boundaries
        search_start = max(0, pos - window_size)
        search_end = pos
        
        # Define lookahead boundary
        lookahead_end = min(pos + lookahead_size, len(data))
        
        # Search for longest match in the window
        for offset in range(1, pos - search_start + 1):
            match_start = pos - offset
            length = 0
            
            # Extend match as far as possible
            while (pos + length < lookahead_end and 
                   data[match_start + length] == data[pos + length]):
                length += 1
                # Handle matches that extend into lookahead (run-length encoding)
                if match_start + length >= pos:
                    match_start = pos - offset
            
            if length > best_length:
                best_length = length
                best_offset = offset
        
        # Get the next character after the match
        next_pos = pos + best_length
        next_char = data[next_pos] if next_pos < len(data) else ''
        
        tokens.append((best_offset, best_length, next_char))
        pos = next_pos + 1
    
    return tokens

# Example usage
text = "ABRACADABRA"
encoded = lz77_encode(text, window_size=10, lookahead_size=5)
print(f"Original: {text}")
print(f"Encoded: {encoded}")
# Output: [(0, 0, 'A'), (0, 0, 'B'), (0, 0, 'R'), (3, 1, 'C'), (2, 1, 'D'), (7, 4, '')]

The window size creates a direct trade-off: larger windows find more matches but require more memory and slower searches. Most implementations use 32KB windows, balancing compression ratio against resource usage.

LZ77 Decompression

Decompression is straightforward and fast. For each token, copy length characters from offset positions back in the output, then append the literal character. No searching required.

def lz77_decode(tokens: list[tuple[int, int, str]]) -> str:
    """
    Decode LZ77 tokens back to original data.
    """
    output = []
    
    for offset, length, next_char in tokens:
        if length > 0:
            # Copy from already-decoded output
            start = len(output) - offset
            for i in range(length):
                # Handle overlapping copies (run-length case)
                output.append(output[start + i])
        
        if next_char:
            output.append(next_char)
    
    return ''.join(output)

# Verify round-trip
decoded = lz77_decode(encoded)
print(f"Decoded: {decoded}")
assert decoded == text

This asymmetry—slow compression, fast decompression—makes LZ77 ideal for scenarios where data is compressed once but decompressed many times (distributing software, serving web content).

LZ78: Explicit Dictionary Approach

LZ78 abandons the sliding window for an explicit dictionary that grows during encoding. Each entry maps an index to a string. The algorithm finds the longest dictionary match, outputs its index plus the next character, then adds the combined string as a new dictionary entry.

def lz78_encode(data: str) -> list[tuple[int, str]]:
    """
    Encode data using LZ78 algorithm.
    Returns list of (dictionary_index, next_char) tokens.
    Index 0 represents empty string (no match).
    """
    dictionary = {'': 0}  # Empty string maps to index 0
    next_index = 1
    tokens = []
    
    pos = 0
    while pos < len(data):
        # Find longest match in dictionary
        current = ''
        while pos < len(data) and current + data[pos] in dictionary:
            current += data[pos]
            pos += 1
        
        # Get the next character (if any)
        next_char = data[pos] if pos < len(data) else ''
        
        # Output token: (index of current match, next character)
        tokens.append((dictionary[current], next_char))
        
        # Add new entry to dictionary
        if next_char:
            dictionary[current + next_char] = next_index
            next_index += 1
            pos += 1
    
    return tokens

def lz78_decode(tokens: list[tuple[int, str]]) -> str:
    """
    Decode LZ78 tokens back to original data.
    """
    dictionary = {0: ''}  # Index 0 is empty string
    next_index = 1
    output = []
    
    for index, next_char in tokens:
        # Reconstruct string from dictionary
        string = dictionary[index] + next_char
        output.append(string)
        
        # Add to dictionary (same as encoder)
        if next_char:
            dictionary[next_index] = string
            next_index += 1
    
    return ''.join(output)

# Example
text = "ABRACADABRA"
lz78_encoded = lz78_encode(text)
print(f"LZ78 Encoded: {lz78_encoded}")
# Output: [(0, 'A'), (0, 'B'), (0, 'R'), (1, 'C'), (1, 'D'), (1, 'B'), (3, 'A')]

lz78_decoded = lz78_decode(lz78_encoded)
assert lz78_decoded == text

The dictionary grows unbounded unless reset. Practical implementations either cap dictionary size and reset when full, or use LZW (a variant that eliminates the explicit next character, improving compression).

Practical Comparison: LZ77 vs. LZ78

Let’s benchmark both algorithms on realistic data:

import time
import random
import string

def generate_test_data(size: int, repetition_factor: float = 0.3) -> str:
    """Generate test data with controlled repetition."""
    base = ''.join(random.choices(string.ascii_letters, k=int(size * (1 - repetition_factor))))
    # Add repetitive patterns
    patterns = ['the ', 'and ', 'ing ', 'tion']
    result = list(base)
    for _ in range(int(size * repetition_factor / 4)):
        pos = random.randint(0, len(result))
        result.insert(pos, random.choice(patterns))
    return ''.join(result)[:size]

def benchmark(data: str):
    """Compare LZ77 and LZ78 on given data."""
    print(f"Original size: {len(data)} bytes\n")
    
    # LZ77
    start = time.perf_counter()
    lz77_tokens = lz77_encode(data, window_size=4096)
    lz77_encode_time = time.perf_counter() - start
    
    start = time.perf_counter()
    lz77_decoded = lz77_decode(lz77_tokens)
    lz77_decode_time = time.perf_counter() - start
    
    # Estimate compressed size (simplified)
    lz77_size = len(lz77_tokens) * 3  # Rough estimate
    
    # LZ78
    start = time.perf_counter()
    lz78_tokens = lz78_encode(data)
    lz78_encode_time = time.perf_counter() - start
    
    start = time.perf_counter()
    lz78_decoded = lz78_decode(lz78_tokens)
    lz78_decode_time = time.perf_counter() - start
    
    lz78_size = len(lz78_tokens) * 2  # Rough estimate
    
    print(f"LZ77: {len(lz77_tokens)} tokens, ~{lz77_size} bytes")
    print(f"  Encode: {lz77_encode_time*1000:.2f}ms, Decode: {lz77_decode_time*1000:.2f}ms")
    print(f"\nLZ78: {len(lz78_tokens)} tokens, ~{lz78_size} bytes")
    print(f"  Encode: {lz78_encode_time*1000:.2f}ms, Decode: {lz78_decode_time*1000:.2f}ms")

# Run benchmark
test_data = generate_test_data(5000)
benchmark(test_data)

Key observations from real-world usage:

LZ77 uses bounded memory (window size is fixed) but has O(n × w) compression complexity where w is window size
LZ78 uses unbounded memory (dictionary grows) but has O(n) complexity with hash table lookups
LZ77 typically achieves better compression on files with long-range repetition
LZ78 variants handle streaming data more naturally

Real implementations: gzip uses DEFLATE (LZ77 + Huffman), while the Unix compress utility used LZW (an LZ78 variant).

Modern Descendants and Applications

DEFLATE combines LZ77 with Huffman coding, encoding the offset/length pairs efficiently. It powers ZIP files, PNG images, and HTTP compression. Understanding LZ77 helps you tune gzip compression levels—higher levels search larger windows.

The LZW algorithm (LZ78 with implicit next character) was patented, causing the GIF format controversy in the 1990s. This drove PNG adoption and influenced the open-source community’s approach to compression algorithms. The patents expired in 2003-2004.

Modern algorithms build on these foundations:

LZ4: Optimized LZ77 variant prioritizing speed over ratio. Used in Linux kernel, databases
Zstandard (zstd): Facebook’s algorithm combining LZ77 with entropy coding and a dictionary. Excellent ratio/speed balance
Snappy: Google’s fast compressor for real-time applications

Choose dictionary-based compression when your data has repeated sequences (text, code, logs, structured data). Statistical methods alone work better for random-looking data with skewed symbol distributions.

Conclusion

LZ77 and LZ78 represent two approaches to the same insight: repeated patterns should be stored once and referenced thereafter. LZ77’s sliding window trades memory efficiency for simpler implementation, while LZ78’s explicit dictionary enables faster lookups at the cost of unbounded growth.

For practical work, use established implementations—zlib for DEFLATE compatibility, zstd for modern applications, LZ4 when speed matters most. But understanding these foundational algorithms helps you choose the right tool, tune parameters effectively, and debug compression issues when they arise.

For deeper exploration, read the original papers (IEEE Transactions on Information Theory, 1977 and 1978) and study the zlib source code—it’s a masterclass in practical algorithm implementation.