Data Compression: Algorithms and Trade-offs

Key Insights

Compression algorithms sit on a spectrum between speed and ratio—zstd and lz4 favor speed, while LZMA maximizes compression at significant CPU cost. Your choice depends on whether you’re optimizing for storage, bandwidth, or latency.
Dictionary-based algorithms (LZ family) dominate general-purpose compression because they exploit the repetitive patterns found in most real-world data, from text to binary formats.
Never compress already-compressed data. JPEG, PNG, MP4, and encrypted content won’t compress further and may actually grow larger with the overhead of compression metadata.

Introduction to Data Compression

Data compression reduces storage costs, speeds up network transfers, and can even improve application performance by reducing I/O bottlenecks. Every time you load a webpage, stream a video, or download an app, compression algorithms are working behind the scenes.

Compression falls into two categories: lossless and lossy. Lossless compression guarantees perfect reconstruction of the original data—essential for source code, databases, and executables. Lossy compression sacrifices some fidelity for dramatically better compression ratios, making it ideal for media where human perception can’t detect the difference.

Understanding these algorithms isn’t just academic. Choosing the wrong compression strategy can tank your application’s performance or balloon your infrastructure costs.

Lossless Compression Fundamentals

All lossless compression exploits one principle: redundancy. Data rarely uses all possible bit patterns with equal frequency. Entropy encoding assigns shorter codes to frequent symbols and longer codes to rare ones.

Huffman coding builds a binary tree where frequent symbols sit near the root (short paths) and rare symbols sit deeper (long paths). It’s optimal for symbol-by-symbol encoding and forms the backbone of many compression formats.

import heapq
from collections import Counter
from typing import Dict, Tuple, Optional

class HuffmanNode:
    def __init__(self, char: Optional[str], freq: int):
        self.char = char
        self.freq = freq
        self.left: Optional[HuffmanNode] = None
        self.right: Optional[HuffmanNode] = None
    
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(text: str) -> HuffmanNode:
    frequency = Counter(text)
    heap = [HuffmanNode(char, freq) for char, freq in frequency.items()]
    heapq.heapify(heap)
    
    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        merged = HuffmanNode(None, left.freq + right.freq)
        merged.left = left
        merged.right = right
        heapq.heappush(heap, merged)
    
    return heap[0]

def build_codes(node: HuffmanNode, prefix: str = "", codes: Dict[str, str] = None) -> Dict[str, str]:
    if codes is None:
        codes = {}
    
    if node.char is not None:
        codes[node.char] = prefix or "0"
    else:
        if node.left:
            build_codes(node.left, prefix + "0", codes)
        if node.right:
            build_codes(node.right, prefix + "1", codes)
    
    return codes

def huffman_encode(text: str) -> Tuple[str, HuffmanNode]:
    tree = build_huffman_tree(text)
    codes = build_codes(tree)
    encoded = "".join(codes[char] for char in text)
    return encoded, tree

def huffman_decode(encoded: str, tree: HuffmanNode) -> str:
    decoded = []
    node = tree
    
    for bit in encoded:
        node = node.left if bit == "0" else node.right
        if node.char is not None:
            decoded.append(node.char)
            node = tree
    
    return "".join(decoded)

# Example usage
text = "abracadabra"
encoded, tree = huffman_encode(text)
decoded = huffman_decode(encoded, tree)
print(f"Original: {text} ({len(text) * 8} bits)")
print(f"Encoded: {encoded} ({len(encoded)} bits)")
print(f"Compression ratio: {len(text) * 8 / len(encoded):.2f}x")

Arithmetic coding goes further by encoding entire messages as a single fractional number, achieving compression closer to the theoretical entropy limit. It’s more complex to implement but used in modern formats like JPEG 2000 and H.265.

Dictionary-Based Compression (LZ Family)

While entropy encoding optimizes symbol frequencies, dictionary-based compression exploits repeated sequences. The LZ family of algorithms maintains a dictionary of previously seen patterns and replaces repetitions with references.

LZ77 uses a sliding window approach: it looks backward in the data stream for matches and encodes them as (offset, length) pairs. This simple idea powers gzip, deflate, and PNG compression.

from typing import List, Tuple, Union

Token = Tuple[int, int, str]  # (offset, length, next_char)

def lz77_compress(data: str, window_size: int = 256, lookahead_size: int = 16) -> List[Token]:
    tokens: List[Token] = []
    pos = 0
    
    while pos < len(data):
        best_offset = 0
        best_length = 0
        
        # Search window boundaries
        search_start = max(0, pos - window_size)
        search_end = pos
        
        # Find longest match in the search window
        for i in range(search_start, search_end):
            length = 0
            while (length < lookahead_size and 
                   pos + length < len(data) and 
                   data[i + length] == data[pos + length]):
                length += 1
                # Handle matches that extend into lookahead
                if i + length >= pos:
                    break
            
            if length > best_length:
                best_length = length
                best_offset = pos - i
        
        # Get next character after match
        next_char = data[pos + best_length] if pos + best_length < len(data) else ""
        tokens.append((best_offset, best_length, next_char))
        pos += best_length + 1
    
    return tokens

def lz77_decompress(tokens: List[Token]) -> str:
    result: List[str] = []
    
    for offset, length, next_char in tokens:
        if length > 0:
            start = len(result) - offset
            for i in range(length):
                result.append(result[start + i])
        if next_char:
            result.append(next_char)
    
    return "".join(result)

# Example
data = "abracadabra_abracadabra"
compressed = lz77_compress(data)
decompressed = lz77_decompress(compressed)

print(f"Original: {data}")
print(f"Tokens: {compressed}")
print(f"Decompressed: {decompressed}")
print(f"Token count: {len(compressed)} vs original length: {len(data)}")

LZW (used in GIF) builds its dictionary dynamically during compression, eliminating the need to transmit the dictionary separately. LZMA (used in 7-zip and xz) combines LZ77 with range coding for exceptional compression ratios.

Modern Compression Algorithms

The compression landscape has evolved significantly. Modern algorithms offer tunable trade-offs between speed and ratio.

zstd (Zstandard) from Facebook has become the go-to choice for many applications. It offers compression ratios comparable to zlib at 3-5x faster speeds, with compression levels from 1 (fastest) to 22 (smallest).

lz4 prioritizes speed above all else, achieving gigabytes-per-second compression and decompression. It’s ideal for real-time applications, caching layers, and scenarios where CPU is the bottleneck.

Brotli from Google optimizes for web content, achieving 20-26% better compression than gzip on text. It’s now the standard for HTTP compression alongside gzip.

import zstandard as zstd
import lz4.frame
import brotli
import time
import os

def benchmark_compression(data: bytes) -> None:
    algorithms = {
        "zstd-1": lambda d: zstd.ZstdCompressor(level=1).compress(d),
        "zstd-10": lambda d: zstd.ZstdCompressor(level=10).compress(d),
        "zstd-22": lambda d: zstd.ZstdCompressor(level=22).compress(d),
        "lz4": lambda d: lz4.frame.compress(d),
        "brotli-4": lambda d: brotli.compress(d, quality=4),
        "brotli-11": lambda d: brotli.compress(d, quality=11),
    }
    
    original_size = len(data)
    print(f"Original size: {original_size:,} bytes\n")
    print(f"{'Algorithm':<12} {'Compressed':<12} {'Ratio':<8} {'Time (ms)':<10}")
    print("-" * 45)
    
    for name, compress_func in algorithms.items():
        start = time.perf_counter()
        compressed = compress_func(data)
        elapsed = (time.perf_counter() - start) * 1000
        
        ratio = original_size / len(compressed)
        print(f"{name:<12} {len(compressed):<12,} {ratio:<8.2f} {elapsed:<10.2f}")

# Generate sample data with realistic patterns
sample_data = (b"The quick brown fox jumps over the lazy dog. " * 1000 +
               b"Lorem ipsum dolor sit amet, consectetur adipiscing elit. " * 1000)

benchmark_compression(sample_data)

Lossy Compression Strategies

Lossy compression achieves dramatic size reductions by discarding information humans won’t notice. The key techniques include:

Quantization reduces the precision of values. Instead of storing exact pixel values, you store approximations. JPEG uses this heavily.

Transform coding converts data to a domain where energy concentrates in fewer coefficients. The Discrete Cosine Transform (DCT) in JPEG converts 8x8 pixel blocks into frequency components, where high frequencies (fine details) can be discarded with minimal visual impact.

from PIL import Image
import io
import os

def compare_jpeg_quality(image_path: str, qualities: list[int]) -> None:
    """Compare file sizes and demonstrate quality degradation."""
    original = Image.open(image_path)
    original_size = os.path.getsize(image_path)
    
    print(f"Original: {original_size:,} bytes")
    print(f"{'Quality':<10} {'Size':<15} {'Ratio':<10} {'Size %':<10}")
    print("-" * 45)
    
    for quality in qualities:
        buffer = io.BytesIO()
        original.save(buffer, format="JPEG", quality=quality)
        compressed_size = buffer.tell()
        
        ratio = original_size / compressed_size
        percentage = (compressed_size / original_size) * 100
        
        print(f"{quality:<10} {compressed_size:<15,} {ratio:<10.2f} {percentage:<10.1f}%")
        
        # Save for visual comparison
        output_path = f"quality_{quality}.jpg"
        original.save(output_path, format="JPEG", quality=quality)

# Usage: compare_jpeg_quality("photo.png", [95, 85, 75, 50, 25, 10])

Perceptual models leverage human sensory limitations. MP3 uses psychoacoustic models to discard frequencies masked by louder sounds. Video codecs exploit temporal redundancy—most frames differ only slightly from their neighbors.

Choosing the Right Algorithm: Trade-off Analysis

Selecting a compression algorithm requires understanding your constraints:

from dataclasses import dataclass
from enum import Enum
from typing import Optional

class Priority(Enum):
    LOW = 1
    MEDIUM = 2
    HIGH = 3

@dataclass
class CompressionRequirements:
    compression_ratio: Priority
    compression_speed: Priority
    decompression_speed: Priority
    memory_limit_mb: Optional[int]
    streaming_required: bool
    
def recommend_algorithm(req: CompressionRequirements) -> str:
    # High decompression speed is critical (real-time, client-side)
    if req.decompression_speed == Priority.HIGH:
        if req.compression_ratio == Priority.HIGH:
            return "zstd (level 10-15): Best balance of ratio and decompression speed"
        return "lz4: Fastest decompression, decent ratio"
    
    # Compression speed matters (real-time logging, streaming)
    if req.compression_speed == Priority.HIGH:
        if req.streaming_required:
            return "lz4 or zstd (level 1-3): Stream-friendly with fast compression"
        return "lz4: Maximum throughput"
    
    # Maximum compression (archival, cold storage)
    if req.compression_ratio == Priority.HIGH:
        if req.memory_limit_mb and req.memory_limit_mb < 100:
            return "zstd (level 19): High ratio, moderate memory"
        return "lzma/xz: Maximum compression, high memory usage"
    
    # Web content
    if req.streaming_required:
        return "brotli (quality 4-6): Optimized for HTTP, good browser support"
    
    return "zstd (level 3): Good default for most use cases"

# Example usage
requirements = CompressionRequirements(
    compression_ratio=Priority.MEDIUM,
    compression_speed=Priority.HIGH,
    decompression_speed=Priority.HIGH,
    memory_limit_mb=50,
    streaming_required=True
)
print(recommend_algorithm(requirements))

Practical Implementation Considerations

Don’t compress compressed data. JPEG, PNG, MP4, encrypted files, and already-compressed archives won’t benefit from additional compression. Detect these formats and skip compression:

INCOMPRESSIBLE_SIGNATURES = {
    b"\xff\xd8\xff": "JPEG",
    b"\x89PNG": "PNG",
    b"GIF8": "GIF",
    b"PK\x03\x04": "ZIP",
    b"\x1f\x8b": "GZIP",
    b"\x28\xb5\x2f\xfd": "ZSTD",
}

def should_compress(data: bytes) -> bool:
    for signature in INCOMPRESSIBLE_SIGNATURES:
        if data.startswith(signature):
            return False
    return True

Handle errors gracefully. Corrupted compressed data can crash decompressors. Always wrap decompression in try-catch blocks and validate checksums when available.

Consider streaming. For large files, streaming compression avoids loading everything into memory. Both zstd and lz4 support streaming APIs.

Profile before optimizing. Measure actual compression ratios on your data. Synthetic benchmarks rarely reflect real-world performance. A 10% ratio improvement means nothing if it doubles your CPU costs.

Compression is a fundamental tool in every engineer’s toolkit. Understanding the trade-offs lets you make informed decisions rather than cargo-culting whatever your framework defaults to.