Data Compression: Algorithms and Trade-offs
Data compression reduces storage costs, speeds up network transfers, and can even improve application performance by reducing I/O bottlenecks. Every time you load a webpage, stream a video, or...
Key Insights
- Compression algorithms sit on a spectrum between speed and ratio—zstd and lz4 favor speed, while LZMA maximizes compression at significant CPU cost. Your choice depends on whether you’re optimizing for storage, bandwidth, or latency.
- Dictionary-based algorithms (LZ family) dominate general-purpose compression because they exploit the repetitive patterns found in most real-world data, from text to binary formats.
- Never compress already-compressed data. JPEG, PNG, MP4, and encrypted content won’t compress further and may actually grow larger with the overhead of compression metadata.
Introduction to Data Compression
Data compression reduces storage costs, speeds up network transfers, and can even improve application performance by reducing I/O bottlenecks. Every time you load a webpage, stream a video, or download an app, compression algorithms are working behind the scenes.
Compression falls into two categories: lossless and lossy. Lossless compression guarantees perfect reconstruction of the original data—essential for source code, databases, and executables. Lossy compression sacrifices some fidelity for dramatically better compression ratios, making it ideal for media where human perception can’t detect the difference.
Understanding these algorithms isn’t just academic. Choosing the wrong compression strategy can tank your application’s performance or balloon your infrastructure costs.
Lossless Compression Fundamentals
All lossless compression exploits one principle: redundancy. Data rarely uses all possible bit patterns with equal frequency. Entropy encoding assigns shorter codes to frequent symbols and longer codes to rare ones.
Huffman coding builds a binary tree where frequent symbols sit near the root (short paths) and rare symbols sit deeper (long paths). It’s optimal for symbol-by-symbol encoding and forms the backbone of many compression formats.
import heapq
from collections import Counter
from typing import Dict, Tuple, Optional
class HuffmanNode:
def __init__(self, char: Optional[str], freq: int):
self.char = char
self.freq = freq
self.left: Optional[HuffmanNode] = None
self.right: Optional[HuffmanNode] = None
def __lt__(self, other):
return self.freq < other.freq
def build_huffman_tree(text: str) -> HuffmanNode:
frequency = Counter(text)
heap = [HuffmanNode(char, freq) for char, freq in frequency.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
merged = HuffmanNode(None, left.freq + right.freq)
merged.left = left
merged.right = right
heapq.heappush(heap, merged)
return heap[0]
def build_codes(node: HuffmanNode, prefix: str = "", codes: Dict[str, str] = None) -> Dict[str, str]:
if codes is None:
codes = {}
if node.char is not None:
codes[node.char] = prefix or "0"
else:
if node.left:
build_codes(node.left, prefix + "0", codes)
if node.right:
build_codes(node.right, prefix + "1", codes)
return codes
def huffman_encode(text: str) -> Tuple[str, HuffmanNode]:
tree = build_huffman_tree(text)
codes = build_codes(tree)
encoded = "".join(codes[char] for char in text)
return encoded, tree
def huffman_decode(encoded: str, tree: HuffmanNode) -> str:
decoded = []
node = tree
for bit in encoded:
node = node.left if bit == "0" else node.right
if node.char is not None:
decoded.append(node.char)
node = tree
return "".join(decoded)
# Example usage
text = "abracadabra"
encoded, tree = huffman_encode(text)
decoded = huffman_decode(encoded, tree)
print(f"Original: {text} ({len(text) * 8} bits)")
print(f"Encoded: {encoded} ({len(encoded)} bits)")
print(f"Compression ratio: {len(text) * 8 / len(encoded):.2f}x")
Arithmetic coding goes further by encoding entire messages as a single fractional number, achieving compression closer to the theoretical entropy limit. It’s more complex to implement but used in modern formats like JPEG 2000 and H.265.
Dictionary-Based Compression (LZ Family)
While entropy encoding optimizes symbol frequencies, dictionary-based compression exploits repeated sequences. The LZ family of algorithms maintains a dictionary of previously seen patterns and replaces repetitions with references.
LZ77 uses a sliding window approach: it looks backward in the data stream for matches and encodes them as (offset, length) pairs. This simple idea powers gzip, deflate, and PNG compression.
from typing import List, Tuple, Union
Token = Tuple[int, int, str] # (offset, length, next_char)
def lz77_compress(data: str, window_size: int = 256, lookahead_size: int = 16) -> List[Token]:
tokens: List[Token] = []
pos = 0
while pos < len(data):
best_offset = 0
best_length = 0
# Search window boundaries
search_start = max(0, pos - window_size)
search_end = pos
# Find longest match in the search window
for i in range(search_start, search_end):
length = 0
while (length < lookahead_size and
pos + length < len(data) and
data[i + length] == data[pos + length]):
length += 1
# Handle matches that extend into lookahead
if i + length >= pos:
break
if length > best_length:
best_length = length
best_offset = pos - i
# Get next character after match
next_char = data[pos + best_length] if pos + best_length < len(data) else ""
tokens.append((best_offset, best_length, next_char))
pos += best_length + 1
return tokens
def lz77_decompress(tokens: List[Token]) -> str:
result: List[str] = []
for offset, length, next_char in tokens:
if length > 0:
start = len(result) - offset
for i in range(length):
result.append(result[start + i])
if next_char:
result.append(next_char)
return "".join(result)
# Example
data = "abracadabra_abracadabra"
compressed = lz77_compress(data)
decompressed = lz77_decompress(compressed)
print(f"Original: {data}")
print(f"Tokens: {compressed}")
print(f"Decompressed: {decompressed}")
print(f"Token count: {len(compressed)} vs original length: {len(data)}")
LZW (used in GIF) builds its dictionary dynamically during compression, eliminating the need to transmit the dictionary separately. LZMA (used in 7-zip and xz) combines LZ77 with range coding for exceptional compression ratios.
Modern Compression Algorithms
The compression landscape has evolved significantly. Modern algorithms offer tunable trade-offs between speed and ratio.
zstd (Zstandard) from Facebook has become the go-to choice for many applications. It offers compression ratios comparable to zlib at 3-5x faster speeds, with compression levels from 1 (fastest) to 22 (smallest).
lz4 prioritizes speed above all else, achieving gigabytes-per-second compression and decompression. It’s ideal for real-time applications, caching layers, and scenarios where CPU is the bottleneck.
Brotli from Google optimizes for web content, achieving 20-26% better compression than gzip on text. It’s now the standard for HTTP compression alongside gzip.
import zstandard as zstd
import lz4.frame
import brotli
import time
import os
def benchmark_compression(data: bytes) -> None:
algorithms = {
"zstd-1": lambda d: zstd.ZstdCompressor(level=1).compress(d),
"zstd-10": lambda d: zstd.ZstdCompressor(level=10).compress(d),
"zstd-22": lambda d: zstd.ZstdCompressor(level=22).compress(d),
"lz4": lambda d: lz4.frame.compress(d),
"brotli-4": lambda d: brotli.compress(d, quality=4),
"brotli-11": lambda d: brotli.compress(d, quality=11),
}
original_size = len(data)
print(f"Original size: {original_size:,} bytes\n")
print(f"{'Algorithm':<12} {'Compressed':<12} {'Ratio':<8} {'Time (ms)':<10}")
print("-" * 45)
for name, compress_func in algorithms.items():
start = time.perf_counter()
compressed = compress_func(data)
elapsed = (time.perf_counter() - start) * 1000
ratio = original_size / len(compressed)
print(f"{name:<12} {len(compressed):<12,} {ratio:<8.2f} {elapsed:<10.2f}")
# Generate sample data with realistic patterns
sample_data = (b"The quick brown fox jumps over the lazy dog. " * 1000 +
b"Lorem ipsum dolor sit amet, consectetur adipiscing elit. " * 1000)
benchmark_compression(sample_data)
Lossy Compression Strategies
Lossy compression achieves dramatic size reductions by discarding information humans won’t notice. The key techniques include:
Quantization reduces the precision of values. Instead of storing exact pixel values, you store approximations. JPEG uses this heavily.
Transform coding converts data to a domain where energy concentrates in fewer coefficients. The Discrete Cosine Transform (DCT) in JPEG converts 8x8 pixel blocks into frequency components, where high frequencies (fine details) can be discarded with minimal visual impact.
from PIL import Image
import io
import os
def compare_jpeg_quality(image_path: str, qualities: list[int]) -> None:
"""Compare file sizes and demonstrate quality degradation."""
original = Image.open(image_path)
original_size = os.path.getsize(image_path)
print(f"Original: {original_size:,} bytes")
print(f"{'Quality':<10} {'Size':<15} {'Ratio':<10} {'Size %':<10}")
print("-" * 45)
for quality in qualities:
buffer = io.BytesIO()
original.save(buffer, format="JPEG", quality=quality)
compressed_size = buffer.tell()
ratio = original_size / compressed_size
percentage = (compressed_size / original_size) * 100
print(f"{quality:<10} {compressed_size:<15,} {ratio:<10.2f} {percentage:<10.1f}%")
# Save for visual comparison
output_path = f"quality_{quality}.jpg"
original.save(output_path, format="JPEG", quality=quality)
# Usage: compare_jpeg_quality("photo.png", [95, 85, 75, 50, 25, 10])
Perceptual models leverage human sensory limitations. MP3 uses psychoacoustic models to discard frequencies masked by louder sounds. Video codecs exploit temporal redundancy—most frames differ only slightly from their neighbors.
Choosing the Right Algorithm: Trade-off Analysis
Selecting a compression algorithm requires understanding your constraints:
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class Priority(Enum):
LOW = 1
MEDIUM = 2
HIGH = 3
@dataclass
class CompressionRequirements:
compression_ratio: Priority
compression_speed: Priority
decompression_speed: Priority
memory_limit_mb: Optional[int]
streaming_required: bool
def recommend_algorithm(req: CompressionRequirements) -> str:
# High decompression speed is critical (real-time, client-side)
if req.decompression_speed == Priority.HIGH:
if req.compression_ratio == Priority.HIGH:
return "zstd (level 10-15): Best balance of ratio and decompression speed"
return "lz4: Fastest decompression, decent ratio"
# Compression speed matters (real-time logging, streaming)
if req.compression_speed == Priority.HIGH:
if req.streaming_required:
return "lz4 or zstd (level 1-3): Stream-friendly with fast compression"
return "lz4: Maximum throughput"
# Maximum compression (archival, cold storage)
if req.compression_ratio == Priority.HIGH:
if req.memory_limit_mb and req.memory_limit_mb < 100:
return "zstd (level 19): High ratio, moderate memory"
return "lzma/xz: Maximum compression, high memory usage"
# Web content
if req.streaming_required:
return "brotli (quality 4-6): Optimized for HTTP, good browser support"
return "zstd (level 3): Good default for most use cases"
# Example usage
requirements = CompressionRequirements(
compression_ratio=Priority.MEDIUM,
compression_speed=Priority.HIGH,
decompression_speed=Priority.HIGH,
memory_limit_mb=50,
streaming_required=True
)
print(recommend_algorithm(requirements))
Practical Implementation Considerations
Don’t compress compressed data. JPEG, PNG, MP4, encrypted files, and already-compressed archives won’t benefit from additional compression. Detect these formats and skip compression:
INCOMPRESSIBLE_SIGNATURES = {
b"\xff\xd8\xff": "JPEG",
b"\x89PNG": "PNG",
b"GIF8": "GIF",
b"PK\x03\x04": "ZIP",
b"\x1f\x8b": "GZIP",
b"\x28\xb5\x2f\xfd": "ZSTD",
}
def should_compress(data: bytes) -> bool:
for signature in INCOMPRESSIBLE_SIGNATURES:
if data.startswith(signature):
return False
return True
Handle errors gracefully. Corrupted compressed data can crash decompressors. Always wrap decompression in try-catch blocks and validate checksums when available.
Consider streaming. For large files, streaming compression avoids loading everything into memory. Both zstd and lz4 support streaming APIs.
Profile before optimizing. Measure actual compression ratios on your data. Synthetic benchmarks rarely reflect real-world performance. A 10% ratio improvement means nothing if it doubles your CPU costs.
Compression is a fundamental tool in every engineer’s toolkit. Understanding the trade-offs lets you make informed decisions rather than cargo-culting whatever your framework defaults to.