Sparse Arrays: Efficient Storage for Large Datasets
Every time you allocate a NumPy array, you're reserving contiguous memory for every single element—whether it contains meaningful data or not. For a 10,000×10,000 matrix of 64-bit floats, that's...
Key Insights
- Sparse arrays can reduce memory usage by 99% or more when your data contains mostly zeros—a 10,000×10,000 matrix drops from 800MB to under 8MB at 1% density
- Format selection matters: COO for construction, CSR for row operations and arithmetic, DOK for random access updates
- The break-even point is typically around 10-30% density—below that, sparse wins; above it, dense arrays perform better
The Problem with Dense Arrays
Every time you allocate a NumPy array, you’re reserving contiguous memory for every single element—whether it contains meaningful data or not. For a 10,000×10,000 matrix of 64-bit floats, that’s 800MB. If only 1% of those values are non-zero, you’re wasting 792MB storing zeros.
This isn’t a theoretical concern. Real-world datasets are sparse by nature:
- Sensor networks: Most sensors report “no change” most of the time
- Social graphs: Users connect to hundreds of others, not millions
- Recommendation systems: Users rate a tiny fraction of available products
- Natural language processing: Document-term matrices have mostly zero counts
- Scientific simulations: Finite element meshes produce banded matrices
When you’re working with datasets at scale, this waste becomes a hard constraint. You either run out of memory, or you spend money on hardware you don’t need.
What Are Sparse Arrays?
Sparse arrays store only the non-zero elements along with their positions. Instead of allocating space for every possible value, you track what exists and where it exists.
The math is straightforward. Let’s calculate the memory difference:
import numpy as np
def memory_comparison(rows: int, cols: int, density: float) -> dict:
"""Compare memory usage between dense and sparse storage."""
total_elements = rows * cols
non_zero_elements = int(total_elements * density)
bytes_per_float = 8 # float64
bytes_per_index = 4 # int32
# Dense: store every element
dense_bytes = total_elements * bytes_per_float
# COO sparse: store value + row index + col index per non-zero
sparse_bytes = non_zero_elements * (bytes_per_float + 2 * bytes_per_index)
return {
"dense_mb": dense_bytes / (1024 ** 2),
"sparse_mb": sparse_bytes / (1024 ** 2),
"savings_percent": (1 - sparse_bytes / dense_bytes) * 100
}
# 10,000 x 10,000 matrix with 1% fill rate
result = memory_comparison(10_000, 10_000, 0.01)
print(f"Dense: {result['dense_mb']:.1f} MB")
print(f"Sparse: {result['sparse_mb']:.1f} MB")
print(f"Savings: {result['savings_percent']:.1f}%")
# Output:
# Dense: 762.9 MB
# Sparse: 15.3 MB
# Savings: 98.0%
At 1% density, you save 98% of memory. At 0.1% density (common in recommendation systems), savings exceed 99.9%.
Common Sparse Storage Formats
Not all sparse formats are equal. Each optimizes for different access patterns.
COO (Coordinate List)
The simplest format: three parallel arrays for row indices, column indices, and values.
class COOMatrix:
"""Coordinate list sparse matrix implementation."""
def __init__(self, shape: tuple[int, int]):
self.shape = shape
self.rows: list[int] = []
self.cols: list[int] = []
self.data: list[float] = []
def set(self, row: int, col: int, value: float) -> None:
if value == 0:
return
self.rows.append(row)
self.cols.append(col)
self.data.append(value)
def to_dense(self) -> np.ndarray:
result = np.zeros(self.shape)
for r, c, v in zip(self.rows, self.cols, self.data):
result[r, c] = v
return result
# Build a sparse matrix
coo = COOMatrix((1000, 1000))
coo.set(0, 0, 1.0)
coo.set(500, 750, 3.14)
coo.set(999, 999, 2.0)
COO excels at incremental construction but has O(n) lookup time for individual elements.
CSR (Compressed Sparse Row)
CSR compresses row information into a pointer array, making row slicing and matrix-vector multiplication fast.
from scipy import sparse
# Create from COO and convert to CSR
data = [1.0, 3.14, 2.0]
rows = [0, 500, 999]
cols = [0, 750, 999]
coo_matrix = sparse.coo_matrix((data, (rows, cols)), shape=(1000, 1000))
csr_matrix = coo_matrix.tocsr()
# CSR internals
print(f"Data: {csr_matrix.data}") # Non-zero values
print(f"Indices: {csr_matrix.indices}") # Column indices
print(f"Indptr: {csr_matrix.indptr[:5]}...") # Row pointers
# Efficient row access
row_500 = csr_matrix.getrow(500)
print(f"Row 500 non-zeros: {row_500.nnz}")
Format Conversion
SciPy makes conversions trivial:
# Start with whatever format is convenient
dok = sparse.dok_matrix((1000, 1000))
dok[0, 0] = 1.0
dok[500, 750] = 3.14
# Convert for computation
csr = dok.tocsr() # For row operations
csc = dok.tocsc() # For column operations
coo = dok.tocoo() # For iteration
Performance Characteristics and Trade-offs
Sparse arrays aren’t universally faster. The crossover point depends on operation type and sparsity level.
import time
from scipy import sparse
def benchmark_matmul(size: int, densities: list[float]) -> dict:
"""Benchmark matrix multiplication at various sparsity levels."""
results = {}
for density in densities:
# Create random sparse matrix
sp = sparse.random(size, size, density=density, format='csr')
dense = sp.toarray()
# Benchmark sparse
start = time.perf_counter()
for _ in range(10):
_ = sp @ sp
sparse_time = (time.perf_counter() - start) / 10
# Benchmark dense
start = time.perf_counter()
for _ in range(10):
_ = dense @ dense
dense_time = (time.perf_counter() - start) / 10
results[density] = {
"sparse_ms": sparse_time * 1000,
"dense_ms": dense_time * 1000,
"sparse_wins": sparse_time < dense_time
}
return results
# Test with 2000x2000 matrices
results = benchmark_matmul(2000, [0.001, 0.01, 0.05, 0.1, 0.2])
for density, data in results.items():
winner = "SPARSE" if data["sparse_wins"] else "DENSE"
print(f"Density {density:.1%}: sparse={data['sparse_ms']:.1f}ms, "
f"dense={data['dense_ms']:.1f}ms -> {winner}")
# Typical output:
# Density 0.1%: sparse=2.3ms, dense=145.2ms -> SPARSE
# Density 1.0%: sparse=18.7ms, dense=144.8ms -> SPARSE
# Density 5.0%: sparse=89.4ms, dense=146.1ms -> SPARSE
# Density 10.0%: sparse=178.3ms, dense=145.9ms -> DENSE
# Density 20.0%: sparse=356.8ms, dense=147.2ms -> DENSE
The crossover typically occurs between 5-15% density for matrix multiplication. For element access, sparse formats are slower at any density due to index lookups.
Practical Implementation Patterns
Here’s a wrapper class that handles format selection automatically:
from scipy import sparse
import numpy as np
class SparseArray:
"""Sparse array with automatic format optimization."""
def __init__(self, shape: tuple[int, int]):
self.shape = shape
self._dok = sparse.dok_matrix(shape, dtype=np.float64)
self._csr: sparse.csr_matrix | None = None
self._dirty = False
def __setitem__(self, key: tuple[int, int], value: float) -> None:
self._dok[key] = value
self._dirty = True
self._csr = None
def __getitem__(self, key: tuple[int, int]) -> float:
return self._dok[key]
def delete(self, row: int, col: int) -> None:
if (row, col) in self._dok:
del self._dok[row, col]
self._dirty = True
self._csr = None
def _ensure_csr(self) -> sparse.csr_matrix:
if self._csr is None or self._dirty:
self._csr = self._dok.tocsr()
self._dirty = False
return self._csr
def dot(self, other: 'SparseArray') -> 'SparseArray':
"""Matrix multiplication using CSR format."""
result_csr = self._ensure_csr() @ other._ensure_csr()
result = SparseArray(result_csr.shape)
result._csr = result_csr
result._dok = result_csr.todok()
return result
@property
def density(self) -> float:
return self._dok.nnz / (self.shape[0] * self.shape[1])
@property
def memory_mb(self) -> float:
csr = self._ensure_csr()
total_bytes = (csr.data.nbytes + csr.indices.nbytes +
csr.indptr.nbytes)
return total_bytes / (1024 ** 2)
Real-World Application: Building a Recommendation Engine
Collaborative filtering relies on user-item interaction matrices. With millions of users and products, these matrices are extremely sparse.
from scipy import sparse
from scipy.sparse.linalg import svds
import numpy as np
class SparseRecommender:
"""Collaborative filtering using sparse SVD."""
def __init__(self, n_factors: int = 50):
self.n_factors = n_factors
self.user_factors: np.ndarray | None = None
self.item_factors: np.ndarray | None = None
def fit(self, interactions: sparse.csr_matrix) -> None:
"""Fit model using truncated SVD on sparse matrix."""
# Normalize by row (user) means
row_means = np.array(interactions.sum(axis=1)).flatten()
row_counts = np.diff(interactions.indptr)
row_means = np.divide(row_means, row_counts,
where=row_counts != 0)
# SVD on sparse matrix - no densification needed
u, s, vt = svds(interactions.astype(float), k=self.n_factors)
self.user_factors = u @ np.diag(np.sqrt(s))
self.item_factors = np.diag(np.sqrt(s)) @ vt
def recommend(self, user_id: int, n: int = 10) -> np.ndarray:
"""Get top-n recommendations for a user."""
scores = self.user_factors[user_id] @ self.item_factors
return np.argsort(scores)[-n:][::-1]
# Simulate 100K users, 50K items, 0.01% interaction rate
n_users, n_items = 100_000, 50_000
density = 0.0001
interactions = sparse.random(n_users, n_items, density=density,
format='csr')
interactions.data[:] = np.random.randint(1, 6, size=len(interactions.data))
print(f"Interactions: {interactions.nnz:,}")
print(f"Sparse size: {interactions.data.nbytes / 1024**2:.1f} MB")
print(f"Dense would be: {n_users * n_items * 8 / 1024**3:.1f} GB")
# Train and recommend
recommender = SparseRecommender(n_factors=50)
recommender.fit(interactions)
top_items = recommender.recommend(user_id=0, n=10)
print(f"Top recommendations for user 0: {top_items}")
This handles 100K users × 50K items in megabytes rather than the 37GB a dense matrix would require.
Choosing the Right Tool
Use this decision framework:
- Density < 1%: Sparse is almost always better
- Density 1-10%: Benchmark your specific operations
- Density > 10%: Dense usually wins unless memory-constrained
For format selection:
| Format | Best For |
|---|---|
| COO | Initial construction, format conversion |
| CSR | Row slicing, matrix-vector products, most arithmetic |
| CSC | Column slicing, solving linear systems |
| DOK | Incremental updates, random access patterns |
| LIL | Row-based construction when order matters |
Start with DOK or COO for building, convert to CSR for computation, and only densify when you’ve confirmed it’s actually faster for your workload.