Design a Recommendation Engine: Collaborative Filtering
Recommendation engines drive engagement across modern applications, from e-commerce product suggestions to streaming service queues. Collaborative filtering remains the foundational technique behind...
Key Insights
- Collaborative filtering assumes users with similar past behavior will have similar future preferences—this simple assumption powers recommendations at Netflix, Spotify, and Amazon
- Item-based collaborative filtering scales better than user-based for most production systems because item similarities are more stable and can be pre-computed efficiently
- Cold start problems are inevitable; plan for hybrid approaches from day one rather than retrofitting them later
Recommendation engines drive engagement across modern applications, from e-commerce product suggestions to streaming service queues. Collaborative filtering remains the foundational technique behind most production systems because it works without understanding content—it only needs behavioral data.
The core assumption is elegant: if users A and B both liked items 1, 2, and 3, and user A also liked item 4, then user B will probably like item 4 too. This approach sidesteps the difficult problem of understanding why users like things and focuses purely on what they like.
Content-based filtering, by contrast, analyzes item attributes (genre, author, color) to find similar items. It works well when you have rich metadata but struggles with serendipity—it won’t recommend jazz to someone who’s only listened to rock, even if similar users made that jump. Collaborative filtering captures these cross-category preferences naturally.
User-Based vs. Item-Based Collaborative Filtering
The two main collaborative filtering approaches differ in what they compare. User-based filtering finds users similar to the target user and recommends items those neighbors enjoyed. Item-based filtering finds items similar to what the target user already liked.
User-based seems intuitive but has practical problems. User preferences shift over time, so similarity matrices become stale quickly. With millions of users, computing and storing all pairwise similarities becomes expensive. And user behavior is often sparse—most users interact with a tiny fraction of available items.
Item-based filtering scales better for most applications. Item relationships are more stable (a thriller movie stays similar to other thrillers), the item catalog is usually smaller than the user base, and you can pre-compute similarities during off-peak hours. Amazon popularized this approach in their 2003 paper, and it remains the default choice for many production systems.
from abc import ABC, abstractmethod
from dataclasses import dataclass
import numpy as np
from scipy.sparse import csr_matrix
@dataclass
class Recommendation:
item_id: int
predicted_score: float
class CollaborativeFilter(ABC):
def __init__(self, ratings_matrix: csr_matrix, k_neighbors: int = 50):
self.ratings = ratings_matrix
self.k = k_neighbors
self.similarity_matrix = None
@abstractmethod
def compute_similarities(self) -> None:
"""Build the similarity matrix."""
pass
@abstractmethod
def predict(self, user_id: int, item_id: int) -> float:
"""Predict rating for a user-item pair."""
pass
def recommend(self, user_id: int, n: int = 10) -> list[Recommendation]:
"""Return top-N recommendations for a user."""
user_ratings = self.ratings[user_id].toarray().flatten()
unrated_items = np.where(user_ratings == 0)[0]
predictions = [
Recommendation(item_id=item, predicted_score=self.predict(user_id, item))
for item in unrated_items
]
return sorted(predictions, key=lambda x: x.predicted_score, reverse=True)[:n]
class ItemBasedCF(CollaborativeFilter):
def compute_similarities(self) -> None:
# Transpose so items are rows for pairwise comparison
item_matrix = self.ratings.T
self.similarity_matrix = cosine_similarity_sparse(item_matrix)
def predict(self, user_id: int, item_id: int) -> float:
user_ratings = self.ratings[user_id].toarray().flatten()
item_similarities = self.similarity_matrix[item_id]
rated_items = np.where(user_ratings > 0)[0]
if len(rated_items) == 0:
return 0.0
# Get k most similar items that user has rated
neighbor_sims = item_similarities[rated_items]
top_k_idx = np.argsort(neighbor_sims)[-self.k:]
top_k_items = rated_items[top_k_idx]
top_k_sims = neighbor_sims[top_k_idx]
# Weighted average
numerator = np.sum(top_k_sims * user_ratings[top_k_items])
denominator = np.sum(np.abs(top_k_sims))
return numerator / denominator if denominator > 0 else 0.0
Building the Similarity Matrix
Similarity computation is the performance bottleneck in collaborative filtering. For a catalog of 100,000 items, you’re computing 5 billion pairwise similarities. Choosing the right metric and implementation matters.
Cosine similarity measures the angle between vectors, ignoring magnitude. It’s the standard choice for sparse rating data because it handles missing values naturally—zeros don’t contribute to the dot product. Pearson correlation adjusts for user rating biases (some users rate everything high) but requires more computation. Jaccard index works well for implicit feedback (clicks, purchases) where you only have binary signals.
from scipy.sparse import csr_matrix, issparse
from sklearn.preprocessing import normalize
def cosine_similarity_sparse(matrix: csr_matrix, chunk_size: int = 1000) -> np.ndarray:
"""
Compute cosine similarity for sparse matrices in chunks to manage memory.
Returns dense matrix—for very large catalogs, use approximate methods instead.
"""
# Normalize rows to unit vectors
normalized = normalize(matrix, norm='l2', axis=1)
n_items = matrix.shape[0]
similarity = np.zeros((n_items, n_items), dtype=np.float32)
# Process in chunks to avoid memory explosion
for start in range(0, n_items, chunk_size):
end = min(start + chunk_size, n_items)
chunk = normalized[start:end]
# Dot product of normalized vectors = cosine similarity
similarity[start:end] = chunk.dot(normalized.T).toarray()
# Zero out self-similarity to avoid recommending same item
np.fill_diagonal(similarity, 0)
return similarity
def pearson_correlation_sparse(matrix: csr_matrix) -> np.ndarray:
"""
Pearson correlation adjusted for user/item mean ratings.
More expensive but handles rating bias better.
"""
# Center ratings by subtracting row means
matrix_dense = matrix.toarray().astype(np.float64)
row_means = np.true_divide(
matrix_dense.sum(axis=1),
(matrix_dense != 0).sum(axis=1),
where=(matrix_dense != 0).sum(axis=1) != 0
)
# Only subtract mean from non-zero entries
centered = matrix_dense.copy()
for i in range(centered.shape[0]):
mask = centered[i] != 0
centered[i, mask] -= row_means[i]
# Now compute cosine similarity on centered data
return cosine_similarity_sparse(csr_matrix(centered))
Generating Predictions and Rankings
The prediction formula aggregates neighbor opinions weighted by similarity. For item-based filtering: find items similar to what the user rated, weight their ratings by similarity scores, and normalize.
Choosing k (number of neighbors) involves a trade-off. Too few neighbors and predictions are noisy; too many and you dilute signal with weakly similar items. Start with k=20-50 and tune based on your evaluation metrics.
def get_top_n_recommendations(
cf_model: CollaborativeFilter,
user_id: int,
n: int = 10,
min_similarity: float = 0.1,
exclude_items: set[int] = None
) -> list[Recommendation]:
"""
Generate recommendations with filtering and score normalization.
"""
exclude_items = exclude_items or set()
user_ratings = cf_model.ratings[user_id].toarray().flatten()
# Get items user hasn't rated
candidates = [
i for i in range(len(user_ratings))
if user_ratings[i] == 0 and i not in exclude_items
]
recommendations = []
for item_id in candidates:
# Get similarity scores to items user has rated
rated_mask = user_ratings > 0
similarities = cf_model.similarity_matrix[item_id][rated_mask]
ratings = user_ratings[rated_mask]
# Filter to neighbors above similarity threshold
valid_mask = similarities >= min_similarity
if not np.any(valid_mask):
continue
valid_sims = similarities[valid_mask]
valid_ratings = ratings[valid_mask]
# Take top-k neighbors
if len(valid_sims) > cf_model.k:
top_k_idx = np.argsort(valid_sims)[-cf_model.k:]
valid_sims = valid_sims[top_k_idx]
valid_ratings = valid_ratings[top_k_idx]
# Weighted average prediction
score = np.dot(valid_sims, valid_ratings) / np.sum(valid_sims)
recommendations.append(Recommendation(item_id=item_id, predicted_score=score))
# Sort and return top N
recommendations.sort(key=lambda x: x.predicted_score, reverse=True)
return recommendations[:n]
Scaling the Architecture
Production recommendation systems serve millions of users with sub-100ms latency requirements. You can’t compute similarities on every request.
Pre-compute everything possible. Similarity matrices should be rebuilt in batch jobs (nightly or weekly depending on catalog velocity). Store the top-k similar items per item rather than the full matrix—you rarely need similarities below rank 100. Use Redis sorted sets for fast retrieval of similar items.
For very large catalogs, matrix factorization reduces dimensionality. Instead of storing explicit similarities, you learn latent factor vectors for users and items. The dot product of these vectors approximates the rating. SVD and ALS (Alternating Least Squares) are standard approaches.
import numpy as np
class MatrixFactorization:
"""
SGD-based matrix factorization for collaborative filtering.
Learns latent factors for users and items.
"""
def __init__(self, n_factors: int = 50, learning_rate: float = 0.01,
regularization: float = 0.02, n_epochs: int = 20):
self.n_factors = n_factors
self.lr = learning_rate
self.reg = regularization
self.n_epochs = n_epochs
self.user_factors = None
self.item_factors = None
def fit(self, ratings: csr_matrix) -> None:
n_users, n_items = ratings.shape
# Initialize with small random values
self.user_factors = np.random.normal(0, 0.1, (n_users, self.n_factors))
self.item_factors = np.random.normal(0, 0.1, (n_items, self.n_factors))
# Get non-zero entries for training
rows, cols = ratings.nonzero()
values = np.array(ratings[rows, cols]).flatten()
for epoch in range(self.n_epochs):
total_error = 0
for idx in np.random.permutation(len(rows)):
u, i, r = rows[idx], cols[idx], values[idx]
# Prediction and error
pred = np.dot(self.user_factors[u], self.item_factors[i])
error = r - pred
total_error += error ** 2
# SGD updates with regularization
user_update = self.lr * (error * self.item_factors[i] - self.reg * self.user_factors[u])
item_update = self.lr * (error * self.user_factors[u] - self.reg * self.item_factors[i])
self.user_factors[u] += user_update
self.item_factors[i] += item_update
rmse = np.sqrt(total_error / len(rows))
print(f"Epoch {epoch + 1}: RMSE = {rmse:.4f}")
def predict(self, user_id: int, item_id: int) -> float:
return np.dot(self.user_factors[user_id], self.item_factors[item_id])
def recommend(self, user_id: int, n: int = 10, rated_items: set = None) -> list[Recommendation]:
rated_items = rated_items or set()
scores = np.dot(self.user_factors[user_id], self.item_factors.T)
# Mask already-rated items
for item in rated_items:
scores[item] = -np.inf
top_items = np.argsort(scores)[-n:][::-1]
return [Recommendation(item_id=int(i), predicted_score=scores[i]) for i in top_items]
Handling Cold Start and Data Sparsity
New users and new items break collaborative filtering—there’s no behavioral data to compute similarities. Plan for this from the start.
For new users: fall back to popularity-based recommendations (most purchased, highest rated), use demographic or contextual signals if available, or prompt for explicit preferences during onboarding. For new items: use content-based features to bootstrap similarity, or employ explore/exploit strategies to gather initial ratings.
class HybridRecommender:
"""
Falls back to popularity when collaborative filtering can't help.
"""
def __init__(self, cf_model: CollaborativeFilter, min_ratings: int = 5):
self.cf = cf_model
self.min_ratings = min_ratings
self.popularity_scores = self._compute_popularity()
def _compute_popularity(self) -> np.ndarray:
"""Popularity = number of ratings * average rating."""
ratings = self.cf.ratings.toarray()
n_ratings = (ratings > 0).sum(axis=0)
avg_ratings = np.true_divide(
ratings.sum(axis=0),
n_ratings,
where=n_ratings > 0
)
return n_ratings * avg_ratings
def recommend(self, user_id: int, n: int = 10) -> list[Recommendation]:
user_ratings = self.cf.ratings[user_id].toarray().flatten()
n_user_ratings = (user_ratings > 0).sum()
# Cold start: use popularity
if n_user_ratings < self.min_ratings:
unrated = np.where(user_ratings == 0)[0]
pop_scores = self.popularity_scores[unrated]
top_idx = np.argsort(pop_scores)[-n:][::-1]
return [
Recommendation(item_id=int(unrated[i]), predicted_score=pop_scores[i])
for i in top_idx
]
# Enough data: use collaborative filtering
return self.cf.recommend(user_id, n)
Evaluation and Iteration
Offline evaluation gives you directional guidance; online A/B tests tell you what actually works. Use both.
For offline metrics, split your data temporally (train on older data, test on newer) rather than randomly—this simulates production conditions. RMSE measures prediction accuracy but doesn’t reflect ranking quality. Precision@k and recall@k better capture whether users see relevant items in their top recommendations.
from sklearn.model_selection import train_test_split
def evaluate_recommender(cf_model: CollaborativeFilter, test_ratings: csr_matrix, k: int = 10):
"""Compute precision@k and recall@k on held-out ratings."""
precisions, recalls = [], []
test_users, test_items = test_ratings.nonzero()
unique_users = np.unique(test_users)
for user_id in unique_users:
# Get actual items user rated in test set
actual_items = set(test_items[test_users == user_id])
if len(actual_items) == 0:
continue
# Get model's top-k recommendations
recs = cf_model.recommend(user_id, n=k)
recommended_items = {r.item_id for r in recs}
# Calculate metrics
hits = len(actual_items & recommended_items)
precisions.append(hits / k)
recalls.append(hits / len(actual_items))
return {
'precision@k': np.mean(precisions),
'recall@k': np.mean(recalls),
'n_users_evaluated': len(precisions)
}
Watch for feedback loops: recommendations influence behavior, which influences future recommendations. Popular items get more exposure, generating more ratings, making them more popular. Inject diversity intentionally and monitor coverage metrics (what percentage of your catalog gets recommended).
Start simple with item-based collaborative filtering, measure rigorously, and add complexity only when you’ve exhausted gains from the basic approach. Most recommendation systems fail not from algorithmic sophistication but from poor data quality and misaligned evaluation metrics.