Design a Recommendation Engine: Collaborative Filtering

Key Insights

Collaborative filtering assumes users with similar past behavior will have similar future preferences—this simple assumption powers recommendations at Netflix, Spotify, and Amazon
Item-based collaborative filtering scales better than user-based for most production systems because item similarities are more stable and can be pre-computed efficiently
Cold start problems are inevitable; plan for hybrid approaches from day one rather than retrofitting them later

Recommendation engines drive engagement across modern applications, from e-commerce product suggestions to streaming service queues. Collaborative filtering remains the foundational technique behind most production systems because it works without understanding content—it only needs behavioral data.

The core assumption is elegant: if users A and B both liked items 1, 2, and 3, and user A also liked item 4, then user B will probably like item 4 too. This approach sidesteps the difficult problem of understanding why users like things and focuses purely on what they like.

Content-based filtering, by contrast, analyzes item attributes (genre, author, color) to find similar items. It works well when you have rich metadata but struggles with serendipity—it won’t recommend jazz to someone who’s only listened to rock, even if similar users made that jump. Collaborative filtering captures these cross-category preferences naturally.

User-Based vs. Item-Based Collaborative Filtering

The two main collaborative filtering approaches differ in what they compare. User-based filtering finds users similar to the target user and recommends items those neighbors enjoyed. Item-based filtering finds items similar to what the target user already liked.

User-based seems intuitive but has practical problems. User preferences shift over time, so similarity matrices become stale quickly. With millions of users, computing and storing all pairwise similarities becomes expensive. And user behavior is often sparse—most users interact with a tiny fraction of available items.

Item-based filtering scales better for most applications. Item relationships are more stable (a thriller movie stays similar to other thrillers), the item catalog is usually smaller than the user base, and you can pre-compute similarities during off-peak hours. Amazon popularized this approach in their 2003 paper, and it remains the default choice for many production systems.

from abc import ABC, abstractmethod
from dataclasses import dataclass
import numpy as np
from scipy.sparse import csr_matrix

@dataclass
class Recommendation:
    item_id: int
    predicted_score: float

class CollaborativeFilter(ABC):
    def __init__(self, ratings_matrix: csr_matrix, k_neighbors: int = 50):
        self.ratings = ratings_matrix
        self.k = k_neighbors
        self.similarity_matrix = None
    
    @abstractmethod
    def compute_similarities(self) -> None:
        """Build the similarity matrix."""
        pass
    
    @abstractmethod
    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating for a user-item pair."""
        pass
    
    def recommend(self, user_id: int, n: int = 10) -> list[Recommendation]:
        """Return top-N recommendations for a user."""
        user_ratings = self.ratings[user_id].toarray().flatten()
        unrated_items = np.where(user_ratings == 0)[0]
        
        predictions = [
            Recommendation(item_id=item, predicted_score=self.predict(user_id, item))
            for item in unrated_items
        ]
        return sorted(predictions, key=lambda x: x.predicted_score, reverse=True)[:n]


class ItemBasedCF(CollaborativeFilter):
    def compute_similarities(self) -> None:
        # Transpose so items are rows for pairwise comparison
        item_matrix = self.ratings.T
        self.similarity_matrix = cosine_similarity_sparse(item_matrix)
    
    def predict(self, user_id: int, item_id: int) -> float:
        user_ratings = self.ratings[user_id].toarray().flatten()
        item_similarities = self.similarity_matrix[item_id]
        
        rated_items = np.where(user_ratings > 0)[0]
        if len(rated_items) == 0:
            return 0.0
        
        # Get k most similar items that user has rated
        neighbor_sims = item_similarities[rated_items]
        top_k_idx = np.argsort(neighbor_sims)[-self.k:]
        top_k_items = rated_items[top_k_idx]
        top_k_sims = neighbor_sims[top_k_idx]
        
        # Weighted average
        numerator = np.sum(top_k_sims * user_ratings[top_k_items])
        denominator = np.sum(np.abs(top_k_sims))
        
        return numerator / denominator if denominator > 0 else 0.0

Building the Similarity Matrix

Similarity computation is the performance bottleneck in collaborative filtering. For a catalog of 100,000 items, you’re computing 5 billion pairwise similarities. Choosing the right metric and implementation matters.

Cosine similarity measures the angle between vectors, ignoring magnitude. It’s the standard choice for sparse rating data because it handles missing values naturally—zeros don’t contribute to the dot product. Pearson correlation adjusts for user rating biases (some users rate everything high) but requires more computation. Jaccard index works well for implicit feedback (clicks, purchases) where you only have binary signals.

from scipy.sparse import csr_matrix, issparse
from sklearn.preprocessing import normalize

def cosine_similarity_sparse(matrix: csr_matrix, chunk_size: int = 1000) -> np.ndarray:
    """
    Compute cosine similarity for sparse matrices in chunks to manage memory.
    Returns dense matrix—for very large catalogs, use approximate methods instead.
    """
    # Normalize rows to unit vectors
    normalized = normalize(matrix, norm='l2', axis=1)
    n_items = matrix.shape[0]
    similarity = np.zeros((n_items, n_items), dtype=np.float32)
    
    # Process in chunks to avoid memory explosion
    for start in range(0, n_items, chunk_size):
        end = min(start + chunk_size, n_items)
        chunk = normalized[start:end]
        
        # Dot product of normalized vectors = cosine similarity
        similarity[start:end] = chunk.dot(normalized.T).toarray()
    
    # Zero out self-similarity to avoid recommending same item
    np.fill_diagonal(similarity, 0)
    return similarity


def pearson_correlation_sparse(matrix: csr_matrix) -> np.ndarray:
    """
    Pearson correlation adjusted for user/item mean ratings.
    More expensive but handles rating bias better.
    """
    # Center ratings by subtracting row means
    matrix_dense = matrix.toarray().astype(np.float64)
    row_means = np.true_divide(
        matrix_dense.sum(axis=1),
        (matrix_dense != 0).sum(axis=1),
        where=(matrix_dense != 0).sum(axis=1) != 0
    )
    
    # Only subtract mean from non-zero entries
    centered = matrix_dense.copy()
    for i in range(centered.shape[0]):
        mask = centered[i] != 0
        centered[i, mask] -= row_means[i]
    
    # Now compute cosine similarity on centered data
    return cosine_similarity_sparse(csr_matrix(centered))

Generating Predictions and Rankings

The prediction formula aggregates neighbor opinions weighted by similarity. For item-based filtering: find items similar to what the user rated, weight their ratings by similarity scores, and normalize.

Choosing k (number of neighbors) involves a trade-off. Too few neighbors and predictions are noisy; too many and you dilute signal with weakly similar items. Start with k=20-50 and tune based on your evaluation metrics.

def get_top_n_recommendations(
    cf_model: CollaborativeFilter,
    user_id: int,
    n: int = 10,
    min_similarity: float = 0.1,
    exclude_items: set[int] = None
) -> list[Recommendation]:
    """
    Generate recommendations with filtering and score normalization.
    """
    exclude_items = exclude_items or set()
    user_ratings = cf_model.ratings[user_id].toarray().flatten()
    
    # Get items user hasn't rated
    candidates = [
        i for i in range(len(user_ratings))
        if user_ratings[i] == 0 and i not in exclude_items
    ]
    
    recommendations = []
    for item_id in candidates:
        # Get similarity scores to items user has rated
        rated_mask = user_ratings > 0
        similarities = cf_model.similarity_matrix[item_id][rated_mask]
        ratings = user_ratings[rated_mask]
        
        # Filter to neighbors above similarity threshold
        valid_mask = similarities >= min_similarity
        if not np.any(valid_mask):
            continue
        
        valid_sims = similarities[valid_mask]
        valid_ratings = ratings[valid_mask]
        
        # Take top-k neighbors
        if len(valid_sims) > cf_model.k:
            top_k_idx = np.argsort(valid_sims)[-cf_model.k:]
            valid_sims = valid_sims[top_k_idx]
            valid_ratings = valid_ratings[top_k_idx]
        
        # Weighted average prediction
        score = np.dot(valid_sims, valid_ratings) / np.sum(valid_sims)
        recommendations.append(Recommendation(item_id=item_id, predicted_score=score))
    
    # Sort and return top N
    recommendations.sort(key=lambda x: x.predicted_score, reverse=True)
    return recommendations[:n]

Scaling the Architecture

Production recommendation systems serve millions of users with sub-100ms latency requirements. You can’t compute similarities on every request.

Pre-compute everything possible. Similarity matrices should be rebuilt in batch jobs (nightly or weekly depending on catalog velocity). Store the top-k similar items per item rather than the full matrix—you rarely need similarities below rank 100. Use Redis sorted sets for fast retrieval of similar items.

For very large catalogs, matrix factorization reduces dimensionality. Instead of storing explicit similarities, you learn latent factor vectors for users and items. The dot product of these vectors approximates the rating. SVD and ALS (Alternating Least Squares) are standard approaches.

import numpy as np

class MatrixFactorization:
    """
    SGD-based matrix factorization for collaborative filtering.
    Learns latent factors for users and items.
    """
    def __init__(self, n_factors: int = 50, learning_rate: float = 0.01,
                 regularization: float = 0.02, n_epochs: int = 20):
        self.n_factors = n_factors
        self.lr = learning_rate
        self.reg = regularization
        self.n_epochs = n_epochs
        self.user_factors = None
        self.item_factors = None
    
    def fit(self, ratings: csr_matrix) -> None:
        n_users, n_items = ratings.shape
        
        # Initialize with small random values
        self.user_factors = np.random.normal(0, 0.1, (n_users, self.n_factors))
        self.item_factors = np.random.normal(0, 0.1, (n_items, self.n_factors))
        
        # Get non-zero entries for training
        rows, cols = ratings.nonzero()
        values = np.array(ratings[rows, cols]).flatten()
        
        for epoch in range(self.n_epochs):
            total_error = 0
            for idx in np.random.permutation(len(rows)):
                u, i, r = rows[idx], cols[idx], values[idx]
                
                # Prediction and error
                pred = np.dot(self.user_factors[u], self.item_factors[i])
                error = r - pred
                total_error += error ** 2
                
                # SGD updates with regularization
                user_update = self.lr * (error * self.item_factors[i] - self.reg * self.user_factors[u])
                item_update = self.lr * (error * self.user_factors[u] - self.reg * self.item_factors[i])
                
                self.user_factors[u] += user_update
                self.item_factors[i] += item_update
            
            rmse = np.sqrt(total_error / len(rows))
            print(f"Epoch {epoch + 1}: RMSE = {rmse:.4f}")
    
    def predict(self, user_id: int, item_id: int) -> float:
        return np.dot(self.user_factors[user_id], self.item_factors[item_id])
    
    def recommend(self, user_id: int, n: int = 10, rated_items: set = None) -> list[Recommendation]:
        rated_items = rated_items or set()
        scores = np.dot(self.user_factors[user_id], self.item_factors.T)
        
        # Mask already-rated items
        for item in rated_items:
            scores[item] = -np.inf
        
        top_items = np.argsort(scores)[-n:][::-1]
        return [Recommendation(item_id=int(i), predicted_score=scores[i]) for i in top_items]

Handling Cold Start and Data Sparsity

New users and new items break collaborative filtering—there’s no behavioral data to compute similarities. Plan for this from the start.

For new users: fall back to popularity-based recommendations (most purchased, highest rated), use demographic or contextual signals if available, or prompt for explicit preferences during onboarding. For new items: use content-based features to bootstrap similarity, or employ explore/exploit strategies to gather initial ratings.

class HybridRecommender:
    """
    Falls back to popularity when collaborative filtering can't help.
    """
    def __init__(self, cf_model: CollaborativeFilter, min_ratings: int = 5):
        self.cf = cf_model
        self.min_ratings = min_ratings
        self.popularity_scores = self._compute_popularity()
    
    def _compute_popularity(self) -> np.ndarray:
        """Popularity = number of ratings * average rating."""
        ratings = self.cf.ratings.toarray()
        n_ratings = (ratings > 0).sum(axis=0)
        avg_ratings = np.true_divide(
            ratings.sum(axis=0),
            n_ratings,
            where=n_ratings > 0
        )
        return n_ratings * avg_ratings
    
    def recommend(self, user_id: int, n: int = 10) -> list[Recommendation]:
        user_ratings = self.cf.ratings[user_id].toarray().flatten()
        n_user_ratings = (user_ratings > 0).sum()
        
        # Cold start: use popularity
        if n_user_ratings < self.min_ratings:
            unrated = np.where(user_ratings == 0)[0]
            pop_scores = self.popularity_scores[unrated]
            top_idx = np.argsort(pop_scores)[-n:][::-1]
            return [
                Recommendation(item_id=int(unrated[i]), predicted_score=pop_scores[i])
                for i in top_idx
            ]
        
        # Enough data: use collaborative filtering
        return self.cf.recommend(user_id, n)

Evaluation and Iteration

Offline evaluation gives you directional guidance; online A/B tests tell you what actually works. Use both.

For offline metrics, split your data temporally (train on older data, test on newer) rather than randomly—this simulates production conditions. RMSE measures prediction accuracy but doesn’t reflect ranking quality. Precision@k and recall@k better capture whether users see relevant items in their top recommendations.

from sklearn.model_selection import train_test_split

def evaluate_recommender(cf_model: CollaborativeFilter, test_ratings: csr_matrix, k: int = 10):
    """Compute precision@k and recall@k on held-out ratings."""
    precisions, recalls = [], []
    
    test_users, test_items = test_ratings.nonzero()
    unique_users = np.unique(test_users)
    
    for user_id in unique_users:
        # Get actual items user rated in test set
        actual_items = set(test_items[test_users == user_id])
        if len(actual_items) == 0:
            continue
        
        # Get model's top-k recommendations
        recs = cf_model.recommend(user_id, n=k)
        recommended_items = {r.item_id for r in recs}
        
        # Calculate metrics
        hits = len(actual_items & recommended_items)
        precisions.append(hits / k)
        recalls.append(hits / len(actual_items))
    
    return {
        'precision@k': np.mean(precisions),
        'recall@k': np.mean(recalls),
        'n_users_evaluated': len(precisions)
    }

Watch for feedback loops: recommendations influence behavior, which influences future recommendations. Popular items get more exposure, generating more ratings, making them more popular. Inject diversity intentionally and monitor coverage metrics (what percentage of your catalog gets recommended).

Start simple with item-based collaborative filtering, measure rigorously, and add complexity only when you’ve exhausted gains from the basic approach. Most recommendation systems fail not from algorithmic sophistication but from poor data quality and misaligned evaluation metrics.