How to Apply Bayes' Theorem

Key Insights

Bayes’ Theorem provides a mathematical framework for updating beliefs based on new evidence, powering everything from spam filters to recommendation engines
The “naive” independence assumption in Naive Bayes classifiers is often violated in practice, yet the algorithm still performs remarkably well for text classification and similar tasks
Working in log-space and applying Laplace smoothing are essential techniques for numerical stability and handling unseen features in production systems

Introduction to Bayes’ Theorem

Bayes’ Theorem is a fundamental tool for reasoning under uncertainty. In software engineering, you encounter it constantly—even if you don’t realize it. Gmail’s spam filter, Netflix’s recommendation system, and modern A/B testing frameworks all rely on Bayesian reasoning to make decisions from incomplete information.

The theorem itself is deceptively simple:

P(A|B) = P(B|A) × P(A) / P(B)

In plain English: the probability of A given B equals the probability of B given A, multiplied by the prior probability of A, divided by the prior probability of B.

Let’s implement this directly in Python:

def bayes_theorem(p_b_given_a, p_a, p_b):
    """
    Calculate P(A|B) using Bayes' Theorem
    
    Args:
        p_b_given_a: Probability of B given A (likelihood)
        p_a: Prior probability of A
        p_b: Prior probability of B (evidence)
    
    Returns:
        Posterior probability of A given B
    """
    return (p_b_given_a * p_a) / p_b

# Example: Medical diagnosis
# P(Disease|Positive Test)
p_positive_given_disease = 0.99  # Test sensitivity
p_disease = 0.01  # 1% of population has disease
p_positive = 0.02  # 2% test positive overall

p_disease_given_positive = bayes_theorem(
    p_positive_given_disease, 
    p_disease, 
    p_positive
)
print(f"Probability of disease given positive test: {p_disease_given_positive:.2%}")
# Output: 49.50%

This result surprises most people. Even with a 99% accurate test, a positive result only means a 49.5% chance of actually having the disease when the base rate is low.

Understanding Prior, Likelihood, and Posterior

Let’s break down the components using email spam detection:

Prior P(Spam): Your belief before seeing the email content. If 30% of emails are spam, P(Spam) = 0.3
Likelihood P(Words|Spam): Probability of seeing these specific words given it’s spam
Evidence P(Words): Overall probability of seeing these words
Posterior P(Spam|Words): Updated belief after seeing the email

Here’s how to calculate each component:

def spam_detection_example():
    # Prior probabilities
    p_spam = 0.3
    p_ham = 0.7
    
    # Likelihood: P(contains "FREE MONEY" | Spam)
    p_free_money_given_spam = 0.4
    p_free_money_given_ham = 0.01
    
    # Evidence: P(contains "FREE MONEY")
    # Using law of total probability
    p_free_money = (p_free_money_given_spam * p_spam + 
                    p_free_money_given_ham * p_ham)
    
    # Posterior: P(Spam | contains "FREE MONEY")
    p_spam_given_free_money = (
        p_free_money_given_spam * p_spam
    ) / p_free_money
    
    print(f"Prior probability of spam: {p_spam:.2%}")
    print(f"Posterior probability of spam: {p_spam_given_free_money:.2%}")
    
spam_detection_example()
# Prior probability of spam: 30.00%
# Posterior probability of spam: 94.49%

The word “FREE MONEY” dramatically increases our confidence that an email is spam, updating our belief from 30% to 94.49%.

Practical Application: Building a Naive Bayes Classifier

The “naive” assumption is that features are independent given the class. For text classification, this means each word’s probability is independent—clearly false in reality, but surprisingly effective in practice.

Here’s a complete implementation:

from collections import defaultdict
import math

class NaiveBayesClassifier:
    def __init__(self):
        self.class_counts = defaultdict(int)
        self.word_counts = defaultdict(lambda: defaultdict(int))
        self.vocab = set()
        self.total_docs = 0
        
    def train(self, documents, labels):
        """
        Train the classifier on documents and their labels
        
        Args:
            documents: List of documents (each is a list of words)
            labels: List of corresponding labels
        """
        for doc, label in zip(documents, labels):
            self.class_counts[label] += 1
            self.total_docs += 1
            
            for word in doc:
                self.vocab.add(word)
                self.word_counts[label][word] += 1
    
    def predict(self, document):
        """
        Predict the most likely class for a document
        
        Args:
            document: List of words
            
        Returns:
            Predicted class label
        """
        scores = {}
        
        for label in self.class_counts:
            # Prior: P(Class)
            prior = math.log(self.class_counts[label] / self.total_docs)
            
            # Likelihood: P(Words|Class)
            likelihood = 0
            total_words = sum(self.word_counts[label].values())
            vocab_size = len(self.vocab)
            
            for word in document:
                # Laplace smoothing
                word_count = self.word_counts[label].get(word, 0)
                word_prob = (word_count + 1) / (total_words + vocab_size)
                likelihood += math.log(word_prob)
            
            scores[label] = prior + likelihood
        
        return max(scores, key=scores.get)
    
    def predict_proba(self, document):
        """Return probability distribution over classes"""
        scores = {}
        
        for label in self.class_counts:
            prior = math.log(self.class_counts[label] / self.total_docs)
            likelihood = 0
            total_words = sum(self.word_counts[label].values())
            vocab_size = len(self.vocab)
            
            for word in document:
                word_count = self.word_counts[label].get(word, 0)
                word_prob = (word_count + 1) / (total_words + vocab_size)
                likelihood += math.log(word_prob)
            
            scores[label] = prior + likelihood
        
        # Convert log probabilities to probabilities
        max_score = max(scores.values())
        exp_scores = {label: math.exp(score - max_score) 
                     for label, score in scores.items()}
        total = sum(exp_scores.values())
        
        return {label: prob / total for label, prob in exp_scores.items()}

# Example usage
spam_docs = [
    ["free", "money", "now", "click"],
    ["win", "prize", "free", "offer"],
    ["congratulations", "winner", "claim"]
]

ham_docs = [
    ["meeting", "tomorrow", "at", "noon"],
    ["project", "update", "attached"],
    ["lunch", "plans", "this", "week"]
]

classifier = NaiveBayesClassifier()
classifier.train(
    spam_docs + ham_docs,
    ["spam"] * len(spam_docs) + ["ham"] * len(ham_docs)
)

test_doc = ["free", "meeting", "prize"]
prediction = classifier.predict(test_doc)
probabilities = classifier.predict_proba(test_doc)

print(f"Prediction: {prediction}")
print(f"Probabilities: {probabilities}")

Real-World Use Case: A/B Test Analysis

Bayesian A/B testing lets you incorporate prior beliefs and make probabilistic statements like “there’s a 95% chance variant B is better than A.” This is more intuitive than p-values.

import numpy as np
from scipy import stats

def bayesian_ab_test(conversions_a, trials_a, conversions_b, trials_b, 
                     prior_alpha=1, prior_beta=1, n_samples=100000):
    """
    Bayesian A/B test using Beta distributions
    
    Args:
        conversions_a: Number of conversions for variant A
        trials_a: Number of trials for variant A
        conversions_b: Number of conversions for variant B
        trials_b: Number of trials for variant B
        prior_alpha: Prior alpha parameter (Beta distribution)
        prior_beta: Prior beta parameter (Beta distribution)
        n_samples: Number of Monte Carlo samples
    
    Returns:
        Dictionary with test results
    """
    # Posterior distributions (Beta is conjugate prior for Binomial)
    posterior_a = stats.beta(
        prior_alpha + conversions_a,
        prior_beta + trials_a - conversions_a
    )
    posterior_b = stats.beta(
        prior_alpha + conversions_b,
        prior_beta + trials_b - conversions_b
    )
    
    # Sample from posteriors
    samples_a = posterior_a.rvs(n_samples)
    samples_b = posterior_b.rvs(n_samples)
    
    # Calculate probability B > A
    prob_b_better = np.mean(samples_b > samples_a)
    
    # Expected loss if we choose wrong variant
    loss_a = np.mean(np.maximum(samples_b - samples_a, 0))
    loss_b = np.mean(np.maximum(samples_a - samples_b, 0))
    
    return {
        "prob_b_better_than_a": prob_b_better,
        "expected_loss_choosing_a": loss_a,
        "expected_loss_choosing_b": loss_b,
        "mean_conversion_a": posterior_a.mean(),
        "mean_conversion_b": posterior_b.mean()
    }

# Example: Testing two landing pages
results = bayesian_ab_test(
    conversions_a=120, trials_a=1000,  # 12% conversion
    conversions_b=145, trials_b=1000   # 14.5% conversion
)

print(f"P(B > A): {results['prob_b_better_than_a']:.2%}")
print(f"Expected loss choosing A: {results['expected_loss_choosing_a']:.4f}")
print(f"Expected loss choosing B: {results['expected_loss_choosing_b']:.4f}")

If P(B > A) is 97% and the expected loss of choosing B is negligible, you can confidently deploy variant B.

Handling Edge Cases and Practical Considerations

Two critical issues arise in production:

Zero-probability problem: If a word never appears in training data for a class, its probability is zero, making the entire product zero. Laplace smoothing (adding 1 to all counts) solves this.

Numerical underflow: Multiplying many small probabilities causes underflow. Working in log-space converts multiplication to addition.

Here’s the improved classifier:

class ProductionNaiveBayes:
    def __init__(self, alpha=1.0):
        """alpha is the Laplace smoothing parameter"""
        self.alpha = alpha
        self.class_log_priors = {}
        self.feature_log_probs = defaultdict(dict)
        self.classes = []
        
    def train(self, X, y):
        """X: list of feature dicts, y: list of labels"""
        self.classes = list(set(y))
        class_counts = defaultdict(int)
        feature_counts = defaultdict(lambda: defaultdict(int))
        
        for features, label in zip(X, y):
            class_counts[label] += 1
            for feature, count in features.items():
                feature_counts[label][feature] += count
        
        # Calculate log priors
        total = len(y)
        for label in self.classes:
            self.class_log_priors[label] = math.log(
                class_counts[label] / total
            )
        
        # Calculate log likelihoods with smoothing
        all_features = set()
        for features in X:
            all_features.update(features.keys())
        
        for label in self.classes:
            total_count = sum(feature_counts[label].values())
            vocab_size = len(all_features)
            
            for feature in all_features:
                count = feature_counts[label].get(feature, 0)
                # Laplace smoothing
                prob = (count + self.alpha) / (
                    total_count + self.alpha * vocab_size
                )
                self.feature_log_probs[label][feature] = math.log(prob)
    
    def predict_log_proba(self, features):
        """Return log probabilities for numerical stability"""
        scores = {}
        for label in self.classes:
            score = self.class_log_priors[label]
            for feature, count in features.items():
                if feature in self.feature_log_probs[label]:
                    score += count * self.feature_log_probs[label][feature]
            scores[label] = score
        return scores

When NOT to use Bayes’ Theorem: If features have strong dependencies (like pixels in images), the naive independence assumption breaks down. Use neural networks or other models that capture feature interactions.

Conclusion and Further Resources

Bayes’ Theorem gives you a principled way to update beliefs with evidence. The key insights:

Start with priors: Your initial belief matters. In A/B testing, this prevents premature decisions from early random fluctuations.
Independence assumption is powerful: Naive Bayes works despite violated assumptions because classification only needs correct ranking, not calibrated probabilities.
Log-space and smoothing are non-negotiable: Production systems must handle numerical stability and unseen features.

For production use, leverage existing libraries:

scikit-learn: MultinomialNB, GaussianNB for standard cases
PyMC: Full Bayesian inference with MCMC sampling
Stan: When you need custom probabilistic models

The implementations here teach fundamentals, but battle-tested libraries handle edge cases and optimizations you’ll encounter at scale. Start with these basics, then graduate to specialized tools as your needs grow.