How to Calculate Likelihood
Likelihood is one of the most misunderstood concepts in statistics, yet it's fundamental to everything from A/B testing to training neural networks. The confusion often starts with the relationship...
Key Insights
- Likelihood measures how well parameter values explain observed data, while probability predicts future observations—understanding this distinction is crucial for statistical inference and machine learning
- Always use log-likelihood in practice to avoid numerical underflow and simplify optimization, especially when dealing with products of many small probabilities
- Vectorized implementations of likelihood calculations can be 100x faster than loops, making the difference between practical and impractical analysis for real datasets
Introduction to Likelihood
Likelihood is one of the most misunderstood concepts in statistics, yet it’s fundamental to everything from A/B testing to training neural networks. The confusion often starts with the relationship between likelihood and probability—they’re related but distinctly different concepts.
Probability answers the question: “Given these parameter values, what’s the chance of observing this data?” Likelihood flips this around: “Given this observed data, how plausible are these parameter values?” This distinction matters because in real-world applications, you always have data and need to infer parameters, not the other way around.
Consider a simple scenario: you flip a coin 10 times and get 7 heads. Probability would tell you the chance of getting 7 heads if the coin is fair (p=0.5). Likelihood tells you how well different values of p (0.3, 0.5, 0.7, etc.) explain your observation of 7 heads.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
# Observed data: 7 heads in 10 flips
n_flips = 10
n_heads = 7
# Probability: Given p=0.5, what's P(7 heads)?
prob_7_heads = binom.pmf(n_heads, n_flips, 0.5)
print(f"Probability of 7 heads with fair coin: {prob_7_heads:.4f}")
# Likelihood: Given 7 heads, how likely are different p values?
p_values = np.linspace(0, 1, 100)
likelihoods = binom.pmf(n_heads, n_flips, p_values)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.bar(range(n_flips + 1), binom.pmf(range(n_flips + 1), n_flips, 0.5))
plt.xlabel('Number of Heads')
plt.ylabel('Probability')
plt.title('Probability Distribution (p=0.5)')
plt.axvline(n_heads, color='red', linestyle='--', label='Observed')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(p_values, likelihoods)
plt.xlabel('Parameter p')
plt.ylabel('Likelihood')
plt.title('Likelihood Function (7 heads observed)')
plt.axvline(0.7, color='red', linestyle='--', label='MLE')
plt.legend()
plt.tight_layout()
plt.show()
The Likelihood Function
Mathematically, the likelihood function is defined as L(θ|x) = P(x|θ), where θ represents parameters and x represents observed data. Notice that we write it as a function of θ, not x—the data is fixed, and we’re evaluating different parameter values.
For independent observations, the likelihood is the product of individual probabilities. If you observe data points x₁, x₂, …, xₙ, the likelihood is:
L(θ|x₁, x₂, …, xₙ) = P(x₁|θ) × P(x₂|θ) × … × P(xₙ|θ)
Let’s implement a likelihood function for a normal distribution:
from scipy.stats import norm
def likelihood_normal(data, mu, sigma):
"""
Calculate likelihood of data given normal distribution parameters.
Args:
data: observed data points
mu: mean parameter
sigma: standard deviation parameter
Returns:
likelihood value
"""
# Product of probability densities for each data point
return np.prod(norm.pdf(data, loc=mu, scale=sigma))
# Example usage
data = np.array([1.2, 1.5, 1.8, 2.1, 2.3])
# Test different parameter values
print(f"L(μ=1.5, σ=0.5): {likelihood_normal(data, 1.5, 0.5):.6e}")
print(f"L(μ=1.8, σ=0.5): {likelihood_normal(data, 1.8, 0.5):.6e}")
print(f"L(μ=2.0, σ=0.5): {likelihood_normal(data, 2.0, 0.5):.6e}")
Calculating Likelihood for Common Distributions
Different distributions require different likelihood calculations. Here’s how to handle the most common cases:
from scipy.stats import bernoulli, binom, norm, poisson
# Sample datasets
coin_flips = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1]) # Bernoulli
continuous_data = np.array([2.3, 2.7, 2.1, 2.9, 2.5]) # Normal
count_data = np.array([3, 5, 4, 6, 4, 5, 3, 4]) # Poisson
def likelihood_bernoulli(data, p):
"""Likelihood for Bernoulli trials (coin flips, binary outcomes)"""
return np.prod(bernoulli.pmf(data, p))
def likelihood_binomial(k, n, p):
"""Likelihood for binomial (number of successes in n trials)"""
return binom.pmf(k, n, p)
def likelihood_normal(data, mu, sigma):
"""Likelihood for normal/Gaussian distribution"""
return np.prod(norm.pdf(data, mu, sigma))
def likelihood_poisson(data, lambda_param):
"""Likelihood for Poisson (count data)"""
return np.prod(poisson.pmf(data, lambda_param))
# Compare likelihoods
print("Bernoulli (p=0.7):", likelihood_bernoulli(coin_flips, 0.7))
print("Normal (μ=2.5, σ=0.3):", likelihood_normal(continuous_data, 2.5, 0.3))
print("Poisson (λ=4.0):", likelihood_poisson(count_data, 4.0))
Log-Likelihood
In practice, you should almost never use raw likelihood values. Here’s why:
# Demonstrate numerical underflow
large_dataset = np.random.normal(0, 1, 1000)
# Raw likelihood - will underflow to 0
raw_likelihood = likelihood_normal(large_dataset, 0, 1)
print(f"Raw likelihood: {raw_likelihood}") # Will print 0.0
# Log-likelihood - numerically stable
def log_likelihood_normal(data, mu, sigma):
"""Calculate log-likelihood for normal distribution"""
return np.sum(norm.logpdf(data, mu, sigma))
log_lik = log_likelihood_normal(large_dataset, 0, 1)
print(f"Log-likelihood: {log_lik:.2f}") # Reasonable value
# Log-likelihood has nice properties
def log_likelihood_generic(data, dist, *params):
"""Generic log-likelihood calculation"""
return np.sum(dist.logpdf(data, *params))
# Easy to work with sums instead of products
ll1 = log_likelihood_normal(large_dataset[:500], 0, 1)
ll2 = log_likelihood_normal(large_dataset[500:], 0, 1)
ll_total = ll1 + ll2 # Sum of log-likelihoods
print(f"Log-likelihood (first half): {ll1:.2f}")
print(f"Log-likelihood (second half): {ll2:.2f}")
print(f"Total log-likelihood: {ll_total:.2f}")
Log-likelihood converts products into sums: log(L) = log(P(x₁|θ)) + log(P(x₂|θ)) + … This prevents underflow and makes optimization easier since derivatives of sums are simpler than derivatives of products.
Maximum Likelihood Estimation
The most common use of likelihood is finding parameter values that maximize it—Maximum Likelihood Estimation (MLE). This is the foundation of most statistical inference.
from scipy.optimize import minimize
# Generate sample data
np.random.seed(42)
true_mu, true_sigma = 5.0, 2.0
data = np.random.normal(true_mu, true_sigma, 100)
# Define negative log-likelihood (we minimize instead of maximize)
def neg_log_likelihood(params, data):
mu, sigma = params
if sigma <= 0: # Constraint: sigma must be positive
return np.inf
return -np.sum(norm.logpdf(data, mu, sigma))
# Method 1: Manual gradient descent
def gradient_descent_mle(data, init_params, learning_rate=0.01, n_iter=1000):
params = np.array(init_params, dtype=float)
for i in range(n_iter):
mu, sigma = params
# Analytical gradients for normal distribution
residuals = data - mu
grad_mu = -np.sum(residuals) / (sigma ** 2)
grad_sigma = -np.sum(residuals ** 2 / (sigma ** 3) - 1 / sigma)
params -= learning_rate * np.array([grad_mu, grad_sigma])
# Ensure sigma stays positive
params[1] = max(params[1], 0.01)
return params
manual_mle = gradient_descent_mle(data, [0.0, 1.0])
print(f"Manual MLE: μ={manual_mle[0]:.3f}, σ={manual_mle[1]:.3f}")
# Method 2: Using scipy.optimize
result = minimize(neg_log_likelihood, x0=[0.0, 1.0], args=(data,),
method='Nelder-Mead')
scipy_mle = result.x
print(f"Scipy MLE: μ={scipy_mle[0]:.3f}, σ={scipy_mle[1]:.3f}")
# Compare with analytical solution for normal distribution
analytical_mu = np.mean(data)
analytical_sigma = np.std(data, ddof=0)
print(f"Analytical MLE: μ={analytical_mu:.3f}, σ={analytical_sigma:.3f}")
print(f"True values: μ={true_mu:.3f}, σ={true_sigma:.3f}")
Practical Implementation Tips
Performance matters when calculating likelihoods repeatedly during optimization. Here’s how to write efficient code:
import time
# Generate larger dataset
large_data = np.random.normal(0, 1, 10000)
# Naive implementation with loop
def likelihood_loop(data, mu, sigma):
likelihood = 1.0
for x in data:
likelihood *= norm.pdf(x, mu, sigma)
return likelihood
# Vectorized implementation
def likelihood_vectorized(data, mu, sigma):
return np.prod(norm.pdf(data, mu, sigma))
# Log-likelihood (best practice)
def log_likelihood_vectorized(data, mu, sigma):
return np.sum(norm.logpdf(data, mu, sigma))
# Performance comparison
start = time.time()
for _ in range(100):
likelihood_loop(large_data[:100], 0, 1)
time_loop = time.time() - start
start = time.time()
for _ in range(100):
likelihood_vectorized(large_data[:100], 0, 1)
time_vec = time.time() - start
start = time.time()
for _ in range(100):
log_likelihood_vectorized(large_data, 0, 1)
time_log = time.time() - start
print(f"Loop implementation: {time_loop:.4f}s")
print(f"Vectorized implementation: {time_vec:.4f}s")
print(f"Log-likelihood implementation: {time_log:.4f}s")
print(f"Speedup (vectorized vs loop): {time_loop/time_vec:.1f}x")
# Handling edge cases
def safe_log_likelihood(data, mu, sigma, epsilon=1e-10):
"""
Safe log-likelihood calculation with numerical safeguards.
"""
# Ensure sigma is positive
sigma = max(sigma, epsilon)
# Calculate log-likelihood
ll = np.sum(norm.logpdf(data, mu, sigma))
# Check for invalid values
if np.isnan(ll) or np.isinf(ll):
return -np.inf # Return very low likelihood
return ll
# Test with edge case
print(f"Safe log-likelihood (σ=0): {safe_log_likelihood(data, 0, 0)}")
print(f"Safe log-likelihood (σ=2): {safe_log_likelihood(data, 0, 2):.2f}")
The vectorized log-likelihood implementation is typically 100x faster than naive loops and handles numerical issues gracefully. Always add safeguards for edge cases like zero or negative variance parameters, especially when using optimization algorithms that might explore invalid parameter regions.
Understanding likelihood calculations is essential for statistical inference, machine learning model fitting, and Bayesian analysis. Master these fundamentals, use log-likelihood in practice, and vectorize your implementations—your future self will thank you when debugging convergence issues at 2 AM.