How to Calculate Likelihood

Likelihood is one of the most misunderstood concepts in statistics, yet it's fundamental to everything from A/B testing to training neural networks. The confusion often starts with the relationship...

Key Insights

  • Likelihood measures how well parameter values explain observed data, while probability predicts future observations—understanding this distinction is crucial for statistical inference and machine learning
  • Always use log-likelihood in practice to avoid numerical underflow and simplify optimization, especially when dealing with products of many small probabilities
  • Vectorized implementations of likelihood calculations can be 100x faster than loops, making the difference between practical and impractical analysis for real datasets

Introduction to Likelihood

Likelihood is one of the most misunderstood concepts in statistics, yet it’s fundamental to everything from A/B testing to training neural networks. The confusion often starts with the relationship between likelihood and probability—they’re related but distinctly different concepts.

Probability answers the question: “Given these parameter values, what’s the chance of observing this data?” Likelihood flips this around: “Given this observed data, how plausible are these parameter values?” This distinction matters because in real-world applications, you always have data and need to infer parameters, not the other way around.

Consider a simple scenario: you flip a coin 10 times and get 7 heads. Probability would tell you the chance of getting 7 heads if the coin is fair (p=0.5). Likelihood tells you how well different values of p (0.3, 0.5, 0.7, etc.) explain your observation of 7 heads.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

# Observed data: 7 heads in 10 flips
n_flips = 10
n_heads = 7

# Probability: Given p=0.5, what's P(7 heads)?
prob_7_heads = binom.pmf(n_heads, n_flips, 0.5)
print(f"Probability of 7 heads with fair coin: {prob_7_heads:.4f}")

# Likelihood: Given 7 heads, how likely are different p values?
p_values = np.linspace(0, 1, 100)
likelihoods = binom.pmf(n_heads, n_flips, p_values)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.bar(range(n_flips + 1), binom.pmf(range(n_flips + 1), n_flips, 0.5))
plt.xlabel('Number of Heads')
plt.ylabel('Probability')
plt.title('Probability Distribution (p=0.5)')
plt.axvline(n_heads, color='red', linestyle='--', label='Observed')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(p_values, likelihoods)
plt.xlabel('Parameter p')
plt.ylabel('Likelihood')
plt.title('Likelihood Function (7 heads observed)')
plt.axvline(0.7, color='red', linestyle='--', label='MLE')
plt.legend()
plt.tight_layout()
plt.show()

The Likelihood Function

Mathematically, the likelihood function is defined as L(θ|x) = P(x|θ), where θ represents parameters and x represents observed data. Notice that we write it as a function of θ, not x—the data is fixed, and we’re evaluating different parameter values.

For independent observations, the likelihood is the product of individual probabilities. If you observe data points x₁, x₂, …, xₙ, the likelihood is:

L(θ|x₁, x₂, …, xₙ) = P(x₁|θ) × P(x₂|θ) × … × P(xₙ|θ)

Let’s implement a likelihood function for a normal distribution:

from scipy.stats import norm

def likelihood_normal(data, mu, sigma):
    """
    Calculate likelihood of data given normal distribution parameters.
    
    Args:
        data: observed data points
        mu: mean parameter
        sigma: standard deviation parameter
    
    Returns:
        likelihood value
    """
    # Product of probability densities for each data point
    return np.prod(norm.pdf(data, loc=mu, scale=sigma))

# Example usage
data = np.array([1.2, 1.5, 1.8, 2.1, 2.3])

# Test different parameter values
print(f"L(μ=1.5, σ=0.5): {likelihood_normal(data, 1.5, 0.5):.6e}")
print(f"L(μ=1.8, σ=0.5): {likelihood_normal(data, 1.8, 0.5):.6e}")
print(f"L(μ=2.0, σ=0.5): {likelihood_normal(data, 2.0, 0.5):.6e}")

Calculating Likelihood for Common Distributions

Different distributions require different likelihood calculations. Here’s how to handle the most common cases:

from scipy.stats import bernoulli, binom, norm, poisson

# Sample datasets
coin_flips = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])  # Bernoulli
continuous_data = np.array([2.3, 2.7, 2.1, 2.9, 2.5])  # Normal
count_data = np.array([3, 5, 4, 6, 4, 5, 3, 4])        # Poisson

def likelihood_bernoulli(data, p):
    """Likelihood for Bernoulli trials (coin flips, binary outcomes)"""
    return np.prod(bernoulli.pmf(data, p))

def likelihood_binomial(k, n, p):
    """Likelihood for binomial (number of successes in n trials)"""
    return binom.pmf(k, n, p)

def likelihood_normal(data, mu, sigma):
    """Likelihood for normal/Gaussian distribution"""
    return np.prod(norm.pdf(data, mu, sigma))

def likelihood_poisson(data, lambda_param):
    """Likelihood for Poisson (count data)"""
    return np.prod(poisson.pmf(data, lambda_param))

# Compare likelihoods
print("Bernoulli (p=0.7):", likelihood_bernoulli(coin_flips, 0.7))
print("Normal (μ=2.5, σ=0.3):", likelihood_normal(continuous_data, 2.5, 0.3))
print("Poisson (λ=4.0):", likelihood_poisson(count_data, 4.0))

Log-Likelihood

In practice, you should almost never use raw likelihood values. Here’s why:

# Demonstrate numerical underflow
large_dataset = np.random.normal(0, 1, 1000)

# Raw likelihood - will underflow to 0
raw_likelihood = likelihood_normal(large_dataset, 0, 1)
print(f"Raw likelihood: {raw_likelihood}")  # Will print 0.0

# Log-likelihood - numerically stable
def log_likelihood_normal(data, mu, sigma):
    """Calculate log-likelihood for normal distribution"""
    return np.sum(norm.logpdf(data, mu, sigma))

log_lik = log_likelihood_normal(large_dataset, 0, 1)
print(f"Log-likelihood: {log_lik:.2f}")  # Reasonable value

# Log-likelihood has nice properties
def log_likelihood_generic(data, dist, *params):
    """Generic log-likelihood calculation"""
    return np.sum(dist.logpdf(data, *params))

# Easy to work with sums instead of products
ll1 = log_likelihood_normal(large_dataset[:500], 0, 1)
ll2 = log_likelihood_normal(large_dataset[500:], 0, 1)
ll_total = ll1 + ll2  # Sum of log-likelihoods

print(f"Log-likelihood (first half): {ll1:.2f}")
print(f"Log-likelihood (second half): {ll2:.2f}")
print(f"Total log-likelihood: {ll_total:.2f}")

Log-likelihood converts products into sums: log(L) = log(P(x₁|θ)) + log(P(x₂|θ)) + … This prevents underflow and makes optimization easier since derivatives of sums are simpler than derivatives of products.

Maximum Likelihood Estimation

The most common use of likelihood is finding parameter values that maximize it—Maximum Likelihood Estimation (MLE). This is the foundation of most statistical inference.

from scipy.optimize import minimize

# Generate sample data
np.random.seed(42)
true_mu, true_sigma = 5.0, 2.0
data = np.random.normal(true_mu, true_sigma, 100)

# Define negative log-likelihood (we minimize instead of maximize)
def neg_log_likelihood(params, data):
    mu, sigma = params
    if sigma <= 0:  # Constraint: sigma must be positive
        return np.inf
    return -np.sum(norm.logpdf(data, mu, sigma))

# Method 1: Manual gradient descent
def gradient_descent_mle(data, init_params, learning_rate=0.01, n_iter=1000):
    params = np.array(init_params, dtype=float)
    
    for i in range(n_iter):
        mu, sigma = params
        
        # Analytical gradients for normal distribution
        residuals = data - mu
        grad_mu = -np.sum(residuals) / (sigma ** 2)
        grad_sigma = -np.sum(residuals ** 2 / (sigma ** 3) - 1 / sigma)
        
        params -= learning_rate * np.array([grad_mu, grad_sigma])
        
        # Ensure sigma stays positive
        params[1] = max(params[1], 0.01)
    
    return params

manual_mle = gradient_descent_mle(data, [0.0, 1.0])
print(f"Manual MLE: μ={manual_mle[0]:.3f}, σ={manual_mle[1]:.3f}")

# Method 2: Using scipy.optimize
result = minimize(neg_log_likelihood, x0=[0.0, 1.0], args=(data,), 
                  method='Nelder-Mead')
scipy_mle = result.x
print(f"Scipy MLE: μ={scipy_mle[0]:.3f}, σ={scipy_mle[1]:.3f}")

# Compare with analytical solution for normal distribution
analytical_mu = np.mean(data)
analytical_sigma = np.std(data, ddof=0)
print(f"Analytical MLE: μ={analytical_mu:.3f}, σ={analytical_sigma:.3f}")
print(f"True values: μ={true_mu:.3f}, σ={true_sigma:.3f}")

Practical Implementation Tips

Performance matters when calculating likelihoods repeatedly during optimization. Here’s how to write efficient code:

import time

# Generate larger dataset
large_data = np.random.normal(0, 1, 10000)

# Naive implementation with loop
def likelihood_loop(data, mu, sigma):
    likelihood = 1.0
    for x in data:
        likelihood *= norm.pdf(x, mu, sigma)
    return likelihood

# Vectorized implementation
def likelihood_vectorized(data, mu, sigma):
    return np.prod(norm.pdf(data, mu, sigma))

# Log-likelihood (best practice)
def log_likelihood_vectorized(data, mu, sigma):
    return np.sum(norm.logpdf(data, mu, sigma))

# Performance comparison
start = time.time()
for _ in range(100):
    likelihood_loop(large_data[:100], 0, 1)
time_loop = time.time() - start

start = time.time()
for _ in range(100):
    likelihood_vectorized(large_data[:100], 0, 1)
time_vec = time.time() - start

start = time.time()
for _ in range(100):
    log_likelihood_vectorized(large_data, 0, 1)
time_log = time.time() - start

print(f"Loop implementation: {time_loop:.4f}s")
print(f"Vectorized implementation: {time_vec:.4f}s")
print(f"Log-likelihood implementation: {time_log:.4f}s")
print(f"Speedup (vectorized vs loop): {time_loop/time_vec:.1f}x")

# Handling edge cases
def safe_log_likelihood(data, mu, sigma, epsilon=1e-10):
    """
    Safe log-likelihood calculation with numerical safeguards.
    """
    # Ensure sigma is positive
    sigma = max(sigma, epsilon)
    
    # Calculate log-likelihood
    ll = np.sum(norm.logpdf(data, mu, sigma))
    
    # Check for invalid values
    if np.isnan(ll) or np.isinf(ll):
        return -np.inf  # Return very low likelihood
    
    return ll

# Test with edge case
print(f"Safe log-likelihood (σ=0): {safe_log_likelihood(data, 0, 0)}")
print(f"Safe log-likelihood (σ=2): {safe_log_likelihood(data, 0, 2):.2f}")

The vectorized log-likelihood implementation is typically 100x faster than naive loops and handles numerical issues gracefully. Always add safeguards for edge cases like zero or negative variance parameters, especially when using optimization algorithms that might explore invalid parameter regions.

Understanding likelihood calculations is essential for statistical inference, machine learning model fitting, and Bayesian analysis. Master these fundamentals, use log-likelihood in practice, and vectorize your implementations—your future self will thank you when debugging convergence issues at 2 AM.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.