How to Use scipy.stats for Probability Distributions in Python

The `scipy.stats` module is Python's most comprehensive library for probability distributions and statistical functions. Whether you're running Monte Carlo simulations, fitting models to data, or...

Key Insights

  • scipy.stats provides a unified interface for 100+ probability distributions, with consistent methods like pdf(), cdf(), ppf(), and rvs() that work identically across all distribution types.
  • The fit() method enables maximum likelihood estimation to match distributions to your data, but always validate your fit with goodness-of-fit tests before using it in production.
  • Freezing distributions with specific parameters creates reusable objects that simplify code and reduce errors when you need the same distribution repeatedly.

Introduction to scipy.stats

The scipy.stats module is Python’s most comprehensive library for probability distributions and statistical functions. Whether you’re running Monte Carlo simulations, fitting models to data, or calculating probabilities for hypothesis tests, this module provides the tools you need.

scipy.stats organizes distributions into two categories: continuous and discrete. Continuous distributions (like normal or exponential) can take any value within a range and use probability density functions. Discrete distributions (like binomial or Poisson) only take integer values and use probability mass functions.

import scipy.stats as stats
import numpy as np

# List all continuous distributions
continuous_dists = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_continuous)]
print(f"Continuous distributions available: {len(continuous_dists)}")

# List all discrete distributions
discrete_dists = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_discrete)]
print(f"Discrete distributions available: {len(discrete_dists)}")

# Some commonly used distributions
common = ['norm', 'expon', 'uniform', 'beta', 'gamma', 'binom', 'poisson', 'geom']
for name in common:
    dist = getattr(stats, name)
    print(f"{name}: {dist.name}")

The real power of scipy.stats lies in its consistent API. Once you learn how to use one distribution, you can use them all. Every distribution object provides the same core methods, making it trivial to swap distributions in your code.

Working with Continuous Distributions

Continuous distributions form the backbone of statistical modeling. The four methods you’ll use constantly are:

  • pdf(x): Probability density function—the height of the distribution curve at point x
  • cdf(x): Cumulative distribution function—probability that a value is less than or equal to x
  • ppf(q): Percent point function (inverse CDF)—the value x where P(X ≤ x) = q
  • rvs(size): Random variates—generate random samples from the distribution
import scipy.stats as stats
import numpy as np

# Create a standard normal distribution (mean=0, std=1)
normal = stats.norm(loc=0, scale=1)

# PDF: What's the density at x=0?
print(f"PDF at x=0: {normal.pdf(0):.4f}")  # ~0.3989

# CDF: What's the probability that X <= 1.96?
print(f"P(X <= 1.96): {normal.cdf(1.96):.4f}")  # ~0.975

# PPF: What value has 95% of the distribution below it?
print(f"95th percentile: {normal.ppf(0.95):.4f}")  # ~1.645

# Working with non-standard normal distribution
# Mean = 100, Standard deviation = 15 (like IQ scores)
iq_dist = stats.norm(loc=100, scale=15)

# What percentage of people have IQ > 130?
prob_above_130 = 1 - iq_dist.cdf(130)
print(f"P(IQ > 130): {prob_above_130:.4f}")  # ~0.0228 or 2.28%

# What IQ separates the top 1%?
top_1_percent = iq_dist.ppf(0.99)
print(f"Top 1% IQ threshold: {top_1_percent:.1f}")  # ~134.9

Other essential continuous distributions include exponential (for modeling wait times), uniform (for equal probability across a range), and beta (for modeling proportions):

# Exponential: average wait time of 5 minutes
wait_time = stats.expon(scale=5)
print(f"P(wait > 10 min): {1 - wait_time.cdf(10):.4f}")

# Uniform: random number between 0 and 10
uniform = stats.uniform(loc=0, scale=10)
print(f"P(3 < X < 7): {uniform.cdf(7) - uniform.cdf(3):.4f}")

# Beta: modeling conversion rates (alpha=5, beta=20)
conversion = stats.beta(a=5, b=20)
print(f"Expected conversion rate: {conversion.mean():.4f}")

Working with Discrete Distributions

Discrete distributions use pmf() (probability mass function) instead of pdf(), since they assign probability to specific integer values rather than continuous ranges.

import scipy.stats as stats

# Binomial: 10 coin flips with fair coin
coin_flips = stats.binom(n=10, p=0.5)

# Probability of exactly 7 heads
print(f"P(X = 7): {coin_flips.pmf(7):.4f}")

# Probability of 7 or more heads
print(f"P(X >= 7): {1 - coin_flips.cdf(6):.4f}")

# Poisson: average of 4 customer arrivals per hour
arrivals = stats.poisson(mu=4)

# Probability of exactly 6 arrivals
print(f"P(X = 6): {arrivals.pmf(6):.4f}")

# Probability of more than 8 arrivals (rare event)
print(f"P(X > 8): {1 - arrivals.cdf(8):.4f}")

# Geometric: probability of first success
# p=0.1 (10% success rate per trial)
first_success = stats.geom(p=0.1)

# Expected number of trials until first success
print(f"Expected trials: {first_success.mean():.1f}")

# Probability first success happens by trial 5
print(f"P(X <= 5): {first_success.cdf(5):.4f}")

The binomial distribution is particularly useful for A/B testing scenarios. If your control converts at 5% and you want to know the probability of seeing 60 or more conversions out of 1000 visitors by chance:

control = stats.binom(n=1000, p=0.05)
prob_60_or_more = 1 - control.cdf(59)
print(f"P(conversions >= 60 | p=0.05): {prob_60_or_more:.4f}")

Generating Random Samples

The rvs() method generates random samples from any distribution. Always set a random seed for reproducibility in production code and analysis.

import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
rng = np.random.default_rng(42)

# Generate samples from different distributions
normal_samples = stats.norm(loc=0, scale=1).rvs(size=10000, random_state=rng)
exponential_samples = stats.expon(scale=2).rvs(size=10000, random_state=rng)
beta_samples = stats.beta(a=2, b=5).rvs(size=10000, random_state=rng)

# Visualize the distributions
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].hist(normal_samples, bins=50, density=True, alpha=0.7)
axes[0].set_title('Normal(0, 1)')

axes[1].hist(exponential_samples, bins=50, density=True, alpha=0.7)
axes[1].set_title('Exponential(scale=2)')

axes[2].hist(beta_samples, bins=50, density=True, alpha=0.7)
axes[2].set_title('Beta(2, 5)')

plt.tight_layout()
plt.savefig('distributions.png', dpi=150)

You can also generate multivariate samples by passing a tuple to the size parameter:

# Generate a 1000x5 matrix of standard normal samples
samples_matrix = stats.norm.rvs(size=(1000, 5), random_state=42)
print(f"Shape: {samples_matrix.shape}")

Fitting Distributions to Data

When you have observed data and want to find the best-fitting distribution parameters, use the fit() method. This performs maximum likelihood estimation.

import scipy.stats as stats
import numpy as np

# Generate synthetic data (pretend we don't know the true parameters)
np.random.seed(42)
true_mean, true_std = 50, 10
observed_data = np.random.normal(true_mean, true_std, size=500)

# Fit a normal distribution to the data
fitted_mean, fitted_std = stats.norm.fit(observed_data)
print(f"True parameters: mean={true_mean}, std={true_std}")
print(f"Fitted parameters: mean={fitted_mean:.2f}, std={fitted_std:.2f}")

# Perform goodness-of-fit test (Kolmogorov-Smirnov)
ks_statistic, p_value = stats.kstest(observed_data, 'norm', args=(fitted_mean, fitted_std))
print(f"KS test p-value: {p_value:.4f}")

# If p-value > 0.05, we can't reject the hypothesis that data follows this distribution
if p_value > 0.05:
    print("Data is consistent with normal distribution")
else:
    print("Data may not follow normal distribution")

For fitting other distributions, be aware that scipy.stats uses loc and scale parameters that shift and stretch the distribution. You can fix these if you know them:

# Fit exponential distribution (fix loc=0 since exponential starts at 0)
exp_data = stats.expon(scale=5).rvs(size=1000, random_state=42)
fitted_loc, fitted_scale = stats.expon.fit(exp_data, floc=0)
print(f"Fitted exponential scale: {fitted_scale:.2f}")  # Should be ~5

Statistical Tests and Summary Statistics

Every distribution object provides methods to calculate theoretical statistics without generating samples:

import scipy.stats as stats
import numpy as np

# Create a gamma distribution
gamma_dist = stats.gamma(a=5, scale=2)

# Theoretical statistics
print(f"Mean: {gamma_dist.mean():.4f}")
print(f"Variance: {gamma_dist.var():.4f}")
print(f"Standard deviation: {gamma_dist.std():.4f}")
print(f"Skewness: {gamma_dist.stats(moments='s')[0]:.4f}")

# Get multiple moments at once
mean, var, skew, kurt = gamma_dist.stats(moments='mvsk')
print(f"Kurtosis: {kurt:.4f}")

# Confidence interval containing 95% of the distribution
lower, upper = gamma_dist.interval(0.95)
print(f"95% interval: [{lower:.2f}, {upper:.2f}]")

# Expected value of a function of the random variable
# E[X^2] for this distribution
expected_x_squared = gamma_dist.expect(lambda x: x**2)
print(f"E[X^2]: {expected_x_squared:.4f}")

# Verify: E[X^2] should equal Var(X) + E[X]^2
calculated = gamma_dist.var() + gamma_dist.mean()**2
print(f"Var + Mean^2: {calculated:.4f}")

Practical Example: Monte Carlo Simulation

Let’s combine everything into a realistic scenario: estimating the probability that a project exceeds its budget. We’ll model task durations with different distributions and simulate total project cost.

import scipy.stats as stats
import numpy as np

def run_project_simulation(n_simulations=10000, seed=42):
    """
    Simulate project costs where:
    - Development time: Gamma distribution (right-skewed, realistic for dev work)
    - Testing time: Normal distribution
    - Unexpected issues: Poisson-distributed count, each with exponential cost
    """
    rng = np.random.default_rng(seed)
    
    # Define distributions
    dev_time = stats.gamma(a=4, scale=5)      # Mean: 20 days, right-skewed
    test_time = stats.norm(loc=10, scale=2)    # Mean: 10 days
    issue_count = stats.poisson(mu=3)          # Average 3 unexpected issues
    issue_cost = stats.expon(scale=1000)       # Average $1000 per issue
    
    hourly_rate = 150  # $150/hour
    hours_per_day = 8
    
    total_costs = []
    
    for _ in range(n_simulations):
        # Simulate this project run
        dev_days = dev_time.rvs(random_state=rng)
        test_days = max(0, test_time.rvs(random_state=rng))  # Can't be negative
        
        labor_cost = (dev_days + test_days) * hours_per_day * hourly_rate
        
        # Simulate unexpected issues
        n_issues = issue_count.rvs(random_state=rng)
        issues_cost = sum(issue_cost.rvs(random_state=rng) for _ in range(n_issues))
        
        total_costs.append(labor_cost + issues_cost)
    
    return np.array(total_costs)

# Run simulation
costs = run_project_simulation()

# Analyze results
budget = 50000

print("=== Project Cost Simulation Results ===")
print(f"Mean cost: ${np.mean(costs):,.0f}")
print(f"Median cost: ${np.median(costs):,.0f}")
print(f"Std deviation: ${np.std(costs):,.0f}")
print(f"5th percentile: ${np.percentile(costs, 5):,.0f}")
print(f"95th percentile: ${np.percentile(costs, 95):,.0f}")

prob_over_budget = np.mean(costs > budget)
print(f"\nProbability of exceeding ${budget:,} budget: {prob_over_budget:.1%}")

# What budget gives us 90% confidence?
safe_budget = np.percentile(costs, 90)
print(f"Budget for 90% confidence: ${safe_budget:,.0f}")

This simulation demonstrates the real power of scipy.stats: combining multiple distributions to model complex, real-world uncertainty. The gamma distribution captures the right-skewed nature of development work (tasks often take longer than estimated, rarely shorter). The Poisson distribution models the count of discrete events (unexpected issues), while exponential models the cost of each.

By running thousands of simulations, we get a distribution of possible outcomes that’s far more useful than a single point estimate. You can answer questions like “what’s the probability we exceed budget?” or “what budget do we need for 90% confidence?” directly from the simulation results.

scipy.stats makes this kind of analysis accessible. Master its consistent API, and you’ll have a powerful tool for any probabilistic modeling task.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.