How to Use scipy.stats.norm in Python

Key Insights

scipy.stats.norm provides a complete toolkit for working with normal distributions, from probability calculations to random sampling and parameter estimation
Frozen distributions (created with norm(loc, scale)) improve code readability and performance when you’re repeatedly using the same parameters
The ppf() method (percent point function) is the inverse of cdf() and is essential for finding critical values in hypothesis testing and confidence interval calculations

Introduction to scipy.stats.norm

The normal distribution is the workhorse of statistics. Whether you’re analyzing measurement errors, modeling natural phenomena, or running hypothesis tests, you’ll encounter Gaussian distributions constantly. Python’s scipy.stats.norm module provides everything you need to work with normal distributions without implementing the math yourself.

This module handles the heavy lifting: calculating probability densities, cumulative probabilities, quantiles, and generating random samples. It’s the foundation for statistical analysis in Python, and understanding it well will make your data science work significantly more efficient.

Common use cases include quality control analysis, financial modeling, A/B test evaluation, and any scenario where you need to calculate probabilities or generate synthetic data following a bell curve.

Getting Started: Installation and Basic Setup

If you have SciPy installed, you’re ready to go. If not, install it with pip:

pip install scipy numpy matplotlib

Now let’s import what we need and explore the two approaches to using norm:

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Approach 1: Pass parameters each time
pdf_value = norm.pdf(0, loc=0, scale=1)
print(f"PDF at x=0: {pdf_value}")

# Approach 2: Create a frozen distribution
standard_normal = norm(loc=0, scale=1)
pdf_value_frozen = standard_normal.pdf(0)
print(f"PDF at x=0 (frozen): {pdf_value_frozen}")

# Custom distribution: mean=100, std=15 (like IQ scores)
iq_distribution = norm(loc=100, scale=15)
print(f"Mean: {iq_distribution.mean()}, Std: {iq_distribution.std()}")

The loc parameter represents the mean (μ), and scale represents the standard deviation (σ). I strongly recommend using frozen distributions when you’re working with the same parameters repeatedly. It’s cleaner, less error-prone, and marginally faster since SciPy doesn’t need to recreate the distribution object each time.

Probability Density and Cumulative Distribution Functions

The probability density function (PDF) tells you the relative likelihood of observing a specific value. The cumulative distribution function (CDF) gives you the probability that a random variable is less than or equal to a given value. These are your bread and butter for probability calculations.

from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

# Create a standard normal distribution
dist = norm(loc=0, scale=1)

# PDF: probability density at specific points
x_values = np.array([-2, -1, 0, 1, 2])
pdf_values = dist.pdf(x_values)
print("PDF values:")
for x, pdf in zip(x_values, pdf_values):
    print(f"  f({x}) = {pdf:.4f}")

# CDF: P(X ≤ x)
print("\nCDF values (cumulative probabilities):")
for x in x_values:
    prob = dist.cdf(x)
    print(f"  P(X ≤ {x}) = {prob:.4f}")

# Survival function: P(X > x) = 1 - CDF(x)
print(f"\nP(X > 1.96) = {dist.sf(1.96):.4f}")

# Probability between two values: P(a < X < b)
prob_between = dist.cdf(1) - dist.cdf(-1)
print(f"P(-1 < X < 1) = {prob_between:.4f}")  # ~68.27%

Let’s visualize the classic bell curve:

# Plot the PDF
x = np.linspace(-4, 4, 1000)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# PDF plot
axes[0].plot(x, dist.pdf(x), 'b-', linewidth=2)
axes[0].fill_between(x, dist.pdf(x), alpha=0.3)
axes[0].set_title('Probability Density Function')
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].grid(True, alpha=0.3)

# CDF plot
axes[1].plot(x, dist.cdf(x), 'r-', linewidth=2)
axes[1].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[1].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
axes[1].set_title('Cumulative Distribution Function')
axes[1].set_xlabel('x')
axes[1].set_ylabel('F(x)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('normal_distribution.png', dpi=150)
plt.show()

The survival function sf() is particularly useful when you’re dealing with upper-tail probabilities. While you could compute 1 - cdf(x), using sf() directly provides better numerical precision for values in the extreme tails.

Working with Quantiles and Percent Point Function

The percent point function ppf() is the inverse of the CDF. Given a probability, it returns the value x such that P(X ≤ x) equals that probability. This is essential for finding critical values and constructing confidence intervals.

from scipy.stats import norm

dist = norm(loc=0, scale=1)

# Find z-scores for common confidence levels
confidence_levels = [0.90, 0.95, 0.99]

print("Critical z-values for two-tailed tests:")
for cl in confidence_levels:
    alpha = 1 - cl
    z_critical = dist.ppf(1 - alpha/2)
    print(f"  {cl*100:.0f}% CI: z = ±{z_critical:.4f}")

# Find specific percentiles
percentiles = [0.25, 0.50, 0.75, 0.95]
print("\nPercentile values:")
for p in percentiles:
    value = dist.ppf(p)
    print(f"  {p*100:.0f}th percentile: {value:.4f}")

# Practical example: SAT scores (mean=1060, std=217)
sat_dist = norm(loc=1060, scale=217)

# What score do you need to be in the top 10%?
top_10_cutoff = sat_dist.ppf(0.90)
print(f"\nTop 10% SAT cutoff: {top_10_cutoff:.0f}")

# What percentile is a score of 1400?
percentile_1400 = sat_dist.cdf(1400) * 100
print(f"A score of 1400 is the {percentile_1400:.1f}th percentile")

The interval() method provides a convenient shortcut for symmetric confidence intervals:

# Get the interval containing 95% of the probability mass
lower, upper = dist.interval(0.95)
print(f"95% of values fall between {lower:.4f} and {upper:.4f}")

Generating Random Samples

The rvs() method generates random variates from the distribution. This is invaluable for simulations, Monte Carlo methods, and creating synthetic datasets.

from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
rng = np.random.default_rng(42)

# Generate samples from a normal distribution
dist = norm(loc=50, scale=10)
samples = dist.rvs(size=10000, random_state=rng)

print(f"Sample mean: {samples.mean():.2f} (expected: 50)")
print(f"Sample std: {samples.std():.2f} (expected: 10)")

# Visualize: histogram with theoretical PDF overlay
fig, ax = plt.subplots(figsize=(10, 6))

# Histogram of samples
ax.hist(samples, bins=50, density=True, alpha=0.7, 
        color='steelblue', edgecolor='white', label='Samples')

# Theoretical PDF
x = np.linspace(samples.min(), samples.max(), 200)
ax.plot(x, dist.pdf(x), 'r-', linewidth=2, label='Theoretical PDF')

ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('10,000 Random Samples vs Theoretical Distribution')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('random_samples.png', dpi=150)
plt.show()

For reproducibility, always use random_state. You can pass an integer seed or a NumPy random generator object. This ensures your simulations produce identical results across runs.

Statistical Measures and Fitting Data

When you have real-world data and want to model it as a normal distribution, the fit() method estimates the optimal parameters using maximum likelihood estimation.

from scipy.stats import norm
import numpy as np

# Simulate "real" data (in practice, this would be your actual dataset)
np.random.seed(123)
true_mean, true_std = 175, 8  # Heights in cm
observed_data = np.random.normal(true_mean, true_std, 500)

# Add some noise to make it realistic
observed_data += np.random.uniform(-2, 2, 500)

# Fit a normal distribution to the data
fitted_mean, fitted_std = norm.fit(observed_data)
print(f"Fitted parameters:")
print(f"  Mean (loc): {fitted_mean:.2f} cm")
print(f"  Std (scale): {fitted_std:.2f} cm")

# Create fitted distribution
fitted_dist = norm(loc=fitted_mean, scale=fitted_std)

# Extract distribution properties
print(f"\nDistribution properties:")
print(f"  Mean: {fitted_dist.mean():.2f}")
print(f"  Variance: {fitted_dist.var():.2f}")
print(f"  Standard deviation: {fitted_dist.std():.2f}")

# 95% interval
lower, upper = fitted_dist.interval(0.95)
print(f"  95% of population between: {lower:.1f} and {upper:.1f} cm")

# Visualize the fit
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(observed_data, bins=30, density=True, alpha=0.7, 
        color='steelblue', edgecolor='white', label='Observed data')

x = np.linspace(observed_data.min(), observed_data.max(), 200)
ax.plot(x, fitted_dist.pdf(x), 'r-', linewidth=2, 
        label=f'Fitted normal (μ={fitted_mean:.1f}, σ={fitted_std:.1f})')

ax.set_xlabel('Height (cm)')
ax.set_ylabel('Density')
ax.set_title('Fitting Normal Distribution to Height Data')
ax.legend()
plt.savefig('fitted_distribution.png', dpi=150)
plt.show()

Practical Application: Hypothesis Testing Example

Let’s tie everything together with a complete hypothesis testing workflow. Suppose you’re a quality control engineer testing whether a manufacturing process is producing bolts with the correct mean diameter.

from scipy.stats import norm
import numpy as np

# Scenario: Bolts should have mean diameter of 10.0 mm
# You've measured 36 bolts and found a sample mean of 10.15 mm
# Historical data shows population std = 0.3 mm
# Is this deviation significant at α = 0.05?

population_mean = 10.0  # Null hypothesis: μ = 10.0
population_std = 0.3
sample_size = 36
sample_mean = 10.15
alpha = 0.05

# Calculate the standard error of the mean
standard_error = population_std / np.sqrt(sample_size)
print(f"Standard error: {standard_error:.4f}")

# Calculate z-score (test statistic)
z_score = (sample_mean - population_mean) / standard_error
print(f"Z-score: {z_score:.4f}")

# Two-tailed test: calculate p-value
# P-value = 2 * P(Z > |z|) for two-tailed test
standard_normal = norm(loc=0, scale=1)
p_value = 2 * standard_normal.sf(abs(z_score))
print(f"P-value: {p_value:.4f}")

# Critical values for α = 0.05 (two-tailed)
z_critical = standard_normal.ppf(1 - alpha/2)
print(f"Critical z-values: ±{z_critical:.4f}")

# Decision
print("\n--- Decision ---")
if p_value < alpha:
    print(f"Reject H₀: p-value ({p_value:.4f}) < α ({alpha})")
    print("The sample mean is significantly different from 10.0 mm")
else:
    print(f"Fail to reject H₀: p-value ({p_value:.4f}) ≥ α ({alpha})")
    print("No significant difference from 10.0 mm")

# Construct 95% confidence interval for the true mean
ci_lower = sample_mean - z_critical * standard_error
ci_upper = sample_mean + z_critical * standard_error
print(f"\n95% CI for true mean: ({ci_lower:.4f}, {ci_upper:.4f}) mm")

# Check if hypothesized mean falls within CI
if ci_lower <= population_mean <= ci_upper:
    print("Note: 10.0 mm falls within the confidence interval")
else:
    print("Note: 10.0 mm falls outside the confidence interval")

This example demonstrates the core workflow: calculate a test statistic using the sampling distribution, find the p-value using sf(), determine critical values with ppf(), and construct confidence intervals. These operations form the backbone of statistical inference in Python.

The scipy.stats.norm module gives you everything needed for serious statistical work with normal distributions. Master these methods, and you’ll handle the majority of parametric statistical analyses with confidence.