Normal Distribution in Python: Complete Guide
The normal distribution, also called the Gaussian distribution or bell curve, is the most important probability distribution in statistics. It describes how continuous data naturally clusters around...
Key Insights
- NumPy and SciPy provide complementary tools for normal distributions—use NumPy for generating samples and SciPy for probability calculations, statistical tests, and analytical work.
- Always test for normality before applying statistical methods that assume it; the Shapiro-Wilk test works well for small samples (n < 5000), while Anderson-Darling handles larger datasets.
- The Central Limit Theorem is your safety net—even when individual data points aren’t normally distributed, sample means will approximate normality with sufficient sample size (typically n ≥ 30).
Introduction to Normal Distribution
The normal distribution, also called the Gaussian distribution or bell curve, is the most important probability distribution in statistics. It describes how continuous data naturally clusters around a central value, with symmetric tails extending in both directions.
Two parameters define a normal distribution completely: the mean (μ), which determines the center, and the standard deviation (σ), which controls the spread. About 68% of values fall within one standard deviation of the mean, 95% within two, and 99.7% within three. This “68-95-99.7 rule” makes the normal distribution incredibly useful for understanding data variability.
Why does this matter for Python developers? Machine learning algorithms often assume normally distributed features. Statistical tests require normality assumptions. Quality control, financial modeling, and scientific research all rely on normal distributions. Understanding how to generate, visualize, test, and apply normal distributions in Python is a fundamental skill.
Generating Normal Distributions with NumPy
NumPy provides two primary functions for generating normally distributed random numbers. The numpy.random.normal() function lets you specify mean and standard deviation, while numpy.random.randn() generates samples from the standard normal distribution (mean=0, std=1).
import numpy as np
# Set seed for reproducibility
np.random.seed(42)
# Generate 1000 samples with mean=100, std=15 (like IQ scores)
samples = np.random.normal(loc=100, scale=15, size=1000)
# Basic statistics
print(f"Sample mean: {samples.mean():.2f}")
print(f"Sample std: {samples.std():.2f}")
print(f"Min: {samples.min():.2f}, Max: {samples.max():.2f}")
# Using randn for standard normal, then transform
standard_samples = np.random.randn(1000)
transformed = standard_samples * 15 + 100 # Same distribution as above
print(f"\nTransformed mean: {transformed.mean():.2f}")
print(f"Transformed std: {transformed.std():.2f}")
The output confirms our samples approximate the specified parameters:
Sample mean: 99.77
Sample std: 14.89
Min: 51.93, Max: 145.62
Transformed mean: 100.41
Transformed std: 15.23
For modern NumPy code, prefer the new random generator API:
rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=100, scale=15, size=1000)
This approach offers better statistical properties and thread safety.
Visualizing Normal Distributions
Visualization helps you understand your data’s distribution at a glance. Matplotlib and Seaborn make this straightforward.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
data = np.random.normal(loc=50, scale=10, size=1000)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Histogram with KDE overlay
ax1 = axes[0]
sns.histplot(data, kde=True, stat='density', ax=ax1, color='steelblue', alpha=0.7)
# Add vertical lines for mean and standard deviations
mean, std = data.mean(), data.std()
ax1.axvline(mean, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean:.1f}')
ax1.axvline(mean - std, color='orange', linestyle=':', linewidth=2, label=f'±1 SD: {std:.1f}')
ax1.axvline(mean + std, color='orange', linestyle=':', linewidth=2)
ax1.axvline(mean - 2*std, color='green', linestyle=':', linewidth=1.5, label='±2 SD')
ax1.axvline(mean + 2*std, color='green', linestyle=':', linewidth=1.5)
ax1.set_title('Histogram with KDE and Standard Deviations')
ax1.set_xlabel('Value')
ax1.set_ylabel('Density')
ax1.legend()
# Q-Q plot to assess normality
from scipy import stats
ax2 = axes[1]
stats.probplot(data, dist="norm", plot=ax2)
ax2.set_title('Q-Q Plot')
plt.tight_layout()
plt.savefig('normal_distribution_viz.png', dpi=150)
plt.show()
The Q-Q plot is particularly valuable. Points falling along the diagonal line indicate normality. Deviations at the tails suggest heavy or light tails, while S-shaped patterns indicate skewness.
Working with SciPy’s norm Distribution
While NumPy handles random sampling, SciPy’s scipy.stats.norm provides the analytical tools you need for probability calculations.
from scipy import stats
import numpy as np
# Create a normal distribution object
dist = stats.norm(loc=100, scale=15)
# Probability Density Function (PDF) - height of curve at a point
print(f"PDF at x=100: {dist.pdf(100):.4f}")
print(f"PDF at x=115: {dist.pdf(115):.4f}")
# Cumulative Distribution Function (CDF) - P(X <= x)
print(f"\nP(X <= 100): {dist.cdf(100):.4f}") # Should be 0.5
print(f"P(X <= 115): {dist.cdf(115):.4f}") # ~0.84 (one std above mean)
print(f"P(X <= 130): {dist.cdf(130):.4f}") # ~0.98 (two std above)
# Percent Point Function (PPF) - inverse of CDF
print(f"\n50th percentile: {dist.ppf(0.50):.2f}")
print(f"95th percentile: {dist.ppf(0.95):.2f}")
print(f"99th percentile: {dist.ppf(0.99):.2f}")
# Calculate probability between two values
prob_between = dist.cdf(115) - dist.cdf(85)
print(f"\nP(85 <= X <= 115): {prob_between:.4f}") # Within 1 std
# Confidence intervals
confidence = 0.95
lower, upper = dist.interval(confidence)
print(f"\n95% confidence interval: [{lower:.2f}, {upper:.2f}]")
Output:
PDF at x=100: 0.0266
PDF at x=115: 0.0176
P(X <= 100): 0.5000
P(X <= 115): 0.8413
P(X <= 130): 0.9772
50th percentile: 100.00
95th percentile: 124.67
99th percentile: 134.90
P(85 <= X <= 115): 0.6827
95% confidence interval: [70.60, 129.40]
These calculations are essential for hypothesis testing, setting thresholds, and understanding probabilities in your data.
Testing for Normality
Never assume your data is normally distributed. Test it. Python offers several statistical tests for normality, each with different strengths.
from scipy import stats
import numpy as np
np.random.seed(42)
# Generate different distributions for comparison
normal_data = np.random.normal(100, 15, 500)
skewed_data = np.random.exponential(scale=2, size=500)
uniform_data = np.random.uniform(0, 100, 500)
def test_normality(data, name):
print(f"\n{'='*50}")
print(f"Testing: {name}")
print('='*50)
# Shapiro-Wilk test (best for n < 5000)
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk: statistic={stat:.4f}, p-value={p:.4f}")
print(f" → {'Normal' if p > 0.05 else 'Not normal'} (α=0.05)")
# D'Agostino-Pearson test (requires n >= 20)
stat, p = stats.normaltest(data)
print(f"D'Agostino-Pearson: statistic={stat:.4f}, p-value={p:.4f}")
print(f" → {'Normal' if p > 0.05 else 'Not normal'} (α=0.05)")
# Anderson-Darling test
result = stats.anderson(data, dist='norm')
print(f"Anderson-Darling: statistic={result.statistic:.4f}")
for i, (cv, sig) in enumerate(zip(result.critical_values, result.significance_level)):
status = "Normal" if result.statistic < cv else "Not normal"
print(f" → {status} at {sig}% significance (critical value: {cv:.3f})")
test_normality(normal_data, "Normal Distribution")
test_normality(skewed_data, "Exponential Distribution")
test_normality(uniform_data, "Uniform Distribution")
Key interpretation rules:
- p-value > 0.05: Fail to reject null hypothesis; data is consistent with normality
- p-value ≤ 0.05: Reject null hypothesis; data is not normally distributed
The Shapiro-Wilk test is most powerful for small samples but becomes overly sensitive with large datasets. For n > 5000, rely more on visual inspection (Q-Q plots) and domain knowledge.
Practical Applications
Let’s apply normal distribution concepts to real-world problems.
Z-Score Calculation and Outlier Detection
import numpy as np
from scipy import stats
np.random.seed(42)
# Simulated sensor readings with some outliers
sensor_data = np.random.normal(50, 5, 100)
sensor_data = np.append(sensor_data, [20, 85, 90]) # Add outliers
# Calculate z-scores
z_scores = stats.zscore(sensor_data)
# Detect outliers using 3-sigma rule
outlier_mask = np.abs(z_scores) > 3
outliers = sensor_data[outlier_mask]
print(f"Data points: {len(sensor_data)}")
print(f"Mean: {sensor_data.mean():.2f}, Std: {sensor_data.std():.2f}")
print(f"\nOutliers detected (|z| > 3): {len(outliers)}")
print(f"Outlier values: {outliers}")
print(f"Outlier z-scores: {z_scores[outlier_mask]}")
# Clean data
clean_data = sensor_data[~outlier_mask]
print(f"\nClean data mean: {clean_data.mean():.2f}")
print(f"Clean data std: {clean_data.std():.2f}")
Central Limit Theorem Demonstration
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
# Start with a decidedly non-normal distribution (exponential)
population = np.random.exponential(scale=2, size=100000)
sample_sizes = [5, 30, 100]
n_samples = 1000
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
# Plot original population
axes[0].hist(population, bins=50, density=True, alpha=0.7, color='gray')
axes[0].set_title('Population\n(Exponential)')
axes[0].set_xlabel('Value')
# Sample means for different sample sizes
for idx, n in enumerate(sample_sizes):
sample_means = [np.random.choice(population, size=n).mean()
for _ in range(n_samples)]
axes[idx + 1].hist(sample_means, bins=30, density=True, alpha=0.7, color='steelblue')
axes[idx + 1].set_title(f'Sample Means\n(n={n})')
axes[idx + 1].set_xlabel('Mean Value')
# Test normality of sample means
_, p = stats.shapiro(sample_means)
axes[idx + 1].text(0.05, 0.95, f'Shapiro p={p:.3f}',
transform=axes[idx + 1].transAxes, verticalalignment='top')
plt.tight_layout()
plt.savefig('clt_demonstration.png', dpi=150)
plt.show()
This demonstrates that even when the population is exponentially distributed, sample means become normally distributed as sample size increases.
Summary and Best Practices
The normal distribution is foundational, but using it correctly requires discipline.
When to assume normality:
- Your normality tests pass (p > 0.05)
- Q-Q plots show points along the diagonal
- You’re working with sample means (CLT applies)
- Domain knowledge supports the assumption
Common pitfalls to avoid:
- Assuming normality without testing
- Using Shapiro-Wilk on very large samples (it becomes too sensitive)
- Ignoring visual diagnostics in favor of p-values alone
- Forgetting that many real-world distributions are skewed
Practical recommendations:
- Always visualize your data first with histograms and Q-Q plots
- Run at least two normality tests for confirmation
- Consider transformations (log, Box-Cox) for skewed data
- When in doubt, use non-parametric methods that don’t assume normality
- Remember that “approximately normal” is often good enough for robust statistical methods
Master these tools, and you’ll handle the vast majority of statistical analysis tasks that come your way.