Pareto Distribution in Python: Complete Guide

In the late 1800s, Italian economist Vilfredo Pareto noticed something peculiar: roughly 80% of Italy's land was owned by 20% of the population. This observation evolved into what we now call the...

Key Insights

  • The Pareto distribution models “rich get richer” phenomena where a small fraction of items account for most of the total—understanding it is essential for analyzing wealth, traffic, file sizes, and other heavy-tailed data.
  • NumPy and SciPy use different parameterizations for Pareto distributions; confusing them leads to incorrect results, so always verify which convention your library uses.
  • For shape parameter α ≤ 2, the variance is infinite, and for α ≤ 1, even the mean is undefined—ignoring this causes silent failures in statistical analyses.

Introduction to the Pareto Distribution

In the late 1800s, Italian economist Vilfredo Pareto noticed something peculiar: roughly 80% of Italy’s land was owned by 20% of the population. This observation evolved into what we now call the Pareto principle, or the 80/20 rule. But beyond the catchy ratio lies a rigorous probability distribution that models a surprising range of real-world phenomena.

The Pareto distribution appears wherever you find heavy-tailed, power-law behavior. City populations follow it—a few megacities dwarf thousands of small towns. Website traffic follows it—a handful of pages receive most visits. File sizes on your hard drive, insurance claims, stock returns during market crashes, and word frequencies in natural language all exhibit Pareto-like characteristics.

What makes this distribution special is its heavy tail. Unlike the normal distribution where extreme values are vanishingly rare, Pareto distributions assign meaningful probability to extreme outcomes. This matters enormously when you’re modeling risk, analyzing user behavior, or building systems that must handle outliers gracefully.

Mathematical Foundation

The Pareto distribution is defined by two parameters: the shape parameter α (alpha) and the scale parameter xₘ (the minimum possible value). The probability density function is:

$$f(x) = \frac{\alpha x_m^\alpha}{x^{\alpha+1}}$$ for $x \geq x_m$

The cumulative distribution function, which gives the probability that a value is less than or equal to x, is:

$$F(x) = 1 - \left(\frac{x_m}{x}\right)^\alpha$$

The shape parameter α controls how heavy the tail is. Smaller values mean heavier tails and more extreme outliers. This has profound implications for moments:

  • Mean exists only when α > 1: $E[X] = \frac{\alpha x_m}{\alpha - 1}$
  • Variance exists only when α > 2: $Var(X) = \frac{\alpha x_m^2}{(\alpha-1)^2(\alpha-2)}$

Let’s implement these from scratch:

import numpy as np

def pareto_pdf(x, alpha, x_m):
    """Pareto probability density function."""
    x = np.asarray(x)
    result = np.zeros_like(x, dtype=float)
    valid = x >= x_m
    result[valid] = (alpha * x_m**alpha) / (x[valid]**(alpha + 1))
    return result

def pareto_cdf(x, alpha, x_m):
    """Pareto cumulative distribution function."""
    x = np.asarray(x)
    result = np.zeros_like(x, dtype=float)
    valid = x >= x_m
    result[valid] = 1 - (x_m / x[valid])**alpha
    return result

def pareto_mean(alpha, x_m):
    """Mean of Pareto distribution (only defined for alpha > 1)."""
    if alpha <= 1:
        return np.inf
    return (alpha * x_m) / (alpha - 1)

def pareto_variance(alpha, x_m):
    """Variance of Pareto distribution (only defined for alpha > 2)."""
    if alpha <= 2:
        return np.inf
    return (alpha * x_m**2) / ((alpha - 1)**2 * (alpha - 2))

# Test our implementations
alpha, x_m = 3.0, 1.0
print(f"Mean: {pareto_mean(alpha, x_m):.4f}")
print(f"Variance: {pareto_variance(alpha, x_m):.4f}")

Generating Pareto Distributions with NumPy and SciPy

Here’s where things get confusing. NumPy and SciPy use different parameterizations, and mixing them up is a common source of bugs.

NumPy’s convention: numpy.random.pareto(a) generates samples from a distribution with scale xₘ = 1. To get the standard Pareto distribution, you must add 1 to the result.

SciPy’s convention: scipy.stats.pareto(b) uses the shape parameter directly, with a scale parameter for xₘ.

import numpy as np
from scipy import stats

np.random.seed(42)

alpha = 2.5
x_m = 1.0
n_samples = 100000

# NumPy approach - note the +1!
numpy_samples = (np.random.pareto(alpha, n_samples) + 1) * x_m

# SciPy approach
scipy_dist = stats.pareto(b=alpha, scale=x_m)
scipy_samples = scipy_dist.rvs(size=n_samples)

# Compare statistics
print("NumPy samples:")
print(f"  Mean: {numpy_samples.mean():.4f}")
print(f"  Std:  {numpy_samples.std():.4f}")

print("\nSciPy samples:")
print(f"  Mean: {scipy_samples.mean():.4f}")
print(f"  Std:  {scipy_samples.std():.4f}")

# Theoretical values
theoretical_mean = pareto_mean(alpha, x_m)
theoretical_std = np.sqrt(pareto_variance(alpha, x_m))
print(f"\nTheoretical Mean: {theoretical_mean:.4f}")
print(f"Theoretical Std:  {theoretical_std:.4f}")

The key takeaway: always verify your samples match theoretical expectations before using them in analysis.

Visualizing Pareto Distributions

Visualization is crucial for understanding how the shape parameter affects the distribution. Log-log plots are particularly useful because Pareto distributions appear as straight lines in log-log space—the slope equals -(α + 1).

import matplotlib.pyplot as plt
from scipy import stats

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

alphas = [1.5, 2.5, 4.0]
x_m = 1.0
x = np.linspace(1.001, 8, 1000)

# PDF comparison
ax = axes[0, 0]
for alpha in alphas:
    pdf = stats.pareto(b=alpha, scale=x_m).pdf(x)
    ax.plot(x, pdf, label=f'α = {alpha}', linewidth=2)
ax.set_xlabel('x')
ax.set_ylabel('Probability Density')
ax.set_title('Pareto PDF for Different Shape Parameters')
ax.legend()
ax.set_ylim(0, 3)

# CDF comparison
ax = axes[0, 1]
for alpha in alphas:
    cdf = stats.pareto(b=alpha, scale=x_m).cdf(x)
    ax.plot(x, cdf, label=f'α = {alpha}', linewidth=2)
ax.set_xlabel('x')
ax.set_ylabel('Cumulative Probability')
ax.set_title('Pareto CDF for Different Shape Parameters')
ax.legend()

# Histogram of samples
ax = axes[1, 0]
samples = stats.pareto(b=2.5, scale=1.0).rvs(size=10000, random_state=42)
ax.hist(samples, bins=100, density=True, alpha=0.7, edgecolor='black')
x_hist = np.linspace(1, samples.max(), 1000)
ax.plot(x_hist, stats.pareto(b=2.5, scale=1.0).pdf(x_hist), 'r-', lw=2)
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.set_title('Histogram vs Theoretical PDF (α = 2.5)')
ax.set_xlim(0.9, 15)

# Log-log plot (power law verification)
ax = axes[1, 1]
x_log = np.logspace(0, 2, 100)
for alpha in alphas:
    survival = 1 - stats.pareto(b=alpha, scale=x_m).cdf(x_log)
    ax.loglog(x_log, survival, label=f'α = {alpha} (slope = -{alpha})', linewidth=2)
ax.set_xlabel('x (log scale)')
ax.set_ylabel('P(X > x) (log scale)')
ax.set_title('Survival Function (Log-Log Scale)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('pareto_visualizations.png', dpi=150)
plt.show()

Parameter Estimation and Fitting

When you have data and suspect it follows a Pareto distribution, you need to estimate the parameters. Maximum likelihood estimation gives us:

$$\hat{\alpha} = \frac{n}{\sum_{i=1}^{n} \ln(x_i / x_m)}$$

where xₘ is typically the minimum observed value.

from scipy import stats
import numpy as np

def fit_pareto_mle(data):
    """Fit Pareto distribution using maximum likelihood estimation."""
    x_m = data.min()
    n = len(data)
    alpha_hat = n / np.sum(np.log(data / x_m))
    return alpha_hat, x_m

# Generate synthetic "city population" data
np.random.seed(42)
true_alpha = 1.8
true_x_m = 10000
synthetic_populations = stats.pareto(b=true_alpha, scale=true_x_m).rvs(size=500)

# Fit using our MLE implementation
alpha_mle, x_m_mle = fit_pareto_mle(synthetic_populations)
print(f"True parameters: α = {true_alpha}, x_m = {true_x_m}")
print(f"MLE estimates:   α = {alpha_mle:.4f}, x_m = {x_m_mle:.2f}")

# Fit using SciPy (note: returns shape, loc, scale)
shape, loc, scale = stats.pareto.fit(synthetic_populations, floc=0)
print(f"SciPy fit:       α = {shape:.4f}, scale = {scale:.2f}")

# Kolmogorov-Smirnov goodness-of-fit test
fitted_dist = stats.pareto(b=alpha_mle, scale=x_m_mle)
ks_stat, p_value = stats.kstest(synthetic_populations, fitted_dist.cdf)
print(f"\nKS test: statistic = {ks_stat:.4f}, p-value = {p_value:.4f}")
if p_value > 0.05:
    print("Fail to reject null hypothesis: data is consistent with Pareto distribution")

Practical Applications

Let’s analyze a realistic scenario: modeling customer lifetime value where a few customers contribute disproportionately to revenue.

import numpy as np
from scipy import stats

np.random.seed(42)

# Simulate customer revenue data (Pareto-distributed)
n_customers = 10000
alpha = 1.5  # Heavy tail - lots of small customers, few whales
min_revenue = 50  # Minimum revenue per customer

revenues = stats.pareto(b=alpha, scale=min_revenue).rvs(size=n_customers)

# Analyze the distribution
print("Revenue Distribution Analysis")
print("=" * 40)
print(f"Total customers: {n_customers:,}")
print(f"Total revenue: ${revenues.sum():,.2f}")
print(f"Mean revenue: ${revenues.mean():,.2f}")
print(f"Median revenue: ${np.median(revenues):,.2f}")
print(f"Max revenue: ${revenues.max():,.2f}")

# Calculate concentration metrics
sorted_rev = np.sort(revenues)[::-1]
cumulative_rev = np.cumsum(sorted_rev)
total_rev = revenues.sum()

# Find what percentage of customers generate 80% of revenue
pct_for_80 = np.searchsorted(cumulative_rev, 0.8 * total_rev) / n_customers * 100
print(f"\n{pct_for_80:.1f}% of customers generate 80% of revenue")

# Tail probability analysis: P(revenue > threshold)
thresholds = [100, 500, 1000, 5000]
print("\nTail Probabilities:")
fitted_dist = stats.pareto(b=alpha, scale=min_revenue)
for t in thresholds:
    prob = 1 - fitted_dist.cdf(t)
    print(f"  P(Revenue > ${t:,}) = {prob:.4f} ({prob*100:.2f}%)")

# Risk analysis: Expected value in the tail
def expected_tail_value(dist, threshold):
    """Calculate E[X | X > threshold] using numerical integration."""
    from scipy import integrate
    numerator, _ = integrate.quad(
        lambda x: x * dist.pdf(x), threshold, threshold * 100
    )
    denominator = 1 - dist.cdf(threshold)
    return numerator / denominator if denominator > 0 else np.inf

print("\nExpected Value Given Exceeding Threshold:")
for t in [100, 500, 1000]:
    ev = expected_tail_value(fitted_dist, t)
    print(f"  E[Revenue | Revenue > ${t:,}] = ${ev:,.2f}")

Common Pitfalls and Best Practices

When to use Pareto vs. alternatives: Pareto is appropriate when you have a strict lower bound and observe power-law decay. If your data can approach zero, consider log-normal. If you’re modeling time between events, exponential or Weibull might be better. Always plot your data on log-log axes first—if it’s not approximately linear, Pareto isn’t the right choice.

Handling undefined moments: When α ≤ 2, sample variance will fluctuate wildly and never converge. Don’t trust sample statistics blindly. Use median and interquartile range instead of mean and standard deviation for heavy-tailed data.

Numerical stability: For very small α values, extreme samples can cause overflow. Use log-transforms when computing products or working with likelihoods:

def safe_pareto_logpdf(x, alpha, x_m):
    """Numerically stable log-PDF computation."""
    x = np.asarray(x)
    result = np.full_like(x, -np.inf, dtype=float)
    valid = x >= x_m
    result[valid] = (np.log(alpha) + alpha * np.log(x_m) 
                     - (alpha + 1) * np.log(x[valid]))
    return result

The Pareto distribution is a powerful tool for modeling inequality and extreme events. Master its quirks—especially the parameterization differences and moment existence conditions—and you’ll be equipped to analyze heavy-tailed phenomena that break conventional statistical assumptions.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.