How to Calculate Cumulative Distribution Functions
A cumulative distribution function (CDF) answers a fundamental question in statistics: 'What's the probability that a random variable X is less than or equal to some value x?' Formally, the CDF is...
Key Insights
- The cumulative distribution function (CDF) tells you the probability that a random variable is less than or equal to a specific value, making it essential for percentile calculations and hypothesis testing
- While probability density functions (PDFs) show probability density at a point, CDFs accumulate probability from negative infinity up to that point, providing a complete picture of the distribution
- Empirical CDFs computed from real data let you compare observed distributions against theoretical models without making parametric assumptions
Introduction to CDFs
A cumulative distribution function (CDF) answers a fundamental question in statistics: “What’s the probability that a random variable X is less than or equal to some value x?” Formally, the CDF is defined as F(x) = P(X ≤ x).
The distinction between probability density functions (PDFs) and CDFs trips up many developers. A PDF shows the relative likelihood of different values—think of it as a probability “rate” for continuous distributions. The CDF, by contrast, accumulates these probabilities from the left tail up to your point of interest. For any continuous distribution, the CDF is the integral of the PDF.
CDFs matter because they directly answer practical questions: “What percentage of users have response times under 200ms?” or “How likely is it that our system handles fewer than 1000 requests per second?” These are CDF questions.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Generate normal distribution
x = np.linspace(-4, 4, 1000)
pdf = stats.norm.pdf(x, loc=0, scale=1)
cdf = stats.norm.cdf(x, loc=0, scale=1)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# PDF plot
ax1.plot(x, pdf, 'b-', linewidth=2)
ax1.fill_between(x[x <= 1], pdf[x <= 1], alpha=0.3)
ax1.set_title('PDF: Probability Density')
ax1.set_xlabel('x')
ax1.set_ylabel('Density')
ax1.axvline(1, color='r', linestyle='--', alpha=0.5)
# CDF plot
ax2.plot(x, cdf, 'g-', linewidth=2)
ax2.set_title('CDF: Cumulative Probability')
ax2.set_xlabel('x')
ax2.set_ylabel('P(X ≤ x)')
ax2.axvline(1, color='r', linestyle='--', alpha=0.5)
ax2.axhline(stats.norm.cdf(1), color='r', linestyle='--', alpha=0.5)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Mathematical Foundation
For discrete distributions, the CDF sums probabilities up to and including x:
F(x) = P(X ≤ x) = Σ P(X = k) for all k ≤ x
For continuous distributions, the CDF integrates the PDF:
F(x) = ∫_{-∞}^{x} f(t) dt
CDFs have three critical properties: they’re non-decreasing (monotonic), they approach 0 as x approaches negative infinity, and they approach 1 as x approaches positive infinity. These properties make CDFs well-behaved mathematical objects.
Let’s calculate a CDF manually for a simple discrete case—rolling a fair six-sided die:
def dice_cdf(x):
"""CDF for a single fair die roll"""
if x < 1:
return 0
elif x >= 6:
return 1
else:
# Probability of rolling <= x is x/6
return int(x) / 6
# Calculate CDF values
outcomes = np.arange(0, 8, 0.1)
cdf_values = [dice_cdf(x) for x in outcomes]
# Visualize
plt.figure(figsize=(10, 5))
plt.step(outcomes, cdf_values, where='post', linewidth=2)
plt.xlabel('Outcome')
plt.ylabel('P(X ≤ x)')
plt.title('CDF of a Fair Die')
plt.grid(True, alpha=0.3)
plt.ylim(-0.1, 1.1)
# Mark actual outcomes
for i in range(1, 7):
plt.plot(i, i/6, 'ro', markersize=8)
plt.show()
# Verify probabilities
print(f"P(X ≤ 3) = {dice_cdf(3)}") # 0.5
print(f"P(X ≤ 4.7) = {dice_cdf(4.7)}") # 0.6667
Calculating CDFs for Common Distributions
Modern statistical libraries handle CDF calculations efficiently. Here’s how to compute CDFs for the most common distributions using scipy.stats:
from scipy import stats
# Normal distribution: mean=100, std=15 (like IQ scores)
normal_dist = stats.norm(loc=100, scale=15)
print(f"P(X ≤ 115) = {normal_dist.cdf(115):.4f}") # ~0.8413
print(f"P(X ≤ 100) = {normal_dist.cdf(100):.4f}") # 0.5
# Binomial distribution: n=10 trials, p=0.3 success probability
binomial_dist = stats.binom(n=10, p=0.3)
print(f"P(X ≤ 3) = {binomial_dist.cdf(3):.4f}") # ~0.6496
print(f"P(X ≤ 5) = {binomial_dist.cdf(5):.4f}") # ~0.9527
# Exponential distribution: lambda=0.5 (rate parameter)
exponential_dist = stats.expon(scale=2) # scale = 1/lambda
print(f"P(X ≤ 2) = {exponential_dist.cdf(2):.4f}") # ~0.6321
print(f"P(X ≤ 4) = {exponential_dist.cdf(4):.4f}") # ~0.8647
# Calculate probability in a range: P(a < X ≤ b) = F(b) - F(a)
prob_range = normal_dist.cdf(115) - normal_dist.cdf(85)
print(f"P(85 < X ≤ 115) = {prob_range:.4f}") # ~0.6827
The pattern is consistent: create a distribution object with parameters, then call .cdf(x). This abstraction handles numerical integration and special functions under the hood.
Computing CDFs from Data
When you have empirical data rather than a theoretical distribution, you need an empirical CDF (ECDF). The ECDF at value x is simply the fraction of observations less than or equal to x.
Here’s the step-by-step process:
- Sort your data in ascending order
- For each unique value, count how many observations are ≤ that value
- Divide by the total number of observations
import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt
# Generate sample data: response times in milliseconds
np.random.seed(42)
response_times = np.random.lognormal(mean=5, sigma=0.5, size=1000)
# Manual ECDF calculation
def manual_ecdf(data):
"""Calculate ECDF manually"""
sorted_data = np.sort(data)
n = len(sorted_data)
y = np.arange(1, n + 1) / n
return sorted_data, y
# Using statsmodels
ecdf = ECDF(response_times)
# Compare with theoretical log-normal CDF
x_range = np.linspace(response_times.min(), response_times.max(), 1000)
theoretical_cdf = stats.lognorm.cdf(x_range, s=0.5, scale=np.exp(5))
# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(ecdf.x, ecdf.y, 'b-', linewidth=2, label='Empirical CDF', alpha=0.7)
plt.plot(x_range, theoretical_cdf, 'r--', linewidth=2,
label='Theoretical (LogNormal)', alpha=0.7)
plt.xlabel('Response Time (ms)')
plt.ylabel('Cumulative Probability')
plt.title('Empirical vs Theoretical CDF')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Key percentiles from ECDF
percentiles = [50, 90, 95, 99]
for p in percentiles:
value = np.percentile(response_times, p)
print(f"P{p}: {value:.2f}ms (P(X ≤ {value:.2f}) = {p/100})")
Practical Applications
CDFs shine in real-world scenarios. Here’s a complete example analyzing API response times:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Simulated API response times (in milliseconds)
np.random.seed(123)
response_times = np.concatenate([
np.random.gamma(shape=2, scale=50, size=950), # Normal requests
np.random.gamma(shape=2, scale=200, size=50) # Slow outliers
])
# Calculate ECDF
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(response_times)
# Application 1: SLA compliance
sla_threshold = 200 # ms
compliance = ecdf(sla_threshold)
print(f"SLA Compliance: {compliance*100:.2f}% of requests under {sla_threshold}ms")
# Application 2: Percentile-based alerting
p95 = np.percentile(response_times, 95)
p99 = np.percentile(response_times, 99)
print(f"P95: {p95:.2f}ms")
print(f"P99: {p99:.2f}ms")
# Application 3: Q-Q plot for distribution checking
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# ECDF plot
ax1.plot(ecdf.x, ecdf.y, linewidth=2)
ax1.axvline(sla_threshold, color='r', linestyle='--',
label=f'SLA: {sla_threshold}ms')
ax1.axhline(compliance, color='r', linestyle='--', alpha=0.5)
ax1.set_xlabel('Response Time (ms)')
ax1.set_ylabel('Cumulative Probability')
ax1.set_title('Response Time CDF')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Q-Q plot against exponential distribution
stats.probplot(response_times, dist="expon", plot=ax2)
ax2.set_title('Q-Q Plot vs Exponential')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Implementation Best Practices
Choose your CDF calculation method based on your use case. For known distributions with closed-form CDFs, use scipy.stats—it’s optimized and numerically stable. For empirical data, statsmodels’ ECDF is efficient and handles edge cases.
Performance matters with large datasets. Here’s an optimized ECDF implementation:
import numpy as np
from timeit import timeit
def naive_ecdf(data, x):
"""Slow: O(n) per query"""
return np.sum(data <= x) / len(data)
def optimized_ecdf(data):
"""Fast: O(n log n) preprocessing, O(log n) per query"""
sorted_data = np.sort(data)
def query(x):
# Binary search for insertion point
idx = np.searchsorted(sorted_data, x, side='right')
return idx / len(sorted_data)
return query
# Performance comparison
large_dataset = np.random.randn(100000)
query_points = np.random.randn(1000)
# Precompute for optimized version
ecdf_func = optimized_ecdf(large_dataset)
# Benchmark
naive_time = timeit(
lambda: [naive_ecdf(large_dataset, x) for x in query_points],
number=10
)
optimized_time = timeit(
lambda: [ecdf_func(x) for x in query_points],
number=10
)
print(f"Naive approach: {naive_time:.4f}s")
print(f"Optimized approach: {optimized_time:.4f}s")
print(f"Speedup: {naive_time/optimized_time:.2f}x")
Watch for common pitfalls: ECDFs are step functions with jumps at observed values, not smooth curves. With small samples, they’re noisy approximations. For interpolation between observed values, consider kernel density estimation instead. And remember that CDFs assume independent observations—autocorrelated data requires different techniques.
The CDF is your Swiss Army knife for probability questions. Master it, and you’ll handle percentiles, hypothesis tests, and distribution comparisons with confidence.