How to Calculate Probability Density Functions
A probability density function (PDF) describes the relative likelihood of a continuous random variable taking on a specific value. Unlike discrete probability mass functions where you can directly...
Key Insights
- Probability density functions (PDFs) describe the likelihood of continuous random variables taking specific values, with the area under the curve representing actual probabilities rather than the y-axis value itself.
- You can calculate PDFs either from theoretical distributions using mathematical formulas or estimate them from empirical data using kernel density estimation (KDE).
- Always validate your PDF calculations by ensuring the function integrates to 1 and use numerical methods to normalize custom functions into valid probability distributions.
Introduction to Probability Density Functions
A probability density function (PDF) describes the relative likelihood of a continuous random variable taking on a specific value. Unlike discrete probability mass functions where you can directly calculate P(X = x), PDFs require integration over intervals to find actual probabilities. The PDF value at any point represents probability density, not probability itself.
PDFs must satisfy two fundamental properties: they must be non-negative everywhere, and they must integrate to 1 over their entire domain. This ensures they represent valid probability distributions. The key distinction from discrete distributions is that for continuous variables, the probability of any exact value is technically zero—we can only meaningfully discuss probabilities over ranges.
Here’s a visualization of the standard normal distribution, one of the most common PDFs:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Generate x values
x = np.linspace(-4, 4, 1000)
# Calculate PDF values for standard normal distribution
pdf_values = stats.norm.pdf(x, loc=0, scale=1)
# Plot the PDF
plt.figure(figsize=(10, 6))
plt.plot(x, pdf_values, 'b-', linewidth=2, label='N(0,1) PDF')
plt.fill_between(x, pdf_values, alpha=0.3)
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Standard Normal Distribution PDF')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()
Mathematical Foundation
The relationship between PDFs and cumulative distribution functions (CDFs) is fundamental. The CDF F(x) gives the probability that X ≤ x, while the PDF f(x) is the derivative of the CDF. Conversely, integrating the PDF from negative infinity to x yields the CDF.
To find the probability that a random variable falls within an interval [a, b], we integrate the PDF:
P(a ≤ X ≤ b) = ∫[a to b] f(x)dx
Common PDF formulas include:
- Normal distribution: f(x) = (1/√(2πσ²)) × exp(-(x-μ)²/(2σ²))
- Exponential distribution: f(x) = λe^(-λx) for x ≥ 0
- Uniform distribution: f(x) = 1/(b-a) for a ≤ x ≤ b
Here’s how to calculate probabilities over ranges using numerical integration:
from scipy import integrate
# Define a normal PDF with mean=2, std=1.5
mu, sigma = 2, 1.5
pdf = lambda x: stats.norm.pdf(x, loc=mu, scale=sigma)
# Calculate P(1 <= X <= 3) using integration
prob_integrate, error = integrate.quad(pdf, 1, 3)
print(f"P(1 <= X <= 3) via integration: {prob_integrate:.4f}")
# Verify using CDF difference
prob_cdf = stats.norm.cdf(3, mu, sigma) - stats.norm.cdf(1, mu, sigma)
print(f"P(1 <= X <= 3) via CDF: {prob_cdf:.4f}")
Calculating PDFs from Theoretical Distributions
When working with known theoretical distributions, scipy.stats provides efficient implementations. You can manipulate parameters to shift, scale, and reshape distributions according to your needs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Set up the plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
x = np.linspace(-5, 15, 1000)
# Normal distribution with different parameters
for mean, std in [(0, 1), (2, 1.5), (5, 2)]:
pdf = stats.norm.pdf(x, loc=mean, scale=std)
axes[0].plot(x, pdf, label=f'μ={mean}, σ={std}')
axes[0].set_title('Normal Distribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Exponential distribution with different rates
x_exp = np.linspace(0, 10, 1000)
for rate in [0.5, 1.0, 2.0]:
pdf = stats.expon.pdf(x_exp, scale=1/rate)
axes[1].plot(x_exp, pdf, label=f'λ={rate}')
axes[1].set_title('Exponential Distribution')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Beta distribution with different shape parameters
x_beta = np.linspace(0, 1, 1000)
for alpha, beta in [(2, 2), (5, 2), (2, 5)]:
pdf = stats.beta.pdf(x_beta, alpha, beta)
axes[2].plot(x_beta, pdf, label=f'α={alpha}, β={beta}')
axes[2].set_title('Beta Distribution')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Estimating PDFs from Empirical Data
When you have sample data but don’t know the underlying distribution, kernel density estimation (KDE) provides a non-parametric way to estimate the PDF. KDE works by placing a kernel (typically Gaussian) at each data point and summing them, creating a smooth continuous estimate.
Bandwidth selection is critical—too small creates an undersmoothed, noisy estimate; too large oversmooths and loses important features. Most implementations use automatic bandwidth selection methods like Scott’s rule or Silverman’s rule.
from scipy.stats import gaussian_kde
from sklearn.neighbors import KernelDensity
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data from a mixture of normals
np.random.seed(42)
data = np.concatenate([
np.random.normal(0, 1, 300),
np.random.normal(5, 1.5, 200)
])
# Method 1: scipy gaussian_kde
kde_scipy = gaussian_kde(data, bw_method='scott')
# Method 2: sklearn KernelDensity
kde_sklearn = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde_sklearn.fit(data.reshape(-1, 1))
# Evaluate PDFs
x_eval = np.linspace(-4, 10, 1000)
pdf_scipy = kde_scipy(x_eval)
pdf_sklearn = np.exp(kde_sklearn.score_samples(x_eval.reshape(-1, 1)))
# Plot comparison
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(data, bins=30, density=True, alpha=0.5, label='Histogram')
plt.plot(x_eval, pdf_scipy, 'r-', linewidth=2, label='KDE (scipy)')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Histogram vs KDE')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(x_eval, pdf_scipy, 'r-', linewidth=2, label='scipy (auto bandwidth)')
plt.plot(x_eval, pdf_sklearn, 'b--', linewidth=2, label='sklearn (bw=0.5)')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('KDE Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Numerical Methods for Custom PDFs
Sometimes you need to work with custom functions that aren’t standard distributions. The key is normalizing them to ensure they integrate to 1. Here’s how to create a valid PDF from an arbitrary non-negative function:
from scipy import integrate
import numpy as np
import matplotlib.pyplot as plt
def custom_function(x):
"""Arbitrary non-negative function (not yet a valid PDF)"""
if x < 0 or x > 10:
return 0
return x**2 * np.exp(-x/2)
# Vectorize for array operations
custom_vec = np.vectorize(custom_function)
# Calculate normalization constant
normalization_constant, _ = integrate.quad(custom_function, 0, 10)
print(f"Normalization constant: {normalization_constant:.4f}")
def normalized_pdf(x):
"""Normalized version - valid PDF"""
return custom_function(x) / normalization_constant
normalized_pdf_vec = np.vectorize(normalized_pdf)
# Verify it integrates to 1
verification, _ = integrate.quad(normalized_pdf, 0, 10)
print(f"Integral of normalized PDF: {verification:.6f}")
# Plot both versions
x = np.linspace(0, 10, 1000)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, custom_vec(x), 'b-', linewidth=2)
plt.title('Original Function (Not Normalized)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(x, normalized_pdf_vec(x), 'r-', linewidth=2)
plt.fill_between(x, normalized_pdf_vec(x), alpha=0.3)
plt.title('Normalized PDF')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Practical Applications and Validation
Once you’ve calculated or estimated a PDF, validation is essential. Q-Q plots help assess whether data follows a theoretical distribution, while goodness-of-fit tests provide statistical evidence.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Generate sample data
np.random.seed(42)
sample_data = np.random.exponential(scale=2.0, size=500)
# Fit exponential distribution
fit_params = stats.expon.fit(sample_data)
fitted_scale = fit_params[1]
print(f"Fitted exponential scale parameter: {fitted_scale:.4f}")
# Create validation plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Histogram with fitted PDF
axes[0, 0].hist(sample_data, bins=30, density=True, alpha=0.6, label='Data')
x_range = np.linspace(0, sample_data.max(), 1000)
fitted_pdf = stats.expon.pdf(x_range, scale=fitted_scale)
axes[0, 0].plot(x_range, fitted_pdf, 'r-', linewidth=2, label='Fitted PDF')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('Histogram vs Fitted PDF')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# 2. Q-Q plot
stats.probplot(sample_data, dist=stats.expon, sparams=(fitted_scale,), plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)
# 3. CDF comparison
sorted_data = np.sort(sample_data)
empirical_cdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
theoretical_cdf = stats.expon.cdf(sorted_data, scale=fitted_scale)
axes[1, 0].plot(sorted_data, empirical_cdf, 'b-', label='Empirical CDF', alpha=0.6)
axes[1, 0].plot(sorted_data, theoretical_cdf, 'r--', label='Theoretical CDF', linewidth=2)
axes[1, 0].set_xlabel('Value')
axes[1, 0].set_ylabel('Cumulative Probability')
axes[1, 0].set_title('CDF Comparison')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# 4. Sample from fitted PDF
samples_from_pdf = stats.expon.rvs(scale=fitted_scale, size=500)
axes[1, 1].hist(sample_data, bins=30, alpha=0.5, density=True, label='Original Data')
axes[1, 1].hist(samples_from_pdf, bins=30, alpha=0.5, density=True, label='Sampled from PDF')
axes[1, 1].set_xlabel('Value')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('Original vs Sampled Data')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Kolmogorov-Smirnov test
ks_statistic, p_value = stats.kstest(sample_data, lambda x: stats.expon.cdf(x, scale=fitted_scale))
print(f"\nKolmogorov-Smirnov test:")
print(f"Statistic: {ks_statistic:.4f}, p-value: {p_value:.4f}")
Common Pitfalls and Best Practices
Numerical stability matters. When working with PDFs that have very small or large values, use log-space calculations. The logpdf methods in scipy.stats are specifically designed for this.
Choose distributions wisely. Don’t default to normal distributions for everything. Consider the data’s domain (positive only? bounded?), shape (skewed? heavy-tailed?), and theoretical justification. For count data, use discrete distributions. For waiting times, consider exponential or gamma distributions.
Bandwidth selection in KDE is crucial. The default methods work well for unimodal, roughly symmetric data, but may fail for multimodal or highly skewed distributions. Cross-validation can help select optimal bandwidth:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KernelDensity
# Grid search for optimal bandwidth
bandwidths = np.logspace(-1, 1, 20)
grid = GridSearchCV(KernelDensity(), {'bandwidth': bandwidths}, cv=5)
grid.fit(data.reshape(-1, 1))
print(f"Optimal bandwidth: {grid.best_params_['bandwidth']:.4f}")
For large datasets, consider computational efficiency. KDE scales poorly beyond tens of thousands of points. Use sampling or binning strategies, or switch to parametric methods when possible.
Know when to go parametric vs. non-parametric. If you have theoretical reasons to expect a specific distribution family, use parametric fitting—it’s more efficient and provides interpretable parameters. Use KDE when the distribution shape is unknown or when flexibility is more important than parameter interpretation.
Always validate your results. A PDF that looks reasonable visually might still fail statistical tests. Use multiple validation methods and understand what each test tells you about your model’s adequacy.