Log-Normal Distribution in Python: Complete Guide
A log-normal distribution describes a random variable whose logarithm is normally distributed. If `X` follows a log-normal distribution, then `ln(X)` follows a normal distribution. This seemingly...
Key Insights
- Log-normal distributions model positive-only data where the logarithm follows a normal distribution—ideal for stock prices, response times, and income data where multiplicative effects dominate
- SciPy and NumPy use different parameterizations for log-normal distributions; understanding the conversion between them prevents subtle bugs that can invalidate your analysis
- Always validate your fit with goodness-of-fit tests and Q-Q plots before drawing conclusions—visual inspection alone frequently misleads, especially in the tails
Introduction to Log-Normal Distribution
A log-normal distribution describes a random variable whose logarithm is normally distributed. If X follows a log-normal distribution, then ln(X) follows a normal distribution. This seemingly simple transformation creates a distribution with fundamentally different properties—one that’s strictly positive, right-skewed, and naturally suited to modeling multiplicative processes.
You’ll encounter log-normal distributions everywhere in practice: stock prices (returns compound multiplicatively), response times in web services, income distributions, file sizes on disk, particle sizes in aerosols, and biological measurements like cell counts. The common thread is data that can’t go negative and tends to have a long right tail.
Choose log-normal over normal when your data exhibits positive skewness, has a hard lower bound at zero, or arises from multiplicative rather than additive processes. If you’re modeling something that grows by percentages rather than fixed amounts, log-normal is likely your distribution.
Mathematical Properties
The probability density function (PDF) of a log-normal distribution is:
$$f(x; \mu, \sigma) = \frac{1}{x\sigma\sqrt{2\pi}} \exp\left(-\frac{(\ln x - \mu)^2}{2\sigma^2}\right)$$
Here, μ and σ are the mean and standard deviation of the underlying normal distribution (i.e., of ln(X)), not of X itself. This distinction trips up many practitioners.
The key statistics derive from these parameters:
- Mean:
exp(μ + σ²/2) - Median:
exp(μ) - Mode:
exp(μ - σ²) - Variance:
(exp(σ²) - 1) * exp(2μ + σ²)
Notice that mean > median > mode for all log-normal distributions—a direct consequence of the right skew.
import numpy as np
def lognormal_statistics(mu, sigma):
"""Calculate theoretical statistics from log-normal parameters."""
mean = np.exp(mu + sigma**2 / 2)
median = np.exp(mu)
mode = np.exp(mu - sigma**2)
variance = (np.exp(sigma**2) - 1) * np.exp(2*mu + sigma**2)
# Skewness is always positive for log-normal
skewness = (np.exp(sigma**2) + 2) * np.sqrt(np.exp(sigma**2) - 1)
return {
'mean': mean,
'median': median,
'mode': mode,
'variance': variance,
'std': np.sqrt(variance),
'skewness': skewness
}
# Example: μ=0, σ=0.5
stats = lognormal_statistics(mu=0, sigma=0.5)
print(f"Mean: {stats['mean']:.4f}")
print(f"Median: {stats['median']:.4f}")
print(f"Mode: {stats['mode']:.4f}")
print(f"Std Dev: {stats['std']:.4f}")
Generating Log-Normal Data in Python
NumPy and SciPy both provide log-normal generators, but they use different parameterizations. This inconsistency causes endless confusion.
NumPy uses the natural parameters directly:
numpy.random.lognormal(mean=0, sigma=1, size=None)
SciPy uses a shape-location-scale parameterization:
scipy.stats.lognorm(s, loc=0, scale=1)
The relationship: SciPy’s s equals NumPy’s sigma, and SciPy’s scale equals exp(mu). The loc parameter shifts the distribution (usually leave it at 0).
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Parameters for the underlying normal distribution
mu = 1.5 # mean of ln(X)
sigma = 0.8 # std of ln(X)
# Generate samples with NumPy
np.random.seed(42)
samples_numpy = np.random.lognormal(mean=mu, sigma=sigma, size=10000)
# Generate samples with SciPy (note the parameterization!)
samples_scipy = stats.lognorm.rvs(s=sigma, scale=np.exp(mu), size=10000, random_state=42)
# Verify they produce equivalent distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(samples_numpy, bins=50, density=True, alpha=0.7, label='NumPy')
axes[0].set_title('NumPy lognormal')
axes[0].set_xlabel('Value')
axes[1].hist(samples_scipy, bins=50, density=True, alpha=0.7, label='SciPy', color='orange')
axes[1].set_title('SciPy lognorm')
axes[1].set_xlabel('Value')
plt.tight_layout()
plt.savefig('lognormal_comparison.png', dpi=150)
plt.show()
print(f"NumPy - Mean: {samples_numpy.mean():.3f}, Std: {samples_numpy.std():.3f}")
print(f"SciPy - Mean: {samples_scipy.mean():.3f}, Std: {samples_scipy.std():.3f}")
Fitting Log-Normal Distribution to Data
When you have observed data and want to estimate the parameters, SciPy’s fit() method performs maximum likelihood estimation:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Simulate some "real" data (in practice, this is your observed data)
np.random.seed(123)
true_mu, true_sigma = 2.0, 0.6
observed_data = np.random.lognormal(mean=true_mu, sigma=true_sigma, size=500)
# Fit using SciPy
# Returns (shape, loc, scale) where shape=sigma, scale=exp(mu)
shape, loc, scale = stats.lognorm.fit(observed_data, floc=0) # fix loc=0
# Convert back to natural parameters
fitted_sigma = shape
fitted_mu = np.log(scale)
print(f"True parameters: μ={true_mu}, σ={true_sigma}")
print(f"Fitted parameters: μ={fitted_mu:.4f}, σ={fitted_sigma:.4f}")
# Visualize the fit
x = np.linspace(0.01, observed_data.max() * 1.1, 200)
pdf_fitted = stats.lognorm.pdf(x, s=fitted_sigma, scale=np.exp(fitted_mu))
plt.figure(figsize=(10, 5))
plt.hist(observed_data, bins=40, density=True, alpha=0.7, label='Observed data')
plt.plot(x, pdf_fitted, 'r-', lw=2, label=f'Fitted PDF (μ={fitted_mu:.2f}, σ={fitted_sigma:.2f})')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.title('Log-Normal Fit to Observed Data')
plt.savefig('lognormal_fit.png', dpi=150)
plt.show()
Handling zeros and negatives: Log-normal distributions only support positive values. If your data contains zeros, you have options: add a small constant (crude but common), use a zero-inflated model, or reconsider whether log-normal is appropriate.
Statistical Analysis and Testing
Never trust a fit without validation. The Kolmogorov-Smirnov test and Q-Q plots are your primary tools.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate test data
np.random.seed(456)
data = np.random.lognormal(mean=1.0, sigma=0.5, size=300)
# Fit the distribution
shape, loc, scale = stats.lognorm.fit(data, floc=0)
# Kolmogorov-Smirnov test
ks_stat, ks_pvalue = stats.kstest(data, 'lognorm', args=(shape, loc, scale))
print(f"K-S Statistic: {ks_stat:.4f}")
print(f"K-S p-value: {ks_pvalue:.4f}")
if ks_pvalue > 0.05:
print("Fail to reject null hypothesis: data is consistent with log-normal")
else:
print("Reject null hypothesis: data may not be log-normal")
# Anderson-Darling test (more sensitive to tails)
# Transform to normal and test normality
log_data = np.log(data)
ad_result = stats.anderson(log_data, dist='norm')
print(f"\nAnderson-Darling statistic: {ad_result.statistic:.4f}")
print(f"Critical values: {ad_result.critical_values}")
# Q-Q Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Q-Q plot against theoretical log-normal
stats.probplot(log_data, dist="norm", plot=axes[0])
axes[0].set_title('Q-Q Plot (log-transformed data vs normal)')
# Visual comparison of empirical vs theoretical CDF
sorted_data = np.sort(data)
empirical_cdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
theoretical_cdf = stats.lognorm.cdf(sorted_data, s=shape, loc=loc, scale=scale)
axes[1].plot(sorted_data, empirical_cdf, 'b-', label='Empirical CDF')
axes[1].plot(sorted_data, theoretical_cdf, 'r--', label='Theoretical CDF')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Cumulative Probability')
axes[1].legend()
axes[1].set_title('CDF Comparison')
plt.tight_layout()
plt.savefig('lognormal_validation.png', dpi=150)
plt.show()
Working with PDF, CDF, and Quantiles
Once you’ve validated your fit, you can compute probabilities and percentiles:
import numpy as np
from scipy import stats
# Define our fitted distribution
mu, sigma = 2.0, 0.5
dist = stats.lognorm(s=sigma, scale=np.exp(mu))
# Probability that X exceeds a threshold
threshold = 15
prob_exceed = 1 - dist.cdf(threshold) # or dist.sf(threshold)
print(f"P(X > {threshold}) = {prob_exceed:.4f}")
# Find the 95th percentile
p95 = dist.ppf(0.95)
print(f"95th percentile: {p95:.4f}")
# Confidence interval (e.g., central 90%)
lower = dist.ppf(0.05)
upper = dist.ppf(0.95)
print(f"90% confidence interval: [{lower:.4f}, {upper:.4f}]")
# Expected value above a threshold (conditional mean)
# E[X | X > threshold] using numerical integration
from scipy import integrate
def conditional_mean_above(dist, threshold):
numerator, _ = integrate.quad(lambda x: x * dist.pdf(x), threshold, np.inf)
denominator = dist.sf(threshold)
return numerator / denominator
cond_mean = conditional_mean_above(dist, threshold)
print(f"E[X | X > {threshold}] = {cond_mean:.4f}")
Practical Application: Modeling API Response Times
Let’s work through a complete example modeling API response times—a classic log-normal use case.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Simulate realistic API response time data (milliseconds)
# In practice, you'd load this from your monitoring system
np.random.seed(789)
response_times = np.concatenate([
np.random.lognormal(mean=4.5, sigma=0.6, size=950), # Normal requests
np.random.lognormal(mean=6.0, sigma=0.3, size=50) # Some slow requests
])
np.random.shuffle(response_times)
# Step 1: Exploratory analysis
print("=== Exploratory Statistics ===")
print(f"Sample size: {len(response_times)}")
print(f"Mean: {response_times.mean():.2f} ms")
print(f"Median: {np.median(response_times):.2f} ms")
print(f"Std Dev: {response_times.std():.2f} ms")
print(f"Min: {response_times.min():.2f} ms, Max: {response_times.max():.2f} ms")
print(f"Skewness: {stats.skew(response_times):.2f}")
# Step 2: Fit log-normal distribution
shape, loc, scale = stats.lognorm.fit(response_times, floc=0)
fitted_mu = np.log(scale)
fitted_sigma = shape
print(f"\n=== Fitted Parameters ===")
print(f"μ (log-space): {fitted_mu:.4f}")
print(f"σ (log-space): {fitted_sigma:.4f}")
# Step 3: Validate fit
ks_stat, ks_pvalue = stats.kstest(response_times, 'lognorm', args=(shape, loc, scale))
print(f"\n=== Goodness of Fit ===")
print(f"K-S p-value: {ks_pvalue:.4f}")
# Step 4: Compute SLA metrics
dist = stats.lognorm(s=shape, loc=loc, scale=scale)
sla_threshold = 200 # 200ms SLA
prob_sla_breach = dist.sf(sla_threshold)
p99 = dist.ppf(0.99)
p999 = dist.ppf(0.999)
print(f"\n=== SLA Analysis ===")
print(f"P(response > {sla_threshold}ms): {prob_sla_breach:.4%}")
print(f"99th percentile: {p99:.2f} ms")
print(f"99.9th percentile: {p999:.2f} ms")
# Step 5: Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Histogram with fitted PDF
x = np.linspace(0.1, response_times.max(), 200)
axes[0, 0].hist(response_times, bins=50, density=True, alpha=0.7, label='Observed')
axes[0, 0].plot(x, dist.pdf(x), 'r-', lw=2, label='Fitted log-normal')
axes[0, 0].axvline(sla_threshold, color='g', linestyle='--', label=f'SLA ({sla_threshold}ms)')
axes[0, 0].set_xlabel('Response Time (ms)')
axes[0, 0].set_ylabel('Density')
axes[0, 0].legend()
axes[0, 0].set_title('Response Time Distribution')
# CDF with percentile markers
axes[0, 1].plot(x, dist.cdf(x), 'b-', lw=2)
axes[0, 1].axhline(0.99, color='orange', linestyle='--', alpha=0.7)
axes[0, 1].axvline(p99, color='orange', linestyle='--', alpha=0.7, label=f'P99: {p99:.0f}ms')
axes[0, 1].set_xlabel('Response Time (ms)')
axes[0, 1].set_ylabel('Cumulative Probability')
axes[0, 1].legend()
axes[0, 1].set_title('Cumulative Distribution Function')
# Q-Q plot
stats.probplot(np.log(response_times), dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot (log-transformed)')
# Log-scale histogram (should look normal)
axes[1, 1].hist(np.log(response_times), bins=50, density=True, alpha=0.7)
axes[1, 1].set_xlabel('log(Response Time)')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('Log-Transformed Data (should be normal)')
plt.tight_layout()
plt.savefig('response_time_analysis.png', dpi=150)
plt.show()
This workflow—explore, fit, validate, analyze—applies to any log-normal modeling task. The key is validating your assumptions before drawing conclusions. A poor fit in the tails can lead to dramatically wrong SLA predictions, especially at high percentiles where the business impact is greatest.
Remember: the log-normal distribution is a model, not reality. Use it when it fits your data well enough for your purposes, and always communicate the uncertainty in your estimates.