Log-Normal Distribution in R: Complete Guide

Key Insights

The log-normal distribution models positive-valued data where the logarithm follows a normal distribution—essential for analyzing income, stock returns, and file sizes where multiplicative processes dominate.
R’s rlnorm(), dlnorm(), plnorm(), and qlnorm() functions use log-scale parameters (meanlog, sdlog), not the actual mean and standard deviation of the data, which trips up most beginners.
Always test log-normality assumptions using QQ-plots of log-transformed data combined with Shapiro-Wilk tests—visual inspection alone isn’t sufficient for statistical rigor.

Introduction to Log-Normal Distribution

A random variable X follows a log-normal distribution if its natural logarithm ln(X) follows a normal distribution. This seemingly simple transformation has profound implications for modeling real-world phenomena where values are strictly positive and exhibit right-skewed behavior.

The log-normal distribution appears naturally in processes involving multiplicative effects. When stock returns compound over time, when individual productivity factors multiply to determine income, or when file sizes grow through successive operations, you’ll find log-normal patterns. Unlike the normal distribution’s symmetric bell curve, the log-normal distribution has a long right tail and is bounded at zero on the left.

Mathematically, if Y ~ Normal(μ, σ²), then X = exp(Y) ~ LogNormal(μ, σ²). The parameters μ and σ here represent the mean and standard deviation of the underlying normal distribution (on the log scale), not the mean and standard deviation of X itself.

# Visual comparison of normal vs log-normal distributions
library(ggplot2)

x_seq <- seq(0, 10, length.out = 500)

# Normal distribution (for comparison)
normal_data <- data.frame(
  x = seq(-5, 15, length.out = 500),
  density = dnorm(seq(-5, 15, length.out = 500), mean = 5, sd = 2),
  type = "Normal"
)

# Log-normal distribution
lognormal_data <- data.frame(
  x = x_seq,
  density = dlnorm(x_seq, meanlog = 1, sdlog = 0.5),
  type = "Log-Normal"
)

combined_data <- rbind(
  normal_data[normal_data$x >= 0, ],
  lognormal_data
)

ggplot(combined_data, aes(x = x, y = density, color = type)) +
  geom_line(linewidth = 1) +
  labs(title = "Normal vs Log-Normal Distribution",
       x = "Value", y = "Density") +
  theme_minimal()

Generating Log-Normal Data in R

The rlnorm() function generates random samples from a log-normal distribution. The critical point: its parameters meanlog and sdlog represent the mean and standard deviation of the underlying normal distribution (on the log scale), not the arithmetic mean and standard deviation of the generated values.

set.seed(123)  # Reproducibility

# Generate 1000 log-normal samples
sample1 <- rlnorm(n = 1000, meanlog = 0, sdlog = 1)
sample2 <- rlnorm(n = 1000, meanlog = 1, sdlog = 0.5)
sample3 <- rlnorm(n = 1000, meanlog = 0, sdlog = 0.25)

# Create comparison histograms
par(mfrow = c(1, 3))
hist(sample1, breaks = 50, main = "meanlog=0, sdlog=1",
     xlab = "Value", col = "lightblue")
hist(sample2, breaks = 50, main = "meanlog=1, sdlog=0.5",
     xlab = "Value", col = "lightgreen")
hist(sample3, breaks = 50, main = "meanlog=0, sdlog=0.25",
     xlab = "Value", col = "lightcoral")
par(mfrow = c(1, 1))

# Verify the parameters
mean(log(sample1))  # Should be close to 0
sd(log(sample1))    # Should be close to 1

The relationship between log-scale parameters and actual distribution moments is:

Actual mean: exp(μ + σ²/2)
Actual variance: [exp(σ²) - 1] × exp(2μ + σ²)

# Calculate theoretical vs empirical moments
meanlog <- 1
sdlog <- 0.5

theoretical_mean <- exp(meanlog + sdlog^2/2)
theoretical_var <- (exp(sdlog^2) - 1) * exp(2*meanlog + sdlog^2)

cat("Theoretical mean:", theoretical_mean, "\n")
cat("Empirical mean:", mean(sample2), "\n")
cat("Theoretical variance:", theoretical_var, "\n")
cat("Empirical variance:", var(sample2), "\n")

Calculating Probabilities and Quantiles

R provides four essential functions for working with log-normal distributions: dlnorm() for density, plnorm() for cumulative probability, qlnorm() for quantiles, and rlnorm() for random generation.

# Density: probability density at specific points
x_values <- c(1, 2, 5, 10)
densities <- dlnorm(x_values, meanlog = 1, sdlog = 0.5)
data.frame(x = x_values, density = densities)

# Cumulative probability: P(X <= x)
# What's the probability a value is less than or equal to 3?
plnorm(3, meanlog = 1, sdlog = 0.5)

# What proportion of values fall between 2 and 5?
plnorm(5, meanlog = 1, sdlog = 0.5) - plnorm(2, meanlog = 1, sdlog = 0.5)

# Quantiles: inverse CDF
# What value corresponds to the 75th percentile?
qlnorm(0.75, meanlog = 1, sdlog = 0.5)

# Find multiple percentiles
percentiles <- c(0.25, 0.5, 0.75, 0.9, 0.95)
qlnorm(percentiles, meanlog = 1, sdlog = 0.5)

# Plot CDF
x_seq <- seq(0, 10, length.out = 500)
cdf_values <- plnorm(x_seq, meanlog = 1, sdlog = 0.5)

plot(x_seq, cdf_values, type = "l", lwd = 2, col = "blue",
     main = "Log-Normal CDF",
     xlab = "x", ylab = "P(X <= x)")
grid()

Fitting Log-Normal Distribution to Data

When you have empirical data and want to estimate log-normal parameters, you have two primary approaches: method of moments and maximum likelihood estimation (MLE).

# Generate sample data (simulating income data in thousands)
set.seed(456)
income_data <- rlnorm(500, meanlog = 3.5, sdlog = 0.6)

# Method of moments: simply calculate mean and sd of log-transformed data
log_income <- log(income_data)
meanlog_hat <- mean(log_income)
sdlog_hat <- sd(log_income)

cat("Method of Moments:\n")
cat("meanlog =", meanlog_hat, "\n")
cat("sdlog =", sdlog_hat, "\n")

# Maximum Likelihood Estimation using MASS package
library(MASS)
fit_mle <- fitdistr(income_data, "lognormal")
print(fit_mle)

# Extract parameters and standard errors
params <- fit_mle$estimate
param_se <- fit_mle$sd

# Visual assessment of fit
hist(income_data, breaks = 50, probability = TRUE,
     main = "Fitted Log-Normal Distribution",
     xlab = "Income (thousands)", col = "lightgray")

# Overlay fitted density
x_seq <- seq(min(income_data), max(income_data), length.out = 500)
fitted_density <- dlnorm(x_seq, 
                         meanlog = params["meanlog"], 
                         sdlog = params["sdlog"])
lines(x_seq, fitted_density, col = "red", lwd = 2)

# Calculate actual mean and median from fitted parameters
fitted_mean <- exp(params["meanlog"] + params["sdlog"]^2/2)
fitted_median <- exp(params["meanlog"])

cat("\nFitted distribution statistics:\n")
cat("Mean:", fitted_mean, "\n")
cat("Median:", fitted_median, "\n")
cat("Empirical mean:", mean(income_data), "\n")
cat("Empirical median:", median(income_data), "\n")

Testing for Log-Normality

Before applying log-normal models, you must verify the assumption that your data actually follows this distribution. Combine visual diagnostics with formal statistical tests.

# QQ-plot on log-transformed data
par(mfrow = c(1, 2))

# QQ-plot of original data vs log-normal
qqnorm(log(income_data), main = "QQ-Plot: Log-Transformed Data")
qqline(log(income_data), col = "red", lwd = 2)

# Alternative: using car package for better QQ-plots
library(car)
qqPlot(log(income_data), distribution = "norm",
       main = "QQ-Plot with Confidence Bands")

par(mfrow = c(1, 1))

# Shapiro-Wilk test on log-transformed data
shapiro_test <- shapiro.test(log(income_data))
cat("Shapiro-Wilk test p-value:", shapiro_test$p.value, "\n")

if (shapiro_test$p.value > 0.05) {
  cat("Cannot reject normality of log-transformed data (log-normal assumption holds)\n")
} else {
  cat("Reject normality of log-transformed data (log-normal assumption violated)\n")
}

# Kolmogorov-Smirnov test
ks_test <- ks.test(income_data, "plnorm", 
                   meanlog = params["meanlog"], 
                   sdlog = params["sdlog"])
cat("Kolmogorov-Smirnov test p-value:", ks_test$p.value, "\n")

# Anderson-Darling test (more powerful for tail behavior)
library(nortest)
ad_test <- ad.test(log(income_data))
cat("Anderson-Darling test p-value:", ad_test$p.value, "\n")

Practical Applications and Examples

Let’s work through a complete analysis simulating household income distribution in a metropolitan area.

# Simulate household income data (in thousands of dollars)
set.seed(789)
n_households <- 1000
income <- rlnorm(n_households, meanlog = 3.8, sdlog = 0.7)

# Fit log-normal distribution
fit <- fitdistr(income, "lognormal")
est_meanlog <- fit$estimate["meanlog"]
est_sdlog <- fit$estimate["sdlog"]

# Calculate key statistics
actual_mean <- exp(est_meanlog + est_sdlog^2/2)
actual_median <- exp(est_meanlog)
actual_mode <- exp(est_meanlog - est_sdlog^2)

cat("Income Distribution Statistics:\n")
cat("Mean income: $", round(actual_mean, 2), "k\n", sep = "")
cat("Median income: $", round(actual_median, 2), "k\n", sep = "")
cat("Modal income: $", round(actual_mode, 2), "k\n", sep = "")

# Calculate proportion earning above certain thresholds
threshold_50k <- 1 - plnorm(50, meanlog = est_meanlog, sdlog = est_sdlog)
threshold_100k <- 1 - plnorm(100, meanlog = est_meanlog, sdlog = est_sdlog)

cat("\nProportion earning over $50k:", round(threshold_50k * 100, 1), "%\n")
cat("Proportion earning over $100k:", round(threshold_100k * 100, 1), "%\n")

# Calculate confidence interval for median income
# Using parametric bootstrap
n_boot <- 1000
boot_medians <- numeric(n_boot)

for (i in 1:n_boot) {
  boot_sample <- rlnorm(n_households, meanlog = est_meanlog, sdlog = est_sdlog)
  boot_fit <- fitdistr(boot_sample, "lognormal")
  boot_medians[i] <- exp(boot_fit$estimate["meanlog"])
}

ci_median <- quantile(boot_medians, c(0.025, 0.975))
cat("\n95% CI for median income: ($", 
    round(ci_median[1], 2), "k, $", 
    round(ci_median[2], 2), "k)\n", sep = "")

# Comprehensive visualization
par(mfrow = c(2, 2))

# Histogram with fitted density
hist(income, breaks = 50, probability = TRUE,
     main = "Income Distribution", xlab = "Income ($1000s)",
     col = "lightblue", border = "white")
x_seq <- seq(0, max(income), length.out = 500)
lines(x_seq, dlnorm(x_seq, est_meanlog, est_sdlog), 
      col = "red", lwd = 2)

# Log-scale histogram
hist(log(income), breaks = 50, probability = TRUE,
     main = "Log-Transformed Income", xlab = "Log(Income)",
     col = "lightgreen", border = "white")
curve(dnorm(x, est_meanlog, est_sdlog), add = TRUE, 
      col = "red", lwd = 2)

# Empirical CDF vs theoretical
plot(ecdf(income), main = "Empirical vs Theoretical CDF",
     xlab = "Income ($1000s)", ylab = "Cumulative Probability")
lines(x_seq, plnorm(x_seq, est_meanlog, est_sdlog), 
      col = "red", lwd = 2)

# QQ-plot
qqnorm(log(income), main = "QQ-Plot (Log-Transformed)")
qqline(log(income), col = "red", lwd = 2)

par(mfrow = c(1, 1))

Common Pitfalls and Best Practices

Parameter Confusion: The most frequent error is confusing meanlog and sdlog with the actual mean and standard deviation of your data. Always remember: these parameters describe the underlying normal distribution of the logarithm, not the log-normal distribution itself. If you need to work backward from desired mean and variance, use: meanlog = log(mean^2 / sqrt(variance + mean^2)) and sdlog = sqrt(log(1 + variance/mean^2)).

Zero and Negative Values: Log-normal distributions only support positive values. If your data contains zeros or negatives, you cannot directly fit a log-normal distribution. Consider adding a small constant (location parameter) or using a different distribution family. Never blindly remove zeros—understand why they exist in your data first.

Inappropriate Applications: Don’t force log-normal distributions onto data just because it’s positive and right-skewed. Test your assumptions. Data with heavy tails might be better modeled by Pareto or other heavy-tailed distributions. Bounded data (like proportions) requires beta or truncated distributions.

Sample Size Matters: Parameter estimation becomes unreliable with small samples (n < 50). The Shapiro-Wilk test loses power with very large samples and will reject normality for trivial deviations. Always combine statistical tests with visual diagnostics and domain knowledge.

Computational Efficiency: For large datasets (n > 100,000), method of moments estimation is dramatically faster than MLE and often sufficient. If you need MLE, consider sampling your data for initial parameter estimates, then refining on the full dataset.

The log-normal distribution is a workhorse for modeling positive-valued, right-skewed data across finance, biology, and engineering. Master its parameterization, always validate your assumptions, and you’ll have a powerful tool for both descriptive statistics and predictive modeling.