Gamma Distribution in R: Complete Guide
The gamma distribution is a two-parameter family of continuous probability distributions defined over positive real numbers. It's characterized by a shape parameter α (alpha) and a rate parameter β...
Key Insights
- The gamma distribution’s flexibility comes from its two parameters—shape (α) and rate (β)—making it ideal for modeling positive continuous data like wait times, insurance claims, and rainfall amounts.
- R provides four core functions (
dgamma(),pgamma(),qgamma(),rgamma()) that handle density, cumulative probability, quantiles, and random generation, but you must understand the rate vs. scale parameterization to avoid costly mistakes. - Always validate your fitted gamma distribution with goodness-of-fit tests and visual diagnostics before using it for predictions or decision-making.
Introduction to the Gamma Distribution
The gamma distribution is a two-parameter family of continuous probability distributions defined over positive real numbers. It’s characterized by a shape parameter α (alpha) and a rate parameter β (beta), with the probability density function:
$$f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}$$
where Γ(α) is the gamma function. The shape parameter controls the distribution’s form—values less than 1 produce a J-shaped curve, while values greater than 1 create a bell-like shape with varying skewness. The rate parameter controls how quickly the distribution decays.
Real-world applications are everywhere. Insurance companies use gamma distributions to model claim amounts. Hydrologists model rainfall accumulation. Operations researchers analyze wait times in queuing systems. Reliability engineers study time-to-failure data. The gamma distribution works well whenever you have positive, right-skewed continuous data.
The gamma family includes two important special cases. When α = 1, you get the exponential distribution—useful for modeling time between events in a Poisson process. When α = k/2 and β = 1/2 (where k is a positive integer), you get the chi-squared distribution with k degrees of freedom. Understanding these relationships helps you recognize when gamma is the right choice.
Working with Gamma Functions in R
R provides four built-in functions for working with gamma distributions:
| Function | Purpose |
|---|---|
dgamma() |
Probability density function (PDF) |
pgamma() |
Cumulative distribution function (CDF) |
qgamma() |
Quantile function (inverse CDF) |
rgamma() |
Random number generation |
Here’s the critical detail that trips up many R users: R uses shape and rate as default parameters, but also accepts scale (where scale = 1/rate). Never specify both rate and scale—R will throw an error.
# Basic syntax demonstration
x <- 2
# Density at x=2 with shape=3, rate=1
dgamma(x, shape = 3, rate = 1)
# [1] 0.2706706
# Same result using scale
dgamma(x, shape = 3, scale = 1)
# [1] 0.2706706
# Cumulative probability P(X <= 2)
pgamma(x, shape = 3, rate = 1)
# [1] 0.3233236
# 90th percentile
qgamma(0.9, shape = 3, rate = 1)
# [1] 5.322057
# Generate 5 random samples
set.seed(42)
rgamma(5, shape = 3, rate = 1)
# [1] 2.624531 3.721239 1.839441 2.609014 4.147079
Probability Density and Cumulative Distribution
The density function dgamma() returns the height of the PDF at a given point. This isn’t a probability itself—for continuous distributions, point probabilities are zero—but it’s useful for comparing relative likelihoods and for visualization.
The cumulative distribution function pgamma() gives you actual probabilities. It returns P(X ≤ x), the probability that a random variable falls at or below a specified value. Set lower.tail = FALSE to get P(X > x) instead.
library(ggplot2)
# Create data for plotting
x_vals <- seq(0, 15, length.out = 200)
# Different parameter combinations
params <- list(
list(shape = 1, rate = 0.5, label = "α=1, β=0.5 (Exponential)"),
list(shape = 2, rate = 0.5, label = "α=2, β=0.5"),
list(shape = 3, rate = 1, label = "α=3, β=1"),
list(shape = 5, rate = 1, label = "α=5, β=1"),
list(shape = 9, rate = 2, label = "α=9, β=2")
)
# Build data frame for plotting
plot_data <- do.call(rbind, lapply(params, function(p) {
data.frame(
x = x_vals,
density = dgamma(x_vals, shape = p$shape, rate = p$rate),
params = p$label
)
}))
# Create the plot
ggplot(plot_data, aes(x = x, y = density, color = params)) +
geom_line(linewidth = 1) +
labs(
title = "Gamma Distribution PDF with Varying Parameters",
x = "x",
y = "Density",
color = "Parameters"
) +
theme_minimal() +
theme(legend.position = "bottom") +
guides(color = guide_legend(ncol = 2))
Notice how shape = 1 produces the exponential distribution’s characteristic decay. As shape increases, the distribution becomes more symmetric and bell-shaped, with the mode shifting rightward.
Quantiles and Random Number Generation
The quantile function qgamma() answers the inverse question: given a probability p, what value x satisfies P(X ≤ x) = p? This is essential for setting thresholds, calculating confidence intervals, and risk analysis.
Random number generation with rgamma() enables Monte Carlo simulation. You can model uncertainty, test statistical procedures, and validate your understanding of the distribution.
# Insurance claim simulation
set.seed(123)
# Parameters estimated from historical data
claim_shape <- 2.5
claim_rate <- 0.001 # Claims in dollars
# Simulate 10,000 claims
n_claims <- 10000
simulated_claims <- rgamma(n_claims, shape = claim_shape, rate = claim_rate)
# Summary statistics
cat("Simulated Claims Summary:\n")
cat("Mean claim:", round(mean(simulated_claims), 2), "\n")
cat("Median claim:", round(median(simulated_claims), 2), "\n")
cat("95th percentile:", round(quantile(simulated_claims, 0.95), 2), "\n")
# Compare empirical vs theoretical
theoretical_mean <- claim_shape / claim_rate
theoretical_var <- claim_shape / claim_rate^2
cat("\nTheoretical vs Empirical:\n")
cat("Mean - Theoretical:", theoretical_mean, "Empirical:", round(mean(simulated_claims), 2), "\n")
cat("Variance - Theoretical:", theoretical_var, "Empirical:", round(var(simulated_claims), 2), "\n")
# Visualize comparison
empirical_df <- data.frame(claims = simulated_claims)
ggplot(empirical_df, aes(x = claims)) +
geom_histogram(aes(y = after_stat(density)), bins = 50,
fill = "steelblue", alpha = 0.7) +
stat_function(fun = dgamma, args = list(shape = claim_shape, rate = claim_rate),
color = "red", linewidth = 1) +
labs(
title = "Simulated Insurance Claims vs Theoretical Gamma",
x = "Claim Amount ($)",
y = "Density"
) +
theme_minimal()
Parameter Estimation from Data
When you have observed data and want to fit a gamma distribution, you have two primary approaches: method of moments and maximum likelihood estimation.
Method of moments uses sample statistics to solve for parameters. For the gamma distribution:
$$\hat{\alpha} = \frac{\bar{x}^2}{s^2}, \quad \hat{\beta} = \frac{\bar{x}}{s^2}$$
Maximum likelihood estimation (MLE) finds parameters that maximize the probability of observing your data. The fitdistr() function from the MASS package handles this efficiently.
library(MASS)
# Generate sample data (pretend this is real observed data)
set.seed(456)
observed_data <- rgamma(500, shape = 4, rate = 0.8)
# Method of moments estimation
mom_alpha <- mean(observed_data)^2 / var(observed_data)
mom_beta <- mean(observed_data) / var(observed_data)
cat("Method of Moments Estimates:\n")
cat("Shape (α):", round(mom_alpha, 3), "\n")
cat("Rate (β):", round(mom_beta, 3), "\n\n")
# Maximum likelihood estimation
mle_fit <- fitdistr(observed_data, "gamma")
cat("MLE Estimates:\n")
print(mle_fit)
# Extract parameters
mle_shape <- mle_fit$estimate["shape"]
mle_rate <- mle_fit$estimate["rate"]
# Compare estimates
comparison <- data.frame(
Method = c("True", "MoM", "MLE"),
Shape = c(4, mom_alpha, mle_shape),
Rate = c(0.8, mom_beta, mle_rate)
)
print(round(comparison, 3))
MLE generally produces more efficient estimates (lower variance) than method of moments, especially for smaller samples. Use MLE unless you have a specific reason not to.
Practical Application: Complete Workflow
Let’s work through a complete analysis: modeling customer service wait times at a call center. We’ll load data, fit the distribution, validate the fit, and make predictions.
library(MASS)
library(ggplot2)
# Simulate realistic wait time data (in minutes)
set.seed(789)
wait_times <- rgamma(200, shape = 2.2, rate = 0.4) +
rnorm(200, mean = 0, sd = 0.3) # Add some noise
wait_times <- pmax(wait_times, 0.1) # Ensure positive values
# Step 1: Exploratory analysis
cat("Wait Time Summary:\n")
summary(wait_times)
# Step 2: Fit gamma distribution
gamma_fit <- fitdistr(wait_times, "gamma")
fitted_shape <- gamma_fit$estimate["shape"]
fitted_rate <- gamma_fit$estimate["rate"]
cat("\nFitted Parameters:\n")
cat("Shape:", round(fitted_shape, 3), "\n")
cat("Rate:", round(fitted_rate, 3), "\n")
cat("Mean wait time:", round(fitted_shape/fitted_rate, 2), "minutes\n")
# Step 3: Goodness-of-fit test (Kolmogorov-Smirnov)
ks_result <- ks.test(wait_times, "pgamma",
shape = fitted_shape,
rate = fitted_rate)
cat("\nKolmogorov-Smirnov Test:\n")
cat("D statistic:", round(ks_result$statistic, 4), "\n")
cat("p-value:", round(ks_result$p.value, 4), "\n")
cat("Conclusion:", ifelse(ks_result$p.value > 0.05,
"Fail to reject H0 - Gamma fits well",
"Reject H0 - Consider alternative distribution"), "\n")
# Step 4: Visual validation with Q-Q plot
theoretical_quantiles <- qgamma(ppoints(length(wait_times)),
shape = fitted_shape,
rate = fitted_rate)
sample_quantiles <- sort(wait_times)
qq_df <- data.frame(
theoretical = theoretical_quantiles,
sample = sample_quantiles
)
ggplot(qq_df, aes(x = theoretical, y = sample)) +
geom_point(alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1) +
labs(
title = "Q-Q Plot: Wait Times vs Fitted Gamma",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal()
# Step 5: Business insights
cat("\nBusiness Insights:\n")
cat("Probability wait > 10 min:",
round(1 - pgamma(10, shape = fitted_shape, rate = fitted_rate), 3), "\n")
cat("90% of customers wait less than:",
round(qgamma(0.9, shape = fitted_shape, rate = fitted_rate), 2), "minutes\n")
Summary and Best Practices
Choose the gamma distribution when your data is continuous, strictly positive, and right-skewed. It’s more flexible than the exponential (which is just gamma with shape = 1) and handles a wider range of shapes than the Weibull for certain applications.
Common pitfalls to avoid:
-
Parameter confusion: Always check whether documentation uses rate or scale. R defaults to rate, but other software (and many textbooks) use scale.
-
Zero or negative values: The gamma distribution is undefined for x ≤ 0. Clean your data or consider a shifted gamma if zeros are meaningful.
-
Ignoring validation: A good fit on training data doesn’t guarantee predictive accuracy. Always test with held-out data and formal goodness-of-fit tests.
-
Over-relying on MLE: With small samples (n < 30), MLE can be unstable. Consider Bayesian methods or bootstrap confidence intervals.
For more advanced work, explore the fitdistrplus package for comprehensive distribution fitting, actuar for actuarial applications, and gamlss for regression models with gamma-distributed responses. The gamma distribution is a workhorse—master it, and you’ll have a powerful tool for modeling real-world phenomena.