Negative Binomial Distribution in R: Complete Guide
The negative binomial distribution models count data with inherent variability that exceeds simple random occurrence. Unlike the Poisson distribution, which assumes mean equals variance, the negative...
Key Insights
- The negative binomial distribution handles overdispersed count data where variance exceeds the mean, making it superior to Poisson for real-world scenarios like customer transactions or defect counts
- R provides four core functions (
dnbinom,pnbinom,qnbinom,rnbinom) plusMASS::glm.nb()for regression modeling, with two equivalent parameterizations that can confuse beginners - Always test for overdispersion before choosing between Poisson and negative binomial models—a dispersion parameter significantly greater than 1 indicates negative binomial is appropriate
Introduction to Negative Binomial Distribution
The negative binomial distribution models count data with inherent variability that exceeds simple random occurrence. Unlike the Poisson distribution, which assumes mean equals variance, the negative binomial allows variance to exceed the mean—a phenomenon called overdispersion that appears constantly in real data.
Two equivalent interpretations exist: the number of failures before achieving r successes, or the total number of trials needed to reach r successes. Both are mathematically identical but frame problems differently.
Real-world applications abound. Insurance companies use it to model claim frequencies where some policyholders file multiple claims while most file none. Epidemiologists track disease outbreaks where case counts vary more than simple random spread predicts. E-commerce analysts model purchase frequencies where customer behavior clusters rather than distributes evenly. Manufacturing quality control uses it for defect counts when defects cluster due to batch effects or equipment degradation.
The distribution’s flexibility in handling overdispersion makes it indispensable for practical data analysis where Poisson assumptions fail.
Mathematical Foundation and Parameters
The negative binomial distribution has two parameters: r (size or number of successes) and p (probability of success on each trial). The probability mass function is:
P(X = k) = C(k+r-1, k) × p^r × (1-p)^k
where k represents the number of failures.
Key properties:
- Mean: μ = r(1-p)/p
- Variance: σ² = r(1-p)/p²
The critical insight: variance always exceeds the mean when r is finite. The variance-to-mean ratio equals μ(1 + μ/r), showing overdispersion increases as r decreases.
# Demonstrate mean-variance relationship
r_vals <- c(1, 5, 10, 50)
p <- 0.3
for(r in r_vals) {
mean_val <- r * (1 - p) / p
var_val <- r * (1 - p) / p^2
ratio <- var_val / mean_val
cat(sprintf("r=%2d: mean=%.2f, var=%.2f, var/mean=%.2f\n",
r, mean_val, var_val, ratio))
}
Output shows that as r increases, the variance-to-mean ratio approaches 1 (Poisson limit), but with small r, substantial overdispersion exists.
Compare to Poisson: when your count data shows variance roughly equal to the mean, use Poisson. When variance substantially exceeds mean, negative binomial is appropriate. This single criterion guides most model selection decisions.
Core R Functions for Negative Binomial Distribution
R implements four standard distribution functions with the nbinom family. Confusingly, R uses an alternative parameterization: size (r) and either prob (p) or mu (mean).
# dnbinom: Probability mass function
# What's the probability of exactly 5 failures before 3 successes (p=0.4)?
dnbinom(x = 5, size = 3, prob = 0.4)
# pnbinom: Cumulative distribution function
# Probability of 5 or fewer failures
pnbinom(q = 5, size = 3, prob = 0.4)
# qnbinom: Quantile function
# How many failures occur with 90% probability?
qnbinom(p = 0.90, size = 3, prob = 0.4)
# rnbinom: Random generation
# Generate 1000 random samples
samples <- rnbinom(n = 1000, size = 3, prob = 0.4)
Visualizing the PMF and CDF clarifies the distribution’s behavior:
library(ggplot2)
# Create data for visualization
x <- 0:20
pmf <- dnbinom(x, size = 3, prob = 0.4)
cdf <- pnbinom(x, size = 3, prob = 0.4)
# PMF plot
df_pmf <- data.frame(x = x, probability = pmf)
ggplot(df_pmf, aes(x = x, y = probability)) +
geom_col(fill = "steelblue") +
labs(title = "Negative Binomial PMF (r=3, p=0.4)",
x = "Number of Failures", y = "Probability") +
theme_minimal()
# CDF plot
df_cdf <- data.frame(x = x, cumulative = cdf)
ggplot(df_cdf, aes(x = x, y = cumulative)) +
geom_step(color = "darkred", size = 1) +
labs(title = "Negative Binomial CDF (r=3, p=0.4)",
x = "Number of Failures", y = "Cumulative Probability") +
theme_minimal()
Generating and Visualizing Negative Binomial Data
Parameter choices dramatically affect distribution shape. Small r with low p creates right-skewed distributions with long tails. Large r approximates Poisson or even normal distributions.
# Generate samples with different parameters
set.seed(123)
n <- 1000
sample1 <- rnbinom(n, size = 2, prob = 0.3) # High overdispersion
sample2 <- rnbinom(n, size = 10, prob = 0.5) # Moderate overdispersion
sample3 <- rnbinom(n, size = 50, prob = 0.7) # Low overdispersion
# Create comparison data frame
df_compare <- data.frame(
value = c(sample1, sample2, sample3),
distribution = rep(c("r=2, p=0.3", "r=10, p=0.5", "r=50, p=0.7"),
each = n)
)
# Comparative histogram
ggplot(df_compare, aes(x = value, fill = distribution)) +
geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
facet_wrap(~ distribution, scales = "free") +
labs(title = "Effect of Parameters on Distribution Shape",
x = "Count", y = "Frequency") +
theme_minimal() +
theme(legend.position = "none")
Overlay theoretical versus empirical distributions to verify random generation:
# Generate sample and compare to theoretical
sample <- rnbinom(10000, size = 5, prob = 0.4)
x_vals <- 0:max(sample)
theoretical <- dnbinom(x_vals, size = 5, prob = 0.4)
# Create comparison plot
hist(sample, breaks = 50, freq = FALSE, col = "lightblue",
main = "Empirical vs Theoretical Distribution",
xlab = "Count", ylab = "Density")
lines(x_vals, theoretical, col = "red", lwd = 2)
legend("topright", c("Empirical", "Theoretical"),
fill = c("lightblue", NA), border = c("black", NA),
lty = c(NA, 1), col = c(NA, "red"), lwd = c(NA, 2))
Fitting Negative Binomial Models to Real Data
The MASS package provides glm.nb() for negative binomial regression. This handles overdispersed count data where standard Poisson regression fails.
library(MASS)
# Simulate overdispersed count data
set.seed(456)
n <- 200
x1 <- rnorm(n, mean = 5, sd = 2)
x2 <- rbinom(n, size = 1, prob = 0.5)
# Generate counts with overdispersion
lambda <- exp(0.5 + 0.3*x1 - 0.4*x2)
y <- rnbinom(n, size = 2, mu = lambda)
df <- data.frame(y = y, x1 = x1, x2 = x2)
# Fit Poisson model (incorrect for overdispersed data)
poisson_model <- glm(y ~ x1 + x2, family = poisson, data = df)
# Fit negative binomial model (correct)
nb_model <- glm.nb(y ~ x1 + x2, data = df)
# Compare models
cat("Poisson AIC:", AIC(poisson_model), "\n")
cat("Negative Binomial AIC:", AIC(nb_model), "\n")
# Check overdispersion formally
dispersion <- sum(residuals(poisson_model, type = "pearson")^2) /
poisson_model$df.residual
cat("Dispersion parameter:", dispersion, "\n")
# Model summary
summary(nb_model)
When dispersion exceeds 1.5, negative binomial is strongly preferred. The AIC comparison confirms this—lower AIC indicates better fit.
Interpret coefficients on the log scale: a one-unit increase in x1 multiplies the expected count by exp(β₁). Make predictions:
# Predict for new data
new_data <- data.frame(x1 = c(3, 5, 7), x2 = c(0, 1, 0))
predictions <- predict(nb_model, newdata = new_data,
type = "response", se.fit = TRUE)
# Create prediction data frame with confidence intervals
pred_df <- data.frame(
x1 = new_data$x1,
x2 = new_data$x2,
predicted = predictions$fit,
se = predictions$se.fit
)
pred_df$lower <- pred_df$predicted - 1.96 * pred_df$se
pred_df$upper <- pred_df$predicted + 1.96 * pred_df$se
print(pred_df)
Practical Applications and Case Studies
Consider customer purchase frequency analysis. You have transaction counts per customer over a month, along with customer characteristics.
# Simulate customer transaction data
set.seed(789)
n_customers <- 500
customer_data <- data.frame(
customer_id = 1:n_customers,
age = rnorm(n_customers, mean = 40, sd = 15),
loyalty_years = rpois(n_customers, lambda = 3),
premium_member = rbinom(n_customers, size = 1, prob = 0.3)
)
# Generate transaction counts (overdispersed)
mu <- exp(1.2 + 0.02*customer_data$age +
0.15*customer_data$loyalty_years +
0.8*customer_data$premium_member)
customer_data$transactions <- rnbinom(n_customers, size = 3, mu = mu)
# Fit model
transaction_model <- glm.nb(transactions ~ age + loyalty_years + premium_member,
data = customer_data)
summary(transaction_model)
# Diagnostic plots
par(mfrow = c(2, 2))
plot(transaction_model)
par(mfrow = c(1, 1))
# Predict high-value customer transactions
high_value <- data.frame(
age = 45,
loyalty_years = 5,
premium_member = 1
)
predict(transaction_model, newdata = high_value, type = "response")
The model reveals that premium membership has the strongest effect on transaction frequency, followed by loyalty years. Age shows a smaller positive effect.
Common Pitfalls and Best Practices
Parameter confusion: R’s size parameter corresponds to r, but prob is the success probability, not failure probability. The mu parameterization often proves more intuitive for regression: glm.nb() estimates mu directly.
Inappropriate use: Don’t use negative binomial for continuous data, proportions, or data without overdispersion. Always test Poisson first—it’s simpler and sometimes sufficient.
Convergence issues: When glm.nb() fails to converge, check for:
- Extreme outliers distorting estimates
- Perfect separation in predictors
- Insufficient sample size relative to parameters
# Formal overdispersion test
check_overdispersion <- function(model) {
if(family(model)$family != "poisson") {
stop("Model must be Poisson")
}
pearson_resid <- residuals(model, type = "pearson")
dispersion <- sum(pearson_resid^2) / model$df.residual
p_value <- pchisq(sum(pearson_resid^2),
df = model$df.residual,
lower.tail = FALSE)
list(dispersion = dispersion, p_value = p_value)
}
# Use on Poisson model
result <- check_overdispersion(poisson_model)
cat("Dispersion:", result$dispersion, "\n")
cat("P-value:", result$p_value, "\n")
Model validation: Always examine residual plots. For count data, use Pearson or deviance residuals. Look for patterns suggesting model misspecification—residuals should appear random.
The negative binomial distribution solves real problems in count data analysis. Master these R functions and modeling approaches, and you’ll handle overdispersed data with confidence. When variance exceeds mean in your counts, reach for negative binomial—it’s built for exactly this scenario.