Poisson Distribution in R: Complete Guide

The Poisson distribution answers a specific question: how many times will an event occur in a fixed interval? That interval could be time, space, or any other continuous measure. You're counting...

Key Insights

  • The Poisson distribution models count data where events occur independently at a constant average rate—mastering R’s four core functions (dpois, ppois, qpois, rpois) covers 90% of practical use cases
  • Lambda (λ) is both the mean and variance of a Poisson distribution; when your data’s variance significantly exceeds its mean, you’re dealing with overdispersion and need alternative approaches
  • Poisson regression via glm() extends simple count modeling to handle predictors, but always check for overdispersion before trusting your results

Introduction to Poisson Distribution

The Poisson distribution answers a specific question: how many times will an event occur in a fixed interval? That interval could be time, space, or any other continuous measure. You’re counting website visits per hour, manufacturing defects per batch, or customer arrivals per day.

Three assumptions must hold for Poisson to be appropriate:

  1. Events occur independently of each other
  2. The average rate (lambda) stays constant across intervals
  3. Two events can’t happen at exactly the same instant

When these assumptions hold, the Poisson distribution becomes remarkably useful. When they don’t—and you’ll learn to detect this—you need alternatives.

Core R Functions for Poisson Distribution

R provides four functions that follow a consistent naming convention across all distributions:

Function Purpose Returns
dpois() Probability mass function P(X = k)
ppois() Cumulative distribution function P(X ≤ k)
qpois() Quantile function k where P(X ≤ k) = p
rpois() Random number generation Simulated counts

All four take lambda as the key parameter—the average number of events per interval.

# Probability of exactly 3 events when lambda = 5
dpois(x = 3, lambda = 5)
# [1] 0.1403739

# Probability of 3 or fewer events
ppois(q = 3, lambda = 5)
# [1] 0.2650259

# What count gives us the 90th percentile?
qpois(p = 0.90, lambda = 5)
# [1] 8

# Generate 10 random Poisson samples
set.seed(42)
rpois(n = 10, lambda = 5)
# [1] 6 8 2 6 6 6 7 4 5 3

The lower.tail argument flips the CDF direction. For P(X > k), use ppois(k, lambda, lower.tail = FALSE):

# P(X > 7) when lambda = 5
ppois(7, lambda = 5, lower.tail = FALSE)
# [1] 0.1333603

# Equivalent to:
1 - ppois(7, lambda = 5)
# [1] 0.1333603

Visualizing Poisson Distributions

Visualization reveals how lambda shapes the distribution. Low lambda values produce right-skewed distributions; as lambda increases, the distribution approaches symmetry.

library(ggplot2)
library(dplyr)
library(tidyr)

# Create data for multiple lambda values
lambdas <- c(1, 4, 10, 20)
x_vals <- 0:35

poisson_data <- expand.grid(x = x_vals, lambda = lambdas) %>%
  mutate(
    probability = dpois(x, lambda),
    lambda_label = paste("λ =", lambda)
  )

# Plot PMFs
ggplot(poisson_data, aes(x = x, y = probability, fill = lambda_label)) +
  geom_col(position = "dodge", width = 0.7) +
  facet_wrap(~lambda_label, scales = "free_y") +
  labs(
    title = "Poisson PMF for Different Lambda Values",
    x = "Count (k)",
    y = "P(X = k)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

For cumulative distributions, switch to ppois():

# CDF comparison
cdf_data <- expand.grid(x = x_vals, lambda = c(3, 7, 12)) %>%
  mutate(
    cumulative_prob = ppois(x, lambda),
    lambda_label = factor(paste("λ =", lambda))
  )

ggplot(cdf_data, aes(x = x, y = cumulative_prob, color = lambda_label)) +
  geom_step(linewidth = 1.2) +
  labs(
    title = "Poisson CDF Comparison",
    x = "Count (k)",
    y = "P(X ≤ k)",
    color = "Lambda"
  ) +
  theme_minimal()

Parameter Estimation and Fitting

Given observed count data, you’ll need to estimate lambda. The maximum likelihood estimator for lambda is simply the sample mean—mathematically elegant and computationally trivial.

# Observed counts (e.g., daily customer complaints)
complaints <- c(2, 5, 3, 4, 6, 2, 3, 5, 4, 3, 7, 2, 4, 3, 5)

# MLE estimate of lambda
lambda_hat <- mean(complaints)
lambda_hat
# [1] 3.866667

For formal fitting with standard errors, use fitdistr() from the MASS package:

library(MASS)

fit <- fitdistr(complaints, "Poisson")
fit
#      lambda   
#   3.86666667 
#  (0.50773691)

The value in parentheses is the standard error. You can extract components:

fit$estimate  # Point estimate
fit$sd        # Standard error

# 95% confidence interval for lambda
fit$estimate + c(-1.96, 1.96) * fit$sd
# [1] 2.871502 4.861831

Goodness-of-Fit Testing

Does your data actually follow a Poisson distribution? The chi-square goodness-of-fit test provides an answer:

# More data for reliable testing
set.seed(123)
observed_counts <- rpois(200, lambda = 4)

# Tabulate observed frequencies
obs_table <- table(factor(observed_counts, levels = 0:12))

# Calculate expected frequencies under Poisson
lambda_est <- mean(observed_counts)
expected_probs <- dpois(0:11, lambda_est)
expected_probs <- c(expected_probs, 1 - sum(expected_probs))  # Lump 12+
expected_freq <- expected_probs * length(observed_counts)

# Chi-square test (combine cells with expected < 5)
chisq.test(as.vector(obs_table), p = expected_probs, rescale.p = TRUE)

Practical Applications and Case Studies

Let’s work through a complete analysis of call center data:

# Simulated hourly call volumes over 30 days
set.seed(456)
call_data <- data.frame(
  day = rep(1:30, each = 8),
  hour = rep(9:16, times = 30),
  calls = rpois(240, lambda = 12)
)

# Estimate overall lambda
overall_lambda <- mean(call_data$calls)
cat("Estimated calls per hour:", round(overall_lambda, 2), "\n")

# Test if morning (9-12) differs from afternoon (13-16)
morning <- call_data$calls[call_data$hour <= 12]
afternoon <- call_data$calls[call_data$hour > 12]

# Rate comparison using exact Poisson test
poisson.test(
  x = c(sum(morning), sum(afternoon)),
  T = c(length(morning), length(afternoon))
)

For quality control, you might track defects per production batch:

# Defects per batch (target: lambda = 2)
defects <- c(1, 3, 2, 0, 4, 2, 1, 5, 2, 3, 1, 2, 6, 2, 1)

# Test if process is in control (H0: lambda = 2)
poisson.test(sum(defects), T = length(defects), r = 2)

Poisson Regression with glm()

When counts depend on predictors, Poisson regression extends the basic model. The glm() function with family = poisson handles this:

# Example: Website visits by day of week and marketing spend
set.seed(789)
web_data <- data.frame(
  day = factor(rep(c("Weekday", "Weekend"), each = 50)),
  ad_spend = runif(100, 100, 500)
)

# Generate counts with true effects
web_data$visits <- rpois(100, 
  lambda = exp(3 + 0.3 * (web_data$day == "Weekend") + 0.002 * web_data$ad_spend)
)

# Fit Poisson regression
pois_model <- glm(visits ~ day + ad_spend, 
                  family = poisson(link = "log"), 
                  data = web_data)

summary(pois_model)

Coefficients are on the log scale. Exponentiate for multiplicative effects:

# Interpret coefficients
exp(coef(pois_model))

# Weekend effect: visits are multiplied by this factor
exp(coef(pois_model)["dayWeekend"])

# Each $100 increase in ad spend multiplies visits by:
exp(coef(pois_model)["ad_spend"] * 100)

Common Pitfalls and Alternatives

The Poisson distribution assumes mean equals variance. Real data often violates this—variance exceeds the mean (overdispersion). Ignoring overdispersion produces artificially small standard errors and false confidence.

# Check for overdispersion
check_dispersion <- function(model) {
  pearson_resid <- residuals(model, type = "pearson")
  dispersion <- sum(pearson_resid^2) / model$df.residual
  cat("Dispersion parameter:", round(dispersion, 3), "\n")
  cat("Values >> 1 indicate overdispersion\n")
  return(dispersion)
}

check_dispersion(pois_model)

When overdispersion exists, switch to negative binomial regression:

library(MASS)

# Fit negative binomial as alternative
nb_model <- glm.nb(visits ~ day + ad_spend, data = web_data)

# Compare AIC values
AIC(pois_model, nb_model)

For zero-inflated data (more zeros than Poisson predicts), use the pscl package:

library(pscl)

# Zero-inflated Poisson
zip_model <- zeroinfl(visits ~ day + ad_spend | 1, data = web_data)

# Compare with Vuong test
vuong(pois_model, zip_model)

Decision Framework

Use this hierarchy when modeling count data:

  1. Start with Poisson regression
  2. Check dispersion parameter—if > 1.5, consider alternatives
  3. For overdispersion without excess zeros: negative binomial
  4. For excess zeros: zero-inflated Poisson or negative binomial
  5. For underdispersion (rare): consider binomial or beta-binomial

The Poisson distribution remains foundational for count data analysis. Master these R functions, understand when assumptions break down, and you’ll handle the majority of count-based problems you encounter.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.