Hypergeometric Distribution in R: Complete Guide

The hypergeometric distribution answers a fundamental question: what's the probability of getting exactly k successes when drawing n items without replacement from a finite population containing K...

Key Insights

  • The hypergeometric distribution models sampling without replacement, making it essential for quality control, card game probabilities, and ecological studies where the population changes with each draw
  • R provides four core functions (dhyper, phyper, qhyper, rhyper) that handle probability mass, cumulative distribution, quantiles, and random generation—understanding their parameter order (x, m, n, k) prevents common errors
  • For large populations where sampling represents less than 5% of the total, the binomial distribution approximates the hypergeometric with better computational efficiency

Introduction to Hypergeometric Distribution

The hypergeometric distribution answers a fundamental question: what’s the probability of getting exactly k successes when drawing n items without replacement from a finite population containing K successes and N-K failures?

Unlike the binomial distribution where each trial is independent, hypergeometric distribution accounts for the changing probability as items are removed. When you draw a card from a deck and don’t replace it, the odds change for the next draw—that’s hypergeometric territory.

Real-world applications include:

  • Quality control: selecting items from a batch to test for defects
  • Card games: calculating poker hand probabilities
  • Ecological sampling: capture-recapture studies for wildlife populations
  • Clinical trials: selecting patients from specific subgroups

Here’s how hypergeometric differs from binomial distribution:

# Compare hypergeometric vs binomial
# Scenario: 50 items, 20 defective, drawing 10 items

x <- 0:10
# Hypergeometric (without replacement)
hyper_prob <- dhyper(x, m = 20, n = 30, k = 10)
# Binomial (with replacement)
binom_prob <- dbinom(x, size = 10, prob = 20/50)

par(mfrow = c(1, 2))
barplot(hyper_prob, names.arg = x, main = "Hypergeometric",
        xlab = "Number of defects", ylab = "Probability", col = "steelblue")
barplot(binom_prob, names.arg = x, main = "Binomial",
        xlab = "Number of defects", ylab = "Probability", col = "coral")

The difference is subtle but crucial. Hypergeometric probabilities are slightly more spread out because the sampling process affects subsequent draws.

Mathematical Foundation and Parameters

The hypergeometric distribution has four parameters:

  • N: Total population size (m + n in R’s notation)
  • K: Number of success states in the population (m in R)
  • n: Number of draws (k in R)
  • k: Number of observed successes (x in R)

The probability mass function is:

P(X = k) = [C(K,k) × C(N-K, n-k)] / C(N,n)

Where C(n,k) represents “n choose k” or the binomial coefficient.

Here’s a manual calculation compared to R’s built-in function:

# Calculate probability manually
# Example: 52 cards, 13 hearts, draw 5 cards, probability of exactly 2 hearts

N <- 52  # Total cards
K <- 13  # Hearts in deck
n <- 5   # Cards drawn
k <- 2   # Hearts we want

# Manual calculation
numerator <- choose(K, k) * choose(N - K, n - k)
denominator <- choose(N, n)
manual_prob <- numerator / denominator

# Using R's built-in function
# Note: R uses (x, m, n, k) where m=successes, n=failures, k=draws
r_prob <- dhyper(x = 2, m = 13, n = 39, k = 5)

cat("Manual calculation:", manual_prob, "\n")
cat("R's dhyper():", r_prob, "\n")
cat("Difference:", abs(manual_prob - r_prob), "\n")

The parameter naming in R is confusing. Remember: dhyper(x, m, n, k) where m is successes in population, n is failures in population, and k is number of draws.

Core R Functions for Hypergeometric Distribution

R provides four functions following the standard distribution function pattern:

dhyper(): Probability Mass Function

Calculate the probability of exactly x successes:

# Probability of drawing exactly 3 aces when drawing 5 cards
dhyper(x = 3, m = 4, n = 48, k = 5)

# Probability for multiple values
x_values <- 0:4
probs <- dhyper(x_values, m = 4, n = 48, k = 5)
names(probs) <- paste0("P(X=", x_values, ")")
print(probs)

phyper(): Cumulative Distribution Function

Calculate cumulative probabilities:

# Probability of drawing 2 or fewer aces in 5 cards
phyper(q = 2, m = 4, n = 48, k = 5, lower.tail = TRUE)

# Probability of drawing more than 2 aces (upper tail)
phyper(q = 2, m = 4, n = 48, k = 5, lower.tail = FALSE)

# Verify they sum to 1
lower <- phyper(2, m = 4, n = 48, k = 5, lower.tail = TRUE)
upper <- phyper(2, m = 4, n = 48, k = 5, lower.tail = FALSE)
cat("Sum:", lower + upper, "\n")

qhyper(): Quantile Function

Find the value corresponding to a given probability:

# What's the median number of aces in 5-card draws?
qhyper(p = 0.5, m = 4, n = 48, k = 5)

# 95th percentile
qhyper(p = 0.95, m = 4, n = 48, k = 5)

# Multiple quantiles
quantiles <- c(0.25, 0.5, 0.75, 0.95)
qhyper(p = quantiles, m = 4, n = 48, k = 5)

rhyper(): Random Number Generation

Generate random samples from the distribution:

# Simulate 1000 five-card draws, counting aces
set.seed(123)
simulated_aces <- rhyper(nn = 1000, m = 4, n = 48, k = 5)

# Compare simulation to theoretical distribution
table(simulated_aces) / 1000  # Empirical
dhyper(0:4, m = 4, n = 48, k = 5)  # Theoretical

Practical Applications and Case Studies

Case Study 1: Quality Control

A shipment contains 100 components, 15 of which are defective. You randomly sample 20 components. What’s the probability of finding exactly 5 defective items?

# Parameters
total_items <- 100
defective <- 15
sample_size <- 20
target_defects <- 5

# Exact probability
prob_exactly_5 <- dhyper(target_defects, defective, 
                          total_items - defective, sample_size)
cat("P(exactly 5 defects):", prob_exactly_5, "\n")

# Probability of 5 or more defects (reject the batch)
prob_5_or_more <- phyper(4, defective, total_items - defective, 
                         sample_size, lower.tail = FALSE)
cat("P(5 or more defects):", prob_5_or_more, "\n")

# Expected number of defects in sample
expected_defects <- sample_size * (defective / total_items)
cat("Expected defects:", expected_defects, "\n")

Case Study 2: Wildlife Population Estimation

Ecologists capture, tag, and release 50 fish in a lake. Later, they capture 30 fish and find 8 are tagged. Estimate the total population.

# Capture-recapture estimation
tagged <- 50
second_sample <- 30
recaptured_tagged <- 8

# Lincoln-Petersen estimator
estimated_population <- (tagged * second_sample) / recaptured_tagged
cat("Estimated population:", round(estimated_population), "\n")

# Calculate probability of observing 8 tagged fish
# for different population sizes
possible_populations <- seq(100, 300, by = 10)
likelihoods <- sapply(possible_populations, function(N) {
  dhyper(recaptured_tagged, tagged, N - tagged, second_sample)
})

# Find maximum likelihood estimate
mle_population <- possible_populations[which.max(likelihoods)]
cat("MLE population:", mle_population, "\n")

Case Study 3: Card Game Probability

What’s the probability of getting at least one ace in a 5-card poker hand?

# At least one ace = 1 - P(no aces)
prob_no_aces <- dhyper(0, m = 4, n = 48, k = 5)
prob_at_least_one <- 1 - prob_no_aces
cat("P(at least one ace):", prob_at_least_one, "\n")

# Alternatively using phyper
prob_at_least_one_alt <- phyper(0, m = 4, n = 48, k = 5, lower.tail = FALSE)
cat("Using phyper:", prob_at_least_one_alt, "\n")

# Full probability distribution
aces <- 0:4
probs <- dhyper(aces, m = 4, n = 48, k = 5)
data.frame(Aces = aces, Probability = probs, 
           Cumulative = cumsum(probs))

Visualization Techniques

Effective visualization helps understand hypergeometric distributions:

# Base R visualization
x <- 0:10
probs <- dhyper(x, m = 20, n = 30, k = 10)

par(mfrow = c(1, 2))
# PMF
barplot(probs, names.arg = x, col = "steelblue",
        main = "Hypergeometric PMF",
        xlab = "Number of successes", ylab = "Probability")

# CDF
plot(x, cumsum(probs), type = "s", lwd = 2, col = "darkred",
     main = "Hypergeometric CDF",
     xlab = "Number of successes", ylab = "Cumulative Probability")
grid()

Using ggplot2 for publication-quality graphics:

library(ggplot2)

# Compare different sample sizes
sample_sizes <- c(5, 10, 20)
comparison_data <- do.call(rbind, lapply(sample_sizes, function(k) {
  x <- 0:k
  data.frame(
    x = x,
    probability = dhyper(x, m = 20, n = 30, k = k),
    sample_size = as.factor(k)
  )
}))

ggplot(comparison_data, aes(x = x, y = probability, fill = sample_size)) +
  geom_col(position = "dodge") +
  labs(title = "Hypergeometric Distribution by Sample Size",
       x = "Number of Successes", y = "Probability",
       fill = "Sample Size") +
  theme_minimal()

Statistical Testing and Hypothesis Testing

Fisher’s exact test uses the hypergeometric distribution for contingency table analysis:

# Example: Drug effectiveness study
# Treatment group: 12 improved, 8 not improved
# Control group: 5 improved, 15 not improved

# Create contingency table
drug_data <- matrix(c(12, 8, 5, 15), nrow = 2,
                    dimnames = list(c("Treatment", "Control"),
                                   c("Improved", "Not Improved")))
print(drug_data)

# Fisher's exact test
fisher.test(drug_data)

# Manual calculation using hypergeometric
# P-value is sum of probabilities for tables as extreme or more extreme
total_improved <- 12 + 5
total_not_improved <- 8 + 15
treatment_total <- 12 + 8

# Probability of observed table
observed_prob <- dhyper(12, total_improved, total_not_improved, treatment_total)
cat("P(observed table):", observed_prob, "\n")

Common Pitfalls and Best Practices

Pitfall 1: Confusing Parameter Order

# WRONG: Confusing parameters
# dhyper(x, population_size, successes, draws) # INCORRECT

# CORRECT: 
# dhyper(x, successes_in_pop, failures_in_pop, draws)
dhyper(3, m = 20, n = 30, k = 10)  # Use named parameters

Pitfall 2: When to Use Binomial Approximation

When sample size is less than 5% of population, use binomial:

# Large population scenario
N <- 10000
K <- 2000
n <- 100

# Hypergeometric (exact)
system.time({
  hyper_result <- dhyper(20, m = K, n = N - K, k = n)
})

# Binomial approximation (faster)
system.time({
  binom_result <- dbinom(20, size = n, prob = K / N)
})

cat("Difference:", abs(hyper_result - binom_result), "\n")

Pitfall 3: Edge Cases

# Handle impossible scenarios
dhyper(10, m = 5, n = 45, k = 8)  # Returns 0 (can't draw 10 from 5)

# Boundary conditions
dhyper(5, m = 5, n = 45, k = 5)   # Drawing all successes

The hypergeometric distribution is indispensable for scenarios involving sampling without replacement. Master these R functions, understand when approximations are appropriate, and always validate your parameter inputs to avoid subtle errors in your statistical analyses.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.