Hypergeometric Distribution in R: Complete Guide
The hypergeometric distribution answers a fundamental question: what's the probability of getting exactly k successes when drawing n items without replacement from a finite population containing K...
Key Insights
- The hypergeometric distribution models sampling without replacement, making it essential for quality control, card game probabilities, and ecological studies where the population changes with each draw
- R provides four core functions (
dhyper,phyper,qhyper,rhyper) that handle probability mass, cumulative distribution, quantiles, and random generation—understanding their parameter order (x, m, n, k) prevents common errors - For large populations where sampling represents less than 5% of the total, the binomial distribution approximates the hypergeometric with better computational efficiency
Introduction to Hypergeometric Distribution
The hypergeometric distribution answers a fundamental question: what’s the probability of getting exactly k successes when drawing n items without replacement from a finite population containing K successes and N-K failures?
Unlike the binomial distribution where each trial is independent, hypergeometric distribution accounts for the changing probability as items are removed. When you draw a card from a deck and don’t replace it, the odds change for the next draw—that’s hypergeometric territory.
Real-world applications include:
- Quality control: selecting items from a batch to test for defects
- Card games: calculating poker hand probabilities
- Ecological sampling: capture-recapture studies for wildlife populations
- Clinical trials: selecting patients from specific subgroups
Here’s how hypergeometric differs from binomial distribution:
# Compare hypergeometric vs binomial
# Scenario: 50 items, 20 defective, drawing 10 items
x <- 0:10
# Hypergeometric (without replacement)
hyper_prob <- dhyper(x, m = 20, n = 30, k = 10)
# Binomial (with replacement)
binom_prob <- dbinom(x, size = 10, prob = 20/50)
par(mfrow = c(1, 2))
barplot(hyper_prob, names.arg = x, main = "Hypergeometric",
xlab = "Number of defects", ylab = "Probability", col = "steelblue")
barplot(binom_prob, names.arg = x, main = "Binomial",
xlab = "Number of defects", ylab = "Probability", col = "coral")
The difference is subtle but crucial. Hypergeometric probabilities are slightly more spread out because the sampling process affects subsequent draws.
Mathematical Foundation and Parameters
The hypergeometric distribution has four parameters:
- N: Total population size (m + n in R’s notation)
- K: Number of success states in the population (m in R)
- n: Number of draws (k in R)
- k: Number of observed successes (x in R)
The probability mass function is:
P(X = k) = [C(K,k) × C(N-K, n-k)] / C(N,n)
Where C(n,k) represents “n choose k” or the binomial coefficient.
Here’s a manual calculation compared to R’s built-in function:
# Calculate probability manually
# Example: 52 cards, 13 hearts, draw 5 cards, probability of exactly 2 hearts
N <- 52 # Total cards
K <- 13 # Hearts in deck
n <- 5 # Cards drawn
k <- 2 # Hearts we want
# Manual calculation
numerator <- choose(K, k) * choose(N - K, n - k)
denominator <- choose(N, n)
manual_prob <- numerator / denominator
# Using R's built-in function
# Note: R uses (x, m, n, k) where m=successes, n=failures, k=draws
r_prob <- dhyper(x = 2, m = 13, n = 39, k = 5)
cat("Manual calculation:", manual_prob, "\n")
cat("R's dhyper():", r_prob, "\n")
cat("Difference:", abs(manual_prob - r_prob), "\n")
The parameter naming in R is confusing. Remember: dhyper(x, m, n, k) where m is successes in population, n is failures in population, and k is number of draws.
Core R Functions for Hypergeometric Distribution
R provides four functions following the standard distribution function pattern:
dhyper(): Probability Mass Function
Calculate the probability of exactly x successes:
# Probability of drawing exactly 3 aces when drawing 5 cards
dhyper(x = 3, m = 4, n = 48, k = 5)
# Probability for multiple values
x_values <- 0:4
probs <- dhyper(x_values, m = 4, n = 48, k = 5)
names(probs) <- paste0("P(X=", x_values, ")")
print(probs)
phyper(): Cumulative Distribution Function
Calculate cumulative probabilities:
# Probability of drawing 2 or fewer aces in 5 cards
phyper(q = 2, m = 4, n = 48, k = 5, lower.tail = TRUE)
# Probability of drawing more than 2 aces (upper tail)
phyper(q = 2, m = 4, n = 48, k = 5, lower.tail = FALSE)
# Verify they sum to 1
lower <- phyper(2, m = 4, n = 48, k = 5, lower.tail = TRUE)
upper <- phyper(2, m = 4, n = 48, k = 5, lower.tail = FALSE)
cat("Sum:", lower + upper, "\n")
qhyper(): Quantile Function
Find the value corresponding to a given probability:
# What's the median number of aces in 5-card draws?
qhyper(p = 0.5, m = 4, n = 48, k = 5)
# 95th percentile
qhyper(p = 0.95, m = 4, n = 48, k = 5)
# Multiple quantiles
quantiles <- c(0.25, 0.5, 0.75, 0.95)
qhyper(p = quantiles, m = 4, n = 48, k = 5)
rhyper(): Random Number Generation
Generate random samples from the distribution:
# Simulate 1000 five-card draws, counting aces
set.seed(123)
simulated_aces <- rhyper(nn = 1000, m = 4, n = 48, k = 5)
# Compare simulation to theoretical distribution
table(simulated_aces) / 1000 # Empirical
dhyper(0:4, m = 4, n = 48, k = 5) # Theoretical
Practical Applications and Case Studies
Case Study 1: Quality Control
A shipment contains 100 components, 15 of which are defective. You randomly sample 20 components. What’s the probability of finding exactly 5 defective items?
# Parameters
total_items <- 100
defective <- 15
sample_size <- 20
target_defects <- 5
# Exact probability
prob_exactly_5 <- dhyper(target_defects, defective,
total_items - defective, sample_size)
cat("P(exactly 5 defects):", prob_exactly_5, "\n")
# Probability of 5 or more defects (reject the batch)
prob_5_or_more <- phyper(4, defective, total_items - defective,
sample_size, lower.tail = FALSE)
cat("P(5 or more defects):", prob_5_or_more, "\n")
# Expected number of defects in sample
expected_defects <- sample_size * (defective / total_items)
cat("Expected defects:", expected_defects, "\n")
Case Study 2: Wildlife Population Estimation
Ecologists capture, tag, and release 50 fish in a lake. Later, they capture 30 fish and find 8 are tagged. Estimate the total population.
# Capture-recapture estimation
tagged <- 50
second_sample <- 30
recaptured_tagged <- 8
# Lincoln-Petersen estimator
estimated_population <- (tagged * second_sample) / recaptured_tagged
cat("Estimated population:", round(estimated_population), "\n")
# Calculate probability of observing 8 tagged fish
# for different population sizes
possible_populations <- seq(100, 300, by = 10)
likelihoods <- sapply(possible_populations, function(N) {
dhyper(recaptured_tagged, tagged, N - tagged, second_sample)
})
# Find maximum likelihood estimate
mle_population <- possible_populations[which.max(likelihoods)]
cat("MLE population:", mle_population, "\n")
Case Study 3: Card Game Probability
What’s the probability of getting at least one ace in a 5-card poker hand?
# At least one ace = 1 - P(no aces)
prob_no_aces <- dhyper(0, m = 4, n = 48, k = 5)
prob_at_least_one <- 1 - prob_no_aces
cat("P(at least one ace):", prob_at_least_one, "\n")
# Alternatively using phyper
prob_at_least_one_alt <- phyper(0, m = 4, n = 48, k = 5, lower.tail = FALSE)
cat("Using phyper:", prob_at_least_one_alt, "\n")
# Full probability distribution
aces <- 0:4
probs <- dhyper(aces, m = 4, n = 48, k = 5)
data.frame(Aces = aces, Probability = probs,
Cumulative = cumsum(probs))
Visualization Techniques
Effective visualization helps understand hypergeometric distributions:
# Base R visualization
x <- 0:10
probs <- dhyper(x, m = 20, n = 30, k = 10)
par(mfrow = c(1, 2))
# PMF
barplot(probs, names.arg = x, col = "steelblue",
main = "Hypergeometric PMF",
xlab = "Number of successes", ylab = "Probability")
# CDF
plot(x, cumsum(probs), type = "s", lwd = 2, col = "darkred",
main = "Hypergeometric CDF",
xlab = "Number of successes", ylab = "Cumulative Probability")
grid()
Using ggplot2 for publication-quality graphics:
library(ggplot2)
# Compare different sample sizes
sample_sizes <- c(5, 10, 20)
comparison_data <- do.call(rbind, lapply(sample_sizes, function(k) {
x <- 0:k
data.frame(
x = x,
probability = dhyper(x, m = 20, n = 30, k = k),
sample_size = as.factor(k)
)
}))
ggplot(comparison_data, aes(x = x, y = probability, fill = sample_size)) +
geom_col(position = "dodge") +
labs(title = "Hypergeometric Distribution by Sample Size",
x = "Number of Successes", y = "Probability",
fill = "Sample Size") +
theme_minimal()
Statistical Testing and Hypothesis Testing
Fisher’s exact test uses the hypergeometric distribution for contingency table analysis:
# Example: Drug effectiveness study
# Treatment group: 12 improved, 8 not improved
# Control group: 5 improved, 15 not improved
# Create contingency table
drug_data <- matrix(c(12, 8, 5, 15), nrow = 2,
dimnames = list(c("Treatment", "Control"),
c("Improved", "Not Improved")))
print(drug_data)
# Fisher's exact test
fisher.test(drug_data)
# Manual calculation using hypergeometric
# P-value is sum of probabilities for tables as extreme or more extreme
total_improved <- 12 + 5
total_not_improved <- 8 + 15
treatment_total <- 12 + 8
# Probability of observed table
observed_prob <- dhyper(12, total_improved, total_not_improved, treatment_total)
cat("P(observed table):", observed_prob, "\n")
Common Pitfalls and Best Practices
Pitfall 1: Confusing Parameter Order
# WRONG: Confusing parameters
# dhyper(x, population_size, successes, draws) # INCORRECT
# CORRECT:
# dhyper(x, successes_in_pop, failures_in_pop, draws)
dhyper(3, m = 20, n = 30, k = 10) # Use named parameters
Pitfall 2: When to Use Binomial Approximation
When sample size is less than 5% of population, use binomial:
# Large population scenario
N <- 10000
K <- 2000
n <- 100
# Hypergeometric (exact)
system.time({
hyper_result <- dhyper(20, m = K, n = N - K, k = n)
})
# Binomial approximation (faster)
system.time({
binom_result <- dbinom(20, size = n, prob = K / N)
})
cat("Difference:", abs(hyper_result - binom_result), "\n")
Pitfall 3: Edge Cases
# Handle impossible scenarios
dhyper(10, m = 5, n = 45, k = 8) # Returns 0 (can't draw 10 from 5)
# Boundary conditions
dhyper(5, m = 5, n = 45, k = 5) # Drawing all successes
The hypergeometric distribution is indispensable for scenarios involving sampling without replacement. Master these R functions, understand when approximations are appropriate, and always validate your parameter inputs to avoid subtle errors in your statistical analyses.