Hypergeometric Distribution in R: Complete Guide

Key Insights

The hypergeometric distribution models sampling without replacement, making it essential for quality control, card game probabilities, and ecological studies where the population changes with each draw
R provides four core functions (dhyper, phyper, qhyper, rhyper) that handle probability mass, cumulative distribution, quantiles, and random generation—understanding their parameter order (x, m, n, k) prevents common errors
For large populations where sampling represents less than 5% of the total, the binomial distribution approximates the hypergeometric with better computational efficiency

Introduction to Hypergeometric Distribution

The hypergeometric distribution answers a fundamental question: what’s the probability of getting exactly k successes when drawing n items without replacement from a finite population containing K successes and N-K failures?

Unlike the binomial distribution where each trial is independent, hypergeometric distribution accounts for the changing probability as items are removed. When you draw a card from a deck and don’t replace it, the odds change for the next draw—that’s hypergeometric territory.

Real-world applications include:

Quality control: selecting items from a batch to test for defects
Card games: calculating poker hand probabilities
Ecological sampling: capture-recapture studies for wildlife populations
Clinical trials: selecting patients from specific subgroups

Here’s how hypergeometric differs from binomial distribution:

# Compare hypergeometric vs binomial
# Scenario: 50 items, 20 defective, drawing 10 items

x <- 0:10
# Hypergeometric (without replacement)
hyper_prob <- dhyper(x, m = 20, n = 30, k = 10)
# Binomial (with replacement)
binom_prob <- dbinom(x, size = 10, prob = 20/50)

par(mfrow = c(1, 2))
barplot(hyper_prob, names.arg = x, main = "Hypergeometric",
        xlab = "Number of defects", ylab = "Probability", col = "steelblue")
barplot(binom_prob, names.arg = x, main = "Binomial",
        xlab = "Number of defects", ylab = "Probability", col = "coral")

The difference is subtle but crucial. Hypergeometric probabilities are slightly more spread out because the sampling process affects subsequent draws.

Mathematical Foundation and Parameters

The hypergeometric distribution has four parameters:

N: Total population size (m + n in R’s notation)
K: Number of success states in the population (m in R)
n: Number of draws (k in R)
k: Number of observed successes (x in R)

The probability mass function is:

P(X = k) = [C(K,k) × C(N-K, n-k)] / C(N,n)

Where C(n,k) represents “n choose k” or the binomial coefficient.

Here’s a manual calculation compared to R’s built-in function:

# Calculate probability manually
# Example: 52 cards, 13 hearts, draw 5 cards, probability of exactly 2 hearts

N <- 52  # Total cards
K <- 13  # Hearts in deck
n <- 5   # Cards drawn
k <- 2   # Hearts we want

# Manual calculation
numerator <- choose(K, k) * choose(N - K, n - k)
denominator <- choose(N, n)
manual_prob <- numerator / denominator

# Using R's built-in function
# Note: R uses (x, m, n, k) where m=successes, n=failures, k=draws
r_prob <- dhyper(x = 2, m = 13, n = 39, k = 5)

cat("Manual calculation:", manual_prob, "\n")
cat("R's dhyper():", r_prob, "\n")
cat("Difference:", abs(manual_prob - r_prob), "\n")

The parameter naming in R is confusing. Remember: dhyper(x, m, n, k) where m is successes in population, n is failures in population, and k is number of draws.

Core R Functions for Hypergeometric Distribution

R provides four functions following the standard distribution function pattern:

dhyper(): Probability Mass Function

Calculate the probability of exactly x successes:

# Probability of drawing exactly 3 aces when drawing 5 cards
dhyper(x = 3, m = 4, n = 48, k = 5)

# Probability for multiple values
x_values <- 0:4
probs <- dhyper(x_values, m = 4, n = 48, k = 5)
names(probs) <- paste0("P(X=", x_values, ")")
print(probs)

phyper(): Cumulative Distribution Function

Calculate cumulative probabilities:

# Probability of drawing 2 or fewer aces in 5 cards
phyper(q = 2, m = 4, n = 48, k = 5, lower.tail = TRUE)

# Probability of drawing more than 2 aces (upper tail)
phyper(q = 2, m = 4, n = 48, k = 5, lower.tail = FALSE)

# Verify they sum to 1
lower <- phyper(2, m = 4, n = 48, k = 5, lower.tail = TRUE)
upper <- phyper(2, m = 4, n = 48, k = 5, lower.tail = FALSE)
cat("Sum:", lower + upper, "\n")

qhyper(): Quantile Function

Find the value corresponding to a given probability:

# What's the median number of aces in 5-card draws?
qhyper(p = 0.5, m = 4, n = 48, k = 5)

# 95th percentile
qhyper(p = 0.95, m = 4, n = 48, k = 5)

# Multiple quantiles
quantiles <- c(0.25, 0.5, 0.75, 0.95)
qhyper(p = quantiles, m = 4, n = 48, k = 5)

rhyper(): Random Number Generation

Generate random samples from the distribution:

# Simulate 1000 five-card draws, counting aces
set.seed(123)
simulated_aces <- rhyper(nn = 1000, m = 4, n = 48, k = 5)

# Compare simulation to theoretical distribution
table(simulated_aces) / 1000  # Empirical
dhyper(0:4, m = 4, n = 48, k = 5)  # Theoretical

Practical Applications and Case Studies

Case Study 1: Quality Control

A shipment contains 100 components, 15 of which are defective. You randomly sample 20 components. What’s the probability of finding exactly 5 defective items?

# Parameters
total_items <- 100
defective <- 15
sample_size <- 20
target_defects <- 5

# Exact probability
prob_exactly_5 <- dhyper(target_defects, defective, 
                          total_items - defective, sample_size)
cat("P(exactly 5 defects):", prob_exactly_5, "\n")

# Probability of 5 or more defects (reject the batch)
prob_5_or_more <- phyper(4, defective, total_items - defective, 
                         sample_size, lower.tail = FALSE)
cat("P(5 or more defects):", prob_5_or_more, "\n")

# Expected number of defects in sample
expected_defects <- sample_size * (defective / total_items)
cat("Expected defects:", expected_defects, "\n")

Case Study 2: Wildlife Population Estimation

Ecologists capture, tag, and release 50 fish in a lake. Later, they capture 30 fish and find 8 are tagged. Estimate the total population.

# Capture-recapture estimation
tagged <- 50
second_sample <- 30
recaptured_tagged <- 8

# Lincoln-Petersen estimator
estimated_population <- (tagged * second_sample) / recaptured_tagged
cat("Estimated population:", round(estimated_population), "\n")

# Calculate probability of observing 8 tagged fish
# for different population sizes
possible_populations <- seq(100, 300, by = 10)
likelihoods <- sapply(possible_populations, function(N) {
  dhyper(recaptured_tagged, tagged, N - tagged, second_sample)
})

# Find maximum likelihood estimate
mle_population <- possible_populations[which.max(likelihoods)]
cat("MLE population:", mle_population, "\n")

Case Study 3: Card Game Probability

What’s the probability of getting at least one ace in a 5-card poker hand?

# At least one ace = 1 - P(no aces)
prob_no_aces <- dhyper(0, m = 4, n = 48, k = 5)
prob_at_least_one <- 1 - prob_no_aces
cat("P(at least one ace):", prob_at_least_one, "\n")

# Alternatively using phyper
prob_at_least_one_alt <- phyper(0, m = 4, n = 48, k = 5, lower.tail = FALSE)
cat("Using phyper:", prob_at_least_one_alt, "\n")

# Full probability distribution
aces <- 0:4
probs <- dhyper(aces, m = 4, n = 48, k = 5)
data.frame(Aces = aces, Probability = probs, 
           Cumulative = cumsum(probs))

Visualization Techniques

Effective visualization helps understand hypergeometric distributions:

# Base R visualization
x <- 0:10
probs <- dhyper(x, m = 20, n = 30, k = 10)

par(mfrow = c(1, 2))
# PMF
barplot(probs, names.arg = x, col = "steelblue",
        main = "Hypergeometric PMF",
        xlab = "Number of successes", ylab = "Probability")

# CDF
plot(x, cumsum(probs), type = "s", lwd = 2, col = "darkred",
     main = "Hypergeometric CDF",
     xlab = "Number of successes", ylab = "Cumulative Probability")
grid()

Using ggplot2 for publication-quality graphics:

library(ggplot2)

# Compare different sample sizes
sample_sizes <- c(5, 10, 20)
comparison_data <- do.call(rbind, lapply(sample_sizes, function(k) {
  x <- 0:k
  data.frame(
    x = x,
    probability = dhyper(x, m = 20, n = 30, k = k),
    sample_size = as.factor(k)
  )
}))

ggplot(comparison_data, aes(x = x, y = probability, fill = sample_size)) +
  geom_col(position = "dodge") +
  labs(title = "Hypergeometric Distribution by Sample Size",
       x = "Number of Successes", y = "Probability",
       fill = "Sample Size") +
  theme_minimal()

Statistical Testing and Hypothesis Testing

Fisher’s exact test uses the hypergeometric distribution for contingency table analysis:

# Example: Drug effectiveness study
# Treatment group: 12 improved, 8 not improved
# Control group: 5 improved, 15 not improved

# Create contingency table
drug_data <- matrix(c(12, 8, 5, 15), nrow = 2,
                    dimnames = list(c("Treatment", "Control"),
                                   c("Improved", "Not Improved")))
print(drug_data)

# Fisher's exact test
fisher.test(drug_data)

# Manual calculation using hypergeometric
# P-value is sum of probabilities for tables as extreme or more extreme
total_improved <- 12 + 5
total_not_improved <- 8 + 15
treatment_total <- 12 + 8

# Probability of observed table
observed_prob <- dhyper(12, total_improved, total_not_improved, treatment_total)
cat("P(observed table):", observed_prob, "\n")

Common Pitfalls and Best Practices

Pitfall 1: Confusing Parameter Order

# WRONG: Confusing parameters
# dhyper(x, population_size, successes, draws) # INCORRECT

# CORRECT: 
# dhyper(x, successes_in_pop, failures_in_pop, draws)
dhyper(3, m = 20, n = 30, k = 10)  # Use named parameters

Pitfall 2: When to Use Binomial Approximation

When sample size is less than 5% of population, use binomial:

# Large population scenario
N <- 10000
K <- 2000
n <- 100

# Hypergeometric (exact)
system.time({
  hyper_result <- dhyper(20, m = K, n = N - K, k = n)
})

# Binomial approximation (faster)
system.time({
  binom_result <- dbinom(20, size = n, prob = K / N)
})

cat("Difference:", abs(hyper_result - binom_result), "\n")

Pitfall 3: Edge Cases

# Handle impossible scenarios
dhyper(10, m = 5, n = 45, k = 8)  # Returns 0 (can't draw 10 from 5)

# Boundary conditions
dhyper(5, m = 5, n = 45, k = 5)   # Drawing all successes

The hypergeometric distribution is indispensable for scenarios involving sampling without replacement. Master these R functions, understand when approximations are appropriate, and always validate your parameter inputs to avoid subtle errors in your statistical analyses.