Hypergeometric Distribution in Python: Complete Guide
The hypergeometric distribution answers a specific question: if you draw items from a finite population without replacement, what's the probability of getting exactly k successes?
Key Insights
- The hypergeometric distribution models sampling without replacement, making it essential for quality control, A/B testing with small populations, and any scenario where selections affect subsequent probabilities.
- SciPy’s
hypergeomprovides a complete toolkit for probability calculations, but understanding the underlying math helps you validate results and debug edge cases. - When the sample size is less than 5% of the population, the binomial distribution approximates the hypergeometric well—but for smaller populations, the distinction matters significantly.
Introduction to the Hypergeometric Distribution
The hypergeometric distribution answers a specific question: if you draw items from a finite population without replacement, what’s the probability of getting exactly k successes?
This differs fundamentally from the binomial distribution. With binomial, each trial is independent—like flipping a coin. With hypergeometric, each draw changes the composition of what remains. Draw an ace from a deck, and your probability of drawing another ace drops from 4/52 to 3/51.
Real-world applications include:
- Quality control: Testing a sample of products from a batch for defects
- Card games: Calculating hand probabilities in poker or Magic: The Gathering
- Ecological studies: Estimating wildlife populations via capture-recapture
- A/B testing: When your test population is small enough that sampling affects proportions
Understanding when to use hypergeometric versus binomial prevents subtle statistical errors that compound in production systems.
Mathematical Foundation
The hypergeometric distribution has three parameters:
- N: Total population size
- K: Number of success states in the population
- n: Number of draws (sample size)
The probability mass function (PMF) calculates the probability of exactly k successes:
$$P(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}}$$
The numerator counts favorable outcomes: ways to choose k successes from K available, multiplied by ways to choose the remaining (n-k) items from the (N-K) failures. The denominator counts total possible samples.
Mean and variance have clean formulas:
- Mean: $\mu = n \cdot \frac{K}{N}$
- Variance: $\sigma^2 = n \cdot \frac{K}{N} \cdot \frac{N-K}{N} \cdot \frac{N-n}{N-1}$
Notice the variance includes a finite population correction factor $\frac{N-n}{N-1}$. This term shrinks variance as sample size approaches population size—when you sample the entire population, there’s no uncertainty.
from math import comb
def hypergeom_pmf(k, N, K, n):
"""
Calculate hypergeometric PMF manually.
Parameters:
- k: number of successes we want
- N: population size
- K: number of success states in population
- n: number of draws
"""
# Validate inputs
if k < max(0, n - (N - K)) or k > min(n, K):
return 0.0
numerator = comb(K, k) * comb(N - K, n - k)
denominator = comb(N, n)
return numerator / denominator
# Example: Drawing 5 cards, what's P(exactly 2 aces)?
N = 52 # cards in deck
K = 4 # aces in deck
n = 5 # cards drawn
k = 2 # aces we want
prob = hypergeom_pmf(k, N, K, n)
print(f"P(exactly 2 aces in 5-card hand) = {prob:.6f}")
# Output: P(exactly 2 aces in 5-card hand) = 0.039930
Implementing with SciPy
SciPy’s scipy.stats.hypergeom handles all hypergeometric calculations. Note the parameter order: hypergeom(M, n, N) where M is population size, n is success states, and N is draws. This differs from some textbook conventions.
from scipy.stats import hypergeom
# Same example: 5-card hand, probability of aces
M = 52 # population size (SciPy calls this M)
n = 4 # success states in population (aces)
N = 5 # number of draws
# Create distribution object
dist = hypergeom(M, n, N)
# Probability of exactly 2 aces
print(f"P(X = 2) = {dist.pmf(2):.6f}")
# Probability of at most 2 aces (CDF)
print(f"P(X <= 2) = {dist.cdf(2):.6f}")
# Probability of more than 2 aces (survival function)
print(f"P(X > 2) = {dist.sf(2):.6f}")
# Distribution statistics
print(f"Mean: {dist.mean():.4f}")
print(f"Variance: {dist.var():.4f}")
print(f"Standard deviation: {dist.std():.4f}")
Output:
P(X = 2) = 0.039930
P(X <= 2) = 0.995726
P(X > 2) = 0.004274
Mean: 0.3846
Variance: 0.3255
Standard deviation: 0.5705
You can also calculate probabilities across multiple values efficiently:
import numpy as np
# Get PMF for all possible outcomes
k_values = np.arange(0, min(n, N) + 1)
probabilities = dist.pmf(k_values)
for k, p in zip(k_values, probabilities):
print(f"P(X = {k}) = {p:.6f}")
Visualization Techniques
Visualizing the distribution reveals its shape and how parameters affect outcomes.
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import hypergeom
def plot_hypergeom_pmf(M, n, N, ax=None, title=None):
"""Plot hypergeometric PMF."""
if ax is None:
fig, ax = plt.subplots(figsize=(8, 5))
dist = hypergeom(M, n, N)
k_values = np.arange(max(0, N - (M - n)), min(N, n) + 1)
probabilities = dist.pmf(k_values)
ax.bar(k_values, probabilities, color='steelblue', edgecolor='black', alpha=0.7)
ax.set_xlabel('Number of Successes (k)')
ax.set_ylabel('Probability')
ax.set_title(title or f'Hypergeometric(M={M}, n={n}, N={N})')
ax.set_xticks(k_values)
# Add mean line
ax.axvline(dist.mean(), color='red', linestyle='--', label=f'Mean = {dist.mean():.2f}')
ax.legend()
return ax
# Single distribution plot
plot_hypergeom_pmf(M=52, n=4, N=5, title='Aces in 5-Card Hand')
plt.tight_layout()
plt.show()
Comparing different parameter combinations shows how the distribution shifts:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Varying sample size
plot_hypergeom_pmf(100, 20, 10, axes[0, 0], 'N=100, K=20, n=10')
plot_hypergeom_pmf(100, 20, 30, axes[0, 1], 'N=100, K=20, n=30')
# Varying success proportion
plot_hypergeom_pmf(100, 10, 20, axes[1, 0], 'N=100, K=10, n=20 (10% success)')
plot_hypergeom_pmf(100, 50, 20, axes[1, 1], 'N=100, K=50, n=20 (50% success)')
plt.tight_layout()
plt.show()
Practical Applications
Quality Control Scenario
A manufacturer produces batches of 500 components. Quality control tests 30 items per batch. If a batch contains 25 defective items, what’s the probability of detecting at least one defect?
from scipy.stats import hypergeom
# QC parameters
batch_size = 500
defective_count = 25
sample_size = 30
dist = hypergeom(batch_size, defective_count, sample_size)
# P(at least 1 defect) = 1 - P(0 defects)
prob_detect = 1 - dist.pmf(0)
print(f"Probability of detecting at least 1 defect: {prob_detect:.4f}")
# What sample size ensures 95% detection probability?
for n in range(10, 100):
dist_n = hypergeom(batch_size, defective_count, n)
if 1 - dist_n.pmf(0) >= 0.95:
print(f"Sample size needed for 95% detection: {n}")
break
Hypothesis Testing with Fisher’s Exact Test
The hypergeometric distribution underlies Fisher’s exact test for contingency tables. Consider testing whether a treatment affects outcomes:
from scipy.stats import hypergeom
# Contingency table:
# Success Failure Total
# Treatment 8 2 10
# Control 3 7 10
# Total 11 9 20
# Under null hypothesis: treatment has no effect
# We ask: P(seeing 8+ successes in treatment group by chance)
N = 20 # total subjects
K = 11 # total successes
n = 10 # treatment group size
observed = 8 # successes in treatment
dist = hypergeom(N, K, n)
# One-tailed p-value: P(X >= 8)
p_value = dist.sf(observed - 1) # sf gives P(X > x), so use observed-1
print(f"One-tailed p-value: {p_value:.4f}")
# Compare with scipy's fisher_exact
from scipy.stats import fisher_exact
table = [[8, 2], [3, 7]]
odds_ratio, p_fisher = fisher_exact(table, alternative='greater')
print(f"Fisher's exact p-value: {p_fisher:.4f}")
Simulation and Sampling
NumPy provides efficient hypergeometric sampling for Monte Carlo simulations:
import numpy as np
from scipy.stats import hypergeom
# Parameters
ngood = 4 # aces
nbad = 48 # non-aces
nsample = 5 # cards drawn
# Generate 100,000 samples
np.random.seed(42)
samples = np.random.hypergeometric(ngood, nbad, nsample, size=100000)
# Compare simulated vs theoretical probabilities
theoretical_dist = hypergeom(ngood + nbad, ngood, nsample)
print("k | Simulated | Theoretical | Difference")
print("-" * 45)
for k in range(min(ngood, nsample) + 1):
simulated = np.mean(samples == k)
theoretical = theoretical_dist.pmf(k)
diff = abs(simulated - theoretical)
print(f"{k} | {simulated:.6f} | {theoretical:.6f} | {diff:.6f}")
Output:
k | Simulated | Theoretical | Difference
---------------------------------------------
0 | 0.658640 | 0.658842 | 0.000202
1 | 0.299290 | 0.299474 | 0.000184
2 | 0.039900 | 0.039930 | 0.000030
3 | 0.002130 | 0.001736 | 0.000394
4 | 0.000040 | 0.000018 | 0.000022
The simulated probabilities converge to theoretical values, validating both our understanding and implementation.
Summary and Further Resources
The hypergeometric distribution is your tool when sampling without replacement from finite populations. Key decision points:
Use hypergeometric when:
- Sample size exceeds 5% of population
- Exact probabilities matter (quality control acceptance criteria)
- Population is small (card games, small-batch manufacturing)
Use binomial approximation when:
- Sample size is less than 5% of population
- Computational simplicity matters
- Population is effectively infinite
For more complex scenarios, explore:
- Multivariate hypergeometric: Multiple categories of items (e.g., drawing cards of different suits)
- Negative hypergeometric: Sampling until reaching a fixed number of successes
- Fisher’s noncentral hypergeometric: When items have unequal selection probabilities
The hypergeometric distribution appears simpler than it is. Master it, and you’ll catch errors that slip past engineers who default to binomial for everything.