Hypergeometric Distribution in Python: Complete Guide

Key Insights

The hypergeometric distribution models sampling without replacement, making it essential for quality control, A/B testing with small populations, and any scenario where selections affect subsequent probabilities.
SciPy’s hypergeom provides a complete toolkit for probability calculations, but understanding the underlying math helps you validate results and debug edge cases.
When the sample size is less than 5% of the population, the binomial distribution approximates the hypergeometric well—but for smaller populations, the distinction matters significantly.

Introduction to the Hypergeometric Distribution

The hypergeometric distribution answers a specific question: if you draw items from a finite population without replacement, what’s the probability of getting exactly k successes?

This differs fundamentally from the binomial distribution. With binomial, each trial is independent—like flipping a coin. With hypergeometric, each draw changes the composition of what remains. Draw an ace from a deck, and your probability of drawing another ace drops from 4/52 to 3/51.

Real-world applications include:

Quality control: Testing a sample of products from a batch for defects
Card games: Calculating hand probabilities in poker or Magic: The Gathering
Ecological studies: Estimating wildlife populations via capture-recapture
A/B testing: When your test population is small enough that sampling affects proportions

Understanding when to use hypergeometric versus binomial prevents subtle statistical errors that compound in production systems.

Mathematical Foundation

The hypergeometric distribution has three parameters:

N: Total population size
K: Number of success states in the population
n: Number of draws (sample size)

The probability mass function (PMF) calculates the probability of exactly k successes:

$$P(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}}$$

The numerator counts favorable outcomes: ways to choose k successes from K available, multiplied by ways to choose the remaining (n-k) items from the (N-K) failures. The denominator counts total possible samples.

Mean and variance have clean formulas:

Mean: $\mu = n \cdot \frac{K}{N}$
Variance: $\sigma^2 = n \cdot \frac{K}{N} \cdot \frac{N-K}{N} \cdot \frac{N-n}{N-1}$

Notice the variance includes a finite population correction factor $\frac{N-n}{N-1}$. This term shrinks variance as sample size approaches population size—when you sample the entire population, there’s no uncertainty.

from math import comb

def hypergeom_pmf(k, N, K, n):
    """
    Calculate hypergeometric PMF manually.
    
    Parameters:
    - k: number of successes we want
    - N: population size
    - K: number of success states in population
    - n: number of draws
    """
    # Validate inputs
    if k < max(0, n - (N - K)) or k > min(n, K):
        return 0.0
    
    numerator = comb(K, k) * comb(N - K, n - k)
    denominator = comb(N, n)
    return numerator / denominator

# Example: Drawing 5 cards, what's P(exactly 2 aces)?
N = 52  # cards in deck
K = 4   # aces in deck
n = 5   # cards drawn
k = 2   # aces we want

prob = hypergeom_pmf(k, N, K, n)
print(f"P(exactly 2 aces in 5-card hand) = {prob:.6f}")
# Output: P(exactly 2 aces in 5-card hand) = 0.039930

Implementing with SciPy

SciPy’s scipy.stats.hypergeom handles all hypergeometric calculations. Note the parameter order: hypergeom(M, n, N) where M is population size, n is success states, and N is draws. This differs from some textbook conventions.

from scipy.stats import hypergeom

# Same example: 5-card hand, probability of aces
M = 52  # population size (SciPy calls this M)
n = 4   # success states in population (aces)
N = 5   # number of draws

# Create distribution object
dist = hypergeom(M, n, N)

# Probability of exactly 2 aces
print(f"P(X = 2) = {dist.pmf(2):.6f}")

# Probability of at most 2 aces (CDF)
print(f"P(X <= 2) = {dist.cdf(2):.6f}")

# Probability of more than 2 aces (survival function)
print(f"P(X > 2) = {dist.sf(2):.6f}")

# Distribution statistics
print(f"Mean: {dist.mean():.4f}")
print(f"Variance: {dist.var():.4f}")
print(f"Standard deviation: {dist.std():.4f}")

Output:

P(X = 2) = 0.039930
P(X <= 2) = 0.995726
P(X > 2) = 0.004274
Mean: 0.3846
Variance: 0.3255
Standard deviation: 0.5705

You can also calculate probabilities across multiple values efficiently:

import numpy as np

# Get PMF for all possible outcomes
k_values = np.arange(0, min(n, N) + 1)
probabilities = dist.pmf(k_values)

for k, p in zip(k_values, probabilities):
    print(f"P(X = {k}) = {p:.6f}")

Visualization Techniques

Visualizing the distribution reveals its shape and how parameters affect outcomes.

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import hypergeom

def plot_hypergeom_pmf(M, n, N, ax=None, title=None):
    """Plot hypergeometric PMF."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 5))
    
    dist = hypergeom(M, n, N)
    k_values = np.arange(max(0, N - (M - n)), min(N, n) + 1)
    probabilities = dist.pmf(k_values)
    
    ax.bar(k_values, probabilities, color='steelblue', edgecolor='black', alpha=0.7)
    ax.set_xlabel('Number of Successes (k)')
    ax.set_ylabel('Probability')
    ax.set_title(title or f'Hypergeometric(M={M}, n={n}, N={N})')
    ax.set_xticks(k_values)
    
    # Add mean line
    ax.axvline(dist.mean(), color='red', linestyle='--', label=f'Mean = {dist.mean():.2f}')
    ax.legend()
    
    return ax

# Single distribution plot
plot_hypergeom_pmf(M=52, n=4, N=5, title='Aces in 5-Card Hand')
plt.tight_layout()
plt.show()

Comparing different parameter combinations shows how the distribution shifts:

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Varying sample size
plot_hypergeom_pmf(100, 20, 10, axes[0, 0], 'N=100, K=20, n=10')
plot_hypergeom_pmf(100, 20, 30, axes[0, 1], 'N=100, K=20, n=30')

# Varying success proportion
plot_hypergeom_pmf(100, 10, 20, axes[1, 0], 'N=100, K=10, n=20 (10% success)')
plot_hypergeom_pmf(100, 50, 20, axes[1, 1], 'N=100, K=50, n=20 (50% success)')

plt.tight_layout()
plt.show()

Practical Applications

Quality Control Scenario

A manufacturer produces batches of 500 components. Quality control tests 30 items per batch. If a batch contains 25 defective items, what’s the probability of detecting at least one defect?

from scipy.stats import hypergeom

# QC parameters
batch_size = 500
defective_count = 25
sample_size = 30

dist = hypergeom(batch_size, defective_count, sample_size)

# P(at least 1 defect) = 1 - P(0 defects)
prob_detect = 1 - dist.pmf(0)
print(f"Probability of detecting at least 1 defect: {prob_detect:.4f}")

# What sample size ensures 95% detection probability?
for n in range(10, 100):
    dist_n = hypergeom(batch_size, defective_count, n)
    if 1 - dist_n.pmf(0) >= 0.95:
        print(f"Sample size needed for 95% detection: {n}")
        break

Hypothesis Testing with Fisher’s Exact Test

The hypergeometric distribution underlies Fisher’s exact test for contingency tables. Consider testing whether a treatment affects outcomes:

from scipy.stats import hypergeom

# Contingency table:
#              Success  Failure  Total
# Treatment       8        2       10
# Control         3        7       10
# Total          11        9       20

# Under null hypothesis: treatment has no effect
# We ask: P(seeing 8+ successes in treatment group by chance)

N = 20   # total subjects
K = 11   # total successes
n = 10   # treatment group size
observed = 8  # successes in treatment

dist = hypergeom(N, K, n)

# One-tailed p-value: P(X >= 8)
p_value = dist.sf(observed - 1)  # sf gives P(X > x), so use observed-1
print(f"One-tailed p-value: {p_value:.4f}")

# Compare with scipy's fisher_exact
from scipy.stats import fisher_exact
table = [[8, 2], [3, 7]]
odds_ratio, p_fisher = fisher_exact(table, alternative='greater')
print(f"Fisher's exact p-value: {p_fisher:.4f}")

Simulation and Sampling

NumPy provides efficient hypergeometric sampling for Monte Carlo simulations:

import numpy as np
from scipy.stats import hypergeom

# Parameters
ngood = 4   # aces
nbad = 48   # non-aces
nsample = 5 # cards drawn

# Generate 100,000 samples
np.random.seed(42)
samples = np.random.hypergeometric(ngood, nbad, nsample, size=100000)

# Compare simulated vs theoretical probabilities
theoretical_dist = hypergeom(ngood + nbad, ngood, nsample)

print("k | Simulated | Theoretical | Difference")
print("-" * 45)
for k in range(min(ngood, nsample) + 1):
    simulated = np.mean(samples == k)
    theoretical = theoretical_dist.pmf(k)
    diff = abs(simulated - theoretical)
    print(f"{k} | {simulated:.6f}  | {theoretical:.6f}   | {diff:.6f}")

Output:

k | Simulated | Theoretical | Difference
---------------------------------------------
0 | 0.658640  | 0.658842   | 0.000202
1 | 0.299290  | 0.299474   | 0.000184
2 | 0.039900  | 0.039930   | 0.000030
3 | 0.002130  | 0.001736   | 0.000394
4 | 0.000040  | 0.000018   | 0.000022

The simulated probabilities converge to theoretical values, validating both our understanding and implementation.

Summary and Further Resources

The hypergeometric distribution is your tool when sampling without replacement from finite populations. Key decision points:

Use hypergeometric when:

Sample size exceeds 5% of population
Exact probabilities matter (quality control acceptance criteria)
Population is small (card games, small-batch manufacturing)

Use binomial approximation when:

Sample size is less than 5% of population
Computational simplicity matters
Population is effectively infinite

For more complex scenarios, explore:

Multivariate hypergeometric: Multiple categories of items (e.g., drawing cards of different suits)
Negative hypergeometric: Sampling until reaching a fixed number of successes
Fisher’s noncentral hypergeometric: When items have unequal selection probabilities

The hypergeometric distribution appears simpler than it is. Master it, and you’ll catch errors that slip past engineers who default to binomial for everything.