Geometric Distribution in Python: Complete Guide

Key Insights

The geometric distribution models the number of independent Bernoulli trials needed to achieve the first success, making it essential for analyzing waiting times, customer conversion, and quality control scenarios.
SciPy’s scipy.stats.geom provides a complete implementation, but understanding the underlying mathematics helps you validate results and handle edge cases in production code.
Parameter estimation from real data requires maximum likelihood estimation (MLE), which simplifies to the reciprocal of the sample mean for geometric distributions.

Introduction to Geometric Distribution

The geometric distribution answers a fundamental question: how many attempts until something works? Whether you’re modeling sales calls until a conversion, login attempts until success, or manufacturing inspections until finding a defect, this distribution captures the essence of repeated independent trials.

Unlike the binomial distribution (which counts successes in a fixed number of trials) or the Poisson distribution (which models event counts in a time interval), the geometric distribution focuses on the trial number of the first success. This makes it uniquely suited for scenarios where you care about waiting time measured in discrete attempts.

Real-world applications include:

Sales and marketing: Calls until first sale, emails until first response
Quality control: Inspections until first defective item
Network systems: Packet transmissions until successful delivery
A/B testing: Users until first conversion event

Mathematical Foundation

The geometric distribution has a single parameter: the probability of success p on each trial, where 0 < p ≤ 1.

The probability mass function (PMF) gives the probability that the first success occurs on trial k:

P(X = k) = (1 - p)^(k-1) * p

The cumulative distribution function (CDF) gives the probability of success by trial k:

P(X ≤ k) = 1 - (1 - p)^k

Key statistics:

Mean (expected value): μ = 1/p
Variance: σ² = (1 - p) / p²

Let’s implement these from scratch to solidify understanding:

def geometric_pmf(k: int, p: float) -> float:
    """Calculate P(X = k) for geometric distribution."""
    if k < 1:
        return 0.0
    return ((1 - p) ** (k - 1)) * p


def geometric_cdf(k: int, p: float) -> float:
    """Calculate P(X <= k) for geometric distribution."""
    if k < 1:
        return 0.0
    return 1 - ((1 - p) ** k)


def geometric_mean(p: float) -> float:
    """Calculate expected value E[X]."""
    return 1 / p


def geometric_variance(p: float) -> float:
    """Calculate Var(X)."""
    return (1 - p) / (p ** 2)


# Example: probability of success on each trial is 0.3
p = 0.3

print(f"P(X = 3): {geometric_pmf(3, p):.4f}")
print(f"P(X <= 3): {geometric_cdf(3, p):.4f}")
print(f"Expected trials: {geometric_mean(p):.2f}")
print(f"Variance: {geometric_variance(p):.2f}")

Output:

P(X = 3): 0.1470
P(X <= 3): 0.6570
Expected trials: 3.33
Variance: 7.78

Implementing with SciPy

For production code, use scipy.stats.geom. It’s optimized, well-tested, and handles edge cases properly. Note that SciPy uses the same parameterization where k starts at 1.

from scipy.stats import geom
import numpy as np

p = 0.3

# Create a frozen distribution object
dist = geom(p)

# Probability mass function
print(f"P(X = 1): {dist.pmf(1):.4f}")
print(f"P(X = 3): {dist.pmf(3):.4f}")
print(f"P(X = 5): {dist.pmf(5):.4f}")

# Cumulative distribution function
print(f"\nP(X <= 3): {dist.cdf(3):.4f}")
print(f"P(X <= 5): {dist.cdf(5):.4f}")

# Survival function P(X > k) - useful for "at least" questions
print(f"\nP(X > 3): {dist.sf(3):.4f}")

# Inverse CDF (quantile function)
print(f"\n50th percentile: {dist.ppf(0.5):.0f}")
print(f"90th percentile: {dist.ppf(0.9):.0f}")

# Distribution statistics
print(f"\nMean: {dist.mean():.2f}")
print(f"Variance: {dist.var():.2f}")
print(f"Std Dev: {dist.std():.2f}")

# Generate random samples
np.random.seed(42)
samples = dist.rvs(size=10)
print(f"\nRandom samples: {samples}")

Output:

P(X = 1): 0.3000
P(X = 3): 0.1470
P(X = 5): 0.0720

P(X <= 3): 0.6570
P(X <= 5): 0.8319

P(X > 3): 0.3430

50th percentile: 2
90th percentile: 7

Mean: 3.33
Variance: 7.78
Std Dev: 2.79

Random samples: [1 1 1 3 2 3 1 2 5 1]

Visualization Techniques

Visualizing the geometric distribution reveals how the success probability dramatically affects the shape. Lower probabilities create longer tails, representing scenarios where you might wait many trials.

import matplotlib.pyplot as plt
from scipy.stats import geom
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Different success probabilities to compare
probabilities = [0.1, 0.3, 0.5, 0.7]
colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6']
k_values = np.arange(1, 16)

# PMF comparison
ax1 = axes[0]
width = 0.2
for i, p in enumerate(probabilities):
    pmf_values = geom.pmf(k_values, p)
    offset = (i - 1.5) * width
    ax1.bar(k_values + offset, pmf_values, width=width, 
            label=f'p = {p}', color=colors[i], alpha=0.8)

ax1.set_xlabel('Number of Trials (k)', fontsize=11)
ax1.set_ylabel('P(X = k)', fontsize=11)
ax1.set_title('Geometric Distribution PMF', fontsize=12)
ax1.legend()
ax1.set_xticks(k_values)
ax1.grid(axis='y', alpha=0.3)

# CDF comparison
ax2 = axes[1]
k_fine = np.arange(1, 20)
for i, p in enumerate(probabilities):
    cdf_values = geom.cdf(k_fine, p)
    ax2.step(k_fine, cdf_values, where='mid', 
             label=f'p = {p}', color=colors[i], linewidth=2)

ax2.set_xlabel('Number of Trials (k)', fontsize=11)
ax2.set_ylabel('P(X ≤ k)', fontsize=11)
ax2.set_title('Geometric Distribution CDF', fontsize=12)
ax2.legend()
ax2.grid(alpha=0.3)
ax2.set_ylim(0, 1.05)

plt.tight_layout()
plt.savefig('geometric_distribution.png', dpi=150)
plt.show()

The PMF shows the characteristic exponential decay—each subsequent trial is less likely to be the first success. The CDF demonstrates how quickly you approach certainty of having achieved at least one success.

Practical Applications with Examples

Let’s model a realistic sales scenario. A sales representative has a 15% conversion rate per call. We want to understand the distribution of calls needed to close a deal.

from scipy.stats import geom
import numpy as np

# Sales call simulation
conversion_rate = 0.15
n_simulations = 10000

np.random.seed(42)

# Simulate many sales cycles
calls_until_sale = geom.rvs(conversion_rate, size=n_simulations)

# Analysis
print("Sales Call Analysis (15% conversion rate)")
print("=" * 45)
print(f"Average calls to first sale: {np.mean(calls_until_sale):.1f}")
print(f"Median calls to first sale: {np.median(calls_until_sale):.1f}")
print(f"Standard deviation: {np.std(calls_until_sale):.1f}")
print(f"Maximum observed: {np.max(calls_until_sale)}")

# Probability questions
dist = geom(conversion_rate)

print(f"\nProbability of sale on first call: {dist.pmf(1):.1%}")
print(f"Probability of sale within 5 calls: {dist.cdf(5):.1%}")
print(f"Probability of needing 10+ calls: {dist.sf(9):.1%}")

# Business planning: calls needed to guarantee 90% chance of sale
calls_90_percent = dist.ppf(0.90)
print(f"\nCalls needed for 90% chance of sale: {calls_90_percent:.0f}")

# Revenue forecasting: if rep makes 20 calls/day
calls_per_day = 20
expected_sales_per_day = calls_per_day * conversion_rate
print(f"\nWith {calls_per_day} calls/day:")
print(f"Expected sales per day: {expected_sales_per_day:.1f}")

# Distribution of sales from simulation
simulated_daily_sales = []
for _ in range(1000):
    daily_calls = geom.rvs(conversion_rate, size=100)
    # Count how many complete sales cycles fit in 20 calls
    sales = np.sum(np.cumsum(daily_calls) <= calls_per_day)
    simulated_daily_sales.append(sales)

print(f"Simulated avg daily sales: {np.mean(simulated_daily_sales):.2f}")
print(f"Days with zero sales: {100 * simulated_daily_sales.count(0) / len(simulated_daily_sales):.1f}%")

Output:

Sales Call Analysis (15% conversion rate)
=============================================
Average calls to first sale: 6.7
Median calls to first sale: 4.0
Standard deviation: 6.0
Maximum observed: 58

Probability of sale on first call: 15.0%
Probability of sale within 5 calls: 55.6%
Probability of needing 10+ calls: 23.2%

Calls needed for 90% chance of sale: 15

With 20 calls/day:
Expected sales per day: 3.0
Simulated avg daily sales: 2.97
Days with zero sales: 3.9%

Parameter Estimation and Fitting

When you have observed data, you need to estimate the underlying success probability. For the geometric distribution, the maximum likelihood estimator is elegantly simple:

p_hat = 1 / sample_mean

Here’s how to fit a geometric distribution to data and validate the fit:

from scipy.stats import geom, chisquare
import numpy as np

# Simulate "observed" data (in practice, this comes from real measurements)
true_p = 0.25
np.random.seed(42)
observed_data = geom.rvs(true_p, size=200)

# MLE estimation
sample_mean = np.mean(observed_data)
estimated_p = 1 / sample_mean

print("Parameter Estimation")
print("=" * 40)
print(f"Sample size: {len(observed_data)}")
print(f"Sample mean: {sample_mean:.3f}")
print(f"Estimated p: {estimated_p:.4f}")
print(f"True p: {true_p:.4f}")
print(f"Estimation error: {abs(estimated_p - true_p):.4f}")

# Goodness-of-fit test using chi-square
# Bin the data and compare observed vs expected frequencies
max_k = int(np.percentile(observed_data, 95))
bins = list(range(1, max_k + 1)) + [np.inf]

observed_freq, _ = np.histogram(observed_data, bins=bins)

# Expected frequencies under fitted distribution
fitted_dist = geom(estimated_p)
expected_prob = [fitted_dist.pmf(k) for k in range(1, max_k)]
expected_prob.append(fitted_dist.sf(max_k - 1))  # P(X >= max_k)
expected_freq = np.array(expected_prob) * len(observed_data)

# Chi-square test (combine bins with expected < 5)
mask = expected_freq >= 5
obs_combined = np.append(observed_freq[mask], observed_freq[~mask].sum())
exp_combined = np.append(expected_freq[mask], expected_freq[~mask].sum())

chi2_stat, p_value = chisquare(obs_combined, exp_combined)

print(f"\nGoodness-of-Fit Test")
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Fit acceptable (p > 0.05): {p_value > 0.05}")

Choosing the right distribution matters. Here’s when to use geometric versus alternatives:

Distribution	Use When
Geometric	Counting trials until first success
Negative Binomial	Counting trials until r successes
Poisson	Counting events in fixed time/space
Exponential	Continuous waiting time until event

from scipy.stats import geom, nbinom, poisson, expon
import numpy as np

np.random.seed(42)
n_samples = 10000

# Scenario: Events with 20% probability per trial
p = 0.2

# Geometric: trials until first success
geom_samples = geom.rvs(p, size=n_samples)

# Negative binomial: trials until 3 successes
nbinom_samples = nbinom.rvs(n=3, p=p, size=n_samples) + 3

# For comparison: if we observed 5 events on average
# Poisson models count in fixed interval
poisson_samples = poisson.rvs(mu=5, size=n_samples)

# Exponential: continuous analog (rate = p)
expon_samples = expon.rvs(scale=1/p, size=n_samples)

print("Distribution Comparison")
print("=" * 50)
print(f"{'Distribution':<20} {'Mean':>10} {'Std Dev':>10}")
print("-" * 50)
print(f"{'Geometric':<20} {np.mean(geom_samples):>10.2f} {np.std(geom_samples):>10.2f}")
print(f"{'Neg. Binomial (r=3)':<20} {np.mean(nbinom_samples):>10.2f} {np.std(nbinom_samples):>10.2f}")
print(f"{'Poisson (λ=5)':<20} {np.mean(poisson_samples):>10.2f} {np.std(poisson_samples):>10.2f}")
print(f"{'Exponential':<20} {np.mean(expon_samples):>10.2f} {np.std(expon_samples):>10.2f}")

The geometric distribution is your tool when you need discrete trial counts until a single success. For multiple successes, reach for negative binomial. For continuous time, use exponential. Match the distribution to your data’s nature, not the other way around.

Introduction to Geometric Distribution

Mathematical Foundation

Implementing with SciPy

Visualization Techniques

Practical Applications with Examples

Parameter Estimation and Fitting

Comparison with Related Distributions

Liked this? There's more.

Similar Articles