Geometric Distribution in R: Complete Guide

The geometric distribution answers a fundamental question: 'How many trials until we get our first success?' This makes it invaluable for real-world scenarios like determining how many sales calls...

Key Insights

  • The geometric distribution models the number of Bernoulli trials needed until the first success occurs, making it essential for analyzing conversion rates, equipment failures, and time-to-event data where each trial is independent
  • R provides four core functions (dgeom(), pgeom(), qgeom(), rgeom()) that handle probability calculations, but be aware that R uses zero-indexing by default (counting failures before first success) rather than the traditional trial-counting approach
  • Parameter estimation from sample data is straightforward using maximum likelihood (p̂ = 1/mean), but always validate your model with goodness-of-fit tests before making business decisions based on geometric distribution assumptions

Introduction to Geometric Distribution

The geometric distribution answers a fundamental question: “How many trials until we get our first success?” This makes it invaluable for real-world scenarios like determining how many sales calls until you close a deal, how many product batches until a defect appears, or how many ad impressions until a user converts.

Unlike the binomial distribution which counts successes in a fixed number of trials, the geometric distribution has no predetermined endpoint—you keep going until success occurs. This characteristic makes it the go-to distribution for modeling waiting times in discrete processes.

Key applications include:

  • Customer acquisition: Modeling trials-to-conversion in marketing campaigns
  • Quality control: Predicting when the next defective item will appear
  • A/B testing: Analyzing how quickly users complete desired actions
  • Clinical trials: Measuring time until treatment response
library(ggplot2)

# Compare geometric vs binomial visually
set.seed(123)
x <- 0:15

# Geometric: trials until first success (p=0.3)
geom_probs <- dgeom(x, prob = 0.3)

# Binomial: number of successes in 15 trials (p=0.3)
binom_probs <- dbinom(x, size = 15, prob = 0.3)

comparison_data <- data.frame(
  trials = rep(x, 2),
  probability = c(geom_probs, binom_probs),
  distribution = rep(c("Geometric", "Binomial"), each = length(x))
)

ggplot(comparison_data, aes(x = trials, y = probability, fill = distribution)) +
  geom_col(position = "dodge", alpha = 0.7) +
  labs(title = "Geometric vs Binomial Distribution (p = 0.3)",
       x = "Number of Trials/Successes", y = "Probability") +
  theme_minimal()

Mathematical Foundation

The geometric distribution’s probability mass function is elegantly simple:

P(X = k) = (1-p)^(k-1) × p

Where:

  • k = number of trials until first success (k = 1, 2, 3, …)
  • p = probability of success on each trial

Key properties:

  • Expected value: E(X) = 1/p
  • Variance: Var(X) = (1-p)/p²
  • Memoryless property: P(X > n+k | X > n) = P(X > k)

The memoryless property is crucial—it means past failures don’t affect future probabilities. If you’ve made 10 unsuccessful sales calls, your probability of success on call 11 is exactly the same as it was on call 1.

# Manual calculation vs R's built-in functions
p <- 0.25  # 25% success rate
k <- 5     # First success on 5th trial

# Manual calculation (traditional formulation: trial number)
manual_prob <- (1 - p)^(k - 1) * p
cat("Manual calculation (trial", k, "):", manual_prob, "\n")

# R's dgeom uses zero-indexing (counts failures before success)
# For 5th trial, we have 4 failures before success
r_prob <- dgeom(k - 1, prob = p)
cat("R's dgeom(4, prob =", p, "):", r_prob, "\n")

# Verify they match
cat("Match:", all.equal(manual_prob, r_prob), "\n")

# Expected value
expected_trials <- 1 / p
cat("Expected trials until success:", expected_trials, "\n")

# Variance
variance <- (1 - p) / p^2
cat("Variance:", variance, "\n")

Core R Functions for Geometric Distribution

R provides four essential functions for working with geometric distributions. Critical note: R’s implementation counts the number of failures before the first success, not the trial number itself.

# dgeom(): Probability mass function
# P(X = k failures before first success)
p <- 0.2

# Probability of exactly 3 failures before first success (4th trial)
dgeom(3, prob = p)  # Returns: 0.1024

# pgeom(): Cumulative distribution function
# P(X <= k failures before first success)
pgeom(3, prob = p)  # Probability of success within 4 trials
pgeom(3, prob = p, lower.tail = FALSE)  # Probability of more than 4 trials

# qgeom(): Quantile function
# Find number of failures corresponding to given probability
qgeom(0.5, prob = p)  # Median number of failures
qgeom(0.95, prob = p)  # 95th percentile

# rgeom(): Random number generation
# Generate random samples
set.seed(456)
samples <- rgeom(1000, prob = p)
cat("Sample mean:", mean(samples), "| Theoretical:", (1-p)/p, "\n")
cat("Sample variance:", var(samples), "| Theoretical:", (1-p)/p^2, "\n")

# Practical example: probability calculations
# "What's the probability we need more than 10 attempts?"
prob_more_than_10 <- pgeom(9, prob = p, lower.tail = FALSE)
cat("P(more than 10 trials):", prob_more_than_10, "\n")

# "How many attempts until we're 90% sure of success?"
attempts_90pct <- qgeom(0.90, prob = p) + 1  # +1 to convert failures to trials
cat("Trials for 90% confidence:", attempts_90pct, "\n")

Practical Applications and Use Cases

Use Case: Customer Conversion Analysis

Suppose you’re analyzing an email marketing campaign where historically 8% of recipients convert. You want to understand the distribution of emails sent before each conversion.

# Email marketing conversion analysis
conversion_rate <- 0.08
n_campaigns <- 500

# Simulate actual conversion data
set.seed(789)
emails_until_conversion <- rgeom(n_campaigns, prob = conversion_rate)

# Summary statistics
cat("Average emails before conversion:", mean(emails_until_conversion), "\n")
cat("Theoretical average:", (1 - conversion_rate) / conversion_rate, "\n")
cat("Median emails:", median(emails_until_conversion), "\n")
cat("90th percentile:", quantile(emails_until_conversion, 0.90), "\n")

# Business question: What's the probability of needing more than 20 emails?
prob_over_20 <- pgeom(19, prob = conversion_rate, lower.tail = FALSE)
cat("\nProbability of needing >20 emails:", round(prob_over_20, 3), "\n")

# Budget planning: How many emails to send for 95% conversion probability?
emails_95pct <- qgeom(0.95, prob = conversion_rate) + 1
cat("Emails needed for 95% confidence:", emails_95pct, "\n")

# Create analysis dataframe
analysis_df <- data.frame(
  emails_sent = emails_until_conversion + 1,  # Convert failures to trial number
  campaign_id = 1:n_campaigns
)

# Visualize
ggplot(analysis_df, aes(x = emails_sent)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "steelblue", alpha = 0.7) +
  stat_function(fun = function(x) dgeom(x - 1, prob = conversion_rate),
                color = "red", size = 1) +
  labs(title = "Emails Until Conversion: Observed vs Theoretical",
       x = "Number of Emails Sent", y = "Density") +
  theme_minimal()

Visualization Techniques

Effective visualization helps communicate geometric distribution characteristics to stakeholders.

library(ggplot2)
library(dplyr)

# Compare multiple success probabilities
probs <- c(0.1, 0.25, 0.5)
x_values <- 0:20

viz_data <- expand.grid(failures = x_values, prob = probs) %>%
  mutate(
    probability = dgeom(failures, prob = prob),
    cumulative = pgeom(failures, prob = prob),
    prob_label = paste0("p = ", prob)
  )

# PMF comparison
pmf_plot <- ggplot(viz_data, aes(x = failures, y = probability, 
                                  color = prob_label, group = prob_label)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Geometric Distribution PMF: Effect of Success Probability",
       x = "Number of Failures Before First Success",
       y = "Probability",
       color = "Success Rate") +
  theme_minimal() +
  theme(legend.position = "top")

# CDF comparison
cdf_plot <- ggplot(viz_data, aes(x = failures, y = cumulative, 
                                  color = prob_label, group = prob_label)) +
  geom_step(size = 1) +
  labs(title = "Geometric Distribution CDF",
       x = "Number of Failures",
       y = "Cumulative Probability",
       color = "Success Rate") +
  theme_minimal() +
  theme(legend.position = "top")

print(pmf_plot)
print(cdf_plot)

Statistical Testing and Model Fitting

When working with real data, you need to verify that a geometric distribution is appropriate and estimate parameters accurately.

# Simulate real-world data with some noise
set.seed(101)
true_p <- 0.15
observed_data <- rgeom(200, prob = true_p)

# Maximum likelihood estimation
# For geometric distribution: p_hat = 1 / (mean + 1)
# Note: mean of failures, so add 1 for trials
estimated_p <- 1 / (mean(observed_data) + 1)
cat("True p:", true_p, "| Estimated p:", round(estimated_p, 4), "\n")

# Goodness-of-fit test using Chi-square
# Group data into bins
breaks <- c(0, 5, 10, 15, 20, Inf)
observed_freq <- table(cut(observed_data, breaks = breaks, right = FALSE))

# Calculate expected frequencies
expected_probs <- diff(pgeom(c(-1, 5, 10, 15, 20, Inf), prob = estimated_p))
expected_freq <- expected_probs * length(observed_data)

# Chi-square test
chi_sq_stat <- sum((as.numeric(observed_freq) - expected_freq)^2 / expected_freq)
df <- length(observed_freq) - 1 - 1  # bins - 1 - estimated parameters
p_value <- pchisq(chi_sq_stat, df, lower.tail = FALSE)

cat("\nGoodness-of-fit test:\n")
cat("Chi-square statistic:", round(chi_sq_stat, 3), "\n")
cat("p-value:", round(p_value, 4), "\n")
cat("Conclusion:", ifelse(p_value > 0.05, "Fail to reject (good fit)", 
                          "Reject (poor fit)"), "\n")

# Confidence interval for p (using asymptotic normality)
n <- length(observed_data)
se <- sqrt(estimated_p * (1 - estimated_p) / n)
ci_lower <- estimated_p - 1.96 * se
ci_upper <- estimated_p + 1.96 * se
cat("\n95% CI for p: [", round(ci_lower, 4), ",", round(ci_upper, 4), "]\n")

Common Pitfalls and Best Practices

Indexing Confusion

The most common error is confusion between R’s zero-indexing (counting failures) and the traditional formulation (counting trials).

# Scenario: First success on the 5th trial
p <- 0.3

# WRONG: Using trial number directly
wrong_prob <- dgeom(5, prob = p)
cat("Wrong (using trial number 5):", wrong_prob, "\n")

# CORRECT: Using number of failures (4 failures before 5th trial)
correct_prob <- dgeom(4, prob = p)
cat("Correct (4 failures before success):", correct_prob, "\n")

# Verify with manual calculation
manual <- (1 - p)^4 * p
cat("Manual calculation:", manual, "\n")

# When converting between representations:
trial_number <- 5
failures_before_success <- trial_number - 1
cat("\nTrial", trial_number, "= dgeom(", failures_before_success, ", prob = p)\n")

When NOT to Use Geometric Distribution

The geometric distribution requires:

  1. Independence: Each trial’s outcome doesn’t affect others
  2. Constant probability: Success rate doesn’t change over time
  3. Binary outcomes: Each trial is success or failure
# Example: Inappropriate use case
# Customer purchases where buying behavior changes after first interaction
# This violates the constant probability assumption

# Simulate data where probability increases after each attempt (learning effect)
set.seed(202)
n_trials <- 100
p_initial <- 0.1
p_increase <- 0.05

inappropriate_data <- numeric(n_trials)
for(i in 1:n_trials) {
  current_p <- min(p_initial + (i - 1) * p_increase, 0.9)
  inappropriate_data[i] <- rgeom(1, prob = current_p)
}

# Fit geometric anyway (wrong!)
estimated_p_wrong <- 1 / (mean(inappropriate_data) + 1)

# The fit will be poor because assumption is violated
cat("This model is inappropriate - probability changes over time\n")
cat("Geometric distribution assumes constant p =", round(estimated_p_wrong, 3), "\n")
cat("But actual p ranges from", p_initial, "to", 
    min(p_initial + n_trials * p_increase, 0.9), "\n")

Best Practices

  1. Always validate assumptions: Test for independence and constant probability before applying geometric distribution
  2. Use domain knowledge: If success probability changes over time or trials aren’t independent, consider alternative models (negative binomial, Markov chains)
  3. Mind the indexing: Be explicit about whether you’re counting trials or failures
  4. Visualize your data: Compare empirical distributions to theoretical ones before making decisions
  5. Report uncertainty: Include confidence intervals when estimating parameters from data

The geometric distribution is powerful when used correctly, but like any statistical tool, it requires careful consideration of underlying assumptions and proper implementation in R.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.