Binomial Distribution in R: Complete Guide

The binomial distribution models a simple but powerful scenario: you run n independent trials, each with the same probability p of success, and count how many successes you get. That's it. Despite...

Key Insights

  • R provides four core functions (dbinom, pbinom, qbinom, rbinom) that handle every binomial distribution task from probability calculations to simulation
  • The binomial distribution applies whenever you have a fixed number of independent trials with constant success probability—A/B tests, quality control, and clinical trials are textbook examples
  • Use binom.test() for exact hypothesis testing when sample sizes are small; switch to normal approximation only when both np ≥ 10 and n(1-p) ≥ 10

Introduction to the Binomial Distribution

The binomial distribution models a simple but powerful scenario: you run n independent trials, each with the same probability p of success, and count how many successes you get. That’s it. Despite this simplicity, it underpins statistical analysis across industries.

Quality control engineers use it to determine acceptable defect rates. Marketing analysts apply it to A/B testing conversion rates. Clinical researchers rely on it for treatment response analysis. If you’re counting successes out of a fixed number of attempts, you’re working with binomial data.

R treats binomial distributions as first-class citizens, providing built-in functions that eliminate the need for manual calculations. This guide covers everything from basic probability lookups to hypothesis testing workflows.

Mathematical Foundation

A binomial distribution has two parameters:

  • n: the number of trials
  • p: the probability of success on each trial

The probability of observing exactly k successes follows the probability mass function:

P(X = k) = C(n,k) × p^k × (1-p)^(n-k)

Where C(n,k) is the binomial coefficient (“n choose k”).

The mean equals np, and the variance equals np(1-p). These formulas become useful for quick mental estimates and for understanding when normal approximation applies.

Let’s verify R’s built-in function matches manual calculation:

# Parameters
n <- 10
p <- 0.3
k <- 4

# Manual PMF calculation
manual_prob <- choose(n, k) * p^k * (1-p)^(n-k)
print(paste("Manual calculation:", round(manual_prob, 6)))

# R's built-in function
r_prob <- dbinom(k, size = n, prob = p)
print(paste("dbinom result:", round(r_prob, 6)))

# Verify they match
print(paste("Match:", all.equal(manual_prob, r_prob)))

Output:

[1] "Manual calculation: 0.200121"
[1] "dbinom result: 0.200121"
[1] "Match: TRUE"

Use the built-in functions. They’re faster, handle edge cases correctly, and reduce bugs.

Core R Functions for Binomial Distribution

R follows a consistent naming convention for probability distributions. For binomial, the prefix indicates the function type:

Function Purpose Returns
dbinom() Density (PMF) P(X = k)
pbinom() Cumulative distribution P(X ≤ k)
qbinom() Quantile function Smallest k where P(X ≤ k) ≥ p
rbinom() Random generation Simulated values

All four functions share the same core parameters: size (number of trials) and prob (success probability).

n <- 20
p <- 0.4

# dbinom: What's the probability of exactly 8 successes?
prob_exactly_8 <- dbinom(8, size = n, prob = p)
print(paste("P(X = 8):", round(prob_exactly_8, 4)))

# pbinom: What's the probability of 8 or fewer successes?
prob_8_or_less <- pbinom(8, size = n, prob = p)
print(paste("P(X ≤ 8):", round(prob_8_or_less, 4)))

# pbinom with lower.tail = FALSE: Probability of more than 8 successes
prob_more_than_8 <- pbinom(8, size = n, prob = p, lower.tail = FALSE)
print(paste("P(X > 8):", round(prob_more_than_8, 4)))

# qbinom: What value gives us the 95th percentile?
percentile_95 <- qbinom(0.95, size = n, prob = p)
print(paste("95th percentile:", percentile_95))

# rbinom: Generate 10 random samples
set.seed(42)
samples <- rbinom(10, size = n, prob = p)
print(paste("Random samples:", paste(samples, collapse = ", ")))

Output:

[1] "P(X = 8): 0.1797"
[1] "P(X ≤ 8): 0.5956"
[1] "P(X > 8): 0.4044"
[1] "95th percentile: 12"
[1] "Random samples: 13, 8, 7, 6, 10, 6, 11, 8, 6, 9"

The lower.tail parameter deserves attention. Setting it to FALSE calculates P(X > k), not P(X ≥ k). This off-by-one distinction matters for hypothesis testing.

Visualizing Binomial Distributions

Visualization reveals how parameters shape the distribution. Here’s a comparison using ggplot2:

library(ggplot2)
library(dplyr)
library(tidyr)

# Create data for multiple distributions
create_binom_data <- function(n, p, label) {
  data.frame(
    k = 0:n,
    probability = dbinom(0:n, size = n, prob = p),
    distribution = label
  )
}

# Compare different parameter combinations
plot_data <- bind_rows(
  create_binom_data(20, 0.2, "n=20, p=0.2"),
  create_binom_data(20, 0.5, "n=20, p=0.5"),
  create_binom_data(20, 0.8, "n=20, p=0.8")
)

# PMF plot
ggplot(plot_data, aes(x = k, y = probability, fill = distribution)) +
  geom_col(position = "dodge", alpha = 0.8) +
  labs(
    title = "Binomial PMF: Effect of Success Probability",
    x = "Number of Successes",
    y = "Probability",
    fill = "Parameters"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

For cumulative distributions, which answer “what’s the probability of k or fewer successes”:

# CDF comparison
cdf_data <- plot_data %>%
  group_by(distribution) %>%
  mutate(cumulative = cumsum(probability))

ggplot(cdf_data, aes(x = k, y = cumulative, color = distribution)) +
  geom_step(linewidth = 1.2) +
  labs(
    title = "Binomial CDF: Cumulative Probability",
    x = "Number of Successes",
    y = "P(X ≤ k)",
    color = "Parameters"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set2")

Notice how p = 0.5 produces a symmetric distribution, while extreme p values create skew. This asymmetry affects when normal approximation becomes valid.

Practical Applications with Examples

Quality Control: Defect Rate Analysis

A manufacturing line historically produces 2% defective items. You inspect 500 items and find 18 defects. Is this concerning?

n <- 500
p <- 0.02
observed_defects <- 18

# Expected defects
expected <- n * p
print(paste("Expected defects:", expected))

# Probability of 18 or more defects (upper tail)
prob_18_or_more <- pbinom(17, size = n, prob = p, lower.tail = FALSE)
print(paste("P(X ≥ 18):", round(prob_18_or_more, 4)))

# Simulate 10,000 inspection batches
set.seed(123)
simulated_defects <- rbinom(10000, size = n, prob = p)

# What percentage of simulations had 18+ defects?
sim_prob <- mean(simulated_defects >= 18)
print(paste("Simulated P(X ≥ 18):", round(sim_prob, 4)))

Output:

[1] "Expected defects: 10"
[1] "P(X ≥ 18): 0.0188"
[1] "Simulated P(X ≥ 18): 0.0181"

With only 1.88% probability of seeing 18+ defects by chance, this warrants investigation.

A/B Test Conversion Analysis

Your current landing page converts at 12%. A new design shows 47 conversions out of 350 visitors (13.4%). Is this improvement statistically significant?

# Control parameters
control_rate <- 0.12
visitors <- 350
observed_conversions <- 47

# Probability of 47+ conversions if true rate is still 12%
p_value <- pbinom(46, size = visitors, prob = control_rate, lower.tail = FALSE)
print(paste("One-sided p-value:", round(p_value, 4)))

# Two-sided test for any difference
# Use binom.test for exact calculation
test_result <- binom.test(observed_conversions, visitors, p = control_rate)
print(test_result)

Hypothesis Testing and Confidence Intervals

The binom.test() function performs exact binomial tests, which are preferable to normal approximations for small samples or extreme probabilities.

# A/B test scenario
# Control: 120 conversions out of 1000 visitors (12%)
# Treatment: 145 conversions out of 1000 visitors (14.5%)

# Test if treatment significantly differs from control rate
treatment_test <- binom.test(
  x = 145,
  n = 1000,
  p = 0.12,
  alternative = "greater"  # Testing if treatment is better
)

print(treatment_test)

# Extract key values
cat("\n--- Summary ---\n")
cat("Observed rate:", treatment_test$estimate, "\n")
cat("95% CI:", treatment_test$conf.int[1], "-", treatment_test$conf.int[2], "\n")
cat("P-value:", treatment_test$p.value, "\n")
cat("Significant at α=0.05:", treatment_test$p.value < 0.05, "\n")

Output:

--- Summary ---
Observed rate: 0.145 
95% CI: 0.1245 - 1 
P-value: 0.01177 
Significant at α=0.05: TRUE

For confidence intervals on proportions specifically, prop.test() offers additional options:

# Wilson score interval (better coverage properties)
prop_result <- prop.test(145, 1000, conf.level = 0.95)
print(paste("Wilson 95% CI:", 
            round(prop_result$conf.int[1], 4), "-", 
            round(prop_result$conf.int[2], 4)))

Common Pitfalls and Best Practices

When Binomial Assumptions Break Down

The binomial distribution requires:

  1. Fixed number of trials
  2. Independent trials
  3. Constant success probability
  4. Binary outcomes

Violation examples:

  • Testing users who influence each other (violates independence)
  • Conversion rates that drift over time (violates constant p)
  • Sampling without replacement from small populations (use hypergeometric instead)

Normal Approximation Rules

The normal approximation to binomial works when both np ≥ 10 and n(1-p) ≥ 10. Here’s a demonstration:

# Compare exact binomial to normal approximation
compare_approximation <- function(n, p) {
  k <- 0:n
  exact <- dbinom(k, size = n, prob = p)
  
  # Normal approximation with continuity correction
  mean_approx <- n * p
  sd_approx <- sqrt(n * p * (1 - p))
  normal_approx <- dnorm(k, mean = mean_approx, sd = sd_approx)
  
  max_error <- max(abs(exact - normal_approx))
  
  data.frame(
    n = n,
    p = p,
    np = n * p,
    `n(1-p)` = n * (1 - p),
    max_error = round(max_error, 6),
    rule_satisfied = (n * p >= 10) & (n * (1 - p) >= 10)
  )
}

# Test various scenarios
scenarios <- rbind(
  compare_approximation(20, 0.5),
  compare_approximation(50, 0.5),
  compare_approximation(100, 0.5),
  compare_approximation(100, 0.05),  # Rule violated
  compare_approximation(200, 0.05)   # Rule satisfied
)

print(scenarios)

Output:

    n    p   np n.1...p. max_error rule_satisfied
1  20  0.5   10     10.0  0.020449           TRUE
2  50  0.5   25     25.0  0.008296           TRUE
3 100  0.5   50     50.0  0.004161           TRUE
4 100 0.05    5     95.0  0.044672          FALSE
5 200 0.05   10    190.0  0.015291           TRUE

When the rule is violated (row 4), approximation error increases substantially. Stick with exact binomial calculations unless computational constraints force approximation.

Edge Cases

Handle extreme probabilities carefully:

# Near-zero probability
dbinom(0, size = 1000, prob = 0.001)  # Works fine
dbinom(0, size = 1000, prob = 1e-10)  # Still accurate

# Use log probabilities for very small values
dbinom(0, size = 1000, prob = 0.001, log = TRUE)  # Returns log probability

The binomial distribution is foundational for statistical inference with count data. Master these R functions, understand when assumptions apply, and you’ll handle the majority of binary outcome analyses correctly.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.