Bernoulli Distribution in R: Complete Guide

Key Insights

The Bernoulli distribution models single binary trials with parameter p, while the Binomial distribution models multiple independent Bernoulli trials—use rbinom(n, size=1, prob=p) for Bernoulli sampling in R
Maximum likelihood estimation of p from Bernoulli data is simply the sample mean, with confidence intervals calculated using prop.test() or the Wilson score interval for better small-sample performance
Visualizing Bernoulli distributions requires discrete probability mass functions shown as bar charts, not continuous density curves—critical for proper interpretation in A/B testing and quality control applications

Introduction to Bernoulli Distribution

The Bernoulli distribution is the simplest discrete probability distribution, modeling a single trial with exactly two possible outcomes: success (1) or failure (0). Named after Swiss mathematician Jacob Bernoulli, this distribution forms the foundation for understanding more complex discrete distributions.

A Bernoulli trial is characterized by a single parameter p, representing the probability of success, where 0 ≤ p ≤ 1. The probability of failure is consequently (1-p), often denoted as q.

Real-world applications are abundant. Every time a user clicks an ad (success) or doesn’t (failure), a manufactured part passes quality inspection or fails, or a patient responds to treatment or doesn’t—you’re observing a Bernoulli trial. Understanding this distribution is essential for A/B testing, quality control, medical trials, and conversion rate optimization.

# Single Bernoulli trial simulation
set.seed(123)
p <- 0.3  # 30% probability of success

# Generate one trial
trial_result <- rbinom(n = 1, size = 1, prob = p)
cat("Trial result:", trial_result, "\n")

# Simulate 10 independent trials
trials <- rbinom(n = 10, size = 1, prob = p)
cat("10 trials:", trials, "\n")
cat("Number of successes:", sum(trials), "\n")

Mathematical Foundation

The probability mass function (PMF) for a Bernoulli random variable X is remarkably simple:

P(X = 1) = p
P(X = 0) = 1 - p

This can be written compactly as: P(X = x) = p^x × (1-p)^(1-x) for x ∈ {0, 1}

The expected value (mean) is E[X] = p, which intuitively makes sense—if you run many trials with p = 0.3, you’d expect about 30% successes.

The variance is Var(X) = p(1-p), maximized when p = 0.5 (maximum uncertainty) and minimized when p approaches 0 or 1 (near certainty).

Bernoulli vs Binomial: The Bernoulli distribution models a single trial, while the Binomial distribution B(n, p) models the number of successes in n independent Bernoulli trials. A Bernoulli distribution is simply B(1, p).

# Manual PMF calculation
calculate_bernoulli_pmf <- function(x, p) {
  if (x == 0) return(1 - p)
  if (x == 1) return(p)
  return(0)  # Invalid input
}

# Test different p values
p_values <- c(0.2, 0.5, 0.8)
for (p in p_values) {
  cat(sprintf("p = %.1f: P(X=0) = %.2f, P(X=1) = %.2f\n", 
              p, calculate_bernoulli_pmf(0, p), calculate_bernoulli_pmf(1, p)))
}

# Verify variance formula
p <- 0.3
theoretical_var <- p * (1 - p)
simulated_trials <- rbinom(100000, 1, p)
empirical_var <- var(simulated_trials)
cat(sprintf("Theoretical variance: %.4f\n", theoretical_var))
cat(sprintf("Empirical variance: %.4f\n", empirical_var))

Generating Bernoulli Random Variables in R

R doesn’t have a dedicated Bernoulli function, but multiple approaches work effectively. The most common uses rbinom() with size=1.

set.seed(42)
n <- 1000
p <- 0.4

# Method 1: rbinom (recommended)
method1 <- rbinom(n, size = 1, prob = p)

# Method 2: sample with replacement
method2 <- sample(c(0, 1), size = n, replace = TRUE, prob = c(1-p, p))

# Method 3: runif with threshold
method3 <- as.integer(runif(n) < p)

# Verify all methods produce similar results
cat("Method 1 mean:", mean(method1), "\n")
cat("Method 2 mean:", mean(method2), "\n")
cat("Method 3 mean:", mean(method3), "\n")

# Performance comparison
library(microbenchmark)
microbenchmark(
  rbinom = rbinom(n, 1, p),
  sample = sample(c(0, 1), n, TRUE, c(1-p, p)),
  runif = as.integer(runif(n) < p),
  times = 1000
)

The rbinom() method is fastest and most idiomatic. Use sample() when you need explicit control over outcomes or weighted sampling. The runif() approach is educational but offers no practical advantages.

Probability Functions

R’s binomial probability functions work seamlessly for Bernoulli distributions by setting size=1.

p <- 0.35

# Probability mass function: P(X = x)
prob_success <- dbinom(1, size = 1, prob = p)
prob_failure <- dbinom(0, size = 1, prob = p)
cat(sprintf("P(X=1) = %.3f, P(X=0) = %.3f\n", prob_success, prob_failure))

# Cumulative distribution function: P(X ≤ x)
cdf_0 <- pbinom(0, size = 1, prob = p)  # P(X ≤ 0) = P(X = 0)
cdf_1 <- pbinom(1, size = 1, prob = p)  # P(X ≤ 1) = 1
cat(sprintf("P(X ≤ 0) = %.3f, P(X ≤ 1) = %.3f\n", cdf_0, cdf_1))

# Quantile function: smallest x where P(X ≤ x) ≥ q
q_25 <- qbinom(0.25, size = 1, prob = p)
q_75 <- qbinom(0.75, size = 1, prob = p)
cat(sprintf("25th percentile: %d, 75th percentile: %d\n", q_25, q_75))

# Create lookup table for various p values
p_range <- seq(0.1, 0.9, by = 0.1)
results <- data.frame(
  p = p_range,
  P_X_0 = dbinom(0, 1, p_range),
  P_X_1 = dbinom(1, 1, p_range),
  variance = p_range * (1 - p_range)
)
print(results)

Practical Statistical Analysis

The maximum likelihood estimator (MLE) for p from Bernoulli data is the sample proportion: p̂ = (sum of successes) / n.

# Generate sample data
set.seed(100)
true_p <- 0.6
n <- 50
sample_data <- rbinom(n, 1, true_p)

# Estimate p (MLE)
p_hat <- mean(sample_data)
cat(sprintf("True p: %.2f, Estimated p: %.2f\n", true_p, p_hat))

# Standard error
se <- sqrt(p_hat * (1 - p_hat) / n)
cat(sprintf("Standard error: %.4f\n", se))

# 95% Confidence interval using prop.test (Wilson score)
test_result <- prop.test(sum(sample_data), n)
ci <- test_result$conf.int
cat(sprintf("95%% CI: [%.3f, %.3f]\n", ci[1], ci[2]))

# Normal approximation CI (less accurate for small n or extreme p)
z <- qnorm(0.975)
ci_normal <- c(p_hat - z * se, p_hat + z * se)
cat(sprintf("95%% CI (normal approx): [%.3f, %.3f]\n", ci_normal[1], ci_normal[2]))

# Hypothesis test: H0: p = 0.5 vs H1: p ≠ 0.5
test_h0 <- prop.test(sum(sample_data), n, p = 0.5)
cat(sprintf("Test p = 0.5: p-value = %.4f\n", test_h0$p.value))

The Wilson score interval from prop.test() performs better than the normal approximation, especially with small samples or probabilities near 0 or 1.

Visualization Techniques

Effective visualization of Bernoulli distributions requires discrete representations, not continuous curves.

library(ggplot2)

# Single distribution PMF
p <- 0.4
pmf_data <- data.frame(
  x = c(0, 1),
  probability = c(1-p, p)
)

ggplot(pmf_data, aes(x = factor(x), y = probability)) +
  geom_col(fill = "steelblue", width = 0.6) +
  geom_text(aes(label = sprintf("%.2f", probability)), vjust = -0.5) +
  ylim(0, 1) +
  labs(title = sprintf("Bernoulli Distribution (p = %.2f)", p),
       x = "Outcome", y = "Probability") +
  theme_minimal()

# Compare multiple distributions
p_values <- c(0.2, 0.5, 0.8)
comparison_data <- data.frame()
for (p in p_values) {
  temp <- data.frame(
    x = c(0, 1),
    probability = c(1-p, p),
    p_value = factor(p)
  )
  comparison_data <- rbind(comparison_data, temp)
}

ggplot(comparison_data, aes(x = factor(x), y = probability, fill = p_value)) +
  geom_col(position = "dodge") +
  labs(title = "Bernoulli Distributions with Different p Values",
       x = "Outcome", y = "Probability", fill = "p") +
  theme_minimal()

# Simulation vs theoretical
set.seed(200)
p <- 0.35
n_sims <- 1000
simulated <- rbinom(n_sims, 1, p)

sim_df <- data.frame(outcome = simulated)
ggplot(sim_df, aes(x = factor(outcome))) +
  geom_bar(aes(y = after_stat(count)/sum(after_stat(count))), 
           fill = "coral", alpha = 0.7) +
  geom_hline(yintercept = c(1-p, p), linetype = "dashed", color = "blue") +
  annotate("text", x = 1, y = 1-p, label = sprintf("Theoretical: %.2f", 1-p), 
           vjust = -0.5, color = "blue") +
  annotate("text", x = 2, y = p, label = sprintf("Theoretical: %.2f", p), 
           vjust = -0.5, color = "blue") +
  labs(title = "Simulated vs Theoretical Bernoulli Distribution",
       x = "Outcome", y = "Proportion") +
  theme_minimal()

Real-World Application: A/B Testing Case Study

Let’s analyze a complete A/B test scenario for a website button redesign.

# Scenario: Testing new button design
# Control group: 1000 visitors, new design: 1000 visitors
set.seed(2024)

control_conversion_rate <- 0.12  # 12% baseline
treatment_conversion_rate <- 0.15  # 15% with new design

# Generate data
control <- rbinom(1000, 1, control_conversion_rate)
treatment <- rbinom(1000, 1, treatment_conversion_rate)

# Summary statistics
control_rate <- mean(control)
treatment_rate <- mean(treatment)
cat(sprintf("Control conversion: %.2f%%\n", control_rate * 100))
cat(sprintf("Treatment conversion: %.2f%%\n", treatment_rate * 100))
cat(sprintf("Absolute lift: %.2f%%\n", (treatment_rate - control_rate) * 100))
cat(sprintf("Relative lift: %.1f%%\n", 
            ((treatment_rate - control_rate) / control_rate) * 100))

# Statistical test
test <- prop.test(c(sum(treatment), sum(control)), c(length(treatment), length(control)))
cat(sprintf("p-value: %.4f\n", test$p.value))
cat(sprintf("95%% CI for difference: [%.4f, %.4f]\n", 
            test$conf.int[1], test$conf.int[2]))

# Visualization
results_df <- data.frame(
  group = rep(c("Control", "Treatment"), each = 1000),
  conversion = c(control, treatment)
)

ggplot(results_df, aes(x = group, fill = factor(conversion))) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("0" = "gray70", "1" = "green4"),
                    labels = c("No Conversion", "Conversion")) +
  labs(title = "A/B Test Results: Button Redesign",
       x = "Group", y = "Percentage", fill = "Outcome") +
  theme_minimal()

# Decision framework
alpha <- 0.05
if (test$p.value < alpha && treatment_rate > control_rate) {
  cat("\nDecision: Implement new design (statistically significant improvement)\n")
} else {
  cat("\nDecision: Keep current design (no significant improvement detected)\n")
}

This complete workflow demonstrates data generation, parameter estimation, hypothesis testing, visualization, and decision-making—the essential components of applied Bernoulli distribution analysis. The framework applies equally to quality control (defect/no defect), medical trials (response/no response), or any binary outcome scenario where understanding and quantifying uncertainty drives better decisions.