How to Perform a Two-Proportion Z-Test in R

You have two groups. You want to know if they convert, respond, or succeed at different rates. This is the two-proportion z-test, and it's one of the most practical statistical tools you'll use.

Key Insights

  • The two-proportion z-test compares success rates between two groups and is the statistical backbone of A/B testing, clinical trials, and survey analysis—use prop.test() in base R for most cases
  • Always check your sample size assumptions: each group needs at least 10 successes and 10 failures for the normal approximation to hold; otherwise, use Fisher’s exact test
  • Yates’ continuity correction (enabled by default in prop.test()) makes the test more conservative—disable it with correct = FALSE when you have large samples and want results closer to the manual z-test calculation

Introduction to Two-Proportion Z-Tests

You have two groups. You want to know if they convert, respond, or succeed at different rates. This is the two-proportion z-test, and it’s one of the most practical statistical tools you’ll use.

Real-world applications are everywhere:

  • A/B testing: Did version B of your landing page convert better than version A?
  • Clinical trials: Is the treatment group’s recovery rate higher than the control?
  • Survey analysis: Do men and women respond differently to a policy question?

The test works by comparing observed proportions to what you’d expect if both groups had the same underlying rate. If the difference is large enough relative to sampling variability, you reject the null hypothesis.

Assumptions you must satisfy:

  1. Independent observations within and between groups
  2. Random sampling from each population
  3. Large enough samples: typically n₁p₁ ≥ 10, n₁(1-p₁) ≥ 10, and the same for group 2
  4. Binary outcome (success/failure)

Violate these assumptions, and your p-values become meaningless.

The Mathematics Behind the Test

Understanding the formula helps you interpret results and troubleshoot problems.

Hypotheses:

  • H₀: p₁ = p₂ (proportions are equal)
  • H₁: p₁ ≠ p₂ (two-tailed) or p₁ > p₂ / p₁ < p₂ (one-tailed)

The z-statistic:

$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}$$

Where $\hat{p}$ is the pooled proportion:

$$\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$$

Here’s how to calculate it manually in R:

# Sample data
x1 <- 45   # successes in group 1
n1 <- 200  # total in group 1
x2 <- 62   # successes in group 2
n2 <- 250  # total in group 2

# Sample proportions
p1_hat <- x1 / n1  # 0.225
p2_hat <- x2 / n2  # 0.248

# Pooled proportion under H0
p_pooled <- (x1 + x2) / (n1 + n2)  # 0.238

# Standard error
se <- sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))

# Z-statistic
z_stat <- (p1_hat - p2_hat) / se
print(paste("Z-statistic:", round(z_stat, 4)))
# [1] "Z-statistic: -0.5749"

# Two-tailed p-value
p_value <- 2 * pnorm(-abs(z_stat))
print(paste("P-value:", round(p_value, 4)))
# [1] "P-value: 0.5654"

The p-value of 0.565 tells us there’s no significant difference between the groups. The observed difference of 2.3 percentage points could easily occur by chance.

Using prop.test() in Base R

Don’t calculate z-statistics manually in production code. Use prop.test(), which handles the math and gives you confidence intervals.

# Same data as before
result <- prop.test(
  x = c(45, 62),      # successes in each group
  n = c(200, 250),    # sample sizes
  alternative = "two.sided",
  correct = TRUE      # Yates' continuity correction (default)
)

print(result)

Output:

	2-sample test for equality of proportions with continuity correction

data:  c(45, 62) out of c(200, 250)
X-squared = 0.21599, df = 1, p-value = 0.6421
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.10295498  0.05695498
sample estimates:
prop 1 prop 2 
 0.225  0.248 

Key points about the output:

  1. prop.test() reports a chi-squared statistic, not z. They’re related: χ² = z². Here, √0.216 ≈ 0.465, which differs from our manual z because of the continuity correction.

  2. The confidence interval (-0.103, 0.057) contains zero, confirming no significant difference.

  3. The p-value (0.642) is higher than our manual calculation due to Yates’ correction.

When to disable continuity correction:

# Without Yates' correction - closer to manual calculation
result_no_correction <- prop.test(
  x = c(45, 62),
  n = c(200, 250),
  correct = FALSE
)
# X-squared = 0.33048, p-value = 0.5654

Use correct = FALSE when:

  • Sample sizes are large (both > 100)
  • You want results matching manual z-test calculations
  • You’re comparing to software that doesn’t apply correction

Keep the correction for smaller samples or when you want conservative estimates.

Performing an Exact Test for Small Samples

When sample sizes are small or expected cell counts fall below 5, the normal approximation breaks down. Use Fisher’s exact test instead.

# Small sample scenario
# Group A: 3 successes out of 15
# Group B: 8 successes out of 12

# Create contingency table
contingency_table <- matrix(
  c(3, 12,    # Group A: successes, failures
    8, 4),    # Group B: successes, failures
  nrow = 2,
  byrow = TRUE,
  dimnames = list(
    Group = c("A", "B"),
    Outcome = c("Success", "Failure")
  )
)

print(contingency_table)
#      Outcome
# Group Success Failure
#     A       3      12
#     B       8       4

# Fisher's exact test
fisher_result <- fisher.test(contingency_table)
print(fisher_result)

Output:

	Fisher's Exact Test for Count Data

data:  contingency_table
p-value = 0.02543
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.01675024 0.80816498
sample estimates:
odds ratio 
 0.1329825 

The p-value of 0.025 indicates a significant difference. Fisher’s test reports an odds ratio (0.133) rather than a difference in proportions—Group A has about 87% lower odds of success compared to Group B.

When to use Fisher’s exact test:

  • Any expected cell count < 5
  • Total sample size < 40
  • When you want exact p-values without approximation assumptions

Practical Example: A/B Test Analysis

Let’s work through a complete A/B test scenario. Your company ran two email campaigns and wants to know if the new design (B) outperforms the control (A).

# Campaign data
campaign_a <- list(
  sent = 5000,
  conversions = 127
)

campaign_b <- list(
  sent = 4800,
  conversions = 156
)

# Calculate conversion rates
rate_a <- campaign_a$conversions / campaign_a$sent
rate_b <- campaign_b$conversions / campaign_b$sent

cat("Campaign A conversion rate:", sprintf("%.2f%%", rate_a * 100), "\n")
cat("Campaign B conversion rate:", sprintf("%.2f%%", rate_b * 100), "\n")
cat("Absolute difference:", sprintf("%.2f%%", (rate_b - rate_a) * 100), "\n")
cat("Relative lift:", sprintf("%.1f%%", ((rate_b - rate_a) / rate_a) * 100), "\n")

Output:

Campaign A conversion rate: 2.54% 
Campaign B conversion rate: 3.25% 
Absolute difference: 0.71% 
Relative lift: 27.9% 

Now test if this difference is statistically significant:

# Two-tailed test: is there any difference?
ab_test_two_tailed <- prop.test(
  x = c(campaign_a$conversions, campaign_b$conversions),
  n = c(campaign_a$sent, campaign_b$sent),
  alternative = "two.sided",
  correct = FALSE
)

print(ab_test_two_tailed)

# One-tailed test: is B specifically better than A?
ab_test_one_tailed <- prop.test(
  x = c(campaign_b$conversions, campaign_a$conversions),  # Note: B first
  n = c(campaign_b$sent, campaign_a$sent),
  alternative = "greater",
  correct = FALSE
)

cat("\nOne-tailed p-value (B > A):", ab_test_one_tailed$p.value, "\n")

Output:

	2-sample test for equality of proportions without continuity correction

X-squared = 4.4219, df = 1, p-value = 0.03548
95 percent confidence interval:
 0.0004908874 0.0136375483
sample estimates:
   prop 1    prop 2 
0.02540000 0.03250000 

One-tailed p-value (B > A): 0.01774004

Interpretation: The two-tailed p-value (0.035) is below 0.05, indicating a statistically significant difference. The one-tailed test (0.018) provides stronger evidence that B specifically outperforms A. The 95% CI for the difference (0.05% to 1.36%) doesn’t include zero, confirming significance.

Visualizing Results

Clear visualization helps stakeholders understand your findings.

library(ggplot2)

# Prepare data for plotting
plot_data <- data.frame(
  Campaign = c("A (Control)", "B (New Design)"),
  Rate = c(rate_a, rate_b),
  Lower = c(
    prop.test(campaign_a$conversions, campaign_a$sent)$conf.int[1],
    prop.test(campaign_b$conversions, campaign_b$sent)$conf.int[1]
  ),
  Upper = c(
    prop.test(campaign_a$conversions, campaign_a$sent)$conf.int[2],
    prop.test(campaign_b$conversions, campaign_b$sent)$conf.int[2]
  )
)

# Create plot
ggplot(plot_data, aes(x = Campaign, y = Rate, fill = Campaign)) +
  geom_col(width = 0.6, alpha = 0.8) +
  geom_errorbar(
    aes(ymin = Lower, ymax = Upper),
    width = 0.2,
    linewidth = 0.8
  ) +
  geom_text(
    aes(label = sprintf("%.2f%%", Rate * 100)),
    vjust = -0.5,
    size = 4
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 0.05),
    expand = c(0, 0)
  ) +
  scale_fill_manual(values = c("#6c757d", "#28a745")) +
  labs(
    title = "Email Campaign Conversion Rates",
    subtitle = "Two-proportion z-test: p = 0.035*",
    y = "Conversion Rate",
    x = NULL,
    caption = "Error bars show 95% confidence intervals"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 14),
    panel.grid.major.x = element_blank()
  )

This produces a clean bar chart with confidence intervals that clearly shows the difference between campaigns and the uncertainty around each estimate.

Common Pitfalls and Best Practices

Sample size and power: Before running your test, calculate the required sample size. Underpowered tests miss real effects.

# Power analysis for two proportions
power.prop.test(
  p1 = 0.025,           # expected proportion in group 1
  p2 = 0.032,           # minimum detectable effect
  power = 0.80,         # desired power
  sig.level = 0.05,
  alternative = "two.sided"
)
# n = 4161.659 per group

Multiple comparisons: Testing multiple variants inflates false positive rates. If you’re comparing 5 campaigns, apply Bonferroni correction (divide α by number of comparisons) or use p.adjust():

p_values <- c(0.035, 0.12, 0.048, 0.003)
p.adjust(p_values, method = "bonferroni")
# [1] 0.140 0.480 0.192 0.012

Reporting results properly: Include effect sizes, confidence intervals, and sample sizes—not just p-values. A complete report looks like:

Campaign B showed a significantly higher conversion rate (3.25%) compared to Campaign A (2.54%), with an absolute difference of 0.71 percentage points (95% CI: 0.05% to 1.36%). This represents a 27.9% relative improvement (χ² = 4.42, p = 0.035, n = 9,800 total).

The two-proportion z-test is straightforward but powerful. Use prop.test() for standard analyses, Fisher’s exact test for small samples, and always verify your assumptions before trusting the results.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.