How to Perform a Z-Test in Python

A z-test is a statistical hypothesis test that determines whether there's a significant difference between sample and population means, or between two sample means. The test produces a z-statistic...

Key Insights

  • Z-tests are appropriate for large samples (n ≥ 30) when population variance is known or can be reliably estimated, making them ideal for A/B testing and quality control scenarios.
  • Python’s statsmodels library provides production-ready implementations for one-sample, two-sample, and proportion z-tests through ztest() and proportions_ztest() functions.
  • Always verify assumptions before running a z-test: sample size requirements, normality of sampling distribution, and independence of observations determine whether your results are valid.

Introduction to Z-Tests

A z-test is a statistical hypothesis test that determines whether there’s a significant difference between sample and population means, or between two sample means. The test produces a z-statistic that follows a standard normal distribution under the null hypothesis.

Use a z-test when you have:

  • A large sample size (typically n ≥ 30)
  • Known population variance, or a sample large enough to estimate it reliably
  • Data that’s approximately normally distributed (or a large enough sample for the Central Limit Theorem to apply)

The key distinction from t-tests: z-tests assume you know the population standard deviation. In practice, this means z-tests work best with large samples where sample variance closely approximates population variance. T-tests are more appropriate for small samples with unknown population variance.

Types of Z-Tests

One-Sample Z-Test compares a sample mean to a known population mean. Use this when you want to determine if a sample comes from a population with a specific mean. Example: Testing if your website’s average load time differs from the industry standard of 3 seconds.

Two-Sample Z-Test compares means between two independent groups. Use this when comparing two populations or treatment groups. Example: Comparing average order values between customers who received a promotional email versus those who didn’t.

Proportion Z-Test tests hypotheses about population proportions rather than means. This is the workhorse of A/B testing. Example: Determining if a new checkout flow has a different conversion rate than the existing one.

Mathematical Foundation

The z-test formula for a one-sample test is:

z = (x̄ - μ) / (σ / √n)

Where:

  • x̄ = sample mean
  • μ = population mean (null hypothesis value)
  • σ = population standard deviation
  • n = sample size

The resulting z-score tells you how many standard deviations your sample mean is from the hypothesized population mean. A z-score of 1.96 corresponds to the 97.5th percentile, making ±1.96 the critical values for a two-tailed test at α = 0.05.

Here’s how to calculate a z-score manually:

import numpy as np

def calculate_z_score(sample_mean, population_mean, population_std, sample_size):
    """Calculate z-score for a one-sample z-test."""
    standard_error = population_std / np.sqrt(sample_size)
    z_score = (sample_mean - population_mean) / standard_error
    return z_score

# Example: Testing if sample mean differs from population mean
sample_data = np.array([52, 48, 55, 51, 49, 53, 50, 47, 54, 52,
                        51, 48, 53, 50, 49, 52, 51, 54, 48, 50,
                        53, 49, 51, 52, 50, 48, 54, 51, 49, 53])

sample_mean = np.mean(sample_data)
population_mean = 50  # Hypothesized population mean
population_std = 3    # Known population standard deviation
n = len(sample_data)

z = calculate_z_score(sample_mean, population_mean, population_std, n)
print(f"Sample mean: {sample_mean:.2f}")
print(f"Z-score: {z:.4f}")

# Calculate p-value for two-tailed test
from scipy import stats
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"P-value (two-tailed): {p_value:.4f}")

One-Sample Z-Test Implementation

For production code, use statsmodels instead of manual calculations. The library handles edge cases and provides consistent output formatting.

import numpy as np
from statsmodels.stats.weightstats import ztest

# Scenario: A factory claims widgets weigh 100g on average
# We sample 50 widgets to verify this claim
np.random.seed(42)
widget_weights = np.random.normal(loc=101.5, scale=5, size=50)

# Known population standard deviation from historical data
population_std = 5

# One-sample z-test
# H0: μ = 100 (widgets weigh 100g on average)
# H1: μ ≠ 100 (widgets do not weigh 100g on average)
z_statistic, p_value = ztest(widget_weights, value=100)

print("One-Sample Z-Test Results")
print("-" * 40)
print(f"Sample size: {len(widget_weights)}")
print(f"Sample mean: {np.mean(widget_weights):.2f}g")
print(f"Sample std: {np.std(widget_weights, ddof=1):.2f}g")
print(f"Z-statistic: {z_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nReject H0 at α={alpha}: Evidence suggests mean ≠ 100g")
else:
    print(f"\nFail to reject H0 at α={alpha}: No evidence mean ≠ 100g")

Note that statsmodels.ztest() uses the sample standard deviation by default. If you have a known population standard deviation, you can adjust the calculation or use the manual approach shown earlier.

Two-Sample Z-Test Implementation

The two-sample z-test compares means between two independent groups. This is common in experimental settings where you’re comparing a treatment group to a control group.

import numpy as np
from statsmodels.stats.weightstats import ztest

# Scenario: Comparing response times between two server configurations
np.random.seed(42)

# Server A: Current configuration
server_a_times = np.random.normal(loc=250, scale=30, size=100)  # milliseconds

# Server B: New configuration
server_b_times = np.random.normal(loc=235, scale=28, size=120)  # milliseconds

# Two-sample z-test
# H0: μ_A = μ_B (no difference in response times)
# H1: μ_A ≠ μ_B (response times differ)
z_statistic, p_value = ztest(server_a_times, server_b_times)

print("Two-Sample Z-Test Results")
print("-" * 40)
print(f"Server A - n: {len(server_a_times)}, mean: {np.mean(server_a_times):.2f}ms")
print(f"Server B - n: {len(server_b_times)}, mean: {np.mean(server_b_times):.2f}ms")
print(f"Difference: {np.mean(server_a_times) - np.mean(server_b_times):.2f}ms")
print(f"Z-statistic: {z_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# One-tailed test: Is Server B faster than Server A?
# H0: μ_A ≤ μ_B
# H1: μ_A > μ_B
z_stat_one, p_val_one = ztest(server_a_times, server_b_times, alternative='larger')
print(f"\nOne-tailed test (A > B):")
print(f"P-value: {p_val_one:.4f}")

The alternative parameter accepts 'two-sided' (default), 'larger', or 'smaller' for directional hypotheses.

Proportion Z-Test

Proportion z-tests are essential for A/B testing in product development. They determine whether conversion rates, click-through rates, or other proportions differ significantly between groups.

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

# Scenario: A/B test for a new checkout button design
# Control (A): Original blue button
# Treatment (B): New green button

# Results after running the experiment
control_conversions = 156
control_visitors = 2000

treatment_conversions = 189
treatment_visitors = 2100

# Prepare data for proportions_ztest
count = np.array([control_conversions, treatment_conversions])
nobs = np.array([control_visitors, treatment_visitors])

# Two-proportion z-test
# H0: p_control = p_treatment (conversion rates are equal)
# H1: p_control ≠ p_treatment (conversion rates differ)
z_statistic, p_value = proportions_ztest(count, nobs)

control_rate = control_conversions / control_visitors
treatment_rate = treatment_conversions / treatment_visitors
relative_lift = (treatment_rate - control_rate) / control_rate * 100

print("A/B Test Results: Checkout Button Experiment")
print("-" * 50)
print(f"Control conversion rate: {control_rate:.2%} ({control_conversions}/{control_visitors})")
print(f"Treatment conversion rate: {treatment_rate:.2%} ({treatment_conversions}/{treatment_visitors})")
print(f"Relative lift: {relative_lift:+.2f}%")
print(f"Z-statistic: {z_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Calculate confidence interval for the difference
from statsmodels.stats.proportion import confint_proportions_2indep

ci_low, ci_high = confint_proportions_2indep(
    treatment_conversions, treatment_visitors,
    control_conversions, control_visitors,
    method='wald'
)
print(f"95% CI for difference: [{ci_low:.4f}, {ci_high:.4f}]")

Interpreting Results and Best Practices

The z-statistic measures effect size in standard deviation units. A z-statistic of 2.5 means your observed difference is 2.5 standard deviations from what you’d expect under the null hypothesis.

The p-value represents the probability of observing results at least as extreme as yours, assuming the null hypothesis is true. It does not tell you the probability that the null hypothesis is true.

Choosing significance levels: The standard α = 0.05 isn’t always appropriate. For high-stakes decisions, use α = 0.01. For exploratory analysis where false negatives are costly, α = 0.10 may be acceptable.

Assumptions to verify:

  1. Sample size is adequate (n ≥ 30 for each group)
  2. Observations are independent
  3. For proportion tests, np ≥ 10 and n(1-p) ≥ 10

Here’s a complete workflow that brings everything together:

import numpy as np
from statsmodels.stats.weightstats import ztest
from scipy import stats

def run_z_test_analysis(sample_data, hypothesized_mean, alpha=0.05):
    """
    Complete z-test workflow with assumption checking and interpretation.
    """
    n = len(sample_data)
    sample_mean = np.mean(sample_data)
    sample_std = np.std(sample_data, ddof=1)
    
    # Check assumptions
    print("=" * 60)
    print("Z-TEST ANALYSIS REPORT")
    print("=" * 60)
    
    print("\n1. ASSUMPTION CHECKS")
    print("-" * 40)
    
    # Sample size check
    size_ok = n >= 30
    print(f"   Sample size (n={n}): {'✓ PASS' if size_ok else '✗ FAIL'} (need n ≥ 30)")
    
    # Normality check (Shapiro-Wilk)
    if n <= 5000:  # Shapiro-Wilk has sample size limits
        _, normality_p = stats.shapiro(sample_data)
        normality_ok = normality_p > 0.05
        print(f"   Normality (p={normality_p:.4f}): {'✓ PASS' if normality_ok else '⚠ CHECK'}")
    
    if not size_ok:
        print("\n   WARNING: Consider using a t-test for small samples")
    
    # Hypothesis statement
    print("\n2. HYPOTHESIS")
    print("-" * 40)
    print(f"   H0: μ = {hypothesized_mean}")
    print(f"   H1: μ ≠ {hypothesized_mean}")
    print(f"   Significance level: α = {alpha}")
    
    # Run the test
    z_statistic, p_value = ztest(sample_data, value=hypothesized_mean)
    
    # Calculate confidence interval
    se = sample_std / np.sqrt(n)
    ci_multiplier = stats.norm.ppf(1 - alpha/2)
    ci_lower = sample_mean - ci_multiplier * se
    ci_upper = sample_mean + ci_multiplier * se
    
    print("\n3. RESULTS")
    print("-" * 40)
    print(f"   Sample mean: {sample_mean:.4f}")
    print(f"   Sample std: {sample_std:.4f}")
    print(f"   Standard error: {se:.4f}")
    print(f"   Z-statistic: {z_statistic:.4f}")
    print(f"   P-value: {p_value:.4f}")
    print(f"   {(1-alpha)*100:.0f}% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
    
    # Decision
    print("\n4. CONCLUSION")
    print("-" * 40)
    if p_value < alpha:
        print(f"   REJECT H0 (p={p_value:.4f} < α={alpha})")
        print(f"   Evidence suggests the population mean ≠ {hypothesized_mean}")
    else:
        print(f"   FAIL TO REJECT H0 (p={p_value:.4f} ≥ α={alpha})")
        print(f"   Insufficient evidence that population mean ≠ {hypothesized_mean}")
    
    print("=" * 60)
    
    return {
        'z_statistic': z_statistic,
        'p_value': p_value,
        'ci': (ci_lower, ci_upper),
        'reject_null': p_value < alpha
    }

# Example usage
np.random.seed(42)
data = np.random.normal(loc=52, scale=8, size=50)
results = run_z_test_analysis(data, hypothesized_mean=50, alpha=0.05)

Common pitfalls to avoid:

  1. Using z-tests with small samples: Stick to t-tests when n < 30.
  2. Ignoring multiple comparisons: Running many z-tests inflates false positive rates. Apply Bonferroni correction or use false discovery rate methods.
  3. Confusing statistical and practical significance: A p-value of 0.001 with a tiny effect size may not matter for your business decision.
  4. Stopping experiments early: Peeking at p-values during data collection and stopping when significant leads to inflated false positive rates.

Z-tests remain a fundamental tool for statistical inference. Master them, understand their assumptions, and you’ll have a reliable method for making data-driven decisions in production environments.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.