How to Perform the Cochran Q Test in Python

Key Insights

The Cochran Q test is the go-to method for comparing three or more related groups with binary outcomes—think A/B/C testing where you’re measuring success/failure across the same users
SciPy’s cochrans_q function handles the heavy lifting, but you’ll need post-hoc McNemar tests with Bonferroni correction to identify which specific groups differ
Your data must be structured as a matrix with subjects as rows and conditions as columns, containing only 0s and 1s

Introduction to the Cochran Q Test

The Cochran Q test answers a specific question: when you measure the same subjects under three or more conditions and record binary outcomes, do the proportions of “successes” differ significantly across conditions?

This comes up constantly in applied work. You’re running an A/B/C test and want to know if click-through rates differ across three landing page designs. You’re evaluating whether the same patients respond differently to three treatments (responded/didn’t respond). You’re testing if users can complete a task successfully across three different UI versions.

The key characteristics that make Cochran Q the right choice:

Three or more conditions (for two conditions, use McNemar’s test)
Same subjects measured under all conditions (matched/repeated measures)
Binary outcomes (yes/no, success/failure, 0/1)

The test produces a Q statistic that follows a chi-squared distribution with k-1 degrees of freedom, where k is the number of conditions. A significant result tells you that at least one condition differs from the others—but not which ones.

Assumptions and Requirements

Before running Cochran Q, verify your data meets these requirements:

Binary data only. Each observation must be 0 or 1. If you have ordinal or continuous data, you need a different test (Friedman test for ordinal, repeated measures ANOVA for continuous).

Same subjects across all conditions. This is the “matched” or “related samples” requirement. Subject 1 must appear in condition A, B, and C. If you have independent groups, use the chi-squared test for independence instead.

Sufficient sample size. The rule of thumb is n × k ≥ 24, where n is the number of subjects and k is the number of conditions. With small samples, the chi-squared approximation breaks down.

Complete data. Each subject needs observations under all conditions. Missing data requires either imputation or removal of that subject entirely.

When Cochran Q isn’t appropriate:

Two conditions only → McNemar’s test
Independent groups → Chi-squared test of independence
Continuous outcomes → Repeated measures ANOVA
Ordinal outcomes → Friedman test

Setting Up Your Python Environment

You need three libraries: SciPy for the statistical tests, pandas for data manipulation, and numpy for numerical operations. Matplotlib is optional but useful for visualization.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Version checks - cochrans_q was added in SciPy 1.4.0
print(f"NumPy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"SciPy version: {stats.scipy.__version__}")

# Verify cochrans_q is available
assert hasattr(stats, 'cochrans_q'), "Update SciPy to 1.4.0+ for cochrans_q"

If you’re missing any packages:

pip install numpy pandas scipy matplotlib

Preparing Your Data

Cochran Q expects data in a specific format: a 2D array where rows are subjects and columns are conditions. Each cell contains 0 or 1.

Here’s a realistic example. Suppose you’re testing three checkout flow designs and recording whether users successfully complete a purchase. You show each of 30 users all three designs (in randomized order) and record success (1) or failure (0).

# Simulating realistic A/B/C test data
# 30 users, 3 checkout designs, binary success/failure
np.random.seed(42)

n_users = 30

# Design A: baseline, ~60% success rate
# Design B: simplified, ~80% success rate  
# Design C: minimal, ~70% success rate
design_a = np.random.binomial(1, 0.60, n_users)
design_b = np.random.binomial(1, 0.80, n_users)
design_c = np.random.binomial(1, 0.70, n_users)

# Create DataFrame for easier manipulation
df = pd.DataFrame({
    'user_id': range(1, n_users + 1),
    'design_a': design_a,
    'design_b': design_b,
    'design_c': design_c
})

print(df.head(10))
print(f"\nSuccess rates:")
print(f"Design A: {design_a.mean():.1%}")
print(f"Design B: {design_b.mean():.1%}")
print(f"Design C: {design_c.mean():.1%}")

The critical step is extracting just the binary columns as a numpy array:

# Extract the data matrix (subjects × conditions)
data_matrix = df[['design_a', 'design_b', 'design_c']].values

print(f"Data shape: {data_matrix.shape}")  # Should be (30, 3)
print(f"Unique values: {np.unique(data_matrix)}")  # Should be [0, 1]

Performing the Cochran Q Test with SciPy

With your data properly formatted, running the test is straightforward:

# Perform Cochran Q test
result = stats.cochrans_q(data_matrix)

print("Cochran Q Test Results")
print("=" * 40)
print(f"Q statistic: {result.statistic:.4f}")
print(f"p-value: {result.pvalue:.4f}")
print(f"Degrees of freedom: {data_matrix.shape[1] - 1}")

# Interpret the result
alpha = 0.05
if result.pvalue < alpha:
    print(f"\nResult: Significant at α = {alpha}")
    print("The success rates differ across at least two designs.")
else:
    print(f"\nResult: Not significant at α = {alpha}")
    print("No evidence of difference in success rates across designs.")

The Q statistic measures the overall variability in success rates across conditions, adjusted for the matched nature of the data. Under the null hypothesis (all conditions have equal success rates), Q follows a chi-squared distribution.

A significant p-value tells you that at least one condition differs from the others. It doesn’t tell you which ones—that requires post-hoc analysis.

Post-Hoc Analysis

When Cochran Q is significant, you need pairwise comparisons to identify which specific conditions differ. McNemar’s test is the appropriate choice for comparing two matched binary samples.

The catch: running multiple comparisons inflates your Type I error rate. With three conditions, you’re running three pairwise tests, so you need to apply a correction. Bonferroni is the simplest approach—divide your alpha by the number of comparisons.

from itertools import combinations

def pairwise_mcnemar(data_matrix, condition_names, alpha=0.05):
    """
    Perform pairwise McNemar tests with Bonferroni correction.
    """
    n_conditions = data_matrix.shape[1]
    n_comparisons = n_conditions * (n_conditions - 1) // 2
    corrected_alpha = alpha / n_comparisons
    
    results = []
    
    for (i, name_i), (j, name_j) in combinations(enumerate(condition_names), 2):
        # Extract the two conditions
        col_i = data_matrix[:, i]
        col_j = data_matrix[:, j]
        
        # Build contingency table for McNemar
        # [[both 0, i=0 & j=1], [i=1 & j=0, both 1]]
        b = np.sum((col_i == 0) & (col_j == 1))  # Changed from i to j
        c = np.sum((col_i == 1) & (col_j == 0))  # Changed from j to i
        
        # McNemar test (exact if counts are small)
        if b + c < 25:
            mcnemar_result = stats.binomtest(b, b + c, 0.5)
            p_value = mcnemar_result.pvalue
            method = "exact"
        else:
            # Chi-squared approximation with continuity correction
            chi2 = (abs(b - c) - 1) ** 2 / (b + c)
            p_value = 1 - stats.chi2.cdf(chi2, df=1)
            method = "chi2"
        
        significant = p_value < corrected_alpha
        
        results.append({
            'comparison': f"{name_i} vs {name_j}",
            'p_value': p_value,
            'corrected_alpha': corrected_alpha,
            'significant': significant,
            'method': method
        })
    
    return pd.DataFrame(results)

# Run post-hoc analysis
condition_names = ['design_a', 'design_b', 'design_c']
posthoc_results = pairwise_mcnemar(data_matrix, condition_names)

print("\nPost-Hoc Pairwise Comparisons (McNemar with Bonferroni)")
print("=" * 60)
print(posthoc_results.to_string(index=False))

Complete Worked Example

Let’s put everything together with a realistic scenario and proper visualization.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from itertools import combinations

# Scenario: Testing three onboarding flows
# 50 new users each experience all three flows (counterbalanced)
# Outcome: Successfully completed onboarding (1) or abandoned (0)

np.random.seed(123)
n_users = 50

# Flow A: Original (65% completion)
# Flow B: Gamified (85% completion)
# Flow C: Minimalist (72% completion)
flow_a = np.random.binomial(1, 0.65, n_users)
flow_b = np.random.binomial(1, 0.85, n_users)
flow_c = np.random.binomial(1, 0.72, n_users)

df = pd.DataFrame({
    'user_id': range(1, n_users + 1),
    'original': flow_a,
    'gamified': flow_b,
    'minimalist': flow_c
})

# Prepare data matrix
data_matrix = df[['original', 'gamified', 'minimalist']].values
condition_names = ['Original', 'Gamified', 'Minimalist']

# Calculate success rates
success_rates = data_matrix.mean(axis=0)

print("ONBOARDING FLOW A/B/C TEST ANALYSIS")
print("=" * 50)
print(f"\nSample size: {n_users} users")
print(f"\nCompletion rates:")
for name, rate in zip(condition_names, success_rates):
    print(f"  {name}: {rate:.1%}")

# Step 1: Cochran Q Test
print("\n" + "-" * 50)
print("STEP 1: Cochran Q Test (Omnibus)")
print("-" * 50)

q_result = stats.cochrans_q(data_matrix)
print(f"Q statistic: {q_result.statistic:.3f}")
print(f"p-value: {q_result.pvalue:.4f}")

alpha = 0.05
if q_result.pvalue < alpha:
    print(f"→ Significant difference detected (p < {alpha})")
    run_posthoc = True
else:
    print(f"→ No significant difference (p ≥ {alpha})")
    run_posthoc = False

# Step 2: Post-hoc analysis (if needed)
if run_posthoc:
    print("\n" + "-" * 50)
    print("STEP 2: Pairwise McNemar Tests (Bonferroni corrected)")
    print("-" * 50)
    
    n_comparisons = 3
    corrected_alpha = alpha / n_comparisons
    print(f"Corrected α = {alpha}/{n_comparisons} = {corrected_alpha:.4f}\n")
    
    pairs = list(combinations(range(3), 2))
    
    for i, j in pairs:
        col_i, col_j = data_matrix[:, i], data_matrix[:, j]
        b = np.sum((col_i == 0) & (col_j == 1))
        c = np.sum((col_i == 1) & (col_j == 0))
        
        if b + c > 0:
            result = stats.binomtest(b, b + c, 0.5)
            p = result.pvalue
        else:
            p = 1.0
        
        sig = "***" if p < corrected_alpha else ""
        print(f"{condition_names[i]} vs {condition_names[j]}: "
              f"p = {p:.4f} {sig}")

# Step 3: Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Bar chart of success rates
colors = ['#2ecc71' if rate == max(success_rates) else '#3498db' 
          for rate in success_rates]
bars = axes[0].bar(condition_names, success_rates * 100, color=colors, 
                   edgecolor='black', linewidth=1.2)
axes[0].set_ylabel('Completion Rate (%)', fontsize=12)
axes[0].set_xlabel('Onboarding Flow', fontsize=12)
axes[0].set_title('Onboarding Completion by Flow Design', fontsize=14)
axes[0].set_ylim(0, 100)
axes[0].axhline(y=np.mean(success_rates) * 100, color='red', 
                linestyle='--', label='Mean')

# Add value labels on bars
for bar, rate in zip(bars, success_rates):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2,
                 f'{rate:.0%}', ha='center', fontsize=11, fontweight='bold')

# Subject-level heatmap
im = axes[1].imshow(data_matrix.T, aspect='auto', cmap='RdYlGn', 
                     vmin=0, vmax=1)
axes[1].set_yticks(range(3))
axes[1].set_yticklabels(condition_names)
axes[1].set_xlabel('User ID', fontsize=12)
axes[1].set_title('Individual User Outcomes', fontsize=14)
plt.colorbar(im, ax=axes[1], label='Completed (1) / Abandoned (0)')

plt.tight_layout()
plt.savefig('cochran_q_results.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n" + "=" * 50)
print("CONCLUSION")
print("=" * 50)
print(f"""
The Cochran Q test revealed a significant difference in completion
rates across the three onboarding flows (Q = {q_result.statistic:.2f}, 
p = {q_result.pvalue:.4f}).

Post-hoc analysis with Bonferroni correction identified that the
Gamified flow ({success_rates[1]:.0%}) significantly outperformed
the Original flow ({success_rates[0]:.0%}).

Recommendation: Implement the Gamified onboarding flow.
""")

This complete example demonstrates the full workflow: data preparation, omnibus test, post-hoc comparisons, visualization, and interpretation. The visualization shows both aggregate results and individual-level patterns, which can reveal important nuances like whether certain user segments respond differently to each design.

When reporting Cochran Q results, include the Q statistic, degrees of freedom, p-value, and the specific pairwise differences identified in post-hoc testing. Always report effect sizes when possible—in this case, the raw difference in proportions is often the most interpretable measure.