How to Perform McNemar's Test in Python

Key Insights

McNemar’s test compares paired binary outcomes—use it when you need to determine if two classifiers perform differently on the same dataset or measure before/after treatment effects
The test focuses only on discordant pairs (cases where the two methods disagree), making it more powerful than naive accuracy comparisons for paired data
Use the exact binomial test when you have fewer than 25 discordant pairs; switch to the chi-squared approximation for larger samples

Introduction to McNemar’s Test

McNemar’s test answers a simple question: do two binary classifiers (or treatments, or diagnostic methods) perform differently on the same set of subjects? Unlike comparing two independent proportions, McNemar’s test accounts for the paired nature of observations—the same samples evaluated by both methods.

You’ll reach for this test in three common scenarios:

Machine learning model comparison: Testing whether Model A and Model B make significantly different predictions on the same test set
Before/after studies: Measuring if a treatment changed outcomes (e.g., patients testing positive before vs. after intervention)
Diagnostic method comparison: Determining if two medical tests yield different results on the same patients

The test gained popularity in ML circles because simple accuracy comparison ignores the paired structure of predictions. Two models might have identical accuracy but disagree on completely different samples—McNemar’s test captures this nuance.

Understanding the Test Statistic and Assumptions

McNemar’s test operates on a 2x2 contingency table that counts agreement and disagreement between two methods:

	Method B: Positive	Method B: Negative
Method A: Positive	a	b
Method A: Negative	c	d

The cells represent:

a: Both methods positive (concordant)
d: Both methods negative (concordant)
b: Method A positive, Method B negative (discordant)
c: Method A negative, Method B positive (discordant)

The test statistic uses only the discordant pairs (b and c):

$$\chi^2 = \frac{(b - c)^2}{b + c}$$

This follows a chi-squared distribution with 1 degree of freedom under the null hypothesis that both methods have the same error rate.

Key assumptions:

Observations are paired (same subjects evaluated by both methods)
Outcomes are binary (positive/negative, correct/incorrect)
Sufficient discordant pairs exist (typically b + c ≥ 25 for chi-squared approximation)

import numpy as np

# Create a 2x2 contingency table
# Rows: Method A (Positive, Negative)
# Columns: Method B (Positive, Negative)
contingency_table = np.array([
    [30, 12],  # Method A positive: [both positive, only A positive]
    [8, 50]    # Method A negative: [only B positive, both negative]
])

# Extract discordant pairs
b = contingency_table[0, 1]  # A positive, B negative
c = contingency_table[1, 0]  # A negative, B positive

print(f"Discordant pairs: b={b}, c={c}")
print(f"Total discordant: {b + c}")

Performing McNemar’s Test with Statsmodels

The statsmodels library provides a clean implementation of McNemar’s test with options for both exact and asymptotic calculations.

import numpy as np
from statsmodels.stats.contingency_tables import mcnemar

# Sample data: comparing two diagnostic tests on 100 patients
# Rows: Test A result, Columns: Test B result
table = np.array([
    [45, 15],  # Test A positive
    [8, 32]    # Test A negative
])

# Perform McNemar's test with chi-squared approximation (default)
result_chi2 = mcnemar(table, exact=False, correction=True)
print("Chi-squared test (with continuity correction):")
print(f"  Statistic: {result_chi2.statistic:.4f}")
print(f"  P-value: {result_chi2.pvalue:.4f}")

# Perform exact binomial test (recommended for small samples)
result_exact = mcnemar(table, exact=True)
print("\nExact binomial test:")
print(f"  Statistic: {result_exact.statistic:.4f}")
print(f"  P-value: {result_exact.pvalue:.4f}")

# Without continuity correction (Edwards' correction)
result_no_correction = mcnemar(table, exact=False, correction=False)
print("\nChi-squared test (without correction):")
print(f"  Statistic: {result_no_correction.statistic:.4f}")
print(f"  P-value: {result_no_correction.pvalue:.4f}")

Output:

Chi-squared test (with continuity correction):
  Statistic: 1.5652
  P-value: 0.2109

Exact binomial test:
  Statistic: 15.0000
  P-value: 0.2101

Chi-squared test (without correction):
  Statistic: 2.1304
  P-value: 0.1444

The correction=True parameter applies Yates’ continuity correction, which adjusts for the discrete nature of the chi-squared approximation. The exact test uses the binomial distribution directly and is preferred when discordant pairs are few.

Alternative: Using SciPy (Manual Calculation)

When statsmodels isn’t available, you can compute McNemar’s test manually using scipy.stats:

import numpy as np
from scipy import stats

def mcnemar_test(table, correction=True):
    """
    Perform McNemar's test manually.
    
    Parameters:
    -----------
    table : array-like, shape (2, 2)
        Contingency table
    correction : bool
        Apply continuity correction
    
    Returns:
    --------
    statistic : float
        Chi-squared test statistic
    pvalue : float
        Two-sided p-value
    """
    table = np.asarray(table)
    b = table[0, 1]
    c = table[1, 0]
    
    if b + c == 0:
        return np.nan, 1.0
    
    if correction:
        # With Yates' continuity correction
        statistic = (abs(b - c) - 1) ** 2 / (b + c)
    else:
        statistic = (b - c) ** 2 / (b + c)
    
    # P-value from chi-squared distribution with 1 df
    pvalue = 1 - stats.chi2.cdf(statistic, df=1)
    # Or equivalently: pvalue = stats.chi2.sf(statistic, df=1)
    
    return statistic, pvalue

def mcnemar_exact(table):
    """
    Perform exact McNemar's test using binomial distribution.
    """
    table = np.asarray(table)
    b = table[0, 1]
    c = table[1, 0]
    n = b + c
    
    if n == 0:
        return 0, 1.0
    
    # Two-sided binomial test
    # Under null, b ~ Binomial(n, 0.5)
    result = stats.binomtest(b, n, p=0.5, alternative='two-sided')
    
    return min(b, c), result.pvalue

# Test with sample data
table = np.array([[45, 15], [8, 32]])

stat, pval = mcnemar_test(table, correction=True)
print(f"Manual chi-squared (corrected): stat={stat:.4f}, p={pval:.4f}")

stat_exact, pval_exact = mcnemar_exact(table)
print(f"Manual exact test: stat={stat_exact}, p={pval_exact:.4f}")

This approach gives you full control and works in environments where only SciPy is installed.

Real-World Example: Comparing ML Classifiers

Here’s a practical example comparing two classifiers on the same test set:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from statsmodels.stats.contingency_tables import mcnemar

# Generate synthetic classification data
X, y = make_classification(
    n_samples=500, 
    n_features=20, 
    n_informative=10,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train two different classifiers
model_a = LogisticRegression(random_state=42, max_iter=1000)
model_b = RandomForestClassifier(n_estimators=100, random_state=42)

model_a.fit(X_train, y_train)
model_b.fit(X_train, y_train)

# Get predictions on the same test set
pred_a = model_a.predict(X_test)
pred_b = model_b.predict(X_test)

# Build contingency table of correct/incorrect predictions
correct_a = (pred_a == y_test)
correct_b = (pred_b == y_test)

def build_mcnemar_table(correct_a, correct_b):
    """
    Build 2x2 table for McNemar's test from correctness arrays.
    
    Returns table where:
    - [0,0]: Both correct
    - [0,1]: A correct, B incorrect
    - [1,0]: A incorrect, B correct
    - [1,1]: Both incorrect
    """
    both_correct = np.sum(correct_a & correct_b)
    a_only = np.sum(correct_a & ~correct_b)
    b_only = np.sum(~correct_a & correct_b)
    both_wrong = np.sum(~correct_a & ~correct_b)
    
    return np.array([[both_correct, a_only],
                     [b_only, both_wrong]])

table = build_mcnemar_table(correct_a, correct_b)

print("Contingency Table:")
print(f"                    Model B Correct  Model B Wrong")
print(f"Model A Correct     {table[0,0]:^15}  {table[0,1]:^13}")
print(f"Model A Wrong       {table[1,0]:^15}  {table[1,1]:^13}")
print()

# Calculate accuracies
acc_a = correct_a.mean()
acc_b = correct_b.mean()
print(f"Logistic Regression accuracy: {acc_a:.4f}")
print(f"Random Forest accuracy: {acc_b:.4f}")
print()

# Perform McNemar's test
discordant = table[0, 1] + table[1, 0]
print(f"Discordant pairs: {discordant}")

if discordant < 25:
    result = mcnemar(table, exact=True)
    test_type = "exact"
else:
    result = mcnemar(table, exact=False, correction=True)
    test_type = "chi-squared"

print(f"\nMcNemar's test ({test_type}):")
print(f"  Statistic: {result.statistic:.4f}")
print(f"  P-value: {result.pvalue:.4f}")

alpha = 0.05
if result.pvalue < alpha:
    print(f"\nConclusion: Significant difference (p < {alpha})")
else:
    print(f"\nConclusion: No significant difference (p >= {alpha})")

This example demonstrates the complete workflow: train models, extract predictions, build the contingency table, and interpret results.

Interpreting Results and Common Pitfalls

Reading the p-value:

p < 0.05: Reject null hypothesis—the classifiers have significantly different error rates
p ≥ 0.05: Fail to reject null—no evidence of different performance

Choosing exact vs. asymptotic:

Use exact test when b + c < 25
Use chi-squared with continuity correction for moderate samples (25-100 discordant pairs)
Chi-squared without correction is acceptable for large samples (>100 discordant pairs)

Common mistakes to avoid:

Using unpaired data: McNemar’s test requires the same samples evaluated by both methods. Don’t use it to compare models trained on different datasets.
Ignoring concordant pairs entirely: While the test statistic uses only discordant pairs, report the full table for transparency.
Confusing the table orientation: Always verify which cell represents which disagreement type. Swapping b and c doesn’t change the test statistic but affects interpretation.
Over-interpreting with small samples: With few discordant pairs, the test has low power. A non-significant result doesn’t prove equivalence.
Multiple comparisons: When comparing many classifier pairs, apply Bonferroni or other corrections to control family-wise error rate.

# Checking if you have enough discordant pairs
def recommend_test_variant(table):
    b, c = table[0, 1], table[1, 0]
    discordant = b + c
    
    if discordant < 10:
        return "exact", "Warning: Very few discordant pairs. Low statistical power."
    elif discordant < 25:
        return "exact", "Use exact test for reliable p-values."
    else:
        return "chi-squared", "Chi-squared approximation is appropriate."

variant, message = recommend_test_variant(table)
print(f"Recommendation: {variant} test")
print(f"Note: {message}")

Conclusion

McNemar’s test is the right tool when you need to compare two methods on paired binary data. In machine learning, it provides a statistically rigorous way to determine if one classifier genuinely outperforms another on the same test set—something raw accuracy comparison cannot do.

Remember these key points:

Build your contingency table correctly, focusing on where the methods disagree
Choose the exact test for small samples, chi-squared for larger ones
Report the full table, not just the p-value
Consider statistical power when interpreting non-significant results

The statsmodels implementation handles the heavy lifting, but understanding the underlying mechanics helps you avoid misapplication and correctly interpret results in your research or production ML pipelines.