How to Use scipy.stats.chi2_contingency in Python

Key Insights

scipy.stats.chi2_contingency tests whether two categorical variables are independent, returning a chi-square statistic, p-value, degrees of freedom, and expected frequencies in a single function call.
Always validate the chi-square assumption that expected cell counts are at least 5; when violated, switch to Fisher’s exact test for 2x2 tables or consider combining categories.
Pass raw counts (not percentages or proportions) to the function—this is the most common mistake that produces meaningless results.

Introduction to Chi-Square Tests for Independence

The chi-square test of independence answers a fundamental question: are two categorical variables related, or do they vary independently? This test compares observed frequencies in a contingency table against the frequencies you’d expect if the variables had no relationship.

You’ll reach for this test constantly in practical work. A/B testing is the obvious use case—did the new button color actually affect conversion rates, or was the difference just noise? Survey analysis relies on it too: does customer satisfaction vary by region? In machine learning, chi-square tests help with feature selection by identifying which categorical features have meaningful relationships with your target variable.

Let’s set up our environment:

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency, fisher_exact

# Set display options for cleaner output
np.set_printoptions(precision=4, suppress=True)
pd.set_option('display.precision', 4)

Understanding the Function Signature and Parameters

The function signature is straightforward:

chi2_contingency(observed, correction=True, lambda_=None)

The observed parameter takes your contingency table as a 2D array-like structure. This can be a NumPy array, a list of lists, or a Pandas DataFrame. The values must be non-negative integers representing counts.

The correction parameter controls Yates’ continuity correction, which adjusts for the fact that we’re using a continuous distribution (chi-square) to approximate a discrete one. It only applies to 2x2 tables. Set correction=False when your sample size is large (total count > 40) or when you want results consistent with other software that doesn’t apply the correction by default.

The lambda_ parameter lets you use alternative statistics from the power divergence family. In practice, you’ll rarely need this—the default Pearson chi-square statistic works for most applications.

Here’s how to create contingency tables:

# Method 1: Direct NumPy array
observed = np.array([
    [45, 55],   # Group A: [converted, not_converted]
    [62, 38]    # Group B: [converted, not_converted]
])

# Method 2: From a Pandas DataFrame using crosstab
data = pd.DataFrame({
    'group': ['A'] * 100 + ['B'] * 100,
    'converted': [1]*45 + [0]*55 + [1]*62 + [0]*38
})

contingency_table = pd.crosstab(data['group'], data['converted'])
print(contingency_table)

converted   0   1
group            
A          55  45
B          38  62

Interpreting the Return Values

The function returns four values, and understanding each is critical:

observed = np.array([[45, 55], [62, 38]])

chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Expected frequencies:\n{expected}")

Chi-square statistic: 5.3030
P-value: 0.0213
Degrees of freedom: 1
Expected frequencies:
[[46.5 53.5]
 [60.5 39.5]]

The chi-square statistic measures how much the observed counts deviate from expected counts. Larger values indicate greater deviation from independence.

The p-value is what you’ll use for decision-making. It represents the probability of observing this much deviation (or more) if the variables were truly independent. With a p-value of 0.0213, there’s only a 2.13% chance of seeing this result under the null hypothesis of independence.

Degrees of freedom equals (rows - 1) × (columns - 1). For a 2x2 table, that’s always 1.

The expected frequencies array shows what each cell count would be if the variables were independent. Compare these against your observed values to understand where the deviation comes from.

For decision-making, compare the p-value against your significance level (typically 0.05):

alpha = 0.05

if p_value < alpha:
    print(f"Reject null hypothesis (p={p_value:.4f} < {alpha})")
    print("The variables are NOT independent - there's a significant relationship")
else:
    print(f"Fail to reject null hypothesis (p={p_value:.4f} >= {alpha})")
    print("No significant evidence of a relationship between variables")

Practical Example: A/B Test Analysis

Let’s work through a realistic scenario. Your team ran an A/B test on a checkout page redesign. You have 2,000 users split between control and treatment groups:

# Simulate realistic A/B test data
np.random.seed(42)

# Control group: 12% conversion rate
# Treatment group: 15% conversion rate
ab_test_data = pd.DataFrame({
    'variant': ['control'] * 1000 + ['treatment'] * 1000,
    'converted': (
        np.random.binomial(1, 0.12, 1000).tolist() +
        np.random.binomial(1, 0.15, 1000).tolist()
    )
})

# Check the actual conversion rates
print("Conversion rates by variant:")
print(ab_test_data.groupby('variant')['converted'].agg(['sum', 'count', 'mean']))

Conversion rates by variant:
           sum  count   mean
variant                      
control    115   1000  0.115
treatment  156   1000  0.156

Now let’s run the full analysis:

# Create the contingency table
contingency = pd.crosstab(
    ab_test_data['variant'], 
    ab_test_data['converted'],
    margins=True
)
print("Contingency table with margins:")
print(contingency)
print()

# Run the chi-square test (no Yates correction for large samples)
observed_array = pd.crosstab(
    ab_test_data['variant'], 
    ab_test_data['converted']
).values

chi2, p_value, dof, expected = chi2_contingency(observed_array, correction=False)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
print()

# Make a decision
alpha = 0.05
if p_value < alpha:
    # Calculate effect size (difference in proportions)
    control_rate = 115 / 1000
    treatment_rate = 156 / 1000
    lift = (treatment_rate - control_rate) / control_rate * 100
    
    print(f"SIGNIFICANT RESULT at α={alpha}")
    print(f"Treatment conversion: {treatment_rate:.1%}")
    print(f"Control conversion: {control_rate:.1%}")
    print(f"Relative lift: {lift:.1f}%")
else:
    print(f"No significant difference detected at α={alpha}")

Contingency table with margins:
converted    0    1   All
variant                   
control    885  115  1000
treatment  844  156  1000
All       1729  271  2000

Chi-square statistic: 6.5765
P-value: 0.0103
Degrees of freedom: 1

SIGNIFICANT RESULT at α=0.05
Treatment conversion: 15.6%
Control conversion: 11.5%
Relative lift: 35.7%

Working with Expected Frequencies

The chi-square test has an important assumption: expected cell frequencies should be at least 5. When this assumption is violated, the p-value becomes unreliable.

Always check the expected frequencies array:

def validate_chi2_assumptions(observed):
    """Check if chi-square test assumptions are met."""
    chi2, p_value, dof, expected = chi2_contingency(observed)
    
    min_expected = expected.min()
    cells_below_5 = (expected < 5).sum()
    total_cells = expected.size
    
    print(f"Minimum expected frequency: {min_expected:.2f}")
    print(f"Cells with expected < 5: {cells_below_5}/{total_cells}")
    
    if min_expected < 1:
        print("WARNING: Expected frequency below 1. Chi-square test invalid.")
        return False
    elif min_expected < 5:
        print("WARNING: Expected frequency below 5. Results may be unreliable.")
        if observed.shape == (2, 2):
            print("Consider using Fisher's exact test instead.")
        return False
    else:
        print("Assumptions satisfied.")
        return True

# Test with a problematic table (small counts)
small_sample = np.array([[3, 7], [8, 2]])
print("Small sample validation:")
validate_chi2_assumptions(small_sample)

Small sample validation:
Minimum expected frequency: 4.50
Cells with expected < 5: 2/4
WARNING: Expected frequency below 5. Results may be unreliable.
Consider using Fisher's exact test instead.

When assumptions fail for 2x2 tables, use Fisher’s exact test:

def analyze_categorical_relationship(observed):
    """Run appropriate test based on expected frequencies."""
    chi2, p_value_chi2, dof, expected = chi2_contingency(observed)
    
    if observed.shape == (2, 2) and expected.min() < 5:
        # Use Fisher's exact test
        odds_ratio, p_value_fisher = fisher_exact(observed)
        print("Using Fisher's exact test (expected frequencies too low)")
        print(f"Odds ratio: {odds_ratio:.4f}")
        print(f"P-value: {p_value_fisher:.4f}")
        return p_value_fisher
    else:
        print("Using chi-square test")
        print(f"Chi-square statistic: {chi2:.4f}")
        print(f"P-value: {p_value_chi2:.4f}")
        return p_value_chi2

# Example with small sample
small_sample = np.array([[3, 7], [8, 2]])
analyze_categorical_relationship(small_sample)

Common Pitfalls and Best Practices

Pitfall 1: Using percentages instead of counts

This is the most common mistake. The chi-square test requires raw counts:

# WRONG: Using percentages
wrong_table = np.array([
    [0.45, 0.55],  # These are proportions, not counts!
    [0.62, 0.38]
])

chi2_wrong, p_wrong, _, _ = chi2_contingency(wrong_table)
print(f"Wrong result (percentages): chi2={chi2_wrong:.4f}, p={p_wrong:.4f}")

# CORRECT: Using counts
correct_table = np.array([
    [45, 55],  # Actual counts
    [62, 38]
])

chi2_correct, p_correct, _, _ = chi2_contingency(correct_table)
print(f"Correct result (counts): chi2={chi2_correct:.4f}, p={p_correct:.4f}")

Wrong result (percentages): chi2=0.0578, p=0.8101
Correct result (counts): chi2=5.3030, p=0.0213

The difference is dramatic. The percentage-based test gives a completely wrong p-value.

Pitfall 2: Ignoring the independence assumption

Each observation must be independent. If the same user appears multiple times in your data, you’ve violated this assumption.

Best practice for larger tables:

For tables beyond 2x2, examine standardized residuals to understand which cells contribute most to the chi-square statistic:

def analyze_larger_table(observed):
    """Analyze contingency tables larger than 2x2."""
    chi2, p_value, dof, expected = chi2_contingency(observed)
    
    # Calculate standardized residuals
    residuals = (observed - expected) / np.sqrt(expected)
    
    print(f"Chi-square: {chi2:.4f}, p-value: {p_value:.4f}, dof: {dof}")
    print("\nStandardized residuals (|value| > 2 indicates significant deviation):")
    print(residuals)
    
    return residuals

# Example: 3x3 table (product preference by age group)
product_by_age = np.array([
    [50, 30, 20],   # Young: [Product A, B, C]
    [35, 40, 25],   # Middle: [Product A, B, C]
    [15, 30, 55]    # Senior: [Product A, B, C]
])

analyze_larger_table(product_by_age)

Conclusion and Further Resources

The chi-square test of independence is a workhorse for analyzing categorical data. Remember these essentials: pass raw counts (never percentages), validate that expected frequencies are at least 5, and use Fisher’s exact test when dealing with small samples in 2x2 tables.

For A/B testing specifically, chi2_contingency gives you a quick, reliable way to determine if observed differences are statistically significant. Combine it with effect size calculations to make practical decisions about whether the difference matters for your business.

Related functions worth exploring:

scipy.stats.fisher_exact: Exact test for 2x2 tables with small samples
scipy.stats.chi2: The chi-square distribution itself, useful for custom calculations
scipy.stats.power_divergence: Generalized version supporting G-test and other variants
statsmodels.stats.contingency_tables: More detailed contingency table analysis including odds ratios and confidence intervals