How to Perform a Chi-Square Test of Independence in Python

Key Insights

The chi-square test of independence determines whether two categorical variables are related, making it essential for A/B testing, survey analysis, and feature selection in machine learning pipelines.
SciPy’s chi2_contingency() function returns four values—statistic, p-value, degrees of freedom, and expected frequencies—giving you everything needed to validate assumptions and interpret results.
Always check that expected frequencies are at least 5 in each cell; when they’re not, use Fisher’s exact test or combine categories to avoid misleading conclusions.

Introduction to the Chi-Square Test of Independence

The chi-square test of independence answers a simple question: are two categorical variables related, or are they independent? This makes it one of the most practical statistical tests for software engineers and data analysts working with real-world data.

You’ll reach for this test when analyzing survey responses, evaluating A/B test results across user segments, or investigating whether a bug affects certain user groups disproportionately. Any time you have counts of observations falling into categories defined by two variables, the chi-square test tells you whether the pattern you see is statistically significant or just noise.

Consider a concrete scenario: your product team wants to know if subscription tier (Free, Pro, Enterprise) affects feature adoption (Adopted, Not Adopted). You have counts for each combination. The chi-square test tells you whether tier and adoption are independent or whether there’s a real relationship worth investigating further.

Understanding the Math Behind the Test

The chi-square test compares what you observed against what you’d expect if the variables were truly independent. The core formula is:

$$\chi^2 = \sum \frac{(O - E)^2}{E}$$

Where O is the observed frequency and E is the expected frequency for each cell. The expected frequency assumes independence: it’s calculated as (row total × column total) / grand total.

The degrees of freedom equal (rows - 1) × (columns - 1). This determines which chi-square distribution to use when calculating the p-value.

Here’s how to calculate expected frequencies manually:

import numpy as np

# Observed frequencies: rows are subscription tiers, columns are adoption status
observed = np.array([
    [120, 80],   # Free: 120 adopted, 80 not adopted
    [95, 25],    # Pro: 95 adopted, 25 not adopted
    [45, 10]     # Enterprise: 45 adopted, 10 not adopted
])

# Calculate row and column totals
row_totals = observed.sum(axis=1, keepdims=True)
col_totals = observed.sum(axis=0, keepdims=True)
grand_total = observed.sum()

# Expected frequencies under independence
expected = (row_totals * col_totals) / grand_total

print("Expected frequencies:")
print(expected)

# Calculate chi-square statistic manually
chi_square = np.sum((observed - expected) ** 2 / expected)
print(f"\nChi-square statistic: {chi_square:.4f}")

A small p-value (typically < 0.05) means the observed distribution differs significantly from what independence would predict. You reject the null hypothesis and conclude the variables are related.

Setting Up Your Data

Real data rarely arrives as a neat contingency table. You’ll typically have raw observations that need transformation. Pandas makes this straightforward with crosstab().

import pandas as pd

# Simulated raw data: each row is a user
data = pd.DataFrame({
    'user_id': range(1, 376),
    'tier': ['Free'] * 200 + ['Pro'] * 120 + ['Enterprise'] * 55,
    'adopted': ['Yes'] * 120 + ['No'] * 80 + 
               ['Yes'] * 95 + ['No'] * 25 + 
               ['Yes'] * 45 + ['No'] * 10
})

# Create contingency table
contingency_table = pd.crosstab(
    data['tier'], 
    data['adopted'],
    margins=True,  # Include row/column totals
    margins_name='Total'
)

print(contingency_table)

Output:

adopted      No  Yes  Total
tier                       
Enterprise   10   45     55
Free         80  120    200
Pro          25   95    120
Total       115  260    375

The data format is crucial: each observation must be independent (one user can’t appear twice), and both variables must be categorical. Continuous variables need binning first.

Performing the Test with SciPy

SciPy’s chi2_contingency() handles everything in one function call. It returns four values that tell the complete story.

from scipy import stats

# Remove margins for the test (use raw counts only)
observed_table = pd.crosstab(data['tier'], data['adopted'])

# Perform chi-square test
chi2, p_value, dof, expected_freq = stats.chi2_contingency(observed_table)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Degrees of freedom: {dof}")
print(f"\nExpected frequencies:")
print(pd.DataFrame(
    expected_freq, 
    index=observed_table.index, 
    columns=observed_table.columns
).round(2))

Output:

Chi-square statistic: 19.4872
P-value: 0.000059
Degrees of freedom: 2

Expected frequencies:
            No     Yes
tier                  
Enterprise  16.87  38.13
Free        61.33  138.67
Pro         36.80  83.20

Interpreting these results: with a p-value of 0.000059 (well below 0.05), we reject the null hypothesis. Subscription tier and feature adoption are not independent—there’s a statistically significant relationship between them.

The expected frequencies show what we’d see if tier didn’t matter. Comparing observed (120 Free users adopted) versus expected (138.67) reveals that Free users adopt less than expected, while Pro and Enterprise users adopt more.

Checking Assumptions and Validity

The chi-square test has assumptions that, when violated, produce unreliable results. The most important: expected frequencies should be at least 5 in every cell.

def validate_chi_square_assumptions(observed_table):
    """Check if chi-square test assumptions are met."""
    chi2, p_value, dof, expected = stats.chi2_contingency(observed_table)
    
    min_expected = expected.min()
    cells_below_5 = (expected < 5).sum()
    total_cells = expected.size
    
    print(f"Minimum expected frequency: {min_expected:.2f}")
    print(f"Cells with expected < 5: {cells_below_5} of {total_cells}")
    
    if min_expected < 5:
        pct_below = cells_below_5 / total_cells * 100
        if pct_below > 20:
            print("⚠️  WARNING: >20% of cells have expected frequency < 5")
            print("   Consider using Fisher's exact test or combining categories")
            return False
        elif min_expected < 1:
            print("⚠️  WARNING: Some expected frequencies < 1")
            print("   Chi-square test is invalid. Use Fisher's exact test.")
            return False
    
    print("✓ Assumptions met for chi-square test")
    return True

# Validate our data
validate_chi_square_assumptions(observed_table)

When assumptions fail, you have two options. For 2×2 tables, use Fisher’s exact test:

# For 2x2 tables only
small_table = np.array([[8, 2], [1, 5]])
odds_ratio, fisher_p = stats.fisher_exact(small_table)
print(f"Fisher's exact test p-value: {fisher_p:.4f}")

For larger tables with sparse cells, combine categories that make logical sense (e.g., merge “Pro” and “Enterprise” into “Paid”).

Visualizing Results

Numbers tell the story, but visualizations make it accessible to stakeholders. A heatmap of the contingency table with annotations works well.

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Observed frequencies heatmap
sns.heatmap(
    observed_table, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    ax=axes[0]
)
axes[0].set_title('Observed Frequencies')

# Standardized residuals show where observed differs from expected
chi2, p_value, dof, expected = stats.chi2_contingency(observed_table)
residuals = (observed_table.values - expected) / np.sqrt(expected)

sns.heatmap(
    pd.DataFrame(residuals, index=observed_table.index, columns=observed_table.columns),
    annot=True,
    fmt='.2f',
    cmap='RdBu_r',
    center=0,
    ax=axes[1]
)
axes[1].set_title('Standardized Residuals\n(>|2| indicates significant deviation)')

plt.tight_layout()
plt.savefig('chi_square_visualization.png', dpi=150)
plt.show()

Standardized residuals greater than 2 or less than -2 indicate cells that contribute most to the chi-square statistic. In our example, you’d see that Enterprise users adopt more than expected (positive residual) while Free users adopt less (negative residual).

Practical Example: End-to-End Analysis

Let’s walk through a complete analysis you might present to stakeholders. Suppose you’re analyzing whether marketing channel affects conversion.

import pandas as pd
import numpy as np
from scipy import stats

# Simulate realistic marketing data
np.random.seed(42)
n_users = 1000

channels = np.random.choice(
    ['Organic', 'Paid Search', 'Social', 'Email'],
    size=n_users,
    p=[0.35, 0.30, 0.20, 0.15]
)

# Different conversion rates by channel
conversion_probs = {
    'Organic': 0.12,
    'Paid Search': 0.08,
    'Social': 0.05,
    'Email': 0.18
}
converted = [
    np.random.random() < conversion_probs[ch] 
    for ch in channels
]

df = pd.DataFrame({
    'channel': channels,
    'converted': ['Yes' if c else 'No' for c in converted]
})

# Build contingency table
ct = pd.crosstab(df['channel'], df['converted'])
print("Contingency Table:")
print(ct)
print()

# Run chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(ct)

# Validate assumptions
min_expected = expected.min()
print(f"Minimum expected frequency: {min_expected:.1f} ✓" 
      if min_expected >= 5 else f"Warning: min expected = {min_expected:.1f}")

# Report results
print(f"\n{'='*50}")
print("CHI-SQUARE TEST RESULTS")
print(f"{'='*50}")
print(f"Chi-square statistic: {chi2:.2f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.4f}")
print(f"\nConclusion at α=0.05: ", end="")
if p_value < 0.05:
    print("SIGNIFICANT - Channel and conversion are related")
else:
    print("NOT SIGNIFICANT - No evidence of relationship")

# Effect size (Cramér's V)
n = ct.sum().sum()
min_dim = min(ct.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim))
print(f"\nEffect size (Cramér's V): {cramers_v:.3f}")
print("  Interpretation: ", end="")
if cramers_v < 0.1:
    print("Negligible")
elif cramers_v < 0.3:
    print("Small")
elif cramers_v < 0.5:
    print("Medium")
else:
    print("Large")

For stakeholders, translate this into actionable language:

“Our analysis shows a statistically significant relationship between marketing channel and conversion rate (χ² = 24.31, p < 0.001). Email campaigns convert at 18% compared to just 5% for Social, suggesting we should reallocate budget toward email marketing. The effect size is small-to-medium (Cramér’s V = 0.16), indicating the relationship is meaningful but other factors also influence conversion.”

The chi-square test told you the relationship exists. Now you need follow-up analysis—comparing specific channels, calculating confidence intervals for conversion rates, and considering confounding variables—to make sound business decisions.