How to Perform a Chi-Square Goodness of Fit Test in Python

The chi-square goodness of fit test answers a simple question: does your observed data match what you expected? You're comparing the frequency distribution of a single categorical variable against a...

Key Insights

  • The chi-square goodness of fit test compares observed categorical frequencies against expected frequencies to determine if your data follows a hypothesized distribution—use it when you have one categorical variable and want to test distributional assumptions.
  • SciPy’s chisquare() function handles the heavy lifting, but understanding the manual calculation helps you interpret results correctly and catch edge cases that break the test’s assumptions.
  • Always verify that expected frequencies are at least 5 per category; violations lead to unreliable p-values and require alternative approaches like combining categories or using exact tests.

Introduction to Chi-Square Goodness of Fit

The chi-square goodness of fit test answers a simple question: does your observed data match what you expected? You’re comparing the frequency distribution of a single categorical variable against a theoretical or hypothesized distribution.

Consider these scenarios where the test applies:

  • Testing whether a die is fair (observed rolls vs. uniform 1/6 probability)
  • Checking if website traffic follows expected seasonal patterns
  • Verifying if survey responses match population demographics

The test works by calculating how much your observed frequencies deviate from expected frequencies. Large deviations produce large chi-square statistics, suggesting your data doesn’t fit the expected distribution.

Three assumptions must hold for valid results:

  1. Independence: Each observation is independent of others
  2. Categorical data: Your variable has discrete categories
  3. Sufficient expected frequencies: Each category should have an expected count of at least 5

Violating these assumptions—especially the third—produces unreliable p-values.

Setting Up Your Python Environment

You need three libraries for chi-square testing: SciPy for the statistical test, NumPy for numerical operations, and pandas for data handling. Matplotlib handles visualization.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Create sample dataset: observed counts for 6 categories
observed = np.array([18, 22, 16, 24, 19, 21])
categories = ['A', 'B', 'C', 'D', 'E', 'F']

# Create a DataFrame for easier manipulation
df = pd.DataFrame({
    'category': categories,
    'observed': observed
})

print(df)
  category  observed
0        A        18
1        B        22
2        C        16
3        D        24
4        E        19
5        F        21

This dataset represents 120 total observations across 6 categories. If we expect uniform distribution, each category should have 20 observations.

Understanding the Test Components

Before using SciPy’s convenience functions, understanding the underlying math helps you interpret results and debug issues.

Observed frequencies are your actual counts per category. Expected frequencies come from your hypothesized distribution. Degrees of freedom equal the number of categories minus one (k - 1), minus any parameters estimated from the data.

The chi-square statistic formula:

$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$

Where O_i is observed frequency and E_i is expected frequency for category i.

# Manual chi-square calculation
observed = np.array([18, 22, 16, 24, 19, 21])
n_total = observed.sum()  # 120
n_categories = len(observed)  # 6

# Expected frequencies (uniform distribution)
expected = np.array([n_total / n_categories] * n_categories)
print(f"Expected frequencies: {expected}")

# Calculate chi-square statistic manually
chi_square_components = (observed - expected) ** 2 / expected
chi_square_stat = chi_square_components.sum()

# Degrees of freedom
df = n_categories - 1

# Calculate p-value from chi-square distribution
p_value = 1 - stats.chi2.cdf(chi_square_stat, df)

print(f"Chi-square statistic: {chi_square_stat:.4f}")
print(f"Degrees of freedom: {df}")
print(f"P-value: {p_value:.4f}")
Expected frequencies: [20. 20. 20. 20. 20. 20.]
Chi-square statistic: 2.6000
Degrees of freedom: 5
P-value: 0.7614

The p-value of 0.76 indicates no significant deviation from uniform distribution. We fail to reject the null hypothesis that the data follows the expected distribution.

Performing the Test with SciPy

SciPy’s chisquare() function simplifies the process. It returns the chi-square statistic and p-value directly.

from scipy.stats import chisquare

observed = np.array([18, 22, 16, 24, 19, 21])

# Test with default expected frequencies (uniform distribution)
stat, p_value = chisquare(observed)

print(f"Chi-square statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")
Chi-square statistic: 2.6000
P-value: 0.7616

When expected frequencies aren’t uniform, pass them explicitly:

# Custom expected frequencies (must sum to same total as observed)
observed = np.array([45, 35, 15, 5])
expected = np.array([40, 30, 20, 10])  # Hypothesized distribution

stat, p_value = chisquare(observed, f_exp=expected)

print(f"Chi-square statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Verify totals match
print(f"Observed total: {observed.sum()}")
print(f"Expected total: {expected.sum()}")
Chi-square statistic: 3.3333
P-value: 0.3430

Critical point: observed and expected frequencies must sum to the same total. SciPy doesn’t automatically normalize expected values.

Real-World Example: Dice Fairness Test

Let’s work through a complete example. You suspect a casino die is loaded after observing unusual patterns. You roll it 300 times and record the results.

import numpy as np
from scipy.stats import chisquare

# Observed frequencies from 300 dice rolls
dice_faces = [1, 2, 3, 4, 5, 6]
observed_rolls = np.array([42, 48, 55, 53, 51, 51])

# For a fair die, each face should appear 1/6 of the time
n_rolls = observed_rolls.sum()
expected_rolls = np.array([n_rolls / 6] * 6)

print("Dice Fairness Test")
print("=" * 40)
print(f"Total rolls: {n_rolls}")
print(f"\nObserved frequencies: {observed_rolls}")
print(f"Expected frequencies: {expected_rolls}")

# Perform chi-square test
chi_stat, p_value = chisquare(observed_rolls, f_exp=expected_rolls)

print(f"\nChi-square statistic: {chi_stat:.4f}")
print(f"Degrees of freedom: {len(dice_faces) - 1}")
print(f"P-value: {p_value:.4f}")

# Interpretation at alpha = 0.05
alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Reject null hypothesis (p < {alpha})")
    print("Evidence suggests the die is NOT fair.")
else:
    print(f"\nResult: Fail to reject null hypothesis (p >= {alpha})")
    print("No significant evidence that the die is unfair.")
Dice Fairness Test
========================================
Total rolls: 300
Observed frequencies: [42 48 55 53 51 51]
Expected frequencies: [50. 50. 50. 50. 50. 50.]

Chi-square statistic: 2.2800
Degrees of freedom: 5
P-value: 0.8093

Result: Fail to reject null hypothesis (p >= 0.05)
No significant evidence that the die is unfair.

The high p-value (0.81) indicates the observed deviations are easily explained by random chance. The die appears fair.

Now let’s test a suspicious die with more extreme deviations:

# Suspicious die - face 6 appears too often
observed_suspicious = np.array([35, 38, 42, 45, 52, 88])
expected_fair = np.array([50.0] * 6)

chi_stat, p_value = chisquare(observed_suspicious, f_exp=expected_fair)

print("Suspicious Die Test")
print(f"Observed: {observed_suspicious}")
print(f"Chi-square statistic: {chi_stat:.4f}")
print(f"P-value: {p_value:.6f}")
Suspicious Die Test
Observed: [35 38 42 45 52 88]
Chi-square statistic: 37.5200
P-value: 0.000005

With p < 0.001, we have strong evidence this die doesn’t follow a uniform distribution.

Visualizing Results

Visualization makes the comparison between observed and expected frequencies immediately clear.

import matplotlib.pyplot as plt
import numpy as np

# Data from suspicious die test
categories = ['1', '2', '3', '4', '5', '6']
observed = np.array([35, 38, 42, 45, 52, 88])
expected = np.array([50.0] * 6)

# Create grouped bar chart
x = np.arange(len(categories))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(x - width/2, observed, width, label='Observed', color='steelblue')
bars2 = ax.bar(x + width/2, expected, width, label='Expected', color='coral')

ax.set_xlabel('Dice Face', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('Chi-Square Goodness of Fit: Dice Fairness Test', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.legend()

# Add value labels on bars
for bar in bars1:
    height = bar.get_height()
    ax.annotate(f'{int(height)}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),
                textcoords="offset points",
                ha='center', va='bottom')

for bar in bars2:
    height = bar.get_height()
    ax.annotate(f'{int(height)}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),
                textcoords="offset points",
                ha='center', va='bottom')

plt.tight_layout()
plt.savefig('chi_square_visualization.png', dpi=150)
plt.show()

This visualization immediately reveals that face 6 appears far more often than expected—the visual pattern matches our statistical conclusion.

Common Pitfalls and Best Practices

Small expected frequencies break the test. When any expected frequency falls below 5, the chi-square approximation becomes unreliable. Solutions include:

  • Combine adjacent categories to increase expected counts
  • Use Fisher’s exact test for 2x2 tables
  • Apply Yates’ continuity correction for 2 categories
# Check for small expected frequencies
def validate_chi_square_assumptions(observed, expected):
    """Validate assumptions before running chi-square test."""
    issues = []
    
    if np.any(expected < 5):
        small_cats = np.where(expected < 5)[0]
        issues.append(f"Expected frequencies < 5 in categories: {small_cats}")
    
    if np.any(expected == 0):
        issues.append("Zero expected frequencies detected - test invalid")
    
    if observed.sum() != expected.sum():
        issues.append("Observed and expected totals don't match")
    
    if issues:
        print("WARNING: Assumption violations detected:")
        for issue in issues:
            print(f"  - {issue}")
        return False
    
    print("All assumptions satisfied.")
    return True

# Example with problematic data
observed = np.array([45, 30, 20, 3, 2])
expected = np.array([40, 25, 20, 10, 5])

validate_chi_square_assumptions(observed, expected)

Don’t confuse goodness of fit with independence tests. The goodness of fit test examines one variable against a theoretical distribution. The chi-square test of independence examines relationships between two categorical variables—that’s scipy.stats.chi2_contingency().

Choose your significance level before testing. Deciding alpha after seeing results is p-hacking. Standard choices are 0.05 or 0.01, but domain context matters.

Consider the G-test for larger samples. The G-test (log-likelihood ratio test) performs better with large samples and handles zero cells more gracefully:

from scipy.stats import power_divergence

observed = np.array([42, 48, 55, 53, 51, 51])
expected = np.array([50.0] * 6)

# G-test (lambda_="log-likelihood")
g_stat, p_value = power_divergence(observed, expected, lambda_="log-likelihood")
print(f"G-test statistic: {g_stat:.4f}, P-value: {p_value:.4f}")

Report effect size alongside p-values. Cramér’s V or the contingency coefficient provides context that p-values alone miss—statistical significance doesn’t imply practical significance.

The chi-square goodness of fit test remains a workhorse for categorical data analysis. Master the assumptions, understand the calculation, and you’ll have a reliable tool for testing distributional hypotheses across countless applications.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.