How to Determine Independence of Events

Statistical independence is a fundamental concept that determines whether two events influence each other. Two events A and B are independent if and only if:

Key Insights

  • Two events are independent when P(A ∩ B) = P(A) × P(B), meaning the occurrence of one doesn’t affect the probability of the other—test this mathematically before making assumptions in your models
  • Mutually exclusive events (those that cannot occur together) are NOT independent unless one has zero probability—this is one of the most common mistakes in probability reasoning
  • Use chi-square tests for categorical data independence with sufficient sample sizes (expected counts ≥ 5), but switch to Fisher’s exact test for smaller datasets to avoid spurious results

Introduction to Event Independence

Statistical independence is a fundamental concept that determines whether two events influence each other. Two events A and B are independent if and only if:

P(A ∩ B) = P(A) × P(B)

This means the probability of both events occurring equals the product of their individual probabilities. When this equality holds, knowing that one event occurred gives you no information about whether the other occurred.

Independence matters because it simplifies probability calculations and underpins many statistical methods. Machine learning algorithms often assume feature independence (naive Bayes), A/B tests assume treatment assignment doesn’t affect user characteristics, and time series models may assume independent error terms.

Consider two scenarios: flipping two coins versus drawing two cards from a deck without replacement. Coin flips are independent—the first flip doesn’t affect the second. But card draws without replacement are dependent—drawing an ace first changes the probability of drawing another ace from 4/52 to 3/51.

Mathematical Tests for Independence

The most direct test uses the multiplication rule. Calculate P(A ∩ B) from your data and compare it to P(A) × P(B). If they’re equal (within rounding error for empirical data), the events are independent.

Alternatively, use conditional probability. Events A and B are independent if:

P(A|B) = P(A)

This states that knowing B occurred doesn’t change the probability of A. Similarly, P(B|A) should equal P(B).

Here’s a Python function to test independence:

def check_independence(p_a, p_b, p_a_and_b, tolerance=1e-6):
    """
    Check if two events are independent using the multiplication rule.
    
    Args:
        p_a: Probability of event A
        p_b: Probability of event B
        p_a_and_b: Probability of A and B occurring together
        tolerance: Acceptable difference for floating point comparison
    
    Returns:
        Boolean indicating independence and the calculated difference
    """
    expected_joint = p_a * p_b
    difference = abs(p_a_and_b - expected_joint)
    is_independent = difference < tolerance
    
    print(f"P(A) = {p_a}")
    print(f"P(B) = {p_b}")
    print(f"P(A ∩ B) = {p_a_and_b}")
    print(f"P(A) × P(B) = {expected_joint}")
    print(f"Difference = {difference}")
    
    return is_independent, difference

# Example: Two fair coin flips
is_indep, diff = check_independence(
    p_a=0.5,           # P(first coin heads)
    p_b=0.5,           # P(second coin heads)
    p_a_and_b=0.25     # P(both heads)
)
print(f"Independent: {is_indep}")  # True

# Example: Drawing aces without replacement
is_indep, diff = check_independence(
    p_a=4/52,          # P(first card is ace)
    p_b=4/52,          # P(second card is ace, marginally)
    p_a_and_b=(4/52)*(3/51)  # P(both aces)
)
print(f"Independent: {is_indep}")  # False

Independence in Discrete Events

When working with categorical data, contingency tables display joint frequencies. For a 2×2 table, independence means each cell’s expected count equals (row total × column total) / grand total.

import numpy as np
import pandas as pd

def check_independence_contingency(contingency_table):
    """
    Check independence from a contingency table using expected frequencies.
    
    Args:
        contingency_table: 2D numpy array or DataFrame with observed counts
    
    Returns:
        Boolean indicating independence
    """
    observed = np.array(contingency_table)
    
    # Calculate marginal totals
    row_totals = observed.sum(axis=1)
    col_totals = observed.sum(axis=0)
    grand_total = observed.sum()
    
    # Calculate expected frequencies under independence
    expected = np.outer(row_totals, col_totals) / grand_total
    
    print("Observed frequencies:")
    print(observed)
    print("\nExpected frequencies (if independent):")
    print(expected)
    
    # Check if observed matches expected (within tolerance)
    difference = np.abs(observed - expected).max()
    print(f"\nMax difference: {difference}")
    
    return difference < 1.0  # Rough threshold for demonstration

# Example: Customer purchase data
# Rows: Existing customer (Yes/No), Cols: Made purchase (Yes/No)
purchase_data = np.array([
    [45, 55],   # Existing customers
    [30, 70]    # New customers
])

is_indep = check_independence_contingency(purchase_data)
print(f"Independent: {is_indep}")

Testing Independence with Real Data

For real datasets, use statistical hypothesis tests. The chi-square test is the standard approach for categorical data:

from scipy.stats import chi2_contingency, fisher_exact
import pandas as pd

def test_independence_chi_square(contingency_table, alpha=0.05):
    """
    Perform chi-square test for independence.
    
    Args:
        contingency_table: 2D array of observed frequencies
        alpha: Significance level
    
    Returns:
        Test results including p-value and conclusion
    """
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    
    print(f"Chi-square statistic: {chi2:.4f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Degrees of freedom: {dof}")
    print(f"Expected frequencies:\n{expected}")
    
    # Check minimum expected frequency assumption
    min_expected = expected.min()
    print(f"\nMinimum expected frequency: {min_expected:.2f}")
    
    if min_expected < 5:
        print("WARNING: Chi-square may be unreliable (expected frequency < 5)")
        print("Consider Fisher's exact test instead")
    
    if p_value < alpha:
        print(f"\nReject null hypothesis (p < {alpha})")
        print("Evidence suggests events are DEPENDENT")
        return False
    else:
        print(f"\nFail to reject null hypothesis (p >= {alpha})")
        print("Insufficient evidence to conclude dependence")
        return True

# Example: Email campaign data
# Rows: Opened email (Yes/No), Cols: Clicked link (Yes/No)
email_data = np.array([
    [120, 80],   # Opened email
    [30, 270]    # Did not open
])

is_independent = test_independence_chi_square(email_data)

# For small samples, use Fisher's exact test
small_sample = np.array([
    [3, 1],
    [1, 3]
])

odds_ratio, p_value = fisher_exact(small_sample)
print(f"\nFisher's exact test p-value: {p_value:.4f}")

Common Pitfalls and Edge Cases

The most critical mistake is confusing mutually exclusive events with independent events. Mutually exclusive events CANNOT be independent (unless one has zero probability).

def demonstrate_mutually_exclusive_not_independent():
    """
    Show that mutually exclusive events are dependent.
    """
    # Rolling a die: Event A = roll 1, Event B = roll 6
    # These are mutually exclusive (can't both happen)
    
    p_a = 1/6           # P(roll 1)
    p_b = 1/6           # P(roll 6)
    p_a_and_b = 0       # Can't roll both 1 and 6
    
    print("Mutually Exclusive Events:")
    print(f"P(A) = {p_a:.4f}")
    print(f"P(B) = {p_b:.4f}")
    print(f"P(A ∩ B) = {p_a_and_b}")
    print(f"P(A) × P(B) = {p_a * p_b:.4f}")
    
    if p_a_and_b == p_a * p_b:
        print("Events are independent")
    else:
        print("Events are NOT independent!")
        print("Mutually exclusive events are always dependent")
    
    # Conditional probability confirms this
    # P(A|B) is undefined (or 0), but P(A) = 1/6
    print(f"\nP(A|B) = 0 (if B occurs, A cannot)")
    print(f"P(A) = {p_a:.4f}")
    print("Since P(A|B) ≠ P(A), events are dependent")

demonstrate_mutually_exclusive_not_independent()

Conditional independence is another subtlety. Events may be marginally dependent but conditionally independent given a third variable, or vice versa.

Practical Applications

In A/B testing, we assume treatment assignment is independent of user characteristics. Violating this assumption (e.g., showing variant B only to mobile users) invalidates results.

For machine learning feature selection, checking independence helps identify redundant features:

from sklearn.feature_selection import chi2
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np

def check_feature_independence(X, feature_idx1, feature_idx2, n_bins=5):
    """
    Check if two continuous features are independent by discretizing
    and applying chi-square test.
    
    Args:
        X: Feature matrix
        feature_idx1, feature_idx2: Indices of features to test
        n_bins: Number of bins for discretization
    """
    # Discretize continuous features
    discretizer = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='quantile')
    
    f1 = discretizer.fit_transform(X[:, [feature_idx1]]).astype(int).flatten()
    f2 = discretizer.fit_transform(X[:, [feature_idx2]]).astype(int).flatten()
    
    # Create contingency table
    contingency = pd.crosstab(f1, f2)
    
    # Perform chi-square test
    chi2_stat, p_value, dof, expected = chi2_contingency(contingency)
    
    print(f"Features {feature_idx1} and {feature_idx2}:")
    print(f"Chi-square statistic: {chi2_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    
    if p_value < 0.05:
        print("Features appear DEPENDENT (may be redundant)\n")
    else:
        print("Features appear INDEPENDENT\n")
    
    return p_value >= 0.05

# Example with synthetic data
np.random.seed(42)
X = np.random.randn(1000, 3)
X[:, 2] = X[:, 0] + np.random.randn(1000) * 0.1  # Feature 2 depends on feature 0

check_feature_independence(X, 0, 1)  # Independent
check_feature_independence(X, 0, 2)  # Dependent

Conclusion and Best Practices

Use this checklist when determining independence:

  1. Calculate directly: For known probabilities, verify P(A ∩ B) = P(A) × P(B)
  2. Use appropriate tests: Chi-square for large samples (expected counts ≥ 5), Fisher’s exact for small samples
  3. Check assumptions: Ensure you’re not confusing mutually exclusive with independent
  4. Consider context: In Bayesian networks, check conditional independence given parent nodes
  5. Validate empirically: With real data, independence is rarely perfect—use significance levels appropriate for your domain

Never assume independence without testing when it matters for your analysis. In A/B tests, randomization ensures independence. In observational data, you must verify it. For machine learning, some algorithms (naive Bayes) assume independence for computational efficiency despite knowing it’s violated—understand when this trade-off is acceptable.

The mathematical definition is clean, but real-world application requires judgment about tolerance levels, sample sizes, and the consequences of incorrect assumptions. When in doubt, test rigorously and document your findings.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.