How to Calculate Mutual Information

Key Insights

Mutual information quantifies the dependence between variables by measuring how much knowing one variable reduces uncertainty about another, making it superior to correlation for detecting non-linear relationships
For discrete variables, calculate MI directly from probability distributions using MI(X;Y) = H(X) + H(Y) - H(X,Y), while continuous variables require binning or kernel density estimation approaches
Normalized mutual information (NMI) variants enable fair comparison across different variable pairs by scaling MI scores between 0 and 1, essential for feature selection tasks

Introduction to Mutual Information

Mutual information (MI) measures the dependence between two random variables by quantifying how much information one variable contains about another. Unlike Pearson correlation, which only captures linear relationships, mutual information detects any type of dependency—linear, non-linear, or otherwise.

The core concept is intuitive: if two variables are independent, knowing one tells you nothing about the other, yielding zero mutual information. Conversely, if they’re perfectly dependent, knowing one completely determines the other, producing maximum MI.

This makes mutual information invaluable for feature selection in machine learning, where you want to identify which features carry the most information about your target variable. It’s also fundamental in information theory, data compression, and causal inference.

Mathematical Foundation

Mutual information is defined in terms of entropy, which measures uncertainty or randomness in a random variable. For discrete variables, entropy is:

H(X) = -Σ p(x) log₂ p(x)

Mutual information between variables X and Y is:

MI(X;Y) = H(X) + H(Y) - H(X,Y)

Where H(X,Y) is the joint entropy. This formula reveals that MI measures how much the joint entropy falls short of what it would be if X and Y were independent.

An alternative formulation uses Kullback-Leibler divergence:

MI(X;Y) = Σ Σ p(x,y) log₂(p(x,y) / (p(x)p(y)))

This shows MI as the divergence between the joint distribution and the product of marginals.

Here’s a function to calculate Shannon entropy:

import numpy as np

def entropy(probabilities):
    """Calculate Shannon entropy from probability distribution."""
    # Remove zero probabilities to avoid log(0)
    probs = probabilities[probabilities > 0]
    return -np.sum(probs * np.log2(probs))

# Example usage
probs = np.array([0.5, 0.3, 0.2])
h = entropy(probs)
print(f"Entropy: {h:.3f} bits")  # Output: Entropy: 1.486 bits

Discrete Variable Implementation

For discrete variables, calculating mutual information is straightforward once you have probability distributions. The key steps are:

Estimate joint probabilities from your data
Calculate marginal probabilities
Compute entropies
Apply the MI formula

Here’s a complete implementation:

import numpy as np
import pandas as pd

def calculate_probabilities(x, y):
    """Calculate joint and marginal probabilities from discrete data."""
    # Create contingency table
    contingency = pd.crosstab(x, y)
    
    # Convert to probabilities
    joint_prob = contingency.values / contingency.values.sum()
    
    # Marginal probabilities
    p_x = joint_prob.sum(axis=1)
    p_y = joint_prob.sum(axis=0)
    
    return joint_prob, p_x, p_y

def mutual_information_discrete(x, y):
    """Calculate mutual information for discrete variables."""
    joint_prob, p_x, p_y = calculate_probabilities(x, y)
    
    mi = 0.0
    for i in range(len(p_x)):
        for j in range(len(p_y)):
            if joint_prob[i, j] > 0:
                mi += joint_prob[i, j] * np.log2(
                    joint_prob[i, j] / (p_x[i] * p_y[j])
                )
    
    return mi

# Example with categorical data
x = np.array(['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'])
y = np.array(['X', 'X', 'Y', 'Y', 'X', 'Y', 'Y', 'X'])

mi = mutual_information_discrete(x, y)
print(f"Mutual Information: {mi:.3f} bits")

This implementation handles the edge case of zero probabilities by checking joint_prob[i, j] > 0 before computing the logarithm, preventing undefined behavior.

Continuous Variable Implementation

Continuous variables present challenges because you can’t directly estimate probability mass functions. Two common approaches are binning and kernel density estimation.

Binning discretizes continuous data into intervals:

from sklearn.feature_selection import mutual_info_regression
from sklearn.metrics import mutual_info_score
import numpy as np

def mutual_information_binning(x, y, bins=10):
    """Calculate MI for continuous variables using binning."""
    # Discretize both variables
    x_binned = np.digitize(x, bins=np.histogram_bin_edges(x, bins=bins))
    y_binned = np.digitize(y, bins=np.histogram_bin_edges(y, bins=bins))
    
    # Use discrete MI calculation
    return mutual_info_score(x_binned, y_binned)

# Example with continuous data
np.random.seed(42)
x = np.random.randn(1000)
y = x**2 + np.random.randn(1000) * 0.5  # Non-linear relationship

mi_binned = mutual_information_binning(x, y, bins=20)
print(f"MI (binning): {mi_binned:.3f}")

# Using scikit-learn's built-in estimator (uses KNN-based approach)
mi_sklearn = mutual_info_regression(x.reshape(-1, 1), y, random_state=42)[0]
print(f"MI (sklearn): {mi_sklearn:.3f}")

Scikit-learn’s mutual_info_regression uses a sophisticated k-nearest neighbors approach that often works better than simple binning. The choice of bins significantly affects results—too few loses information, too many creates sparse estimates.

Normalized Mutual Information

Raw mutual information values are difficult to compare across different variable pairs because they depend on the marginal entropies. Normalized mutual information (NMI) scales MI to [0, 1], enabling fair comparisons.

Common normalization schemes include:

def normalized_mutual_information(x, y, method='arithmetic'):
    """
    Calculate normalized mutual information.
    
    Parameters:
    -----------
    method : str
        'max': MI / max(H(X), H(Y))
        'min': MI / min(H(X), H(Y))
        'arithmetic': MI / ((H(X) + H(Y)) / 2)
        'geometric': MI / sqrt(H(X) * H(Y))
    """
    joint_prob, p_x, p_y = calculate_probabilities(x, y)
    
    # Calculate MI
    mi = 0.0
    for i in range(len(p_x)):
        for j in range(len(p_y)):
            if joint_prob[i, j] > 0:
                mi += joint_prob[i, j] * np.log2(
                    joint_prob[i, j] / (p_x[i] * p_y[j])
                )
    
    # Calculate entropies
    h_x = entropy(p_x)
    h_y = entropy(p_y)
    
    # Normalize
    if method == 'max':
        return mi / max(h_x, h_y)
    elif method == 'min':
        return mi / min(h_x, h_y)
    elif method == 'arithmetic':
        return mi / ((h_x + h_y) / 2)
    elif method == 'geometric':
        return mi / np.sqrt(h_x * h_y)
    else:
        raise ValueError(f"Unknown method: {method}")

# Example
x = np.array(['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'])
y = np.array(['X', 'X', 'Y', 'Y', 'X', 'Y', 'Y', 'X'])

for method in ['max', 'min', 'arithmetic', 'geometric']:
    nmi = normalized_mutual_information(x, y, method=method)
    print(f"NMI ({method}): {nmi:.3f}")

The arithmetic mean normalization is most common, but the choice depends on your application. Use max normalization when you want to penalize high-entropy variables.

Practical Applications & Interpretation

Feature selection is where mutual information shines. Here’s a complete workflow for selecting the top-k most informative features:

from sklearn.datasets import make_classification
from sklearn.feature_selection import mutual_info_classif
import pandas as pd
import matplotlib.pyplot as plt

# Generate synthetic dataset
X, y = make_classification(
    n_samples=1000, 
    n_features=20, 
    n_informative=10,
    n_redundant=5,
    random_state=42
)

# Calculate MI for each feature
mi_scores = mutual_info_classif(X, y, random_state=42)

# Create results dataframe
feature_importance = pd.DataFrame({
    'feature': [f'feature_{i}' for i in range(X.shape[1])],
    'mi_score': mi_scores
}).sort_values('mi_score', ascending=False)

print("Top 5 features by MI:")
print(feature_importance.head())

# Select top-k features
def select_features_by_mi(X, y, k=10):
    """Select top-k features by mutual information."""
    mi_scores = mutual_info_classif(X, y, random_state=42)
    top_k_indices = np.argsort(mi_scores)[-k:][::-1]
    return top_k_indices, mi_scores[top_k_indices]

top_features, top_scores = select_features_by_mi(X, y, k=5)
print(f"\nSelected features: {top_features}")
print(f"MI scores: {top_scores}")

Interpretation guidelines:

MI = 0: Variables are independent
MI > 0: Variables share information; higher values indicate stronger dependence
Compare with maximum entropy: MI close to min(H(X), H(Y)) indicates strong dependence

Use MI over correlation when you suspect non-linear relationships or have categorical variables. MI captures any dependency structure, while correlation only detects linear relationships.

Performance Considerations & Best Practices

Computational complexity for discrete MI is O(n × m × k) where n is sample size, m and k are the number of unique values. For continuous variables, binning is O(n log n) due to sorting, while KNN-based methods are O(n² log n).

Sample size requirements: You need sufficient samples to reliably estimate probabilities. A rule of thumb is at least 5-10 samples per unique value combination for discrete variables. For continuous variables, use at least 100-200 samples.

Binning strategy: Start with the Freedman-Diaconis rule for bin width: 2 × IQR × n^(-1/3). Experiment with different bin counts and use cross-validation to validate your choice.

Limitations:

MI cannot detect the direction of relationships (it’s symmetric)
Sensitive to estimation method for continuous variables
Requires more samples than simple correlation
Doesn’t distinguish between direct and indirect dependencies

When to use alternatives: Consider distance correlation for continuous variables with complex dependencies, or conditional mutual information when you need to account for confounding variables.

For high-dimensional data, use mutual information for initial feature screening, then apply more sophisticated methods like recursive feature elimination. Always validate your selected features with cross-validation to ensure they generalize.