How to Calculate Entropy in Probability

Key Insights

Entropy quantifies uncertainty in probability distributions using the formula H(X) = -Σ p(x) log₂ p(x), where higher values indicate more randomness and lower values indicate more predictability
Maximum entropy occurs with uniform distributions, while minimum entropy (zero) happens when one outcome has probability 1—this principle drives decision tree algorithms and feature selection in machine learning
Numerical stability requires special handling of zero probabilities and vectorized NumPy operations can speed up entropy calculations by 10-100x compared to naive Python loops

Introduction to Entropy

Entropy measures uncertainty in probability distributions. When you flip a fair coin, you’re maximally uncertain about the outcome—that’s high entropy. When you flip a two-headed coin, there’s no uncertainty—that’s zero entropy.

Shannon entropy, developed by Claude Shannon in 1948, quantifies this mathematically:

H(X) = -Σ p(x) log₂ p(x)

This formula sums over all possible outcomes, where p(x) is the probability of outcome x. The logarithm base determines the unit: base 2 gives bits, natural log gives nats. Most applications use bits because they align with binary information.

Entropy drives critical decisions in machine learning. Decision trees use it to choose split points. Neural networks minimize cross-entropy loss. Feature selection algorithms identify high-information variables. Understanding entropy calculation is foundational for data science.

import numpy as np
import matplotlib.pyplot as plt

# Low entropy: biased coin
low_entropy_probs = [0.95, 0.05]
low_entropy = -sum(p * np.log2(p) for p in low_entropy_probs if p > 0)

# High entropy: fair coin
high_entropy_probs = [0.5, 0.5]
high_entropy = -sum(p * np.log2(p) for p in high_entropy_probs if p > 0)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

ax1.bar(['Heads', 'Tails'], low_entropy_probs, color='steelblue')
ax1.set_title(f'Low Entropy: {low_entropy:.3f} bits')
ax1.set_ylabel('Probability')
ax1.set_ylim([0, 1])

ax2.bar(['Heads', 'Tails'], high_entropy_probs, color='coral')
ax2.set_title(f'High Entropy: {high_entropy:.3f} bits')
ax2.set_ylabel('Probability')
ax2.set_ylim([0, 1])

plt.tight_layout()
plt.show()

The Mathematical Foundation

Let’s dissect the entropy formula. Each term -p(x) log₂ p(x) represents the “information content” of outcome x. Rare events (small p(x)) contribute more information than common events. The negative sign ensures entropy is positive since log₂ p(x) is negative for probabilities less than 1.

Why logarithms? They convert multiplication to addition, making independent events’ entropies additive. Base 2 aligns with binary digits—one bit can distinguish between two equally likely outcomes.

The edge case p(x) = 0 requires care. Mathematically, lim(p→0) p log p = 0, so we define 0 log 0 = 0. This makes intuitive sense: impossible events contribute no uncertainty.

Entropy ranges from 0 to log₂(n) for n possible outcomes. Minimum entropy (0) occurs when one outcome has probability 1. Maximum entropy (log₂(n)) occurs when all outcomes are equally likely—the uniform distribution.

def entropy_detailed(probabilities):
    """Calculate entropy with step-by-step breakdown."""
    print(f"{'Outcome':<10} {'p(x)':<10} {'log₂ p(x)':<12} {'-p(x)log₂ p(x)':<15}")
    print("-" * 50)
    
    total_entropy = 0
    for i, p in enumerate(probabilities):
        if p > 0:
            log_p = np.log2(p)
            contribution = -p * log_p
            total_entropy += contribution
            print(f"{i:<10} {p:<10.4f} {log_p:<12.4f} {contribution:<15.4f}")
        else:
            print(f"{i:<10} {p:<10.4f} {'undefined':<12} {0:<15.4f}")
    
    print("-" * 50)
    print(f"Total Entropy: {total_entropy:.4f} bits\n")
    return total_entropy

# Example: biased die
die_probs = [0.5, 0.2, 0.15, 0.1, 0.05, 0.0]
entropy_detailed(die_probs)

Calculating Entropy for Discrete Distributions

For discrete distributions, entropy calculation is straightforward: sum the formula over all outcomes. Let’s work through common examples.

A fair coin has two equally likely outcomes: p(H) = p(T) = 0.5. Entropy = -0.5 log₂(0.5) - 0.5 log₂(0.5) = -0.5(-1) - 0.5(-1) = 1 bit. This is maximum entropy for two outcomes.

A biased coin with p(H) = 0.9, p(T) = 0.1 has lower entropy: -0.9 log₂(0.9) - 0.1 log₂(0.1) ≈ 0.469 bits. Less uncertainty means less entropy.

A fair six-sided die has maximum entropy: log₂(6) ≈ 2.585 bits. Any bias reduces this.

def entropy(probabilities):
    """Calculate Shannon entropy from probability distribution."""
    probabilities = np.array(probabilities)
    # Filter out zero probabilities
    probabilities = probabilities[probabilities > 0]
    return -np.sum(probabilities * np.log2(probabilities))

def entropy_from_counts(counts):
    """Calculate entropy from frequency counts."""
    counts = np.array(counts)
    probabilities = counts / counts.sum()
    return entropy(probabilities)

# Fair vs biased coin
print(f"Fair coin entropy: {entropy([0.5, 0.5]):.4f} bits")
print(f"Biased coin entropy: {entropy([0.9, 0.1]):.4f} bits")

# Fair vs loaded die
print(f"\nFair die entropy: {entropy([1/6]*6):.4f} bits")
print(f"Loaded die entropy: {entropy([0.5, 0.2, 0.15, 0.1, 0.04, 0.01]):.4f} bits")

# From counts (e.g., survey responses)
responses = [10, 25, 15, 30, 20]  # 5 categories
print(f"\nSurvey entropy: {entropy_from_counts(responses):.4f} bits")

Practical Applications

Entropy drives decision tree algorithms. At each node, the algorithm chooses the split that maximizes information gain: the reduction in entropy after splitting.

Information gain = H(parent) - Σ (|child_i| / |parent|) * H(child_i)

Higher information gain means the split better separates the data. This greedy approach builds trees that classify efficiently.

Feature selection uses entropy to identify informative variables. High-entropy features provide more distinguishing power. Low-entropy features (nearly constant) offer little value.

Data compression efficiency relates directly to entropy. Optimal compression for a source with entropy H requires at least H bits per symbol on average. This is Shannon’s source coding theorem.

import pandas as pd
from collections import Counter

def information_gain(parent_labels, left_labels, right_labels):
    """Calculate information gain from a binary split."""
    def label_entropy(labels):
        if len(labels) == 0:
            return 0
        counts = Counter(labels)
        probs = [count/len(labels) for count in counts.values()]
        return entropy(probs)
    
    parent_entropy = label_entropy(parent_labels)
    n = len(parent_labels)
    n_left = len(left_labels)
    n_right = len(right_labels)
    
    weighted_child_entropy = (n_left/n * label_entropy(left_labels) + 
                              n_right/n * label_entropy(right_labels))
    
    return parent_entropy - weighted_child_entropy

# Example: splitting iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Calculate entropy of species column
species_counts = df['species'].value_counts()
print(f"Species entropy: {entropy_from_counts(species_counts.values):.4f} bits")

# Evaluate a split on petal length
threshold = 2.5
left_mask = df['petal length (cm)'] <= threshold
left_labels = df.loc[left_mask, 'species'].values
right_labels = df.loc[~left_mask, 'species'].values

ig = information_gain(df['species'].values, left_labels, right_labels)
print(f"Information gain (petal length <= {threshold}): {ig:.4f} bits")

Cross-Entropy and KL Divergence

Cross-entropy extends entropy to compare two distributions. Given true distribution P and predicted distribution Q:

H(P, Q) = -Σ p(x) log₂ q(x)

Cross-entropy measures the average number of bits needed to encode data from P using an encoding optimized for Q. When P = Q, cross-entropy equals entropy. When P ≠ Q, cross-entropy exceeds entropy.

KL divergence (Kullback-Leibler divergence) quantifies how much Q diverges from P:

D_KL(P || Q) = Σ p(x) log₂(p(x) / q(x)) = H(P, Q) - H(P)

KL divergence is always non-negative and equals zero only when P = Q. It’s asymmetric: D_KL(P || Q) ≠ D_KL(Q || P).

Neural networks minimize cross-entropy loss for classification. This is equivalent to maximizing the likelihood of the true labels under the predicted distribution.

def cross_entropy(p, q):
    """Calculate cross-entropy H(P, Q)."""
    p = np.array(p)
    q = np.array(q)
    # Avoid log(0)
    q = np.clip(q, 1e-15, 1)
    return -np.sum(p * np.log2(q))

def kl_divergence(p, q):
    """Calculate KL divergence D_KL(P || Q)."""
    p = np.array(p)
    q = np.array(q)
    # Only sum where p > 0
    mask = p > 0
    q = np.clip(q, 1e-15, 1)
    return np.sum(p[mask] * np.log2(p[mask] / q[mask]))

# True distribution (fair die)
p = np.array([1/6] * 6)

# Predicted distribution (biased estimate)
q = np.array([0.2, 0.2, 0.2, 0.2, 0.1, 0.1])

print(f"Entropy H(P): {entropy(p):.4f} bits")
print(f"Cross-entropy H(P, Q): {cross_entropy(p, q):.4f} bits")
print(f"KL divergence D_KL(P || Q): {kl_divergence(p, q):.4f} bits")
print(f"Verification: H(P,Q) - H(P) = {cross_entropy(p, q) - entropy(p):.4f}")

Implementation Best Practices

Numerical stability is critical. The naive implementation fails when probabilities are zero or very small. Always handle the 0 log 0 case explicitly and clip probabilities away from zero for cross-entropy.

Vectorization dramatically improves performance. NumPy operations on arrays are 10-100x faster than Python loops. Use boolean indexing to filter zero probabilities efficiently.

For large datasets, compute entropy on value counts rather than raw data. Pandas’ value_counts() is optimized and reduces memory usage.

def entropy_robust(probabilities, base=2):
    """Numerically stable entropy calculation."""
    probabilities = np.asarray(probabilities, dtype=np.float64)
    
    # Filter out zero probabilities
    probabilities = probabilities[probabilities > 0]
    
    # Normalize if needed
    prob_sum = probabilities.sum()
    if not np.isclose(prob_sum, 1.0):
        probabilities = probabilities / prob_sum
    
    # Use natural log if base is e, otherwise convert
    if base == np.e:
        return -np.sum(probabilities * np.log(probabilities))
    else:
        return -np.sum(probabilities * np.log(probabilities)) / np.log(base)

def entropy_from_data(data):
    """Efficiently calculate entropy from raw data."""
    # Use pandas for efficient counting
    if isinstance(data, pd.Series):
        counts = data.value_counts().values
    else:
        counts = pd.Series(data).value_counts().values
    
    probabilities = counts / counts.sum()
    return entropy_robust(probabilities)

# Performance comparison
import time

large_data = np.random.choice(['A', 'B', 'C', 'D'], size=1000000)

# Naive approach (slow)
start = time.time()
counts_dict = {}
for item in large_data:
    counts_dict[item] = counts_dict.get(item, 0) + 1
naive_probs = np.array(list(counts_dict.values())) / len(large_data)
naive_result = entropy_robust(naive_probs)
naive_time = time.time() - start

# Vectorized approach (fast)
start = time.time()
vectorized_result = entropy_from_data(large_data)
vectorized_time = time.time() - start

print(f"Naive approach: {naive_result:.4f} bits in {naive_time:.4f}s")
print(f"Vectorized approach: {vectorized_result:.4f} bits in {vectorized_time:.4f}s")
print(f"Speedup: {naive_time/vectorized_time:.1f}x")

Entropy is fundamental to understanding information and uncertainty. Master these calculations and you’ll have a powerful tool for analyzing data, building models, and making informed algorithmic choices. The implementations here handle edge cases and scale to production use—adapt them to your specific needs.