How to Calculate Point-Biserial Correlation in Python

Point-biserial correlation measures the strength and direction of association between a binary variable and a continuous variable. If you've ever needed to answer questions like 'Is there a...

Key Insights

  • Point-biserial correlation is simply Pearson correlation applied to one binary and one continuous variable—use scipy.stats.pointbiserialr() for quick calculations with built-in p-values.
  • The coefficient ranges from -1 to +1, where values above 0.3 (or below -0.3) typically indicate meaningful relationships in social science contexts.
  • Always check your assumptions: the binary variable should represent a true dichotomy, and the continuous variable should be roughly normally distributed within each group.

Introduction to Point-Biserial Correlation

Point-biserial correlation measures the strength and direction of association between a binary variable and a continuous variable. If you’ve ever needed to answer questions like “Is there a relationship between passing an exam (yes/no) and study hours?” or “Does employment status (employed/unemployed) correlate with income level?”, this is your tool.

The technique appears frequently in educational research, psychology, and A/B testing scenarios. Common applications include:

  • Test item analysis (correct/incorrect vs. total score)
  • Medical studies (treatment/control vs. outcome measure)
  • HR analytics (hired/not hired vs. interview score)
  • Marketing (converted/not converted vs. time on page)

Here’s the key insight that simplifies everything: point-biserial correlation is mathematically identical to Pearson correlation when one variable is dichotomous. Python’s implementation reflects this—you could use either function and get the same result. The separate function exists primarily for clarity and convenience.

Mathematical Foundation

The point-biserial correlation coefficient (r_pb) is calculated using this formula:

r_pb = (M₁ - M₀) / s_y * √(n₁ * n₀ / n²)

Where:

  • M₁ = mean of continuous variable for group coded as 1
  • M₀ = mean of continuous variable for group coded as 0
  • s_y = standard deviation of the entire continuous variable
  • n₁ = number of observations in group 1
  • n₀ = number of observations in group 0
  • n = total number of observations

The formula essentially compares the difference in means between your two groups, scaled by the overall variability and adjusted for group sizes.

Assumptions to verify:

  1. True dichotomy: The binary variable should represent genuinely distinct categories (male/female, pass/fail), not an artificially split continuous variable.
  2. Normality: The continuous variable should be approximately normally distributed within each group.
  3. Homoscedasticity: Variance of the continuous variable should be similar across both groups.
  4. Independence: Observations should be independent of each other.

Interpreting the coefficient:

Absolute Value Interpretation
0.00 - 0.10 Negligible
0.10 - 0.30 Weak
0.30 - 0.50 Moderate
0.50 - 0.70 Strong
0.70 - 1.00 Very strong

These thresholds come from Cohen’s conventions and work well for behavioral sciences. Adjust expectations based on your domain.

Using SciPy’s Built-in Function

SciPy provides pointbiserialr() in its stats module. This is the recommended approach for most use cases—it’s tested, optimized, and returns both the correlation coefficient and p-value.

import numpy as np
from scipy import stats

# Sample data: exam pass/fail (binary) and study hours (continuous)
passed_exam = np.array([1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1])
study_hours = np.array([8, 7, 9, 3, 2, 6, 4, 8, 3, 2, 7, 9, 4, 3, 6])

# Calculate point-biserial correlation
correlation, p_value = stats.pointbiserialr(passed_exam, study_hours)

print(f"Point-biserial correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print(f"The correlation is statistically significant at α=0.05")
else:
    print(f"The correlation is not statistically significant at α=0.05")

Output:

Point-biserial correlation: 0.8660
P-value: 0.0000
The correlation is statistically significant at α=0.05

The function returns a PointbiserialrResult named tuple, so you can also access results via .correlation and .pvalue attributes if you prefer that syntax.

Manual Calculation from Scratch

Understanding the manual calculation helps you verify results and adapt the method when needed. Here’s a NumPy implementation that follows the formula directly:

import numpy as np

def point_biserial_manual(binary_var, continuous_var):
    """
    Calculate point-biserial correlation coefficient manually.
    
    Parameters:
    -----------
    binary_var : array-like
        Binary variable (0s and 1s)
    continuous_var : array-like
        Continuous variable
        
    Returns:
    --------
    tuple: (correlation coefficient, standard error)
    """
    binary_var = np.asarray(binary_var)
    continuous_var = np.asarray(continuous_var)
    
    # Separate continuous values by group
    group_1 = continuous_var[binary_var == 1]
    group_0 = continuous_var[binary_var == 0]
    
    # Calculate components
    n = len(continuous_var)
    n1 = len(group_1)
    n0 = len(group_0)
    
    mean_1 = np.mean(group_1)
    mean_0 = np.mean(group_0)
    
    # Use population std (ddof=0) to match the formula
    std_y = np.std(continuous_var, ddof=0)
    
    # Calculate point-biserial correlation
    r_pb = ((mean_1 - mean_0) / std_y) * np.sqrt((n1 * n0) / (n ** 2))
    
    return r_pb

# Test with same data
passed_exam = np.array([1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1])
study_hours = np.array([8, 7, 9, 3, 2, 6, 4, 8, 3, 2, 7, 9, 4, 3, 6])

manual_result = point_biserial_manual(passed_exam, study_hours)
scipy_result, _ = stats.pointbiserialr(passed_exam, study_hours)

print(f"Manual calculation: {manual_result:.4f}")
print(f"SciPy calculation:  {scipy_result:.4f}")
print(f"Results match: {np.isclose(manual_result, scipy_result)}")

Output:

Manual calculation: 0.8660
SciPy calculation:  0.8660
Results match: True

Note that SciPy uses sample standard deviation internally but adjusts the formula accordingly. The results will match regardless.

Real-World Example with Pandas

Let’s work through a complete analysis workflow using a realistic dataset. We’ll examine whether customer churn (binary) correlates with various continuous metrics.

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Create a realistic customer dataset
np.random.seed(42)
n_customers = 200

# Churned customers tend to have lower satisfaction and fewer purchases
churned = np.random.binomial(1, 0.3, n_customers)

# Generate correlated continuous variables
satisfaction_score = np.where(
    churned == 1,
    np.random.normal(5.5, 1.5, n_customers),
    np.random.normal(7.5, 1.2, n_customers)
).clip(1, 10)

monthly_purchases = np.where(
    churned == 1,
    np.random.normal(3, 2, n_customers),
    np.random.normal(8, 3, n_customers)
).clip(0, 20)

# Account age has weaker relationship
account_age_months = np.random.normal(24, 12, n_customers).clip(1, 60)

# Create DataFrame
df = pd.DataFrame({
    'churned': churned,
    'satisfaction_score': satisfaction_score,
    'monthly_purchases': monthly_purchases,
    'account_age_months': account_age_months
})

print(df.head(10))
print(f"\nChurn rate: {df['churned'].mean():.1%}")

Now let’s calculate point-biserial correlations for all continuous variables:

def analyze_correlations(df, binary_col, continuous_cols):
    """
    Calculate point-biserial correlations between a binary variable
    and multiple continuous variables.
    """
    results = []
    
    for col in continuous_cols:
        corr, p_val = stats.pointbiserialr(df[binary_col], df[col])
        results.append({
            'variable': col,
            'correlation': corr,
            'p_value': p_val,
            'significant': p_val < 0.05
        })
    
    return pd.DataFrame(results).sort_values('correlation', key=abs, ascending=False)

# Analyze all continuous variables
continuous_vars = ['satisfaction_score', 'monthly_purchases', 'account_age_months']
correlation_results = analyze_correlations(df, 'churned', continuous_vars)

print("\nPoint-Biserial Correlation Analysis")
print("=" * 55)
print(correlation_results.to_string(index=False))

Finally, let’s visualize these relationships:

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

for idx, var in enumerate(continuous_vars):
    ax = axes[idx]
    
    # Box plot showing distribution by churn status
    df.boxplot(column=var, by='churned', ax=ax)
    
    # Get correlation for title
    corr, p_val = stats.pointbiserialr(df['churned'], df[var])
    ax.set_title(f'{var}\nr = {corr:.3f}, p = {p_val:.4f}')
    ax.set_xlabel('Churned (0=No, 1=Yes)')
    
plt.suptitle('Point-Biserial Correlation Analysis: Churn Predictors', y=1.02)
plt.tight_layout()
plt.savefig('point_biserial_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

Interpreting Results and Statistical Significance

The p-value tests the null hypothesis that the true population correlation is zero. A small p-value (typically < 0.05) suggests the observed correlation is unlikely to have occurred by chance.

However, statistical significance doesn’t equal practical significance. With large samples, even tiny correlations become “significant.” Always consider effect size alongside p-values.

Common pitfalls:

  1. Artificial dichotomization: If you split a continuous variable at the median to create a binary variable, use biserial correlation instead. Point-biserial assumes a natural dichotomy.

  2. Unequal group sizes: Extreme imbalance (e.g., 95%/5% split) reduces statistical power and can distort results.

  3. Outliers: The correlation is sensitive to outliers in the continuous variable. Always visualize your data first.

  4. Causation confusion: Correlation doesn’t imply causation. A strong point-biserial correlation between treatment and outcome doesn’t prove the treatment caused the outcome.

# Check for potential issues
def diagnostic_check(df, binary_col, continuous_col):
    """Run basic diagnostics before interpreting correlation."""
    
    group_0 = df[df[binary_col] == 0][continuous_col]
    group_1 = df[df[binary_col] == 1][continuous_col]
    
    print(f"Group sizes: n0={len(group_0)}, n1={len(group_1)}")
    print(f"Balance ratio: {min(len(group_0), len(group_1)) / max(len(group_0), len(group_1)):.2f}")
    
    # Levene's test for homogeneity of variance
    stat, p = stats.levene(group_0, group_1)
    print(f"Levene's test p-value: {p:.4f} {'(variances may differ)' if p < 0.05 else '(OK)'}")

diagnostic_check(df, 'churned', 'satisfaction_score')

Conclusion

Point-biserial correlation provides a straightforward way to quantify relationships between binary and continuous variables. For most applications, scipy.stats.pointbiserialr() handles everything you need—it’s fast, accurate, and includes p-value calculation.

When to use alternatives:

  • Logistic regression: When you need to control for confounding variables or want to predict the binary outcome.
  • Biserial correlation: When your binary variable is an artificially dichotomized continuous variable.
  • Rank-biserial correlation: When your continuous variable violates normality assumptions.
  • Independent samples t-test: When you’re more interested in comparing group means than measuring association strength.

The point-biserial correlation shines in exploratory analysis and feature selection. Use it to quickly identify which continuous variables relate to your binary outcome, then follow up with more sophisticated modeling as needed.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.