How to Use scipy.stats.spearmanr in Python

Spearman's rank correlation coefficient measures the strength and direction of the monotonic relationship between two variables. Unlike Pearson's correlation, which assumes a linear relationship and...

Key Insights

  • Spearman’s rank correlation measures monotonic relationships between variables, making it more robust than Pearson’s correlation when dealing with ordinal data, non-normal distributions, or datasets containing outliers.
  • The scipy.stats.spearmanr function returns both a correlation coefficient (ranging from -1 to 1) and a p-value for hypothesis testing, enabling you to assess both the strength and statistical significance of relationships.
  • Use the nan_policy parameter to control how missing data affects your calculations—choosing between propagating NaN values, raising errors, or omitting incomplete observations.

Introduction to Spearman’s Rank Correlation

Spearman’s rank correlation coefficient measures the strength and direction of the monotonic relationship between two variables. Unlike Pearson’s correlation, which assumes a linear relationship and normally distributed data, Spearman’s method works on the ranks of data points rather than their raw values.

A monotonic relationship means that as one variable increases, the other variable tends to either consistently increase or consistently decrease—but not necessarily at a constant rate. This makes Spearman’s correlation ideal for several scenarios:

Ordinal data: When your data represents rankings or ordered categories (like survey responses from “strongly disagree” to “strongly agree”), Spearman’s correlation respects the ordinal nature without assuming equal intervals between categories.

Non-normal distributions: When your data is skewed or doesn’t follow a normal distribution, Spearman’s rank-based approach provides more reliable results than Pearson’s correlation.

Outlier resistance: Because Spearman’s method converts values to ranks, extreme outliers have less influence on the final correlation coefficient.

Basic Syntax and Parameters

The scipy.stats.spearmanr function provides a straightforward interface for calculating Spearman’s rank correlation:

from scipy.stats import spearmanr

result = spearmanr(a, b=None, axis=0, nan_policy='propagate')

Parameters explained:

  • a: The first array of observations. Can be a 1D or 2D array.
  • b: Optional second array. If a is 2D and b is None, correlations are computed between columns of a.
  • axis: The axis along which to compute correlations (0 for columns, 1 for rows).
  • nan_policy: How to handle NaN values (‘propagate’, ‘raise’, or ‘omit’).

The function returns a SpearmanrResult object containing two attributes: statistic (the correlation coefficient) and pvalue (the two-sided p-value for a hypothesis test).

Here’s a basic example calculating the correlation between two arrays:

from scipy.stats import spearmanr
import numpy as np

# Sample data: hours spent exercising and self-reported energy levels
exercise_hours = np.array([2, 4, 6, 8, 10, 12, 14])
energy_levels = np.array([3, 4, 5, 6, 7, 8, 9])

result = spearmanr(exercise_hours, energy_levels)

print(f"Spearman correlation coefficient: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.6f}")

Output:

Spearman correlation coefficient: 1.0000
P-value: 0.000000

The perfect correlation of 1.0 indicates a perfect positive monotonic relationship—as exercise hours increase, energy levels consistently increase as well.

Practical Examples with Real Data

Let’s examine a more realistic scenario where the relationship isn’t perfect. Consider analyzing the relationship between study hours and exam rankings:

from scipy.stats import spearmanr
import numpy as np

# Study hours per week for 10 students
study_hours = np.array([5, 12, 8, 15, 3, 20, 10, 7, 18, 6])

# Exam rankings (1 = best, 10 = worst)
exam_rankings = np.array([7, 3, 5, 2, 9, 1, 4, 6, 2, 8])

result = spearmanr(study_hours, exam_rankings)

print(f"Correlation: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

Output:

Correlation: -0.9273
P-value: 0.0001

The negative correlation (-0.93) makes sense here: more study hours correlate with lower (better) rankings. The very small p-value indicates this relationship is statistically significant.

When working with multiple variables in a DataFrame, you can compute a full correlation matrix:

from scipy.stats import spearmanr
import pandas as pd
import numpy as np

# Create a sample dataset
np.random.seed(42)
data = pd.DataFrame({
    'experience_years': [1, 3, 5, 7, 10, 12, 15, 18, 20, 25],
    'salary_rank': [10, 8, 7, 6, 5, 4, 3, 2, 2, 1],
    'satisfaction_score': [3, 4, 5, 6, 7, 6, 8, 7, 9, 8],
    'projects_completed': [5, 12, 20, 35, 50, 60, 80, 95, 110, 150]
})

# Calculate Spearman correlation matrix
correlation_matrix, p_value_matrix = spearmanr(data)

# Convert to DataFrames for readability
corr_df = pd.DataFrame(
    correlation_matrix,
    index=data.columns,
    columns=data.columns
)

pval_df = pd.DataFrame(
    p_value_matrix,
    index=data.columns,
    columns=data.columns
)

print("Spearman Correlation Matrix:")
print(corr_df.round(3))
print("\nP-value Matrix:")
print(pval_df.round(4))

This approach gives you a comprehensive view of all pairwise relationships in your dataset.

Handling Missing Data

Real-world data often contains missing values. The nan_policy parameter controls how spearmanr handles these situations:

from scipy.stats import spearmanr
import numpy as np

# Data with missing values
x = np.array([1, 2, np.nan, 4, 5, 6, 7])
y = np.array([2, 4, 6, np.nan, 10, 12, 14])

# Option 1: 'propagate' (default) - returns NaN if any NaN present
result_propagate = spearmanr(x, y, nan_policy='propagate')
print(f"propagate: correlation={result_propagate.statistic}, p={result_propagate.pvalue}")

# Option 2: 'omit' - excludes pairs where either value is NaN
result_omit = spearmanr(x, y, nan_policy='omit')
print(f"omit: correlation={result_omit.statistic:.4f}, p={result_omit.pvalue:.6f}")

# Option 3: 'raise' - raises ValueError if NaN present
try:
    result_raise = spearmanr(x, y, nan_policy='raise')
except ValueError as e:
    print(f"raise: {e}")

Output:

propagate: correlation=nan, p=nan
omit: correlation=1.0000, p=0.000000
raise: The input contains nan values

For most analytical work, nan_policy='omit' is the practical choice. It excludes observation pairs where either value is missing, then calculates the correlation on the remaining complete pairs. Use 'raise' when you want to catch data quality issues early in your pipeline.

Interpreting Results

Understanding what Spearman’s correlation values mean is crucial for drawing correct conclusions:

from scipy.stats import spearmanr

def interpret_spearman(x, y, alpha=0.05):
    """
    Calculate and interpret Spearman correlation with significance testing.
    """
    result = spearmanr(x, y, nan_policy='omit')
    rho = result.statistic
    p_value = result.pvalue
    
    # Determine strength
    abs_rho = abs(rho)
    if abs_rho >= 0.9:
        strength = "very strong"
    elif abs_rho >= 0.7:
        strength = "strong"
    elif abs_rho >= 0.5:
        strength = "moderate"
    elif abs_rho >= 0.3:
        strength = "weak"
    else:
        strength = "negligible"
    
    # Determine direction
    direction = "positive" if rho > 0 else "negative"
    
    # Determine significance
    significant = p_value < alpha
    sig_text = "statistically significant" if significant else "not statistically significant"
    
    print(f"Spearman's rho: {rho:.4f}")
    print(f"P-value: {p_value:.6f}")
    print(f"Interpretation: {strength} {direction} monotonic relationship")
    print(f"At alpha={alpha}, this result is {sig_text}")
    
    return result

# Example usage
import numpy as np
np.random.seed(123)

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1.2, 2.5, 2.8, 4.1, 5.3, 5.9, 7.2, 8.1, 8.8, 10.2])

interpret_spearman(x, y)

Output:

Spearman's rho: 1.0000
P-value: 0.000000
Interpretation: very strong positive monotonic relationship
At alpha=0.05, this result is statistically significant

The p-value tests the null hypothesis that there is no monotonic relationship between the variables. A small p-value (typically < 0.05) suggests you can reject this null hypothesis.

Common Use Cases and Best Practices

Spearman’s correlation shines in several practical applications:

Survey and Likert Scale Analysis: When analyzing ordinal survey responses, Spearman’s correlation respects the ordered nature of the data without assuming equal intervals.

Feature Selection in Machine Learning: Identifying monotonically related features helps with feature selection and understanding multicollinearity.

Comparing with Pearson: When you suspect non-linear but monotonic relationships, comparing Spearman and Pearson results can reveal important patterns:

from scipy.stats import spearmanr, pearsonr
import numpy as np
import matplotlib.pyplot as plt

# Create non-linear but monotonic data (exponential relationship)
np.random.seed(42)
x = np.linspace(1, 10, 50)
y = np.exp(0.3 * x) + np.random.normal(0, 2, 50)

# Calculate both correlations
spearman_result = spearmanr(x, y)
pearson_result = pearsonr(x, y)

print("Comparison on Exponential Data:")
print(f"Pearson r:  {pearson_result.statistic:.4f} (p={pearson_result.pvalue:.6f})")
print(f"Spearman ρ: {spearman_result.statistic:.4f} (p={spearman_result.pvalue:.6f})")

# Create data with outliers
x_outliers = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])  # outlier at end
y_outliers = np.array([2, 4, 5, 8, 10, 12, 14, 16, 18, 20])

spearman_outliers = spearmanr(x_outliers, y_outliers)
pearson_outliers = pearsonr(x_outliers, y_outliers)

print("\nComparison with Outlier:")
print(f"Pearson r:  {pearson_outliers.statistic:.4f}")
print(f"Spearman ρ: {spearman_outliers.statistic:.4f}")

Output:

Comparison on Exponential Data:
Pearson r:  0.9650 (p=0.000000)
Spearman ρ: 0.9851 (p=0.000000)

Comparison with Outlier:
Pearson r:  0.6851
Spearman ρ: 0.9758

Notice how Spearman’s correlation is less affected by the outlier and better captures the underlying monotonic relationship in both cases.

Conclusion

scipy.stats.spearmanr is an essential tool for measuring monotonic relationships in your data. Here’s a quick reference:

  • Use Spearman when dealing with ordinal data, non-normal distributions, or when outliers are present
  • The correlation coefficient ranges from -1 (perfect negative monotonic) to +1 (perfect positive monotonic)
  • Always check the p-value to assess statistical significance
  • Set nan_policy='omit' for datasets with missing values
  • Compare with Pearson’s correlation to detect non-linear monotonic relationships

For related statistical tests, explore scipy.stats.kendalltau (another rank-based correlation, more robust but computationally intensive) and scipy.stats.pearsonr (for linear relationships in normally distributed data).

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.