How to Perform the Shapiro-Wilk Test in Python

Key Insights

The Shapiro-Wilk test is most reliable for sample sizes under 50; for larger datasets, it becomes overly sensitive and will reject normality for trivial deviations that don’t matter practically.
Never rely on a single normality test—combine the Shapiro-Wilk p-value with visual inspection (histograms and Q-Q plots) to make informed decisions about your data’s distribution.
When data fails normality tests, you have two paths: transform the data (log, Box-Cox) and re-test, or switch to non-parametric statistical methods that don’t assume normality.

Why Normality Testing Matters

Many statistical methods assume your data follows a normal distribution. T-tests, ANOVA, linear regression, and Pearson correlation all make this assumption. Violating it can lead to incorrect conclusions—inflated Type I errors, unreliable confidence intervals, and misleading p-values.

The Shapiro-Wilk test is the go-to method for checking normality in small to medium-sized samples. It outperforms alternatives like Kolmogorov-Smirnov in statistical power for detecting non-normality. However, it’s not a silver bullet. Understanding when and how to use it properly separates competent data analysis from cargo-cult statistics.

Understanding the Shapiro-Wilk Test

The test calculates a W statistic that measures how well your data’s order statistics match what you’d expect from a normal distribution. The W value ranges from 0 to 1, where 1 indicates perfect normality.

Here’s what you need to know about interpretation:

Null hypothesis: The data comes from a normally distributed population
P-value > 0.05: Fail to reject the null—data is consistent with normality
P-value ≤ 0.05: Reject the null—data likely isn’t normally distributed

The test has important limitations. It works best for samples between 3 and 50 observations. With larger samples (n > 50), the test becomes hypersensitive. Even minor, practically irrelevant deviations from normality will produce significant p-values. This is a feature of all hypothesis tests with large samples, not a bug specific to Shapiro-Wilk.

Basic Implementation with SciPy

Let’s start with the fundamentals. SciPy’s shapiro() function makes this straightforward:

import numpy as np
from scipy import stats

# Generate sample data from a normal distribution
np.random.seed(42)
normal_data = np.random.normal(loc=50, scale=10, size=30)

# Perform Shapiro-Wilk test
statistic, p_value = stats.shapiro(normal_data)

print(f"W statistic: {statistic:.4f}")
print(f"P-value: {p_value:.4f}")

Output:

W statistic: 0.9732
P-value: 0.6304

A W statistic close to 1 and a high p-value indicate the data is consistent with normality. Now let’s add decision logic:

def test_normality(data, alpha=0.05):
    """
    Perform Shapiro-Wilk test with interpretation.
    
    Parameters:
        data: array-like, sample data
        alpha: significance level (default 0.05)
    
    Returns:
        dict with test results and interpretation
    """
    statistic, p_value = stats.shapiro(data)
    
    is_normal = p_value > alpha
    
    result = {
        'w_statistic': statistic,
        'p_value': p_value,
        'alpha': alpha,
        'is_normal': is_normal,
        'interpretation': (
            f"Data appears normally distributed (p={p_value:.4f} > {alpha})"
            if is_normal else
            f"Data does NOT appear normally distributed (p={p_value:.4f} ≤ {alpha})"
        )
    }
    
    return result

# Test with normal data
result = test_normality(normal_data)
print(result['interpretation'])

# Test with non-normal data (exponential distribution)
skewed_data = np.random.exponential(scale=10, size=30)
result_skewed = test_normality(skewed_data)
print(result_skewed['interpretation'])

Visualizing Normality Alongside the Test

Statistical tests give you a number. Visual inspection gives you understanding. Always use both.

import matplotlib.pyplot as plt
from scipy import stats
import numpy as np

def normality_diagnostic_plot(data, title="Normality Diagnostic"):
    """
    Create histogram and Q-Q plot with Shapiro-Wilk test results.
    """
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Perform Shapiro-Wilk test
    w_stat, p_value = stats.shapiro(data)
    
    # Histogram with normal curve overlay
    ax1 = axes[0]
    ax1.hist(data, bins='auto', density=True, alpha=0.7, edgecolor='black')
    
    # Overlay theoretical normal distribution
    mu, std = np.mean(data), np.std(data)
    x = np.linspace(min(data), max(data), 100)
    ax1.plot(x, stats.norm.pdf(x, mu, std), 'r-', linewidth=2, label='Normal fit')
    
    ax1.set_xlabel('Value')
    ax1.set_ylabel('Density')
    ax1.set_title('Histogram with Normal Curve')
    ax1.legend()
    
    # Q-Q plot
    ax2 = axes[1]
    stats.probplot(data, dist="norm", plot=ax2)
    ax2.set_title('Q-Q Plot')
    
    # Add test results as annotation
    result_text = f"Shapiro-Wilk Test\nW = {w_stat:.4f}\np = {p_value:.4f}"
    normality_status = "Normal" if p_value > 0.05 else "Non-normal"
    result_text += f"\nConclusion: {normality_status}"
    
    fig.text(0.02, 0.98, result_text, transform=fig.transFigure,
             fontsize=10, verticalalignment='top',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.suptitle(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.subplots_adjust(top=0.85)
    
    return fig, (w_stat, p_value)

# Example usage
np.random.seed(42)
sample_data = np.random.normal(100, 15, 40)
fig, results = normality_diagnostic_plot(sample_data, "Sample Data Normality Check")
plt.show()

The Q-Q plot is particularly valuable. Points falling along the diagonal line indicate normality. Systematic deviations reveal specific problems: S-curves suggest heavy tails, curves bending away at ends indicate skewness.

Practical Examples with Real Data

Real-world analysis usually involves DataFrames, not standalone arrays. Here’s how to test a single column:

import pandas as pd
from scipy import stats

# Create sample DataFrame
np.random.seed(42)
df = pd.DataFrame({
    'revenue': np.random.normal(50000, 10000, 45),
    'response_time': np.random.exponential(2, 45),
    'satisfaction_score': np.random.uniform(1, 5, 45),
    'age': np.random.normal(35, 8, 45)
})

# Test single column
column = 'revenue'
stat, p = stats.shapiro(df[column].dropna())
print(f"{column}: W={stat:.4f}, p={p:.4f}")

For batch testing multiple columns, create a summary table:

def batch_normality_test(df, columns=None, alpha=0.05):
    """
    Test normality for multiple DataFrame columns.
    
    Parameters:
        df: pandas DataFrame
        columns: list of column names (default: all numeric columns)
        alpha: significance level
    
    Returns:
        DataFrame with test results
    """
    if columns is None:
        columns = df.select_dtypes(include=[np.number]).columns.tolist()
    
    results = []
    
    for col in columns:
        data = df[col].dropna()
        
        if len(data) < 3:
            results.append({
                'column': col,
                'n': len(data),
                'w_statistic': np.nan,
                'p_value': np.nan,
                'is_normal': None,
                'note': 'Insufficient data (n < 3)'
            })
            continue
        
        if len(data) > 5000:
            # Shapiro-Wilk has a limit of 5000 samples
            data = data.sample(5000, random_state=42)
            note = 'Sampled to 5000'
        else:
            note = ''
        
        w_stat, p_value = stats.shapiro(data)
        
        results.append({
            'column': col,
            'n': len(df[col].dropna()),
            'w_statistic': round(w_stat, 4),
            'p_value': round(p_value, 4),
            'is_normal': p_value > alpha,
            'note': note
        })
    
    return pd.DataFrame(results)

# Run batch test
results_df = batch_normality_test(df)
print(results_df.to_string(index=False))

Output:

          column   n  w_statistic  p_value  is_normal note
         revenue  45       0.9847   0.8142       True     
   response_time  45       0.8234   0.0001      False     
satisfaction_score  45       0.9512   0.0571       True     
             age  45       0.9789   0.5891       True

Handling Non-Normal Data

When data fails the normality test, you have options. Transformations can often normalize skewed distributions:

from scipy.stats import boxcox

def transform_and_test(data, transform_type='log'):
    """
    Apply transformation and re-test for normality.
    """
    original_stat, original_p = stats.shapiro(data)
    
    if transform_type == 'log':
        # Add small constant if data contains zeros
        min_val = data.min()
        if min_val <= 0:
            transformed = np.log(data - min_val + 1)
        else:
            transformed = np.log(data)
    elif transform_type == 'sqrt':
        transformed = np.sqrt(data - data.min() + 1)
    elif transform_type == 'boxcox':
        # Box-Cox requires positive data
        shifted = data - data.min() + 1
        transformed, lambda_param = boxcox(shifted)
        print(f"Box-Cox lambda: {lambda_param:.4f}")
    else:
        raise ValueError(f"Unknown transform: {transform_type}")
    
    trans_stat, trans_p = stats.shapiro(transformed)
    
    print(f"Original:    W={original_stat:.4f}, p={original_p:.4f}")
    print(f"Transformed: W={trans_stat:.4f}, p={trans_p:.4f}")
    
    improvement = trans_p > original_p
    print(f"Improvement: {'Yes' if improvement else 'No'}")
    
    return transformed, (trans_stat, trans_p)

# Test with skewed data
skewed = np.random.exponential(scale=10, size=40)
print("Log transformation:")
log_transformed, _ = transform_and_test(skewed, 'log')

print("\nBox-Cox transformation:")
boxcox_transformed, _ = transform_and_test(skewed, 'boxcox')

If transformations don’t work, switch to non-parametric methods:

Parametric Test	Non-Parametric Alternative
T-test	Mann-Whitney U
Paired t-test	Wilcoxon signed-rank
ANOVA	Kruskal-Wallis
Pearson correlation	Spearman correlation

Best Practices and Common Pitfalls

Sample size matters enormously. With n < 20, the test lacks power to detect non-normality. With n > 50, it detects trivial deviations. The sweet spot is 20-50 observations.

Combine multiple tests for robustness:

def comprehensive_normality_check(data, alpha=0.05):
    """
    Run multiple normality tests for robust assessment.
    """
    results = {}
    
    # Shapiro-Wilk
    stat, p = stats.shapiro(data)
    results['shapiro_wilk'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha}
    
    # D'Agostino-Pearson (requires n >= 20)
    if len(data) >= 20:
        stat, p = stats.normaltest(data)
        results['dagostino_pearson'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha}
    
    # Anderson-Darling
    result = stats.anderson(data, dist='norm')
    # Compare with critical value at 5% significance (index 2)
    is_normal = result.statistic < result.critical_values[2]
    results['anderson_darling'] = {
        'statistic': result.statistic,
        'critical_value_5pct': result.critical_values[2],
        'normal': is_normal
    }
    
    # Consensus
    normal_votes = sum(1 for r in results.values() if r.get('normal', False))
    results['consensus'] = f"{normal_votes}/{len(results)} tests suggest normality"
    
    return results

# Example
data = np.random.normal(0, 1, 35)
check = comprehensive_normality_check(data)
for test, result in check.items():
    print(f"{test}: {result}")

Know when normality doesn’t matter. Many methods are robust to moderate non-normality, especially with larger samples. The Central Limit Theorem means that sampling distributions of means approach normality regardless of the underlying distribution. Don’t abandon a well-suited parametric test just because Shapiro-Wilk returned p = 0.04.

Use the Shapiro-Wilk test as one tool in your diagnostic toolkit, not as an absolute gatekeeper. Combine it with visual inspection, domain knowledge, and practical judgment about whether deviations from normality actually matter for your specific analysis.