How to Perform the Shapiro-Wilk Test in Python
Many statistical methods assume your data follows a normal distribution. T-tests, ANOVA, linear regression, and Pearson correlation all make this assumption. Violating it can lead to incorrect...
Key Insights
- The Shapiro-Wilk test is most reliable for sample sizes under 50; for larger datasets, it becomes overly sensitive and will reject normality for trivial deviations that don’t matter practically.
- Never rely on a single normality test—combine the Shapiro-Wilk p-value with visual inspection (histograms and Q-Q plots) to make informed decisions about your data’s distribution.
- When data fails normality tests, you have two paths: transform the data (log, Box-Cox) and re-test, or switch to non-parametric statistical methods that don’t assume normality.
Why Normality Testing Matters
Many statistical methods assume your data follows a normal distribution. T-tests, ANOVA, linear regression, and Pearson correlation all make this assumption. Violating it can lead to incorrect conclusions—inflated Type I errors, unreliable confidence intervals, and misleading p-values.
The Shapiro-Wilk test is the go-to method for checking normality in small to medium-sized samples. It outperforms alternatives like Kolmogorov-Smirnov in statistical power for detecting non-normality. However, it’s not a silver bullet. Understanding when and how to use it properly separates competent data analysis from cargo-cult statistics.
Understanding the Shapiro-Wilk Test
The test calculates a W statistic that measures how well your data’s order statistics match what you’d expect from a normal distribution. The W value ranges from 0 to 1, where 1 indicates perfect normality.
Here’s what you need to know about interpretation:
- Null hypothesis: The data comes from a normally distributed population
- P-value > 0.05: Fail to reject the null—data is consistent with normality
- P-value ≤ 0.05: Reject the null—data likely isn’t normally distributed
The test has important limitations. It works best for samples between 3 and 50 observations. With larger samples (n > 50), the test becomes hypersensitive. Even minor, practically irrelevant deviations from normality will produce significant p-values. This is a feature of all hypothesis tests with large samples, not a bug specific to Shapiro-Wilk.
Basic Implementation with SciPy
Let’s start with the fundamentals. SciPy’s shapiro() function makes this straightforward:
import numpy as np
from scipy import stats
# Generate sample data from a normal distribution
np.random.seed(42)
normal_data = np.random.normal(loc=50, scale=10, size=30)
# Perform Shapiro-Wilk test
statistic, p_value = stats.shapiro(normal_data)
print(f"W statistic: {statistic:.4f}")
print(f"P-value: {p_value:.4f}")
Output:
W statistic: 0.9732
P-value: 0.6304
A W statistic close to 1 and a high p-value indicate the data is consistent with normality. Now let’s add decision logic:
def test_normality(data, alpha=0.05):
"""
Perform Shapiro-Wilk test with interpretation.
Parameters:
data: array-like, sample data
alpha: significance level (default 0.05)
Returns:
dict with test results and interpretation
"""
statistic, p_value = stats.shapiro(data)
is_normal = p_value > alpha
result = {
'w_statistic': statistic,
'p_value': p_value,
'alpha': alpha,
'is_normal': is_normal,
'interpretation': (
f"Data appears normally distributed (p={p_value:.4f} > {alpha})"
if is_normal else
f"Data does NOT appear normally distributed (p={p_value:.4f} ≤ {alpha})"
)
}
return result
# Test with normal data
result = test_normality(normal_data)
print(result['interpretation'])
# Test with non-normal data (exponential distribution)
skewed_data = np.random.exponential(scale=10, size=30)
result_skewed = test_normality(skewed_data)
print(result_skewed['interpretation'])
Visualizing Normality Alongside the Test
Statistical tests give you a number. Visual inspection gives you understanding. Always use both.
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np
def normality_diagnostic_plot(data, title="Normality Diagnostic"):
"""
Create histogram and Q-Q plot with Shapiro-Wilk test results.
"""
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Perform Shapiro-Wilk test
w_stat, p_value = stats.shapiro(data)
# Histogram with normal curve overlay
ax1 = axes[0]
ax1.hist(data, bins='auto', density=True, alpha=0.7, edgecolor='black')
# Overlay theoretical normal distribution
mu, std = np.mean(data), np.std(data)
x = np.linspace(min(data), max(data), 100)
ax1.plot(x, stats.norm.pdf(x, mu, std), 'r-', linewidth=2, label='Normal fit')
ax1.set_xlabel('Value')
ax1.set_ylabel('Density')
ax1.set_title('Histogram with Normal Curve')
ax1.legend()
# Q-Q plot
ax2 = axes[1]
stats.probplot(data, dist="norm", plot=ax2)
ax2.set_title('Q-Q Plot')
# Add test results as annotation
result_text = f"Shapiro-Wilk Test\nW = {w_stat:.4f}\np = {p_value:.4f}"
normality_status = "Normal" if p_value > 0.05 else "Non-normal"
result_text += f"\nConclusion: {normality_status}"
fig.text(0.02, 0.98, result_text, transform=fig.transFigure,
fontsize=10, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.suptitle(title, fontsize=14, fontweight='bold')
plt.tight_layout()
plt.subplots_adjust(top=0.85)
return fig, (w_stat, p_value)
# Example usage
np.random.seed(42)
sample_data = np.random.normal(100, 15, 40)
fig, results = normality_diagnostic_plot(sample_data, "Sample Data Normality Check")
plt.show()
The Q-Q plot is particularly valuable. Points falling along the diagonal line indicate normality. Systematic deviations reveal specific problems: S-curves suggest heavy tails, curves bending away at ends indicate skewness.
Practical Examples with Real Data
Real-world analysis usually involves DataFrames, not standalone arrays. Here’s how to test a single column:
import pandas as pd
from scipy import stats
# Create sample DataFrame
np.random.seed(42)
df = pd.DataFrame({
'revenue': np.random.normal(50000, 10000, 45),
'response_time': np.random.exponential(2, 45),
'satisfaction_score': np.random.uniform(1, 5, 45),
'age': np.random.normal(35, 8, 45)
})
# Test single column
column = 'revenue'
stat, p = stats.shapiro(df[column].dropna())
print(f"{column}: W={stat:.4f}, p={p:.4f}")
For batch testing multiple columns, create a summary table:
def batch_normality_test(df, columns=None, alpha=0.05):
"""
Test normality for multiple DataFrame columns.
Parameters:
df: pandas DataFrame
columns: list of column names (default: all numeric columns)
alpha: significance level
Returns:
DataFrame with test results
"""
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns.tolist()
results = []
for col in columns:
data = df[col].dropna()
if len(data) < 3:
results.append({
'column': col,
'n': len(data),
'w_statistic': np.nan,
'p_value': np.nan,
'is_normal': None,
'note': 'Insufficient data (n < 3)'
})
continue
if len(data) > 5000:
# Shapiro-Wilk has a limit of 5000 samples
data = data.sample(5000, random_state=42)
note = 'Sampled to 5000'
else:
note = ''
w_stat, p_value = stats.shapiro(data)
results.append({
'column': col,
'n': len(df[col].dropna()),
'w_statistic': round(w_stat, 4),
'p_value': round(p_value, 4),
'is_normal': p_value > alpha,
'note': note
})
return pd.DataFrame(results)
# Run batch test
results_df = batch_normality_test(df)
print(results_df.to_string(index=False))
Output:
column n w_statistic p_value is_normal note
revenue 45 0.9847 0.8142 True
response_time 45 0.8234 0.0001 False
satisfaction_score 45 0.9512 0.0571 True
age 45 0.9789 0.5891 True
Handling Non-Normal Data
When data fails the normality test, you have options. Transformations can often normalize skewed distributions:
from scipy.stats import boxcox
def transform_and_test(data, transform_type='log'):
"""
Apply transformation and re-test for normality.
"""
original_stat, original_p = stats.shapiro(data)
if transform_type == 'log':
# Add small constant if data contains zeros
min_val = data.min()
if min_val <= 0:
transformed = np.log(data - min_val + 1)
else:
transformed = np.log(data)
elif transform_type == 'sqrt':
transformed = np.sqrt(data - data.min() + 1)
elif transform_type == 'boxcox':
# Box-Cox requires positive data
shifted = data - data.min() + 1
transformed, lambda_param = boxcox(shifted)
print(f"Box-Cox lambda: {lambda_param:.4f}")
else:
raise ValueError(f"Unknown transform: {transform_type}")
trans_stat, trans_p = stats.shapiro(transformed)
print(f"Original: W={original_stat:.4f}, p={original_p:.4f}")
print(f"Transformed: W={trans_stat:.4f}, p={trans_p:.4f}")
improvement = trans_p > original_p
print(f"Improvement: {'Yes' if improvement else 'No'}")
return transformed, (trans_stat, trans_p)
# Test with skewed data
skewed = np.random.exponential(scale=10, size=40)
print("Log transformation:")
log_transformed, _ = transform_and_test(skewed, 'log')
print("\nBox-Cox transformation:")
boxcox_transformed, _ = transform_and_test(skewed, 'boxcox')
If transformations don’t work, switch to non-parametric methods:
| Parametric Test | Non-Parametric Alternative |
|---|---|
| T-test | Mann-Whitney U |
| Paired t-test | Wilcoxon signed-rank |
| ANOVA | Kruskal-Wallis |
| Pearson correlation | Spearman correlation |
Best Practices and Common Pitfalls
Sample size matters enormously. With n < 20, the test lacks power to detect non-normality. With n > 50, it detects trivial deviations. The sweet spot is 20-50 observations.
Combine multiple tests for robustness:
def comprehensive_normality_check(data, alpha=0.05):
"""
Run multiple normality tests for robust assessment.
"""
results = {}
# Shapiro-Wilk
stat, p = stats.shapiro(data)
results['shapiro_wilk'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha}
# D'Agostino-Pearson (requires n >= 20)
if len(data) >= 20:
stat, p = stats.normaltest(data)
results['dagostino_pearson'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha}
# Anderson-Darling
result = stats.anderson(data, dist='norm')
# Compare with critical value at 5% significance (index 2)
is_normal = result.statistic < result.critical_values[2]
results['anderson_darling'] = {
'statistic': result.statistic,
'critical_value_5pct': result.critical_values[2],
'normal': is_normal
}
# Consensus
normal_votes = sum(1 for r in results.values() if r.get('normal', False))
results['consensus'] = f"{normal_votes}/{len(results)} tests suggest normality"
return results
# Example
data = np.random.normal(0, 1, 35)
check = comprehensive_normality_check(data)
for test, result in check.items():
print(f"{test}: {result}")
Know when normality doesn’t matter. Many methods are robust to moderate non-normality, especially with larger samples. The Central Limit Theorem means that sampling distributions of means approach normality regardless of the underlying distribution. Don’t abandon a well-suited parametric test just because Shapiro-Wilk returned p = 0.04.
Use the Shapiro-Wilk test as one tool in your diagnostic toolkit, not as an absolute gatekeeper. Combine it with visual inspection, domain knowledge, and practical judgment about whether deviations from normality actually matter for your specific analysis.