Shapiro-Wilk Test in R: Step-by-Step Guide

The Shapiro-Wilk test answers a fundamental question in statistics: does my data come from a normally distributed population? This matters because many statistical procedures—t-tests, ANOVA, linear...

Key Insights

  • The Shapiro-Wilk test is the most powerful normality test for small samples (n < 50), but becomes overly sensitive with larger datasets where trivial deviations trigger rejection.
  • Never rely on the p-value alone—always pair statistical tests with visual inspection using Q-Q plots and histograms to make informed decisions about your data’s distribution.
  • A significant Shapiro-Wilk result doesn’t automatically invalidate your analysis; many statistical methods are robust to mild normality violations, especially with larger samples.

Introduction to the Shapiro-Wilk Test

The Shapiro-Wilk test answers a fundamental question in statistics: does my data come from a normally distributed population? This matters because many statistical procedures—t-tests, ANOVA, linear regression—assume normality in either the data or the residuals.

The test works by calculating how well your sample data correlates with the expected values from a normal distribution. It produces a W statistic ranging from 0 to 1, where values closer to 1 indicate better agreement with normality.

Here’s what you need to understand about the hypothesis framework:

  • Null hypothesis (H₀): The data is normally distributed
  • Alternative hypothesis (H₁): The data is not normally distributed

This means a low p-value (typically < 0.05) leads you to reject normality. Many practitioners get this backwards, so burn it into your memory: you want a high p-value if you’re hoping your data is normal.

The Shapiro-Wilk test performs best with sample sizes between 3 and 50. Below 3, you simply don’t have enough data. Above 50, the test becomes problematic—not because it fails, but because it becomes too sensitive. With 5,000 observations, even meaningless deviations from perfect normality will produce significant p-values.

Prerequisites and Setup

The Shapiro-Wilk test lives in base R’s stats package, which loads automatically. You don’t need to install anything for the core functionality.

# Base R has everything you need for the test itself
# For better visualizations, load ggplot2
library(ggplot2)

# Create sample data for our examples
set.seed(42)
normal_data <- rnorm(30, mean = 100, sd = 15)
skewed_data <- rexp(30, rate = 0.1)

# Load built-in dataset for realistic examples
data(mtcars)

I recommend keeping ggplot2 available for visualization, but the base R plotting functions work fine for quick checks during exploratory analysis.

Basic Implementation with shapiro.test()

The shapiro.test() function takes a numeric vector and returns a hypothesis test object. The syntax couldn’t be simpler:

# Test our normally distributed sample
result <- shapiro.test(normal_data)
print(result)
	Shapiro-Wilk normality test

data:  normal_data
W = 0.97528, p-value = 0.6902

Let’s break down this output:

  • W = 0.97528: Close to 1, suggesting good agreement with normality
  • p-value = 0.6902: Well above 0.05, so we fail to reject the null hypothesis

The interpretation: we have no statistical evidence that this data deviates from normality. This makes sense—we generated it from a normal distribution.

Now compare with skewed data:

# Test exponentially distributed (skewed) data
result_skewed <- shapiro.test(skewed_data)
print(result_skewed)
	Shapiro-Wilk normality test

data:  skewed_data
W = 0.87234, p-value = 0.001832

The W statistic dropped to 0.87, and the p-value is well below 0.05. We reject the null hypothesis—this data is not normally distributed.

You can extract specific values programmatically:

# Access individual components
result$statistic  # W value
result$p.value    # p-value
result$method     # Test name

Visual Verification of Results

Statistical tests give you a number. Plots give you understanding. Always use both.

The Q-Q (quantile-quantile) plot compares your data’s quantiles against theoretical normal quantiles. If your data is normal, points fall along a straight diagonal line.

# Base R Q-Q plot
par(mfrow = c(1, 2))

# Q-Q plot for normal data
qqnorm(normal_data, main = "Q-Q Plot: Normal Data")
qqline(normal_data, col = "red", lwd = 2)

# Q-Q plot for skewed data
qqnorm(skewed_data, main = "Q-Q Plot: Skewed Data")
qqline(skewed_data, col = "red", lwd = 2)

par(mfrow = c(1, 1))

For the normal data, points hug the line. For the skewed data, you’ll see a characteristic curve—points deviate systematically in the tails.

Histograms with density overlays provide another perspective:

# ggplot2 histogram with normal curve overlay
ggplot(data.frame(x = normal_data), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 10, fill = "steelblue", alpha = 0.7) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(normal_data), sd = sd(normal_data)),
                color = "red", linewidth = 1) +
  labs(title = "Histogram with Normal Density Overlay",
       x = "Value", y = "Density") +
  theme_minimal()

When the histogram roughly follows the bell curve, you have visual confirmation supporting your test results.

Practical Examples with Real Data

Testing Regression Residuals

Linear regression assumes normally distributed residuals. Here’s how to check:

# Fit a linear model
model <- lm(mpg ~ wt + hp, data = mtcars)

# Extract residuals
residuals <- resid(model)

# Test normality of residuals
shapiro.test(residuals)
	Shapiro-Wilk normality test

data:  residuals
W = 0.95814, p-value = 0.2292

With p = 0.23, we have no evidence against normality. The regression assumption holds.

Always visualize residuals too:

# Diagnostic plot
par(mfrow = c(1, 2))
qqnorm(residuals, main = "Residuals Q-Q Plot")
qqline(residuals, col = "red")
hist(residuals, breaks = 10, main = "Residuals Distribution", 
     xlab = "Residuals", probability = TRUE)
curve(dnorm(x, mean = mean(residuals), sd = sd(residuals)), 
      add = TRUE, col = "red", lwd = 2)
par(mfrow = c(1, 1))

Grouped Data Testing

Often you need to test normality within groups. Use tapply() for base R or dplyr for a tidyverse approach:

# Base R approach with tapply
by_cyl <- tapply(mtcars$mpg, mtcars$cyl, shapiro.test)

# Extract p-values for each group
sapply(by_cyl, function(x) x$p.value)
        4         6         8 
0.2605796 0.3252851 0.9322104 

All groups pass the normality check. Here’s the dplyr equivalent:

library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  summarise(
    n = n(),
    shapiro_p = shapiro.test(mpg)$p.value,
    normal = ifelse(shapiro_p > 0.05, "Yes", "No")
  )

This approach scales better when you have many groups and want a clean summary table.

Limitations and Alternatives

The Shapiro-Wilk test has a critical flaw: it’s too good at its job with large samples. Watch what happens as sample size increases:

# Demonstrate sensitivity to sample size
set.seed(123)

sizes <- c(30, 100, 500, 2000)
for (n in sizes) {
  # Slightly non-normal: normal with tiny contamination
  data <- c(rnorm(n * 0.95), rnorm(n * 0.05, mean = 0, sd = 1.5))
  p <- shapiro.test(data)$p.value
  cat(sprintf("n = %4d: p-value = %.4f\n", n, p))
}
n =   30: p-value = 0.8234
n =  100: p-value = 0.3891
n =  500: p-value = 0.0312
n = 2000: p-value = 0.0001

The same trivial deviation gets detected as “significant” only because we have more data. This is statistically correct but practically useless.

For larger samples, consider alternatives from the nortest package:

# install.packages("nortest")
library(nortest)

large_sample <- rnorm(200)

# Anderson-Darling test
ad.test(large_sample)

# Lilliefors (Kolmogorov-Smirnov) test
lillie.test(large_sample)

The Anderson-Darling test is particularly good at detecting deviations in the tails, while Lilliefors is a corrected version of Kolmogorov-Smirnov suitable when parameters are estimated from the data.

Decision Framework and Best Practices

Here’s my practical workflow for assessing normality:

  1. Start visual: Create Q-Q plot and histogram before any test
  2. Run Shapiro-Wilk: For samples under 50, trust it heavily
  3. For larger samples: Weight visual inspection more heavily than p-values
  4. Consider context: How sensitive is your downstream analysis to normality violations?

When normality is violated, you have options:

# Log transformation for right-skewed data
transformed <- log(skewed_data)
shapiro.test(transformed)

# Square root transformation (milder)
sqrt_transformed <- sqrt(skewed_data)
shapiro.test(sqrt_transformed)

# Box-Cox transformation (optimal power)
library(MASS)
bc <- boxcox(lm(skewed_data ~ 1), plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]
boxcox_transformed <- (skewed_data^lambda - 1) / lambda

If transformations don’t work or aren’t appropriate, switch to non-parametric alternatives: Mann-Whitney U instead of t-tests, Kruskal-Wallis instead of ANOVA, Spearman correlation instead of Pearson.

Remember that many parametric tests are robust to normality violations when sample sizes are adequate. The Central Limit Theorem means that with n > 30 per group, your sampling distribution of the mean approaches normality regardless of the underlying distribution. Don’t abandon a t-test just because Shapiro-Wilk gave p = 0.04 on a sample of 100.

The Shapiro-Wilk test is a tool, not a verdict. Use it alongside visual inspection and domain knowledge to make informed decisions about your analytical approach.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.