How to Create a QQ Plot in R
Before running a t-test, fitting a linear regression, or applying ANOVA, you need to verify your data meets normality assumptions. The QQ (quantile-quantile) plot is your most powerful visual tool...
Key Insights
- QQ plots compare your sample data’s distribution against a theoretical distribution by plotting quantiles against each other—points falling on the diagonal line indicate a good fit
- Base R’s
qqnorm()andqqline()functions provide quick normality checks, while ggplot2’sgeom_qq()offers publication-ready visualizations with confidence bands - Learning to read QQ plot patterns (S-curves, banana shapes, flaring tails) gives you immediate diagnostic power to identify skewness, heavy tails, and outliers before running parametric tests
Introduction to QQ Plots
Before running a t-test, fitting a linear regression, or applying ANOVA, you need to verify your data meets normality assumptions. The QQ (quantile-quantile) plot is your most powerful visual tool for this job.
Unlike histograms that can mislead with bin width choices, or density plots that smooth away important details, QQ plots give you a direct, honest comparison between your data’s distribution and a theoretical one. If your data is normally distributed, points fall on a straight line. Deviations from that line tell you exactly how your data differs from normal—whether it’s skewed, has heavy tails, or contains outliers.
This article walks you through creating and interpreting QQ plots in R, from basic implementations to production-ready visualizations.
Understanding the Theory Behind QQ Plots
A QQ plot works by sorting your sample data and plotting each observation against where it should fall if the data came from your theoretical distribution.
Here’s the process:
- Sort your n data points from smallest to largest
- Calculate theoretical quantiles: for each rank i, compute the expected value at the (i - 0.5)/n quantile of the theoretical distribution
- Plot sample values (y-axis) against theoretical quantiles (x-axis)
If your sample comes from the theoretical distribution, points cluster around a 45-degree reference line. The slope of this line reflects your data’s standard deviation, and the intercept reflects the mean.
The reference line typically passes through the first and third quartiles of both distributions, giving you a robust fit that isn’t distorted by outliers.
Creating Basic QQ Plots with Base R
R’s base graphics provide two essential functions: qqnorm() for comparing against the normal distribution and qqline() for adding the reference line.
# Generate sample data
set.seed(42)
normal_data <- rnorm(200, mean = 50, sd = 10)
# Create QQ plot
qqnorm(normal_data,
main = "QQ Plot: Normal Data",
xlab = "Theoretical Quantiles",
ylab = "Sample Quantiles",
pch = 19,
col = "steelblue")
# Add reference line
qqline(normal_data, col = "red", lwd = 2)
This produces a clean diagnostic plot in seconds. The pch = 19 gives solid points, and col arguments let you customize colors.
For quick exploratory analysis, base R is hard to beat. You can run qqnorm(your_data); qqline(your_data) as a one-liner during data exploration.
# Quick one-liner for interactive use
par(mfrow = c(1, 2)) # Two plots side by side
# Compare two variables
qqnorm(mtcars$mpg, main = "MPG"); qqline(mtcars$mpg)
qqnorm(mtcars$hp, main = "Horsepower"); qqline(mtcars$hp)
par(mfrow = c(1, 1)) # Reset layout
Enhanced QQ Plots with ggplot2
For reports and publications, ggplot2 provides superior customization. The geom_qq() and geom_qq_line() functions integrate seamlessly with the grammar of graphics.
library(ggplot2)
# Create sample data
set.seed(123)
df <- data.frame(values = rnorm(150, mean = 100, sd = 15))
# Basic ggplot2 QQ plot
ggplot(df, aes(sample = values)) +
geom_qq(color = "#2C3E50", alpha = 0.7, size = 2) +
geom_qq_line(color = "#E74C3C", linewidth = 1) +
labs(
title = "QQ Plot with ggplot2",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 14)
Adding confidence bands helps you judge whether deviations are statistically meaningful or just sampling noise:
library(ggplot2)
library(dplyr)
# Generate data and calculate confidence bands
set.seed(456)
n <- 100
sample_data <- rnorm(n, mean = 50, sd = 8)
# Create dataframe with theoretical quantiles and confidence intervals
qq_df <- data.frame(
sample = sort(sample_data),
theoretical = qnorm(ppoints(n))
) %>%
mutate(
# Approximate 95% confidence bands
se = (sd(sample_data) / dnorm(theoretical)) * sqrt(ppoints(n) * (1 - ppoints(n)) / n),
lower = theoretical * sd(sample_data) + mean(sample_data) - 1.96 * se,
upper = theoretical * sd(sample_data) + mean(sample_data) + 1.96 * se
)
# Plot with confidence ribbon
ggplot(qq_df, aes(x = theoretical)) +
geom_ribbon(aes(ymin = lower, ymax = upper),
fill = "gray80", alpha = 0.5) +
geom_point(aes(y = sample), color = "#3498DB", size = 2) +
geom_qq_line(aes(sample = sample), color = "#E74C3C", linewidth = 1) +
labs(
title = "QQ Plot with 95% Confidence Bands",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_bw(base_size = 12)
Comparing Against Different Distributions
The normal distribution isn’t always your target. Use qqplot() to compare your data against any theoretical distribution.
# Generate data from exponential distribution
set.seed(789)
exp_data <- rexp(200, rate = 0.5)
# Compare against exponential distribution
theoretical_exp <- qexp(ppoints(200), rate = 0.5)
qqplot(theoretical_exp, exp_data,
main = "QQ Plot: Exponential Distribution",
xlab = "Theoretical Exponential Quantiles",
ylab = "Sample Quantiles",
pch = 19, col = "darkgreen")
abline(0, 1, col = "red", lwd = 2)
For uniform distributions:
# Compare against uniform distribution
set.seed(101)
uniform_data <- runif(150, min = 0, max = 10)
theoretical_unif <- qunif(ppoints(150), min = 0, max = 10)
qqplot(theoretical_unif, uniform_data,
main = "QQ Plot: Uniform Distribution",
xlab = "Theoretical Uniform Quantiles",
ylab = "Sample Quantiles",
pch = 19, col = "purple")
abline(0, 1, col = "red", lwd = 2)
This flexibility makes QQ plots invaluable for model validation. Fitting a Weibull distribution to survival data? Generate theoretical Weibull quantiles and check the fit visually.
Interpreting Common QQ Plot Patterns
Knowing what patterns mean transforms QQ plots from pretty pictures into diagnostic tools.
# Generate datasets with known characteristics
set.seed(2024)
n <- 200
# Right-skewed data (log-normal)
right_skew <- rlnorm(n, meanlog = 0, sdlog = 0.5)
# Left-skewed data
left_skew <- -rlnorm(n, meanlog = 0, sdlog = 0.5) + max(rlnorm(n, 0, 0.5)) + 1
# Heavy-tailed data (t-distribution with low df)
heavy_tails <- rt(n, df = 3)
# Light-tailed data (uniform)
light_tails <- runif(n, -2, 2)
# Create comparison plots
par(mfrow = c(2, 2))
qqnorm(right_skew, main = "Right Skew (Log-normal)", pch = 19, col = "coral")
qqline(right_skew, col = "black", lwd = 2)
qqnorm(left_skew, main = "Left Skew", pch = 19, col = "coral")
qqline(left_skew, col = "black", lwd = 2)
qqnorm(heavy_tails, main = "Heavy Tails (t-dist, df=3)", pch = 19, col = "steelblue")
qqline(heavy_tails, col = "black", lwd = 2)
qqnorm(light_tails, main = "Light Tails (Uniform)", pch = 19, col = "steelblue")
qqline(light_tails, col = "black", lwd = 2)
par(mfrow = c(1, 1))
Here’s your pattern recognition guide:
| Pattern | Description | Indicates |
|---|---|---|
| S-curve (concave up) | Points below line on left, above on right | Right skew |
| S-curve (concave down) | Points above line on left, below on right | Left skew |
| Flaring at both ends | Points deviate outward at extremes | Heavy tails |
| Pinched at both ends | Points curve inward at extremes | Light tails |
| Individual points far from line | Isolated deviations | Outliers |
Practical Application: Real Dataset Example
Let’s work through a complete normality assessment workflow using R’s built-in iris dataset. We’ll check whether sepal length is normally distributed before deciding on parametric versus non-parametric tests.
library(ggplot2)
# Load and examine data
data(iris)
sepal_length <- iris$Sepal.Length
# Step 1: Visual inspection with histogram
hist(sepal_length, breaks = 20,
main = "Distribution of Sepal Length",
xlab = "Sepal Length (cm)",
col = "lightblue", border = "white")
# Step 2: QQ plot for normality assessment
qqnorm(sepal_length,
main = "QQ Plot: Iris Sepal Length",
pch = 19, col = "darkblue")
qqline(sepal_length, col = "red", lwd = 2)
# Step 3: Formal statistical test (Shapiro-Wilk)
shapiro_result <- shapiro.test(sepal_length)
print(shapiro_result)
The QQ plot reveals slight deviations at the tails, and the Shapiro-Wilk test provides a p-value. But here’s where judgment matters: with 150 observations, even minor deviations can produce significant p-values.
Now let’s check normality within each species:
# QQ plots by species
library(ggplot2)
ggplot(iris, aes(sample = Sepal.Length)) +
geom_qq(color = "#2980B9", alpha = 0.7) +
geom_qq_line(color = "#C0392B", linewidth = 0.8) +
facet_wrap(~Species, scales = "free") +
labs(
title = "QQ Plots by Species",
subtitle = "Checking normality assumption for ANOVA",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 11)
# Shapiro-Wilk tests by species
by(iris$Sepal.Length, iris$Species, shapiro.test)
This grouped analysis reveals that normality holds reasonably well within each species, supporting the use of ANOVA for comparing means across groups.
The decision framework:
- If QQ plot shows minor deviations and sample size > 30, parametric tests are usually robust
- If QQ plot shows systematic patterns (clear S-curves, heavy tails), consider transformations or non-parametric alternatives
- Always combine visual assessment with formal tests—neither alone tells the complete story
QQ plots aren’t just diagnostic checkboxes. They’re windows into your data’s structure. Master reading them, and you’ll catch problems that summary statistics miss entirely.