R ggplot2 - Box Plot with Examples
Box plots display the five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. In ggplot2, creating a box plot requires mapping a categorical variable to the...
Key Insights
- Box plots in ggplot2 reveal distribution characteristics through quartiles, medians, and outliers, making them essential for comparing multiple groups and identifying data anomalies
- The
geom_boxplot()function provides extensive customization options including notched boxes for median confidence intervals, variable width based on sample size, and outlier styling - Combining box plots with jittered points, violin plots, or statistical annotations creates more informative visualizations that show both summary statistics and underlying data patterns
Basic Box Plot Syntax
Box plots display the five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. In ggplot2, creating a box plot requires mapping a categorical variable to the x-axis and a continuous variable to the y-axis.
library(ggplot2)
# Using the built-in mtcars dataset
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot() +
labs(x = "Number of Cylinders", y = "Miles Per Gallon")
The box represents the interquartile range (IQR) containing the middle 50% of data. The line inside the box marks the median. Whiskers extend to 1.5 times the IQR, and points beyond are plotted as outliers.
Customizing Box Plot Appearance
Control visual elements through parameters in geom_boxplot() and theme functions.
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(
fill = "#4ECDC4",
color = "#1A535C",
alpha = 0.7,
outlier.colour = "#FF6B6B",
outlier.shape = 16,
outlier.size = 3,
notch = FALSE,
width = 0.6
) +
theme_minimal() +
labs(x = "Cylinders", y = "MPG")
The notch = TRUE parameter adds notches around the median, providing a visual confidence interval. If notches from two boxes don’t overlap, their medians are significantly different at approximately the 95% confidence level.
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(notch = TRUE, fill = "#95E1D3") +
theme_classic()
Grouped and Faceted Box Plots
Compare distributions across multiple categorical variables using fill aesthetics or faceting.
# Using the ToothGrowth dataset
ggplot(ToothGrowth, aes(x = factor(dose), y = len, fill = supp)) +
geom_boxplot(position = position_dodge(0.8)) +
scale_fill_manual(
values = c("#E63946", "#457B9D"),
labels = c("Orange Juice", "Vitamin C")
) +
labs(
x = "Dose (mg/day)",
y = "Tooth Length",
fill = "Supplement Type"
) +
theme_bw()
Faceting splits the plot into separate panels:
ggplot(ToothGrowth, aes(x = factor(dose), y = len)) +
geom_boxplot(fill = "#F4A261", alpha = 0.8) +
facet_wrap(~ supp, labeller = labeller(
supp = c(OJ = "Orange Juice", VC = "Vitamin C")
)) +
theme_minimal() +
theme(strip.background = element_rect(fill = "#E9C46A"))
Box Plots with Overlaid Data Points
Overlaying raw data points reveals sample size and distribution patterns hidden in summary statistics.
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "#A8DADC", outlier.shape = NA) +
geom_jitter(
width = 0.2,
alpha = 0.5,
color = "#1D3557",
size = 2
) +
theme_minimal()
Setting outlier.shape = NA prevents double-plotting outliers. The geom_jitter() adds random noise to x-coordinates, preventing overplotting.
For better control, use geom_point() with position_jitter():
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "#DDA15E", alpha = 0.6, outlier.shape = NA) +
geom_point(
position = position_jitter(width = 0.15, seed = 123),
alpha = 0.7,
size = 2.5,
color = "#BC6C25"
)
Variable Width Box Plots
Set box width proportional to sample size using varwidth = TRUE:
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(varwidth = TRUE, fill = "#B5838D") +
labs(
x = "Cylinders",
y = "MPG",
caption = "Box width proportional to sample size"
) +
theme_light()
This immediately shows which groups have more observations, providing context for statistical reliability.
Horizontal Box Plots
Flip coordinates for better readability with long category names:
# Create sample data with longer labels
library(dplyr)
data <- data.frame(
category = rep(c("High Performance Engine",
"Standard Engine",
"Economy Engine"), each = 30),
efficiency = c(rnorm(30, 15, 3),
rnorm(30, 22, 4),
rnorm(30, 32, 5))
)
ggplot(data, aes(x = category, y = efficiency)) +
geom_boxplot(fill = "#8ECAE6") +
coord_flip() +
labs(x = NULL, y = "Fuel Efficiency (MPG)") +
theme_minimal()
Combining Box Plots with Violin Plots
Violin plots show the full distribution density. Combining both provides comprehensive insights:
ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
geom_violin(alpha = 0.4, trim = FALSE) +
geom_boxplot(width = 0.2, alpha = 0.8, outlier.shape = NA) +
scale_fill_brewer(palette = "Set2") +
labs(x = "Cylinders", y = "MPG") +
theme_minimal() +
theme(legend.position = "none")
The violin shows distribution shape while the box plot provides exact quartile values.
Adding Statistical Annotations
Include mean values, sample sizes, or statistical test results:
# Calculate means for each group
means <- mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg))
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "#ADB5BD") +
stat_summary(
fun = mean,
geom = "point",
shape = 23,
size = 3,
fill = "#E63946"
) +
geom_text(
data = means,
aes(x = factor(cyl), y = mean_mpg, label = round(mean_mpg, 1)),
vjust = -1,
color = "#E63946",
fontface = "bold"
) +
labs(x = "Cylinders", y = "MPG") +
theme_minimal()
Reordering Box Plots by Median
Order categories by their median values for clearer patterns:
ggplot(mtcars, aes(x = reorder(factor(cyl), mpg, FUN = median), y = mpg)) +
geom_boxplot(aes(fill = factor(cyl))) +
scale_fill_manual(values = c("#264653", "#2A9D8F", "#E76F51")) +
labs(x = "Cylinders (ordered by median MPG)", y = "MPG") +
theme_minimal() +
theme(legend.position = "none")
Custom Outlier Detection
Define custom outlier thresholds beyond the default 1.5 * IQR:
# Custom function to identify outliers
is_outlier <- function(x, coef = 1.5) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
x < (Q1 - coef * IQR) | x > (Q3 + coef * IQR)
}
mtcars_outliers <- mtcars %>%
group_by(cyl) %>%
mutate(outlier = is_outlier(mpg, coef = 2)) %>%
ungroup()
ggplot(mtcars_outliers, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(outlier.shape = NA, coef = 2) +
geom_point(
data = filter(mtcars_outliers, outlier),
color = "#D62828",
size = 3
) +
labs(
x = "Cylinders",
y = "MPG",
caption = "Outliers defined as > 2 * IQR"
) +
theme_minimal()
Box Plots with Custom Color Gradients
Apply gradient fills based on a continuous variable:
mtcars_summary <- mtcars %>%
group_by(cyl) %>%
summarise(mean_hp = mean(hp))
mtcars_joined <- mtcars %>%
left_join(mtcars_summary, by = "cyl")
ggplot(mtcars_joined, aes(x = factor(cyl), y = mpg, fill = mean_hp)) +
geom_boxplot() +
scale_fill_gradient(low = "#F8F9FA", high = "#343A40") +
labs(
x = "Cylinders",
y = "MPG",
fill = "Mean HP"
) +
theme_minimal()
Box plots in ggplot2 offer flexibility for exploratory data analysis and publication-quality visualizations. Combine multiple geoms, adjust statistical parameters, and customize aesthetics to match your analytical needs. The key is choosing the right combination of features that best communicates your data’s story without overwhelming the viewer.