R ggplot2 - Box Plot with Examples

Box plots display the five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. In ggplot2, creating a box plot requires mapping a categorical variable to the...

Key Insights

  • Box plots in ggplot2 reveal distribution characteristics through quartiles, medians, and outliers, making them essential for comparing multiple groups and identifying data anomalies
  • The geom_boxplot() function provides extensive customization options including notched boxes for median confidence intervals, variable width based on sample size, and outlier styling
  • Combining box plots with jittered points, violin plots, or statistical annotations creates more informative visualizations that show both summary statistics and underlying data patterns

Basic Box Plot Syntax

Box plots display the five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. In ggplot2, creating a box plot requires mapping a categorical variable to the x-axis and a continuous variable to the y-axis.

library(ggplot2)

# Using the built-in mtcars dataset
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot() +
  labs(x = "Number of Cylinders", y = "Miles Per Gallon")

The box represents the interquartile range (IQR) containing the middle 50% of data. The line inside the box marks the median. Whiskers extend to 1.5 times the IQR, and points beyond are plotted as outliers.

Customizing Box Plot Appearance

Control visual elements through parameters in geom_boxplot() and theme functions.

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(
    fill = "#4ECDC4",
    color = "#1A535C",
    alpha = 0.7,
    outlier.colour = "#FF6B6B",
    outlier.shape = 16,
    outlier.size = 3,
    notch = FALSE,
    width = 0.6
  ) +
  theme_minimal() +
  labs(x = "Cylinders", y = "MPG")

The notch = TRUE parameter adds notches around the median, providing a visual confidence interval. If notches from two boxes don’t overlap, their medians are significantly different at approximately the 95% confidence level.

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(notch = TRUE, fill = "#95E1D3") +
  theme_classic()

Grouped and Faceted Box Plots

Compare distributions across multiple categorical variables using fill aesthetics or faceting.

# Using the ToothGrowth dataset
ggplot(ToothGrowth, aes(x = factor(dose), y = len, fill = supp)) +
  geom_boxplot(position = position_dodge(0.8)) +
  scale_fill_manual(
    values = c("#E63946", "#457B9D"),
    labels = c("Orange Juice", "Vitamin C")
  ) +
  labs(
    x = "Dose (mg/day)",
    y = "Tooth Length",
    fill = "Supplement Type"
  ) +
  theme_bw()

Faceting splits the plot into separate panels:

ggplot(ToothGrowth, aes(x = factor(dose), y = len)) +
  geom_boxplot(fill = "#F4A261", alpha = 0.8) +
  facet_wrap(~ supp, labeller = labeller(
    supp = c(OJ = "Orange Juice", VC = "Vitamin C")
  )) +
  theme_minimal() +
  theme(strip.background = element_rect(fill = "#E9C46A"))

Box Plots with Overlaid Data Points

Overlaying raw data points reveals sample size and distribution patterns hidden in summary statistics.

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "#A8DADC", outlier.shape = NA) +
  geom_jitter(
    width = 0.2,
    alpha = 0.5,
    color = "#1D3557",
    size = 2
  ) +
  theme_minimal()

Setting outlier.shape = NA prevents double-plotting outliers. The geom_jitter() adds random noise to x-coordinates, preventing overplotting.

For better control, use geom_point() with position_jitter():

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "#DDA15E", alpha = 0.6, outlier.shape = NA) +
  geom_point(
    position = position_jitter(width = 0.15, seed = 123),
    alpha = 0.7,
    size = 2.5,
    color = "#BC6C25"
  )

Variable Width Box Plots

Set box width proportional to sample size using varwidth = TRUE:

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(varwidth = TRUE, fill = "#B5838D") +
  labs(
    x = "Cylinders",
    y = "MPG",
    caption = "Box width proportional to sample size"
  ) +
  theme_light()

This immediately shows which groups have more observations, providing context for statistical reliability.

Horizontal Box Plots

Flip coordinates for better readability with long category names:

# Create sample data with longer labels
library(dplyr)

data <- data.frame(
  category = rep(c("High Performance Engine", 
                   "Standard Engine", 
                   "Economy Engine"), each = 30),
  efficiency = c(rnorm(30, 15, 3), 
                 rnorm(30, 22, 4), 
                 rnorm(30, 32, 5))
)

ggplot(data, aes(x = category, y = efficiency)) +
  geom_boxplot(fill = "#8ECAE6") +
  coord_flip() +
  labs(x = NULL, y = "Fuel Efficiency (MPG)") +
  theme_minimal()

Combining Box Plots with Violin Plots

Violin plots show the full distribution density. Combining both provides comprehensive insights:

ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
  geom_violin(alpha = 0.4, trim = FALSE) +
  geom_boxplot(width = 0.2, alpha = 0.8, outlier.shape = NA) +
  scale_fill_brewer(palette = "Set2") +
  labs(x = "Cylinders", y = "MPG") +
  theme_minimal() +
  theme(legend.position = "none")

The violin shows distribution shape while the box plot provides exact quartile values.

Adding Statistical Annotations

Include mean values, sample sizes, or statistical test results:

# Calculate means for each group
means <- mtcars %>%
  group_by(cyl) %>%
  summarise(mean_mpg = mean(mpg))

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "#ADB5BD") +
  stat_summary(
    fun = mean,
    geom = "point",
    shape = 23,
    size = 3,
    fill = "#E63946"
  ) +
  geom_text(
    data = means,
    aes(x = factor(cyl), y = mean_mpg, label = round(mean_mpg, 1)),
    vjust = -1,
    color = "#E63946",
    fontface = "bold"
  ) +
  labs(x = "Cylinders", y = "MPG") +
  theme_minimal()

Reordering Box Plots by Median

Order categories by their median values for clearer patterns:

ggplot(mtcars, aes(x = reorder(factor(cyl), mpg, FUN = median), y = mpg)) +
  geom_boxplot(aes(fill = factor(cyl))) +
  scale_fill_manual(values = c("#264653", "#2A9D8F", "#E76F51")) +
  labs(x = "Cylinders (ordered by median MPG)", y = "MPG") +
  theme_minimal() +
  theme(legend.position = "none")

Custom Outlier Detection

Define custom outlier thresholds beyond the default 1.5 * IQR:

# Custom function to identify outliers
is_outlier <- function(x, coef = 1.5) {
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  IQR <- Q3 - Q1
  x < (Q1 - coef * IQR) | x > (Q3 + coef * IQR)
}

mtcars_outliers <- mtcars %>%
  group_by(cyl) %>%
  mutate(outlier = is_outlier(mpg, coef = 2)) %>%
  ungroup()

ggplot(mtcars_outliers, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(outlier.shape = NA, coef = 2) +
  geom_point(
    data = filter(mtcars_outliers, outlier),
    color = "#D62828",
    size = 3
  ) +
  labs(
    x = "Cylinders",
    y = "MPG",
    caption = "Outliers defined as > 2 * IQR"
  ) +
  theme_minimal()

Box Plots with Custom Color Gradients

Apply gradient fills based on a continuous variable:

mtcars_summary <- mtcars %>%
  group_by(cyl) %>%
  summarise(mean_hp = mean(hp))

mtcars_joined <- mtcars %>%
  left_join(mtcars_summary, by = "cyl")

ggplot(mtcars_joined, aes(x = factor(cyl), y = mpg, fill = mean_hp)) +
  geom_boxplot() +
  scale_fill_gradient(low = "#F8F9FA", high = "#343A40") +
  labs(
    x = "Cylinders",
    y = "MPG",
    fill = "Mean HP"
  ) +
  theme_minimal()

Box plots in ggplot2 offer flexibility for exploratory data analysis and publication-quality visualizations. Combine multiple geoms, adjust statistical parameters, and customize aesthetics to match your analytical needs. The key is choosing the right combination of features that best communicates your data’s story without overwhelming the viewer.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.