How to Create a Violin Plot in ggplot2
Violin plots combine the summary statistics of box plots with the distribution visualization of kernel density plots. While a box plot shows you five numbers (min, Q1, median, Q3, max), a violin plot...
Key Insights
- Violin plots reveal distribution shapes that box plots hide, making them superior for detecting bimodal distributions, skewness, and multiple peaks in your data
- Layer
geom_violin()withgeom_boxplot()andgeom_jitter()to show distribution density, quartiles, and individual observations simultaneously - Use the
scaleparameter to control violin width normalization—“area” keeps total area constant across groups, while “width” standardizes maximum width
Understanding Violin Plots and When to Use Them
Violin plots combine the summary statistics of box plots with the distribution visualization of kernel density plots. While a box plot shows you five numbers (min, Q1, median, Q3, max), a violin plot reveals the entire probability density of your data at different values.
The critical advantage? Violin plots expose distribution characteristics that box plots completely miss. Two datasets can have identical five-number summaries but wildly different distributions—one might be bimodal, another uniform, and a third normally distributed. Violin plots make these differences immediately visible.
Use violin plots when you have sufficient data points (ideally 30+) per category and when understanding distribution shape matters for your analysis. They’re particularly valuable when comparing multiple groups where distribution differences are as important as central tendency differences.
Building Your First Violin Plot
Let’s start with the essentials. You’ll need ggplot2, and I recommend having dplyr loaded for data manipulation.
library(ggplot2)
library(dplyr)
# Using the iris dataset
data(iris)
# Basic violin plot
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
geom_violin() +
labs(title = "Sepal Length Distribution by Species",
x = "Species",
y = "Sepal Length (cm)") +
theme_minimal()
This basic syntax requires two aesthetics: a categorical variable for the x-axis and a continuous variable for the y-axis. The geom_violin() function calculates kernel density estimates and mirrors them to create the characteristic violin shape.
The width of the violin at any point represents the density of observations at that value—wider sections indicate more data points, narrower sections indicate fewer.
Customizing Violin Aesthetics
Raw violin plots rarely tell the complete story. Let’s enhance them with color, adjust their appearance, and make them publication-ready.
# Enhanced violin plot with custom styling
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_violin(trim = FALSE,
scale = "width",
alpha = 0.7) +
scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73")) +
labs(title = "Sepal Length Distribution by Species",
x = "Species",
y = "Sepal Length (cm)") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size = 14, face = "bold"))
Key parameters to understand:
trim: When FALSE, extends the violins to the full range of the data rather than trimming tails at extreme valuesscale: Controls width normalization. “area” makes all violins have equal area, “count” scales area by observation count, “width” standardizes maximum widthalpha: Controls transparency, useful when overlaying multiple geoms
The scale parameter deserves special attention. Use “area” when you want to emphasize that all groups are equally important regardless of sample size. Use “count” when sample size differences matter and should be visually apparent. Use “width” (default) when you want easy visual comparison of distribution shapes.
Layering Multiple Geoms for Rich Visualizations
The real power of violin plots emerges when you combine them with other geometric objects. This approach provides distribution shape, summary statistics, and individual data points in a single visualization.
# Violin plot with overlaid box plot and points
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_violin(alpha = 0.6, trim = FALSE) +
geom_boxplot(width = 0.2,
fill = "white",
outlier.shape = NA) +
geom_jitter(width = 0.1,
alpha = 0.3,
size = 1) +
scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73")) +
labs(title = "Comprehensive Distribution Comparison",
x = "Species",
y = "Sepal Length (cm)") +
theme_minimal() +
theme(legend.position = "none")
This layered approach gives you three levels of information:
- Violin: Overall distribution shape and density
- Box plot: Median, quartiles, and outliers
- Jittered points: Individual observations
Set outlier.shape = NA in geom_boxplot() to avoid plotting outliers twice (once in the box plot, once in the jittered points). Adjust the box plot width to keep it narrow—you want it to summarize, not dominate.
Advanced Techniques: Split Violins and Faceting
Split violins allow direct side-by-side comparison of two groups within each category. This is particularly effective for A/B testing or gender comparisons.
# Create a dataset with two groups
set.seed(42)
comparison_data <- data.frame(
category = rep(c("A", "B", "C"), each = 100),
group = rep(c("Control", "Treatment"), times = 150),
value = c(rnorm(50, 10, 2), rnorm(50, 12, 2),
rnorm(50, 15, 3), rnorm(50, 16, 3),
rnorm(50, 20, 2.5), rnorm(50, 19, 2.5))
)
# Split violin plot
ggplot(comparison_data, aes(x = category, y = value, fill = group)) +
geom_violin(position = position_dodge(width = 0.9),
alpha = 0.7) +
geom_boxplot(position = position_dodge(width = 0.9),
width = 0.2,
outlier.shape = NA) +
scale_fill_manual(values = c("#D55E00", "#0072B2")) +
labs(title = "Control vs Treatment Across Categories",
x = "Category",
y = "Response Value",
fill = "Group") +
theme_minimal()
For exploring multiple dimensions, faceting creates small multiples:
# Faceted violin plots
ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
geom_violin(show.legend = FALSE) +
facet_wrap(~ cut(Sepal.Width, breaks = 3,
labels = c("Narrow", "Medium", "Wide"))) +
labs(title = "Petal Length by Species, Faceted by Sepal Width",
x = "Species",
y = "Petal Length (cm)") +
theme_minimal()
Practical Applications and Best Practices
Violin plots excel in specific scenarios. Here’s a real-world example analyzing API response times across different endpoints:
# Simulated API response time data
set.seed(123)
api_data <- data.frame(
endpoint = rep(c("/users", "/products", "/orders", "/search"), each = 200),
response_time = c(
rgamma(200, shape = 2, rate = 0.05), # users: right-skewed
c(rnorm(150, 40, 5), rnorm(50, 80, 10)), # products: bimodal
rnorm(200, 60, 15), # orders: normal
rexp(200, rate = 0.02) # search: exponential
)
)
# Comprehensive API performance visualization
ggplot(api_data, aes(x = reorder(endpoint, response_time, median),
y = response_time,
fill = endpoint)) +
geom_violin(alpha = 0.6) +
geom_boxplot(width = 0.2, fill = "white", outlier.shape = NA) +
stat_summary(fun = median, geom = "point",
size = 3, color = "red", shape = 18) +
scale_fill_brewer(palette = "Set2") +
labs(title = "API Response Time Distributions by Endpoint",
subtitle = "Red diamonds indicate median response time",
x = "Endpoint (ordered by median response time)",
y = "Response Time (ms)") +
theme_minimal() +
theme(legend.position = "none")
This visualization immediately reveals that /products has a bimodal distribution (suggesting two different performance profiles), /search has a long right tail (occasional slow queries), while /orders shows consistent performance.
Critical best practices:
-
Sample size matters: Violin plots need at least 30 observations per category. With fewer points, the kernel density estimation becomes unreliable. Consider box plots or strip charts for small samples.
-
Bandwidth selection: The default bandwidth usually works well, but you can adjust it with the
bwparameter ingeom_violin(). Smaller values show more detail but risk overfitting; larger values create smoother but potentially oversimplified distributions. -
Avoid violin plots for discrete data: If your continuous variable only takes a few distinct values, violin plots create misleading density estimates. Use bar charts or dot plots instead.
-
Order matters: Reorder categories by median or mean to make comparisons easier, as shown in the API example with
reorder(). -
Color accessibility: Always consider colorblind-friendly palettes. The viridis or ColorBrewer palettes work well.
Violin plots transform distribution comparison from a statistical exercise into an intuitive visual experience. They reveal patterns that summary statistics obscure and make distribution differences immediately apparent. Master them, and you’ll communicate data insights far more effectively than with box plots alone.