How to Create a Scatter Plot in ggplot2
ggplot2 is R's most popular visualization package, built on Leland Wilkinson's grammar of graphics. Rather than providing pre-built chart types, ggplot2 treats plots as layered compositions of data,...
Key Insights
- ggplot2’s layered grammar of graphics makes scatter plots intuitive—start with
ggplot()to define data and aesthetics, then addgeom_point()to render points - Map variables to aesthetics (color, size, shape) inside
aes()for data-driven styling, but set fixed values outsideaes()for consistent appearance across all points - Combine
geom_smooth()with scatter plots to reveal trends, and use faceting to compare relationships across categorical groups without cluttering a single plot
Introduction to ggplot2 and Scatter Plots
ggplot2 is R’s most popular visualization package, built on Leland Wilkinson’s grammar of graphics. Rather than providing pre-built chart types, ggplot2 treats plots as layered compositions of data, aesthetics, and geometric objects. This approach gives you precise control over every visual element.
Scatter plots excel at revealing relationships between two continuous variables. Use them to identify correlations, spot outliers, detect clusters, and visualize trends. They’re essential for exploratory data analysis and communicating findings about how variables relate to each other.
The grammar of graphics philosophy means you build plots incrementally. You start with a base layer defining your data and aesthetic mappings, then add geometric layers (points, lines, etc.), statistical transformations, and styling. This compositional approach makes ggplot2 code readable and modifications straightforward.
Basic Scatter Plot Setup
Every ggplot2 visualization starts with the ggplot() function, which initializes the plot object. You specify your dataset and map variables to aesthetic properties like x and y coordinates. Then you add layers using the + operator.
Here’s a minimal scatter plot using the built-in mtcars dataset, which contains specifications for 32 automobiles from 1974:
library(ggplot2)
# Basic scatter plot: fuel efficiency vs. horsepower
ggplot(data = mtcars, aes(x = hp, y = mpg)) +
geom_point()
This creates a scatter plot with horsepower on the x-axis and miles per gallon on the y-axis. The aes() function establishes aesthetic mappings—it tells ggplot2 which variables control which visual properties. The geom_point() layer renders the actual points.
You can also pipe data into ggplot2, which is cleaner when preprocessing:
mtcars |>
ggplot(aes(x = hp, y = mpg)) +
geom_point()
The negative relationship is immediately visible—cars with more horsepower generally have lower fuel efficiency. This is the power of scatter plots: patterns emerge instantly.
Customizing Point Aesthetics
Points have multiple aesthetic properties you can control: color, size, shape, and transparency (alpha). Understanding when to map variables versus setting fixed values is crucial.
Mapping variables to aesthetics (inside aes()): Use this when you want the aesthetic to represent data. ggplot2 automatically creates legends and scales.
Setting fixed aesthetics (outside aes()): Use this when you want all points to have the same appearance.
# Map cylinder count to color (inside aes)
# Set fixed size and transparency (outside aes)
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3, alpha = 0.7)
Here, color = factor(cyl) maps the number of cylinders to point color. We wrap cyl in factor() to treat it as categorical rather than continuous. The size = 3 and alpha = 0.7 arguments apply to all points—they’re not mapped to data.
You can map multiple variables simultaneously:
ggplot(mtcars, aes(x = hp, y = mpg,
color = factor(cyl),
size = wt)) +
geom_point(alpha = 0.6)
Now color represents cylinders and size represents weight. The alpha transparency helps when points overlap. This multi-dimensional approach lets you visualize four variables in a single plot—x position, y position, color, and size.
Available point shapes are numbered 0-25. Shapes 0-14 are hollow, 15-20 are solid, and 21-25 have both fill and border colors:
ggplot(mtcars, aes(x = hp, y = mpg, shape = factor(cyl))) +
geom_point(size = 3)
Adding Layers and Annotations
Scatter plots become more informative when you add contextual layers. Trend lines reveal overall patterns, reference lines highlight thresholds, and labels identify specific points.
Add a linear regression line with geom_smooth():
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(size = 2, alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "blue")
The method = "lm" argument specifies linear regression. Setting se = TRUE includes the confidence interval as a shaded ribbon. For non-linear relationships, use method = "loess" (the default) for locally weighted smoothing.
Reference lines mark important values:
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(size = 2) +
geom_hline(yintercept = 20, linetype = "dashed", color = "red") +
geom_vline(xintercept = 150, linetype = "dashed", color = "red")
Label specific points with geom_text() or geom_label():
library(dplyr)
# Identify cars with extreme values
extreme_cars <- mtcars |>
filter(mpg > 30 | hp > 250) |>
mutate(car_name = rownames(mtcars)[mpg > 30 | hp > 250])
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(size = 2, alpha = 0.6) +
geom_label(data = extreme_cars,
aes(label = car_name),
nudge_y = 1, size = 3)
The nudge_y argument shifts labels vertically to avoid overlapping points. Use geom_label() for labels with backgrounds or geom_text() for plain text.
Enhancing with Themes and Labels
Default ggplot2 plots are functional but bland. Professional visualizations need clear labels, appropriate themes, and thoughtful styling.
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "gray30") +
labs(
title = "Fuel Efficiency Decreases with Horsepower",
subtitle = "1974 Motor Trend automobile data",
x = "Horsepower",
y = "Miles per Gallon",
color = "Cylinders",
caption = "Source: 1974 Motor Trend magazine"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "top"
)
The labs() function sets all text elements. Always label your axes clearly—variable names like “hp” are meaningless to most audiences.
ggplot2 includes several complete themes:
theme_minimal(): Clean, no background gridtheme_bw(): Black and white with gridtheme_classic(): No grid, axes onlytheme_light(): Subtle grid lines
You can further customize with theme(). Common adjustments include legend position, text sizes, and grid line appearance.
For publications, consider the theme_set() function to apply a theme globally:
theme_set(theme_minimal(base_size = 12))
Advanced Techniques
Faceting splits your data into multiple panels, each showing a subset. This is invaluable for comparing relationships across categories:
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(size = 2, alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~cyl, labeller = label_both) +
labs(
title = "Horsepower-MPG Relationship by Cylinder Count",
x = "Horsepower",
y = "Miles per Gallon"
) +
theme_bw()
Use facet_wrap() for one variable or facet_grid() for two variables creating a matrix layout.
When points overlap heavily (overplotting), patterns become invisible. geom_jitter() adds random noise to separate points:
# Example with discrete data
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_jitter(width = 0.2, height = 0, alpha = 0.6, size = 2) +
labs(x = "Cylinders", y = "Miles per Gallon") +
theme_minimal()
The width and height arguments control jitter magnitude. Set height = 0 when your y-axis represents precise measurements.
For large datasets, consider geom_hex() or geom_bin2d() to create heatmap-style density plots:
# Simulating larger dataset
large_data <- data.frame(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(large_data, aes(x = x, y = y)) +
geom_hex(bins = 30) +
scale_fill_viridis_c() +
theme_minimal()
For interactive plots, convert ggplot2 objects to plotly:
library(plotly)
p <- ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
theme_minimal()
ggplotly(p)
This adds hover tooltips, zoom, and pan capabilities—useful for web-based reports and presentations.
Practical Recommendations
Start simple and add complexity incrementally. A basic scatter plot with clear labels often communicates better than an over-designed visualization.
Always consider your audience. Academic papers might need theme_bw() and precise statistical annotations. Business presentations benefit from theme_minimal() and bold titles.
Use color purposefully. Map color to meaningful categories, not arbitrary groupings. Stick to colorblind-friendly palettes like viridis for continuous data.
When showing multiple relationships, faceting beats cramming everything into one plot. Your audience can compare patterns across panels more easily than decoding overlapping geometries.
Export plots at appropriate resolutions. Use ggsave() with explicit dimensions:
ggsave("scatter_plot.png", width = 8, height = 6, dpi = 300)
Master scatter plots in ggplot2 and you’ll have a foundation for understanding the entire package. The same principles—layering, aesthetic mapping, and incremental refinement—apply to every plot type ggplot2 supports.