R ggplot2 - Scatter Plot with Examples
The fundamental ggplot2 scatter plot requires a dataset, aesthetic mappings, and a point geometry layer. Here's the minimal implementation:
Key Insights
- ggplot2 uses a layered grammar of graphics approach where you build plots by adding geometric objects, scales, and themes to a base ggplot object
- Scatter plots in ggplot2 are created using
geom_point(), with extensive customization options for aesthetics like color, size, shape, and transparency - Advanced scatter plot techniques include faceting, trend lines, marginal distributions, and interactive annotations that transform basic visualizations into analytical tools
Basic Scatter Plot Construction
The fundamental ggplot2 scatter plot requires a dataset, aesthetic mappings, and a point geometry layer. Here’s the minimal implementation:
library(ggplot2)
# Create sample data
set.seed(123)
df <- data.frame(
x = rnorm(100, mean = 50, sd = 10),
y = rnorm(100, mean = 50, sd = 10)
)
# Basic scatter plot
ggplot(df, aes(x = x, y = y)) +
geom_point()
The ggplot() function initializes the plot object with data and aesthetic mappings. The aes() function maps variables to visual properties. The + operator adds layers—in this case, geom_point() renders the scatter plot.
Customizing Point Aesthetics
Control point appearance through size, color, shape, and transparency:
# Fixed aesthetics (applied to all points)
ggplot(df, aes(x = x, y = y)) +
geom_point(color = "steelblue", size = 3, shape = 17, alpha = 0.6)
# Variable aesthetics (mapped to data)
df$category <- sample(c("A", "B", "C"), 100, replace = TRUE)
df$value <- runif(100, 1, 10)
ggplot(df, aes(x = x, y = y, color = category, size = value)) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(1, 8))
Shape codes: 0-25 represent different shapes. Common choices: 16 (filled circle), 17 (filled triangle), 15 (filled square). Alpha values range from 0 (transparent) to 1 (opaque).
Color Scales and Palettes
ggplot2 provides multiple color scale functions for different data types:
# Discrete colors
ggplot(df, aes(x = x, y = y, color = category)) +
geom_point(size = 3) +
scale_color_manual(values = c("A" = "#E41A1C", "B" = "#377EB8", "C" = "#4DAF4A"))
# Using ColorBrewer palettes
ggplot(df, aes(x = x, y = y, color = category)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1")
# Continuous color gradient
ggplot(df, aes(x = x, y = y, color = value)) +
geom_point(size = 3) +
scale_color_gradient(low = "yellow", high = "red")
# Viridis color scales (perceptually uniform)
ggplot(df, aes(x = x, y = y, color = value)) +
geom_point(size = 3) +
scale_color_viridis_c(option = "plasma")
Adding Trend Lines and Statistical Summaries
Overlay regression lines, smoothers, or confidence intervals:
# Linear regression line
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red")
# LOESS smoother
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "loess", span = 0.3)
# Multiple regression lines by group
ggplot(df, aes(x = x, y = y, color = category)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
# Custom formula
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE)
The se parameter controls confidence interval display. The span parameter in LOESS controls smoothing degree (smaller = less smooth).
Faceting for Multi-Panel Plots
Split data into subplots based on categorical variables:
# Create richer dataset
df$group1 <- sample(c("Group1", "Group2"), 100, replace = TRUE)
df$group2 <- sample(c("Type_X", "Type_Y"), 100, replace = TRUE)
# Facet by one variable
ggplot(df, aes(x = x, y = y)) +
geom_point() +
facet_wrap(~ category, ncol = 2)
# Facet by two variables (grid)
ggplot(df, aes(x = x, y = y)) +
geom_point() +
facet_grid(group1 ~ group2)
# Free scales for each facet
ggplot(df, aes(x = x, y = y)) +
geom_point() +
facet_wrap(~ category, scales = "free")
Use scales = "free" when subgroups have different ranges. facet_grid() creates a matrix layout, while facet_wrap() wraps panels into a rectangular layout.
Handling Overplotting
When points overlap, use transparency, jittering, or 2D density representations:
# Create overlapping data
df_overlap <- data.frame(
x = rnorm(5000, mean = 50, sd = 5),
y = rnorm(5000, mean = 50, sd = 5)
)
# Alpha transparency
ggplot(df_overlap, aes(x = x, y = y)) +
geom_point(alpha = 0.1)
# Jittering
ggplot(df_overlap, aes(x = x, y = y)) +
geom_jitter(width = 0.5, height = 0.5, alpha = 0.3)
# 2D density contours
ggplot(df_overlap, aes(x = x, y = y)) +
geom_density_2d() +
geom_point(alpha = 0.2)
# Hexagonal binning
ggplot(df_overlap, aes(x = x, y = y)) +
geom_hex(bins = 30) +
scale_fill_viridis_c()
Annotations and Labels
Add text, arrows, and reference lines:
# Text labels for specific points
df_labeled <- df[1:5, ]
df_labeled$label <- paste("Point", 1:5)
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_text(data = df_labeled, aes(label = label),
vjust = -0.5, hjust = 0.5, size = 3)
# Using ggrepel to avoid label overlap
library(ggrepel)
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_text_repel(data = df_labeled, aes(label = label))
# Reference lines
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_hline(yintercept = mean(df$y), linetype = "dashed", color = "red") +
geom_vline(xintercept = mean(df$x), linetype = "dashed", color = "blue") +
annotate("rect", xmin = 40, xmax = 60, ymin = 40, ymax = 60,
alpha = 0.2, fill = "yellow")
Theme Customization
Control non-data plot elements:
ggplot(df, aes(x = x, y = y, color = category)) +
geom_point(size = 3) +
labs(
title = "Scatter Plot Analysis",
subtitle = "Sample data with categorical grouping",
x = "X Variable (units)",
y = "Y Variable (units)",
color = "Category"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
legend.position = "bottom",
panel.grid.minor = element_blank(),
axis.text = element_text(size = 10)
)
Built-in themes: theme_minimal(), theme_classic(), theme_bw(), theme_dark(). The theme() function provides granular control over every visual element.
Real-World Example: Correlation Analysis
Complete example analyzing the mtcars dataset:
library(ggplot2)
library(dplyr)
# Prepare data
mtcars_prep <- mtcars %>%
mutate(
cyl_factor = factor(cyl),
am_label = factor(am, labels = c("Automatic", "Manual"))
)
# Comprehensive scatter plot
ggplot(mtcars_prep, aes(x = wt, y = mpg)) +
geom_point(aes(color = cyl_factor, size = hp), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
facet_wrap(~ am_label) +
scale_color_manual(
values = c("4" = "#2E86AB", "6" = "#A23B72", "8" = "#F18F01"),
name = "Cylinders"
) +
scale_size_continuous(range = c(3, 10), name = "Horsepower") +
labs(
title = "Fuel Efficiency vs Vehicle Weight",
subtitle = "Grouped by transmission type and engine configuration",
x = "Weight (1000 lbs)",
y = "Miles per Gallon"
) +
theme_bw() +
theme(
legend.position = "right",
strip.background = element_rect(fill = "lightblue"),
plot.title = element_text(hjust = 0.5, face = "bold")
)
This example demonstrates multiple techniques: faceting by transmission type, color-coding by cylinder count, sizing by horsepower, and adding regression lines to identify trends within each transmission category.