R ggplot2 - Scatter Plot with Examples

The fundamental ggplot2 scatter plot requires a dataset, aesthetic mappings, and a point geometry layer. Here's the minimal implementation:

Key Insights

  • ggplot2 uses a layered grammar of graphics approach where you build plots by adding geometric objects, scales, and themes to a base ggplot object
  • Scatter plots in ggplot2 are created using geom_point(), with extensive customization options for aesthetics like color, size, shape, and transparency
  • Advanced scatter plot techniques include faceting, trend lines, marginal distributions, and interactive annotations that transform basic visualizations into analytical tools

Basic Scatter Plot Construction

The fundamental ggplot2 scatter plot requires a dataset, aesthetic mappings, and a point geometry layer. Here’s the minimal implementation:

library(ggplot2)

# Create sample data
set.seed(123)
df <- data.frame(
  x = rnorm(100, mean = 50, sd = 10),
  y = rnorm(100, mean = 50, sd = 10)
)

# Basic scatter plot
ggplot(df, aes(x = x, y = y)) +
  geom_point()

The ggplot() function initializes the plot object with data and aesthetic mappings. The aes() function maps variables to visual properties. The + operator adds layers—in this case, geom_point() renders the scatter plot.

Customizing Point Aesthetics

Control point appearance through size, color, shape, and transparency:

# Fixed aesthetics (applied to all points)
ggplot(df, aes(x = x, y = y)) +
  geom_point(color = "steelblue", size = 3, shape = 17, alpha = 0.6)

# Variable aesthetics (mapped to data)
df$category <- sample(c("A", "B", "C"), 100, replace = TRUE)
df$value <- runif(100, 1, 10)

ggplot(df, aes(x = x, y = y, color = category, size = value)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(range = c(1, 8))

Shape codes: 0-25 represent different shapes. Common choices: 16 (filled circle), 17 (filled triangle), 15 (filled square). Alpha values range from 0 (transparent) to 1 (opaque).

Color Scales and Palettes

ggplot2 provides multiple color scale functions for different data types:

# Discrete colors
ggplot(df, aes(x = x, y = y, color = category)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("A" = "#E41A1C", "B" = "#377EB8", "C" = "#4DAF4A"))

# Using ColorBrewer palettes
ggplot(df, aes(x = x, y = y, color = category)) +
  geom_point(size = 3) +
  scale_color_brewer(palette = "Set1")

# Continuous color gradient
ggplot(df, aes(x = x, y = y, color = value)) +
  geom_point(size = 3) +
  scale_color_gradient(low = "yellow", high = "red")

# Viridis color scales (perceptually uniform)
ggplot(df, aes(x = x, y = y, color = value)) +
  geom_point(size = 3) +
  scale_color_viridis_c(option = "plasma")

Adding Trend Lines and Statistical Summaries

Overlay regression lines, smoothers, or confidence intervals:

# Linear regression line
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "red")

# LOESS smoother
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.3)

# Multiple regression lines by group
ggplot(df, aes(x = x, y = y, color = category)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

# Custom formula
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE)

The se parameter controls confidence interval display. The span parameter in LOESS controls smoothing degree (smaller = less smooth).

Faceting for Multi-Panel Plots

Split data into subplots based on categorical variables:

# Create richer dataset
df$group1 <- sample(c("Group1", "Group2"), 100, replace = TRUE)
df$group2 <- sample(c("Type_X", "Type_Y"), 100, replace = TRUE)

# Facet by one variable
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~ category, ncol = 2)

# Facet by two variables (grid)
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  facet_grid(group1 ~ group2)

# Free scales for each facet
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~ category, scales = "free")

Use scales = "free" when subgroups have different ranges. facet_grid() creates a matrix layout, while facet_wrap() wraps panels into a rectangular layout.

Handling Overplotting

When points overlap, use transparency, jittering, or 2D density representations:

# Create overlapping data
df_overlap <- data.frame(
  x = rnorm(5000, mean = 50, sd = 5),
  y = rnorm(5000, mean = 50, sd = 5)
)

# Alpha transparency
ggplot(df_overlap, aes(x = x, y = y)) +
  geom_point(alpha = 0.1)

# Jittering
ggplot(df_overlap, aes(x = x, y = y)) +
  geom_jitter(width = 0.5, height = 0.5, alpha = 0.3)

# 2D density contours
ggplot(df_overlap, aes(x = x, y = y)) +
  geom_density_2d() +
  geom_point(alpha = 0.2)

# Hexagonal binning
ggplot(df_overlap, aes(x = x, y = y)) +
  geom_hex(bins = 30) +
  scale_fill_viridis_c()

Annotations and Labels

Add text, arrows, and reference lines:

# Text labels for specific points
df_labeled <- df[1:5, ]
df_labeled$label <- paste("Point", 1:5)

ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_text(data = df_labeled, aes(label = label), 
            vjust = -0.5, hjust = 0.5, size = 3)

# Using ggrepel to avoid label overlap
library(ggrepel)
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_text_repel(data = df_labeled, aes(label = label))

# Reference lines
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_hline(yintercept = mean(df$y), linetype = "dashed", color = "red") +
  geom_vline(xintercept = mean(df$x), linetype = "dashed", color = "blue") +
  annotate("rect", xmin = 40, xmax = 60, ymin = 40, ymax = 60, 
           alpha = 0.2, fill = "yellow")

Theme Customization

Control non-data plot elements:

ggplot(df, aes(x = x, y = y, color = category)) +
  geom_point(size = 3) +
  labs(
    title = "Scatter Plot Analysis",
    subtitle = "Sample data with categorical grouping",
    x = "X Variable (units)",
    y = "Y Variable (units)",
    color = "Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    axis.text = element_text(size = 10)
  )

Built-in themes: theme_minimal(), theme_classic(), theme_bw(), theme_dark(). The theme() function provides granular control over every visual element.

Real-World Example: Correlation Analysis

Complete example analyzing the mtcars dataset:

library(ggplot2)
library(dplyr)

# Prepare data
mtcars_prep <- mtcars %>%
  mutate(
    cyl_factor = factor(cyl),
    am_label = factor(am, labels = c("Automatic", "Manual"))
  )

# Comprehensive scatter plot
ggplot(mtcars_prep, aes(x = wt, y = mpg)) +
  geom_point(aes(color = cyl_factor, size = hp), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
  facet_wrap(~ am_label) +
  scale_color_manual(
    values = c("4" = "#2E86AB", "6" = "#A23B72", "8" = "#F18F01"),
    name = "Cylinders"
  ) +
  scale_size_continuous(range = c(3, 10), name = "Horsepower") +
  labs(
    title = "Fuel Efficiency vs Vehicle Weight",
    subtitle = "Grouped by transmission type and engine configuration",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_bw() +
  theme(
    legend.position = "right",
    strip.background = element_rect(fill = "lightblue"),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

This example demonstrates multiple techniques: faceting by transmission type, color-coding by cylinder count, sizing by horsepower, and adding regression lines to identify trends within each transmission category.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.