How to Use Scale Functions in ggplot2
Scales are the bridge between your data and what appears on your plot. Every time you map a variable to an aesthetic—whether that's position, color, size, or shape—ggplot2 creates a scale to handle...
Key Insights
- Scales in ggplot2 control the mapping between your data values and visual properties—every aesthetic (position, color, size) has an associated scale that can be customized using the
scale_<aesthetic>_<type>()pattern - Position scales control axis behavior including limits, breaks, and transformations, while color/fill scales determine palette choices for both continuous and categorical data
- The
scalespackage provides essential formatting functions for labels, and choosing colorblind-friendly palettes like viridis should be your default for accessibility
Understanding ggplot2 Scales
Scales are the bridge between your data and what appears on your plot. Every time you map a variable to an aesthetic—whether that’s position, color, size, or shape—ggplot2 creates a scale to handle that mapping. Understanding scales transforms you from someone who makes plots into someone who crafts precise, publication-ready visualizations.
The naming convention is straightforward: scale_<aesthetic>_<type>(). The aesthetic is what you’re controlling (x, y, color, fill, size, etc.), and the type indicates how you’re controlling it (continuous, discrete, manual, gradient, etc.). Let’s see the difference:
library(ggplot2)
library(dplyr)
# Create sample data
df <- data.frame(
category = rep(c("A", "B", "C"), each = 30),
value = c(rnorm(30, 10, 2), rnorm(30, 15, 2), rnorm(30, 12, 2)),
size_var = runif(90, 1, 10)
)
# Default scales
p1 <- ggplot(df, aes(x = category, y = value, fill = category)) +
geom_boxplot() +
ggtitle("Default Scales")
# Customized scales
p2 <- ggplot(df, aes(x = category, y = value, fill = category)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(0, 20, 2.5), limits = c(0, 20)) +
scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73")) +
ggtitle("Customized Scales")
library(patchwork)
p1 | p2
The difference is immediately apparent. Custom scales give you control over every visual element.
Position Scales: Controlling Your Axes
Position scales control the x and y axes. For continuous data, use scale_x_continuous() and scale_y_continuous(). For categorical data, use scale_x_discrete() and scale_y_discrete().
The most common parameters you’ll adjust are limits, breaks, and labels. Limits define the range, breaks determine where tick marks appear, and labels control what text shows up at those ticks.
# Sample sales data
sales <- data.frame(
month = 1:12,
revenue = c(45000, 52000, 48000, 61000, 73000, 89000,
95000, 88000, 79000, 71000, 84000, 102000)
)
ggplot(sales, aes(x = month, y = revenue)) +
geom_line(linewidth = 1) +
geom_point(size = 3) +
scale_x_continuous(
breaks = 1:12,
labels = month.abb
) +
scale_y_continuous(
limits = c(0, 110000),
breaks = seq(0, 110000, 20000),
labels = scales::dollar_format(scale = 1e-3, suffix = "K")
) +
labs(title = "Monthly Revenue", x = "Month", y = "Revenue")
Transformations are powerful for data with exponential relationships or wide ranges. Use the trans parameter:
# Exponential growth data
growth <- data.frame(
time = 1:20,
population = 100 * 1.15^(1:20)
)
ggplot(growth, aes(x = time, y = population)) +
geom_line() +
scale_y_continuous(
trans = "log10",
breaks = c(100, 500, 1000, 5000, 10000),
labels = scales::comma
) +
labs(title = "Population Growth (Log Scale)")
For discrete scales, you control the order and labels of categories:
df_discrete <- data.frame(
priority = factor(c("Low", "Medium", "High", "Critical"),
levels = c("Low", "Medium", "High", "Critical")),
count = c(45, 78, 34, 12)
)
ggplot(df_discrete, aes(x = priority, y = count)) +
geom_col(fill = "steelblue") +
scale_x_discrete(
labels = c("Low\nPriority", "Medium\nPriority",
"High\nPriority", "Critical\nIssue")
)
Color and Fill Scales: Making Data Visible
Color scales are where many visualizations succeed or fail. The right color choice makes patterns obvious; the wrong choice obscures them or excludes colorblind viewers.
For categorical data with specific color requirements, use scale_color_manual() or scale_fill_manual():
regions <- data.frame(
region = c("North", "South", "East", "West"),
sales = c(234000, 189000, 267000, 198000)
)
ggplot(regions, aes(x = region, y = sales, fill = region)) +
geom_col() +
scale_fill_manual(
values = c(
"North" = "#2E86AB",
"South" = "#A23B72",
"East" = "#F18F01",
"West" = "#C73E1D"
)
) +
theme(legend.position = "none")
For continuous data, gradient scales work well:
# Heatmap data
heatmap_df <- expand.grid(
x = 1:10,
y = 1:10
) %>%
mutate(value = x * y + rnorm(100, 0, 5))
ggplot(heatmap_df, aes(x = x, y = y, fill = value)) +
geom_tile() +
scale_fill_gradient2(
low = "#2166AC",
mid = "#F7F7F7",
high = "#B2182B",
midpoint = 50
)
The ColorBrewer palettes are excellent for categorical data:
iris_summary <- iris %>%
group_by(Species) %>%
summarise(avg_sepal = mean(Sepal.Length))
ggplot(iris_summary, aes(x = Species, y = avg_sepal, fill = Species)) +
geom_col() +
scale_fill_brewer(palette = "Set2")
But the viridis scales should be your default for accessibility:
ggplot(heatmap_df, aes(x = x, y = y, fill = value)) +
geom_tile() +
scale_fill_viridis_c(option = "plasma") +
labs(title = "Colorblind-Friendly Heatmap")
Size, Shape, and Alpha Scales
These scales add additional dimensions to your visualizations. Use them wisely—too many aesthetic mappings create cluttered plots.
# Multi-dimensional scatter plot
scatter_data <- data.frame(
x = rnorm(100),
y = rnorm(100),
category = sample(c("Type A", "Type B", "Type C"), 100, replace = TRUE),
importance = runif(100, 1, 10),
confidence = runif(100, 0.3, 1)
)
ggplot(scatter_data, aes(x = x, y = y)) +
geom_point(aes(size = importance, shape = category, alpha = confidence)) +
scale_size_continuous(range = c(2, 10), name = "Importance") +
scale_shape_manual(values = c(16, 17, 15), name = "Type") +
scale_alpha_continuous(range = c(0.3, 1), name = "Confidence") +
theme_minimal()
For size scales specifically, use scale_size_area() when the variable represents counts or magnitudes—this ensures the visual area is proportional to the value, not the radius:
city_data <- data.frame(
city = c("NYC", "LA", "Chicago", "Houston"),
population = c(8.3, 4.0, 2.7, 2.3) * 1e6,
x = c(1, 2, 3, 4),
y = c(1, 1, 1, 1)
)
ggplot(city_data, aes(x = x, y = y, size = population)) +
geom_point(alpha = 0.6) +
scale_size_area(max_size = 30, labels = scales::comma) +
labs(size = "Population")
Advanced Scale Techniques
The scales package provides formatting functions that make your axes publication-ready:
library(scales)
# Financial time series
dates <- seq.Date(as.Date("2023-01-01"), as.Date("2023-12-31"), by = "month")
stock_data <- data.frame(
date = dates,
price = 100 + cumsum(rnorm(12, 0, 5)),
volume = sample(1e6:5e6, 12)
)
ggplot(stock_data, aes(x = date, y = price)) +
geom_line() +
scale_x_date(
date_breaks = "2 months",
date_labels = "%b\n%Y"
) +
scale_y_continuous(
labels = dollar_format(),
breaks = seq(80, 120, 10)
) +
labs(title = "Stock Price Over Time")
Reverse scales are useful for rankings or when convention dictates (like depth below surface):
ranking_data <- data.frame(
team = paste("Team", LETTERS[1:10]),
rank = 1:10,
score = seq(95, 50, length.out = 10)
)
ggplot(ranking_data, aes(x = reorder(team, rank), y = rank)) +
geom_col(aes(fill = score)) +
scale_y_reverse() +
scale_fill_viridis_c(option = "magma") +
coord_flip() +
labs(x = NULL, y = "Rank")
Best Practices and Common Pitfalls
Always consider colorblind accessibility. Roughly 8% of men and 0.5% of women have some form of color vision deficiency. Use viridis palettes or test your colors with tools like the colorBlindness package.
Don’t truncate y-axes unless you have a good reason. Starting a bar chart at a non-zero value exaggerates differences and misleads viewers. If you must truncate, make it obvious:
# Bad practice
bad_plot <- ggplot(regions, aes(x = region, y = sales)) +
geom_col() +
scale_y_continuous(limits = c(180000, 270000)) +
labs(title = "Misleading: Truncated Y-axis")
# Better practice
good_plot <- ggplot(regions, aes(x = region, y = sales)) +
geom_col() +
scale_y_continuous(limits = c(0, 300000), labels = scales::comma) +
labs(title = "Clear: Full Y-axis")
Match your scale to your data type. Continuous data gets continuous scales, discrete data gets discrete scales. Forcing a continuous variable into discrete bins loses information unless that’s specifically your goal.
Use consistent scales when comparing plots. If you’re showing multiple related visualizations, keep the scales identical so viewers can make direct comparisons.
Mastering scales gives you precise control over every aspect of your visualizations. Start with sensible defaults like viridis colors and full y-axes, then customize deliberately to highlight the story your data tells.