R ggplot2 - Histogram with Examples
The fundamental histogram in ggplot2 requires a dataset and a continuous variable mapped to the x-axis. The `geom_histogram()` function automatically bins the data and counts observations.
Key Insights
- Histograms in ggplot2 use
geom_histogram()to visualize continuous variable distributions through binned data, with automatic bin calculation or manual specification viabins,binwidth, orbreaks - The layered grammar of graphics allows precise control over aesthetics (fill, color, alpha), faceting for multi-group comparisons, and statistical transformations like density curves overlaid on histograms
- Production-ready histograms require attention to bin selection (Sturges, Freedman-Diaconis, or Scott rules), appropriate scaling (count vs. density), and visual refinement through themes and annotations
Basic Histogram Construction
The fundamental histogram in ggplot2 requires a dataset and a continuous variable mapped to the x-axis. The geom_histogram() function automatically bins the data and counts observations.
library(ggplot2)
# Create sample data
set.seed(123)
data <- data.frame(
values = rnorm(1000, mean = 50, sd = 10)
)
# Basic histogram
ggplot(data, aes(x = values)) +
geom_histogram()
This produces a histogram with 30 bins (default). The warning message about bin selection is intentional—ggplot2 forces you to make an explicit choice about binning strategy.
Controlling Bins
Bin specification directly impacts histogram interpretation. Three primary methods control binning behavior.
# Method 1: Specify number of bins
ggplot(data, aes(x = values)) +
geom_histogram(bins = 50, fill = "steelblue", color = "white") +
labs(title = "50 Bins", y = "Frequency")
# Method 2: Specify bin width
ggplot(data, aes(x = values)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Bin Width = 2", y = "Frequency")
# Method 3: Specify exact breaks
ggplot(data, aes(x = values)) +
geom_histogram(breaks = seq(20, 80, by = 5),
fill = "steelblue", color = "white") +
labs(title = "Custom Breaks", y = "Frequency")
For optimal bin selection, use established algorithms:
# Sturges' formula
sturges_bins <- nclass.Sturges(data$values)
# Freedman-Diaconis rule
fd_bins <- nclass.FD(data$values)
# Scott's rule
scott_bins <- nclass.scott(data$values)
ggplot(data, aes(x = values)) +
geom_histogram(bins = fd_bins, fill = "steelblue", color = "white") +
labs(title = paste("Freedman-Diaconis:", fd_bins, "bins"))
Aesthetic Customization
Histograms support multiple aesthetic parameters for visual differentiation and clarity.
# Fill, color, and alpha
ggplot(data, aes(x = values)) +
geom_histogram(bins = 30,
fill = "#2C3E50",
color = "#ECF0F1",
alpha = 0.8,
linewidth = 0.3) +
theme_minimal() +
labs(x = "Values", y = "Count")
# Conditional coloring based on bins
data$category <- cut(data$values,
breaks = c(-Inf, 40, 60, Inf),
labels = c("Low", "Medium", "High"))
ggplot(data, aes(x = values, fill = category)) +
geom_histogram(bins = 30, color = "white", linewidth = 0.2) +
scale_fill_manual(values = c("Low" = "#E74C3C",
"Medium" = "#F39C12",
"High" = "#27AE60")) +
theme_minimal()
Density Scaling and Overlays
Converting histograms to density scale enables overlay with density curves and comparison across different sample sizes.
# Density histogram with density curve
ggplot(data, aes(x = values)) +
geom_histogram(aes(y = after_stat(density)),
bins = 30,
fill = "lightblue",
color = "white") +
geom_density(color = "darkblue", linewidth = 1) +
labs(y = "Density") +
theme_minimal()
# Multiple density curves
ggplot(data, aes(x = values)) +
geom_histogram(aes(y = after_stat(density)),
bins = 30,
fill = "gray80",
color = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(data$values),
sd = sd(data$values)),
color = "red", linewidth = 1) +
labs(title = "Histogram with Normal Distribution Overlay") +
theme_minimal()
Faceted Histograms
Faceting enables multi-group comparison while maintaining consistent scales.
# Create multi-group data
set.seed(456)
multi_data <- data.frame(
values = c(rnorm(500, mean = 45, sd = 8),
rnorm(500, mean = 55, sd = 12),
rnorm(500, mean = 50, sd = 10)),
group = rep(c("A", "B", "C"), each = 500)
)
# Faceted by group
ggplot(multi_data, aes(x = values)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~group, ncol = 1) +
theme_minimal() +
labs(x = "Values", y = "Frequency")
# Overlapping histograms with transparency
ggplot(multi_data, aes(x = values, fill = group)) +
geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
scale_fill_brewer(palette = "Set1") +
theme_minimal()
# Side-by-side comparison with dodge
ggplot(multi_data, aes(x = values, fill = group)) +
geom_histogram(bins = 30, position = "dodge", color = "white") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
Statistical Annotations
Adding statistical information enhances histogram interpretability.
# Calculate statistics
mean_val <- mean(data$values)
median_val <- median(data$values)
sd_val <- sd(data$values)
# Histogram with statistical lines
ggplot(data, aes(x = values)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
geom_vline(aes(xintercept = mean_val),
color = "red", linetype = "dashed", linewidth = 1) +
geom_vline(aes(xintercept = median_val),
color = "darkgreen", linetype = "dashed", linewidth = 1) +
annotate("text", x = mean_val + 2, y = Inf,
label = paste("Mean:", round(mean_val, 2)),
vjust = 2, color = "red") +
annotate("text", x = median_val - 2, y = Inf,
label = paste("Median:", round(median_val, 2)),
vjust = 4, color = "darkgreen") +
theme_minimal()
Real-World Example: Distribution Analysis
Complete workflow analyzing a realistic dataset.
# Load built-in dataset
data(diamonds, package = "ggplot2")
# Analyze price distribution
price_stats <- data.frame(
metric = c("Mean", "Median", "SD"),
value = c(mean(diamonds$price),
median(diamonds$price),
sd(diamonds$price))
)
# Comprehensive histogram
ggplot(diamonds, aes(x = price)) +
geom_histogram(aes(y = after_stat(density)),
bins = 50,
fill = "#3498DB",
color = "white",
alpha = 0.7) +
geom_density(color = "#E74C3C", linewidth = 1.2) +
geom_vline(xintercept = median(diamonds$price),
color = "#2ECC71", linetype = "dashed", linewidth = 1) +
scale_x_continuous(labels = scales::dollar_format()) +
labs(title = "Diamond Price Distribution",
subtitle = paste("n =", nrow(diamonds), "| Median =",
scales::dollar(median(diamonds$price))),
x = "Price", y = "Density") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "gray40"))
# Price by cut quality
ggplot(diamonds, aes(x = price, fill = cut)) +
geom_histogram(bins = 40, alpha = 0.8, color = "white", linewidth = 0.1) +
facet_wrap(~cut, ncol = 1, scales = "free_y") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_fill_brewer(palette = "RdYlGn", direction = 1) +
labs(title = "Price Distribution by Cut Quality",
x = "Price", y = "Count") +
theme_minimal() +
theme(legend.position = "none",
strip.text = element_text(face = "bold"))
Performance Considerations
For large datasets, optimize rendering performance.
# Large dataset simulation
large_data <- data.frame(values = rnorm(1000000))
# Use fewer bins for large datasets
system.time({
p1 <- ggplot(large_data, aes(x = values)) +
geom_histogram(bins = 50, fill = "steelblue")
})
# Pre-calculate bins for reuse
breaks <- seq(min(large_data$values),
max(large_data$values),
length.out = 51)
p2 <- ggplot(large_data, aes(x = values)) +
geom_histogram(breaks = breaks, fill = "steelblue")
Histograms in ggplot2 provide granular control over data visualization. Proper bin selection, aesthetic refinement, and statistical annotation transform raw distributions into actionable insights. The layered approach enables progressive enhancement from basic frequency plots to publication-ready figures with density overlays, faceting, and custom themes.