How to Create a Correlation Matrix in ggplot2

Correlation matrices are workhorses of exploratory data analysis. They provide an immediate visual summary of linear relationships across multiple variables, helping you identify multicollinearity...

Key Insights

  • Correlation matrices reveal linear relationships between multiple variables simultaneously, making them essential for exploratory data analysis and feature selection in machine learning pipelines.
  • ggplot2 requires correlation data in long format rather than the wide matrix format produced by cor(), necessitating a reshape step using pivot_longer() or melt().
  • Strategic use of geom_tile(), diverging color scales, and triangle filtering transforms raw correlation coefficients into publication-ready visualizations that highlight meaningful patterns.

Introduction & Use Cases

Correlation matrices are workhorses of exploratory data analysis. They provide an immediate visual summary of linear relationships across multiple variables, helping you identify multicollinearity before building regression models, discover unexpected associations in your data, or select features for machine learning algorithms.

While base R’s corrplot package offers quick solutions, ggplot2 provides superior customization for publication-ready graphics. You get complete control over colors, themes, annotations, and layout—essential when you need visualizations that match your organization’s style guide or journal requirements. The grammar of graphics approach also makes your code more maintainable and modifications more intuitive.

Preparing Your Data

Correlation matrices work exclusively with numeric data. Your first step is selecting appropriate variables and handling any data quality issues that could distort your results.

Start by loading necessary packages and examining your data structure:

library(ggplot2)
library(tidyr)
library(dplyr)

# Load example dataset
data(mtcars)

# Inspect structure
str(mtcars)

Select only numeric columns that make sense to correlate. For mtcars, all variables are numeric, but you might want to exclude categorical variables encoded as numbers:

# Select relevant numeric columns
numeric_data <- mtcars %>%
  select(mpg, cyl, disp, hp, drat, wt, qsec)

# Check for missing values
sum(is.na(numeric_data))

# If missing values exist, decide on handling strategy
# Option 1: Remove rows with any NA
numeric_data_complete <- na.omit(numeric_data)

# Option 2: Use pairwise complete observations in cor()
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

For datasets without missing values, computing the correlation matrix is straightforward:

# Compute correlation matrix
cor_matrix <- cor(numeric_data)

# Preview the matrix
print(round(cor_matrix, 2))

This produces a symmetric matrix where each cell represents the Pearson correlation coefficient between two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation).

Reshaping Data for ggplot2

ggplot2 operates on tidy data principles, expecting long-format data frames where each row represents a single observation. Your correlation matrix is currently in wide format—variables as both rows and columns. You need to reshape it.

The transformation converts your matrix into three columns: the first variable, the second variable, and their correlation value:

# Convert matrix to data frame with row names as a column
cor_df <- as.data.frame(cor_matrix)
cor_df$var1 <- rownames(cor_df)

# Reshape to long format using tidyr
cor_long <- cor_df %>%
  pivot_longer(cols = -var1, 
               names_to = "var2", 
               values_to = "correlation")

# Preview transformed data
head(cor_long)

Alternatively, using reshape2::melt():

library(reshape2)

# Melt the correlation matrix
cor_long <- melt(cor_matrix)

# Rename columns for clarity
colnames(cor_long) <- c("var1", "var2", "correlation")

Both approaches produce identical results. Choose based on your existing dependencies—tidyr if you’re already in the tidyverse ecosystem, reshape2 if you prefer its explicit function naming.

Creating the Basic Heatmap

With long-format data ready, building the basic heatmap requires just a few lines:

ggplot(cor_long, aes(x = var1, y = var2, fill = correlation)) +
  geom_tile() +
  coord_fixed()

This creates a grid where each tile’s color represents correlation strength. The coord_fixed() ensures tiles remain square rather than rectangular, maintaining visual consistency.

However, this basic version needs improvement. The default color scale doesn’t effectively distinguish positive from negative correlations, and variable names may overlap:

# Improved basic heatmap
ggplot(cor_long, aes(x = var1, y = var2, fill = correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  coord_fixed() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Adding white borders between tiles (color = "white") improves readability, while rotating x-axis labels prevents overlap.

Customizing the Visualization

Professional correlation matrices require careful color selection and informative annotations. Diverging color scales work best, using distinct colors for positive and negative correlations with white or neutral tones at zero:

ggplot(cor_long, aes(x = var1, y = var2, fill = correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                       midpoint = 0, limit = c(-1, 1),
                       name = "Correlation") +
  coord_fixed() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        axis.title = element_blank(),
        panel.grid = element_blank())

Add correlation coefficients directly on tiles for precise reading:

ggplot(cor_long, aes(x = var1, y = var2, fill = correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(correlation, 2)), color = "black", size = 3) +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                       midpoint = 0, limit = c(-1, 1),
                       name = "Correlation") +
  coord_fixed() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        axis.title = element_blank(),
        panel.grid = element_blank(),
        legend.position = "right")

For better text contrast on dark tiles, make text color conditional:

cor_long <- cor_long %>%
  mutate(label_color = ifelse(abs(correlation) > 0.5, "white", "black"))

ggplot(cor_long, aes(x = var1, y = var2, fill = correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(correlation, 2), color = label_color), size = 3) +
  scale_color_identity() +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                       midpoint = 0, limit = c(-1, 1),
                       name = "Correlation") +
  coord_fixed() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        axis.title = element_blank(),
        panel.grid = element_blank())

Advanced Techniques

Correlation matrices are symmetric—the correlation between variable A and B equals B and A. Displaying both triangles is redundant. Show only the lower triangle:

# Filter for lower triangle
cor_long_lower <- cor_long %>%
  mutate(var1 = factor(var1, levels = colnames(cor_matrix)),
         var2 = factor(var2, levels = colnames(cor_matrix))) %>%
  filter(as.numeric(var1) >= as.numeric(var2))

ggplot(cor_long_lower, aes(x = var1, y = var2, fill = correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(correlation, 2)), size = 3) +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                       midpoint = 0, limit = c(-1, 1),
                       name = "Correlation") +
  coord_fixed() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        axis.title = element_blank(),
        panel.grid = element_blank())

Reorder variables by hierarchical clustering to group correlated variables together:

# Perform hierarchical clustering
hc <- hclust(as.dist(1 - cor_matrix))
var_order <- colnames(cor_matrix)[hc$order]

# Apply ordering to long format data
cor_long_ordered <- cor_long %>%
  mutate(var1 = factor(var1, levels = var_order),
         var2 = factor(var2, levels = var_order))

ggplot(cor_long_ordered, aes(x = var1, y = var2, fill = correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                       midpoint = 0, limit = c(-1, 1),
                       name = "Correlation") +
  coord_fixed() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        axis.title = element_blank(),
        panel.grid = element_blank())

Complete Working Example

Here’s a production-ready script combining all techniques:

library(ggplot2)
library(tidyr)
library(dplyr)

# Load and prepare data
data(mtcars)
numeric_data <- mtcars %>%
  select(mpg, cyl, disp, hp, drat, wt, qsec)

# Compute correlation matrix
cor_matrix <- cor(numeric_data)

# Hierarchical clustering for ordering
hc <- hclust(as.dist(1 - cor_matrix))
var_order <- colnames(cor_matrix)[hc$order]

# Reshape to long format
cor_long <- as.data.frame(cor_matrix) %>%
  mutate(var1 = rownames(cor_matrix)) %>%
  pivot_longer(cols = -var1, names_to = "var2", values_to = "correlation") %>%
  mutate(var1 = factor(var1, levels = var_order),
         var2 = factor(var2, levels = var_order)) %>%
  filter(as.numeric(var1) >= as.numeric(var2))

# Create polished visualization
ggplot(cor_long, aes(x = var1, y = var2, fill = correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.2f", correlation)), 
            color = "black", size = 3.5) +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                       midpoint = 0, limit = c(-1, 1),
                       name = "Pearson\nCorrelation") +
  coord_fixed() +
  labs(title = "Correlation Matrix: Motor Trend Car Road Tests",
       subtitle = "Variables ordered by hierarchical clustering") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        axis.title = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(face = "bold"),
        legend.position = "right")

This code produces a publication-ready correlation matrix with clustered variables, clear color encoding, precise correlation values, and clean typography. Modify the color palette, text size, or filtering logic to match your specific requirements. The modular structure makes adjustments straightforward—change one component without rewriting the entire visualization.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.