How to Calculate the Correlation Matrix in R
A correlation matrix is a table showing correlation coefficients between multiple variables simultaneously. Each cell represents the relationship strength between two variables, ranging from -1...
Key Insights
- R’s base
cor()function calculates correlation matrices instantly, but you’ll need theHmiscpackage to get p-values for statistical significance testing - Always handle missing data explicitly—the
useparameter incor()determines whether R drops rows globally or calculates pairwise complete observations - Visualization transforms correlation matrices from number grids into actionable insights;
corrplotandggcorrplotare your best options for publication-ready graphics
Introduction to Correlation Matrices
A correlation matrix is a table showing correlation coefficients between multiple variables simultaneously. Each cell represents the relationship strength between two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation). Zero indicates no linear relationship.
When you’re exploring a dataset with dozens of numeric variables, calculating individual correlations becomes tedious. A correlation matrix gives you the full picture in one computation, revealing which variables move together, which move in opposite directions, and which have no relationship at all.
R supports three correlation types out of the box:
- Pearson: Measures linear relationships between continuous variables. This is the default and most common choice.
- Spearman: Measures monotonic relationships using ranked data. Use this when your data isn’t normally distributed or contains outliers.
- Kendall: Another rank-based method that’s more robust with small samples or many tied values.
Choose Pearson for normally distributed continuous data. Switch to Spearman or Kendall when those assumptions fail.
Preparing Your Data
Before calculating correlations, you need clean numeric data. Let’s work with the built-in mtcars dataset, which contains specifications for 32 automobiles from 1974.
# Load and inspect the data
data(mtcars)
head(mtcars)
# Check dimensions and structure
dim(mtcars)
str(mtcars)
# Check for missing values
sum(is.na(mtcars))
colSums(is.na(mtcars))
The mtcars dataset has no missing values, but real-world data usually does. Here’s how to handle NAs:
# Create sample data with missing values
sample_data <- mtcars[1:10, 1:4]
sample_data[2, 1] <- NA
sample_data[5, 3] <- NA
# Option 1: Remove rows with any NA
clean_data <- na.omit(sample_data)
# Option 2: Check which rows have complete cases
complete.cases(sample_data)
The na.omit() approach is aggressive—it drops entire rows even if only one value is missing. For correlation matrices, you have better options through the use parameter, which we’ll cover shortly.
Basic Correlation Matrix with cor()
R’s base cor() function is straightforward. Pass it a numeric data frame or matrix, and you get a correlation matrix back.
# Basic Pearson correlation matrix
cor_matrix <- cor(mtcars)
round(cor_matrix, 2)
# Subset to first 6 variables for readability
cor_subset <- cor(mtcars[, 1:6])
round(cor_subset, 2)
Output:
mpg cyl disp hp drat wt
mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87
cyl -0.85 1.00 0.90 0.83 -0.70 0.78
disp -0.85 0.90 1.00 0.79 -0.71 0.89
hp -0.78 0.83 0.79 1.00 -0.45 0.66
drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71
wt -0.87 0.78 0.89 0.66 -0.71 1.00
This immediately reveals strong relationships: mpg correlates negatively with cyl, disp, and wt (heavier cars with more cylinders get worse mileage).
To switch correlation methods:
# Spearman correlation (rank-based)
cor_spearman <- cor(mtcars[, 1:6], method = "spearman")
round(cor_spearman, 2)
# Kendall correlation
cor_kendall <- cor(mtcars[, 1:6], method = "kendall")
round(cor_kendall, 2)
For handling missing values in cor(), use the use parameter:
# Only use complete observations (drops any row with NA)
cor(sample_data, use = "complete.obs")
# Use pairwise complete observations (maximizes data usage)
cor(sample_data, use = "pairwise.complete.obs")
The pairwise.complete.obs option calculates each correlation using all available data for that specific pair of variables. This preserves more data but can produce correlation matrices that aren’t positive semi-definite—a problem for some downstream analyses.
Statistical Significance with Hmisc and corrplot
The cor() function gives you coefficients but no p-values. You don’t know if a correlation of 0.3 is statistically significant or just noise. The Hmisc package solves this.
# Install if needed
install.packages("Hmisc")
library(Hmisc)
# rcorr requires a matrix, not a data frame
mtcars_matrix <- as.matrix(mtcars[, 1:6])
correlation_results <- rcorr(mtcars_matrix)
# View the structure
correlation_results
The rcorr() function returns a list with three components:
# Correlation coefficients
round(correlation_results$r, 2)
# P-values
round(correlation_results$P, 4)
# Sample sizes (useful with missing data)
correlation_results$n
The p-value matrix shows which correlations are statistically significant:
# Extract correlations and p-values
cor_coef <- correlation_results$r
p_values <- correlation_results$P
# Find significant correlations (p < 0.05)
significant <- p_values < 0.05
significant
For Spearman correlations with p-values:
spearman_results <- rcorr(mtcars_matrix, type = "spearman")
round(spearman_results$r, 2)
round(spearman_results$P, 4)
Visualizing Correlation Matrices
Numbers are hard to scan. Visualization makes patterns obvious. The corrplot package is the standard choice.
install.packages("corrplot")
library(corrplot)
# Basic correlation matrix
cor_matrix <- cor(mtcars)
# Color-coded heatmap
corrplot(cor_matrix, method = "color")
# Upper triangle only (avoids redundancy)
corrplot(cor_matrix, method = "color", type = "upper")
# Circle plot with correlation values
corrplot(cor_matrix, method = "circle", type = "upper",
addCoef.col = "black", number.cex = 0.7)
The corrplot function offers many customization options:
# Hierarchical clustering to group related variables
corrplot(cor_matrix, method = "color", type = "upper",
order = "hclust", addrect = 3)
# Custom color palette
corrplot(cor_matrix, method = "color", type = "upper",
col = colorRampPalette(c("#BB4444", "#FFFFFF", "#4477AA"))(200))
# Add significance indicators
p_matrix <- rcorr(as.matrix(mtcars))$P
corrplot(cor_matrix, method = "color", type = "upper",
p.mat = p_matrix, sig.level = 0.05, insig = "blank")
The last example blanks out non-significant correlations—extremely useful for identifying meaningful relationships.
For ggplot2 users, ggcorrplot provides a familiar interface:
install.packages("ggcorrplot")
library(ggcorrplot)
# Basic ggplot2-style correlation plot
ggcorrplot(cor_matrix)
# Customized version
ggcorrplot(cor_matrix,
type = "upper",
lab = TRUE,
lab_size = 3,
colors = c("#6D9EC1", "white", "#E46726"),
title = "Motor Trend Car Correlations",
ggtheme = theme_minimal())
# With p-value filtering
ggcorrplot(cor_matrix,
type = "upper",
p.mat = p_matrix,
insig = "blank")
Handling Common Issues
Real datasets aren’t as clean as mtcars. Here are solutions to frequent problems.
Non-numeric columns: The cor() function only accepts numeric data. Filter your data frame first:
library(dplyr)
# Using dplyr
numeric_data <- mtcars %>% select_if(is.numeric)
# Base R alternative
numeric_data <- mtcars[, sapply(mtcars, is.numeric)]
# Then calculate correlation
cor(numeric_data)
Mixed data types: If you have factors that should be numeric, convert them:
# Convert specific columns
df$category <- as.numeric(as.factor(df$category))
# Or use model.matrix for dummy variables
Missing data strategies: Choose based on your situation:
# complete.obs: Conservative, uses only rows with no NAs anywhere
cor(data, use = "complete.obs")
# pairwise.complete.obs: Liberal, maximizes data per pair
cor(data, use = "pairwise.complete.obs")
# For small datasets with many NAs, consider imputation first
Large correlation matrices: With many variables, focus on strong correlations:
# Flatten matrix and filter
cor_matrix <- cor(mtcars)
cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "Correlation")
# Remove self-correlations and duplicates
cor_df <- cor_df[cor_df$Var1 != cor_df$Var2, ]
cor_df <- cor_df[abs(cor_df$Correlation) > 0.7, ]
cor_df[order(-abs(cor_df$Correlation)), ]
Conclusion
Calculating correlation matrices in R is simple once you know the right tools. Start with cor() for quick exploration, add Hmisc::rcorr() when you need statistical significance, and use corrplot or ggcorrplot to communicate findings visually.
Choose your correlation method deliberately: Pearson for linear relationships with normal data, Spearman for monotonic relationships or when outliers are present, and Kendall for small samples with many ties.
Handle missing data explicitly rather than letting R’s defaults surprise you. And when presenting results, always visualize—a well-designed correlation plot communicates more in seconds than a matrix of numbers ever could.
For next steps, consider how these correlations inform your regression models. Strong correlations between predictors signal multicollinearity problems, while strong correlations with your outcome variable identify promising features for prediction.