How to Calculate Pearson Correlation in R
Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. It produces a value between -1 and +1, where -1 indicates a perfect...
Key Insights
- R’s built-in
cor()function calculates Pearson correlation instantly, but you needcor.test()to get p-values and confidence intervals for statistical inference. - Always check for missing values before computing correlations—the default behavior drops entire rows, which can silently skew your results.
- Pearson correlation assumes linearity; if your relationship is monotonic but non-linear, switch to Spearman’s rank correlation with
method = "spearman".
Introduction to Pearson Correlation
Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. It produces a value between -1 and +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.
The interpretation is straightforward:
- 0.7 to 1.0 (or -0.7 to -1.0): Strong correlation
- 0.4 to 0.7 (or -0.4 to -0.7): Moderate correlation
- 0.1 to 0.4 (or -0.1 to -0.4): Weak correlation
- 0 to 0.1 (or 0 to -0.1): Negligible correlation
These thresholds are guidelines, not rules. Context matters. A correlation of 0.3 might be impressive in social science research but underwhelming in physics.
Prerequisites and Setup
The good news: you don’t need to install anything. R’s stats package includes everything for basic correlation analysis, and it loads automatically. For visualization, you’ll want ggplot2 and corrplot.
# Load visualization packages (install first if needed)
# install.packages(c("ggplot2", "corrplot"))
library(ggplot2)
library(corrplot)
# We'll use mtcars, a built-in dataset
data(mtcars)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The mtcars dataset contains 32 observations of 11 numeric variables describing automobile characteristics. It’s perfect for demonstrating correlation because the variables have obvious relationships—heavier cars get worse gas mileage, more cylinders mean more horsepower.
Basic Correlation Calculation with cor()
The cor() function is your workhorse for Pearson correlation. At its simplest, pass two numeric vectors:
# Correlation between car weight and miles per gallon
cor(mtcars$wt, mtcars$mpg)
[1] -0.8676594
The result of -0.87 tells us there’s a strong negative relationship: as weight increases, fuel efficiency decreases. This makes intuitive sense.
You can also specify the method explicitly, though Pearson is the default:
cor(mtcars$wt, mtcars$mpg, method = "pearson")
Handling Missing Values
Real data has gaps. The cor() function won’t compute if either vector contains NA values unless you tell it how to handle them:
# Create sample data with missing values
x <- c(1, 2, 3, NA, 5, 6, 7, 8, 9, 10)
y <- c(2, 4, 5, 8, 10, NA, 14, 16, 18, 20)
# This returns NA
cor(x, y)
[1] NA
# Use complete observations only
cor(x, y, use = "complete.obs")
[1] 0.9970182
The use parameter has several options:
"everything": Default. Returns NA if any missing values exist."complete.obs": Uses only rows where both variables have values."pairwise.complete.obs": For matrices, uses all available pairs per correlation.
For pairwise correlations, "complete.obs" and "pairwise.complete.obs" produce identical results. The difference matters when computing correlation matrices, which we’ll cover shortly.
Statistical Testing with cor.test()
Knowing that two variables correlate at 0.87 is useful, but is it statistically significant? The cor.test() function provides hypothesis testing:
# Full statistical test
result <- cor.test(mtcars$wt, mtcars$mpg)
print(result)
Pearson's product-moment correlation
data: mtcars$wt and mtcars$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9338264 -0.7440872
sample estimates:
cor
-0.8676594
This output tells you everything you need:
- t-statistic: -9.559, the test statistic
- df: 30 degrees of freedom (n - 2)
- p-value: 1.294e-10, extremely significant
- 95% CI: The true correlation likely falls between -0.93 and -0.74
- cor: The point estimate, -0.868
The result is an object you can extract from programmatically:
# Extract specific values
result$estimate # correlation coefficient
result$p.value # p-value
result$conf.int # confidence interval
result$statistic # t-statistic
cor
-0.8676594
[1] 1.293959e-10
[1] -0.9338264 -0.7440872
attr(,"conf.level")
[1] 0.95
t
-9.559044
This extraction is essential when you’re running correlations in loops or building automated reports.
Correlation Matrices for Multiple Variables
When analyzing datasets with many variables, you’ll want to see all pairwise correlations at once:
# Select numeric columns of interest
vars <- mtcars[, c("mpg", "hp", "wt", "qsec")]
# Compute correlation matrix
cor_matrix <- cor(vars)
print(cor_matrix)
mpg hp wt qsec
mpg 1.0000000 -0.7761684 -0.8676594 0.4186840
hp -0.7761684 1.0000000 0.6587479 -0.7082234
wt -0.8676594 0.6587479 1.0000000 -0.1747159
qsec 0.4186840 -0.7082234 -0.1747159 1.0000000
The diagonal is always 1 (a variable correlates perfectly with itself). The matrix is symmetric—the correlation between mpg and hp equals the correlation between hp and mpg.
For cleaner output, round the values:
round(cor_matrix, 2)
mpg hp wt qsec
mpg 1.00 -0.78 -0.87 0.42
hp -0.78 1.00 0.66 -0.71
wt -0.87 0.66 1.00 -0.17
qsec 0.42 -0.71 -0.17 1.00
With missing data in your matrix, the use parameter becomes critical:
# Pairwise deletion: maximizes data usage per correlation
cor(vars, use = "pairwise.complete.obs")
# Listwise deletion: uses only complete cases across ALL variables
cor(vars, use = "complete.obs")
Pairwise deletion uses more data but can produce correlation matrices that aren’t positive semi-definite (a mathematical property some downstream analyses require). Listwise deletion is more conservative but consistent.
Visualizing Correlations
Numbers tell part of the story. Visualization reveals patterns, outliers, and non-linearity that correlation coefficients hide.
Scatterplot with Trend Line
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "steelblue") +
labs(
title = "Weight vs. Fuel Efficiency",
subtitle = paste("Pearson r =", round(cor(mtcars$wt, mtcars$mpg), 3)),
x = "Weight (1000 lbs)",
y = "Miles per Gallon"
) +
theme_minimal()
The geom_smooth(method = "lm") adds a linear regression line. The shaded area shows the 95% confidence interval for the fit. Always plot your data before trusting a correlation coefficient.
Correlation Heatmap
For correlation matrices, heatmaps provide instant visual insight:
# Compute correlation matrix
cor_matrix <- cor(mtcars[, c("mpg", "cyl", "disp", "hp", "wt", "qsec")])
# Create heatmap
corrplot(cor_matrix,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
diag = FALSE)
The corrplot package offers numerous customization options. Setting type = "upper" shows only the upper triangle (since the matrix is symmetric). The order = "hclust" groups similar variables together using hierarchical clustering.
Assumptions and Best Practices
Pearson correlation rests on assumptions. Violate them, and your results become unreliable.
Linearity
Pearson measures linear relationships only. Two variables can have a perfect curvilinear relationship and still show a Pearson correlation near zero. Always plot your data first.
Normality
For hypothesis testing (p-values, confidence intervals), both variables should be approximately normally distributed. The Shapiro-Wilk test provides a quick check:
shapiro.test(mtcars$mpg)
Shapiro-Wilk normality test
data: mtcars$mpg
W = 0.94756, p-value = 0.1229
A p-value above 0.05 suggests the data doesn’t significantly deviate from normality. Here, mpg passes the test.
shapiro.test(mtcars$hp)
Shapiro-Wilk normality test
data: mtcars$hp
W = 0.93342, p-value = 0.04881
Horsepower fails (p < 0.05), suggesting non-normality. For small samples with non-normal data, consider Spearman’s correlation instead:
cor(mtcars$wt, mtcars$mpg, method = "spearman")
[1] -0.886422
Outliers
Pearson correlation is sensitive to outliers. A single extreme point can dramatically inflate or deflate the correlation. Visual inspection catches what statistical tests miss.
When to Use Alternatives
- Spearman: Ordinal data, non-linear monotonic relationships, or when normality fails
- Kendall: Small samples, many tied ranks, or when you want a more robust measure
Both alternatives use method = "spearman" or method = "kendall" in cor() and cor.test().
Common Pitfalls
Correlation is not causation. A strong correlation between ice cream sales and drowning deaths doesn’t mean ice cream causes drowning. Both increase in summer.
Restricted range reduces correlation. If you only sample tall people, the correlation between height and weight will appear weaker than in the general population.
Aggregated data inflates correlation. Correlations computed on group means are typically higher than correlations on individual observations. This is Simpson’s paradox territory.
Multiple testing requires correction. If you compute 100 correlations, expect 5 to be “significant” at p < 0.05 by chance alone. Use Bonferroni correction or false discovery rate adjustment for large correlation matrices.
R makes correlation analysis accessible, but accessibility doesn’t excuse carelessness. Check your assumptions, visualize your data, and interpret results within their proper context.