How to Calculate Pearson Correlation in R

Key Insights

R’s built-in cor() function calculates Pearson correlation instantly, but you need cor.test() to get p-values and confidence intervals for statistical inference.
Always check for missing values before computing correlations—the default behavior drops entire rows, which can silently skew your results.
Pearson correlation assumes linearity; if your relationship is monotonic but non-linear, switch to Spearman’s rank correlation with method = "spearman".

Introduction to Pearson Correlation

Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. It produces a value between -1 and +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.

The interpretation is straightforward:

0.7 to 1.0 (or -0.7 to -1.0): Strong correlation
0.4 to 0.7 (or -0.4 to -0.7): Moderate correlation
0.1 to 0.4 (or -0.1 to -0.4): Weak correlation
0 to 0.1 (or 0 to -0.1): Negligible correlation

These thresholds are guidelines, not rules. Context matters. A correlation of 0.3 might be impressive in social science research but underwhelming in physics.

Prerequisites and Setup

The good news: you don’t need to install anything. R’s stats package includes everything for basic correlation analysis, and it loads automatically. For visualization, you’ll want ggplot2 and corrplot.

# Load visualization packages (install first if needed)
# install.packages(c("ggplot2", "corrplot"))
library(ggplot2)
library(corrplot)

# We'll use mtcars, a built-in dataset
data(mtcars)
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The mtcars dataset contains 32 observations of 11 numeric variables describing automobile characteristics. It’s perfect for demonstrating correlation because the variables have obvious relationships—heavier cars get worse gas mileage, more cylinders mean more horsepower.

Basic Correlation Calculation with cor()

The cor() function is your workhorse for Pearson correlation. At its simplest, pass two numeric vectors:

# Correlation between car weight and miles per gallon
cor(mtcars$wt, mtcars$mpg)

[1] -0.8676594

The result of -0.87 tells us there’s a strong negative relationship: as weight increases, fuel efficiency decreases. This makes intuitive sense.

You can also specify the method explicitly, though Pearson is the default:

cor(mtcars$wt, mtcars$mpg, method = "pearson")

Handling Missing Values

Real data has gaps. The cor() function won’t compute if either vector contains NA values unless you tell it how to handle them:

# Create sample data with missing values
x <- c(1, 2, 3, NA, 5, 6, 7, 8, 9, 10)
y <- c(2, 4, 5, 8, 10, NA, 14, 16, 18, 20)

# This returns NA
cor(x, y)

[1] NA

# Use complete observations only
cor(x, y, use = "complete.obs")

[1] 0.9970182

The use parameter has several options:

"everything": Default. Returns NA if any missing values exist.
"complete.obs": Uses only rows where both variables have values.
"pairwise.complete.obs": For matrices, uses all available pairs per correlation.

For pairwise correlations, "complete.obs" and "pairwise.complete.obs" produce identical results. The difference matters when computing correlation matrices, which we’ll cover shortly.

Statistical Testing with cor.test()

Knowing that two variables correlate at 0.87 is useful, but is it statistically significant? The cor.test() function provides hypothesis testing:

# Full statistical test
result <- cor.test(mtcars$wt, mtcars$mpg)
print(result)

	Pearson's product-moment correlation

data:  mtcars$wt and mtcars$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338264 -0.7440872
sample estimates:
       cor 
-0.8676594

This output tells you everything you need:

t-statistic: -9.559, the test statistic
df: 30 degrees of freedom (n - 2)
p-value: 1.294e-10, extremely significant
95% CI: The true correlation likely falls between -0.93 and -0.74
cor: The point estimate, -0.868

The result is an object you can extract from programmatically:

# Extract specific values
result$estimate    # correlation coefficient
result$p.value     # p-value
result$conf.int    # confidence interval
result$statistic   # t-statistic

       cor 
-0.8676594 
[1] 1.293959e-10
[1] -0.9338264 -0.7440872
attr(,"conf.level")
[1] 0.95
        t 
-9.559044

This extraction is essential when you’re running correlations in loops or building automated reports.

Correlation Matrices for Multiple Variables

When analyzing datasets with many variables, you’ll want to see all pairwise correlations at once:

# Select numeric columns of interest
vars <- mtcars[, c("mpg", "hp", "wt", "qsec")]

# Compute correlation matrix
cor_matrix <- cor(vars)
print(cor_matrix)

            mpg         hp         wt       qsec
mpg   1.0000000 -0.7761684 -0.8676594  0.4186840
hp   -0.7761684  1.0000000  0.6587479 -0.7082234
wt   -0.8676594  0.6587479  1.0000000 -0.1747159
qsec  0.4186840 -0.7082234 -0.1747159  1.0000000

The diagonal is always 1 (a variable correlates perfectly with itself). The matrix is symmetric—the correlation between mpg and hp equals the correlation between hp and mpg.

For cleaner output, round the values:

round(cor_matrix, 2)

       mpg    hp    wt  qsec
mpg   1.00 -0.78 -0.87  0.42
hp   -0.78  1.00  0.66 -0.71
wt   -0.87  0.66  1.00 -0.17
qsec  0.42 -0.71 -0.17  1.00

With missing data in your matrix, the use parameter becomes critical:

# Pairwise deletion: maximizes data usage per correlation
cor(vars, use = "pairwise.complete.obs")

# Listwise deletion: uses only complete cases across ALL variables
cor(vars, use = "complete.obs")

Pairwise deletion uses more data but can produce correlation matrices that aren’t positive semi-definite (a mathematical property some downstream analyses require). Listwise deletion is more conservative but consistent.

Visualizing Correlations

Numbers tell part of the story. Visualization reveals patterns, outliers, and non-linearity that correlation coefficients hide.

Scatterplot with Trend Line

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "steelblue") +
  labs(
    title = "Weight vs. Fuel Efficiency",
    subtitle = paste("Pearson r =", round(cor(mtcars$wt, mtcars$mpg), 3)),
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

The geom_smooth(method = "lm") adds a linear regression line. The shaded area shows the 95% confidence interval for the fit. Always plot your data before trusting a correlation coefficient.

Correlation Heatmap

For correlation matrices, heatmaps provide instant visual insight:

# Compute correlation matrix
cor_matrix <- cor(mtcars[, c("mpg", "cyl", "disp", "hp", "wt", "qsec")])

# Create heatmap
corrplot(cor_matrix, 
         method = "color",
         type = "upper",
         order = "hclust",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         diag = FALSE)

The corrplot package offers numerous customization options. Setting type = "upper" shows only the upper triangle (since the matrix is symmetric). The order = "hclust" groups similar variables together using hierarchical clustering.

Assumptions and Best Practices

Pearson correlation rests on assumptions. Violate them, and your results become unreliable.

Linearity

Pearson measures linear relationships only. Two variables can have a perfect curvilinear relationship and still show a Pearson correlation near zero. Always plot your data first.

Normality

For hypothesis testing (p-values, confidence intervals), both variables should be approximately normally distributed. The Shapiro-Wilk test provides a quick check:

shapiro.test(mtcars$mpg)

	Shapiro-Wilk normality test

data:  mtcars$mpg
W = 0.94756, p-value = 0.1229

A p-value above 0.05 suggests the data doesn’t significantly deviate from normality. Here, mpg passes the test.

shapiro.test(mtcars$hp)

	Shapiro-Wilk normality test

data:  mtcars$hp
W = 0.93342, p-value = 0.04881

Horsepower fails (p < 0.05), suggesting non-normality. For small samples with non-normal data, consider Spearman’s correlation instead:

cor(mtcars$wt, mtcars$mpg, method = "spearman")

[1] -0.886422

Outliers

Pearson correlation is sensitive to outliers. A single extreme point can dramatically inflate or deflate the correlation. Visual inspection catches what statistical tests miss.

When to Use Alternatives

Spearman: Ordinal data, non-linear monotonic relationships, or when normality fails
Kendall: Small samples, many tied ranks, or when you want a more robust measure

Both alternatives use method = "spearman" or method = "kendall" in cor() and cor.test().

Common Pitfalls

Correlation is not causation. A strong correlation between ice cream sales and drowning deaths doesn’t mean ice cream causes drowning. Both increase in summer.

Restricted range reduces correlation. If you only sample tall people, the correlation between height and weight will appear weaker than in the general population.

Aggregated data inflates correlation. Correlations computed on group means are typically higher than correlations on individual observations. This is Simpson’s paradox territory.

Multiple testing requires correction. If you compute 100 correlations, expect 5 to be “significant” at p < 0.05 by chance alone. Use Bonferroni correction or false discovery rate adjustment for large correlation matrices.

R makes correlation analysis accessible, but accessibility doesn’t excuse carelessness. Check your assumptions, visualize your data, and interpret results within their proper context.