R - Correlation (cor, cor.test)
The `cor()` function computes correlation coefficients between numeric vectors or matrices. The most common method is Pearson correlation, which measures linear relationships between variables.
Key Insights
- R provides
cor()for calculating correlation coefficients between variables using Pearson, Spearman, or Kendall methods, with built-in handling of missing values through theuseparameter - The
cor.test()function extends basic correlation by providing statistical significance testing, confidence intervals, and p-values for hypothesis testing on correlation coefficients - Understanding when to use each correlation method is critical: Pearson for linear relationships with normally distributed data, Spearman for monotonic relationships or ordinal data, and Kendall for small sample sizes or data with many tied ranks
Basic Correlation with cor()
The cor() function computes correlation coefficients between numeric vectors or matrices. The most common method is Pearson correlation, which measures linear relationships between variables.
# Simple correlation between two vectors
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(2, 4, 5, 7, 8, 10, 11, 13, 14, 16)
# Pearson correlation (default)
cor(x, y)
# [1] 0.9960783
# Spearman correlation (rank-based)
cor(x, y, method = "spearman")
# [1] 1
# Kendall correlation
cor(x, y, method = "kendall")
# [1] 1
The function returns values between -1 and 1, where 1 indicates perfect positive correlation, -1 indicates perfect negative correlation, and 0 indicates no linear correlation.
Correlation Matrices
When working with multiple variables, cor() generates a correlation matrix showing pairwise correlations between all variables.
# Create sample dataset
data <- data.frame(
height = c(165, 170, 175, 180, 185, 190),
weight = c(60, 65, 70, 75, 80, 85),
age = c(25, 30, 35, 40, 45, 50),
income = c(35000, 42000, 48000, 55000, 62000, 70000)
)
# Correlation matrix
cor_matrix <- cor(data)
print(cor_matrix)
# height weight age income
# height 1.0000000 1.0000000 1.0000000 1.0000000
# weight 1.0000000 1.0000000 1.0000000 1.0000000
# age 1.0000000 1.0000000 1.0000000 1.0000000
# income 1.0000000 1.0000000 1.0000000 1.0000000
# Round for readability
round(cor_matrix, 3)
For selecting specific variables from a data frame:
# Correlation between specific columns
cor(data[, c("height", "weight")])
# Correlation between one variable and multiple others
cor(data$height, data[, c("weight", "age", "income")])
Handling Missing Values
Real-world data often contains missing values. The use parameter controls how cor() handles NA values.
# Data with missing values
x_na <- c(1, 2, NA, 4, 5, 6, 7, 8, NA, 10)
y_na <- c(2, 4, 5, NA, 8, 10, 11, 13, 14, 16)
# Default behavior returns NA
cor(x_na, y_na)
# [1] NA
# Complete observations only (pairwise deletion)
cor(x_na, y_na, use = "complete.obs")
# [1] 0.9970545
# Pairwise complete observations
cor(x_na, y_na, use = "pairwise.complete.obs")
# [1] 0.9970545
# Everything (fails if any NA present)
cor(x_na, y_na, use = "everything")
# [1] NA
For matrices with missing values:
data_na <- data.frame(
a = c(1, 2, NA, 4, 5),
b = c(2, NA, 6, 8, 10),
c = c(3, 6, 9, 12, 15)
)
# Use pairwise complete observations
cor(data_na, use = "pairwise.complete.obs")
Statistical Testing with cor.test()
While cor() calculates correlation coefficients, cor.test() performs hypothesis testing to determine if correlations are statistically significant.
# Generate sample data
set.seed(123)
x <- rnorm(30, mean = 50, sd = 10)
y <- 2 * x + rnorm(30, mean = 0, sd = 5)
# Perform correlation test
test_result <- cor.test(x, y)
print(test_result)
#
# Pearson's product-moment correlation
#
# data: x and y
# t = 28.447, df = 28, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.9631748 0.9927844
# sample estimates:
# cor
# 0.9828571
# Extract specific values
test_result$estimate # correlation coefficient
test_result$p.value # p-value
test_result$conf.int # confidence interval
Choosing the Right Correlation Method
Different correlation methods suit different data characteristics:
# Pearson: Linear relationships, normally distributed
x_linear <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y_linear <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
cor.test(x_linear, y_linear, method = "pearson")
# Spearman: Monotonic relationships, ordinal data
x_rank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y_rank <- c(1, 4, 9, 16, 25, 36, 49, 64, 81, 100) # squared relationship
cor.test(x_rank, y_rank, method = "spearman")
# Kendall: Small samples, many tied ranks
x_small <- c(1, 2, 2, 3, 3, 3, 4, 4, 5)
y_small <- c(2, 3, 3, 4, 4, 4, 5, 5, 6)
cor.test(x_small, y_small, method = "kendall")
Practical Example: Real Dataset Analysis
Here’s a complete workflow analyzing relationships in a dataset:
# Load built-in mtcars dataset
data(mtcars)
# Examine correlations between fuel efficiency and other variables
variables <- c("mpg", "hp", "wt", "qsec", "disp")
cor_subset <- cor(mtcars[, variables])
print(round(cor_subset, 3))
# Test specific hypothesis: mpg vs weight
mpg_wt_test <- cor.test(mtcars$mpg, mtcars$wt, method = "pearson")
cat(sprintf("Correlation: %.3f\n", mpg_wt_test$estimate))
cat(sprintf("P-value: %.2e\n", mpg_wt_test$p.value))
cat(sprintf("95%% CI: [%.3f, %.3f]\n",
mpg_wt_test$conf.int[1],
mpg_wt_test$conf.int[2]))
# Test for non-linear relationship using Spearman
mpg_hp_spearman <- cor.test(mtcars$mpg, mtcars$hp, method = "spearman")
mpg_hp_pearson <- cor.test(mtcars$mpg, mtcars$hp, method = "pearson")
cat(sprintf("Spearman: %.3f, Pearson: %.3f\n",
mpg_hp_spearman$estimate,
mpg_hp_pearson$estimate))
Alternative Hypothesis Testing
The alternative parameter allows one-sided hypothesis tests:
# Test if correlation is greater than 0
cor.test(mtcars$mpg, mtcars$wt, alternative = "less")
# Test if correlation is less than 0
cor.test(mtcars$mpg, mtcars$hp, alternative = "less")
# Two-sided test (default)
cor.test(mtcars$mpg, mtcars$disp, alternative = "two.sided")
Correlation with Exact P-values
For small samples, exact p-values can be computed:
# Small sample
x_small <- c(1, 2, 3, 4, 5)
y_small <- c(2, 3, 5, 4, 6)
# Exact test for Kendall (default for small n)
cor.test(x_small, y_small, method = "kendall", exact = TRUE)
# Exact test for Spearman
cor.test(x_small, y_small, method = "spearman", exact = TRUE)
Performance Considerations
When working with large datasets, consider computational efficiency:
# Large dataset
set.seed(456)
large_data <- matrix(rnorm(10000 * 50), ncol = 50)
# Pearson is fastest
system.time(cor(large_data, method = "pearson"))
# Spearman requires ranking (slower)
system.time(cor(large_data, method = "spearman"))
# Kendall is slowest (O(n²) complexity)
system.time(cor(large_data, method = "kendall"))
Understanding correlation analysis in R enables data-driven decision-making. Use cor() for exploratory analysis and cor.test() when statistical significance matters. Choose methods based on data characteristics: Pearson for linear relationships, Spearman for monotonic patterns, and Kendall for robust analysis with small samples.