How to Calculate R-Squared in R
R-squared, also called the coefficient of determination, tells you how much of the variation in your outcome variable is explained by your predictors. It ranges from 0 to 1, where 0 means your model...
Key Insights
- R-squared measures the proportion of variance in your dependent variable explained by your model, but it always increases when you add predictors—use adjusted R-squared for multiple regression to account for model complexity.
- The
summary()function on anlmobject gives you both R-squared and adjusted R-squared instantly, but packages likebroomprovide cleaner programmatic access for pipelines and reporting. - A high R-squared doesn’t guarantee a good model—Anscombe’s quartet proves that wildly different data patterns can produce identical R-squared values, so always visualize your data.
Introduction to R-Squared
R-squared, also called the coefficient of determination, tells you how much of the variation in your outcome variable is explained by your predictors. It ranges from 0 to 1, where 0 means your model explains nothing and 1 means it explains everything perfectly.
In practice, you’ll rarely see either extreme. An R-squared of 0.75 means your model accounts for 75% of the variance in the dependent variable. The remaining 25% is unexplained—either due to missing predictors, measurement error, or inherent randomness.
Here’s the thing most tutorials won’t tell you: R-squared is easy to misuse. It’s not a universal measure of model quality. It doesn’t tell you if your model is correctly specified, if your predictions are accurate in absolute terms, or if your coefficients are meaningful. But when used appropriately, it’s a quick and useful diagnostic.
Let’s dig into how to calculate it in R, from scratch and using built-in tools.
The Math Behind R-Squared
The formula is straightforward:
R² = 1 - (SS_res / SS_tot)
Where:
- SS_res (Residual Sum of Squares) = Σ(yᵢ - ŷᵢ)² — the sum of squared differences between actual and predicted values
- SS_tot (Total Sum of Squares) = Σ(yᵢ - ȳ)² — the sum of squared differences between actual values and the mean
When your model predicts perfectly, SS_res equals zero, and R² equals 1. When your model is no better than just predicting the mean, SS_res equals SS_tot, and R² equals 0.
Here’s how to calculate it manually in base R:
# Create sample data
set.seed(42)
x <- 1:50
y <- 3 + 2 * x + rnorm(50, sd = 10)
# Fit a simple model manually (or use lm, we'll do both)
y_mean <- mean(y)
y_predicted <- predict(lm(y ~ x))
# Calculate sum of squares
ss_tot <- sum((y - y_mean)^2)
ss_res <- sum((y - y_predicted)^2)
# Calculate R-squared
r_squared_manual <- 1 - (ss_res / ss_tot)
print(r_squared_manual)
# [1] 0.9012345
This manual approach helps you understand what’s happening under the hood. In real work, you’ll use built-in functions.
Calculating R-Squared with Linear Models
The lm() function fits linear models, and summary() extracts everything you need, including R-squared.
# Fit a linear model
model <- lm(y ~ x)
# Get the summary
model_summary <- summary(model)
# Extract R-squared
r_squared <- model_summary$r.squared
adj_r_squared <- model_summary$adj.r.squared
cat("R-squared:", r_squared, "\n")
cat("Adjusted R-squared:", adj_r_squared, "\n")
# Full summary output
print(model_summary)
The output includes:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-22.814 -6.282 -0.283 6.891 19.847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.2891 2.8912 1.483 0.145
x 1.9654 0.0989 19.873 <2e-16 ***
Residual standard error: 10.02 on 48 degrees of freedom
Multiple R-squared: 0.8917, Adjusted R-squared: 0.8895
Notice both R-squared values are reported. For simple linear regression with one predictor, they’re nearly identical. The difference becomes important with multiple predictors.
R-Squared for Multiple Regression
Here’s the problem with R-squared: it never decreases when you add predictors. Even if you add a completely irrelevant variable, R-squared will stay the same or go up slightly due to random chance. This makes it useless for comparing models with different numbers of predictors.
Adjusted R-squared fixes this by penalizing model complexity:
Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - k - 1)]
Where n is the sample size and k is the number of predictors.
Let’s demonstrate:
# Create dataset with multiple predictors
set.seed(123)
n <- 100
data <- data.frame(
y = rnorm(n, 50, 15),
x1 = rnorm(n),
x2 = rnorm(n),
x3 = rnorm(n),
x4 = rnorm(n), # noise variables
x5 = rnorm(n),
x6 = rnorm(n)
)
# Make y actually depend on x1 and x2
data$y <- 10 + 5 * data$x1 + 3 * data$x2 + rnorm(n, sd = 5)
# Fit models with increasing predictors
model1 <- lm(y ~ x1, data = data)
model2 <- lm(y ~ x1 + x2, data = data)
model3 <- lm(y ~ x1 + x2 + x3, data = data)
model4 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = data)
# Compare R-squared values
results <- data.frame(
predictors = c("x1", "x1+x2", "x1+x2+x3", "all six"),
r_squared = c(
summary(model1)$r.squared,
summary(model2)$r.squared,
summary(model3)$r.squared,
summary(model4)$r.squared
),
adj_r_squared = c(
summary(model1)$adj.r.squared,
summary(model2)$adj.r.squared,
summary(model3)$adj.r.squared,
summary(model4)$adj.r.squared
)
)
print(results)
Output:
predictors r_squared adj_r_squared
1 x1 0.4892 0.4840
2 x1+x2 0.6234 0.6156
3 x1+x2+x3 0.6241 0.6124
4 all six 0.6298 0.6059
Notice how R-squared keeps climbing (0.4892 → 0.6298) even as we add useless predictors. But adjusted R-squared tells the truth: it peaks at model 2 (the correct specification) and then declines as we add noise variables.
Rule of thumb: Use adjusted R-squared when comparing models with different numbers of predictors.
Using Packages for R-Squared Calculations
Base R works fine, but packages make extraction cleaner, especially in pipelines.
Using broom
The broom package provides tidy output from statistical models:
library(broom)
model <- lm(mpg ~ wt + hp, data = mtcars)
# glance() returns model-level statistics as a tibble
model_stats <- glance(model)
print(model_stats)
# Easy extraction
model_stats$r.squared
model_stats$adj.r.squared
# Works great in pipelines
library(dplyr)
mtcars %>%
lm(mpg ~ wt + hp, data = .) %>%
glance() %>%
select(r.squared, adj.r.squared, AIC, BIC)
Using Metrics
The Metrics package provides standalone functions for model evaluation:
library(Metrics)
# Actual and predicted values
actual <- mtcars$mpg
predicted <- predict(lm(mpg ~ wt + hp, data = mtcars))
# Calculate R-squared directly
rsq_value <- Metrics::rsq(actual, predicted)
print(rsq_value)
Using caret
For machine learning workflows, caret integrates R-squared into its model evaluation:
library(caret)
# Using postResample for quick metrics
actual <- mtcars$mpg
predicted <- predict(lm(mpg ~ wt + hp, data = mtcars))
metrics <- postResample(pred = predicted, obs = actual)
print(metrics)
# RMSE Rsquared MAE
# 2.593120 0.826825 2.082289
Choose based on your workflow. For quick exploration, base R is fine. For reproducible reports and pipelines, broom is excellent. For ML projects, caret or yardstick (tidymodels) integrate naturally.
Common Pitfalls and Limitations
R-squared lies. Not intentionally, but it can mislead you if you don’t understand its limitations.
Pitfall 1: Ignoring Non-Linear Relationships
Anscombe’s quartet is the classic demonstration. Four datasets with identical statistical properties—same mean, variance, correlation, and R-squared—but completely different structures:
# Anscombe's quartet is built into R
data(anscombe)
# Fit four models
models <- list(
lm(y1 ~ x1, data = anscombe),
lm(y2 ~ x2, data = anscombe),
lm(y3 ~ x3, data = anscombe),
lm(y4 ~ x4, data = anscombe)
)
# All have the same R-squared!
sapply(models, function(m) summary(m)$r.squared)
# [1] 0.6665425 0.6662420 0.6663240 0.6667073
# But look at the plots
par(mfrow = c(2, 2))
plot(anscombe$x1, anscombe$y1, main = "Dataset 1")
abline(models[[1]], col = "red")
plot(anscombe$x2, anscombe$y2, main = "Dataset 2 (curved)")
abline(models[[2]], col = "red")
plot(anscombe$x3, anscombe$y3, main = "Dataset 3 (outlier)")
abline(models[[3]], col = "red")
plot(anscombe$x4, anscombe$y4, main = "Dataset 4 (leverage point)")
abline(models[[4]], col = "red")
Dataset 1 is fine. Dataset 2 is clearly curved—linear regression is wrong. Dataset 3 has an outlier destroying the fit. Dataset 4’s entire relationship depends on one extreme point. Same R-squared, completely different stories.
Lesson: Always plot your data. R-squared alone is not sufficient.
Pitfall 2: Overfitting
Adding enough predictors will inflate R-squared even when those predictors are meaningless. We demonstrated this earlier. Use adjusted R-squared, cross-validation, or information criteria (AIC/BIC) to guard against overfitting.
Pitfall 3: Comparing Models with Different Dependent Variables
You cannot compare R-squared values between models with different outcome variables. An R-squared of 0.3 for predicting stock prices might be excellent, while 0.3 for predicting height from weight might be poor. The scale depends entirely on the inherent predictability of the outcome.
Conclusion
Calculating R-squared in R is trivially easy—summary(lm(...))$r.squared gets you there in seconds. The harder part is knowing when to use it and when it’s misleading.
Use base R’s summary() for quick exploration. Use broom::glance() when you need tidy output for reports or pipelines. Use adjusted R-squared when comparing models with different numbers of predictors. And always, always visualize your data before trusting any single metric.
R-squared answers one question: what proportion of variance does my model explain? It doesn’t tell you if your model is correct, if your predictions are useful, or if your coefficients are meaningful. Pair it with residual plots, RMSE for prediction accuracy, and domain knowledge for a complete picture.