How to Perform Linear Regression in R

Key Insights

Linear regression in R requires just one function (lm()), but understanding the output and validating assumptions separates useful analysis from misleading conclusions.
Always perform exploratory data analysis before fitting a model—visualizing relationships helps you catch non-linear patterns and outliers that could invalidate your results.
The summary() output tells you everything you need: coefficient estimates, their significance, and how much variance your model explains through R-squared.

Introduction to Linear Regression

Linear regression models the relationship between a dependent variable (what you’re trying to predict) and one or more independent variables (your predictors). The goal is finding the “line of best fit”—the straight line that minimizes the distance between predicted and actual values.

The equation is straightforward: y = β₀ + β₁x + ε, where β₀ is the intercept, β₁ is the slope coefficient, and ε represents error. When you have one predictor, it’s simple linear regression. Multiple predictors? Multiple linear regression.

Use linear regression when you need to:

Predict continuous outcomes (sales, temperature, prices)
Understand the strength and direction of relationships between variables
Quantify how much change in X affects Y
Control for confounding variables in observational studies

Linear regression assumes a linear relationship between variables, normally distributed residuals, constant variance (homoscedasticity), and independence of observations. Violating these assumptions doesn’t always break your analysis, but it can lead to unreliable conclusions.

Setting Up Your R Environment

R’s stats package includes everything you need for linear regression—it loads automatically. For better visualizations and tidier output, add ggplot2 and broom.

# Install optional packages (run once)
install.packages(c("ggplot2", "broom"))

# Load packages
library(ggplot2)
library(broom)

We’ll use the built-in mtcars dataset, which contains fuel consumption and vehicle characteristics for 32 automobiles. Start by understanding your data structure:

# Load and inspect the data
data(mtcars)

# First few rows
head(mtcars)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

# Data structure
str(mtcars)
# 'data.frame':	32 obs. of  11 variables:
#  $ mpg : num  21 21 22.8 21.4 18.7 ...
#  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...

# Quick summary statistics
summary(mtcars[, c("mpg", "wt", "hp")])

For this tutorial, we’ll predict miles per gallon (mpg) using vehicle weight (wt). This relationship is intuitive—heavier cars typically consume more fuel.

Exploratory Data Analysis

Never fit a model blind. Exploring your data first reveals whether linear regression is appropriate and highlights potential problems.

Start with correlation analysis:

# Correlation between mpg and weight
cor(mtcars$mpg, mtcars$wt)
# [1] -0.8676594

# Correlation matrix for multiple variables
cor(mtcars[, c("mpg", "wt", "hp", "disp")])
#            mpg         wt         hp       disp
# mpg   1.0000000 -0.8676594 -0.7761684 -0.8475514
# wt   -0.8676594  1.0000000  0.6587479  0.8879799

The correlation of -0.87 indicates a strong negative relationship—as weight increases, fuel efficiency decreases. This is promising for linear regression.

Now visualize the relationship:

# Base R scatter plot
plot(mtcars$wt, mtcars$mpg,
     main = "MPG vs. Vehicle Weight",
     xlab = "Weight (1000 lbs)",
     ylab = "Miles Per Gallon",
     pch = 19,
     col = "steelblue")

# ggplot2 version with trend line
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  labs(title = "MPG vs. Vehicle Weight",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon") +
  theme_minimal()

The scatter plot confirms a clear linear trend. No obvious curves, clusters, or extreme outliers that would complicate modeling. We’re ready to proceed.

Building a Simple Linear Regression Model

R’s lm() function fits linear models using a formula interface. The syntax is dependent_variable ~ independent_variable, read as “mpg modeled by weight.”

# Fit the linear regression model
model <- lm(mpg ~ wt, data = mtcars)

# View basic model information
model
# Call:
# lm(formula = mpg ~ wt, data = mtcars)
# 
# Coefficients:
# (Intercept)           wt  
#      37.285       -5.344

This output gives you the regression equation: mpg = 37.285 - 5.344 × wt. For every 1,000-pound increase in vehicle weight, fuel efficiency drops by approximately 5.3 miles per gallon. A weightless car (theoretical intercept) would get 37.3 mpg.

For multiple predictors, extend the formula:

# Multiple linear regression
model_multi <- lm(mpg ~ wt + hp + disp, data = mtcars)

Interpreting Model Output

The summary() function provides everything you need to evaluate your model:

summary(model)
# Call:
# lm(formula = mpg ~ wt, data = mtcars)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -4.5432 -2.3647 -0.1252  1.4096  6.8727 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
# wt           -5.3445     0.5591  -9.559 1.29e-10 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 3.046 on 30 degrees of freedom
# Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
# F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Here’s what matters:

Coefficients Table: The Estimate column shows your regression coefficients. The Pr(>|t|) column shows p-values—values below 0.05 indicate statistical significance. Both our intercept and weight coefficient are highly significant (p < 0.001).

R-squared: 0.7528 means weight explains 75.28% of the variance in fuel efficiency. That’s strong explanatory power for a single predictor. Adjusted R-squared (0.7446) penalizes for additional predictors—use this when comparing models with different numbers of variables.

F-statistic: Tests whether the model as a whole is significant. A p-value of 1.29e-10 confirms our model is significantly better than a model with no predictors.

Extract specific values programmatically:

# Get coefficients
coef(model)
# (Intercept)          wt 
#   37.285126   -5.344472

# Get R-squared
summary(model)$r.squared
# [1] 0.7528328

# Tidy output with broom
tidy(model)
# # A tibble: 2 × 5
#   term        estimate std.error statistic  p.value
#   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
# 1 (Intercept)    37.3      1.88      19.9  8.24e-19
# 2 wt             -5.34     0.559     -9.56 1.29e-10

Model Diagnostics and Validation

A statistically significant model isn’t necessarily a valid one. Check the underlying assumptions with diagnostic plots:

# Generate four diagnostic plots
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))

These four plots reveal:

Residuals vs. Fitted: Points should scatter randomly around zero with no pattern. A curve suggests non-linearity; a funnel shape indicates heteroscedasticity.
Normal Q-Q: Points should follow the diagonal line. Deviations at the ends suggest non-normal residuals.
Scale-Location: Checks for constant variance. The red line should be roughly horizontal with randomly scattered points.
Residuals vs. Leverage: Identifies influential observations. Points outside Cook’s distance lines (dashed) may be unduly affecting your model.

For deeper residual analysis:

# Extract residuals
resid <- residuals(model)

# Shapiro-Wilk test for normality
shapiro.test(resid)
# W = 0.94508, p-value = 0.1044

# Fitted values
fitted_vals <- fitted(model)

# Manual residual plot
plot(fitted_vals, resid,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs. Fitted")
abline(h = 0, col = "red", lty = 2)

The Shapiro-Wilk test p-value of 0.10 exceeds 0.05, so we can’t reject the null hypothesis of normality. Our residuals appear normally distributed.

Making Predictions

Once validated, use your model to predict outcomes for new data:

# Create new data for prediction
new_cars <- data.frame(wt = c(2.5, 3.0, 3.5, 4.0))

# Point predictions
predict(model, newdata = new_cars)
#        1        2        3        4 
# 23.92380 21.25171 18.57962 15.90753

# Confidence intervals (uncertainty in the mean prediction)
predict(model, newdata = new_cars, interval = "confidence")
#        fit      lwr      upr
# 1 23.92380 22.55310 25.29450
# 2 21.25171 20.12444 22.37898
# 3 18.57962 17.44762 19.71163
# 4 15.90753 14.48629 17.32878

# Prediction intervals (uncertainty for individual observations)
predict(model, newdata = new_cars, interval = "prediction")
#        fit      lwr      upr
# 1 23.92380 17.55702 30.29058
# 2 21.25171 14.92987 27.57355
# 3 18.57962 12.24850 24.91075
# 4 15.90753  9.48458 22.33049

The difference matters: confidence intervals estimate uncertainty around the mean response at a given X value. Prediction intervals estimate uncertainty for a single new observation—they’re always wider because they account for both model uncertainty and individual variation.

For a car weighing 3,000 pounds, we predict 21.25 mpg. We’re 95% confident the true mean for all cars at this weight falls between 20.12 and 22.38 mpg. For any individual car at this weight, we expect mpg between 14.93 and 27.57.

Visualize predictions with your data:

# Add predictions to original data
mtcars$predicted <- predict(model)

# Plot with prediction line
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3) +
  geom_line(aes(y = predicted), color = "red", linewidth = 1) +
  labs(title = "Linear Regression: MPG vs. Weight",
       subtitle = paste("R² =", round(summary(model)$r.squared, 3))) +
  theme_minimal()

Linear regression in R is straightforward once you understand the workflow: explore your data, fit the model with lm(), interpret summary() output, validate assumptions with diagnostics, and make predictions. The hard part isn’t the code—it’s knowing when your model is telling you something real versus when it’s misleading you. Always check those diagnostic plots.