How to Perform Multiple Linear Regression in R
Multiple linear regression (MLR) extends simple linear regression to model relationships between one continuous outcome variable and two or more predictor variables. The fundamental equation is:
Key Insights
- Multiple linear regression in R requires just one function (
lm()), but building a reliable model demands rigorous assumption checking—skipping diagnostics is the fastest path to misleading results. - The
summary()output contains everything you need: coefficients tell you effect sizes, p-values indicate statistical significance, and adjusted R-squared measures explanatory power while penalizing unnecessary predictors. - Always check for multicollinearity using VIF scores before interpreting coefficients—correlated predictors inflate standard errors and make individual effects unreliable.
Introduction to Multiple Linear Regression
Multiple linear regression (MLR) extends simple linear regression to model relationships between one continuous outcome variable and two or more predictor variables. The fundamental equation is:
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_nX_n + \epsilon$$
Use MLR when you need to understand how multiple factors simultaneously influence an outcome, control for confounding variables, or predict values based on several inputs. Common applications include predicting house prices from square footage, bedrooms, and location; estimating salary based on experience, education, and industry; or modeling patient outcomes from multiple clinical measurements.
Before building any model, understand the four key assumptions:
- Linearity: The relationship between predictors and outcome is linear
- Independence: Observations are independent of each other
- Homoscedasticity: Residuals have constant variance across all fitted values
- Normality: Residuals are approximately normally distributed
Violating these assumptions doesn’t necessarily invalidate your analysis, but it affects how you interpret results and may require transformations or alternative methods.
Preparing Your Data
Start by loading your data and conducting exploratory analysis. I’ll use the built-in mtcars dataset to predict fuel efficiency (mpg) from vehicle characteristics.
# Load the dataset
data(mtcars)
# Examine structure and summary statistics
str(mtcars)
summary(mtcars)
# Check for missing values
colSums(is.na(mtcars))
# Select predictors of interest
df <- mtcars[, c("mpg", "wt", "hp", "disp", "cyl")]
Before modeling, examine relationships between variables using a correlation matrix and scatterplot matrix:
# Correlation matrix
cor_matrix <- cor(df)
round(cor_matrix, 2)
# Visual correlation matrix (requires corrplot package)
# install.packages("corrplot")
library(corrplot)
corrplot(cor_matrix, method = "number", type = "upper")
# Scatterplot matrix for visual inspection
pairs(df,
main = "Scatterplot Matrix",
pch = 19,
col = "steelblue")
The correlation matrix reveals potential issues early. In mtcars, you’ll notice disp, cyl, wt, and hp are highly correlated with each other—a warning sign for multicollinearity that we’ll address later.
Look for correlations above 0.7 between predictors. These pairs may cause problems in your regression model, inflating standard errors and making coefficient interpretation unreliable.
Building the Regression Model
R’s lm() function fits linear models using ordinary least squares. The formula syntax uses ~ to separate the outcome from predictors, with + joining multiple predictors.
# Fit the multiple regression model
model <- lm(mpg ~ wt + hp + disp, data = df)
# View the full summary
summary(model)
The output looks like this:
Call:
lm(formula = mpg ~ wt + hp + disp, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.891 -1.640 -0.578 1.119 5.861
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
wt -3.800891 1.066191 -3.565 0.00133 **
hp -0.031157 0.011436 -2.724 0.01097 *
disp -0.000937 0.010350 -0.091 0.92851
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.639 on 28 degrees of freedom
Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
F-statistic: 44.57 on 3 and 28 DF, p-value: 6.379e-11
Each coefficient represents the expected change in mpg for a one-unit increase in that predictor, holding all other predictors constant. Here, each additional 1000 lbs of weight (wt) decreases fuel efficiency by about 3.8 mpg.
Evaluating Model Assumptions
Never skip diagnostics. R provides built-in diagnostic plots that reveal assumption violations:
# Generate diagnostic plots
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
This produces four plots:
- Residuals vs Fitted: Check for linearity and homoscedasticity. You want random scatter around zero with no patterns.
- Normal Q-Q: Assess normality of residuals. Points should follow the diagonal line.
- Scale-Location: Another homoscedasticity check. Look for horizontal spread.
- Residuals vs Leverage: Identify influential observations. Watch for points outside Cook’s distance lines.
For formal testing, use the Shapiro-Wilk test for normality:
# Test normality of residuals
shapiro.test(residuals(model))
A p-value above 0.05 suggests residuals are approximately normal. However, with larger samples, rely more on Q-Q plots than formal tests—Shapiro-Wilk becomes overly sensitive with more data.
Check multicollinearity using Variance Inflation Factors (VIF):
# Install car package if needed
# install.packages("car")
library(car)
# Calculate VIF
vif(model)
VIF values above 5 indicate moderate multicollinearity; above 10 suggests serious problems. In our model, disp likely shows high VIF due to its correlation with wt and hp. Consider removing highly collinear predictors or using techniques like principal component regression.
Interpreting Results
Let’s break down the key statistics from summary(model):
Coefficients: The Estimate column shows effect sizes. For wt, the coefficient of -3.80 means each additional 1000 lbs reduces mpg by 3.8, controlling for horsepower and displacement.
Standard Errors and t-values: Smaller standard errors relative to estimates produce larger t-values and smaller p-values. The wt coefficient has a t-value of -3.565, indicating it’s significantly different from zero.
P-values: Values below 0.05 (or your chosen threshold) indicate statistical significance. Notice disp has a p-value of 0.93—it adds no explanatory power beyond what wt and hp already provide.
R-squared vs Adjusted R-squared: R-squared (0.8268) measures variance explained, but always increases when adding predictors. Adjusted R-squared (0.8083) penalizes model complexity, making it better for comparing models with different numbers of predictors.
F-statistic: Tests whether the model as a whole is significant. A small p-value (here, 6.379e-11) indicates at least one predictor has a non-zero effect.
Get confidence intervals for coefficients:
# 95% confidence intervals
confint(model, level = 0.95)
# ANOVA table for sequential tests
anova(model)
The ANOVA table shows each predictor’s contribution when added sequentially. This order matters—the first predictor gets credit for shared variance with later predictors.
Making Predictions
Use your model to predict values for new observations:
# Create new data for prediction
new_cars <- data.frame(
wt = c(2.5, 3.0, 3.5),
hp = c(100, 150, 200),
disp = c(150, 200, 250)
)
# Point predictions
predict(model, newdata = new_cars)
# Predictions with confidence intervals (mean response)
predict(model, newdata = new_cars, interval = "confidence", level = 0.95)
# Predictions with prediction intervals (individual observations)
predict(model, newdata = new_cars, interval = "prediction", level = 0.95)
Confidence intervals estimate uncertainty around the mean response for given predictor values. Prediction intervals are wider because they account for both estimation uncertainty and individual variation—use these when predicting specific observations.
Model Refinement
The insignificant disp coefficient suggests our model might benefit from simplification. Use stepwise selection to automate predictor selection:
# Start with full model
full_model <- lm(mpg ~ wt + hp + disp + cyl, data = df)
# Backward stepwise selection using AIC
step_model <- step(full_model, direction = "backward")
# View selected model
summary(step_model)
Compare models using information criteria:
# Compare AIC (lower is better)
AIC(model, step_model)
# Compare BIC (penalizes complexity more heavily)
BIC(model, step_model)
# Formal comparison with ANOVA (for nested models)
reduced_model <- lm(mpg ~ wt + hp, data = df)
anova(reduced_model, model)
A non-significant ANOVA comparison (p > 0.05) suggests the simpler model is adequate—the additional predictor doesn’t improve fit enough to justify its inclusion.
For production use, consider these refinements:
# Final recommended model
final_model <- lm(mpg ~ wt + hp, data = df)
summary(final_model)
# Verify assumptions on final model
par(mfrow = c(2, 2))
plot(final_model)
vif(final_model)
This streamlined model explains nearly the same variance with fewer predictors, making it more interpretable and less prone to overfitting. The VIF values will also improve with the removal of collinear predictors.
Multiple linear regression in R is straightforward to implement but requires careful attention to assumptions and diagnostics. Always explore your data before modeling, check assumptions after fitting, and consider simpler models when complex ones don’t add value. The goal isn’t the highest R-squared—it’s a reliable, interpretable model that answers your research question.