How to Implement Linear Regression in R
Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The fundamental form is y = mx + b, where y...
Key Insights
- Linear regression in R requires just one function (
lm()), but proper model evaluation through diagnostic plots and statistical metrics separates good analysis from misleading conclusions - The
mtcarsdataset provides an ideal sandbox for learning regression mechanics, but real-world applications demand train/test splits and careful handling of multicollinearity - R’s formula syntax (
y ~ x1 + x2) makes building models intuitive, while functions likepredict()andsummary()handle the heavy mathematical lifting behind the scenes
Introduction to Linear Regression
Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The fundamental form is y = mx + b, where y is your outcome, x is your predictor, m is the slope, and b is the intercept.
Use simple linear regression when examining the relationship between one predictor and an outcome—like how car weight affects fuel efficiency. Switch to multiple linear regression when you need to account for several factors simultaneously, such as weight, horsepower, and number of cylinders all influencing miles per gallon.
Linear regression works best when relationships are approximately linear, residuals are normally distributed, and predictors aren’t highly correlated with each other. It’s interpretable, fast, and serves as the foundation for understanding more complex algorithms.
Setting Up Your R Environment
Base R includes everything needed for linear regression in the stats package, which loads automatically. You’ll want ggplot2 for professional visualizations.
# Install ggplot2 if you don't have it
install.packages("ggplot2")
# Load required libraries
library(ggplot2)
# Load the mtcars dataset (built into R)
data(mtcars)
# Examine the first few rows
head(mtcars)
# Check structure
str(mtcars)
The mtcars dataset contains fuel consumption and 10 aspects of automobile design for 32 cars from 1973-74. It’s perfect for learning because it’s clean, well-documented, and small enough to understand completely.
Building a Simple Linear Regression Model
The lm() function (linear model) handles regression in R. The syntax uses R’s formula notation: dependent_variable ~ independent_variable.
# Simple linear regression: mpg predicted by weight
simple_model <- lm(mpg ~ wt, data = mtcars)
# View the model summary
summary(simple_model)
The output shows several critical pieces:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The coefficient for wt is -5.34, meaning each 1,000-pound increase in weight decreases fuel efficiency by 5.34 mpg. The intercept (37.29) represents the theoretical mpg for a weightless car—not physically meaningful but mathematically necessary.
Evaluating Model Performance
R-squared (0.7528) indicates that weight explains about 75% of the variance in fuel efficiency. The p-value (1.29e-10) for weight is far below 0.05, confirming statistical significance.
But statistics alone don’t validate a model. Check assumptions visually:
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(simple_model)
par(mfrow = c(1, 1))
These four plots reveal:
- Residuals vs Fitted: Should show random scatter. Patterns indicate non-linearity.
- Q-Q Plot: Points should follow the diagonal line, confirming normal distribution of residuals.
- Scale-Location: Tests homoscedasticity (constant variance). Horizontal line is ideal.
- Residuals vs Leverage: Identifies influential outliers (Cook’s distance > 0.5 is concerning).
For the weight-mpg model, you’ll notice some curvature in the residuals plot, suggesting a non-linear relationship might fit better—but the model is still reasonable.
Multiple Linear Regression
Real-world scenarios rarely involve single predictors. Add variables with the + operator:
# Multiple regression with weight, horsepower, and cylinders
multi_model <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(multi_model)
Output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
wt -3.16697 0.74058 -4.276 0.000199 ***
hp -0.01804 0.01188 -1.519 0.140015
cyl -0.94162 0.55092 -1.709 0.098480 .
Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
R-squared improved to 0.84, but notice horsepower isn’t statistically significant (p = 0.14). This might indicate multicollinearity—when predictors correlate with each other.
Check correlations:
# Correlation matrix for predictors
cor(mtcars[, c("wt", "hp", "cyl")])
High correlations (> 0.7) between predictors can inflate standard errors and make coefficients unreliable. Consider removing redundant variables or using regularization techniques like ridge regression.
Categorical variables work seamlessly when converted to factors:
# Convert cyl to factor (treating it as categories, not continuous)
mtcars$cyl_factor <- as.factor(mtcars$cyl)
categorical_model <- lm(mpg ~ wt + cyl_factor, data = mtcars)
summary(categorical_model)
R automatically creates dummy variables for each category.
Making Predictions with Your Model
The predict() function applies your model to new data:
# Create new data for prediction
new_cars <- data.frame(
wt = c(2.5, 3.0, 3.5),
hp = c(100, 120, 140),
cyl = c(4, 6, 6)
)
# Predict mpg
predictions <- predict(multi_model, newdata = new_cars,
interval = "prediction", level = 0.95)
print(predictions)
The interval = "prediction" argument provides 95% prediction intervals, showing the range where you expect individual observations to fall. Use interval = "confidence" for the range where the mean prediction falls—confidence intervals are narrower.
Visualize predictions against actual data:
# Create prediction data across weight range
wt_range <- data.frame(
wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 100),
hp = mean(mtcars$hp),
cyl = median(mtcars$cyl)
)
wt_range$predicted_mpg <- predict(multi_model, newdata = wt_range)
# Plot
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 3, alpha = 0.6) +
geom_line(data = wt_range, aes(y = predicted_mpg),
color = "blue", size = 1) +
labs(title = "MPG vs Weight with Regression Line",
x = "Weight (1000 lbs)",
y = "Miles per Gallon") +
theme_minimal()
Best Practices and Common Pitfalls
Always split your data into training and testing sets. The model should be built on training data and evaluated on unseen test data:
# Set seed for reproducibility
set.seed(123)
# Create 70/30 split
train_indices <- sample(1:nrow(mtcars), 0.7 * nrow(mtcars))
train_data <- mtcars[train_indices, ]
test_data <- mtcars[-train_indices, ]
# Build model on training data
train_model <- lm(mpg ~ wt + hp, data = train_data)
# Evaluate on test data
test_predictions <- predict(train_model, newdata = test_data)
test_rmse <- sqrt(mean((test_data$mpg - test_predictions)^2))
print(paste("Test RMSE:", round(test_rmse, 2)))
Detect multicollinearity using Variance Inflation Factor (VIF):
# Install car package for VIF
install.packages("car")
library(car)
vif(multi_model)
VIF > 5 indicates problematic multicollinearity. Remove or combine correlated predictors.
Feature scaling isn’t required for linear regression coefficients to be correct, but it helps with interpretation when predictors have vastly different scales. Use scale() to standardize:
mtcars_scaled <- mtcars
mtcars_scaled[, c("wt", "hp")] <- scale(mtcars[, c("wt", "hp")])
Know when to move on. If residual plots show clear non-linearity, consider polynomial regression (lm(mpg ~ poly(wt, 2))), generalized additive models (GAMs), or tree-based methods. If you have many predictors, regularization techniques (LASSO, ridge) prevent overfitting better than standard linear regression.
Linear regression’s simplicity is both its strength and limitation. Master these fundamentals in R, and you’ll have a solid foundation for understanding virtually every supervised learning algorithm that follows.