How to Implement Ridge Regression in R
Ridge regression extends ordinary least squares (OLS) regression by adding a penalty term proportional to the sum of squared coefficients. This L2 regularization shrinks coefficient estimates,...
Key Insights
- Ridge regression solves multicollinearity and overfitting by adding an L2 penalty that shrinks coefficients toward zero without eliminating them entirely, unlike LASSO which can zero out coefficients completely
- The lambda parameter controls regularization strength—higher values increase bias but reduce variance, and cross-validation is essential for finding the optimal value rather than guessing
- Always standardize your predictors before applying ridge regression since the penalty is scale-dependent, meaning variables on larger scales would be penalized less without normalization
Introduction to Ridge Regression
Ridge regression extends ordinary least squares (OLS) regression by adding a penalty term proportional to the sum of squared coefficients. This L2 regularization shrinks coefficient estimates, reducing model variance at the cost of introducing some bias. The technique excels in two scenarios: when you have multicollinearity (highly correlated predictors) that inflates coefficient variance, and when you have more predictors than observations or risk overfitting.
Unlike variable selection methods that eliminate predictors entirely, ridge regression retains all variables but constrains their coefficients. This makes it particularly valuable when you believe all predictors contribute meaningful information, even if their individual effects are difficult to isolate due to correlation.
Let’s visualize how ridge regression shrinks coefficients compared to OLS:
library(glmnet)
library(MASS)
# Generate data with multicollinearity
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- x1 + rnorm(n, 0, 0.3) # Highly correlated with x1
x3 <- rnorm(n)
X <- cbind(x1, x2, x3)
y <- 2*x1 + 3*x2 + 1.5*x3 + rnorm(n)
# OLS coefficients
ols_coef <- coef(lm(y ~ X))[-1]
# Ridge coefficients at different lambdas
ridge_model <- glmnet(X, y, alpha = 0, lambda = c(0, 1, 10, 50))
ridge_coefs <- as.matrix(coef(ridge_model)[-1, ])
# Compare
comparison <- data.frame(
Variable = paste0("x", 1:3),
OLS = ols_coef,
Lambda_0 = ridge_coefs[, 1],
Lambda_1 = ridge_coefs[, 2],
Lambda_10 = ridge_coefs[, 3],
Lambda_50 = ridge_coefs[, 4]
)
print(round(comparison, 3))
Notice how coefficients shrink progressively as lambda increases, with the correlated variables (x1 and x2) showing more dramatic changes.
Understanding the Lambda Parameter
Lambda (λ) controls the strength of the penalty term. The ridge regression objective function minimizes:
RSS + λ × Σ(β²)
Where RSS is the residual sum of squares and β represents the coefficients. When λ = 0, you get standard OLS. As λ increases, coefficients shrink toward zero, reducing model complexity.
The bias-variance tradeoff is central here. Low lambda values produce models similar to OLS—low bias but potentially high variance. High lambda values create simpler models with more bias but lower variance. The optimal lambda balances these competing concerns.
# Demonstrate lambda's effect on coefficients
lambda_seq <- 10^seq(3, -2, length = 100)
ridge_path <- glmnet(X, y, alpha = 0, lambda = lambda_seq)
# Plot coefficient paths
plot(ridge_path, xvar = "lambda", label = TRUE)
title("Ridge Regression Coefficient Paths")
# Extract coefficients at specific lambdas
lambda_values <- c(0.01, 1, 10, 100)
coef_matrix <- coef(glmnet(X, y, alpha = 0, lambda = lambda_values))
for(i in 1:length(lambda_values)) {
cat(sprintf("\nLambda = %.2f:\n", lambda_values[i]))
print(round(as.matrix(coef_matrix[, i]), 4))
}
This visualization shows how each coefficient traces a path toward zero as lambda increases, with the rate of shrinkage depending on the variable’s relationship with the response and other predictors.
Preparing Data for Ridge Regression
Ridge regression is scale-sensitive—the penalty applies equally to all coefficients regardless of the original predictor scales. A variable measured in thousands would receive less penalty than one measured in decimals, purely due to scaling. Standardization is mandatory.
# Load example data
data(mtcars)
# Create response and predictor matrix
y <- mtcars$mpg
X_raw <- mtcars[, c("cyl", "disp", "hp", "wt", "qsec")]
# Check scales before standardization
summary(X_raw)
# Standardize predictors (mean = 0, sd = 1)
X_scaled <- scale(X_raw)
# Verify standardization
colMeans(X_scaled) # Should be ~0
apply(X_scaled, 2, sd) # Should be 1
# For categorical variables, use model.matrix
mtcars$gear_factor <- factor(mtcars$gear)
X_with_categorical <- model.matrix(mpg ~ cyl + disp + hp + wt + gear_factor,
data = mtcars)[, -1] # Remove intercept
X_with_categorical <- scale(X_with_categorical)
# Train-test split
set.seed(456)
train_idx <- sample(1:nrow(mtcars), 0.7 * nrow(mtcars))
X_train <- X_scaled[train_idx, ]
X_test <- X_scaled[-train_idx, ]
y_train <- y[train_idx]
y_test <- y[-train_idx]
The scale() function centers and scales each column. For categorical variables, model.matrix() creates dummy variables, which should also be standardized. Note that glmnet standardizes by default, but doing it explicitly gives you control and ensures consistency across your workflow.
Implementing Ridge Regression with glmnet
The glmnet package provides efficient ridge regression implementation. The key parameter is alpha: set it to 0 for ridge regression (1 for LASSO, values between 0 and 1 for elastic net).
library(glmnet)
# Fit ridge regression across a sequence of lambdas
ridge_model <- glmnet(X_train, y_train, alpha = 0)
# glmnet automatically generates lambda sequence
print(ridge_model$lambda[1:10]) # First 10 lambda values
# Fit with custom lambda sequence
custom_lambdas <- 10^seq(2, -2, length = 100)
ridge_custom <- glmnet(X_train, y_train, alpha = 0, lambda = custom_lambdas)
# Extract coefficients for a specific lambda
coef(ridge_model, s = 1) # s specifies lambda value
# Get coefficients for multiple lambdas
coef_at_lambdas <- coef(ridge_model, s = c(0.1, 1, 10))
print(coef_at_lambdas)
# Number of non-zero coefficients (always all for ridge)
ridge_model$df # Degrees of freedom at each lambda
Unlike LASSO, ridge regression never sets coefficients exactly to zero—all predictors remain in the model with shrunk coefficients. The df (degrees of freedom) reflects the effective number of parameters.
Selecting Optimal Lambda via Cross-Validation
Cross-validation systematically tests different lambda values to find the one that minimizes prediction error on held-out data. The cv.glmnet() function automates this process.
# Perform 10-fold cross-validation
set.seed(789)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0, nfolds = 10)
# Plot cross-validation results
plot(cv_ridge)
title("Ridge Regression Cross-Validation")
# Optimal lambda values
lambda_min <- cv_ridge$lambda.min # Minimizes CV error
lambda_1se <- cv_ridge$lambda.1se # Largest lambda within 1 SE of minimum
cat(sprintf("Lambda min: %.4f\n", lambda_min))
cat(sprintf("Lambda 1se: %.4f\n", lambda_1se))
# Coefficients at optimal lambdas
coef_min <- coef(cv_ridge, s = "lambda.min")
coef_1se <- coef(cv_ridge, s = "lambda.1se")
comparison_optimal <- data.frame(
Variable = rownames(coef_min),
Lambda_Min = as.vector(coef_min),
Lambda_1SE = as.vector(coef_1se)
)
print(comparison_optimal)
The lambda.min gives the best predictive performance, while lambda.1se provides a more parsimonious model that performs nearly as well. The one-standard-error rule is conservative and often preferred to avoid overfitting, especially with smaller datasets.
Model Evaluation and Prediction
Evaluate ridge regression against OLS using standard regression metrics. Ridge should outperform on test data when multicollinearity or overfitting exists.
# Predictions using optimal lambda
ridge_pred_min <- predict(cv_ridge, newx = X_test, s = "lambda.min")
ridge_pred_1se <- predict(cv_ridge, newx = X_test, s = "lambda.1se")
# OLS for comparison
ols_model <- lm(y_train ~ X_train)
ols_pred <- predict(ols_model,
newdata = data.frame(X_train = X_test))
# Calculate metrics
calculate_metrics <- function(actual, predicted) {
rmse <- sqrt(mean((actual - predicted)^2))
mae <- mean(abs(actual - predicted))
r_squared <- 1 - sum((actual - predicted)^2) / sum((actual - mean(actual))^2)
return(c(RMSE = rmse, MAE = mae, R_Squared = r_squared))
}
results <- data.frame(
OLS = calculate_metrics(y_test, ols_pred),
Ridge_Min = calculate_metrics(y_test, ridge_pred_min),
Ridge_1SE = calculate_metrics(y_test, ridge_pred_1se)
)
print(round(results, 4))
Lower RMSE and MAE indicate better predictions. Ridge typically shows improvement when OLS suffers from high variance due to multicollinearity.
Practical Example: Real-World Dataset
Let’s apply ridge regression to the Boston housing dataset, which exhibits multicollinearity among predictors.
library(MASS)
data(Boston)
# Prepare data
set.seed(100)
train_idx <- sample(1:nrow(Boston), 0.75 * nrow(Boston))
X <- as.matrix(Boston[, -14]) # Exclude medv (response)
y <- Boston$medv
X_train <- X[train_idx, ]
X_test <- X[-train_idx, ]
y_train <- y[train_idx]
y_test <- y[-train_idx]
# Check for multicollinearity
correlation_matrix <- cor(X_train)
high_cor <- which(abs(correlation_matrix) > 0.7 & correlation_matrix != 1,
arr.ind = TRUE)
cat("High correlations found:", nrow(high_cor)/2, "\n")
# Fit ridge with cross-validation
cv_ridge_boston <- cv.glmnet(X_train, y_train, alpha = 0, nfolds = 10)
# Visualize
par(mfrow = c(1, 2))
plot(cv_ridge_boston)
plot(glmnet(X_train, y_train, alpha = 0), xvar = "lambda")
# Predictions and evaluation
ridge_pred <- predict(cv_ridge_boston, newx = X_test, s = "lambda.min")
ols_boston <- lm(y_train ~ X_train)
ols_pred <- predict(ols_boston, newdata = data.frame(X_train = X_test))
# Performance comparison
performance <- data.frame(
Model = c("OLS", "Ridge"),
RMSE = c(sqrt(mean((y_test - ols_pred)^2)),
sqrt(mean((y_test - ridge_pred)^2))),
R_Squared = c(cor(y_test, ols_pred)^2,
cor(y_test, ridge_pred)^2)
)
print(performance)
# Examine coefficients
final_coefs <- coef(cv_ridge_boston, s = "lambda.min")
coef_df <- data.frame(
Variable = rownames(final_coefs),
Coefficient = as.vector(final_coefs)
)
print(coef_df[order(-abs(coef_df$Coefficient)), ])
This complete workflow demonstrates ridge regression’s value on real data. The Boston dataset’s multicollinearity makes it an ideal candidate, and ridge regression typically reduces test error compared to OLS by controlling coefficient variance.
Ridge regression is a fundamental tool for handling correlated predictors and preventing overfitting. Master it before moving to more complex regularization techniques—the concepts transfer directly to LASSO, elastic net, and beyond.