How to Implement Lasso Regression in R
Lasso (Least Absolute Shrinkage and Selection Operator) regression adds an L1 penalty term to ordinary least squares regression. The key difference from Ridge regression is mathematical: Lasso uses...
Key Insights
- Lasso regression automatically performs feature selection by shrinking irrelevant coefficients to exactly zero, making it superior to Ridge regression when you need interpretable models with fewer predictors
- The
glmnetpackage requires matrix input and standardizes features by default, but you must manually create dummy variables for categorical predictors usingmodel.matrix() - Always compare
lambda.min(lowest CV error) withlambda.1se(most regularized model within one standard error) because the latter often provides better generalization with fewer features
Introduction to Lasso Regression
Lasso (Least Absolute Shrinkage and Selection Operator) regression adds an L1 penalty term to ordinary least squares regression. The key difference from Ridge regression is mathematical: Lasso uses the absolute value of coefficients (L1 norm) while Ridge uses squared coefficients (L2 norm). This seemingly small distinction has major practical implications.
The Lasso objective function minimizes: RSS + λ∑|βj|, where λ controls regularization strength. This penalty forces some coefficients to exactly zero, effectively performing automatic feature selection. Ridge regression, by contrast, only shrinks coefficients toward zero but never eliminates them entirely.
Use Lasso when you suspect many features are irrelevant or when model interpretability matters. It excels with high-dimensional data where the number of predictors approaches or exceeds the number of observations. If you believe all features contribute meaningfully, Ridge or Elastic Net might perform better.
Setting Up Your R Environment
The glmnet package is the gold standard for Lasso regression in R. Install it alongside caret for additional utilities:
# Install required packages
install.packages(c("glmnet", "caret"))
# Load libraries
library(glmnet)
library(caret)
# Load dataset - using Boston housing data
library(MASS)
data(Boston)
# Examine the data structure
str(Boston)
head(Boston)
# Check for missing values
sum(is.na(Boston))
The Boston housing dataset contains 506 observations with 13 predictors and median home value as the target variable. It’s ideal for demonstrating Lasso because it includes correlated features where regularization provides clear benefits.
Data Preparation and Splitting
Lasso regression requires predictor variables in matrix format. The model.matrix() function handles this conversion and automatically creates dummy variables for factors:
# Separate predictors and response
x <- model.matrix(medv ~ ., Boston)[, -1] # Remove intercept column
y <- Boston$medv
# Check dimensions
dim(x) # Should show 506 rows, 13 columns
# Create train/test split (80/20)
set.seed(123)
train_indices <- sample(1:nrow(x), 0.8 * nrow(x))
test_indices <- setdiff(1:nrow(x), train_indices)
x_train <- x[train_indices, ]
y_train <- y[train_indices]
x_test <- x[test_indices, ]
y_test <- y[test_indices]
# Verify split proportions
length(train_indices) / nrow(x) # Should be ~0.8
Note that glmnet standardizes features internally by default (standardize = TRUE), so manual scaling isn’t necessary. However, if you need to scale manually for other purposes, use scale():
# Manual scaling (optional with glmnet)
x_train_scaled <- scale(x_train)
x_test_scaled <- scale(x_test,
center = attr(x_train_scaled, "scaled:center"),
scale = attr(x_train_scaled, "scaled:scale"))
Building the Lasso Model
The glmnet() function requires setting alpha = 1 for Lasso. Use cv.glmnet() for cross-validation to find the optimal lambda:
# Fit Lasso model with cross-validation
set.seed(123)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)
# View the cross-validation object
print(cv_lasso)
# Extract optimal lambda values
lambda_min <- cv_lasso$lambda.min
lambda_1se <- cv_lasso$lambda.1se
cat("Lambda min:", lambda_min, "\n")
cat("Lambda 1se:", lambda_1se, "\n")
# Fit final model using optimal lambda
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_min)
The cv.glmnet() function performs k-fold cross-validation (default k=10) across a range of lambda values. It returns two key lambda values: lambda.min gives the lowest cross-validation error, while lambda.1se gives the most regularized model within one standard error of the minimum. The latter often generalizes better to new data.
Selecting Optimal Lambda
Visualizing the cross-validation results helps understand the bias-variance tradeoff:
# Plot cross-validation results
plot(cv_lasso)
title("Lasso Cross-Validation", line = 2.5)
# Plot coefficient paths
plot(cv_lasso$glmnet.fit, xvar = "lambda", label = TRUE)
abline(v = log(lambda_min), col = "red", lty = 2)
abline(v = log(lambda_1se), col = "blue", lty = 2)
legend("topright",
legend = c("lambda.min", "lambda.1se"),
col = c("red", "blue"),
lty = 2)
# View number of non-zero coefficients at each lambda
print(cv_lasso$nzero[cv_lasso$lambda == lambda_min])
print(cv_lasso$nzero[cv_lasso$lambda == lambda_1se])
The coefficient path plot shows how each predictor’s coefficient shrinks toward zero as lambda increases. You’ll notice some coefficients hit zero earlier than others—these are the features Lasso deems less important. The vertical lines mark the optimal lambda values.
Model Evaluation and Interpretation
Evaluate model performance on the test set and identify selected features:
# Make predictions using both lambda values
pred_min <- predict(cv_lasso, s = lambda_min, newx = x_test)
pred_1se <- predict(cv_lasso, s = lambda_1se, newx = x_test)
# Calculate RMSE
rmse_min <- sqrt(mean((pred_min - y_test)^2))
rmse_1se <- sqrt(mean((pred_1se - y_test)^2))
cat("RMSE (lambda.min):", rmse_min, "\n")
cat("RMSE (lambda.1se):", rmse_1se, "\n")
# Calculate R-squared
ss_res_min <- sum((y_test - pred_min)^2)
ss_tot <- sum((y_test - mean(y_test))^2)
r2_min <- 1 - (ss_res_min / ss_tot)
cat("R-squared (lambda.min):", r2_min, "\n")
# Extract and display non-zero coefficients
coef_min <- coef(cv_lasso, s = lambda_min)
coef_1se <- coef(cv_lasso, s = lambda_1se)
# Show selected features (non-zero coefficients)
selected_features_min <- coef_min[coef_min[,1] != 0,]
selected_features_1se <- coef_1se[coef_1se[,1] != 0,]
cat("\nSelected features (lambda.min):\n")
print(selected_features_min)
cat("\nSelected features (lambda.1se):\n")
print(selected_features_1se)
The coefficient output reveals which features survived regularization. In the Boston dataset, you’ll typically see that rm (average number of rooms) and lstat (percentage lower status population) have large coefficients, while some features like age might be eliminated entirely.
Practical Tips and Best Practices
Feature scaling matters even though glmnet handles it automatically. If you pre-process data separately, ensure test data uses training set parameters to avoid data leakage.
For categorical variables with many levels, consider grouping rare categories before creating dummy variables:
# Example: handling categorical variables
data_with_factors <- data.frame(
price = rnorm(100),
category = sample(c("A", "B", "C", "D"), 100, replace = TRUE)
)
# Create model matrix (automatically handles factors)
x_matrix <- model.matrix(price ~ . - 1, data_with_factors)
Compare Lasso with Ridge and Elastic Net to determine the best approach for your data:
# Compare regularization methods
set.seed(123)
# Ridge (alpha = 0)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)
pred_ridge <- predict(cv_ridge, s = "lambda.min", newx = x_test)
rmse_ridge <- sqrt(mean((pred_ridge - y_test)^2))
# Elastic Net (alpha = 0.5)
cv_elastic <- cv.glmnet(x_train, y_train, alpha = 0.5, nfolds = 10)
pred_elastic <- predict(cv_elastic, s = "lambda.min", newx = x_test)
rmse_elastic <- sqrt(mean((pred_elastic - y_test)^2))
# Comparison
comparison <- data.frame(
Method = c("Lasso", "Ridge", "Elastic Net"),
RMSE = c(rmse_min, rmse_ridge, rmse_elastic),
N_Features = c(
sum(coef(cv_lasso, s = lambda_min) != 0) - 1,
sum(coef(cv_ridge, s = "lambda.min") != 0) - 1,
sum(coef(cv_elastic, s = "lambda.min") != 0) - 1
)
)
print(comparison)
This comparison reveals the tradeoff between model complexity and performance. Lasso typically selects fewer features than Elastic Net, which selects fewer than Ridge. If Lasso and Ridge perform similarly, but Lasso uses half the features, choose Lasso for interpretability.
When working with time series or panel data, use grouped cross-validation instead of random splits to respect temporal ordering. The caret package provides tools for custom cross-validation schemes.
Finally, always examine residual plots to verify model assumptions. Lasso doesn’t fix fundamental problems like non-linear relationships or heteroscedasticity—it only performs regularization and feature selection on linear models.