How to Perform Lasso Regression in Python
Lasso (Least Absolute Shrinkage and Selection Operator) regression adds an L1 penalty to ordinary least squares, fundamentally changing how the model handles coefficients. While Ridge regression uses...
Key Insights
- Lasso regression uses L1 regularization to automatically perform feature selection by driving irrelevant coefficients to exactly zero, making it invaluable for high-dimensional datasets where you need interpretable, sparse models.
- The regularization parameter alpha controls the trade-off between model fit and sparsity—use
LassoCVto automatically find the optimal value through cross-validation rather than guessing. - Always standardize your features before applying Lasso; the L1 penalty treats all coefficients equally, so variables on different scales will be penalized unfairly without normalization.
Introduction to Lasso Regression
Lasso (Least Absolute Shrinkage and Selection Operator) regression adds an L1 penalty to ordinary least squares, fundamentally changing how the model handles coefficients. While Ridge regression uses L2 regularization (squared coefficients), Lasso uses the absolute value of coefficients. This seemingly minor difference has major practical implications.
The L1 penalty creates sparse solutions. Coefficients don’t just shrink toward zero—they hit zero exactly. This makes Lasso a feature selection algorithm disguised as a regression technique. When you have hundreds of features and suspect only a handful matter, Lasso identifies them automatically.
Use Lasso when you need interpretability, when you’re working with high-dimensional data (more features than observations), or when you want automatic feature selection baked into your modeling process.
Mathematical Foundation
The Lasso cost function minimizes:
$$J(\beta) = \sum_{i=1}^{n}(y_i - \hat{y}i)^2 + \alpha \sum{j=1}^{p}|\beta_j|$$
The first term is your standard residual sum of squares. The second term penalizes the absolute values of coefficients, scaled by alpha. As alpha increases, more coefficients get pushed to zero.
Why does L1 produce zeros while L2 doesn’t? The geometry explains it. L1 creates a diamond-shaped constraint region; L2 creates a sphere. The optimal solution sits where the contours of the loss function first touch the constraint region. Diamond corners lie on axes where some coefficients equal zero. Spheres have no corners—solutions rarely land exactly on axes.
import numpy as np
import matplotlib.pyplot as plt
# Visualize L1 vs L2 constraint regions
theta = np.linspace(0, 2 * np.pi, 100)
# L2 (Ridge): circle
l2_x = np.cos(theta)
l2_y = np.sin(theta)
# L1 (Lasso): diamond
l1_t = np.linspace(0, 2 * np.pi, 100)
l1_x = np.sign(np.cos(l1_t)) * np.abs(np.cos(l1_t))
l1_y = np.sign(np.sin(l1_t)) * np.abs(np.sin(l1_t))
# Proper diamond shape
l1_x = np.array([1, 0, -1, 0, 1])
l1_y = np.array([0, 1, 0, -1, 0])
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].plot(l2_x, l2_y, 'b-', linewidth=2)
axes[0].set_title('L2 (Ridge) Constraint Region')
axes[0].set_aspect('equal')
axes[0].axhline(y=0, color='k', linewidth=0.5)
axes[0].axvline(x=0, color='k', linewidth=0.5)
axes[1].plot(l1_x, l1_y, 'r-', linewidth=2)
axes[1].set_title('L1 (Lasso) Constraint Region')
axes[1].set_aspect('equal')
axes[1].axhline(y=0, color='k', linewidth=0.5)
axes[1].axvline(x=0, color='k', linewidth=0.5)
plt.tight_layout()
plt.savefig('constraint_regions.png', dpi=150)
plt.show()
The diamond’s corners on the axes explain why Lasso solutions often have exactly zero coefficients—the optimization naturally lands on these corners.
Implementing Lasso with Scikit-learn
Scikit-learn’s Lasso class makes implementation straightforward. Let’s work through a complete example with synthetic data that has known properties.
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate synthetic data with sparse true coefficients
np.random.seed(42)
X, y, true_coef = make_regression(
n_samples=200,
n_features=20,
n_informative=5, # Only 5 features actually matter
noise=10,
coef=True
)
# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Fit Lasso model
lasso = Lasso(alpha=1.0, max_iter=10000, tol=1e-4)
lasso.fit(X_train_scaled, y_train)
# Display coefficients
print("Lasso Coefficients:")
for i, coef in enumerate(lasso.coef_):
if coef != 0:
print(f" Feature {i}: {coef:.4f}")
print(f"\nNon-zero coefficients: {np.sum(lasso.coef_ != 0)} out of 20")
print(f"True informative features: 5")
Key parameters to understand:
- alpha: Regularization strength. Higher values mean more coefficients become zero. Default is 1.0, but you’ll almost always tune this.
- max_iter: Maximum iterations for the coordinate descent solver. Increase if you get convergence warnings.
- tol: Tolerance for optimization. Smaller values mean more precise solutions but longer computation.
Tuning the Regularization Parameter (Alpha)
Choosing alpha manually is guesswork. LassoCV performs cross-validation across a range of alpha values and selects the best one automatically.
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt
# LassoCV automatically tests multiple alpha values
lasso_cv = LassoCV(
alphas=np.logspace(-4, 2, 100), # Test 100 values from 0.0001 to 100
cv=5, # 5-fold cross-validation
max_iter=10000
)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Optimal alpha: {lasso_cv.alpha_:.6f}")
print(f"Number of non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}")
# Plot MSE vs alpha
mse_mean = np.mean(lasso_cv.mse_path_, axis=1)
mse_std = np.std(lasso_cv.mse_path_, axis=1)
plt.figure(figsize=(10, 6))
plt.semilogx(lasso_cv.alphas_, mse_mean, 'b-', label='Mean MSE')
plt.fill_between(
lasso_cv.alphas_,
mse_mean - mse_std,
mse_mean + mse_std,
alpha=0.2
)
plt.axvline(lasso_cv.alpha_, color='r', linestyle='--', label=f'Optimal α={lasso_cv.alpha_:.4f}')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Mean Squared Error')
plt.title('Lasso Cross-Validation: MSE vs Alpha')
plt.legend()
plt.savefig('lasso_cv_mse.png', dpi=150)
plt.show()
The plot reveals the bias-variance tradeoff. Too-small alpha leads to overfitting (high variance). Too-large alpha oversimplifies the model (high bias). The optimal alpha balances both.
Feature Selection with Lasso
Lasso’s killer feature is automatic feature selection. Extract selected features programmatically:
import pandas as pd
# Get feature importance from Lasso
feature_names = [f'Feature_{i}' for i in range(20)]
coef_df = pd.DataFrame({
'feature': feature_names,
'coefficient': lasso_cv.coef_,
'abs_coefficient': np.abs(lasso_cv.coef_)
})
# Filter and sort
selected_features = coef_df[coef_df['coefficient'] != 0].sort_values(
'abs_coefficient', ascending=False
)
print("Selected Features (ranked by importance):")
print(selected_features.to_string(index=False))
# Extract feature indices for downstream use
selected_indices = np.where(lasso_cv.coef_ != 0)[0]
print(f"\nSelected feature indices: {selected_indices}")
# Compare with true informative features
true_informative = np.where(true_coef != 0)[0]
print(f"True informative features: {true_informative}")
print(f"Correctly identified: {len(set(selected_indices) & set(true_informative))}/{len(true_informative)}")
This approach lets you reduce dimensionality before feeding data to other models, or simply understand which variables drive your outcome.
Model Evaluation and Comparison
Compare Lasso against OLS and Ridge to understand when each excels:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
# Fit all three models
models = {
'OLS': LinearRegression(),
'Ridge': Ridge(alpha=lasso_cv.alpha_), # Use same alpha for fair comparison
'Lasso': Lasso(alpha=lasso_cv.alpha_)
}
results = []
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
results.append({
'Model': name,
'MSE': mean_squared_error(y_test, y_pred),
'R²': r2_score(y_test, y_pred),
'Non-zero Coefs': np.sum(model.coef_ != 0) if hasattr(model, 'coef_') else 'N/A'
})
results_df = pd.DataFrame(results)
print("\nModel Comparison:")
print(results_df.to_string(index=False))
Expected output shows Lasso achieving comparable predictive performance with far fewer features. When true sparsity exists in your data, Lasso often matches or beats OLS while providing interpretability.
Practical Considerations and Best Practices
Standardization is mandatory. Lasso penalizes coefficients equally. A feature measured in millions will have a tiny coefficient compared to one measured in decimals—the penalty hits them differently. Always standardize first.
# Always do this before Lasso
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Multicollinearity causes instability. When features correlate highly, Lasso arbitrarily picks one and zeros out others. This selection can flip between runs. If you need stable selection among correlated features, consider Elastic Net.
When to use Elastic Net instead:
from sklearn.linear_model import ElasticNetCV
# Elastic Net combines L1 and L2 penalties
elastic_net = ElasticNetCV(
l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0], # 1.0 = pure Lasso
alphas=np.logspace(-4, 2, 50),
cv=5
)
elastic_net.fit(X_train_scaled, y_train)
print(f"Optimal l1_ratio: {elastic_net.l1_ratio_}")
print(f"Optimal alpha: {elastic_net.alpha_:.6f}")
Use Elastic Net when you have groups of correlated features and want to include or exclude them together, or when Lasso’s feature selection seems unstable across different data samples.
Lasso regression remains one of the most practical tools for building interpretable models from high-dimensional data. Master it, and you’ll have a reliable method for both prediction and understanding which features actually matter.