How to Perform Elastic Net Regression in Python

Elastic Net regression solves a fundamental problem with Lasso regression: when you have correlated features, Lasso arbitrarily selects one and zeros out the others. This behavior is problematic when...

Key Insights

  • Elastic Net combines L1 and L2 regularization, making it superior to pure Lasso when features are correlated—it selects groups of correlated features rather than arbitrarily picking one
  • The l1_ratio parameter controls the mix between Lasso (1.0) and Ridge (0.0), with values between 0.5 and 0.7 often working well as starting points
  • Always use ElasticNetCV instead of manually tuning hyperparameters—it performs efficient cross-validation and typically finds better parameter combinations than grid search

Introduction to Elastic Net Regression

Elastic Net regression solves a fundamental problem with Lasso regression: when you have correlated features, Lasso arbitrarily selects one and zeros out the others. This behavior is problematic when you need stable coefficient estimates or when the correlated features all carry meaningful information.

Ridge regression handles correlated features better by shrinking coefficients toward zero without eliminating them entirely. But it never performs feature selection—you’re stuck with all your predictors.

Elastic Net gives you both. It combines the L1 penalty (which drives coefficients to exactly zero) with the L2 penalty (which handles correlated features gracefully). Use Elastic Net when:

  • Your features exhibit multicollinearity
  • You want automatic feature selection but need stable coefficient estimates
  • You have more features than observations (high-dimensional data)
  • Pure Lasso is giving you inconsistent results across different data samples

The Math Behind Elastic Net

The Elastic Net cost function adds both L1 and L2 penalty terms to the ordinary least squares objective:

# Elastic Net Cost Function:
# minimize: (1 / 2n) * ||y - Xw||² + alpha * l1_ratio * ||w||₁ + alpha * (1 - l1_ratio) * 0.5 * ||w||²
#
# Where:
# - ||y - Xw||² is the residual sum of squares
# - ||w||₁ is the L1 norm (sum of absolute values of coefficients)
# - ||w||² is the squared L2 norm (sum of squared coefficients)
# - alpha controls overall regularization strength
# - l1_ratio controls the mix between L1 and L2 penalties
#
# When l1_ratio = 1: Pure Lasso
# When l1_ratio = 0: Pure Ridge
# When 0 < l1_ratio < 1: Elastic Net

The key insight is that alpha controls how much total regularization you apply, while l1_ratio determines the character of that regularization. A higher l1_ratio produces sparser models with more zero coefficients. A lower l1_ratio keeps more features but with smaller coefficients.

Preparing Your Data

Feature scaling is non-negotiable for regularized regression. The penalties are applied uniformly across all coefficients, so features on larger scales will be penalized less than features on smaller scales. This creates biased models.

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
feature_names = diabetes.feature_names

print(f"Dataset shape: {X.shape}")
print(f"Features: {feature_names}")

# Check for missing values (this dataset is clean, but always verify)
print(f"Missing values: {np.isnan(X).sum()}")

# Split data before scaling to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features using training data statistics only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")

A common mistake: fitting the scaler on the entire dataset before splitting. This leaks information from your test set into training, giving you overly optimistic performance estimates. Always fit the scaler on training data only.

Implementing Elastic Net with Scikit-Learn

The basic implementation is straightforward. Start with reasonable defaults and refine from there:

from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

# Initialize Elastic Net with moderate regularization
elastic_net = ElasticNet(
    alpha=0.1,          # Overall regularization strength
    l1_ratio=0.5,       # Equal mix of L1 and L2
    max_iter=10000,     # Increase if convergence warnings appear
    random_state=42
)

# Fit the model
elastic_net.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred = elastic_net.predict(X_train_scaled)
y_test_pred = elastic_net.predict(X_test_scaled)

# Evaluate performance
print("Training Performance:")
print(f"  MSE: {mean_squared_error(y_train, y_train_pred):.2f}")
print(f"  R²: {r2_score(y_train, y_train_pred):.4f}")

print("\nTest Performance:")
print(f"  MSE: {mean_squared_error(y_test, y_test_pred):.2f}")
print(f"  R²: {r2_score(y_test, y_test_pred):.4f}")

# Examine coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': elastic_net.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("\nCoefficients:")
print(coef_df.to_string(index=False))

Watch for convergence warnings. If you see them, increase max_iter or adjust tol. The defaults work for most cases, but high-dimensional or ill-conditioned data may require tuning.

Hyperparameter Tuning with Cross-Validation

Manual hyperparameter selection is tedious and suboptimal. ElasticNetCV automates this process using efficient coordinate descent along a regularization path:

from sklearn.linear_model import ElasticNetCV
import matplotlib.pyplot as plt

# Define parameter ranges to search
alphas = np.logspace(-4, 1, 50)  # 50 values from 0.0001 to 10
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99]

# Fit ElasticNetCV with 5-fold cross-validation
elastic_cv = ElasticNetCV(
    alphas=alphas,
    l1_ratio=l1_ratios,
    cv=5,
    max_iter=10000,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

elastic_cv.fit(X_train_scaled, y_train)

# Best parameters found
print(f"Best alpha: {elastic_cv.alpha_:.6f}")
print(f"Best l1_ratio: {elastic_cv.l1_ratio_:.2f}")

# Evaluate the tuned model
y_test_pred_cv = elastic_cv.predict(X_test_scaled)
print(f"\nTest MSE: {mean_squared_error(y_test, y_test_pred_cv):.2f}")
print(f"Test R²: {r2_score(y_test, y_test_pred_cv):.4f}")

# Plot cross-validation results for the best l1_ratio
best_l1_idx = list(l1_ratios).index(elastic_cv.l1_ratio_)
mse_path = elastic_cv.mse_path_[best_l1_idx]

plt.figure(figsize=(10, 6))
plt.semilogx(elastic_cv.alphas_, mse_path.mean(axis=1), 'b-', linewidth=2)
plt.fill_between(
    elastic_cv.alphas_,
    mse_path.mean(axis=1) - mse_path.std(axis=1),
    mse_path.mean(axis=1) + mse_path.std(axis=1),
    alpha=0.2
)
plt.axvline(elastic_cv.alpha_, color='r', linestyle='--', label=f'Best alpha: {elastic_cv.alpha_:.4f}')
plt.xlabel('Alpha (regularization strength)')
plt.ylabel('Mean Squared Error')
plt.title(f'Cross-Validation Results (l1_ratio={elastic_cv.l1_ratio_})')
plt.legend()
plt.tight_layout()
plt.show()

The MSE path plot shows how model performance changes with regularization strength. The optimal alpha balances bias and variance—too low and you overfit, too high and you underfit.

Model Evaluation and Interpretation

Beyond performance metrics, examine which features Elastic Net selected and how it weighted them:

# Detailed coefficient analysis
coef_analysis = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': elastic_cv.coef_,
    'Abs_Coefficient': np.abs(elastic_cv.coef_),
    'Selected': elastic_cv.coef_ != 0
})
coef_analysis = coef_analysis.sort_values('Abs_Coefficient', ascending=False)

print("Feature Selection Results:")
print("=" * 50)
print(coef_analysis.to_string(index=False))
print(f"\nFeatures selected: {coef_analysis['Selected'].sum()} / {len(feature_names)}")
print(f"Features eliminated: {(~coef_analysis['Selected']).sum()}")

# Visualize coefficient magnitudes
plt.figure(figsize=(10, 6))
colors = ['steelblue' if c != 0 else 'lightgray' for c in coef_analysis['Coefficient']]
plt.barh(coef_analysis['Feature'], coef_analysis['Coefficient'], color=colors)
plt.xlabel('Coefficient Value')
plt.title('Elastic Net Coefficients (Gray = Eliminated)')
plt.axvline(x=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()

Features with zero coefficients have been eliminated from the model. This is Elastic Net performing automatic feature selection. The remaining coefficients indicate relative feature importance—but remember, this interpretation only holds when features are standardized.

Elastic Net vs. Lasso vs. Ridge: A Practical Comparison

Let’s compare all three approaches on the same data to see when Elastic Net provides advantages:

from sklearn.linear_model import LassoCV, RidgeCV

# Fit all three models with cross-validation
lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=10000, random_state=42)
ridge_cv = RidgeCV(alphas=alphas, cv=5)

lasso_cv.fit(X_train_scaled, y_train)
ridge_cv.fit(X_train_scaled, y_train)

# Compare test performance
models = {
    'Ridge': ridge_cv,
    'Lasso': lasso_cv,
    'Elastic Net': elastic_cv
}

results = []
for name, model in models.items():
    y_pred = model.predict(X_test_scaled)
    results.append({
        'Model': name,
        'Test MSE': mean_squared_error(y_test, y_pred),
        'Test R²': r2_score(y_test, y_pred),
        'Non-zero Coefs': np.sum(model.coef_ != 0),
        'Best Alpha': model.alpha_
    })

results_df = pd.DataFrame(results)
print("Model Comparison:")
print("=" * 70)
print(results_df.to_string(index=False))

# Compare coefficient stability across models
coef_comparison = pd.DataFrame({
    'Feature': feature_names,
    'Ridge': ridge_cv.coef_,
    'Lasso': lasso_cv.coef_,
    'Elastic Net': elastic_cv.coef_
})

print("\nCoefficient Comparison:")
print("=" * 70)
print(coef_comparison.round(4).to_string(index=False))

# Highlight differences in feature selection
print("\nFeature Selection Differences:")
for feature in feature_names:
    ridge_coef = ridge_cv.coef_[list(feature_names).index(feature)]
    lasso_coef = lasso_cv.coef_[list(feature_names).index(feature)]
    enet_coef = elastic_cv.coef_[list(feature_names).index(feature)]
    
    if lasso_coef == 0 and enet_coef != 0:
        print(f"  {feature}: Lasso eliminated, Elastic Net retained ({enet_coef:.4f})")

On this dataset, the differences may be subtle because the diabetes dataset doesn’t have extreme multicollinearity. In real-world scenarios with highly correlated features—think genomics data, financial indicators, or sensor readings—Elastic Net’s advantages become pronounced.

The key practical takeaway: when Lasso gives you unstable feature selection (different features selected on different data samples), switch to Elastic Net. Start with l1_ratio=0.5 and adjust based on how much sparsity you need. Lower values retain more features; higher values eliminate more aggressively.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.