How to Perform Ridge Regression in Python

Key Insights

Ridge regression adds an L2 penalty to ordinary least squares, shrinking coefficients toward zero to combat multicollinearity and overfitting without eliminating features entirely.
Always standardize your features before applying ridge regression—the regularization penalty treats all coefficients equally, so features on different scales will be penalized unfairly.
Use RidgeCV with a logarithmic range of alpha values (e.g., 0.001 to 1000) to efficiently find the optimal regularization strength through cross-validation.

Introduction to Ridge Regression

Standard linear regression has a dirty secret: it falls apart when your features are correlated. When you have multicollinearity—predictors that move together—ordinary least squares (OLS) produces wildly unstable coefficient estimates. Small changes in your data cause coefficients to swing dramatically, making your model unreliable and uninterpretable.

Ridge regression solves this by adding a penalty term that constrains coefficient magnitudes. Instead of finding coefficients that minimize only the sum of squared errors, ridge finds coefficients that minimize errors while keeping coefficients small. This L2 penalty shrinks coefficients toward zero but never exactly to zero, which distinguishes ridge from Lasso regression.

The result is a more stable, generalizable model—especially valuable when you’re working with correlated features, high-dimensional data, or situations where you have more features than observations.

The Math Behind Ridge Regression

The ordinary least squares objective is straightforward: minimize the sum of squared residuals.

$$\text{OLS Cost} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

Ridge regression adds a penalty term proportional to the squared magnitude of coefficients:

$$\text{Ridge Cost} = \sum_{i=1}^{n}(y_i - \hat{y}i)^2 + \alpha \sum{j=1}^{p}\beta_j^2$$

The hyperparameter α (alpha) controls regularization strength. When α = 0, you get OLS. As α increases, coefficients shrink more aggressively toward zero.

The closed-form solution for ridge coefficients is:

$$\hat{\beta}_{ridge} = (X^TX + \alpha I)^{-1}X^Ty$$

That αI term is the key—it adds a small value to the diagonal of X^TX, making the matrix invertible even when features are perfectly correlated.

Here’s a manual implementation to understand what’s happening under the hood:

import numpy as np

def ridge_regression_manual(X, y, alpha):
    """
    Calculate ridge regression coefficients using the closed-form solution.
    
    Parameters:
    -----------
    X : array-like, shape (n_samples, n_features)
        Feature matrix (should be standardized)
    y : array-like, shape (n_samples,)
        Target values
    alpha : float
        Regularization strength
    
    Returns:
    --------
    coefficients : array, shape (n_features,)
        Ridge regression coefficients
    """
    n_features = X.shape[1]
    identity = np.eye(n_features)
    
    # Ridge closed-form: (X^T X + αI)^(-1) X^T y
    XtX = X.T @ X
    Xty = X.T @ y
    coefficients = np.linalg.inv(XtX + alpha * identity) @ Xty
    
    return coefficients

# Example usage
np.random.seed(42)
X = np.random.randn(100, 3)
y = 3 * X[:, 0] + 2 * X[:, 1] + 0.5 * X[:, 2] + np.random.randn(100) * 0.5

coeffs = ridge_regression_manual(X, y, alpha=1.0)
print(f"Ridge coefficients: {coeffs}")
# Output: Ridge coefficients: [2.93, 1.96, 0.48]

Implementing Ridge Regression with Scikit-learn

In practice, use scikit-learn’s optimized implementation. The workflow follows the standard fit-predict pattern, but feature scaling is critical.

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression

# Generate sample data with some multicollinearity
np.random.seed(42)
X, y = make_regression(n_samples=200, n_features=10, n_informative=5, 
                       noise=10, random_state=42)

# Add correlated features to simulate multicollinearity
X[:, 5] = X[:, 0] + np.random.randn(200) * 0.1
X[:, 6] = X[:, 1] + np.random.randn(200) * 0.1

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# IMPORTANT: Scale features before ridge regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# Make predictions
y_pred = ridge.predict(X_test_scaled)

print(f"Ridge coefficients: {ridge.coef_.round(2)}")
print(f"R² score: {ridge.score(X_test_scaled, y_test):.4f}")

Why is scaling essential? The L2 penalty treats all coefficients equally. If one feature ranges from 0-1 and another from 0-1000, the penalty will disproportionately shrink the first feature’s coefficient. Standardizing puts all features on equal footing.

Choosing the Optimal Alpha with Cross-Validation

The alpha hyperparameter makes or breaks your ridge model. Too small, and you’re back to OLS with its instability. Too large, and you’ve over-regularized, underfitting your data.

RidgeCV performs efficient leave-one-out cross-validation across multiple alpha values:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Use logarithmically spaced alphas
alphas = np.logspace(-3, 3, 100)  # 0.001 to 1000

# RidgeCV finds optimal alpha automatically
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train_scaled, y_train)

print(f"Optimal alpha: {ridge_cv.alpha_:.4f}")
print(f"R² with optimal alpha: {ridge_cv.score(X_test_scaled, y_test):.4f}")

# Visualize cross-validation scores across alphas
cv_scores = []
for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    scores = cross_val_score(ridge, X_train_scaled, y_train, cv=5, 
                            scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())

plt.figure(figsize=(10, 6))
plt.semilogx(alphas, cv_scores, linewidth=2)
plt.axvline(ridge_cv.alpha_, color='red', linestyle='--', 
            label=f'Optimal α = {ridge_cv.alpha_:.4f}')
plt.xlabel('Alpha (Regularization Strength)', fontsize=12)
plt.ylabel('Mean Squared Error (CV)', fontsize=12)
plt.title('Cross-Validation Error vs. Alpha')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('ridge_cv_alpha.png', dpi=150)
plt.show()

The logarithmic spacing is intentional—alpha’s effect is multiplicative, so you need to search across orders of magnitude.

Evaluating Model Performance

Compare ridge against OLS to see the regularization benefit, especially with multicollinear data:

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Create highly multicollinear data
np.random.seed(42)
n_samples = 100
X_base = np.random.randn(n_samples, 3)

# Create correlated features
X_collinear = np.column_stack([
    X_base[:, 0],
    X_base[:, 0] + np.random.randn(n_samples) * 0.01,  # Nearly identical to feature 0
    X_base[:, 1],
    X_base[:, 1] + np.random.randn(n_samples) * 0.01,  # Nearly identical to feature 2
    X_base[:, 2]
])

y = 3 * X_base[:, 0] + 2 * X_base[:, 1] + X_base[:, 2] + np.random.randn(n_samples) * 0.5

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X_collinear, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit both models
ols = LinearRegression()
ridge = Ridge(alpha=1.0)

ols.fit(X_train_scaled, y_train)
ridge.fit(X_train_scaled, y_train)

# Evaluate
models = {'OLS': ols, 'Ridge': ridge}

print(f"{'Model':<10} {'R²':>10} {'RMSE':>10} {'Coef Std':>12}")
print("-" * 45)

for name, model in models.items():
    y_pred = model.predict(X_test_scaled)
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    coef_std = np.std(model.coef_)
    
    print(f"{name:<10} {r2:>10.4f} {rmse:>10.4f} {coef_std:>12.4f}")

print(f"\nOLS coefficients:   {ols.coef_.round(2)}")
print(f"Ridge coefficients: {ridge.coef_.round(2)}")

Notice how OLS coefficients for correlated features are wildly inflated in opposite directions (they cancel out), while ridge produces stable, interpretable coefficients.

Visualizing Ridge Coefficients

The coefficient path plot reveals how regularization progressively shrinks coefficients:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# Generate data with known coefficients
np.random.seed(42)
n_samples, n_features = 100, 8
X = np.random.randn(n_samples, n_features)
true_coefs = np.array([5, 3, 2, 1, 0.5, 0.1, 0, 0])
y = X @ true_coefs + np.random.randn(n_samples) * 0.5

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compute coefficients for range of alphas
alphas = np.logspace(-2, 4, 200)
coef_paths = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_scaled, y)
    coef_paths.append(ridge.coef_)

coef_paths = np.array(coef_paths)

# Plot coefficient paths
plt.figure(figsize=(12, 7))
for i in range(n_features):
    plt.semilogx(alphas, coef_paths[:, i], linewidth=2, 
                 label=f'Feature {i} (true={true_coefs[i]})')

plt.xlabel('Alpha (Regularization Strength)', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Ridge Coefficient Paths: Shrinkage as Alpha Increases')
plt.legend(loc='upper right', fontsize=9)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.savefig('ridge_coefficient_paths.png', dpi=150)
plt.show()

This visualization shows ridge’s key behavior: coefficients shrink toward zero but never reach it. Features with larger true effects maintain higher coefficients even at high alpha values.

When to Use Ridge Regression

Use ridge when:

You detect multicollinearity. Check the Variance Inflation Factor (VIF)—values above 5-10 indicate problematic correlation between predictors.
You have more features than samples. OLS fails entirely in this case; ridge remains stable.
You want to keep all features. Unlike Lasso, ridge doesn’t zero out coefficients, which matters when you believe all predictors contribute.
Coefficient stability matters more than sparsity. For interpretable models where you need stable coefficient estimates across samples.

Choose Lasso instead when you want automatic feature selection—Lasso’s L1 penalty drives some coefficients exactly to zero.

Consider ElasticNet when you want the best of both worlds. It combines L1 and L2 penalties:

from sklearn.linear_model import ElasticNet

# l1_ratio=0 is pure ridge, l1_ratio=1 is pure lasso
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train_scaled, y_train)

Ridge regression is a fundamental tool for any data scientist working with correlated features. Master the alpha tuning process, always scale your features, and you’ll have a robust regularization technique that improves model stability without sacrificing predictive power.