How to Implement SVM for Regression in Python

While Support Vector Machines are famous for classification, Support Vector Regression applies the same principles to predict continuous values. The key difference lies in the objective: instead of...

Key Insights

  • Support Vector Regression (SVR) creates a “tube” around predictions where errors within epsilon are ignored, making it robust to outliers and effective for non-linear relationships
  • Feature scaling is mandatory for SVR since the algorithm relies on distance calculations—unscaled features will completely break your model
  • The RBF kernel with properly tuned C, epsilon, and gamma parameters typically outperforms linear models on complex datasets, but requires grid search to find optimal values

Introduction to Support Vector Regression (SVR)

While Support Vector Machines are famous for classification, Support Vector Regression applies the same principles to predict continuous values. The key difference lies in the objective: instead of finding a hyperplane that maximizes the margin between classes, SVR finds a function that deviates from actual target values by no more than epsilon (ε), while remaining as flat as possible.

The epsilon-insensitive loss function is what makes SVR unique. It creates a “tube” of width 2ε around the regression line. Data points inside this tube don’t contribute to the loss—they’re considered correct predictions. Only points outside the tube incur penalties. This makes SVR inherently robust to small errors and outliers, unlike ordinary least squares regression which penalizes every deviation.

The regularization parameter C controls the trade-off between the flatness of the function and the amount we tolerate deviations larger than epsilon. A small C creates a flatter function with more tolerance for errors, while a large C fits the training data more closely.

Understanding SVR Kernels and Hyperparameters

SVR’s power comes from kernel functions that map input features into higher-dimensional spaces where linear relationships become possible. The three primary kernels are:

Linear kernel: Use when your relationship is approximately linear or when you have many features. It’s computationally efficient and interpretable.

RBF (Radial Basis Function) kernel: The default choice for non-linear data. It can model complex relationships and works well when you don’t know the underlying pattern. Requires tuning the gamma parameter, which defines how far the influence of a single training example reaches.

Polynomial kernel: Good when you suspect polynomial relationships. Requires tuning the degree parameter, but can be computationally expensive for high degrees.

Key hyperparameters to understand:

  • C: Regularization strength (inverse). Higher values mean less regularization and tighter fit to training data
  • epsilon: Width of the tube where no penalty is given to errors. Larger values create wider tubes and simpler models
  • gamma: Defines how far the influence of a single training example reaches (RBF and polynomial kernels only). Low values mean far reach, high values mean close reach
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR

# Generate non-linear data
np.random.seed(42)
X = np.sort(5 * np.random.rand(100, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Compare different kernels
svr_rbf = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr_linear = SVR(kernel='linear', C=100, epsilon=0.1)
svr_poly = SVR(kernel='poly', C=100, degree=3, epsilon=0.1)

# Fit models
svr_rbf.fit(X, y)
svr_linear.fit(X, y)
svr_poly.fit(X, y)

# Predictions
X_test = np.linspace(0, 5, 100).reshape(-1, 1)
y_rbf = svr_rbf.predict(X_test)
y_linear = svr_linear.predict(X_test)
y_poly = svr_poly.predict(X_test)

# Visualize
plt.figure(figsize=(12, 4))
plt.scatter(X, y, color='black', label='Data', s=20)
plt.plot(X_test, y_rbf, color='red', label='RBF kernel')
plt.plot(X_test, y_linear, color='blue', label='Linear kernel')
plt.plot(X_test, y_poly, color='green', label='Polynomial kernel')
plt.legend()
plt.title('SVR with Different Kernels')
plt.show()

The RBF kernel captures the sinusoidal pattern, while the linear kernel fails completely. The polynomial kernel does reasonably well but may overfit at the boundaries.

Preparing Data for SVR

Feature scaling is not optional for SVR—it’s mandatory. SVR uses distance calculations, so features with larger scales will dominate the model. Always scale your features before training.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

# Load or create dataset
# Using a simple example with multiple features
np.random.seed(42)
X = np.random.rand(200, 3) * 10  # 3 features with different scales
y = 2 * X[:, 0] + 3 * X[:, 1] - X[:, 2] + np.random.normal(0, 1, 200)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features - CRITICAL for SVR
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Optionally scale target variable for better numerical stability
scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).ravel()

print(f"Original feature ranges: {X_train.min(axis=0)} to {X_train.max(axis=0)}")
print(f"Scaled feature ranges: {X_train_scaled.min(axis=0)} to {X_train_scaled.max(axis=0)}")

Never fit the scaler on test data—this causes data leakage. Always fit on training data and transform both sets using the same scaler.

Building a Basic SVR Model

Start with default parameters and iterate. Here’s a complete implementation with a more interesting synthetic dataset:

from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic dataset with non-linear relationship
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 150)).reshape(-1, 1)
y = np.sin(X).ravel() + 0.5 * X.ravel() + np.random.normal(0, 0.5, X.shape[0])

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train SVR model
svr_model = SVR(kernel='rbf', C=100, gamma='scale', epsilon=0.1)
svr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_train = svr_model.predict(X_train_scaled)
y_pred_test = svr_model.predict(X_test_scaled)

# Evaluate
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
test_r2 = r2_score(y_test, y_pred_test)

print(f"Training RMSE: {train_rmse:.4f}")
print(f"Test RMSE: {test_rmse:.4f}")
print(f"Test R²: {test_r2:.4f}")
print(f"Number of support vectors: {len(svr_model.support_)}")

The number of support vectors tells you how many training points are critical to the model. More support vectors generally means a more complex decision boundary.

Hyperparameter Tuning with GridSearchCV

Default parameters rarely give optimal results. Use grid search with cross-validation to systematically find the best combination:

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100, 1000],
    'epsilon': [0.01, 0.1, 0.2, 0.5],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# Create grid search object
svr = SVR()
grid_search = GridSearchCV(
    svr, 
    param_grid, 
    cv=5, 
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

# Fit grid search
grid_search.fit(X_train_scaled, y_train)

# Best parameters and score
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score (negative MSE): {grid_search.best_score_:.4f}")

# Use best model
best_svr = grid_search.best_estimator_
y_pred_optimized = best_svr.predict(X_test_scaled)
optimized_rmse = np.sqrt(mean_squared_error(y_test, y_pred_optimized))
optimized_r2 = r2_score(y_test, y_pred_optimized)

print(f"Optimized Test RMSE: {optimized_rmse:.4f}")
print(f"Optimized Test R²: {optimized_r2:.4f}")

Grid search can be time-consuming. Start with a coarse grid, identify promising regions, then refine. For large datasets, consider RandomizedSearchCV instead.

Model Evaluation and Comparison

Never rely on a single metric. Compare SVR against simpler baselines to justify the added complexity:

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Train multiple models
models = {
    'Linear Regression': LinearRegression(),
    'SVR (RBF)': SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

results = {}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    results[name] = {
        'MAE': mean_absolute_error(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'R²': r2_score(y_test, y_pred)
    }

# Display results
results_df = pd.DataFrame(results).T
print(results_df)

# Visualize predictions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, (name, model) in enumerate(models.items()):
    y_pred = model.predict(X_test_scaled)
    axes[idx].scatter(y_test, y_pred, alpha=0.6)
    axes[idx].plot([y_test.min(), y_test.max()], 
                   [y_test.min(), y_test.max()], 
                   'r--', lw=2)
    axes[idx].set_xlabel('Actual')
    axes[idx].set_ylabel('Predicted')
    axes[idx].set_title(f'{name}\nR² = {results[name]["R²"]:.3f}')

plt.tight_layout()
plt.show()

Real-World Application: Energy Consumption Prediction

Let’s apply SVR to predict energy consumption based on temperature and other factors:

from sklearn.datasets import fetch_california_housing

# Load dataset (using California housing as proxy for energy data)
data = fetch_california_housing()
X, y = data.data, data.target

# Use subset of features
X = X[:, :4]  # MedInc, HouseAge, AveRooms, AveBedrms

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Optimized SVR model
svr_final = SVR(kernel='rbf', C=10, gamma=0.1, epsilon=0.1)
svr_final.fit(X_train_scaled, y_train)

# Predictions and evaluation
y_pred = svr_final.predict(X_test_scaled)

print(f"Final Model Performance:")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"R²: {r2_score(y_test, y_pred):.4f}")

# Feature importance approximation (using permutation)
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(
    svr_final, X_test_scaled, y_test, n_repeats=10, random_state=42
)

feature_names = data.feature_names[:4]
for i, name in enumerate(feature_names):
    print(f"{name}: {perm_importance.importances_mean[i]:.4f}")

SVR excels when relationships are non-linear and you need robust predictions. It’s particularly effective for medium-sized datasets (hundreds to thousands of samples) with complex patterns. For very large datasets, consider using LinearSVR or other scalable alternatives. Always start with proper feature scaling, use cross-validation for hyperparameter tuning, and compare against simpler baselines to ensure the complexity is justified.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.