How to Perform Bayesian Optimization in Python

Key Insights

Bayesian optimization uses probabilistic models to intelligently search parameter spaces, requiring 10-100x fewer evaluations than grid search for expensive functions like neural network training
The algorithm balances exploration (sampling uncertain regions) and exploitation (refining known good areas) through acquisition functions like Expected Improvement and Upper Confidence Bound
Use scikit-optimize for quick implementations, but switch to Optuna for production ML pipelines due to its pruning capabilities and distributed optimization support

Introduction to Bayesian Optimization

Bayesian optimization solves a fundamental problem in machine learning: how do you find optimal hyperparameters when each evaluation takes minutes or hours? Grid search is exhaustive but wasteful. Random search improves efficiency but ignores information from previous trials. Bayesian optimization treats hyperparameter tuning as a sequential decision problem, using past evaluations to inform where to search next.

The algorithm builds a probabilistic model (typically a Gaussian process) of your objective function, then uses this model to decide which parameters to try next. This approach shines when evaluations are expensive—training deep learning models, running simulations, or optimizing chemical processes. Where grid search might require 1000 evaluations, Bayesian optimization often finds comparable results in 50-100 trials.

Core Concepts

Bayesian optimization has three key components working together:

Surrogate Model: A Gaussian process (GP) approximates your objective function. The GP provides not just predictions but uncertainty estimates—crucial for knowing where we’re confident and where we should explore. As you evaluate more points, the GP updates its beliefs about the function’s shape.

Acquisition Function: This function decides where to sample next by balancing exploitation (sampling where the model predicts good values) and exploration (sampling where uncertainty is high). Common choices include:

Expected Improvement (EI): Samples points with the highest expected improvement over the current best
Upper Confidence Bound (UCB): Samples points with high predicted value plus uncertainty bonus
Probability of Improvement (PI): Samples points most likely to beat the current best

Sequential Optimization Loop: Evaluate the objective, update the surrogate model, use the acquisition function to select the next point, repeat.

Here’s a visualization of how a Gaussian process models uncertainty:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel

# True function (unknown to optimizer)
def true_function(x):
    return np.sin(3 * x) + 0.3 * np.cos(9 * x)

# Sample a few points
X_observed = np.array([0.1, 0.4, 0.8]).reshape(-1, 1)
y_observed = true_function(X_observed).ravel()

# Fit Gaussian process
kernel = ConstantKernel(1.0) * RBF(length_scale=0.3)
gp = GaussianProcessRegressor(kernel=kernel, alpha=1e-6)
gp.fit(X_observed, y_observed)

# Predict on dense grid
X_pred = np.linspace(0, 1, 200).reshape(-1, 1)
y_pred, sigma = gp.predict(X_pred, return_std=True)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(X_pred, true_function(X_pred), 'g--', label='True function')
plt.plot(X_pred, y_pred, 'b-', label='GP mean')
plt.fill_between(X_pred.ravel(), y_pred - 2*sigma, y_pred + 2*sigma, 
                 alpha=0.3, label='95% confidence')
plt.scatter(X_observed, y_observed, c='red', s=100, zorder=10, label='Observations')
plt.legend()
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gaussian Process Surrogate Model')
plt.show()

Notice how uncertainty grows in regions far from observed points. The acquisition function exploits this information.

Setting Up Your Environment

Install the necessary libraries. I recommend scikit-optimize for learning and Optuna for production use:

pip install scikit-optimize numpy matplotlib
pip install optuna  # Alternative, more modern library
pip install xgboost scikit-learn  # For ML examples

Basic imports you’ll use repeatedly:

import numpy as np
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
from skopt.plots import plot_convergence, plot_objective
from skopt.utils import use_named_args

Basic Implementation with Scikit-Optimize

Let’s optimize the Branin function, a standard benchmark with known global minima:

from skopt import gp_minimize
from skopt.space import Real
import numpy as np

# Branin function (has 3 global minima)
def branin(params):
    x1, x2 = params
    a = 1.0
    b = 5.1 / (4.0 * np.pi**2)
    c = 5.0 / np.pi
    r = 6.0
    s = 10.0
    t = 1.0 / (8.0 * np.pi)
    
    term1 = a * (x2 - b * x1**2 + c * x1 - r)**2
    term2 = s * (1 - t) * np.cos(x1)
    term3 = s
    
    return term1 + term2 + term3

# Define search space
space = [
    Real(-5.0, 10.0, name='x1'),
    Real(0.0, 15.0, name='x2')
]

# Run optimization
result = gp_minimize(
    branin,                  # Objective function
    space,                   # Search space
    n_calls=50,             # Number of evaluations
    n_random_starts=10,     # Random exploration first
    acq_func='EI',          # Expected Improvement
    random_state=42
)

print(f"Best parameters: {result.x}")
print(f"Best value: {result.fun:.4f}")
print(f"Known global minimum: 0.397887")

# Visualize convergence
plot_convergence(result)

The optimizer finds near-optimal parameters in 50 evaluations. Grid search with similar granularity would require thousands.

Real-World Application: Hyperparameter Tuning

Here’s a complete example optimizing XGBoost hyperparameters on a classification task:

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args
import numpy as np

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                          n_informative=15, n_redundant=5, 
                          random_state=42)

# Define search space
space = [
    Integer(10, 200, name='n_estimators'),
    Integer(2, 20, name='max_depth'),
    Integer(2, 10, name='min_samples_split'),
    Real(0.01, 0.5, name='min_samples_leaf'),
    Real(0.5, 1.0, name='max_features')
]

# Objective function
@use_named_args(space)
def objective(**params):
    model = RandomForestClassifier(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        min_samples_split=params['min_samples_split'],
        min_samples_leaf=params['min_samples_leaf'],
        max_features=params['max_features'],
        random_state=42,
        n_jobs=-1
    )
    
    # Use cross-validation score (negative because we minimize)
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy', n_jobs=-1)
    return -np.mean(scores)  # Minimize negative accuracy

# Run optimization
result = gp_minimize(
    objective,
    space,
    n_calls=40,
    n_random_starts=10,
    random_state=42,
    verbose=True
)

print(f"\nBest hyperparameters:")
print(f"n_estimators: {result.x[0]}")
print(f"max_depth: {result.x[1]}")
print(f"min_samples_split: {result.x[2]}")
print(f"min_samples_leaf: {result.x[3]:.3f}")
print(f"max_features: {result.x[4]:.3f}")
print(f"Best CV accuracy: {-result.fun:.4f}")

This approach evaluates only 40 configurations but typically matches or beats grid search with hundreds of trials.

Advanced Techniques

Choosing Acquisition Functions: Expected Improvement (EI) works well for most cases. Use Upper Confidence Bound (UCB) when you want more exploration:

# More exploratory
result = gp_minimize(objective, space, n_calls=50, acq_func='gp_hedge')

# UCB with custom kappa (higher = more exploration)
from skopt.learning import GaussianProcessRegressor
from skopt.learning.gaussian_process.kernels import Matern

result = gp_minimize(
    objective, 
    space, 
    n_calls=50,
    acq_func='LCB',  # Lower Confidence Bound (minimization)
    acq_optimizer='sampling',
    n_random_starts=10
)

Parallel Optimization: Evaluate multiple points simultaneously when you have computational resources:

from skopt import Optimizer

opt = Optimizer(space, n_initial_points=10, acq_func='EI')

# Ask for 4 points to evaluate in parallel
x_batch = opt.ask(n_points=4)

# Evaluate in parallel (pseudo-code)
y_batch = [objective(x) for x in x_batch]

# Tell optimizer about results
opt.tell(x_batch, y_batch)

Handling Categorical Variables: Mix continuous and categorical parameters:

from skopt.space import Categorical

space = [
    Integer(10, 200, name='n_estimators'),
    Categorical(['gini', 'entropy'], name='criterion'),
    Categorical([True, False], name='bootstrap')
]

Best Practices and Gotchas

Define Sensible Search Spaces: Don’t optimize over 10 orders of magnitude. Use log-uniform distributions for learning rates:

from skopt.space import Real

# Bad: linear space for learning rate
Real(0.00001, 0.1, name='learning_rate')

# Good: log-uniform space
Real(1e-5, 1e-1, prior='log-uniform', name='learning_rate')

Budget Your Evaluations: Start with 10-20x the dimensionality of your search space. For 5 hyperparameters, use 50-100 evaluations minimum.

Use Cross-Validation: Always use CV in your objective function. Single train-test splits lead to overfitting the validation set.

When NOT to Use Bayesian Optimization:

Cheap functions (< 1 second): Random search is fine
Very high dimensions (> 20): Consider random search or genetic algorithms
Discrete combinatorial problems: Use specialized solvers

Save Your Results: Optimization can crash. Save intermediate results:

import pickle

result = gp_minimize(objective, space, n_calls=100, callback=[checkpoint_callback])

def checkpoint_callback(result):
    with open('optimization_checkpoint.pkl', 'wb') as f:
        pickle.dump(result, f)

Dealing with Noise: If your objective function is noisy (e.g., stochastic training), increase n_initial_points and consider averaging multiple evaluations per configuration.

Bayesian optimization transforms hyperparameter tuning from a tedious grid search into an intelligent, sample-efficient process. Start with scikit-optimize for simplicity, but consider Optuna for production systems where pruning and distributed optimization matter. The key is defining good search spaces and budgeting enough evaluations—typically 50-100 for most ML problems.