How to Implement Bagging in Python

Key Insights

Bagging reduces model variance by training multiple models on bootstrapped samples and aggregating their predictions, making it particularly effective for high-variance algorithms like decision trees
You can implement bagging from scratch in under 50 lines of Python, but scikit-learn’s BaggingClassifier and BaggingRegressor provide production-ready implementations with better performance and flexibility
Bagging typically improves model accuracy by 2-10% over single models while providing more stable predictions, with diminishing returns beyond 50-100 base estimators

Introduction to Bagging

Bagging, short for Bootstrap Aggregating, is an ensemble learning technique that combines predictions from multiple models to produce more robust results. The core idea is simple: train several models on different random subsets of your training data, then aggregate their predictions through voting or averaging.

The technique excels at reducing variance in unstable models—those whose predictions change significantly with small variations in training data. Decision trees are the classic example. A single decision tree might overfit to noise in your training set, but a bagged ensemble of trees learns more generalizable patterns.

Real-world applications are everywhere. Random Forests, one of the most popular machine learning algorithms, is essentially bagged decision trees with an additional twist. Financial institutions use bagging for credit scoring, healthcare systems employ it for disease diagnosis, and e-commerce platforms leverage it for customer churn prediction.

The Mathematics Behind Bagging

Bagging works through two key mechanisms: bootstrap sampling and aggregation.

Bootstrap sampling creates multiple training sets by randomly sampling with replacement from your original dataset. Each bootstrap sample has the same size as the original but contains duplicates of some instances while omitting others. On average, each bootstrap sample contains about 63.2% of the unique instances from the original dataset.

The aggregation step combines predictions from all models. For classification, we use majority voting—the class that receives the most votes wins. For regression, we average the predictions. This aggregation reduces variance because random errors from individual models tend to cancel out.

Here’s how bootstrap sampling works in practice:

import numpy as np
import matplotlib.pyplot as plt

# Original dataset
original_data = np.arange(1, 11)
print(f"Original data: {original_data}")

# Create 5 bootstrap samples
n_samples = 5
bootstrap_samples = []

for i in range(n_samples):
    # Sample with replacement
    bootstrap = np.random.choice(original_data, size=len(original_data), replace=True)
    bootstrap_samples.append(bootstrap)
    print(f"Bootstrap sample {i+1}: {sorted(bootstrap)}")

# Visualize which instances appear in each bootstrap
fig, ax = plt.subplots(figsize=(10, 6))
for i, sample in enumerate(bootstrap_samples):
    unique, counts = np.unique(sample, return_counts=True)
    ax.scatter(unique, [i+1]*len(unique), s=counts*100, alpha=0.6)

ax.set_yticks(range(1, n_samples+1))
ax.set_xlabel('Original Instance')
ax.set_ylabel('Bootstrap Sample')
ax.set_title('Bootstrap Sampling Visualization')
plt.tight_layout()

Manual Bagging Implementation from Scratch

Let’s build a bagging classifier from the ground up. This implementation will help you understand exactly what’s happening under the hood.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from collections import Counter

class SimpleBaggingClassifier:
    def __init__(self, n_estimators=10, max_samples=1.0):
        self.n_estimators = n_estimators
        self.max_samples = max_samples
        self.estimators = []
        
    def _bootstrap_sample(self, X, y):
        """Generate a bootstrap sample from the dataset"""
        n_samples = int(len(X) * self.max_samples)
        indices = np.random.choice(len(X), size=n_samples, replace=True)
        return X[indices], y[indices]
    
    def fit(self, X, y):
        """Train multiple base estimators on bootstrap samples"""
        self.estimators = []
        
        for _ in range(self.n_estimators):
            # Create bootstrap sample
            X_sample, y_sample = self._bootstrap_sample(X, y)
            
            # Train a decision tree on this sample
            tree = DecisionTreeClassifier(max_depth=10)
            tree.fit(X_sample, y_sample)
            self.estimators.append(tree)
        
        return self
    
    def predict(self, X):
        """Aggregate predictions using majority voting"""
        # Get predictions from all estimators
        predictions = np.array([estimator.predict(X) for estimator in self.estimators])
        
        # Majority vote for each instance
        final_predictions = []
        for i in range(predictions.shape[1]):
            votes = predictions[:, i]
            most_common = Counter(votes).most_common(1)[0][0]
            final_predictions.append(most_common)
        
        return np.array(final_predictions)

# Test the implementation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, 
                          n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate
bagging = SimpleBaggingClassifier(n_estimators=50)
bagging.fit(X_train, y_train)
predictions = bagging.predict(X_test)

print(f"Custom Bagging Accuracy: {accuracy_score(y_test, predictions):.4f}")

# Compare with single decision tree
single_tree = DecisionTreeClassifier(max_depth=10, random_state=42)
single_tree.fit(X_train, y_train)
single_predictions = single_tree.predict(X_test)
print(f"Single Tree Accuracy: {accuracy_score(y_test, single_predictions):.4f}")

Using Scikit-learn’s BaggingClassifier and BaggingRegressor

While building from scratch is educational, scikit-learn provides optimized implementations you should use in production. Let’s explore both classification and regression scenarios.

from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Classification example with Iris dataset
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

# Create bagging classifier
bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,  # Use 80% of data for each bootstrap
    max_features=0.8,  # Use 80% of features for each estimator
    bootstrap=True,
    n_jobs=-1,  # Use all CPU cores
    random_state=42
)

# Cross-validation
scores = cross_val_score(bagging_clf, X_iris, y_iris, cv=5)
print(f"Bagging Classifier CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

# Regression example with California Housing
housing = fetch_california_housing()
X_housing, y_housing = housing.data[:1000], housing.target[:1000]  # Subset for speed

bagging_reg = BaggingRegressor(
    estimator=LinearRegression(),
    n_estimators=50,
    max_samples=0.8,
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)

# Cross-validation with negative MSE
scores = cross_val_score(bagging_reg, X_housing, y_housing, 
                        cv=5, scoring='neg_mean_squared_error')
print(f"Bagging Regressor CV MSE: {-scores.mean():.4f} (+/- {scores.std():.4f})")

Key parameters to tune:

n_estimators: Number of base models (typically 50-500)
max_samples: Fraction of samples for each bootstrap (0.5-1.0)
max_features: Fraction of features to consider (0.5-1.0)
bootstrap: Whether to use bootstrap sampling (almost always True)

Practical Example: Building a Bagging Model for Real Data

Let’s walk through a complete workflow using a credit card fraud detection scenario with imbalanced data.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# Simulate credit card transaction data (imbalanced)
X, y = make_classification(
    n_samples=10000,
    n_features=30,
    n_informative=20,
    n_redundant=10,
    n_classes=2,
    weights=[0.97, 0.03],  # 3% fraud rate
    flip_y=0.01,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Training set fraud rate: {y_train.mean():.2%}")
print(f"Test set fraud rate: {y_test.mean():.2%}")

# Hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_samples': [0.7, 0.8, 1.0],
    'max_features': [0.7, 0.8, 1.0]
}

bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10),
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)

grid_search = GridSearchCV(
    bagging_model,
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best F1 score: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Performance Comparison and Best Practices

Let’s quantify the benefits of bagging with a comprehensive comparison.

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

# Compare single model vs bagging
models = {
    'Single Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Bagging (10 estimators)': BaggingClassifier(
        DecisionTreeClassifier(max_depth=10), n_estimators=10, random_state=42
    ),
    'Bagging (50 estimators)': BaggingClassifier(
        DecisionTreeClassifier(max_depth=10), n_estimators=50, random_state=42
    ),
    'Bagging (100 estimators)': BaggingClassifier(
        DecisionTreeClassifier(max_depth=10), n_estimators=100, random_state=42
    )
}

results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5)
    results[name] = {
        'mean': scores.mean(),
        'std': scores.std()
    }
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")

# Visualize variance reduction
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
names = list(results.keys())
means = [results[name]['mean'] for name in names]
stds = [results[name]['std'] for name in names]

ax1.bar(range(len(names)), means, yerr=stds, capsize=5)
ax1.set_xticks(range(len(names)))
ax1.set_xticklabels(names, rotation=45, ha='right')
ax1.set_ylabel('Accuracy')
ax1.set_title('Model Performance Comparison')

# Learning curves for single tree vs bagging
train_sizes, train_scores, val_scores = learning_curve(
    models['Bagging (50 estimators)'], X_train, y_train, 
    cv=5, n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10)
)

ax2.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
ax2.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
ax2.fill_between(train_sizes, 
                  val_scores.mean(axis=1) - val_scores.std(axis=1),
                  val_scores.mean(axis=1) + val_scores.std(axis=1), 
                  alpha=0.2)
ax2.set_xlabel('Training Set Size')
ax2.set_ylabel('Score')
ax2.set_title('Learning Curve: Bagging Ensemble')
ax2.legend()

plt.tight_layout()

When to use bagging:

Your base model has high variance (decision trees, neural networks, k-NN)
You have sufficient computational resources for training multiple models
You need more stable predictions than a single model provides
Your dataset is large enough to create meaningful bootstrap samples (typically 1000+ instances)

When NOT to use bagging:

Your base model already has low variance (linear regression, naive Bayes)
You need model interpretability (ensembles are harder to explain)
Training time is critical and you can’t parallelize
Your dataset is too small (< 500 instances)

Computational considerations:

Bagging parallelizes naturally since each base model trains independently. Always set n_jobs=-1 in scikit-learn to use all CPU cores. For large datasets, start with 50 estimators and increase only if validation performance improves. Beyond 100 estimators, gains are usually marginal.

The sweet spot for max_samples is typically 0.7-0.8, giving each model enough data while maintaining diversity. For max_features, use 0.8-1.0 unless you have many irrelevant features.

Bagging won’t fix fundamentally weak models or poor feature engineering, but it’s one of the most reliable ways to squeeze extra performance from good base models. Combine it with proper cross-validation and hyperparameter tuning for production-ready machine learning systems.