How to Perform Cross-Validation in Python

Key Insights

Cross-validation provides a more reliable estimate of model performance than a single train-test split by testing on multiple data subsets, reducing the risk of overfitting to a particular split
Stratified K-fold is essential for imbalanced datasets and classification problems, maintaining class proportions across folds to prevent biased evaluation metrics
Always apply preprocessing within cross-validation folds using pipelines to avoid data leakage—fitting scalers or encoders on the entire dataset before CV will artificially inflate performance metrics

Introduction to Cross-Validation

Cross-validation is a statistical method for evaluating machine learning models by partitioning data into subsets, training on some subsets, and validating on others. The fundamental problem it solves is simple: a single train-test split can be misleading. Your model might perform exceptionally well on one particular split due to luck, or poorly due to an unrepresentative test set.

When you train a model on data, it learns patterns—both genuine signals and noise specific to that training set. Overfitting occurs when your model memorizes training data rather than learning generalizable patterns. A single 80-20 split might hide this problem if your 20% test set happens to be similar to your training data. Cross-validation addresses this by systematically rotating which data serves as the test set, giving you a more robust estimate of how your model will perform on unseen data.

The core value proposition: instead of one performance number that might be an outlier, you get multiple measurements that can be averaged and analyzed for variance. High variance across folds signals that your model is unstable or that your dataset has problematic characteristics.

K-Fold Cross-Validation

K-fold cross-validation divides your dataset into K equal-sized folds. The algorithm trains K times, each time using K-1 folds for training and the remaining fold for testing. This means every data point gets used for testing exactly once and for training K-1 times.

The typical choice is K=5 or K=10. Smaller K means faster computation but higher bias in your performance estimate. Larger K gives lower bias but increases computational cost and can increase variance in your estimates. K=10 has become a default through empirical research showing it provides a good bias-variance tradeoff for most datasets.

Here’s how to implement K-fold cross-validation with scikit-learn:

from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Create K-fold cross-validator
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

The shuffle=True parameter is important—it randomizes the data before splitting, preventing issues if your data is ordered by class or some other feature. The random_state ensures reproducibility.

You can also manually iterate through folds for more control:

for fold, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(f"Fold {fold}: {score:.3f}")

Stratified K-Fold for Imbalanced Datasets

Standard K-fold has a critical weakness with imbalanced datasets: random splits might create folds where minority classes are underrepresented or absent entirely. If you have a binary classification problem with 95% class A and 5% class B, a random fold might contain zero class B samples, making that fold useless for evaluation.

Stratified K-fold solves this by ensuring each fold maintains approximately the same class distribution as the complete dataset. This is essential for classification problems, especially imbalanced ones.

from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
                          n_features=20, random_state=42)

print(f"Overall class distribution: {np.bincount(y)}")

# Compare regular K-fold vs Stratified K-fold
print("\nRegular K-Fold class distributions:")
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
    test_dist = np.bincount(y[test_idx])
    print(f"Fold {fold}: {test_dist} ({test_dist[1]/len(test_idx)*100:.1f}% minority)")

print("\nStratified K-Fold class distributions:")
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, test_idx) in enumerate(skfold.split(X, y), 1):
    test_dist = np.bincount(y[test_idx])
    print(f"Fold {fold}: {test_dist} ({test_dist[1]/len(test_idx)*100:.1f}% minority)")

Notice that StratifiedKFold.split() requires both X and y, since it needs class labels to stratify. The output shows that stratified folds maintain consistent class ratios, while regular folds can vary significantly.

For classification tasks, use StratifiedKFold by default. The only exception is when you have specific reasons to use regular K-fold, such as multi-label classification where stratification becomes complex.

Leave-One-Out and Time Series Cross-Validation

Leave-One-Out Cross-Validation (LOOCV) is K-fold where K equals the number of samples. Each iteration uses a single sample for testing and all others for training. This gives you the lowest bias possible but comes with extreme computational cost—you train N models for N samples.

LOOCV is only practical for small datasets (hundreds of samples, not thousands) and fast-training models. It also has high variance because test sets are so small (one sample each).

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X[:100], y[:100], cv=loo)
print(f"LOOCV mean accuracy: {scores.mean():.3f}")
print(f"Number of folds: {loo.get_n_splits(X[:100])}")

More important for practitioners is TimeSeriesSplit, which handles temporal data where random splitting causes data leakage. With time series, you cannot train on future data and test on past data—that would give you impossibly optimistic results.

TimeSeriesSplit creates expanding training sets with forward-looking test sets:

from sklearn.model_selection import TimeSeriesSplit
import matplotlib.pyplot as plt

# Simulate time series data
n_samples = 100
X_time = np.arange(n_samples).reshape(-1, 1)
y_time = np.sin(X_time.ravel() * 0.1) + np.random.normal(0, 0.1, n_samples)

tscv = TimeSeriesSplit(n_splits=5)

plt.figure(figsize=(12, 6))
for fold, (train_idx, test_idx) in enumerate(tscv.split(X_time), 1):
    plt.scatter(train_idx, [fold] * len(train_idx), c='blue', marker='s', s=20)
    plt.scatter(test_idx, [fold] * len(test_idx), c='red', marker='o', s=20)

plt.ylabel('Fold')
plt.xlabel('Sample Index')
plt.title('TimeSeriesSplit Visualization')
plt.legend(['Train', 'Test'])
plt.tight_layout()
plt.savefig('timeseries_cv.png')

Notice how training sets grow over time and test sets always follow training sets chronologically. This respects temporal ordering and prevents future information from leaking into your model during training.

Cross-Validation with Hyperparameter Tuning

Cross-validation’s real power emerges when combined with hyperparameter tuning. GridSearchCV and RandomizedSearchCV use cross-validation internally to evaluate each parameter combination, preventing overfitting to your validation set.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize model
rf = RandomForestClassifier(random_state=42)

# Grid search with cross-validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Fit grid search
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
print(f"Test set score: {grid_search.score(X, y):.3f}")

# Access detailed results
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
print(results_df[['params', 'mean_test_score', 'std_test_score']].head())

Each parameter combination is evaluated using 5-fold cross-validation, meaning if you test 100 combinations, you’re training 500 models. The n_jobs=-1 parameter parallelizes this across all CPU cores.

Best Practices and Common Pitfalls

The most critical mistake is data leakage through preprocessing. If you fit a scaler, imputer, or feature selector on your entire dataset before cross-validation, information from test folds leaks into training, inflating your performance metrics.

The solution is pipelines that encapsulate preprocessing and modeling:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate

# Wrong way - data leakage
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Fit on entire dataset
scores = cross_val_score(model, X_scaled, y, cv=5)  # Leakage!

# Right way - preprocessing in pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=200))
])

# Cross-validation fits scaler separately for each fold
cv_results = cross_validate(
    pipeline, X, y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring=['accuracy', 'precision_weighted', 'recall_weighted'],
    return_train_score=True
)

print(f"Test accuracy: {cv_results['test_accuracy'].mean():.3f}")
print(f"Train accuracy: {cv_results['train_accuracy'].mean():.3f}")

The cross_validate function is more powerful than cross_val_score—it returns multiple metrics, training scores, and fit times. Comparing training and test scores helps detect overfitting.

Other best practices:

Choose K based on dataset size: Use K=5 for small datasets (hundreds of samples), K=10 for medium datasets (thousands), and consider K=3 for very large datasets where computation is expensive.

Always shuffle unless temporal: Set shuffle=True for non-temporal data to avoid ordering artifacts.

Use stratification for classification: Default to StratifiedKFold unless you have specific reasons not to.

Consider computational cost: Cross-validation multiplies training time by K. For large datasets or slow models, reduce K or use train_test_split with multiple random seeds.

Report both mean and standard deviation: High standard deviation across folds indicates model instability or dataset issues worth investigating.

Cross-validation is your primary defense against overfitting and unreliable performance estimates. Implement it correctly with pipelines, choose appropriate splitting strategies for your data type, and you’ll build models that actually generalize to production data.