How to Perform Cross-Validation in R
• Cross-validation provides more reliable performance estimates than single train-test splits by evaluating models across multiple data partitions, reducing the impact of random sampling variation.
Key Insights
• Cross-validation provides more reliable performance estimates than single train-test splits by evaluating models across multiple data partitions, reducing the impact of random sampling variation. • The caret package simplifies cross-validation implementation in R with standardized interfaces for dozens of algorithms, making it the de facto standard for production model evaluation workflows. • Stratified cross-validation is critical for classification tasks with imbalanced classes—failing to preserve class distributions across folds leads to unreliable performance estimates and poor model selection decisions.
Introduction to Cross-Validation
Single train-test splits are fundamentally unreliable. The performance you measure depends heavily on which observations randomly end up in your test set. You might get lucky with an easy test set and overestimate performance, or unlucky with a difficult one and underestimate it.
Cross-validation solves this by evaluating your model multiple times on different subsets of data, then averaging the results. This gives you a more stable estimate of how your model will perform on unseen data. More importantly, it helps you detect overfitting—when your model memorizes training data instead of learning generalizable patterns.
The basic principle is simple: partition your data into multiple folds, train on some folds, validate on others, rotate which folds serve which purpose, then aggregate the results. This approach uses your data more efficiently than holding out a large test set and gives you confidence intervals around performance metrics.
K-Fold Cross-Validation Basics
K-fold cross-validation divides your dataset into k equal-sized subsets. The algorithm trains k different models, each time using k-1 folds for training and the remaining fold for validation. After k iterations, every observation has been used for validation exactly once.
Common choices for k are 5 or 10. Smaller k values (like 3) train faster but give noisier estimates. Larger k values (like 20) are more computationally expensive but reduce bias. The value k=10 has become standard because it balances these tradeoffs well for most datasets.
Here’s a manual implementation using base R with linear regression on the mtcars dataset:
# Load data
data(mtcars)
# Set parameters
k <- 5
set.seed(42) # Reproducibility
# Create fold assignments
folds <- cut(seq(1, nrow(mtcars)), breaks = k, labels = FALSE)
folds <- sample(folds) # Randomize
# Storage for results
mse_values <- numeric(k)
# Perform k-fold CV
for(i in 1:k) {
# Split data
test_indices <- which(folds == i)
train_data <- mtcars[-test_indices, ]
test_data <- mtcars[test_indices, ]
# Train model
model <- lm(mpg ~ wt + hp + cyl, data = train_data)
# Predict and evaluate
predictions <- predict(model, newdata = test_data)
mse_values[i] <- mean((test_data$mpg - predictions)^2)
}
# Average performance
mean_mse <- mean(mse_values)
sd_mse <- sd(mse_values)
cat(sprintf("Mean MSE: %.2f (+/- %.2f)\n", mean_mse, sd_mse))
This manual approach works but becomes tedious when comparing multiple models or algorithms. That’s where caret comes in.
Using the caret Package
The caret (Classification And Regression Training) package provides a unified interface for cross-validation across hundreds of machine learning algorithms. It handles the fold creation, model training, and result aggregation automatically.
The workflow centers on two functions: trainControl() configures the cross-validation strategy, and train() fits models using that strategy.
library(caret)
# Configure 10-fold CV
train_control <- trainControl(
method = "cv",
number = 10,
savePredictions = "final",
verboseIter = FALSE
)
# Train linear model with CV
set.seed(42)
lm_model <- train(
mpg ~ wt + hp + cyl,
data = mtcars,
method = "lm",
trControl = train_control
)
# View results
print(lm_model)
print(lm_model$results)
For a random forest model with hyperparameter tuning:
# Configure CV with more options
train_control <- trainControl(
method = "cv",
number = 10,
search = "grid"
)
# Define hyperparameter grid
rf_grid <- expand.grid(
mtry = c(2, 3, 4) # Number of variables at each split
)
# Train random forest with CV
set.seed(42)
rf_model <- train(
mpg ~ .,
data = mtcars,
method = "rf",
trControl = train_control,
tuneGrid = rf_grid,
ntree = 100
)
# Extract best parameters and performance
print(rf_model$bestTune)
print(rf_model$results)
The train() function automatically performs cross-validation for each hyperparameter combination, selecting the best configuration based on the optimization metric (RMSE by default for regression).
Leave-One-Out Cross-Validation (LOOCV)
LOOCV is an extreme case where k equals the number of observations. Each model trains on all data except one observation, which serves as the validation set. This maximizes training data and eliminates randomness in fold assignment.
LOOCV works well for small datasets (n < 100) where you can’t afford to hold out much data. The downside is computational cost—you train n models instead of 5 or 10.
Manual implementation:
# LOOCV manually
n <- nrow(mtcars)
loocv_errors <- numeric(n)
for(i in 1:n) {
train_data <- mtcars[-i, ]
test_data <- mtcars[i, ]
model <- lm(mpg ~ wt + hp + cyl, data = train_data)
prediction <- predict(model, newdata = test_data)
loocv_errors[i] <- (test_data$mpg - prediction)^2
}
loocv_mse <- mean(loocv_errors)
print(loocv_mse)
Using caret is simpler:
# LOOCV with caret
loocv_control <- trainControl(method = "LOOCV")
loocv_model <- train(
mpg ~ wt + hp + cyl,
data = mtcars,
method = "lm",
trControl = loocv_control
)
print(loocv_model)
For datasets with thousands of observations, LOOCV becomes impractical. Stick with k-fold CV in those cases.
Stratified Cross-Validation for Classification
For classification problems, random fold assignment can create problems. If you have 100 observations with 90 in class A and 10 in class B, random 5-fold CV might create folds where class B appears in only 3-4 folds. This leads to unreliable performance estimates.
Stratified cross-validation maintains the class distribution in each fold. If your overall dataset is 90% class A, each fold will also be approximately 90% class A.
library(caret)
data(iris)
# Create stratified folds manually
set.seed(42)
stratified_folds <- createFolds(
iris$Species,
k = 5,
list = TRUE,
returnTrain = FALSE
)
# Verify class distributions
for(i in 1:5) {
fold_data <- iris[stratified_folds[[i]], ]
print(table(fold_data$Species))
}
# Use stratified CV in train()
train_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = multiClassSummary
)
# Train classification model
set.seed(42)
knn_model <- train(
Species ~ .,
data = iris,
method = "knn",
trControl = train_control,
tuneGrid = data.frame(k = c(3, 5, 7, 9))
)
print(knn_model)
Caret automatically performs stratified sampling for classification tasks when you use trainControl(), but you can disable it by setting index manually if needed.
Comparing Models with Cross-Validation
Cross-validation’s real power emerges when comparing multiple models. By using the same folds for each model, you ensure fair comparisons.
library(caret)
# Shared CV configuration
train_control <- trainControl(
method = "cv",
number = 10,
savePredictions = "final"
)
set.seed(42)
# Train multiple models
lm_model <- train(mpg ~ ., data = mtcars, method = "lm", trControl = train_control)
ridge_model <- train(mpg ~ ., data = mtcars, method = "ridge", trControl = train_control)
rf_model <- train(mpg ~ ., data = mtcars, method = "rf", trControl = train_control, ntree = 100)
# Compare results
results <- resamples(list(
Linear = lm_model,
Ridge = ridge_model,
RandomForest = rf_model
))
summary(results)
# Visualize comparisons
dotplot(results)
bwplot(results)
The resamples() function aligns results from the same folds, enabling paired statistical tests:
# Statistical comparison
diff_results <- diff(results)
summary(diff_results)
This tells you whether performance differences are statistically significant or just random variation.
Best Practices and Common Pitfalls
Choose k wisely. Use k=5 or k=10 for most applications. Use LOOCV only for small datasets (n < 100). Larger k values don’t always improve reliability and increase computation time.
Set random seeds. Always use set.seed() before cross-validation to ensure reproducibility. Without it, you’ll get different results each run.
Avoid data leakage. Perform feature scaling, imputation, and feature selection inside the CV loop, not before. If you normalize using the entire dataset, information from test folds leaks into training, inflating performance estimates.
Use stratification for classification. This is automatic in caret but verify it’s enabled, especially with imbalanced datasets.
Consider computational costs. Cross-validation multiplies training time by k. For expensive models (deep learning, large random forests), use smaller k values or consider a single validation set for initial exploration.
Don’t tune on CV results directly. Cross-validation estimates generalization performance. If you repeatedly adjust models based on CV scores, you’re indirectly overfitting to the CV folds. Use nested cross-validation for hyperparameter tuning within model selection.
Cross-validation is your primary defense against overfitting and unreliable performance estimates. Master these techniques and you’ll build models that actually work in production.