How to Split Data into Train and Test Sets in R

Splitting your data into training and testing sets is fundamental to building reliable machine learning models. The training set teaches your model patterns in the data, while the test set—data the...

Key Insights

  • Random splitting works for most datasets, but stratified splitting (via caret or rsample) preserves class distributions and prevents imbalanced train/test sets—critical for classification problems with rare classes.
  • Time series and grouped data require specialized splitting methods that respect temporal order or group boundaries; random splits will leak future information and artificially inflate performance metrics.
  • Setting a seed before splitting is non-negotiable for reproducible research, and you should always verify that your split maintains representative distributions of key variables across both sets.

Introduction to Train-Test Splitting

Splitting your data into training and testing sets is fundamental to building reliable machine learning models. The training set teaches your model patterns in the data, while the test set—data the model has never seen—validates whether those patterns generalize to new observations. Without this separation, you’re essentially grading your model on the same material it studied, which tells you nothing about real-world performance.

The standard practice is an 80/20 or 70/30 split, where the larger portion trains the model and the smaller portion tests it. For datasets under 1,000 observations, consider 70/30 to ensure your test set has enough samples for meaningful evaluation. With larger datasets (10,000+ rows), you can safely use 90/10 since even 10% provides adequate test samples.

The real danger isn’t just overfitting—it’s overconfidence. A model that performs brilliantly on training data but fails on test data is worse than useless; it’s misleading. Proper train-test splitting is your first line of defense against this problem.

Basic Random Splitting with Base R

The simplest approach uses base R’s sample() function to randomly select row indices for training. This works well for most regression problems and balanced classification tasks.

# Load example dataset
data(mtcars)

# Set seed for reproducibility
set.seed(123)

# Create training indices (80% of data)
train_indices <- sample(1:nrow(mtcars), size = 0.8 * nrow(mtcars))

# Split the data
train_data <- mtcars[train_indices, ]
test_data <- mtcars[-train_indices, ]

# Verify the split
cat("Training set:", nrow(train_data), "observations\n")
cat("Test set:", nrow(test_data), "observations\n")

The set.seed() call is critical. Without it, you’ll get different splits each time you run your code, making your results impossible to reproduce. Choose any integer—the specific number doesn’t matter, but consistency does.

This approach has one significant limitation: it doesn’t account for class imbalances. If you’re predicting a rare event that occurs in 5% of cases, random sampling might give you a test set with 2% or 8% of that class, distorting your performance metrics.

Using the caret Package

The caret package’s createDataPartition() function solves the imbalance problem through stratified sampling. It maintains the proportion of your target variable across both sets, which is essential for classification problems.

library(caret)

# Load iris dataset
data(iris)

# Set seed
set.seed(456)

# Create stratified split maintaining species proportions
train_indices <- createDataPartition(iris$Species, 
                                     p = 0.8, 
                                     list = FALSE)

train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

# Verify stratification
cat("Training set species distribution:\n")
print(prop.table(table(train_data$Species)))

cat("\nTest set species distribution:\n")
print(prop.table(table(test_data$Species)))

cat("\nOriginal species distribution:\n")
print(prop.table(table(iris$Species)))

The output shows identical proportions across all three sets—exactly 33.3% for each species. This is crucial when working with imbalanced datasets. If you’re predicting customer churn with a 10% churn rate, stratified splitting ensures both your training and test sets have approximately 10% churned customers.

The list = FALSE parameter returns a vector of indices rather than a list, which simplifies the subsetting syntax. Set it to TRUE if you need multiple splits for cross-validation.

Using the rsample Package (tidymodels)

The rsample package, part of the tidymodels ecosystem, provides a modern, pipe-friendly approach to data splitting. It creates a special split object that cleanly separates the splitting logic from data extraction.

library(rsample)
library(dplyr)

# Set seed
set.seed(789)

# Create split object (80/20)
data_split <- initial_split(mtcars, prop = 0.8, strata = NULL)

# Extract training and testing data
train_data <- training(data_split)
test_data <- testing(data_split)

# For stratified split on a specific variable
iris_split <- initial_split(iris, prop = 0.8, strata = Species)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)

# Verify stratification
iris_train %>%
  count(Species) %>%
  mutate(proportion = n / sum(n))

The rsample approach shines in larger workflows. The split object can be passed through pipelines, stored for documentation, and integrates seamlessly with other tidymodels packages like recipes and parsnip. The strata parameter provides the same stratification benefits as caret but with cleaner syntax.

This is my preferred method for new projects. The separation between creating the split and extracting the data makes code more readable and easier to debug.

Time Series and Grouped Data Considerations

Random splitting breaks catastrophically with temporal data. If you’re predicting stock prices and randomly split your data, your training set will contain future information that “predicts” the past in your test set. This data leakage produces artificially impressive results that fail spectacularly in production.

For time series, use chronological splits where training data comes entirely before test data:

library(rsample)

# Create time series data
dates <- seq.Date(from = as.Date("2020-01-01"), 
                  to = as.Date("2023-12-31"), 
                  by = "day")
ts_data <- data.frame(
  date = dates,
  value = cumsum(rnorm(length(dates)))
)

# Time-based split: last 20% for testing
time_split <- initial_time_split(ts_data, prop = 0.8)

train_ts <- training(time_split)
test_ts <- testing(time_split)

# Verify temporal separation
cat("Training period:", max(train_ts$date), "\n")
cat("Testing period starts:", min(test_ts$date), "\n")

The initial_time_split() function assumes your data is already ordered chronologically and takes the first prop portion for training. No randomization occurs—the split respects temporal boundaries.

Similarly, grouped data (multiple observations per patient, customer, or entity) requires splitting at the group level, not the observation level. If patient 42 has 10 measurements and 8 end up in training while 2 end up in testing, your model learns patient 42’s specific patterns and gets tested on the same patient. That’s not generalization; that’s memorization.

library(rsample)

# Grouped data example
patient_data <- data.frame(
  patient_id = rep(1:50, each = 5),
  measurement = rnorm(250),
  outcome = sample(c(0, 1), 250, replace = TRUE)
)

# Split by group (patient_id)
grouped_split <- group_initial_split(patient_data, 
                                     group = patient_id, 
                                     prop = 0.8)

train_grouped <- training(grouped_split)
test_grouped <- testing(grouped_split)

# Verify no patient appears in both sets
train_patients <- unique(train_grouped$patient_id)
test_patients <- unique(test_grouped$patient_id)
cat("Patient overlap:", length(intersect(train_patients, test_patients)), "\n")

Validation Sets and Cross-Validation Preview

For hyperparameter tuning, you need three sets: training (build models), validation (tune parameters), and test (final evaluation). Using your test set for tuning is another form of data leakage—you’re optimizing for the test set, which defeats its purpose.

library(rsample)

set.seed(321)

# First split: 80% for development, 20% for final test
initial_split <- initial_split(iris, prop = 0.8, strata = Species)
development_data <- training(initial_split)
test_data <- testing(initial_split)

# Second split: divide development into 75% train, 25% validation
dev_split <- initial_split(development_data, prop = 0.75, strata = Species)
train_data <- training(dev_split)
validation_data <- testing(dev_split)

cat("Training:", nrow(train_data), "observations\n")
cat("Validation:", nrow(validation_data), "observations\n")
cat("Test:", nrow(test_data), "observations\n")

This gives you approximately 60% training, 20% validation, and 20% test. The exact proportions matter less than the principle: tune on validation, evaluate on test, and never let your test set influence any modeling decisions.

For more robust evaluation, k-fold cross-validation splits training data into k subsets, training k models where each subset serves as validation once. The rsample package’s vfold_cv() function handles this elegantly, but that’s a topic for dedicated cross-validation coverage.

Best Practices and Common Pitfalls

Always set a seed. Reproducibility isn’t optional in scientific computing. Document your seed value in comments or project documentation.

Verify your splits. Print summary statistics and class distributions for both sets. They should look similar. A test set with dramatically different characteristics than training data will give misleading performance estimates.

Match your split strategy to your data structure. Random for independent observations, stratified for imbalanced classes, chronological for time series, grouped for hierarchical data. Using the wrong strategy is worse than not splitting at all because it gives you false confidence.

Watch for data leakage. Feature engineering, scaling, and imputation must use only training data statistics. If you normalize using the mean of your entire dataset before splitting, information from the test set influences your training process.

Adjust split ratios for dataset size. With 100,000 observations, a 95/5 split gives you 5,000 test samples—plenty for evaluation. With 500 observations, you need at least 20% for testing to get meaningful metrics.

Don’t split multiple times and cherry-pick results. If you try five different seeds and report the best test performance, you’ve effectively used the test set for model selection. Pick one seed and stick with it.

The train-test split is the foundation of honest model evaluation. Get it right, and you build reliable models. Get it wrong, and you build expensive mistakes that fail in production. Choose your splitting strategy deliberately, implement it carefully, and your future self will thank you when your models actually work on real data.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.