How to Split Data into Train and Test Sets in Python
Every machine learning model needs honest evaluation. Training and testing on the same data is like a student grading their own exam—the results look great but mean nothing. You'll get near-perfect...
Key Insights
- Always split your data before applying transformations like scaling or encoding to prevent data leakage that inflates model performance metrics
- Use stratified splitting for classification tasks with imbalanced classes to maintain representative class distributions in both train and test sets
- Never randomly shuffle time series data—temporal ordering contains critical information that random splits destroy
Why Split Data? The Fundamentals
Every machine learning model needs honest evaluation. Training and testing on the same data is like a student grading their own exam—the results look great but mean nothing. You’ll get near-perfect accuracy that evaporates the moment real-world data arrives.
The train/test split solves this by dividing your dataset into two portions. The training set teaches your model patterns, while the test set acts as unseen data to measure real-world performance. This separation reveals overfitting—when your model memorizes training data instead of learning generalizable patterns.
The standard split ratios are 80/20 or 70/30 (train/test). Larger datasets can afford smaller test percentages since you’ll still have enough test samples for reliable evaluation. Smaller datasets might need 70/30 to ensure adequate test coverage.
# Conceptual flow of data splitting
# Original Dataset (1000 samples)
# ↓
# ├─ Training Set (800 samples) → Train Model
# └─ Test Set (200 samples) → Evaluate Model
The critical rule: your test set must remain completely isolated until final evaluation. Touch it during development, and you’re back to grading your own exam.
Basic Train/Test Split with Scikit-learn
Scikit-learn’s train_test_split() handles 95% of splitting scenarios. It randomizes your data and divides it according to your specified ratio.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import pandas as pd
# Load sample data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')
# Perform 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # Reproducibility
shuffle=True # Randomize before splitting (default)
)
# Verify split sizes
print(f"Training samples: {len(X_train)}") # 120
print(f"Test samples: {len(X_test)}") # 30
print(f"Total: {len(X)}") # 150
print(f"Split ratio: {len(X_test)/len(X):.1%}") # 20.0%
The random_state parameter is non-negotiable for reproducible research. Set it to any integer, and you’ll get identical splits across runs. Without it, every execution produces different train/test sets, making debugging and collaboration nightmarish.
The shuffle=True default randomizes data before splitting. This prevents order-based bias if your dataset is sorted by class or feature values.
Stratified Splitting for Imbalanced Data
Random splitting has a fatal flaw with imbalanced datasets. Imagine a fraud detection dataset with 95% legitimate transactions and 5% fraud. A random split might put all fraud cases in training, leaving your test set unable to evaluate fraud detection.
Stratified splitting maintains class proportions across both sets. If your original data is 95/5, both train and test sets will be 95/5.
import numpy as np
from collections import Counter
# Create imbalanced dataset
np.random.seed(42)
X_imbalanced = np.random.randn(1000, 5)
y_imbalanced = np.array([0]*950 + [1]*50) # 95% class 0, 5% class 1
# Regular split - risky
X_train, X_test, y_train, y_test = train_test_split(
X_imbalanced, y_imbalanced,
test_size=0.2,
random_state=42
)
print("Regular split:")
print(f"Train distribution: {Counter(y_train)}")
print(f"Test distribution: {Counter(y_test)}")
# Stratified split - maintains proportions
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
X_imbalanced, y_imbalanced,
test_size=0.2,
random_state=42,
stratify=y_imbalanced # Key parameter
)
print("\nStratified split:")
print(f"Train distribution: {Counter(y_train_strat)}")
print(f"Test distribution: {Counter(y_test_strat)}")
print(f"Test class 1 percentage: {sum(y_test_strat)/len(y_test_strat):.1%}")
The stratified split guarantees both sets represent your data’s true distribution. Use stratify=y for every classification task unless you have specific reasons not to.
Time Series Data Splitting
Time series data breaks the shuffling rule. Stock prices, sensor readings, and user behavior logs contain temporal dependencies—past influences future. Shuffling destroys this chronological information and creates unrealistic evaluation scenarios.
For time series, split sequentially: early data for training, recent data for testing. This simulates real deployment where you predict future events based on historical patterns.
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
# Create time series data
dates = pd.date_range('2023-01-01', periods=365, freq='D')
data = pd.DataFrame({
'date': dates,
'value': np.cumsum(np.random.randn(365)) + 100,
'feature': np.random.randn(365)
})
# Sequential split (80/20)
split_point = int(len(data) * 0.8)
train_data = data.iloc[:split_point]
test_data = data.iloc[split_point:]
print(f"Training period: {train_data['date'].min()} to {train_data['date'].max()}")
print(f"Test period: {test_data['date'].min()} to {test_data['date'].max()}")
# For cross-validation: TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, test_idx) in enumerate(tscv.split(data)):
print(f"Fold {fold + 1}:")
print(f" Train: {data.iloc[train_idx]['date'].min()} to {data.iloc[train_idx]['date'].max()}")
print(f" Test: {data.iloc[test_idx]['date'].min()} to {data.iloc[test_idx]['date'].max()}")
TimeSeriesSplit provides expanding window cross-validation where each fold uses all previous data for training and the next chunk for testing. This respects temporal order while providing multiple evaluation points.
Manual Splitting and Custom Approaches
Sometimes you need control beyond what train_test_split() offers. Manual splitting with pandas or numpy handles edge cases like grouped data or three-way splits.
import pandas as pd
import numpy as np
# Sample dataset
np.random.seed(42)
df = pd.DataFrame({
'user_id': np.repeat(range(100), 10),
'feature1': np.random.randn(1000),
'feature2': np.random.randn(1000),
'target': np.random.randint(0, 2, 1000)
})
# Manual three-way split (60% train, 20% validation, 20% test)
n = len(df)
train_end = int(0.6 * n)
val_end = int(0.8 * n)
# Shuffle first
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)
train = df_shuffled.iloc[:train_end]
validation = df_shuffled.iloc[train_end:val_end]
test = df_shuffled.iloc[val_end:]
print(f"Train: {len(train)}, Val: {len(validation)}, Test: {len(test)}")
# Group-based split - keep all records from same user together
unique_users = df['user_id'].unique()
np.random.shuffle(unique_users)
train_users = unique_users[:60]
val_users = unique_users[60:80]
test_users = unique_users[80:]
train_grouped = df[df['user_id'].isin(train_users)]
val_grouped = df[df['user_id'].isin(val_users)]
test_grouped = df[df['user_id'].isin(test_users)]
print(f"\nGrouped split:")
print(f"Train users: {len(train_users)}, samples: {len(train_grouped)}")
print(f"Val users: {len(val_users)}, samples: {len(val_grouped)}")
print(f"Test users: {len(test_users)}, samples: {len(test_grouped)}")
Group-based splitting is crucial when samples aren’t independent—like multiple measurements per patient or transactions per user. Splitting by group prevents data leakage where the model learns user-specific patterns that appear in both train and test.
Common Pitfalls and Best Practices
The most dangerous mistake is data leakage through preprocessing. If you scale, encode, or impute missing values before splitting, information from the test set leaks into training through global statistics.
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# WRONG: Scaling before split causes data leakage
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X) # Uses ALL data statistics
X_train_wrong, X_test_wrong, y_train, y_test = train_test_split(
X_scaled_wrong, y, test_size=0.2, random_state=42
)
# CORRECT: Split first, then scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler_correct = StandardScaler()
X_train_scaled = scaler_correct.fit_transform(X_train) # Fit on train only
X_test_scaled = scaler_correct.transform(X_test) # Apply train statistics
print("Wrong approach - test set statistics leaked into training")
print(f"Test mean: {X_test_wrong.mean():.6f}")
print("\nCorrect approach - test set remains unseen")
print(f"Test mean: {X_test_scaled.mean():.6f}")
The correct workflow: split first, fit preprocessing on training data only, then apply those fitted transformers to test data. The test set should never influence any fitting operation.
Other critical practices:
- Set random_state consistently across your project for reproducibility
- Split before any data exploration to avoid unconscious bias from test set patterns
- Use stratification for classification unless you have massive balanced datasets
- Never tune hyperparameters on test data—use cross-validation on training data or create a separate validation set
- Document your split strategy so collaborators understand your evaluation methodology
Small datasets (under 1000 samples) present special challenges. Consider k-fold cross-validation instead of a single train/test split to maximize data usage while maintaining evaluation integrity.
Splitting data correctly is foundational to trustworthy machine learning. Get it wrong, and every metric, every conclusion, every deployment decision rests on contaminated ground. Get it right, and you build models that actually work when they meet reality.