NumPy - Random Shuffle and Permutation
NumPy provides two primary methods for randomizing array elements: `shuffle()` and `permutation()`. The fundamental difference lies in how they handle the original array.
Key Insights
shuffle()modifies arrays in-place whilepermutation()returns a new shuffled copy, making permutation safer for preserving original data- Understanding the difference between shuffling rows versus entire arrays is critical for maintaining data integrity in multi-dimensional datasets
- Setting random seeds ensures reproducible results across development, testing, and production environments
Understanding shuffle() vs permutation()
NumPy provides two primary methods for randomizing array elements: shuffle() and permutation(). The fundamental difference lies in how they handle the original array.
shuffle() operates in-place, directly modifying the original array without creating a copy. This approach is memory-efficient but destructive:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
np.random.shuffle(arr)
print(arr) # Output: [3, 1, 5, 2, 4] (original array modified)
permutation() returns a shuffled copy, leaving the original array intact:
arr = np.array([1, 2, 3, 4, 5])
shuffled = np.random.permutation(arr)
print(arr) # Output: [1, 2, 3, 4, 5] (original unchanged)
print(shuffled) # Output: [4, 2, 5, 1, 3] (new shuffled array)
For production code, prefer permutation() unless memory constraints require in-place operations. The immutability principle prevents bugs from unexpected mutations.
Modern Random Generator API
NumPy’s legacy np.random.shuffle() and np.random.permutation() functions still work but the newer Generator API provides better statistical properties and thread safety:
from numpy.random import default_rng
rng = default_rng(seed=42)
arr = np.array([10, 20, 30, 40, 50])
# Using the new API
rng.shuffle(arr) # In-place shuffle
print(arr)
# Create new generator for permutation
rng = default_rng(seed=42)
shuffled = rng.permutation(arr)
print(shuffled)
The Generator API supports parallel random number generation without state conflicts, essential for multi-threaded applications:
from concurrent.futures import ThreadPoolExecutor
def shuffle_data(seed):
rng = default_rng(seed)
data = np.arange(1000)
return rng.permutation(data)
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(shuffle_data, range(4)))
Shuffling Multi-Dimensional Arrays
Multi-dimensional array shuffling requires careful consideration of which axis to randomize. By default, both methods shuffle along the first axis (rows):
matrix = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
rng = default_rng(seed=42)
shuffled = rng.permutation(matrix)
print(shuffled)
# Output: Rows shuffled, but each row maintains internal order
# [[7, 8, 9],
# [1, 2, 3],
# [4, 5, 6]]
To shuffle along different axes, use axis parameter (available in NumPy 1.18+):
matrix = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
rng = default_rng(seed=42)
# Shuffle columns instead of rows
shuffled = rng.permutation(matrix, axis=1)
print(shuffled)
# Output: Columns shuffled
# [[2, 3, 1],
# [5, 6, 4],
# [8, 9, 7]]
For complete element randomization regardless of structure, flatten and reshape:
matrix = np.array([[1, 2, 3], [4, 5, 6]])
rng = default_rng(seed=42)
flat_shuffled = rng.permutation(matrix.flatten())
completely_shuffled = flat_shuffled.reshape(matrix.shape)
print(completely_shuffled)
# Output: All elements randomly distributed
# [[6, 1, 4],
# [3, 5, 2]]
Practical Use Case: Train-Test Split
A common application involves shuffling datasets before splitting into training and testing sets:
def train_test_split(X, y, test_size=0.2, seed=None):
"""
Split data into training and testing sets with shuffling.
Parameters:
X: Feature array (n_samples, n_features)
y: Target array (n_samples,)
test_size: Fraction of data for testing
seed: Random seed for reproducibility
"""
rng = default_rng(seed)
n_samples = X.shape[0]
# Generate shuffled indices
indices = rng.permutation(n_samples)
# Calculate split point
split_idx = int(n_samples * (1 - test_size))
# Split indices
train_idx = indices[:split_idx]
test_idx = indices[split_idx:]
return X[train_idx], X[test_idx], y[train_idx], y[test_idx]
# Example usage
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 1, 0, 1, 0])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, seed=42)
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")
Maintaining Row Relationships in Paired Data
When working with paired datasets (features and labels), maintain row correspondence by shuffling indices rather than arrays directly:
# Dataset with features and labels
features = np.array([
[1.0, 2.0],
[3.0, 4.0],
[5.0, 6.0],
[7.0, 8.0]
])
labels = np.array(['A', 'B', 'C', 'D'])
rng = default_rng(seed=42)
# Generate shuffled indices
indices = rng.permutation(len(features))
# Apply same shuffle to both arrays
shuffled_features = features[indices]
shuffled_labels = labels[indices]
print("Shuffled features:\n", shuffled_features)
print("Shuffled labels:", shuffled_labels)
# Both maintain their row relationships
Weighted Random Sampling
While permutation() provides uniform shuffling, weighted sampling requires choice():
# Sample indices with probabilities
data = np.array([10, 20, 30, 40, 50])
probabilities = np.array([0.1, 0.2, 0.4, 0.2, 0.1])
rng = default_rng(seed=42)
sampled_indices = rng.choice(
len(data),
size=3,
replace=False, # No duplicates
p=probabilities
)
sampled_data = data[sampled_indices]
print(sampled_data)
Reproducibility with Seeds
Setting seeds ensures identical shuffles across runs, critical for debugging and reproducible research:
# Same seed produces identical results
rng1 = default_rng(seed=42)
arr1 = rng1.permutation(np.arange(10))
rng2 = default_rng(seed=42)
arr2 = rng2.permutation(np.arange(10))
print(np.array_equal(arr1, arr2)) # Output: True
For production systems requiring different shuffles per execution while maintaining reproducibility:
import time
def get_daily_seed():
"""Generate seed based on current date for daily reproducibility"""
from datetime import datetime
date_str = datetime.now().strftime('%Y%m%d')
return int(date_str)
rng = default_rng(seed=get_daily_seed())
shuffled = rng.permutation(data)
Performance Considerations
shuffle() outperforms permutation() for large arrays due to in-place operations:
import time
large_array = np.arange(10_000_000)
# Measure shuffle (in-place)
start = time.perf_counter()
rng = default_rng(seed=42)
rng.shuffle(large_array)
shuffle_time = time.perf_counter() - start
# Measure permutation (copy)
large_array = np.arange(10_000_000)
start = time.perf_counter()
rng = default_rng(seed=42)
result = rng.permutation(large_array)
perm_time = time.perf_counter() - start
print(f"Shuffle: {shuffle_time:.4f}s")
print(f"Permutation: {perm_time:.4f}s")
For memory-constrained environments processing large datasets, use shuffle() with explicit copies when preservation is needed:
original = np.arange(1_000_000)
working_copy = original.copy()
rng = default_rng(seed=42)
rng.shuffle(working_copy)
# Original preserved, memory overhead controlled
These techniques form the foundation for data preprocessing pipelines, ensuring proper randomization while maintaining data integrity and reproducibility across machine learning workflows.