NumPy - Random Shuffle and Permutation

Key Insights

shuffle() modifies arrays in-place while permutation() returns a new shuffled copy, making permutation safer for preserving original data
Understanding the difference between shuffling rows versus entire arrays is critical for maintaining data integrity in multi-dimensional datasets
Setting random seeds ensures reproducible results across development, testing, and production environments

Understanding shuffle() vs permutation()

NumPy provides two primary methods for randomizing array elements: shuffle() and permutation(). The fundamental difference lies in how they handle the original array.

shuffle() operates in-place, directly modifying the original array without creating a copy. This approach is memory-efficient but destructive:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])
np.random.shuffle(arr)
print(arr)  # Output: [3, 1, 5, 2, 4] (original array modified)

permutation() returns a shuffled copy, leaving the original array intact:

arr = np.array([1, 2, 3, 4, 5])
shuffled = np.random.permutation(arr)
print(arr)       # Output: [1, 2, 3, 4, 5] (original unchanged)
print(shuffled)  # Output: [4, 2, 5, 1, 3] (new shuffled array)

For production code, prefer permutation() unless memory constraints require in-place operations. The immutability principle prevents bugs from unexpected mutations.

Modern Random Generator API

NumPy’s legacy np.random.shuffle() and np.random.permutation() functions still work but the newer Generator API provides better statistical properties and thread safety:

from numpy.random import default_rng

rng = default_rng(seed=42)
arr = np.array([10, 20, 30, 40, 50])

# Using the new API
rng.shuffle(arr)  # In-place shuffle
print(arr)

# Create new generator for permutation
rng = default_rng(seed=42)
shuffled = rng.permutation(arr)
print(shuffled)

The Generator API supports parallel random number generation without state conflicts, essential for multi-threaded applications:

from concurrent.futures import ThreadPoolExecutor

def shuffle_data(seed):
    rng = default_rng(seed)
    data = np.arange(1000)
    return rng.permutation(data)

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(shuffle_data, range(4)))

Shuffling Multi-Dimensional Arrays

Multi-dimensional array shuffling requires careful consideration of which axis to randomize. By default, both methods shuffle along the first axis (rows):

matrix = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

rng = default_rng(seed=42)
shuffled = rng.permutation(matrix)
print(shuffled)
# Output: Rows shuffled, but each row maintains internal order
# [[7, 8, 9],
#  [1, 2, 3],
#  [4, 5, 6]]

To shuffle along different axes, use axis parameter (available in NumPy 1.18+):

matrix = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

rng = default_rng(seed=42)
# Shuffle columns instead of rows
shuffled = rng.permutation(matrix, axis=1)
print(shuffled)
# Output: Columns shuffled
# [[2, 3, 1],
#  [5, 6, 4],
#  [8, 9, 7]]

For complete element randomization regardless of structure, flatten and reshape:

matrix = np.array([[1, 2, 3], [4, 5, 6]])
rng = default_rng(seed=42)

flat_shuffled = rng.permutation(matrix.flatten())
completely_shuffled = flat_shuffled.reshape(matrix.shape)
print(completely_shuffled)
# Output: All elements randomly distributed
# [[6, 1, 4],
#  [3, 5, 2]]

Practical Use Case: Train-Test Split

A common application involves shuffling datasets before splitting into training and testing sets:

def train_test_split(X, y, test_size=0.2, seed=None):
    """
    Split data into training and testing sets with shuffling.
    
    Parameters:
    X: Feature array (n_samples, n_features)
    y: Target array (n_samples,)
    test_size: Fraction of data for testing
    seed: Random seed for reproducibility
    """
    rng = default_rng(seed)
    n_samples = X.shape[0]
    
    # Generate shuffled indices
    indices = rng.permutation(n_samples)
    
    # Calculate split point
    split_idx = int(n_samples * (1 - test_size))
    
    # Split indices
    train_idx = indices[:split_idx]
    test_idx = indices[split_idx:]
    
    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

# Example usage
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 1, 0, 1, 0])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, seed=42)
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

Maintaining Row Relationships in Paired Data

When working with paired datasets (features and labels), maintain row correspondence by shuffling indices rather than arrays directly:

# Dataset with features and labels
features = np.array([
    [1.0, 2.0],
    [3.0, 4.0],
    [5.0, 6.0],
    [7.0, 8.0]
])
labels = np.array(['A', 'B', 'C', 'D'])

rng = default_rng(seed=42)

# Generate shuffled indices
indices = rng.permutation(len(features))

# Apply same shuffle to both arrays
shuffled_features = features[indices]
shuffled_labels = labels[indices]

print("Shuffled features:\n", shuffled_features)
print("Shuffled labels:", shuffled_labels)
# Both maintain their row relationships

Weighted Random Sampling

While permutation() provides uniform shuffling, weighted sampling requires choice():

# Sample indices with probabilities
data = np.array([10, 20, 30, 40, 50])
probabilities = np.array([0.1, 0.2, 0.4, 0.2, 0.1])

rng = default_rng(seed=42)
sampled_indices = rng.choice(
    len(data), 
    size=3, 
    replace=False,  # No duplicates
    p=probabilities
)

sampled_data = data[sampled_indices]
print(sampled_data)

Reproducibility with Seeds

Setting seeds ensures identical shuffles across runs, critical for debugging and reproducible research:

# Same seed produces identical results
rng1 = default_rng(seed=42)
arr1 = rng1.permutation(np.arange(10))

rng2 = default_rng(seed=42)
arr2 = rng2.permutation(np.arange(10))

print(np.array_equal(arr1, arr2))  # Output: True

For production systems requiring different shuffles per execution while maintaining reproducibility:

import time

def get_daily_seed():
    """Generate seed based on current date for daily reproducibility"""
    from datetime import datetime
    date_str = datetime.now().strftime('%Y%m%d')
    return int(date_str)

rng = default_rng(seed=get_daily_seed())
shuffled = rng.permutation(data)

Performance Considerations

shuffle() outperforms permutation() for large arrays due to in-place operations:

import time

large_array = np.arange(10_000_000)

# Measure shuffle (in-place)
start = time.perf_counter()
rng = default_rng(seed=42)
rng.shuffle(large_array)
shuffle_time = time.perf_counter() - start

# Measure permutation (copy)
large_array = np.arange(10_000_000)
start = time.perf_counter()
rng = default_rng(seed=42)
result = rng.permutation(large_array)
perm_time = time.perf_counter() - start

print(f"Shuffle: {shuffle_time:.4f}s")
print(f"Permutation: {perm_time:.4f}s")

For memory-constrained environments processing large datasets, use shuffle() with explicit copies when preservation is needed:

original = np.arange(1_000_000)
working_copy = original.copy()
rng = default_rng(seed=42)
rng.shuffle(working_copy)
# Original preserved, memory overhead controlled

These techniques form the foundation for data preprocessing pipelines, ensuring proper randomization while maintaining data integrity and reproducibility across machine learning workflows.