NumPy - Random Seed for Reproducibility

Key Insights

Random seeds ensure reproducible results across NumPy’s random number generation, critical for debugging, testing, and scientific reproducibility
NumPy offers multiple approaches: legacy np.random.seed(), modern Generator with explicit state management, and SeedSequence for parallel workflows
Proper seed management prevents common pitfalls like seed reuse in parallel processes and ensures deterministic behavior in machine learning pipelines

Why Random Seeds Matter

Random number generation in NumPy produces pseudorandom numbers—sequences that appear random but are deterministic given an initial state. Without controlling this state, you’ll get different results every time you run your code, making debugging impossible and violating scientific reproducibility standards.

import numpy as np

# Without seed - different results each run
print(np.random.rand(3))  # [0.417, 0.720, 0.000] (example)
print(np.random.rand(3))  # [0.302, 0.147, 0.092] (example)

# With seed - reproducible results
np.random.seed(42)
print(np.random.rand(3))  # [0.374, 0.950, 0.731]
np.random.seed(42)
print(np.random.rand(3))  # [0.374, 0.950, 0.731] - identical

The seed initializes the random number generator’s internal state. The same seed always produces the same sequence, enabling you to reproduce bugs, validate tests, and ensure research reproducibility.

Legacy API: np.random.seed()

The traditional approach uses the global random state through np.random.seed(). This method modifies NumPy’s global random number generator, affecting all subsequent random operations in your program.

import numpy as np

def simulate_experiment():
    np.random.seed(123)
    data = np.random.normal(loc=50, scale=10, size=100)
    noise = np.random.uniform(-5, 5, size=100)
    return data + noise

# Always produces the same result
result1 = simulate_experiment()
result2 = simulate_experiment()
print(np.allclose(result1, result2))  # True

While convenient, the global state creates problems in larger applications. Functions that modify the global seed affect unrelated code, making behavior unpredictable in multi-threaded environments or when combining third-party libraries.

import numpy as np

np.random.seed(42)
print(np.random.rand())  # 0.374

# Some other code changes the seed
np.random.seed(99)
print(np.random.rand())  # 0.889

# Your code now produces different results
print(np.random.rand())  # 0.094 - unexpected!

Modern API: Generator Objects

NumPy 1.17 introduced the Generator API, which uses explicit random number generator instances instead of global state. This approach provides better isolation and thread safety.

import numpy as np

# Create independent generators
rng1 = np.random.default_rng(42)
rng2 = np.random.default_rng(42)

# Both produce identical sequences
print(rng1.random(3))  # [0.773, 0.438, 0.858]
print(rng2.random(3))  # [0.773, 0.438, 0.858]

# But they're independent objects
print(rng1.random(3))  # [0.697, 0.094, 0.028]
print(rng2.random(3))  # [0.697, 0.094, 0.028]

The Generator approach allows you to pass RNG instances to functions, making dependencies explicit and testable:

import numpy as np

def generate_training_data(n_samples, rng):
    """Generate synthetic data with explicit RNG dependency."""
    X = rng.normal(0, 1, size=(n_samples, 5))
    noise = rng.normal(0, 0.1, size=n_samples)
    y = X[:, 0] * 2 + X[:, 1] * -1 + noise
    return X, y

# Reproducible data generation
rng = np.random.default_rng(42)
X_train, y_train = generate_training_data(1000, rng)
X_test, y_test = generate_training_data(200, rng)

# Verify reproducibility
rng_verify = np.random.default_rng(42)
X_verify, y_verify = generate_training_data(1000, rng_verify)
print(np.allclose(X_train, X_verify))  # True

SeedSequence for Parallel Workflows

When running parallel processes or threads, each worker needs an independent but reproducible random state. SeedSequence generates multiple independent seeds from a single master seed.

import numpy as np
from multiprocessing import Pool

def worker_task(seed):
    """Simulate work with independent random state."""
    rng = np.random.default_rng(seed)
    samples = rng.normal(0, 1, size=10000)
    return samples.mean(), samples.std()

# Create independent seeds for workers
master_seed = 12345
ss = np.random.SeedSequence(master_seed)
child_seeds = ss.spawn(4)  # 4 independent seeds

# Parallel execution with reproducible results
with Pool(4) as pool:
    results = pool.map(worker_task, child_seeds)

print(results)
# [(0.0023, 1.0012), (-0.0045, 0.9987), (0.0067, 1.0034), (-0.0012, 0.9965)]

This pattern ensures each worker has independent random streams while maintaining overall reproducibility. Running the same code with the same master seed always produces identical results across all workers.

import numpy as np

def hierarchical_seeding():
    """Demonstrate nested seed generation."""
    master = np.random.SeedSequence(999)
    
    # Create seeds for different experiment components
    data_seed, model_seed, eval_seed = master.spawn(3)
    
    # Each component gets independent RNG
    data_rng = np.random.default_rng(data_seed)
    model_rng = np.random.default_rng(model_seed)
    eval_rng = np.random.default_rng(eval_seed)
    
    return data_rng, model_rng, eval_rng

# Reproducible multi-component system
data_rng, model_rng, eval_rng = hierarchical_seeding()

Machine Learning Pipeline Example

Real-world ML pipelines require careful seed management across data splitting, augmentation, and model initialization:

import numpy as np

class ReproduciblePipeline:
    def __init__(self, master_seed=42):
        self.master_seed = master_seed
        ss = np.random.SeedSequence(master_seed)
        seeds = ss.spawn(3)
        
        self.data_rng = np.random.default_rng(seeds[0])
        self.augment_rng = np.random.default_rng(seeds[1])
        self.model_rng = np.random.default_rng(seeds[2])
    
    def train_test_split(self, X, y, test_size=0.2):
        """Split data reproducibly."""
        n = len(X)
        indices = np.arange(n)
        self.data_rng.shuffle(indices)
        
        split_idx = int(n * (1 - test_size))
        train_idx = indices[:split_idx]
        test_idx = indices[split_idx:]
        
        return X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    
    def augment_data(self, X):
        """Apply random augmentation."""
        noise = self.augment_rng.normal(0, 0.1, size=X.shape)
        return X + noise
    
    def initialize_weights(self, shape):
        """Initialize model weights."""
        return self.model_rng.normal(0, 0.01, size=shape)

# Usage
pipeline = ReproduciblePipeline(master_seed=42)

# Generate synthetic dataset
X = np.random.rand(1000, 10)
y = np.random.randint(0, 2, size=1000)

# Reproducible pipeline execution
X_train, X_test, y_train, y_test = pipeline.train_test_split(X, y)
X_train_aug = pipeline.augment_data(X_train)
weights = pipeline.initialize_weights((10, 5))

# Verify reproducibility
pipeline2 = ReproduciblePipeline(master_seed=42)
X_train2, X_test2, y_train2, y_test2 = pipeline2.train_test_split(X, y)
print(np.allclose(X_train, X_train2))  # True

Common Pitfalls and Solutions

Pitfall 1: Setting seed inside loops

# WRONG - resets to same state each iteration
for i in range(5):
    np.random.seed(42)
    print(np.random.rand())  # Always prints 0.374

# CORRECT - set seed once before loop
np.random.seed(42)
for i in range(5):
    print(np.random.rand())  # Different values each iteration

Pitfall 2: Forgetting to seed test fixtures

import numpy as np

def test_model_training():
    # WRONG - non-deterministic test
    X = np.random.rand(100, 5)
    # Test may pass or fail randomly
    
def test_model_training_fixed():
    # CORRECT - deterministic test
    rng = np.random.default_rng(42)
    X = rng.random((100, 5))
    # Test always behaves identically

Pitfall 3: Seed reuse in parallel contexts

# WRONG - all workers use same seed
def worker(seed):
    rng = np.random.default_rng(seed)
    return rng.random(100)

with Pool(4) as pool:
    results = pool.map(worker, [42, 42, 42, 42])  # Identical results!

# CORRECT - use SeedSequence for independent seeds
ss = np.random.SeedSequence(42)
child_seeds = ss.spawn(4)
with Pool(4) as pool:
    results = pool.map(worker, child_seeds)  # Independent results

Best Practices

Set seeds at the highest level of your application, preferably as command-line arguments or configuration parameters. Use Generator objects instead of the global state for new code. For parallel workflows, always use SeedSequence.spawn() to create independent child seeds.

Document your seeding strategy in docstrings and comments. When publishing research or sharing code, always specify the seed values used. For production systems, log the seed values used for each run to enable post-hoc debugging.

import numpy as np
import argparse

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--seed', type=int, default=42)
    args = parser.parse_args()
    
    rng = np.random.default_rng(args.seed)
    print(f"Using random seed: {args.seed}")
    
    # Your application logic here
    data = rng.normal(0, 1, size=1000)

if __name__ == '__main__':
    main()

Proper seed management transforms random operations from sources of frustration into controlled, reproducible components of your data pipeline.