NumPy - Random Seed for Reproducibility
Random number generation in NumPy produces pseudorandom numbers—sequences that appear random but are deterministic given an initial state. Without controlling this state, you'll get different results...
Key Insights
- Random seeds ensure reproducible results across NumPy’s random number generation, critical for debugging, testing, and scientific reproducibility
- NumPy offers multiple approaches: legacy
np.random.seed(), modernGeneratorwith explicit state management, andSeedSequencefor parallel workflows - Proper seed management prevents common pitfalls like seed reuse in parallel processes and ensures deterministic behavior in machine learning pipelines
Why Random Seeds Matter
Random number generation in NumPy produces pseudorandom numbers—sequences that appear random but are deterministic given an initial state. Without controlling this state, you’ll get different results every time you run your code, making debugging impossible and violating scientific reproducibility standards.
import numpy as np
# Without seed - different results each run
print(np.random.rand(3)) # [0.417, 0.720, 0.000] (example)
print(np.random.rand(3)) # [0.302, 0.147, 0.092] (example)
# With seed - reproducible results
np.random.seed(42)
print(np.random.rand(3)) # [0.374, 0.950, 0.731]
np.random.seed(42)
print(np.random.rand(3)) # [0.374, 0.950, 0.731] - identical
The seed initializes the random number generator’s internal state. The same seed always produces the same sequence, enabling you to reproduce bugs, validate tests, and ensure research reproducibility.
Legacy API: np.random.seed()
The traditional approach uses the global random state through np.random.seed(). This method modifies NumPy’s global random number generator, affecting all subsequent random operations in your program.
import numpy as np
def simulate_experiment():
np.random.seed(123)
data = np.random.normal(loc=50, scale=10, size=100)
noise = np.random.uniform(-5, 5, size=100)
return data + noise
# Always produces the same result
result1 = simulate_experiment()
result2 = simulate_experiment()
print(np.allclose(result1, result2)) # True
While convenient, the global state creates problems in larger applications. Functions that modify the global seed affect unrelated code, making behavior unpredictable in multi-threaded environments or when combining third-party libraries.
import numpy as np
np.random.seed(42)
print(np.random.rand()) # 0.374
# Some other code changes the seed
np.random.seed(99)
print(np.random.rand()) # 0.889
# Your code now produces different results
print(np.random.rand()) # 0.094 - unexpected!
Modern API: Generator Objects
NumPy 1.17 introduced the Generator API, which uses explicit random number generator instances instead of global state. This approach provides better isolation and thread safety.
import numpy as np
# Create independent generators
rng1 = np.random.default_rng(42)
rng2 = np.random.default_rng(42)
# Both produce identical sequences
print(rng1.random(3)) # [0.773, 0.438, 0.858]
print(rng2.random(3)) # [0.773, 0.438, 0.858]
# But they're independent objects
print(rng1.random(3)) # [0.697, 0.094, 0.028]
print(rng2.random(3)) # [0.697, 0.094, 0.028]
The Generator approach allows you to pass RNG instances to functions, making dependencies explicit and testable:
import numpy as np
def generate_training_data(n_samples, rng):
"""Generate synthetic data with explicit RNG dependency."""
X = rng.normal(0, 1, size=(n_samples, 5))
noise = rng.normal(0, 0.1, size=n_samples)
y = X[:, 0] * 2 + X[:, 1] * -1 + noise
return X, y
# Reproducible data generation
rng = np.random.default_rng(42)
X_train, y_train = generate_training_data(1000, rng)
X_test, y_test = generate_training_data(200, rng)
# Verify reproducibility
rng_verify = np.random.default_rng(42)
X_verify, y_verify = generate_training_data(1000, rng_verify)
print(np.allclose(X_train, X_verify)) # True
SeedSequence for Parallel Workflows
When running parallel processes or threads, each worker needs an independent but reproducible random state. SeedSequence generates multiple independent seeds from a single master seed.
import numpy as np
from multiprocessing import Pool
def worker_task(seed):
"""Simulate work with independent random state."""
rng = np.random.default_rng(seed)
samples = rng.normal(0, 1, size=10000)
return samples.mean(), samples.std()
# Create independent seeds for workers
master_seed = 12345
ss = np.random.SeedSequence(master_seed)
child_seeds = ss.spawn(4) # 4 independent seeds
# Parallel execution with reproducible results
with Pool(4) as pool:
results = pool.map(worker_task, child_seeds)
print(results)
# [(0.0023, 1.0012), (-0.0045, 0.9987), (0.0067, 1.0034), (-0.0012, 0.9965)]
This pattern ensures each worker has independent random streams while maintaining overall reproducibility. Running the same code with the same master seed always produces identical results across all workers.
import numpy as np
def hierarchical_seeding():
"""Demonstrate nested seed generation."""
master = np.random.SeedSequence(999)
# Create seeds for different experiment components
data_seed, model_seed, eval_seed = master.spawn(3)
# Each component gets independent RNG
data_rng = np.random.default_rng(data_seed)
model_rng = np.random.default_rng(model_seed)
eval_rng = np.random.default_rng(eval_seed)
return data_rng, model_rng, eval_rng
# Reproducible multi-component system
data_rng, model_rng, eval_rng = hierarchical_seeding()
Machine Learning Pipeline Example
Real-world ML pipelines require careful seed management across data splitting, augmentation, and model initialization:
import numpy as np
class ReproduciblePipeline:
def __init__(self, master_seed=42):
self.master_seed = master_seed
ss = np.random.SeedSequence(master_seed)
seeds = ss.spawn(3)
self.data_rng = np.random.default_rng(seeds[0])
self.augment_rng = np.random.default_rng(seeds[1])
self.model_rng = np.random.default_rng(seeds[2])
def train_test_split(self, X, y, test_size=0.2):
"""Split data reproducibly."""
n = len(X)
indices = np.arange(n)
self.data_rng.shuffle(indices)
split_idx = int(n * (1 - test_size))
train_idx = indices[:split_idx]
test_idx = indices[split_idx:]
return X[train_idx], X[test_idx], y[train_idx], y[test_idx]
def augment_data(self, X):
"""Apply random augmentation."""
noise = self.augment_rng.normal(0, 0.1, size=X.shape)
return X + noise
def initialize_weights(self, shape):
"""Initialize model weights."""
return self.model_rng.normal(0, 0.01, size=shape)
# Usage
pipeline = ReproduciblePipeline(master_seed=42)
# Generate synthetic dataset
X = np.random.rand(1000, 10)
y = np.random.randint(0, 2, size=1000)
# Reproducible pipeline execution
X_train, X_test, y_train, y_test = pipeline.train_test_split(X, y)
X_train_aug = pipeline.augment_data(X_train)
weights = pipeline.initialize_weights((10, 5))
# Verify reproducibility
pipeline2 = ReproduciblePipeline(master_seed=42)
X_train2, X_test2, y_train2, y_test2 = pipeline2.train_test_split(X, y)
print(np.allclose(X_train, X_train2)) # True
Common Pitfalls and Solutions
Pitfall 1: Setting seed inside loops
# WRONG - resets to same state each iteration
for i in range(5):
np.random.seed(42)
print(np.random.rand()) # Always prints 0.374
# CORRECT - set seed once before loop
np.random.seed(42)
for i in range(5):
print(np.random.rand()) # Different values each iteration
Pitfall 2: Forgetting to seed test fixtures
import numpy as np
def test_model_training():
# WRONG - non-deterministic test
X = np.random.rand(100, 5)
# Test may pass or fail randomly
def test_model_training_fixed():
# CORRECT - deterministic test
rng = np.random.default_rng(42)
X = rng.random((100, 5))
# Test always behaves identically
Pitfall 3: Seed reuse in parallel contexts
# WRONG - all workers use same seed
def worker(seed):
rng = np.random.default_rng(seed)
return rng.random(100)
with Pool(4) as pool:
results = pool.map(worker, [42, 42, 42, 42]) # Identical results!
# CORRECT - use SeedSequence for independent seeds
ss = np.random.SeedSequence(42)
child_seeds = ss.spawn(4)
with Pool(4) as pool:
results = pool.map(worker, child_seeds) # Independent results
Best Practices
Set seeds at the highest level of your application, preferably as command-line arguments or configuration parameters. Use Generator objects instead of the global state for new code. For parallel workflows, always use SeedSequence.spawn() to create independent child seeds.
Document your seeding strategy in docstrings and comments. When publishing research or sharing code, always specify the seed values used. For production systems, log the seed values used for each run to enable post-hoc debugging.
import numpy as np
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--seed', type=int, default=42)
args = parser.parse_args()
rng = np.random.default_rng(args.seed)
print(f"Using random seed: {args.seed}")
# Your application logic here
data = rng.normal(0, 1, size=1000)
if __name__ == '__main__':
main()
Proper seed management transforms random operations from sources of frustration into controlled, reproducible components of your data pipeline.