How to Set Random Seed in NumPy
Random number generation sits at the heart of modern data science and machine learning. From shuffling datasets and initializing neural network weights to running Monte Carlo simulations, we rely on...
Key Insights
- NumPy’s
default_rng()function is the modern, preferred way to set random seeds—it provides better statistical properties, thread safety, and isolated state compared to the legacynp.random.seed()approach. - Global random state is a hidden dependency that causes subtle bugs in larger codebases; passing Generator instances explicitly makes your code more testable and predictable.
- Always document your random seeds in experiments and version control them alongside your code—reproducibility requires knowing exactly which seed produced your results.
Introduction
Random number generation sits at the heart of modern data science and machine learning. From shuffling datasets and initializing neural network weights to running Monte Carlo simulations, we rely on randomness constantly. But here’s the catch: we need that randomness to be reproducible.
A random seed is an integer that initializes a pseudorandom number generator (PRNG) to a known state. Given the same seed, a PRNG will produce the identical sequence of “random” numbers every time. This reproducibility is non-negotiable when you need to debug model behavior, share experiments with colleagues, or publish research that others can verify.
NumPy provides two distinct APIs for seeding random number generation. Understanding when and how to use each will save you from subtle bugs and make your code more robust.
The Legacy Approach: np.random.seed()
The traditional way to set a random seed in NumPy uses the global np.random.seed() function. This approach has been available since NumPy’s early days and remains widely used in tutorials and older codebases.
import numpy as np
# Set the global random seed
np.random.seed(42)
# Generate random numbers
print(np.random.rand(5))
# Output: [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
# Reset the seed to get the same sequence
np.random.seed(42)
print(np.random.rand(5))
# Output: [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
The function modifies a global RandomState instance that all np.random.* functions share. Any subsequent call to np.random.rand(), np.random.randint(), np.random.shuffle(), or similar functions will draw from this global state.
np.random.seed(123)
# All these functions use the same global state
random_floats = np.random.rand(3)
random_integers = np.random.randint(0, 100, size=3)
random_normal = np.random.randn(3)
print(f"Floats: {random_floats}")
print(f"Integers: {random_integers}")
print(f"Normal: {random_normal}")
This approach works, but it carries significant baggage that becomes problematic as your codebase grows.
The Modern Approach: np.random.Generator and default_rng()
NumPy 1.17 introduced a redesigned random number generation system built around the Generator class and the default_rng() factory function. This is now the recommended approach for all new code.
import numpy as np
# Create a Generator with a specific seed
rng = np.random.default_rng(seed=42)
# Generate random numbers using the Generator instance
print(rng.random(5))
# Output: [0.77395605 0.43887844 0.85859792 0.69736803 0.09417735]
# Create another Generator with the same seed
rng2 = np.random.default_rng(seed=42)
print(rng2.random(5))
# Output: [0.77395605 0.43887844 0.85859792 0.69736803 0.09417735]
Notice that the output differs from np.random.seed(42) followed by np.random.rand(5). The new Generator uses a different underlying algorithm (PCG64 by default) with better statistical properties than the legacy Mersenne Twister.
The Generator API mirrors the legacy functions but with slightly different names:
rng = np.random.default_rng(42)
# Method equivalents
random_floats = rng.random(3) # Like np.random.rand()
random_integers = rng.integers(0, 100, size=3) # Like np.random.randint()
random_normal = rng.standard_normal(3) # Like np.random.randn()
random_choice = rng.choice([1, 2, 3, 4, 5], size=2) # Like np.random.choice()
Legacy vs. Modern: Key Differences
The fundamental difference is state management. The legacy API uses global state; the modern API uses explicit, isolated state. This distinction has profound implications for code quality.
import numpy as np
# Legacy approach: global state causes hidden dependencies
def legacy_function_a():
return np.random.rand(3)
def legacy_function_b():
return np.random.rand(3)
np.random.seed(42)
print("Legacy A:", legacy_function_a())
print("Legacy B:", legacy_function_b())
np.random.seed(42)
# If we call B first, A gets different values!
print("Legacy B:", legacy_function_b())
print("Legacy A:", legacy_function_a())
The order in which you call functions affects their outputs because they share state. This creates action-at-a-distance bugs that are notoriously difficult to track down.
# Modern approach: isolated state, predictable behavior
def modern_function_a(rng):
return rng.random(3)
def modern_function_b(rng):
return rng.random(3)
# Each function gets its own Generator
rng_a = np.random.default_rng(42)
rng_b = np.random.default_rng(42)
print("Modern A:", modern_function_a(rng_a))
print("Modern B:", modern_function_b(rng_b))
# Order doesn't matter—each has independent state
rng_a = np.random.default_rng(42)
rng_b = np.random.default_rng(42)
print("Modern B:", modern_function_b(rng_b))
print("Modern A:", modern_function_a(rng_a))
# A and B produce the same values regardless of call order
Thread safety is another critical advantage. The legacy global state isn’t thread-safe—concurrent access from multiple threads can corrupt the state or produce unpredictable results. Generator instances, being independent objects, can be safely used in parallel code as long as each thread has its own instance.
from concurrent.futures import ThreadPoolExecutor
import numpy as np
def generate_samples(seed):
rng = np.random.default_rng(seed)
return rng.random(1000).mean()
# Safe: each thread creates its own Generator
with ThreadPoolExecutor(max_workers=4) as executor:
seeds = [42, 43, 44, 45]
results = list(executor.map(generate_samples, seeds))
print(results)
Common Use Cases
Machine Learning Train/Test Splits
Reproducible data splitting is essential for fair model comparisons:
import numpy as np
def train_test_split(X, y, test_size=0.2, seed=None):
rng = np.random.default_rng(seed)
n_samples = len(X)
n_test = int(n_samples * test_size)
indices = rng.permutation(n_samples)
test_indices = indices[:n_test]
train_indices = indices[n_test:]
return X[train_indices], X[test_indices], y[train_indices], y[test_indices]
# Usage
X = np.arange(100).reshape(50, 2)
y = np.arange(50)
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=42)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
Unit Testing with Randomness
Tests involving random operations should be deterministic:
import numpy as np
class DataAugmenter:
def __init__(self, noise_level=0.1, rng=None):
self.noise_level = noise_level
self.rng = rng if rng is not None else np.random.default_rng()
def add_noise(self, data):
noise = self.rng.normal(0, self.noise_level, data.shape)
return data + noise
# In your test file
def test_augmenter_reproducibility():
data = np.array([1.0, 2.0, 3.0])
augmenter1 = DataAugmenter(rng=np.random.default_rng(42))
augmenter2 = DataAugmenter(rng=np.random.default_rng(42))
result1 = augmenter1.add_noise(data)
result2 = augmenter2.add_noise(data)
assert np.allclose(result1, result2), "Results should be identical with same seed"
Monte Carlo Simulations
For simulations requiring multiple independent runs:
import numpy as np
def monte_carlo_pi(n_points, rng):
"""Estimate pi using Monte Carlo sampling."""
x = rng.random(n_points)
y = rng.random(n_points)
inside_circle = (x**2 + y**2) <= 1
return 4 * inside_circle.sum() / n_points
# Run multiple independent simulations
base_seed = 42
n_simulations = 10
estimates = []
for i in range(n_simulations):
rng = np.random.default_rng(base_seed + i)
estimate = monte_carlo_pi(100000, rng)
estimates.append(estimate)
print(f"Pi estimates: mean={np.mean(estimates):.4f}, std={np.std(estimates):.4f}")
Best Practices and Pitfalls
Pass Generators explicitly. Don’t rely on global state. Design functions to accept a rng parameter:
# Good: explicit dependency
def sample_data(n, rng=None):
if rng is None:
rng = np.random.default_rng()
return rng.random(n)
# Bad: hidden global dependency
def sample_data_bad(n):
np.random.seed(42) # Modifies global state!
return np.random.rand(n)
Document your seeds. When running experiments, log the seed value alongside results. Store seeds in configuration files that are version-controlled.
Don’t reseed repeatedly. Calling seed() or creating new Generators in tight loops defeats the purpose of the PRNG:
# Bad: reseeding destroys randomness properties
for i in range(1000):
np.random.seed(i) # Don't do this
value = np.random.rand()
# Good: seed once, sample many times
rng = np.random.default_rng(42)
for i in range(1000):
value = rng.random()
Use spawn() for parallel streams. When you need multiple independent streams from one seed:
rng = np.random.default_rng(42)
child_rngs = rng.spawn(4) # Create 4 independent child generators
for i, child_rng in enumerate(child_rngs):
print(f"Stream {i}: {child_rng.random(3)}")
Conclusion
For any new NumPy code, use default_rng(). It provides better algorithms, isolated state, and thread safety. The legacy np.random.seed() works but introduces global state that makes code harder to test and reason about.
Here’s a quick reference to get you started:
import numpy as np
# Create a seeded Generator
rng = np.random.default_rng(seed=42)
# Common operations
rng.random(10) # Uniform [0, 1)
rng.integers(0, 100, 10) # Random integers
rng.standard_normal(10) # Standard normal
rng.choice(arr, size=5) # Random selection
rng.shuffle(arr) # In-place shuffle
rng.permutation(arr) # Shuffled copy
Treat your Generator like any other dependency: create it once, pass it where needed, and document the seed that produced your results.