NumPy - Random Generator (np.random.default_rng)
NumPy introduced `default_rng()` in version 1.17 as part of a complete overhaul of its random number generation infrastructure. The legacy `RandomState` and module-level functions...
Key Insights
- NumPy’s
default_rng()provides a modern, statistically superior random number generator that replaces legacy methods likenp.random.rand()andnp.random.seed() - The Generator API offers better performance, independent random streams through bit generators, and reproducible results across different NumPy versions
- Understanding generator methods like
integers(),choice(), andshuffle()enables efficient sampling, simulation, and data augmentation in production systems
Why default_rng() Replaced Legacy Random Functions
NumPy introduced default_rng() in version 1.17 as part of a complete overhaul of its random number generation infrastructure. The legacy RandomState and module-level functions (np.random.rand(), np.random.randint()) suffer from global state issues and use outdated algorithms.
import numpy as np
# Legacy approach (avoid in new code)
np.random.seed(42)
legacy_samples = np.random.rand(5)
# Modern approach
rng = np.random.default_rng(42)
modern_samples = rng.random(5)
print(f"Legacy: {legacy_samples}")
print(f"Modern: {modern_samples}")
The default_rng() function returns a Generator instance backed by PCG64, a permuted congruential generator with excellent statistical properties and a period of 2^128. This bit generator passes rigorous statistical test suites like TestU01 and PractRand, making it suitable for Monte Carlo simulations and cryptographic applications requiring high-quality randomness.
Creating and Seeding Generators
Proper seeding ensures reproducibility in scientific computing and machine learning pipelines. The Generator API supports multiple seeding strategies.
import numpy as np
# Seed with integer
rng1 = np.random.default_rng(12345)
# Seed with SeedSequence for advanced use cases
from numpy.random import SeedSequence
ss = SeedSequence(67890)
rng2 = np.random.default_rng(ss)
# Create independent streams
child_seeds = ss.spawn(3)
parallel_rngs = [np.random.default_rng(s) for s in child_seeds]
# Each generator produces different sequences
for i, rng in enumerate(parallel_rngs):
print(f"Stream {i}: {rng.integers(0, 100, 3)}")
The spawn() method creates statistically independent generators for parallel processing. This eliminates correlation issues that plague naive approaches like incrementing seeds.
Generating Random Integers and Floats
The Generator API provides optimized methods for common distributions with clearer semantics than legacy functions.
import numpy as np
rng = np.random.default_rng(42)
# Random integers in [0, 100) - half-open interval
integers = rng.integers(0, 100, size=10)
print(f"Integers: {integers}")
# Closed interval [0, 100] using endpoint parameter
integers_closed = rng.integers(0, 100, size=10, endpoint=True)
print(f"Integers (closed): {integers_closed}")
# Random floats in [0.0, 1.0)
floats = rng.random(10)
print(f"Floats [0, 1): {floats}")
# Random floats in custom range [10.0, 20.0)
custom_floats = rng.uniform(10.0, 20.0, size=10)
print(f"Floats [10, 20): {custom_floats}")
# 2D array of integers
matrix = rng.integers(0, 256, size=(3, 4))
print(f"Matrix:\n{matrix}")
Note the endpoint parameter in integers() - this explicit control prevents off-by-one errors common with legacy randint().
Sampling and Choice Operations
Random sampling from arrays is fundamental for bootstrapping, cross-validation, and data augmentation.
import numpy as np
rng = np.random.default_rng(42)
data = np.array(['A', 'B', 'C', 'D', 'E'])
# Sample with replacement
sample_replace = rng.choice(data, size=10, replace=True)
print(f"With replacement: {sample_replace}")
# Sample without replacement
sample_no_replace = rng.choice(data, size=3, replace=False)
print(f"Without replacement: {sample_no_replace}")
# Weighted sampling
weights = np.array([0.1, 0.1, 0.2, 0.3, 0.3])
weighted_sample = rng.choice(data, size=100, p=weights)
unique, counts = np.unique(weighted_sample, return_counts=True)
print(f"Weighted distribution: {dict(zip(unique, counts))}")
# Sample indices instead of values
indices = rng.choice(len(data), size=3, replace=False)
print(f"Random indices: {indices}")
The p parameter enables probability-weighted sampling, essential for importance sampling and stratified data selection in machine learning.
Shuffling and Permutations
In-place shuffling and permutation generation are optimized operations in the Generator API.
import numpy as np
rng = np.random.default_rng(42)
# In-place shuffle
arr = np.arange(10)
rng.shuffle(arr)
print(f"Shuffled: {arr}")
# Shuffle along specific axis (for 2D arrays)
matrix = np.arange(20).reshape(5, 4)
rng.shuffle(matrix) # Shuffles rows
print(f"Row-shuffled matrix:\n{matrix}")
# Generate permutation (returns shuffled indices)
perm = rng.permutation(10)
print(f"Permutation: {perm}")
# Apply permutation to array
original = np.array(['a', 'b', 'c', 'd', 'e'])
permuted = original[rng.permutation(len(original))]
print(f"Permuted array: {permuted}")
# Permute multidimensional array along axis
data = np.arange(12).reshape(3, 4)
permuted_data = rng.permutation(data) # Permutes along first axis
print(f"Permuted data:\n{permuted_data}")
The shuffle() method modifies arrays in-place for memory efficiency, while permutation() returns a new array with shuffled indices or values.
Statistical Distributions
The Generator provides methods for dozens of probability distributions with efficient implementations.
import numpy as np
rng = np.random.default_rng(42)
# Normal distribution (mean=0, std=1)
normal = rng.normal(loc=0, scale=1, size=1000)
print(f"Normal mean: {normal.mean():.3f}, std: {normal.std():.3f}")
# Exponential distribution
exponential = rng.exponential(scale=2.0, size=1000)
print(f"Exponential mean: {exponential.mean():.3f}")
# Binomial distribution (n trials, p probability)
binomial = rng.binomial(n=10, p=0.3, size=1000)
print(f"Binomial mean: {binomial.mean():.3f}")
# Poisson distribution
poisson = rng.poisson(lam=5.0, size=1000)
print(f"Poisson mean: {poisson.mean():.3f}")
# Multivariate normal
mean = [0, 0]
cov = [[1, 0.5], [0.5, 1]]
multivariate = rng.multivariate_normal(mean, cov, size=100)
print(f"Multivariate shape: {multivariate.shape}")
These distributions support vectorized operations and accept array-like parameters for broadcasting, enabling efficient generation of large datasets.
Performance Comparison and Best Practices
The Generator API outperforms legacy methods through better algorithms and reduced overhead.
import numpy as np
import time
# Performance comparison
n = 10_000_000
# Legacy method
start = time.perf_counter()
np.random.seed(42)
legacy = np.random.rand(n)
legacy_time = time.perf_counter() - start
# Modern method
start = time.perf_counter()
rng = np.random.default_rng(42)
modern = rng.random(n)
modern_time = time.perf_counter() - start
print(f"Legacy: {legacy_time:.4f}s")
print(f"Modern: {modern_time:.4f}s")
print(f"Speedup: {legacy_time/modern_time:.2f}x")
Best practices for production code:
- Create generator instances once: Initialize
default_rng()at module or class level rather than inside loops - Use explicit seeding: Always seed generators in reproducible workflows; omit seeds only for true randomness
- Prefer Generator methods: Use
rng.integers()overnp.random.randint()for better performance and clarity - Leverage independent streams: Use
spawn()for parallel processing instead of manual seed manipulation - Avoid global state: Never use module-level functions like
np.random.rand()in library code
Integration with Machine Learning Workflows
Random number generation integrates deeply with data preprocessing and model training.
import numpy as np
class DataAugmenter:
def __init__(self, seed=None):
self.rng = np.random.default_rng(seed)
def add_noise(self, data, noise_level=0.1):
noise = self.rng.normal(0, noise_level, data.shape)
return data + noise
def random_crop(self, image, crop_size):
h, w = image.shape[:2]
top = self.rng.integers(0, h - crop_size[0] + 1)
left = self.rng.integers(0, w - crop_size[1] + 1)
return image[top:top+crop_size[0], left:left+crop_size[1]]
def train_test_split(self, data, test_ratio=0.2):
n = len(data)
indices = self.rng.permutation(n)
split_idx = int(n * (1 - test_ratio))
return data[indices[:split_idx]], data[indices[split_idx:]]
# Usage
augmenter = DataAugmenter(seed=42)
X = np.random.randn(100, 10)
X_train, X_test = augmenter.train_test_split(X)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
This pattern encapsulates random state within objects, enabling reproducible pipelines while maintaining clean interfaces.
The Generator API represents NumPy’s commitment to modern random number generation. Its superior statistical properties, performance characteristics, and ergonomic design make it the definitive choice for any application requiring randomness in Python.