NumPy - Random Generator (np.random.default_rng)

Key Insights

NumPy’s default_rng() provides a modern, statistically superior random number generator that replaces legacy methods like np.random.rand() and np.random.seed()
The Generator API offers better performance, independent random streams through bit generators, and reproducible results across different NumPy versions
Understanding generator methods like integers(), choice(), and shuffle() enables efficient sampling, simulation, and data augmentation in production systems

Why default_rng() Replaced Legacy Random Functions

NumPy introduced default_rng() in version 1.17 as part of a complete overhaul of its random number generation infrastructure. The legacy RandomState and module-level functions (np.random.rand(), np.random.randint()) suffer from global state issues and use outdated algorithms.

import numpy as np

# Legacy approach (avoid in new code)
np.random.seed(42)
legacy_samples = np.random.rand(5)

# Modern approach
rng = np.random.default_rng(42)
modern_samples = rng.random(5)

print(f"Legacy: {legacy_samples}")
print(f"Modern: {modern_samples}")

The default_rng() function returns a Generator instance backed by PCG64, a permuted congruential generator with excellent statistical properties and a period of 2^128. This bit generator passes rigorous statistical test suites like TestU01 and PractRand, making it suitable for Monte Carlo simulations and cryptographic applications requiring high-quality randomness.

Creating and Seeding Generators

Proper seeding ensures reproducibility in scientific computing and machine learning pipelines. The Generator API supports multiple seeding strategies.

import numpy as np

# Seed with integer
rng1 = np.random.default_rng(12345)

# Seed with SeedSequence for advanced use cases
from numpy.random import SeedSequence
ss = SeedSequence(67890)
rng2 = np.random.default_rng(ss)

# Create independent streams
child_seeds = ss.spawn(3)
parallel_rngs = [np.random.default_rng(s) for s in child_seeds]

# Each generator produces different sequences
for i, rng in enumerate(parallel_rngs):
    print(f"Stream {i}: {rng.integers(0, 100, 3)}")

The spawn() method creates statistically independent generators for parallel processing. This eliminates correlation issues that plague naive approaches like incrementing seeds.

Generating Random Integers and Floats

The Generator API provides optimized methods for common distributions with clearer semantics than legacy functions.

import numpy as np

rng = np.random.default_rng(42)

# Random integers in [0, 100) - half-open interval
integers = rng.integers(0, 100, size=10)
print(f"Integers: {integers}")

# Closed interval [0, 100] using endpoint parameter
integers_closed = rng.integers(0, 100, size=10, endpoint=True)
print(f"Integers (closed): {integers_closed}")

# Random floats in [0.0, 1.0)
floats = rng.random(10)
print(f"Floats [0, 1): {floats}")

# Random floats in custom range [10.0, 20.0)
custom_floats = rng.uniform(10.0, 20.0, size=10)
print(f"Floats [10, 20): {custom_floats}")

# 2D array of integers
matrix = rng.integers(0, 256, size=(3, 4))
print(f"Matrix:\n{matrix}")

Note the endpoint parameter in integers() - this explicit control prevents off-by-one errors common with legacy randint().

Sampling and Choice Operations

Random sampling from arrays is fundamental for bootstrapping, cross-validation, and data augmentation.

import numpy as np

rng = np.random.default_rng(42)
data = np.array(['A', 'B', 'C', 'D', 'E'])

# Sample with replacement
sample_replace = rng.choice(data, size=10, replace=True)
print(f"With replacement: {sample_replace}")

# Sample without replacement
sample_no_replace = rng.choice(data, size=3, replace=False)
print(f"Without replacement: {sample_no_replace}")

# Weighted sampling
weights = np.array([0.1, 0.1, 0.2, 0.3, 0.3])
weighted_sample = rng.choice(data, size=100, p=weights)
unique, counts = np.unique(weighted_sample, return_counts=True)
print(f"Weighted distribution: {dict(zip(unique, counts))}")

# Sample indices instead of values
indices = rng.choice(len(data), size=3, replace=False)
print(f"Random indices: {indices}")

The p parameter enables probability-weighted sampling, essential for importance sampling and stratified data selection in machine learning.

Shuffling and Permutations

In-place shuffling and permutation generation are optimized operations in the Generator API.

import numpy as np

rng = np.random.default_rng(42)

# In-place shuffle
arr = np.arange(10)
rng.shuffle(arr)
print(f"Shuffled: {arr}")

# Shuffle along specific axis (for 2D arrays)
matrix = np.arange(20).reshape(5, 4)
rng.shuffle(matrix)  # Shuffles rows
print(f"Row-shuffled matrix:\n{matrix}")

# Generate permutation (returns shuffled indices)
perm = rng.permutation(10)
print(f"Permutation: {perm}")

# Apply permutation to array
original = np.array(['a', 'b', 'c', 'd', 'e'])
permuted = original[rng.permutation(len(original))]
print(f"Permuted array: {permuted}")

# Permute multidimensional array along axis
data = np.arange(12).reshape(3, 4)
permuted_data = rng.permutation(data)  # Permutes along first axis
print(f"Permuted data:\n{permuted_data}")

The shuffle() method modifies arrays in-place for memory efficiency, while permutation() returns a new array with shuffled indices or values.

Statistical Distributions

The Generator provides methods for dozens of probability distributions with efficient implementations.

import numpy as np

rng = np.random.default_rng(42)

# Normal distribution (mean=0, std=1)
normal = rng.normal(loc=0, scale=1, size=1000)
print(f"Normal mean: {normal.mean():.3f}, std: {normal.std():.3f}")

# Exponential distribution
exponential = rng.exponential(scale=2.0, size=1000)
print(f"Exponential mean: {exponential.mean():.3f}")

# Binomial distribution (n trials, p probability)
binomial = rng.binomial(n=10, p=0.3, size=1000)
print(f"Binomial mean: {binomial.mean():.3f}")

# Poisson distribution
poisson = rng.poisson(lam=5.0, size=1000)
print(f"Poisson mean: {poisson.mean():.3f}")

# Multivariate normal
mean = [0, 0]
cov = [[1, 0.5], [0.5, 1]]
multivariate = rng.multivariate_normal(mean, cov, size=100)
print(f"Multivariate shape: {multivariate.shape}")

These distributions support vectorized operations and accept array-like parameters for broadcasting, enabling efficient generation of large datasets.

Performance Comparison and Best Practices

The Generator API outperforms legacy methods through better algorithms and reduced overhead.

import numpy as np
import time

# Performance comparison
n = 10_000_000

# Legacy method
start = time.perf_counter()
np.random.seed(42)
legacy = np.random.rand(n)
legacy_time = time.perf_counter() - start

# Modern method
start = time.perf_counter()
rng = np.random.default_rng(42)
modern = rng.random(n)
modern_time = time.perf_counter() - start

print(f"Legacy: {legacy_time:.4f}s")
print(f"Modern: {modern_time:.4f}s")
print(f"Speedup: {legacy_time/modern_time:.2f}x")

Best practices for production code:

Create generator instances once: Initialize default_rng() at module or class level rather than inside loops
Use explicit seeding: Always seed generators in reproducible workflows; omit seeds only for true randomness
Prefer Generator methods: Use rng.integers() over np.random.randint() for better performance and clarity
Leverage independent streams: Use spawn() for parallel processing instead of manual seed manipulation
Avoid global state: Never use module-level functions like np.random.rand() in library code

Integration with Machine Learning Workflows

Random number generation integrates deeply with data preprocessing and model training.

import numpy as np

class DataAugmenter:
    def __init__(self, seed=None):
        self.rng = np.random.default_rng(seed)
    
    def add_noise(self, data, noise_level=0.1):
        noise = self.rng.normal(0, noise_level, data.shape)
        return data + noise
    
    def random_crop(self, image, crop_size):
        h, w = image.shape[:2]
        top = self.rng.integers(0, h - crop_size[0] + 1)
        left = self.rng.integers(0, w - crop_size[1] + 1)
        return image[top:top+crop_size[0], left:left+crop_size[1]]
    
    def train_test_split(self, data, test_ratio=0.2):
        n = len(data)
        indices = self.rng.permutation(n)
        split_idx = int(n * (1 - test_ratio))
        return data[indices[:split_idx]], data[indices[split_idx:]]

# Usage
augmenter = DataAugmenter(seed=42)
X = np.random.randn(100, 10)
X_train, X_test = augmenter.train_test_split(X)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

This pattern encapsulates random state within objects, enabling reproducible pipelines while maintaining clean interfaces.

The Generator API represents NumPy’s commitment to modern random number generation. Its superior statistical properties, performance characteristics, and ergonomic design make it the definitive choice for any application requiring randomness in Python.