NumPy - Random Integer (np.random.randint)

The `np.random.randint()` function generates random integers within a specified range. The basic signature takes a low bound (inclusive), high bound (exclusive), and optional size parameter.

Key Insights

  • np.random.randint() generates random integers from a discrete uniform distribution, with parameters for low (inclusive), high (exclusive), and size for array dimensions
  • The function supports both the legacy RandomState API and the modern Generator API introduced in NumPy 1.17, with the latter providing better statistical properties and performance
  • Understanding dtype selection, reproducibility through seeding, and vectorized operations enables efficient random integer generation for simulations, sampling, and testing scenarios

Basic Syntax and Parameters

The np.random.randint() function generates random integers within a specified range. The basic signature takes a low bound (inclusive), high bound (exclusive), and optional size parameter.

import numpy as np

# Single random integer between 0 and 10 (exclusive)
single = np.random.randint(10)
print(single)  # Output: 7 (example)

# Single random integer between 5 and 15 (exclusive)
single_range = np.random.randint(5, 15)
print(single_range)  # Output: 12 (example)

# Array of 5 random integers
array_1d = np.random.randint(0, 100, size=5)
print(array_1d)  # Output: [23 67 45 89 12] (example)

# 2D array of random integers
array_2d = np.random.randint(1, 50, size=(3, 4))
print(array_2d)
# Output:
# [[12 34 23 45]
#  [8  19 42 31]
#  [17 29 38 11]] (example)

The low parameter is inclusive while high is exclusive. When only one argument is provided, it’s interpreted as high with low defaulting to 0.

Legacy vs Modern Random Generation

NumPy provides two approaches for random number generation. The legacy np.random.randint() uses a global RandomState instance, while the modern approach uses the Generator class.

# Legacy approach (global state)
legacy_random = np.random.randint(0, 100, size=5)
print(legacy_random)

# Modern approach with Generator
rng = np.random.default_rng(seed=42)
modern_random = rng.integers(0, 100, size=5)
print(modern_random)  # Output: [51 92 14 71 60]

# Multiple generators with independent state
rng1 = np.random.default_rng(seed=100)
rng2 = np.random.default_rng(seed=200)

print(rng1.integers(10, size=3))  # Output: [7 9 3]
print(rng2.integers(10, size=3))  # Output: [2 8 2]
print(rng1.integers(10, size=3))  # Output: [5 0 3]

The Generator.integers() method is the modern equivalent of randint(). It offers better performance, improved statistical properties, and cleaner state management. For new code, prefer default_rng() over the legacy API.

Reproducibility with Seeds

Seeding ensures reproducible random sequences, critical for debugging, testing, and scientific reproducibility.

# Legacy seeding
np.random.seed(42)
result1 = np.random.randint(0, 100, size=5)
print(result1)  # Output: [51 92 14 71 60]

np.random.seed(42)
result2 = np.random.randint(0, 100, size=5)
print(result2)  # Output: [51 92 14 71 60] (identical)

# Modern seeding with Generator
rng = np.random.default_rng(seed=42)
result3 = rng.integers(0, 100, size=5)
print(result3)  # Output: [51 92 14 71 60]

# Creating independent streams
seed_sequence = np.random.SeedSequence(12345)
child_seeds = seed_sequence.spawn(3)
rngs = [np.random.default_rng(s) for s in child_seeds]

for i, rng in enumerate(rngs):
    print(f"Stream {i}: {rng.integers(0, 10, size=3)}")
# Output:
# Stream 0: [7 4 8]
# Stream 1: [2 9 1]
# Stream 2: [5 3 6]

The SeedSequence approach enables parallel random number generation with independent, non-overlapping streams.

Data Type Control

The dtype parameter controls the integer type of generated values, affecting memory usage and range limitations.

# Default dtype (np.int64 on most systems)
default = np.random.randint(0, 100, size=5)
print(default.dtype)  # Output: int64

# 8-bit integers (range: 0 to 255 for unsigned)
int8_array = np.random.randint(0, 128, size=5, dtype=np.int8)
print(int8_array.dtype)  # Output: int8
print(int8_array.nbytes)  # Output: 5 bytes

# 16-bit integers
int16_array = np.random.randint(0, 1000, size=5, dtype=np.int16)
print(int16_array.dtype)  # Output: int16

# 64-bit integers for large ranges
int64_array = np.random.randint(0, 10**15, size=5, dtype=np.int64)
print(int64_array)

# Memory comparison
large_array_64 = np.random.randint(0, 100, size=1_000_000, dtype=np.int64)
large_array_8 = np.random.randint(0, 100, size=1_000_000, dtype=np.int8)
print(f"int64: {large_array_64.nbytes / 1024 / 1024:.2f} MB")  # 7.63 MB
print(f"int8: {large_array_8.nbytes / 1024 / 1024:.2f} MB")    # 0.95 MB

Choosing appropriate dtypes reduces memory footprint, especially for large arrays where values fit within smaller ranges.

Practical Applications

Random Sampling and Indexing

# Random sample indices from a dataset
data = np.arange(1000, 2000)
sample_indices = np.random.randint(0, len(data), size=10)
sample = data[sample_indices]
print(sample)

# Stratified sampling
rng = np.random.default_rng(seed=42)
categories = np.array(['A'] * 100 + ['B'] * 150 + ['C'] * 80)
samples_per_category = {'A': 10, 'B': 15, 'C': 8}

stratified_sample = []
for cat, n_samples in samples_per_category.items():
    cat_indices = np.where(categories == cat)[0]
    sample_idx = rng.choice(cat_indices, size=n_samples, replace=False)
    stratified_sample.extend(sample_idx)

print(f"Sampled {len(stratified_sample)} items")

Monte Carlo Simulation

# Simulate dice rolls
rng = np.random.default_rng(seed=123)
n_simulations = 100_000
dice1 = rng.integers(1, 7, size=n_simulations)
dice2 = rng.integers(1, 7, size=n_simulations)
sums = dice1 + dice2

# Probability of rolling 7
prob_seven = np.sum(sums == 7) / n_simulations
print(f"P(sum=7): {prob_seven:.4f}")  # ~0.1667 (theoretical: 1/6)

# Simulate random walk
steps = rng.integers(0, 2, size=1000) * 2 - 1  # Convert to -1, 1
position = np.cumsum(steps)
print(f"Final position: {position[-1]}")
print(f"Max distance: {np.max(np.abs(position))}")

Test Data Generation

# Generate synthetic user IDs and ages
rng = np.random.default_rng(seed=999)
n_users = 1000

user_ids = rng.integers(10_000, 99_999, size=n_users)
ages = rng.integers(18, 80, size=n_users)
scores = rng.integers(0, 101, size=n_users)

# Create realistic distribution (weighted towards certain values)
weighted_ages = np.concatenate([
    rng.integers(25, 35, size=400),  # 40% young adults
    rng.integers(35, 50, size=400),  # 40% middle-aged
    rng.integers(50, 70, size=200)   # 20% seniors
])
rng.shuffle(weighted_ages)

print(f"Mean age: {weighted_ages.mean():.1f}")
print(f"Age distribution: 25-35: {np.sum((weighted_ages >= 25) & (weighted_ages < 35))}")

Performance Considerations

Vectorized operations with randint() significantly outperform loop-based generation.

import time

# Inefficient: loop-based generation
start = time.time()
result_loop = np.array([np.random.randint(0, 100) for _ in range(1_000_000)])
loop_time = time.time() - start

# Efficient: vectorized generation
start = time.time()
result_vec = np.random.randint(0, 100, size=1_000_000)
vec_time = time.time() - start

print(f"Loop time: {loop_time:.4f}s")
print(f"Vectorized time: {vec_time:.4f}s")
print(f"Speedup: {loop_time / vec_time:.1f}x")

# Modern Generator performance
rng = np.random.default_rng()
start = time.time()
result_gen = rng.integers(0, 100, size=1_000_000)
gen_time = time.time() - start
print(f"Generator time: {gen_time:.4f}s")

The vectorized approach typically runs 50-100x faster than loops. The modern Generator API provides additional performance improvements, especially for parallel workloads.

Edge Cases and Validation

# Endpoint handling
rng = np.random.default_rng(seed=42)

# endpoint=True makes high inclusive (Generator only)
inclusive = rng.integers(1, 10, size=100, endpoint=True)
print(f"Max value with endpoint=True: {inclusive.max()}")  # Can be 10

# Default behavior (high exclusive)
exclusive = rng.integers(1, 10, size=100, endpoint=False)
print(f"Max value with endpoint=False: {exclusive.max()}")  # Max is 9

# Validation: ensure generated values are in range
test_array = np.random.randint(5, 15, size=10000)
assert test_array.min() >= 5, "Values below minimum"
assert test_array.max() < 15, "Values at or above maximum"
print("Validation passed")

# Handling edge case: low == high (raises ValueError)
try:
    invalid = np.random.randint(10, 10, size=5)
except ValueError as e:
    print(f"Error: {e}")

Understanding the exclusive upper bound prevents off-by-one errors. The endpoint parameter in the modern API provides flexibility for inclusive ranges when needed.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.