NumPy - Random Choice from Array (np.random.choice)

Key Insights

np.random.choice() provides efficient random sampling from arrays with options for replacement, probability weighting, and multi-dimensional output
Understanding the replace parameter is critical: replace=False ensures unique selections while replace=True allows duplicate picks
Custom probability distributions via the p parameter enable weighted sampling for simulations, bootstrapping, and Monte Carlo methods

Basic Random Selection

np.random.choice() selects random elements from a 1-D array. The simplest form picks a single element:

import numpy as np

arr = np.array([10, 20, 30, 40, 50])
random_element = np.random.choice(arr)
print(random_element)  # Output: 30 (varies each run)

For selecting multiple elements, specify the size parameter:

arr = np.array(['apple', 'banana', 'cherry', 'date', 'elderberry'])
random_fruits = np.random.choice(arr, size=3)
print(random_fruits)  # Output: ['cherry' 'apple' 'cherry']

The function also accepts integers directly. When passed an integer n, it treats it as np.arange(n):

# Equivalent to np.random.choice(np.arange(10), size=5)
random_indices = np.random.choice(10, size=5)
print(random_indices)  # Output: [7 2 9 2 4]

Sampling With and Without Replacement

The replace parameter controls whether elements can be selected multiple times. By default, replace=True allows duplicates:

arr = np.array([1, 2, 3, 4, 5])

# With replacement (default)
with_replacement = np.random.choice(arr, size=10, replace=True)
print(with_replacement)  # Output: [3 3 1 5 2 3 4 1 2 5]

# Without replacement - unique elements only
without_replacement = np.random.choice(arr, size=5, replace=False)
print(without_replacement)  # Output: [4 2 5 1 3]

Attempting to sample more elements than available without replacement raises an error:

arr = np.array([1, 2, 3])

try:
    invalid = np.random.choice(arr, size=5, replace=False)
except ValueError as e:
    print(f"Error: {e}")
    # Output: Error: Cannot take a larger sample than population when 'replace=False'

This behavior is essential for applications like random train-test splits or shuffling:

# Randomly shuffle array indices
data = np.array([100, 200, 300, 400, 500])
shuffled_indices = np.random.choice(len(data), size=len(data), replace=False)
shuffled_data = data[shuffled_indices]
print(shuffled_data)  # Output: [300 500 100 400 200]

Weighted Random Selection

The p parameter assigns custom probabilities to each element. Probabilities must sum to 1.0:

outcomes = np.array(['win', 'lose', 'draw'])
probabilities = np.array([0.6, 0.3, 0.1])  # 60% win, 30% lose, 10% draw

results = np.random.choice(outcomes, size=1000, p=probabilities)
unique, counts = np.unique(results, return_counts=True)

for outcome, count in zip(unique, counts):
    print(f"{outcome}: {count} ({count/1000*100:.1f}%)")

# Output:
# draw: 94 (9.4%)
# lose: 312 (31.2%)
# win: 594 (59.4%)

Weighted sampling is powerful for simulations:

# Simulate biased dice rolls
dice_faces = np.array([1, 2, 3, 4, 5, 6])
# Loaded dice: face 6 appears 30% of the time
loaded_probabilities = np.array([0.14, 0.14, 0.14, 0.14, 0.14, 0.30])

rolls = np.random.choice(dice_faces, size=10000, p=loaded_probabilities)
print(f"Average roll: {rolls.mean():.2f}")  # Output: ~3.88 (higher than fair 3.5)
print(f"Sixes rolled: {(rolls == 6).sum()}")  # Output: ~3000

Multi-Dimensional Output

Generate multi-dimensional arrays by passing a tuple to size:

# Create 3x4 matrix of random choices
arr = np.array([10, 20, 30])
matrix = np.random.choice(arr, size=(3, 4))
print(matrix)
# Output:
# [[20 10 30 20]
#  [30 20 10 10]
#  [20 30 30 10]]

This is useful for batch sampling in machine learning:

# Generate random mini-batches
data_size = 1000
batch_size = 32
num_batches = 5

# Sample batch indices
batch_indices = np.random.choice(
    data_size, 
    size=(num_batches, batch_size), 
    replace=False
)

print(f"Batch indices shape: {batch_indices.shape}")  # (5, 32)
print(f"First batch indices: {batch_indices[0][:5]}")  # [742 891 234 567 123]

Practical Applications

Bootstrap Resampling

Bootstrap methods rely on sampling with replacement to estimate statistical properties:

# Original dataset
data = np.array([23, 45, 67, 89, 12, 34, 56, 78, 90, 11])

# Generate 1000 bootstrap samples
n_bootstraps = 1000
bootstrap_means = np.zeros(n_bootstraps)

for i in range(n_bootstraps):
    bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
    bootstrap_means[i] = bootstrap_sample.mean()

# Calculate 95% confidence interval
confidence_interval = np.percentile(bootstrap_means, [2.5, 97.5])
print(f"Mean: {data.mean():.2f}")
print(f"95% CI: [{confidence_interval[0]:.2f}, {confidence_interval[1]:.2f}]")

Stratified Sampling

Maintain class proportions when sampling from imbalanced datasets:

# Dataset with class labels
labels = np.array([0]*70 + [1]*30)  # 70% class 0, 30% class 1
data_indices = np.arange(len(labels))

# Stratified sample maintaining 70-30 ratio
sample_size = 20
class_0_indices = data_indices[labels == 0]
class_1_indices = data_indices[labels == 1]

sampled_class_0 = np.random.choice(class_0_indices, size=14, replace=False)
sampled_class_1 = np.random.choice(class_1_indices, size=6, replace=False)

stratified_sample = np.concatenate([sampled_class_0, sampled_class_1])
print(f"Sampled indices: {stratified_sample}")
print(f"Class distribution maintained: {len(sampled_class_0)/20:.0%} / {len(sampled_class_1)/20:.0%}")

Monte Carlo Simulation

Model random processes with custom probability distributions:

# Simulate customer arrivals (Poisson-like discrete distribution)
arrival_counts = np.array([0, 1, 2, 3, 4, 5])
arrival_probs = np.array([0.05, 0.15, 0.30, 0.25, 0.15, 0.10])

# Simulate 30 days
daily_arrivals = np.random.choice(
    arrival_counts, 
    size=30, 
    p=arrival_probs
)

print(f"Total customers: {daily_arrivals.sum()}")
print(f"Average daily arrivals: {daily_arrivals.mean():.2f}")
print(f"Max arrivals in a day: {daily_arrivals.max()}")

Performance Considerations

np.random.choice() is vectorized and significantly faster than Python loops:

import time

arr = np.arange(1000000)

# NumPy approach
start = time.time()
samples = np.random.choice(arr, size=10000, replace=False)
numpy_time = time.time() - start

# Python approach
import random
start = time.time()
samples_py = random.sample(list(arr), 10000)
python_time = time.time() - start

print(f"NumPy: {numpy_time:.4f}s")
print(f"Python: {python_time:.4f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")
# NumPy is typically 10-50x faster

For reproducible results, set the random seed:

np.random.seed(42)
sample1 = np.random.choice(10, size=5)
print(sample1)  # [6 3 7 4 6]

np.random.seed(42)
sample2 = np.random.choice(10, size=5)
print(sample2)  # [6 3 7 4 6] - identical

Modern NumPy versions recommend using the Generator interface for better statistical properties:

rng = np.random.default_rng(seed=42)
sample = rng.choice(10, size=5, replace=False)
print(sample)  # [0 7 6 4 8]

The Generator.choice() method provides the same functionality with improved random number generation algorithms, making it the preferred approach for new code.