How to Sample Random Rows in Pandas

Random sampling is fundamental to practical data work. You need it for exploratory data analysis when you can't eyeball a million rows. You need it for creating train/test splits in machine learning...

Key Insights

  • The sample() method is your primary tool for random row selection, offering control through n (count), frac (proportion), random_state (reproducibility), replace (duplicates), and weights (probability distribution).
  • Always use random_state in production code and shared notebooks—reproducibility isn’t optional when others need to verify your results or when debugging sampling-related issues.
  • Combine groupby() with sample() for stratified sampling, ensuring proportional representation across categories in your dataset.

Introduction

Random sampling is fundamental to practical data work. You need it for exploratory data analysis when you can’t eyeball a million rows. You need it for creating train/test splits in machine learning pipelines. You need it for bootstrapping statistical estimates. You need it for generating reproducible subsets for debugging.

Pandas provides the sample() method on both DataFrames and Series, giving you a clean API for all these use cases. This article covers everything you need to know to use it effectively, from basic syntax to weighted sampling strategies.

Basic Random Sampling with sample()

The sample() method offers two primary ways to specify how many rows you want: an absolute count or a fraction of the total.

import pandas as pd

# Create a sample dataset
df = pd.DataFrame({
    'id': range(1, 101),
    'value': range(100, 200),
    'category': ['A', 'B', 'C', 'D'] * 25
})

# Sample exactly 5 rows
sample_by_count = df.sample(n=5)
print(sample_by_count)
    id  value category
42  43    142        C
87  88    187        D
15  16    115        D
63  64    163        D
29  30    129        B

Use n when you need a specific number of rows regardless of dataset size. This is common for quick data inspection or when your downstream process expects a fixed batch size.

# Sample 10% of rows
sample_by_fraction = df.sample(frac=0.1)
print(f"Original rows: {len(df)}, Sampled rows: {len(sample_by_fraction)}")
Original rows: 100, Sampled rows: 10

Use frac when you want proportional sampling. This is the standard approach for train/test splits where you want, say, 80% of your data for training. The frac parameter accepts values between 0 and 1 (unless you’re sampling with replacement, covered later).

One constraint: you can’t specify both n and frac simultaneously. Pandas will raise a ValueError if you try.

Controlling Randomness with random_state

Every call to sample() produces different results by default. This is fine for interactive exploration but problematic for reproducible workflows.

# Without random_state: different results each run
print(df.sample(n=3)['id'].tolist())  # [47, 12, 89]
print(df.sample(n=3)['id'].tolist())  # [23, 67, 4]

# With random_state: identical results each run
print(df.sample(n=3, random_state=42)['id'].tolist())  # [52, 15, 73]
print(df.sample(n=3, random_state=42)['id'].tolist())  # [52, 15, 73]

The random_state parameter accepts an integer seed or a NumPy RandomState object. Use integers for simplicity.

Set random_state in these situations:

  • Shared notebooks and scripts: Colleagues should get the same sample when running your code
  • Unit tests: Assertions against sampled data need deterministic inputs
  • Debugging: Reproduce the exact conditions that caused an issue
  • Model training: Ensure identical train/test splits across experiments

I recommend defining a constant at the top of your script or notebook:

RANDOM_SEED = 42

train_df = df.sample(frac=0.8, random_state=RANDOM_SEED)
test_df = df.drop(train_df.index)

This makes the seed easy to find and modify, and ensures consistency across all sampling operations in your code.

Sampling With and Without Replacement

By default, sample() draws without replacement—each row can appear at most once in the result. The replace parameter changes this behavior.

# Without replacement (default): all unique rows
sample_unique = df.sample(n=10, replace=False, random_state=42)
print(f"Unique indices: {sample_unique.index.nunique()}")  # 10

# With replacement: rows can repeat
sample_with_dupes = df.sample(n=10, replace=True, random_state=42)
print(f"Unique indices: {sample_with_dupes.index.nunique()}")  # Could be less than 10

Without replacement is appropriate for most use cases: creating holdout sets, random previews, and subsampling for performance. You’re selecting a subset of your data.

With replacement unlocks two capabilities:

  1. Sampling more rows than exist: You can request n=200 from a 100-row DataFrame
  2. Bootstrapping: Statistical technique for estimating confidence intervals

Here’s a practical bootstrapping example:

import numpy as np

def bootstrap_mean_ci(series, n_iterations=1000, confidence=0.95):
    """Calculate confidence interval for mean using bootstrap."""
    bootstrap_means = []
    
    for i in range(n_iterations):
        # Resample with replacement, same size as original
        resampled = series.sample(n=len(series), replace=True, random_state=i)
        bootstrap_means.append(resampled.mean())
    
    # Calculate percentile-based confidence interval
    lower = np.percentile(bootstrap_means, (1 - confidence) / 2 * 100)
    upper = np.percentile(bootstrap_means, (1 + confidence) / 2 * 100)
    
    return lower, upper

# Example usage
values = pd.Series([23, 45, 67, 34, 89, 12, 56, 78, 43, 65])
ci_lower, ci_upper = bootstrap_mean_ci(values)
print(f"95% CI for mean: [{ci_lower:.2f}, {ci_upper:.2f}]")
95% CI for mean: [35.80, 62.50]

Each bootstrap iteration samples the full dataset with replacement, creating a slightly different distribution. The variation across iterations estimates the sampling distribution of your statistic.

Weighted Random Sampling

The weights parameter lets you assign sampling probabilities to each row. Rows with higher weights are more likely to be selected.

df_weighted = pd.DataFrame({
    'transaction_id': range(1, 11),
    'amount': [10, 500, 25, 1000, 15, 750, 30, 2000, 20, 100],
    'date': pd.date_range('2024-01-01', periods=10)
})

# Sample weighted by transaction amount (higher amounts more likely)
high_value_sample = df_weighted.sample(
    n=5, 
    weights='amount',  # Column name as string
    random_state=42
)
print(high_value_sample)
   transaction_id  amount       date
7               8    2000 2024-01-08
3               4    1000 2024-01-04
5               6     750 2024-01-06
1               2     500 2024-01-02
9              10     100 2024-01-10

Notice how the sample skews toward higher-amount transactions. The weights parameter accepts a column name (string), a Series, or an array-like object with the same length as the DataFrame.

Weights are automatically normalized, so they don’t need to sum to 1. A row with weight 100 is twice as likely to be selected as a row with weight 50.

Common use cases for weighted sampling:

  • Prioritize recent data: Weight by recency for time-sensitive analysis
  • Oversample rare classes: Balance imbalanced datasets before training
  • Importance sampling: Focus on high-impact records for auditing
# Weight by recency (more recent = higher weight)
df_weighted['recency_weight'] = range(1, 11)  # Older to newer
recent_biased = df_weighted.sample(
    n=5, 
    weights='recency_weight', 
    random_state=42
)

One caveat: rows with zero weight are never selected, and negative weights raise an error.

Practical Use Cases

Let’s look at common patterns you’ll use repeatedly.

Train/Test Split

# Simple 80/20 split
train = df.sample(frac=0.8, random_state=42)
test = df.drop(train.index)

print(f"Train: {len(train)}, Test: {len(test)}")
Train: 80, Test: 20

For machine learning, consider sklearn.model_selection.train_test_split instead—it handles stratification and shuffling more robustly. But for quick exploratory splits, this pattern works.

Random Data Preview

# Better than df.head() for understanding data distribution
def random_preview(df, n=10, seed=None):
    """Display random sample instead of first n rows."""
    return df.sample(n=min(n, len(df)), random_state=seed)

# See rows from throughout the dataset, not just the beginning
random_preview(df, n=5, seed=42)

Stratified Sampling with groupby()

When you need proportional representation across categories:

# Sample 2 rows from each category
stratified = df.groupby('category').sample(n=2, random_state=42)
print(stratified.sort_values('category'))
    id  value category
16  17    116        A
44  45    144        A
65  66    165        B
29  30    129        B
74  75    174        C
42  43    142        C
71  72    171        D
87  88    187        D

For proportional stratified sampling (maintaining original category ratios):

# Sample 20% from each category
stratified_proportional = df.groupby('category').sample(frac=0.2, random_state=42)
print(stratified_proportional.groupby('category').size())
category
A    5
B    5
C    5
D    5
dtype: int64

This ensures your sample maintains the same category distribution as the original dataset—critical for representative analysis.

Conclusion

The sample() method covers the full spectrum of random sampling needs in Pandas. Master these five parameters:

  • n: Absolute number of rows to sample
  • frac: Proportion of rows to sample (0 to 1)
  • random_state: Seed for reproducibility (always set this in production)
  • replace: Allow duplicate selections (required for bootstrapping)
  • weights: Non-uniform sampling probabilities

Combine with groupby() for stratified sampling across categories. For complex machine learning workflows, graduate to scikit-learn’s splitting utilities, but sample() handles the majority of everyday sampling tasks efficiently.

For complete parameter documentation, see the official Pandas sample() reference.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.