How to Sample Rows in Polars

Row sampling is one of those operations you reach for constantly in data work. You need a quick subset to test a pipeline, want to explore a massive dataset without loading everything into memory, or...

Key Insights

  • Polars’ sample() method offers both count-based (n) and percentage-based (fraction) sampling with excellent performance on large datasets
  • Always use the seed parameter when you need reproducible results for testing, debugging, or sharing analysis with teammates
  • Weighted sampling lets you bias selection toward specific rows—useful for prioritizing recent data, high-value records, or implementing custom stratification logic

Introduction

Row sampling is one of those operations you reach for constantly in data work. You need a quick subset to test a pipeline, want to explore a massive dataset without loading everything into memory, or need to create train/test splits for machine learning. Whatever the reason, efficient sampling matters.

Polars handles sampling remarkably well. Unlike pandas, which can struggle with large datasets, Polars’ sampling operations work seamlessly in both eager and lazy modes, letting you sample from datasets that don’t fit in memory. The API is clean and gives you precise control over exactly how rows get selected.

Let’s walk through everything you need to know about sampling in Polars, from basic operations to advanced weighted sampling techniques.

Basic Random Sampling with sample()

The sample() method is your primary tool for row sampling. It offers two approaches: sampling a fixed number of rows with n, or sampling a fraction of the dataset with fraction.

Here’s the straightforward approach for grabbing a specific number of rows:

import polars as pl

# Create a sample dataset
df = pl.DataFrame({
    "id": range(1, 10001),
    "value": [i * 1.5 for i in range(1, 10001)],
    "category": ["A", "B", "C", "D"] * 2500
})

# Sample exactly 100 rows
sample_fixed = df.sample(n=100)
print(f"Sampled {sample_fixed.height} rows")
# Output: Sampled 100 rows

When you want a percentage of your data instead, use fraction:

# Sample 10% of the dataset
sample_fraction = df.sample(fraction=0.1)
print(f"Sampled {sample_fraction.height} rows from {df.height} total")
# Output: Sampled 1000 rows from 10000 total

One thing to note: you can’t use both n and fraction at the same time. Polars will raise an error if you try. Pick the approach that makes sense for your use case—fixed counts work well for consistent batch sizes, while fractions are better when you want proportional samples regardless of dataset size.

Reproducible Sampling with Seeds

Random sampling is great until you need to reproduce your results. Maybe you’re debugging an issue that only appears with certain rows, or you want a colleague to work with the exact same subset. The seed parameter solves this:

# Without a seed, each run gives different results
sample1 = df.sample(n=5)
sample2 = df.sample(n=5)
print(sample1["id"].to_list())  # e.g., [4521, 892, 7234, 156, 9087]
print(sample2["id"].to_list())  # e.g., [2341, 6789, 123, 8456, 3012]

# With a seed, results are reproducible
sample_seeded1 = df.sample(n=5, seed=42)
sample_seeded2 = df.sample(n=5, seed=42)
print(sample_seeded1["id"].to_list())  # [7272, 2533, 8507, 3948, 1841]
print(sample_seeded2["id"].to_list())  # [7272, 2533, 8507, 3948, 1841] - identical!

I recommend always using seeds in production code and tests. It costs nothing and saves hours of debugging when you need to reproduce an issue. Pick a memorable number—42 is traditional, but use whatever works for your team.

Sampling With and Without Replacement

By default, Polars samples without replacement, meaning each row can only appear once in your sample. Sometimes you want replacement—bootstrap sampling being the classic example.

# Default: without replacement (each row appears at most once)
sample_no_replace = df.sample(n=100, seed=42, with_replacement=False)
unique_ids = sample_no_replace["id"].n_unique()
print(f"Unique IDs: {unique_ids}")  # Always 100

# With replacement: rows can appear multiple times
sample_with_replace = df.sample(n=100, seed=42, with_replacement=True)
unique_ids_replace = sample_with_replace["id"].n_unique()
print(f"Unique IDs with replacement: {unique_ids_replace}")  # Often less than 100

Here’s when to use each:

Without replacement (default): Train/test splits, creating subsets for analysis, any situation where duplicate rows would skew your results.

With replacement: Bootstrap sampling for confidence intervals, oversampling minority classes, Monte Carlo simulations where you need to draw from the same distribution repeatedly.

One important detail: when sampling without replacement, n cannot exceed the number of rows in your DataFrame. With replacement, you can sample more rows than exist in the original data:

# This works fine with replacement
large_sample = df.sample(n=50000, with_replacement=True)
print(f"Sampled {large_sample.height} rows from {df.height} original rows")
# Output: Sampled 50000 rows from 10000 original rows

Weighted Sampling

Sometimes uniform random sampling isn’t what you need. Maybe recent records should be more likely to appear, or high-value transactions deserve more representation. Weighted sampling handles this elegantly.

The weights parameter accepts a column name or expression that determines each row’s selection probability:

# Create dataset with timestamps and transaction values
df_weighted = pl.DataFrame({
    "id": range(1, 1001),
    "days_ago": [i % 365 for i in range(1, 1001)],  # 0-364 days ago
    "transaction_value": [100 + (i * 0.5) for i in range(1, 1001)]
})

# Weight by recency (newer = higher weight)
df_with_weights = df_weighted.with_columns(
    (365 - pl.col("days_ago")).alias("recency_weight")
)

# Sample with recency bias
recent_biased_sample = df_with_weights.sample(
    n=100, 
    seed=42, 
    weights="recency_weight"
)

# Check average days_ago - should be lower than uniform sampling
print(f"Weighted sample avg days ago: {recent_biased_sample['days_ago'].mean():.1f}")

uniform_sample = df_with_weights.sample(n=100, seed=42)
print(f"Uniform sample avg days ago: {uniform_sample['days_ago'].mean():.1f}")

You can also use expressions directly for the weights:

# Weight by transaction value using an expression
high_value_sample = df_weighted.sample(
    n=100,
    seed=42,
    weights=pl.col("transaction_value")
)

print(f"High-value biased avg: ${high_value_sample['transaction_value'].mean():.2f}")
print(f"Uniform avg: ${df_weighted.sample(n=100, seed=42)['transaction_value'].mean():.2f}")

Weights don’t need to sum to 1—Polars normalizes them automatically. Just ensure all weights are non-negative.

Sampling in Lazy Mode

Polars really shines when working with large datasets in lazy mode. Sampling integrates seamlessly into lazy query plans, letting you filter and sample without materializing intermediate results:

# Simulate a large dataset scenario
large_df = pl.DataFrame({
    "id": range(1, 1000001),
    "category": ["A", "B", "C", "D", "E"] * 200000,
    "value": [i * 0.001 for i in range(1, 1000001)],
    "is_valid": [i % 7 != 0 for i in range(1, 1000001)]
})

# Lazy sampling with filtering
result = (
    large_df.lazy()
    .filter(pl.col("is_valid"))
    .filter(pl.col("category").is_in(["A", "B"]))
    .sample(n=1000, seed=42)
    .collect()
)

print(f"Final sample size: {result.height}")
print(f"Categories in sample: {result['category'].unique().to_list()}")

This approach is memory-efficient because Polars optimizes the entire query plan. The sample operation doesn’t require materializing all filtered rows first—it’s integrated into the execution strategy.

For truly massive datasets that don’t fit in memory, combine lazy mode with scanning:

# Sampling from a large parquet file
result = (
    pl.scan_parquet("large_dataset.parquet")
    .filter(pl.col("status") == "active")
    .sample(n=5000, seed=42)
    .collect()
)

Practical Use Cases

Let’s look at real-world applications that combine these techniques.

Train/Test Split

Creating a proper train/test split requires sampling your training set and then grabbing everything else for testing:

df = pl.DataFrame({
    "id": range(1, 10001),
    "features": [i * 0.1 for i in range(1, 10001)],
    "target": [i % 2 for i in range(1, 10001)]
})

# Sample 80% for training
train = df.sample(fraction=0.8, seed=42)

# Anti-join to get test set (rows NOT in training)
test = df.join(train, on="id", how="anti")

print(f"Train: {train.height}, Test: {test.height}, Total: {train.height + test.height}")
# Output: Train: 8000, Test: 2000, Total: 10000

Stratified-Like Sampling

When you need proportional representation across categories, use group_by with map_groups:

df = pl.DataFrame({
    "id": range(1, 1001),
    "category": ["rare"] * 50 + ["common"] * 950,
    "value": range(1, 1001)
})

# Sample 10% from each category
stratified = df.group_by("category").map_groups(
    lambda group: group.sample(fraction=0.1, seed=42)
)

print(stratified.group_by("category").len())
# Shows proportional sampling: ~5 rare, ~95 common

Debugging with Subset Data

When debugging pipelines, grab a reproducible subset that includes edge cases:

# Get a sample that includes specific categories you want to test
debug_sample = (
    df.lazy()
    .filter(pl.col("category").is_in(["rare", "common"]))
    .sample(n=100, seed=42)
    .collect()
)

Row sampling in Polars is fast, flexible, and integrates naturally into both eager and lazy workflows. Master these techniques and you’ll handle everything from quick data exploration to production-grade train/test splits with confidence.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.