Pandas - Best Practices for Large DataFrames

Key Insights

Optimizing data types alone can reduce DataFrame memory usage by 50-90%, often making the difference between fitting data in memory or crashing your process.
Chunked processing isn’t just for files that don’t fit in memory—it’s a defensive strategy that prevents unexpected OOM errors when data grows.
Before reaching for Dask or Spark, exhaust pandas optimizations first; most “big data” problems under 10GB are actually pandas problems in disguise.

Understanding Memory Footprint

Pandas DataFrames are deceptively memory-hungry. A 500MB CSV can easily balloon to 2-3GB in memory because pandas defaults to generous data types and stores strings as Python objects with significant overhead.

Before optimizing anything, measure what you’re working with:

import pandas as pd
import numpy as np

df = pd.read_csv('sales_data.csv')

# Surface-level memory (misleading for object columns)
print(f"Shallow memory: {df.memory_usage().sum() / 1e6:.2f} MB")

# True memory including object contents
print(f"Deep memory: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")

# Per-column breakdown
memory_by_col = df.memory_usage(deep=True) / 1e6
print(memory_by_col.sort_values(ascending=False).head(10))

The deep=True parameter is critical. Without it, pandas reports only pointer sizes for object columns, not the actual string data. I’ve seen developers confused why their “50MB DataFrame” crashed a 16GB machine—they were looking at shallow memory.

Object dtype columns are almost always the culprit. A column with 1 million strings averaging 20 characters consumes roughly 70MB as objects but only 20MB as a proper string array. The overhead comes from Python’s object model: each string carries type information, reference counts, and memory allocation metadata.

# Demonstrating object vs category memory difference
n_rows = 1_000_000
df_test = pd.DataFrame({
    'status_object': np.random.choice(['pending', 'completed', 'failed', 'cancelled'], n_rows),
})
df_test['status_category'] = df_test['status_object'].astype('category')

print(f"Object dtype: {df_test['status_object'].memory_usage(deep=True) / 1e6:.2f} MB")
print(f"Category dtype: {df_test['status_category'].memory_usage(deep=True) / 1e6:.2f} MB")
# Output: Object dtype: 59.00 MB | Category dtype: 0.95 MB

Optimizing Data Types

The single highest-impact optimization is fixing data types at load time or immediately after. Here’s a function I use on every project:

def optimize_dtypes(df, categorical_threshold=0.5):
    """
    Optimize DataFrame memory usage by downcasting types.
    
    Args:
        df: Input DataFrame
        categorical_threshold: Convert object columns to category if 
                               unique_values / total_rows < threshold
    """
    start_mem = df.memory_usage(deep=True).sum() / 1e6
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type == 'object':
            # Convert low-cardinality strings to category
            num_unique = df[col].nunique()
            num_total = len(df[col])
            if num_unique / num_total < categorical_threshold:
                df[col] = df[col].astype('category')
        
        elif col_type == 'float64':
            # Downcast floats
            df[col] = pd.to_numeric(df[col], downcast='float')
        
        elif col_type == 'int64':
            # Use nullable integer types, then downcast
            col_min, col_max = df[col].min(), df[col].max()
            if col_min >= 0:
                if col_max < 255:
                    df[col] = df[col].astype('uint8')
                elif col_max < 65535:
                    df[col] = df[col].astype('uint16')
                elif col_max < 4294967295:
                    df[col] = df[col].astype('uint32')
            else:
                if col_min > -128 and col_max < 127:
                    df[col] = df[col].astype('int8')
                elif col_min > -32768 and col_max < 32767:
                    df[col] = df[col].astype('int16')
                elif col_min > -2147483648 and col_max < 2147483647:
                    df[col] = df[col].astype('int32')
    
    end_mem = df.memory_usage(deep=True).sum() / 1e6
    print(f"Memory reduced from {start_mem:.2f} MB to {end_mem:.2f} MB "
          f"({100 * (start_mem - end_mem) / start_mem:.1f}% reduction)")
    
    return df

For CSVs, specify dtypes upfront to avoid the initial memory spike:

dtypes = {
    'user_id': 'uint32',
    'amount': 'float32',
    'status': 'category',
    'country_code': 'category',
}
df = pd.read_csv('transactions.csv', dtype=dtypes)

Chunked Reading and Processing

When files exceed available memory, process them in chunks. The key insight: don’t concatenate chunks back together unless absolutely necessary.

def process_large_csv(filepath, chunksize=100_000):
    """Process large CSV with running aggregations."""
    
    total_revenue = 0
    category_counts = pd.Series(dtype='int64')
    
    chunks = pd.read_csv(
        filepath,
        chunksize=chunksize,
        dtype={'category': 'category', 'revenue': 'float32'}
    )
    
    for i, chunk in enumerate(chunks):
        # Running sum
        total_revenue += chunk['revenue'].sum()
        
        # Combine category counts
        chunk_counts = chunk['category'].value_counts()
        category_counts = category_counts.add(chunk_counts, fill_value=0)
        
        if i % 10 == 0:
            print(f"Processed {(i + 1) * chunksize:,} rows")
    
    return {
        'total_revenue': total_revenue,
        'category_distribution': category_counts.astype('int64')
    }

For operations requiring the full dataset (like median or complex joins), consider whether you can sample, pre-aggregate, or restructure the problem. Often you can compute partial results per chunk and combine them mathematically.

Efficient Filtering and Querying

Boolean indexing creates intermediate arrays. For complex conditions, query() and eval() are more memory-efficient:

import time

n = 5_000_000
df = pd.DataFrame({
    'a': np.random.randn(n),
    'b': np.random.randn(n),
    'c': np.random.choice(['x', 'y', 'z'], n),
})

# Method 1: Chained boolean indexing (creates multiple intermediate arrays)
start = time.perf_counter()
result1 = df[(df['a'] > 0) & (df['b'] < 0.5) & (df['c'] == 'x')]
print(f"Boolean indexing: {time.perf_counter() - start:.4f}s")

# Method 2: query() with string expression (more memory efficient)
start = time.perf_counter()
result2 = df.query("a > 0 and b < 0.5 and c == 'x'")
print(f"query(): {time.perf_counter() - start:.4f}s")

# Method 3: eval() for computed columns
start = time.perf_counter()
df['d'] = pd.eval("df.a * 2 + df.b")
print(f"eval(): {time.perf_counter() - start:.4f}s")

Avoid chained indexing like df[df['a'] > 0]['b'] = 5. It’s unpredictable and creates copies. Use loc instead: df.loc[df['a'] > 0, 'b'] = 5.

Leveraging Vectorization

Every apply() call should make you uncomfortable. It’s a loop in disguise, processing one row at a time with Python overhead.

# Slow: apply() with lambda
def categorize_slow(df):
    return df.apply(
        lambda row: 'high' if row['value'] > 100 else 'low',
        axis=1
    )

# Fast: vectorized with np.where
def categorize_fast(df):
    return np.where(df['value'] > 100, 'high', 'low')

# Even faster for multiple conditions: np.select
def categorize_fastest(df):
    conditions = [
        df['value'] > 100,
        df['value'] > 50,
        df['value'] > 0,
    ]
    choices = ['high', 'medium', 'low']
    return np.select(conditions, choices, default='negative')

# Benchmark
df = pd.DataFrame({'value': np.random.randn(500_000) * 100})

start = time.perf_counter()
df['cat_slow'] = categorize_slow(df)
print(f"apply(): {time.perf_counter() - start:.3f}s")

start = time.perf_counter()
df['cat_fast'] = categorize_fast(df)
print(f"np.where(): {time.perf_counter() - start:.3f}s")

# Typical output: apply(): 8.234s | np.where(): 0.012s

That’s not a typo—vectorized operations are often 100-1000x faster than apply().

When to Consider Alternatives

If you’ve optimized dtypes, used chunking, and vectorized everything, but still hit walls, consider these alternatives:

# Polars: Rust-based, lazy evaluation, excellent for large datasets
import polars as pl

# Same operation: filter and aggregate
# Pandas
df_pd = pd.read_csv('large_file.csv')
result_pd = (df_pd[df_pd['status'] == 'active']
             .groupby('category')['amount']
             .sum())

# Polars (lazy evaluation, optimized execution plan)
df_pl = pl.scan_csv('large_file.csv')  # Lazy, doesn't load yet
result_pl = (df_pl
             .filter(pl.col('status') == 'active')
             .group_by('category')
             .agg(pl.col('amount').sum())
             .collect())  # Executes optimized query

Switch to Polars when: Data fits in memory but pandas is too slow, you need parallel processing, or you’re doing heavy groupby/join operations.

Switch to Dask when: Data doesn’t fit in memory and you want pandas-like syntax with lazy evaluation across a cluster.

Switch to a database when: You’re repeatedly querying the same large dataset—DuckDB handles analytical queries on CSV/Parquet files remarkably well.

Practical Checklist

Before loading data:

Define explicit dtypes in read_csv()
Use usecols to load only needed columns
Consider nrows for initial exploration

After loading:

Run df.memory_usage(deep=True) to identify bloated columns
Convert low-cardinality strings to category
Downcast numeric types to smallest sufficient size

During processing:

Replace apply() with vectorized operations
Use query() for complex filters
Process in chunks when memory is constrained
Delete intermediate DataFrames and call gc.collect() in tight loops

Profiling workflow:

%memit in Jupyter to track memory per cell
memory_profiler for line-by-line analysis
Monitor system memory during long jobs with htop or glances

The goal isn’t premature optimization—it’s building habits that prevent your notebook from crashing at 3 AM when the data pipeline delivers twice the expected volume. Optimize once, run reliably forever.