Pandas - Speed Up Your Code (Performance Tips)
Pandas is the workhorse of data analysis in Python. It's intuitive, well-documented, and handles most tabular data tasks elegantly. But that convenience comes with a cost: it's surprisingly easy to...
Key Insights
- Vectorized operations can be 100x faster than iterating with loops—always prefer built-in Pandas and NumPy functions over
iterrows()orapply()with custom lambdas. - Proper dtype selection (especially
categoryfor low-cardinality strings) can reduce memory usage by 90% and significantly speed up operations. - When Pandas becomes the bottleneck on large datasets, libraries like Polars offer dramatic speedups with minimal code changes.
Why Pandas Performance Matters
Pandas is the workhorse of data analysis in Python. It’s intuitive, well-documented, and handles most tabular data tasks elegantly. But that convenience comes with a cost: it’s surprisingly easy to write slow code.
When you’re exploring data in a Jupyter notebook with a few thousand rows, performance rarely matters. But when you’re processing millions of records in a production pipeline, or running the same transformation hundreds of times daily, inefficient Pandas code becomes a real problem. I’ve seen data pipelines that took hours reduced to minutes with straightforward optimizations.
The good news? Most Pandas performance issues stem from a handful of common antipatterns. Fix these, and you’ll handle significantly larger datasets without reaching for distributed computing frameworks.
Avoid Iterating with Loops
The single most impactful optimization you can make is eliminating row-by-row iteration. When you call iterrows() or write a Python for loop over a DataFrame, you’re abandoning everything that makes Pandas fast.
Pandas is built on NumPy, which performs operations on entire arrays at once using optimized C code. When you iterate row by row, you’re forcing Python to handle each element individually, with all the overhead that entails.
Here’s a concrete example. Suppose you need to calculate a discount price based on a percentage:
import pandas as pd
import numpy as np
# Create a sample dataset
np.random.seed(42)
df = pd.DataFrame({
'price': np.random.uniform(10, 1000, 1_000_000),
'discount_pct': np.random.uniform(0.05, 0.30, 1_000_000)
})
# SLOW: Using iterrows
def calculate_with_loop(df):
discounted_prices = []
for idx, row in df.iterrows():
discounted_prices.append(row['price'] * (1 - row['discount_pct']))
df['discounted_price'] = discounted_prices
return df
# FAST: Vectorized operation
def calculate_vectorized(df):
df['discounted_price'] = df['price'] * (1 - df['discount_pct'])
return df
The timing difference is stark:
%timeit calculate_with_loop(df.copy())
# 45.2 s ± 1.3 s per loop
%timeit calculate_vectorized(df.copy())
# 12.4 ms ± 0.8 ms per loop
That’s roughly 3,600x faster. The vectorized version completes in milliseconds what the loop takes nearly a minute to accomplish.
The rule is simple: if you’re reaching for iterrows(), stop and think about whether a vectorized operation exists. It almost always does.
Use Appropriate Data Types
Pandas defaults to generous data types. Integers become int64, floats become float64, and strings become object dtype. This wastes memory and slows down operations.
The biggest win usually comes from converting low-cardinality string columns to category dtype. If a column contains repeated values (like country names, status codes, or product categories), categorical encoding stores each unique value once and uses integer codes internally.
# Create a DataFrame with repetitive string data
df = pd.DataFrame({
'user_id': range(1_000_000),
'country': np.random.choice(['USA', 'UK', 'Germany', 'France', 'Japan'], 1_000_000),
'status': np.random.choice(['active', 'inactive', 'pending'], 1_000_000),
'score': np.random.randint(0, 100, 1_000_000)
})
print("Before optimization:")
print(df.info(memory_usage='deep'))
# Memory usage: 95.4 MB
# Convert to appropriate types
df_optimized = df.copy()
df_optimized['country'] = df_optimized['country'].astype('category')
df_optimized['status'] = df_optimized['status'].astype('category')
df_optimized['score'] = df_optimized['score'].astype('int8') # Values 0-100 fit in int8
print("\nAfter optimization:")
print(df_optimized.info(memory_usage='deep'))
# Memory usage: 10.7 MB
We reduced memory usage by nearly 90%. This isn’t just about RAM—smaller data means faster operations because more fits in CPU cache.
For numeric columns, consider downcasting when you know the value range:
# Downcast integers to smallest type that fits
df['small_int'] = pd.to_numeric(df['score'], downcast='integer')
# For nullable integers (with NaN support)
df['nullable_int'] = df['score'].astype('Int8') # Capital I for nullable
Leverage Built-in Methods Over Apply
The apply() function is seductively flexible. You can pass any Python function and it “just works.” But that flexibility comes at a steep performance cost.
When you use apply() with a custom lambda, Pandas can’t optimize the operation. It falls back to calling your Python function for each row or element, similar to explicit iteration.
String operations are a common offender. Pandas provides a .str accessor with vectorized string methods that are dramatically faster:
df = pd.DataFrame({
'email': [f'user{i}@example.com' for i in range(500_000)]
})
# SLOW: Using apply with lambda
%timeit df['email'].apply(lambda x: x.split('@')[0])
# 312 ms ± 15 ms per loop
# FAST: Using .str accessor
%timeit df['email'].str.split('@').str[0]
# 89 ms ± 4 ms per loop
# EVEN FASTER: Using str.extract with regex
%timeit df['email'].str.extract(r'(.+)@')[0]
# 156 ms ± 8 ms per loop
The .str accessor covers most common string operations: contains(), replace(), lower(), strip(), len(), and many more. Check the documentation before writing a custom apply().
The same principle applies to numeric operations. Use NumPy functions directly instead of wrapping them in apply():
# SLOW
df['log_value'] = df['value'].apply(lambda x: np.log(x))
# FAST
df['log_value'] = np.log(df['value'])
Optimize GroupBy and Aggregations
GroupBy operations are central to data analysis, but they’re easy to write inefficiently. The key insight is that setting up a groupby is expensive—you want to extract maximum value from each one.
Instead of chaining multiple groupby calls, use agg() with a dictionary to compute multiple aggregations at once:
df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D'], 1_000_000),
'subcategory': np.random.choice(['X', 'Y', 'Z'], 1_000_000),
'value': np.random.randn(1_000_000),
'quantity': np.random.randint(1, 100, 1_000_000)
})
# SLOW: Multiple separate groupby calls
def slow_aggregation(df):
result = df.groupby('category')['value'].mean().reset_index()
result['value_sum'] = df.groupby('category')['value'].sum().values
result['quantity_mean'] = df.groupby('category')['quantity'].mean().values
return result
# FAST: Single groupby with agg dictionary
def fast_aggregation(df):
return df.groupby('category').agg({
'value': ['mean', 'sum'],
'quantity': 'mean'
}).reset_index()
%timeit slow_aggregation(df)
# 89 ms ± 3 ms per loop
%timeit fast_aggregation(df)
# 31 ms ± 1 ms per loop
Also consider transform() when you need to broadcast aggregated values back to the original DataFrame shape. It’s more efficient than merging a grouped result back:
# Add group mean as a new column
df['category_mean'] = df.groupby('category')['value'].transform('mean')
Consider Alternative Libraries for Large Data
Sometimes Pandas itself is the bottleneck. If you’re working with datasets that push the limits of your RAM, or you need faster execution for production workloads, consider alternatives.
Polars has emerged as a compelling option. It’s written in Rust, uses lazy evaluation, and often outperforms Pandas by 10-50x on larger datasets:
import polars as pl
# Pandas approach
def pandas_workflow(path):
df = pd.read_csv(path)
df = df[df['value'] > 0]
result = df.groupby('category').agg({'value': 'sum', 'quantity': 'mean'})
return result
# Polars approach
def polars_workflow(path):
df = pl.read_csv(path)
result = (
df.filter(pl.col('value') > 0)
.group_by('category')
.agg([
pl.col('value').sum(),
pl.col('quantity').mean()
])
)
return result
# On a 10 million row CSV:
# Pandas: 8.2 seconds
# Polars: 0.9 seconds
For datasets that don’t fit in memory, Dask provides a Pandas-like API that works with chunked data. Modin offers a drop-in replacement that parallelizes Pandas operations across cores.
Profiling Before Optimizing
Before optimizing anything, measure. Intuition about performance is often wrong, and premature optimization wastes time.
Use %timeit in Jupyter for quick comparisons. For memory profiling, memory_profiler shows line-by-line memory consumption. For CPU profiling, line_profiler reveals which lines are actually slow.
# Install: pip install memory_profiler line_profiler
# In Jupyter:
%load_ext memory_profiler
%memit your_function(df)
%load_ext line_profiler
%lprun -f your_function your_function(df)
Profile first, optimize the actual bottlenecks, and measure again to confirm improvement. The techniques in this article will handle most cases, but your specific data and operations might have unique characteristics worth investigating.