Pandas: Memory Usage Reduction
Pandas defaults to memory-hungry data types. Load a CSV with a million rows, and Pandas will happily allocate 64-bit integers for columns that only contain values 0-10, and store repeated strings...
Key Insights
- Downcasting numeric types and converting low-cardinality strings to categories can reduce DataFrame memory usage by 50-90% with minimal code changes.
- Always profile memory usage with
df.info(memory_usage='deep')before optimizing—object columns often consume far more memory than their apparent size suggests. - Chunked processing isn’t just for files that don’t fit in memory; it’s a defensive programming practice that makes your code more resilient and scalable.
Introduction: Why Memory Matters in Pandas
Pandas defaults to memory-hungry data types. Load a CSV with a million rows, and Pandas will happily allocate 64-bit integers for columns that only contain values 0-10, and store repeated strings like “active” and “inactive” as full Python objects. The result? A 100MB CSV can balloon into a 2GB DataFrame.
This matters for three reasons. First, you’ll hit memory limits faster, causing crashes or forcing you to rent bigger machines. Second, operations on bloated DataFrames are slower—more memory means more cache misses and more time spent shuffling bytes. Third, excessive memory usage limits your ability to work with multiple DataFrames simultaneously or perform memory-intensive operations like joins and groupbys.
The good news: Pandas provides straightforward tools to slash memory consumption. Most optimizations require just a few lines of code and can reduce memory usage by 50-90%.
Profiling Memory Usage
Before optimizing, measure. Pandas provides several ways to inspect memory consumption.
The most useful method is df.info() with the memory_usage='deep' parameter:
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'id': range(100000),
'status': np.random.choice(['active', 'inactive', 'pending'], 100000),
'value': np.random.random(100000),
'count': np.random.randint(0, 100, 100000)
})
df.info(memory_usage='deep')
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 100000 non-null int64
1 status 100000 non-null object
2 value 100000 non-null float64
3 count 100000 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 6.9 MB
The 'deep' parameter is critical for object columns. Without it, Pandas reports only the memory used by the object pointers, not the actual string data. For a more granular view:
memory_by_column = df.memory_usage(deep=True)
print(memory_by_column)
print(f"\nTotal: {memory_by_column.sum() / 1024**2:.2f} MB")
This reveals which columns are the worst offenders. Object columns almost always top the list.
Optimizing Numeric Data Types
Pandas defaults to int64 and float64 for all numeric data. This is wasteful when your data doesn’t need that precision.
Consider the ranges:
int8: -128 to 127int16: -32,768 to 32,767int32: -2.1 billion to 2.1 billionint64: -9.2 quintillion to 9.2 quintillion
If your column contains values 0-100, you’re wasting 7 bytes per value with int64.
Use pd.to_numeric() with the downcast parameter for automatic optimization:
def optimize_numeric_columns(df):
"""Downcast numeric columns to smallest possible dtype."""
df_optimized = df.copy()
for col in df_optimized.select_dtypes(include=['int64']).columns:
df_optimized[col] = pd.to_numeric(df_optimized[col], downcast='integer')
for col in df_optimized.select_dtypes(include=['float64']).columns:
df_optimized[col] = pd.to_numeric(df_optimized[col], downcast='float')
return df_optimized
df_optimized = optimize_numeric_columns(df)
df_optimized.info(memory_usage='deep')
For more control, convert explicitly:
# Check the actual range before converting
print(f"id range: {df['id'].min()} to {df['id'].max()}")
print(f"count range: {df['count'].min()} to {df['count'].max()}")
# Convert based on actual data ranges
df['count'] = df['count'].astype('int8') # Values 0-100 fit in int8
df['id'] = df['id'].astype('int32') # 100k values fit in int32
df['value'] = df['value'].astype('float32') # Usually sufficient precision
A word of caution: downcasting floats to float32 reduces precision. For financial calculations or scientific computing where precision matters, keep float64. For most analytics and machine learning tasks, float32 is fine.
Optimizing String/Object Columns
Object columns are memory hogs. Each string is stored as a separate Python object with its own overhead. For columns with repeated values—status codes, country names, categories—this is massively inefficient.
The category dtype stores each unique value once and uses integer codes to reference them:
# Before optimization
print(f"Status column memory: {df['status'].memory_usage(deep=True) / 1024**2:.2f} MB")
print(f"Unique values: {df['status'].nunique()}")
# Convert to category
df['status'] = df['status'].astype('category')
# After optimization
print(f"Status column memory: {df['status'].memory_usage(deep=True) / 1024**2:.2f} MB")
The savings scale with the ratio of total rows to unique values. A column with 1 million rows but only 10 unique values will see dramatic improvement.
Here’s a function to automatically convert low-cardinality object columns:
def optimize_object_columns(df, cardinality_threshold=0.5):
"""
Convert object columns to category if cardinality is below threshold.
cardinality_threshold: Convert if unique values < threshold * total rows
"""
df_optimized = df.copy()
for col in df_optimized.select_dtypes(include=['object']).columns:
num_unique = df_optimized[col].nunique()
num_total = len(df_optimized[col])
cardinality = num_unique / num_total
if cardinality < cardinality_threshold:
df_optimized[col] = df_optimized[col].astype('category')
print(f"Converted {col}: {num_unique} unique values, "
f"cardinality {cardinality:.2%}")
return df_optimized
Don’t convert high-cardinality columns like names, emails, or free-text fields. The category overhead can actually increase memory usage when most values are unique.
Chunked Processing for Large Files
When datasets exceed available memory, process them in chunks:
def process_large_csv(filepath, chunksize=50000):
"""Process a large CSV file in chunks."""
results = []
for chunk in pd.read_csv(filepath, chunksize=chunksize):
# Optimize memory within each chunk
chunk = optimize_numeric_columns(chunk)
chunk = optimize_object_columns(chunk)
# Process the chunk (example: aggregate)
chunk_result = chunk.groupby('category')['value'].sum()
results.append(chunk_result)
# Combine results
return pd.concat(results).groupby(level=0).sum()
# Usage
# daily_totals = process_large_csv('massive_dataset.csv')
For operations that require the full dataset, like sorting or computing percentiles, you’ll need a different approach:
def get_column_stats_chunked(filepath, column, chunksize=50000):
"""Calculate statistics for a column without loading full dataset."""
values = []
for chunk in pd.read_csv(filepath, usecols=[column], chunksize=chunksize):
values.extend(chunk[column].dropna().tolist())
series = pd.Series(values)
return {
'mean': series.mean(),
'median': series.median(),
'std': series.std(),
'percentile_95': series.quantile(0.95)
}
The usecols parameter is another memory saver—only load the columns you actually need.
Sparse Data Structures
Datasets with many null values, zeros, or repeated values benefit from sparse storage:
# Create a DataFrame with many zeros
sparse_data = pd.DataFrame({
'dense': np.random.choice([0, 0, 0, 0, 1], 100000),
})
print(f"Dense memory: {sparse_data['dense'].memory_usage(deep=True) / 1024:.2f} KB")
# Convert to sparse
sparse_data['sparse'] = pd.arrays.SparseArray(sparse_data['dense'], fill_value=0)
print(f"Sparse memory: {sparse_data['sparse'].memory_usage() / 1024:.2f} KB")
Sparse arrays store only non-fill values and their positions. The more uniform your data, the better the compression.
For DataFrames with many NaN values:
# Create DataFrame with many NaNs
df_with_nans = pd.DataFrame(
np.random.choice([np.nan, np.nan, np.nan, 1.0, 2.0], (10000, 10))
)
# Convert to sparse
df_sparse = df_with_nans.astype(pd.SparseDtype("float", np.nan))
print(f"Dense: {df_with_nans.memory_usage(deep=True).sum() / 1024:.2f} KB")
print(f"Sparse: {df_sparse.memory_usage(deep=True).sum() / 1024:.2f} KB")
Sparse arrays have trade-offs: some operations are slower, and not all Pandas functions support them. Test your specific workflow before committing to sparse storage.
Summary: Best Practices Checklist
Apply these techniques in order of impact and ease:
| Technique | Expected Savings | When to Use | Trade-offs |
|---|---|---|---|
| Downcast integers | 50-87% | Always | None for most cases |
| Downcast floats | 50% | Non-precision-critical work | Reduced precision |
| Category dtype | 70-95% | Low-cardinality strings | Overhead for high cardinality |
| Chunked processing | N/A | Files exceeding memory | More complex code |
| Sparse arrays | 50-90% | Data with many nulls/zeros | Limited operation support |
A practical optimization workflow:
def optimize_dataframe(df, category_threshold=0.5):
"""Apply all memory optimizations to a DataFrame."""
initial_memory = df.memory_usage(deep=True).sum() / 1024**2
# Optimize numerics
df = optimize_numeric_columns(df)
# Optimize objects
df = optimize_object_columns(df, category_threshold)
final_memory = df.memory_usage(deep=True).sum() / 1024**2
reduction = (1 - final_memory / initial_memory) * 100
print(f"Memory reduced from {initial_memory:.2f} MB to {final_memory:.2f} MB")
print(f"Reduction: {reduction:.1f}%")
return df
Run this on every DataFrame you load. The few seconds of optimization time will pay dividends in faster operations and fewer memory errors. Make it a habit, and memory problems become rare exceptions rather than daily frustrations.