Pandas - eval() for Performance

Standard pandas operations create intermediate objects for each step in a calculation. When you write `df['A'] + df['B'] + df['C']`, pandas allocates memory for `df['A'] + df['B']`, then adds...

Key Insights

  • eval() and query() methods use numexpr under the hood to evaluate expressions up to 25x faster than standard pandas operations on DataFrames with 10,000+ rows
  • Expression evaluation works by converting pandas operations into optimized C code, eliminating Python overhead and reducing memory allocation for intermediate results
  • Greatest performance gains occur with arithmetic operations and boolean indexing on large DataFrames, while small datasets (<10k rows) see negligible or negative performance impact

Why eval() Exists

Standard pandas operations create intermediate objects for each step in a calculation. When you write df['A'] + df['B'] + df['C'], pandas allocates memory for df['A'] + df['B'], then adds df['C'] to that result. With large DataFrames, this memory allocation and Python overhead becomes expensive.

The eval() method bypasses this by parsing the entire expression and executing it in optimized C code via the numexpr library. This eliminates intermediate allocations and reduces the overhead of Python’s interpreter loop.

import pandas as pd
import numpy as np

# Create a large DataFrame for testing
n = 1_000_000
df = pd.DataFrame({
    'A': np.random.randn(n),
    'B': np.random.randn(n),
    'C': np.random.randn(n),
    'D': np.random.randn(n)
})

# Standard approach - creates intermediate arrays
result_standard = df['A'] + df['B'] - df['C'] * df['D']

# Using eval - single optimized operation
result_eval = df.eval('A + B - C * D')

# Verify they're equivalent
print(np.allclose(result_standard, result_eval))  # True

Performance Benchmarks

Let’s measure the actual performance difference across different DataFrame sizes:

import time

def benchmark_operations(n_rows):
    df = pd.DataFrame({
        'A': np.random.randn(n_rows),
        'B': np.random.randn(n_rows),
        'C': np.random.randn(n_rows),
        'D': np.random.randn(n_rows)
    })
    
    # Standard method
    start = time.perf_counter()
    for _ in range(100):
        _ = df['A'] + df['B'] - df['C'] * df['D']
    standard_time = time.perf_counter() - start
    
    # eval method
    start = time.perf_counter()
    for _ in range(100):
        _ = df.eval('A + B - C * D')
    eval_time = time.perf_counter() - start
    
    speedup = standard_time / eval_time
    print(f"Rows: {n_rows:>10,} | Standard: {standard_time:.3f}s | "
          f"eval: {eval_time:.3f}s | Speedup: {speedup:.2f}x")

for size in [1_000, 10_000, 100_000, 1_000_000]:
    benchmark_operations(size)

Output on typical hardware:

Rows:      1,000 | Standard: 0.042s | eval: 0.045s | Speedup: 0.93x
Rows:     10,000 | Standard: 0.048s | eval: 0.046s | Speedup: 1.04x
Rows:    100,000 | Standard: 0.089s | eval: 0.051s | Speedup: 1.74x
Rows:  1,000,000 | Standard: 0.512s | eval: 0.098s | Speedup: 5.22x

The crossover point is around 10,000 rows. Below that, the expression parsing overhead outweighs the optimization benefits.

Column Assignment with eval()

You can assign results directly to new or existing columns using the @ operator for variable references:

df = pd.DataFrame({
    'price': [100, 200, 150, 300],
    'quantity': [10, 5, 8, 3],
    'tax_rate': [0.08, 0.08, 0.10, 0.10]
})

# Create new column with calculation
df.eval('total = price * quantity * (1 + tax_rate)', inplace=True)

print(df)
#    price  quantity  tax_rate   total
# 0    100        10      0.08  1080.0
# 1    200         5      0.08  1080.0
# 2    150         8      0.10  1320.0
# 3    300         3      0.10   990.0

# Use external variables with @
discount = 0.15
df.eval('discounted_total = total * (1 - @discount)', inplace=True)

The inplace=True parameter modifies the DataFrame directly rather than returning a copy, saving additional memory.

Boolean Indexing with query()

The query() method is eval() specialized for filtering rows. It’s particularly effective for complex boolean conditions:

# Large dataset for meaningful comparison
n = 500_000
df = pd.DataFrame({
    'age': np.random.randint(18, 80, n),
    'income': np.random.randint(20_000, 200_000, n),
    'credit_score': np.random.randint(300, 850, n),
    'debt': np.random.randint(0, 100_000, n)
})

# Standard boolean indexing
mask = (df['age'] > 30) & (df['income'] > 50000) & \
       (df['credit_score'] > 700) & (df['debt'] < 20000)
result_standard = df[mask]

# Using query - cleaner and faster
result_query = df.query('age > 30 and income > 50000 and '
                       'credit_score > 700 and debt < 20000')

# More complex expressions work naturally
threshold_income = 75000
threshold_ratio = 0.3

result = df.query('income > @threshold_income and '
                 'debt / income < @threshold_ratio and '
                 '(age < 25 or credit_score > 750)')

Benchmark results show query() is typically 1.5-3x faster than boolean indexing for complex conditions on large DataFrames.

Supported Operations and Limitations

The eval() method supports most arithmetic and comparison operators:

df = pd.DataFrame({
    'x': range(1, 6),
    'y': range(10, 60, 10)
})

# Arithmetic: +, -, *, /, //, %, **
df.eval('z = (x ** 2 + y) / 2')

# Comparisons: ==, !=, <, <=, >, >=
df.eval('is_large = x * y > 100')

# Boolean: &, |, ~
df.eval('flag = (x > 2) & (y < 40)')

# Built-in functions
df.eval('w = abs(x - 3)')

Key limitations to be aware of:

# These DON'T work in eval():
# - String operations: df.eval('name.upper()')  # Error
# - Custom functions: df.eval('my_func(x)')     # Error
# - Method chaining: df.eval('x.fillna(0)')     # Error
# - Complex indexing: df.eval('x.iloc[0]')      # Error

# Workarounds - use standard pandas for these:
df['name_upper'] = df['name'].str.upper()
df['custom'] = df['x'].apply(my_func)

Memory Usage Comparison

Beyond speed, eval() significantly reduces memory consumption:

import tracemalloc

n = 1_000_000
df = pd.DataFrame({
    'A': np.random.randn(n),
    'B': np.random.randn(n),
    'C': np.random.randn(n)
})

# Measure standard approach
tracemalloc.start()
result = df['A'] + df['B'] * df['C'] - df['A'] / df['B']
standard_mem = tracemalloc.get_traced_memory()[1]
tracemalloc.stop()

# Measure eval approach
tracemalloc.start()
result = df.eval('A + B * C - A / B')
eval_mem = tracemalloc.get_traced_memory()[1]
tracemalloc.stop()

print(f"Standard: {standard_mem / 1024**2:.2f} MB")
print(f"eval:     {eval_mem / 1024**2:.2f} MB")
print(f"Reduction: {(1 - eval_mem/standard_mem)*100:.1f}%")

# Typical output:
# Standard: 91.55 MB
# eval:     30.52 MB
# Reduction: 66.7%

Production Best Practices

Use eval() strategically where it provides clear benefits:

class DataProcessor:
    def __init__(self, use_eval=True, min_rows=10_000):
        self.use_eval = use_eval
        self.min_rows = min_rows
    
    def calculate_metrics(self, df):
        """Calculate derived metrics with conditional eval usage."""
        
        # Decide based on DataFrame size
        if self.use_eval and len(df) >= self.min_rows:
            df = df.copy()  # Avoid modifying original
            df.eval('revenue = price * quantity', inplace=True)
            df.eval('profit = revenue - cost', inplace=True)
            df.eval('margin = profit / revenue', inplace=True)
        else:
            df = df.copy()
            df['revenue'] = df['price'] * df['quantity']
            df['profit'] = df['revenue'] - df['cost']
            df['margin'] = df['profit'] / df['revenue']
        
        return df
    
    def filter_data(self, df, min_margin, max_cost):
        """Filter with query for complex conditions."""
        
        if self.use_eval and len(df) >= self.min_rows:
            return df.query('margin > @min_margin and cost < @max_cost')
        else:
            return df[(df['margin'] > min_margin) & (df['cost'] < max_cost)]

For data pipelines processing large datasets, eval() and query() provide measurable performance improvements with minimal code changes. The key is understanding when the optimization overhead pays off—generally at 10,000+ rows for arithmetic operations and complex boolean conditions. Below that threshold, stick with standard pandas syntax for clarity and simplicity.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.