How to Use Eval in Pandas

Pandas provides two eval functions that let you evaluate string expressions against your data: the top-level `pd.eval()` and the DataFrame method `df.eval()`. Both parse and execute expressions...

Key Insights

  • Pandas eval() uses the NumExpr engine to execute string expressions with significant memory and speed benefits on datasets larger than 10,000 rows
  • The @ prefix lets you reference local Python variables inside eval expressions, enabling dynamic filtering and computation without string concatenation
  • Reserve eval() for complex expressions on large datasets—for simple operations or small DataFrames, standard Pandas syntax is faster and more debuggable

Introduction to pd.eval() and DataFrame.eval()

Pandas provides two eval functions that let you evaluate string expressions against your data: the top-level pd.eval() and the DataFrame method df.eval(). Both parse and execute expressions written as strings, but they serve slightly different purposes.

pd.eval() operates on any array-like objects you pass to it and returns the result. DataFrame.eval() executes expressions in the context of a specific DataFrame, giving you direct access to column names without qualification.

Why bother with string expressions when Pandas already has expressive syntax? Three reasons: memory efficiency, speed on large datasets, and sometimes cleaner expression of complex logic. The eval functions use NumExpr under the hood, which avoids creating intermediate arrays and can parallelize operations across CPU cores.

That said, eval isn’t always the right choice. For small datasets or simple operations, standard Pandas is faster due to the overhead of parsing string expressions. Understanding when to reach for eval—and when to avoid it—separates efficient Pandas code from cargo-culted patterns.

Basic Syntax and Simple Expressions

The fundamental syntax is straightforward. For DataFrame.eval(), you pass a string expression that references column names directly:

import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
    'A': np.random.randn(100000),
    'B': np.random.randn(100000),
    'C': np.random.randn(100000)
})

# Standard Pandas approach
df['D'] = df['A'] + df['B'] * df['C']

# Using eval - returns a Series
result = df.eval('A + B * C')

# Using eval to create a new column
df = df.eval('D = A + B * C')

The pd.eval() function requires you to specify the DataFrame context or pass arrays explicitly:

# Top-level eval with explicit DataFrame reference
result = pd.eval('df.A + df.B * df.C')

# Or with local_dict parameter
result = pd.eval('A + B * C', local_dict={'A': df['A'], 'B': df['B'], 'C': df['C']})

For most use cases, df.eval() is cleaner and more readable. Reserve pd.eval() for operations spanning multiple DataFrames or when you need more control over the evaluation context.

Performance Benefits and When to Use Eval

The performance story for eval centers on NumExpr, a library that compiles expressions to efficient machine code and minimizes memory allocation. Standard Pandas operations create intermediate arrays at each step. Eval computes the entire expression in a single pass.

Here’s a concrete benchmark:

import pandas as pd
import numpy as np

# Large dataset where eval shines
df_large = pd.DataFrame({
    'A': np.random.randn(1_000_000),
    'B': np.random.randn(1_000_000),
    'C': np.random.randn(1_000_000),
    'D': np.random.randn(1_000_000)
})

# Complex expression to benchmark
# Standard Pandas
%timeit df_large['A'] + df_large['B'] * df_large['C'] - df_large['D'] / df_large['A']
# Typical result: ~15-20ms

# Using eval
%timeit df_large.eval('A + B * C - D / A')
# Typical result: ~8-12ms

# The gap widens with more complex expressions
%timeit (df_large['A'] > 0) & (df_large['B'] < df_large['C']) | (df_large['D'] > df_large['A'])
# Typical result: ~25-35ms

%timeit df_large.eval('(A > 0) & (B < C) | (D > A)')
# Typical result: ~10-15ms

The general rule: eval becomes advantageous around 10,000 rows, with benefits increasing as data size grows. Below that threshold, the parsing overhead makes standard Pandas faster.

Memory efficiency matters too. With a complex expression on a 1GB DataFrame, standard Pandas might allocate several GB for intermediates. Eval keeps memory usage close to the input size.

Local and Global Variable References

Real-world code rarely operates on columns alone. You need to compare against thresholds, use computed values, or incorporate configuration parameters. The @ prefix lets you reference Python variables inside eval expressions:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'price': np.random.uniform(10, 100, 100000),
    'quantity': np.random.randint(1, 50, 100000),
    'discount_rate': np.random.uniform(0, 0.3, 100000)
})

# Reference local variables with @
min_price = 25.0
max_quantity = 30
tax_rate = 0.08

# Complex calculation using external variables
df = df.eval('''
    total = price * quantity * (1 - discount_rate)
    taxed_total = total * (1 + @tax_rate)
    is_valid = (price >= @min_price) & (quantity <= @max_quantity)
''')

# Dynamic filtering
threshold = df['price'].mean()
high_value = df.eval('price > @threshold')

This approach is cleaner than string formatting and safer than f-strings with user input. The @ prefix clearly signals “this comes from outside the DataFrame.”

You can also reference other DataFrames:

reference_df = pd.DataFrame({'benchmark': [50.0]})

# Compare against values in another DataFrame
df = df.eval('above_benchmark = price > @reference_df["benchmark"].iloc[0]')

Supported Operations and Limitations

Eval supports a specific subset of operations. Knowing the boundaries prevents frustrating debugging sessions.

Supported operations:

  • Arithmetic: +, -, *, /, **, //, %
  • Comparisons: <, >, <=, >=, ==, !=
  • Boolean: &, |, ~, and, or, not
  • Indexing: [] for column access
  • Attribute access: . for DataFrame attributes
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'revenue': np.random.uniform(1000, 10000, 100000),
    'costs': np.random.uniform(500, 8000, 100000),
    'units': np.random.randint(10, 1000, 100000),
    'is_premium': np.random.choice([True, False], 100000)
})

# Complex boolean expression
df = df.eval('''
    profitable = (revenue > costs) & (units > 100)
    high_margin = ((revenue - costs) / revenue > 0.3) | is_premium
    priority = profitable & high_margin & (units > 500)
''')

# Chained comparisons work
df = df.eval('mid_range = 2000 < revenue < 8000')

Not supported:

  • Custom Python functions (including most NumPy functions)
  • String methods
  • Aggregations (sum, mean, etc.)
  • Complex indexing beyond simple column access
# These will fail
# df.eval('upper_name = name.str.upper()')  # No string methods
# df.eval('total = revenue.sum()')  # No aggregations
# df.eval('log_revenue = np.log(revenue)')  # No NumPy functions

# Workaround: compute outside eval, reference with @
log_revenue = np.log(df['revenue'])
df = df.eval('scaled = @log_revenue * 100')

In-Place Column Assignment with inplace Parameter

By default, df.eval() returns a new DataFrame with any assigned columns. The inplace=True parameter modifies the DataFrame directly:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'base_price': np.random.uniform(100, 500, 10000),
    'markup_pct': np.random.uniform(0.1, 0.5, 10000)
})

# Without inplace - returns new DataFrame
df_new = df.eval('final_price = base_price * (1 + markup_pct)')
# df is unchanged, df_new has the new column

# With inplace - modifies df directly
df.eval('final_price = base_price * (1 + markup_pct)', inplace=True)
# df now has final_price column, returns None

# Multiple assignments in one call
df.eval('''
    margin = final_price - base_price
    margin_pct = margin / final_price
    is_high_margin = margin_pct > 0.25
''', inplace=True)

The inplace parameter follows the same pattern as other Pandas methods—it returns None and modifies the object. Be aware that Pandas has been moving away from inplace operations in general, so the non-inplace pattern (reassigning the result) is often preferred for clarity.

Best Practices and Common Pitfalls

Use eval for the right reasons. Reach for eval when you have large datasets (100K+ rows) and complex expressions. Don’t use it just because you saw it in a tutorial.

Keep expressions readable. Multi-line strings with clear formatting beat dense one-liners:

# Hard to read
df.eval('result=(A>0)&(B<C)|(D>A)&~(E==F)', inplace=True)

# Much better
df.eval('''
    result = (A > 0) & (B < C) | 
             (D > A) & ~(E == F)
''', inplace=True)

Debug incrementally. String expressions don’t give helpful error messages. Build complex expressions piece by piece:

# Debug by testing parts
print(df.eval('A > 0').head())
print(df.eval('B < C').head())
print(df.eval('(A > 0) & (B < C)').head())

Never use eval with untrusted input. While Pandas eval is more restricted than Python’s built-in eval(), it’s still a code execution vector. Never construct eval strings from user input:

# DANGEROUS - don't do this
user_column = request.form['column']  # Could be malicious
df.eval(f'{user_column} > 0')  # Security vulnerability

# Instead, validate against known columns
allowed_columns = {'price', 'quantity', 'total'}
if user_column in allowed_columns:
    result = df[user_column] > 0  # Use standard Pandas

Consider the engine parameter. Pandas eval supports two engines: numexpr (default when available) and python. The Python engine supports more operations but loses the performance benefits:

# Force Python engine for broader compatibility
df.eval('result = A + B', engine='python')

# Check which engine is being used
pd.eval('1 + 1', engine='numexpr')  # Raises if numexpr not installed

Eval is a power tool in the Pandas toolkit. Used appropriately on large datasets with complex expressions, it delivers real performance gains. Used indiscriminately, it adds complexity without benefit. Profile your actual workload, and let the numbers guide your decision.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.