Pandas - Query Method for Filtering

The `query()` method accepts a string expression containing column names and comparison operators. Unlike traditional bracket notation, it eliminates the need for repetitive DataFrame references.

Key Insights

  • The query() method provides a more readable alternative to boolean indexing by using string expressions that resemble SQL WHERE clauses, reducing bracket clutter and improving code maintainability.
  • Query expressions support variable interpolation with @ syntax, enabling dynamic filtering based on external values without breaking the expression flow.
  • Performance characteristics vary between query() and boolean indexing depending on DataFrame size, with query() leveraging numexpr for optimization on larger datasets (typically >10,000 rows).

Basic Query Syntax

The query() method accepts a string expression containing column names and comparison operators. Unlike traditional bracket notation, it eliminates the need for repetitive DataFrame references.

import pandas as pd
import numpy as np

# Create sample dataset
df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'price': [1200, 25, 75, 300, 80],
    'stock': [15, 150, 45, 30, 60],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories']
})

# Traditional boolean indexing
result_traditional = df[(df['price'] > 50) & (df['stock'] < 100)]

# Equivalent query method
result_query = df.query('price > 50 and stock < 100')

print(result_query)
    product  price  stock     category
0    Laptop   1200     15  Electronics
2  Keyboard     75     45  Accessories
3   Monitor    300     30  Electronics
4    Webcam     80     60  Accessories

The query syntax uses and, or, and not operators instead of &, |, and ~, making expressions more readable without requiring parentheses around each condition.

Column Names with Spaces

When working with column names containing spaces or special characters, wrap them in backticks within the query string.

df_sales = pd.DataFrame({
    'Product Name': ['Widget A', 'Widget B', 'Widget C'],
    'Sales Amount': [5000, 3000, 7500],
    'Return Rate': [0.02, 0.05, 0.01]
})

# Query with backticks for column names with spaces
high_performers = df_sales.query('`Sales Amount` > 4000 and `Return Rate` < 0.03')

print(high_performers)
  Product Name  Sales Amount  Return Rate
0     Widget A          5000         0.02
2     Widget C          7500         0.01

Variable Interpolation

The @ symbol allows referencing variables from the local namespace, enabling dynamic filtering without string concatenation.

# Define filter criteria as variables
min_price = 100
max_stock = 50
target_category = 'Electronics'

# Use @ to reference external variables
filtered = df.query('price >= @min_price and stock <= @max_stock and category == @target_category')

print(filtered)
   product  price  stock     category
3  Monitor    300     30  Electronics

This approach maintains clean separation between filter logic and values, particularly useful when criteria come from user input or configuration files.

# Dynamic filtering with multiple variables
def filter_products(df, price_range, categories):
    min_p, max_p = price_range
    return df.query('price >= @min_p and price <= @max_p and category in @categories')

result = filter_products(df, (50, 500), ['Electronics', 'Accessories'])
print(result)

String Operations and Methods

Query expressions support string methods through the str accessor, enabling text-based filtering.

df_customers = pd.DataFrame({
    'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Alice Williams'],
    'email': ['alice@example.com', 'bob@test.com', 'charlie@example.com', 'awill@other.com'],
    'age': [28, 35, 42, 31]
})

# Filter using string methods
gmail_users = df_customers.query('email.str.contains("example")')
print(gmail_users)
            name                email  age
0  Alice Johnson   alice@example.com   28
2  Charlie Brown  charlie@example.com   42

Combine string operations with other conditions:

# Multiple string conditions
result = df_customers.query('name.str.startswith("Alice") and age < 30')
print(result)

Index-Based Filtering

The query() method can filter based on index values using the index keyword.

df_indexed = pd.DataFrame({
    'value': [10, 20, 30, 40, 50]
}, index=['A', 'B', 'C', 'D', 'E'])

# Filter by index
result = df_indexed.query('index in ["A", "C", "E"]')
print(result)
   value
A     10
C     30
E     50

For numeric indices:

df_numeric_idx = pd.DataFrame({
    'data': np.random.randn(100)
}, index=range(100))

# Filter based on index range
subset = df_numeric_idx.query('index >= 20 and index < 30')
print(f"Retrieved {len(subset)} rows")

Complex Expressions with Functions

Query expressions support Python built-in functions and mathematical operations.

df_metrics = pd.DataFrame({
    'revenue': [10000, 15000, 8000, 20000],
    'cost': [7000, 9000, 6000, 12000],
    'units': [100, 150, 80, 200]
})

# Calculate profit margin in query
profitable = df_metrics.query('(revenue - cost) / revenue > 0.3')
print(profitable)
   revenue   cost  units
1    15000   9000    150
3    20000  12000    200

Use mathematical functions:

# Complex calculations
df_metrics['efficiency'] = df_metrics['revenue'] / df_metrics['units']
optimized = df_metrics.query('efficiency > 100 and cost < 10000')
print(optimized)

Performance Considerations

The query() method uses the numexpr library for optimization when available, providing performance benefits on larger DataFrames.

import time

# Create large DataFrame
large_df = pd.DataFrame({
    'A': np.random.randn(100000),
    'B': np.random.randn(100000),
    'C': np.random.choice(['X', 'Y', 'Z'], 100000)
})

# Benchmark query method
start = time.time()
result_query = large_df.query('A > 0 and B < 0.5 and C == "X"')
query_time = time.time() - start

# Benchmark boolean indexing
start = time.time()
result_bool = large_df[(large_df['A'] > 0) & (large_df['B'] < 0.5) & (large_df['C'] == 'X')]
bool_time = time.time() - start

print(f"Query method: {query_time:.4f}s")
print(f"Boolean indexing: {bool_time:.4f}s")

Set the engine parameter explicitly when needed:

# Force use of python engine
result = df.query('price > 100', engine='python')

Handling Missing Values

Query expressions handle NaN values differently than boolean indexing. Use isna() or notna() methods.

df_incomplete = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, 50]
})

# Filter non-null values
complete_rows = df_incomplete.query('A == A and B == B')  # NaN != NaN
print(complete_rows)
     A     B
0  1.0  10.0
3  4.0  40.0
4  5.0  50.0

More explicit approach:

# Using notna()
result = df_incomplete.query('A.notna() and B > 20')
print(result)

Inplace Operations and Method Chaining

The query() method supports the inplace parameter and chains naturally with other DataFrame operations.

# Method chaining
result = (df.query('price > 50')
           .query('category == "Electronics"')
           .sort_values('price', ascending=False))

print(result)
   product  price  stock     category
0   Laptop   1200     15  Electronics
3  Monitor    300     30  Electronics

Combining with other operations:

# Complex pipeline
summary = (df.query('stock > 20')
            .groupby('category')['price']
            .agg(['mean', 'count'])
            .reset_index())

print(summary)

The query() method provides a clean, readable alternative to boolean indexing while maintaining flexibility for complex filtering scenarios. Its SQL-like syntax reduces cognitive load when building multi-condition filters, and variable interpolation enables dynamic querying without sacrificing readability. For production code handling large datasets, leverage its numexpr integration for optimal performance.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.