Pandas - Query Method for Filtering
The `query()` method accepts a string expression containing column names and comparison operators. Unlike traditional bracket notation, it eliminates the need for repetitive DataFrame references.
Key Insights
- The
query()method provides a more readable alternative to boolean indexing by using string expressions that resemble SQL WHERE clauses, reducing bracket clutter and improving code maintainability. - Query expressions support variable interpolation with
@syntax, enabling dynamic filtering based on external values without breaking the expression flow. - Performance characteristics vary between
query()and boolean indexing depending on DataFrame size, withquery()leveraging numexpr for optimization on larger datasets (typically >10,000 rows).
Basic Query Syntax
The query() method accepts a string expression containing column names and comparison operators. Unlike traditional bracket notation, it eliminates the need for repetitive DataFrame references.
import pandas as pd
import numpy as np
# Create sample dataset
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'price': [1200, 25, 75, 300, 80],
'stock': [15, 150, 45, 30, 60],
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories']
})
# Traditional boolean indexing
result_traditional = df[(df['price'] > 50) & (df['stock'] < 100)]
# Equivalent query method
result_query = df.query('price > 50 and stock < 100')
print(result_query)
product price stock category
0 Laptop 1200 15 Electronics
2 Keyboard 75 45 Accessories
3 Monitor 300 30 Electronics
4 Webcam 80 60 Accessories
The query syntax uses and, or, and not operators instead of &, |, and ~, making expressions more readable without requiring parentheses around each condition.
Column Names with Spaces
When working with column names containing spaces or special characters, wrap them in backticks within the query string.
df_sales = pd.DataFrame({
'Product Name': ['Widget A', 'Widget B', 'Widget C'],
'Sales Amount': [5000, 3000, 7500],
'Return Rate': [0.02, 0.05, 0.01]
})
# Query with backticks for column names with spaces
high_performers = df_sales.query('`Sales Amount` > 4000 and `Return Rate` < 0.03')
print(high_performers)
Product Name Sales Amount Return Rate
0 Widget A 5000 0.02
2 Widget C 7500 0.01
Variable Interpolation
The @ symbol allows referencing variables from the local namespace, enabling dynamic filtering without string concatenation.
# Define filter criteria as variables
min_price = 100
max_stock = 50
target_category = 'Electronics'
# Use @ to reference external variables
filtered = df.query('price >= @min_price and stock <= @max_stock and category == @target_category')
print(filtered)
product price stock category
3 Monitor 300 30 Electronics
This approach maintains clean separation between filter logic and values, particularly useful when criteria come from user input or configuration files.
# Dynamic filtering with multiple variables
def filter_products(df, price_range, categories):
min_p, max_p = price_range
return df.query('price >= @min_p and price <= @max_p and category in @categories')
result = filter_products(df, (50, 500), ['Electronics', 'Accessories'])
print(result)
String Operations and Methods
Query expressions support string methods through the str accessor, enabling text-based filtering.
df_customers = pd.DataFrame({
'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Alice Williams'],
'email': ['alice@example.com', 'bob@test.com', 'charlie@example.com', 'awill@other.com'],
'age': [28, 35, 42, 31]
})
# Filter using string methods
gmail_users = df_customers.query('email.str.contains("example")')
print(gmail_users)
name email age
0 Alice Johnson alice@example.com 28
2 Charlie Brown charlie@example.com 42
Combine string operations with other conditions:
# Multiple string conditions
result = df_customers.query('name.str.startswith("Alice") and age < 30')
print(result)
Index-Based Filtering
The query() method can filter based on index values using the index keyword.
df_indexed = pd.DataFrame({
'value': [10, 20, 30, 40, 50]
}, index=['A', 'B', 'C', 'D', 'E'])
# Filter by index
result = df_indexed.query('index in ["A", "C", "E"]')
print(result)
value
A 10
C 30
E 50
For numeric indices:
df_numeric_idx = pd.DataFrame({
'data': np.random.randn(100)
}, index=range(100))
# Filter based on index range
subset = df_numeric_idx.query('index >= 20 and index < 30')
print(f"Retrieved {len(subset)} rows")
Complex Expressions with Functions
Query expressions support Python built-in functions and mathematical operations.
df_metrics = pd.DataFrame({
'revenue': [10000, 15000, 8000, 20000],
'cost': [7000, 9000, 6000, 12000],
'units': [100, 150, 80, 200]
})
# Calculate profit margin in query
profitable = df_metrics.query('(revenue - cost) / revenue > 0.3')
print(profitable)
revenue cost units
1 15000 9000 150
3 20000 12000 200
Use mathematical functions:
# Complex calculations
df_metrics['efficiency'] = df_metrics['revenue'] / df_metrics['units']
optimized = df_metrics.query('efficiency > 100 and cost < 10000')
print(optimized)
Performance Considerations
The query() method uses the numexpr library for optimization when available, providing performance benefits on larger DataFrames.
import time
# Create large DataFrame
large_df = pd.DataFrame({
'A': np.random.randn(100000),
'B': np.random.randn(100000),
'C': np.random.choice(['X', 'Y', 'Z'], 100000)
})
# Benchmark query method
start = time.time()
result_query = large_df.query('A > 0 and B < 0.5 and C == "X"')
query_time = time.time() - start
# Benchmark boolean indexing
start = time.time()
result_bool = large_df[(large_df['A'] > 0) & (large_df['B'] < 0.5) & (large_df['C'] == 'X')]
bool_time = time.time() - start
print(f"Query method: {query_time:.4f}s")
print(f"Boolean indexing: {bool_time:.4f}s")
Set the engine parameter explicitly when needed:
# Force use of python engine
result = df.query('price > 100', engine='python')
Handling Missing Values
Query expressions handle NaN values differently than boolean indexing. Use isna() or notna() methods.
df_incomplete = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [10, np.nan, 30, 40, 50]
})
# Filter non-null values
complete_rows = df_incomplete.query('A == A and B == B') # NaN != NaN
print(complete_rows)
A B
0 1.0 10.0
3 4.0 40.0
4 5.0 50.0
More explicit approach:
# Using notna()
result = df_incomplete.query('A.notna() and B > 20')
print(result)
Inplace Operations and Method Chaining
The query() method supports the inplace parameter and chains naturally with other DataFrame operations.
# Method chaining
result = (df.query('price > 50')
.query('category == "Electronics"')
.sort_values('price', ascending=False))
print(result)
product price stock category
0 Laptop 1200 15 Electronics
3 Monitor 300 30 Electronics
Combining with other operations:
# Complex pipeline
summary = (df.query('stock > 20')
.groupby('category')['price']
.agg(['mean', 'count'])
.reset_index())
print(summary)
The query() method provides a clean, readable alternative to boolean indexing while maintaining flexibility for complex filtering scenarios. Its SQL-like syntax reduces cognitive load when building multi-condition filters, and variable interpolation enables dynamic querying without sacrificing readability. For production code handling large datasets, leverage its numexpr integration for optimal performance.