How to Use Query in Pandas
Pandas gives you two main ways to filter DataFrames: boolean indexing and the `query()` method. Most tutorials focus on boolean indexing because it's the traditional approach, but `query()` often...
Key Insights
- The
query()method offers a more readable, SQL-like syntax for filtering DataFrames, especially when combining multiple conditions withand,or, andnotoperators. - Use the
@prefix to reference external Python variables within query strings, keeping your filtering logic clean and maintainable. - While
query()excels at readability for complex filters, it has limitations—no method calls allowed, and performance benefits only appear with large datasets (100k+ rows).
Introduction to the Query Method
Pandas gives you two main ways to filter DataFrames: boolean indexing and the query() method. Most tutorials focus on boolean indexing because it’s the traditional approach, but query() often produces cleaner, more maintainable code.
Boolean indexing looks like this:
df[(df['age'] > 30) & (df['salary'] < 100000) & (df['department'] == 'Engineering')]
The same filter with query():
df.query('age > 30 and salary < 100000 and department == "Engineering"')
The query() version reads like English. You don’t need parentheses around each condition, you use and instead of &, and you reference columns by name without the df['column'] syntax.
Use query() when you have multiple conditions, when readability matters more than micro-optimization, or when you’re building dynamic filters from user input. Stick with boolean indexing for simple single-condition filters or when you need to call methods on columns during filtering.
Basic Query Syntax
The query() method accepts a string expression that evaluates to a boolean result for each row. Rows where the expression is True are returned.
import pandas as pd
# Sample data
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'age': [28, 35, 42, 31, 25],
'status': ['active', 'inactive', 'active', 'active', 'inactive'],
'salary': [75000, 82000, 95000, 68000, 71000]
})
# Filter by numeric comparison
young_employees = df.query('age < 30')
print(young_employees)
Output:
name age status salary
0 Alice 28 active 75000
4 Eve 25 inactive 71000
For string comparisons, use quotes inside the query string. You can use either single quotes for the query and double quotes for the value, or vice versa:
# Filter by string equality
active_employees = df.query('status == "active"')
# Or equivalently:
active_employees = df.query("status == 'active'")
All standard comparison operators work: ==, !=, <, >, <=, >=.
# Not equal
non_active = df.query('status != "active"')
# Greater than or equal
senior_salary = df.query('salary >= 80000')
Combining Multiple Conditions
Here’s where query() really shines. Instead of the awkward &, |, ~ operators with mandatory parentheses, you use plain English: and, or, not.
# Multiple conditions with boolean indexing (verbose)
result = df[(df['age'] > 30) & (df['salary'] < 90000) & (df['status'] == 'active')]
# Same filter with query (clean)
result = df.query('age > 30 and salary < 90000 and status == "active"')
print(result)
Output:
name age status salary
3 Diana 31 active 68000
The or operator works as expected:
# Employees who are either young OR highly paid
result = df.query('age < 30 or salary > 90000')
print(result)
Output:
name age status salary
0 Alice 28 active 75000
2 Charlie 42 active 95000
4 Eve 25 inactive 71000
Use not to negate conditions:
# Everyone except active employees under 30
result = df.query('not (status == "active" and age < 30)')
print(result)
You can build complex logic by combining operators:
# Complex filter: active employees over 30, OR inactive employees with high salary
result = df.query('(status == "active" and age > 30) or (status == "inactive" and salary > 75000)')
print(result)
Output:
name age status salary
1 Bob 35 inactive 82000
2 Charlie 42 active 95000
3 Diana 31 active 68000
Using Variables in Queries
Hard-coding values in query strings limits reusability. The @ prefix lets you reference Python variables from outside the query:
min_age = 30
max_salary = 85000
target_status = 'active'
result = df.query('age >= @min_age and salary <= @max_salary and status == @target_status')
print(result)
Output:
name age status salary
3 Diana 31 active 68000
This pattern is essential for building reusable filtering functions:
def filter_employees(df, min_age=None, max_salary=None, status=None):
conditions = []
if min_age is not None:
conditions.append('age >= @min_age')
if max_salary is not None:
conditions.append('salary <= @max_salary')
if status is not None:
conditions.append('status == @status')
if not conditions:
return df
query_string = ' and '.join(conditions)
return df.query(query_string)
# Usage
result = filter_employees(df, min_age=25, status='active')
You can also reference list variables:
valid_names = ['Alice', 'Bob', 'Charlie']
result = df.query('name in @valid_names')
print(result)
Output:
name age status salary
0 Alice 28 active 75000
1 Bob 35 inactive 82000
2 Charlie 42 active 95000
Working with Column Names Containing Spaces or Special Characters
Real-world data often has messy column names. The query() method handles these with backticks:
# DataFrame with problematic column names
df_messy = pd.DataFrame({
'employee name': ['Alice', 'Bob', 'Charlie'],
'total sales': [15000, 22000, 18000],
'commission %': [5.0, 7.5, 6.0],
'start-date': ['2020-01-15', '2019-06-01', '2021-03-10']
})
# Use backticks for column names with spaces
high_performers = df_messy.query('`total sales` > 16000')
print(high_performers)
Output:
employee name total sales commission % start-date
1 Bob 22000 7.5 2019-06-01
2 Charlie 18000 6.0 2021-03-10
Backticks work for any problematic character:
# Columns with special characters
result = df_messy.query('`commission %` >= 6.0 and `total sales` > 15000')
print(result)
While backticks solve the immediate problem, consider renaming columns for long-term maintainability:
df_clean = df_messy.rename(columns={
'employee name': 'employee_name',
'total sales': 'total_sales',
'commission %': 'commission_pct',
'start-date': 'start_date'
})
Advanced Query Techniques
List Membership with in
The in operator checks if values exist in a list:
# Filter for specific categories
df_products = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo', 'Thingamajig', 'Doohickey'],
'category': ['A', 'B', 'A', 'C', 'B'],
'price': [10, 25, 15, 30, 20]
})
result = df_products.query('category in ["A", "B"]')
print(result)
Output:
product category price
0 Widget A 10
1 Gadget B 25
2 Gizmo A 15
4 Doohickey B 20
Use not in for exclusion:
result = df_products.query('category not in ["C"]')
Querying the Index
Reference the index directly using index:
df_indexed = df.set_index('name')
# Query by index value
result = df_indexed.query('index == "Alice" or index == "Bob"')
print(result)
For numeric indexes:
# Filter rows where index is greater than 2
result = df.query('index > 2')
print(result)
Output:
name age status salary
3 Diana 31 active 68000
4 Eve 25 inactive 71000
Arithmetic in Queries
You can perform calculations within the query:
# Find employees where salary per year of age is above a threshold
result = df.query('salary / age > 2500')
print(result)
# Compare columns to each other
df_compare = pd.DataFrame({
'budget': [1000, 2000, 1500],
'spent': [800, 2100, 1400],
'project': ['Alpha', 'Beta', 'Gamma']
})
over_budget = df_compare.query('spent > budget')
print(over_budget)
Output:
budget spent project
1 2000 2100 Beta
Performance Considerations and Best Practices
When Query Performs Better
For small DataFrames (under 100k rows), query() and boolean indexing perform similarly. The parsing overhead of query() might even make it slightly slower.
For large DataFrames, query() can be faster because it uses numexpr under the hood when available. Install it with pip install numexpr to enable this optimization.
import numpy as np
# Large DataFrame for benchmarking
large_df = pd.DataFrame({
'a': np.random.randint(0, 100, 1_000_000),
'b': np.random.randint(0, 100, 1_000_000),
'c': np.random.choice(['x', 'y', 'z'], 1_000_000)
})
# With numexpr installed, query() can be 2-3x faster on large data
result = large_df.query('a > 50 and b < 30')
The inplace Parameter
Like many Pandas methods, query() supports inplace=True to modify the DataFrame directly:
df_copy = df.copy()
df_copy.query('age > 30', inplace=True)
# df_copy now only contains rows where age > 30
However, I recommend avoiding inplace=True. It makes code harder to reason about and doesn’t actually save memory in most cases.
Limitations to Know
The query() method cannot call methods on columns. This fails:
# This won't work
df.query('name.str.startswith("A")') # Error!
For string methods, use boolean indexing:
# Use boolean indexing instead
result = df[df['name'].str.startswith('A')]
Similarly, you can’t use custom functions:
# This won't work
df.query('my_function(age) > 10') # Error!
Best Practices Summary
- Use
query()for multi-condition filters where readability matters - Use boolean indexing for single conditions or when you need method calls
- Always use
@for variables instead of f-strings or concatenation - Use backticks for messy column names, but consider renaming columns
- Install
numexprif you work with large datasets frequently - Avoid
inplace=Truefor clearer, more maintainable code
The query() method won’t replace boolean indexing entirely, but it’s a powerful tool for writing cleaner filtering code. Use it when your conditions get complex, and your future self will thank you.