How to Use Query in Pandas

Pandas gives you two main ways to filter DataFrames: boolean indexing and the `query()` method. Most tutorials focus on boolean indexing because it's the traditional approach, but `query()` often...

Key Insights

  • The query() method offers a more readable, SQL-like syntax for filtering DataFrames, especially when combining multiple conditions with and, or, and not operators.
  • Use the @ prefix to reference external Python variables within query strings, keeping your filtering logic clean and maintainable.
  • While query() excels at readability for complex filters, it has limitations—no method calls allowed, and performance benefits only appear with large datasets (100k+ rows).

Introduction to the Query Method

Pandas gives you two main ways to filter DataFrames: boolean indexing and the query() method. Most tutorials focus on boolean indexing because it’s the traditional approach, but query() often produces cleaner, more maintainable code.

Boolean indexing looks like this:

df[(df['age'] > 30) & (df['salary'] < 100000) & (df['department'] == 'Engineering')]

The same filter with query():

df.query('age > 30 and salary < 100000 and department == "Engineering"')

The query() version reads like English. You don’t need parentheses around each condition, you use and instead of &, and you reference columns by name without the df['column'] syntax.

Use query() when you have multiple conditions, when readability matters more than micro-optimization, or when you’re building dynamic filters from user input. Stick with boolean indexing for simple single-condition filters or when you need to call methods on columns during filtering.

Basic Query Syntax

The query() method accepts a string expression that evaluates to a boolean result for each row. Rows where the expression is True are returned.

import pandas as pd

# Sample data
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [28, 35, 42, 31, 25],
    'status': ['active', 'inactive', 'active', 'active', 'inactive'],
    'salary': [75000, 82000, 95000, 68000, 71000]
})

# Filter by numeric comparison
young_employees = df.query('age < 30')
print(young_employees)

Output:

    name  age  status  salary
0  Alice   28  active   75000
4    Eve   25  inactive 71000

For string comparisons, use quotes inside the query string. You can use either single quotes for the query and double quotes for the value, or vice versa:

# Filter by string equality
active_employees = df.query('status == "active"')
# Or equivalently:
active_employees = df.query("status == 'active'")

All standard comparison operators work: ==, !=, <, >, <=, >=.

# Not equal
non_active = df.query('status != "active"')

# Greater than or equal
senior_salary = df.query('salary >= 80000')

Combining Multiple Conditions

Here’s where query() really shines. Instead of the awkward &, |, ~ operators with mandatory parentheses, you use plain English: and, or, not.

# Multiple conditions with boolean indexing (verbose)
result = df[(df['age'] > 30) & (df['salary'] < 90000) & (df['status'] == 'active')]

# Same filter with query (clean)
result = df.query('age > 30 and salary < 90000 and status == "active"')
print(result)

Output:

    name  age  status  salary
3  Diana   31  active   68000

The or operator works as expected:

# Employees who are either young OR highly paid
result = df.query('age < 30 or salary > 90000')
print(result)

Output:

      name  age    status  salary
0    Alice   28    active   75000
2  Charlie   42    active   95000
4      Eve   25  inactive   71000

Use not to negate conditions:

# Everyone except active employees under 30
result = df.query('not (status == "active" and age < 30)')
print(result)

You can build complex logic by combining operators:

# Complex filter: active employees over 30, OR inactive employees with high salary
result = df.query('(status == "active" and age > 30) or (status == "inactive" and salary > 75000)')
print(result)

Output:

      name  age    status  salary
1      Bob   35  inactive   82000
2  Charlie   42    active   95000
3    Diana   31    active   68000

Using Variables in Queries

Hard-coding values in query strings limits reusability. The @ prefix lets you reference Python variables from outside the query:

min_age = 30
max_salary = 85000
target_status = 'active'

result = df.query('age >= @min_age and salary <= @max_salary and status == @target_status')
print(result)

Output:

    name  age  status  salary
3  Diana   31  active   68000

This pattern is essential for building reusable filtering functions:

def filter_employees(df, min_age=None, max_salary=None, status=None):
    conditions = []
    
    if min_age is not None:
        conditions.append('age >= @min_age')
    if max_salary is not None:
        conditions.append('salary <= @max_salary')
    if status is not None:
        conditions.append('status == @status')
    
    if not conditions:
        return df
    
    query_string = ' and '.join(conditions)
    return df.query(query_string)

# Usage
result = filter_employees(df, min_age=25, status='active')

You can also reference list variables:

valid_names = ['Alice', 'Bob', 'Charlie']
result = df.query('name in @valid_names')
print(result)

Output:

      name  age    status  salary
0    Alice   28    active   75000
1      Bob   35  inactive   82000
2  Charlie   42    active   95000

Working with Column Names Containing Spaces or Special Characters

Real-world data often has messy column names. The query() method handles these with backticks:

# DataFrame with problematic column names
df_messy = pd.DataFrame({
    'employee name': ['Alice', 'Bob', 'Charlie'],
    'total sales': [15000, 22000, 18000],
    'commission %': [5.0, 7.5, 6.0],
    'start-date': ['2020-01-15', '2019-06-01', '2021-03-10']
})

# Use backticks for column names with spaces
high_performers = df_messy.query('`total sales` > 16000')
print(high_performers)

Output:

  employee name  total sales  commission %  start-date
1           Bob        22000           7.5  2019-06-01
2       Charlie        18000           6.0  2021-03-10

Backticks work for any problematic character:

# Columns with special characters
result = df_messy.query('`commission %` >= 6.0 and `total sales` > 15000')
print(result)

While backticks solve the immediate problem, consider renaming columns for long-term maintainability:

df_clean = df_messy.rename(columns={
    'employee name': 'employee_name',
    'total sales': 'total_sales',
    'commission %': 'commission_pct',
    'start-date': 'start_date'
})

Advanced Query Techniques

List Membership with in

The in operator checks if values exist in a list:

# Filter for specific categories
df_products = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo', 'Thingamajig', 'Doohickey'],
    'category': ['A', 'B', 'A', 'C', 'B'],
    'price': [10, 25, 15, 30, 20]
})

result = df_products.query('category in ["A", "B"]')
print(result)

Output:

   product category  price
0   Widget        A     10
1   Gadget        B     25
2    Gizmo        A     15
4  Doohickey       B     20

Use not in for exclusion:

result = df_products.query('category not in ["C"]')

Querying the Index

Reference the index directly using index:

df_indexed = df.set_index('name')

# Query by index value
result = df_indexed.query('index == "Alice" or index == "Bob"')
print(result)

For numeric indexes:

# Filter rows where index is greater than 2
result = df.query('index > 2')
print(result)

Output:

    name  age    status  salary
3  Diana   31    active   68000
4    Eve   25  inactive   71000

Arithmetic in Queries

You can perform calculations within the query:

# Find employees where salary per year of age is above a threshold
result = df.query('salary / age > 2500')
print(result)
# Compare columns to each other
df_compare = pd.DataFrame({
    'budget': [1000, 2000, 1500],
    'spent': [800, 2100, 1400],
    'project': ['Alpha', 'Beta', 'Gamma']
})

over_budget = df_compare.query('spent > budget')
print(over_budget)

Output:

   budget  spent project
1    2000   2100    Beta

Performance Considerations and Best Practices

When Query Performs Better

For small DataFrames (under 100k rows), query() and boolean indexing perform similarly. The parsing overhead of query() might even make it slightly slower.

For large DataFrames, query() can be faster because it uses numexpr under the hood when available. Install it with pip install numexpr to enable this optimization.

import numpy as np

# Large DataFrame for benchmarking
large_df = pd.DataFrame({
    'a': np.random.randint(0, 100, 1_000_000),
    'b': np.random.randint(0, 100, 1_000_000),
    'c': np.random.choice(['x', 'y', 'z'], 1_000_000)
})

# With numexpr installed, query() can be 2-3x faster on large data
result = large_df.query('a > 50 and b < 30')

The inplace Parameter

Like many Pandas methods, query() supports inplace=True to modify the DataFrame directly:

df_copy = df.copy()
df_copy.query('age > 30', inplace=True)
# df_copy now only contains rows where age > 30

However, I recommend avoiding inplace=True. It makes code harder to reason about and doesn’t actually save memory in most cases.

Limitations to Know

The query() method cannot call methods on columns. This fails:

# This won't work
df.query('name.str.startswith("A")')  # Error!

For string methods, use boolean indexing:

# Use boolean indexing instead
result = df[df['name'].str.startswith('A')]

Similarly, you can’t use custom functions:

# This won't work
df.query('my_function(age) > 10')  # Error!

Best Practices Summary

  1. Use query() for multi-condition filters where readability matters
  2. Use boolean indexing for single conditions or when you need method calls
  3. Always use @ for variables instead of f-strings or concatenation
  4. Use backticks for messy column names, but consider renaming columns
  5. Install numexpr if you work with large datasets frequently
  6. Avoid inplace=True for clearer, more maintainable code

The query() method won’t replace boolean indexing entirely, but it’s a powerful tool for writing cleaner filtering code. Use it when your conditions get complex, and your future self will thank you.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.