How to Filter Rows in Pandas
Row filtering is something you'll do in virtually every pandas workflow. Whether you're cleaning messy data, preparing subsets for analysis, or extracting records that meet specific criteria,...
Key Insights
- Boolean indexing is the foundation of pandas filtering—master it first, and every other method becomes intuitive
- Use
.query()for complex conditions with multiple columns; it’s more readable and often faster on large datasets - Always wrap multiple conditions in parentheses when using
&and|operators; this is the most common source of filtering bugs
Introduction
Row filtering is something you’ll do in virtually every pandas workflow. Whether you’re cleaning messy data, preparing subsets for analysis, or extracting records that meet specific criteria, filtering is foundational. Yet I regularly see developers struggle with the syntax, especially when combining multiple conditions.
This article covers the practical techniques you need. We’ll start with boolean indexing basics, work through multi-condition filtering, and explore specialized methods like .query(), .isin(), and .between(). By the end, you’ll know exactly which approach to use for any filtering scenario.
Let’s work with a sample dataset throughout:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'age': [28, 35, 42, 31, None],
'department': ['Engineering', 'Sales', 'Engineering', 'Marketing', 'Sales'],
'salary': [75000, 52000, 95000, 61000, 48000],
'years_employed': [3, 7, 12, 5, 2]
})
Boolean Indexing Basics
Boolean indexing is the core mechanism behind pandas filtering. The concept is simple: you create a Series of True and False values (a boolean mask), then pass it to the DataFrame. Rows where the mask is True are kept; rows where it’s False are dropped.
When you apply a comparison operator to a pandas Series, you get a boolean Series back:
# This creates a boolean Series
mask = df['age'] > 30
print(mask)
# 0 False
# 1 True
# 2 True
# 3 True
# 4 False
# Name: age, dtype: bool
Pass this mask to the DataFrame using bracket notation to filter:
# Filter rows where age > 30
older_employees = df[df['age'] > 30]
print(older_employees)
# name age department salary years_employed
# 1 Bob 35.0 Sales 52000 7
# 2 Charlie 42.0 Engineering 95000 12
# 3 Diana 31.0 Marketing 61000 5
String comparisons work identically:
# Filter rows where department equals 'Engineering'
engineers = df[df['department'] == 'Engineering']
print(engineers)
# name age department salary years_employed
# 0 Alice 28.0 Engineering 75000 3
# 2 Charlie 42.0 Engineering 95000 12
You can also use other comparison operators: <, <=, >=, !=. The pattern is always the same: df[df['column'] <operator> value].
Filtering with Multiple Conditions
Real-world filtering usually involves multiple conditions. This is where pandas syntax gets tricky if you’re coming from Python or SQL.
Use & for AND, | for OR, and ~ for NOT. Crucially, you must wrap each condition in parentheses. This isn’t optional—it’s required because of Python’s operator precedence rules.
# Filter: age > 30 AND salary < 70000
mid_career_moderate_pay = df[(df['age'] > 30) & (df['salary'] < 70000)]
print(mid_career_moderate_pay)
# name age department salary years_employed
# 1 Bob 35.0 Sales 52000 7
# 3 Diana 31.0 Marketing 61000 5
Without parentheses, you’ll get a cryptic error about ambiguous truth values. I’ve seen experienced developers waste hours debugging this.
OR conditions follow the same pattern:
# Filter: department is Engineering OR salary > 60000
high_value = df[(df['department'] == 'Engineering') | (df['salary'] > 60000)]
print(high_value)
# name age department salary years_employed
# 0 Alice 28.0 Engineering 75000 3
# 2 Charlie 42.0 Engineering 95000 12
# 3 Diana 31.0 Marketing 61000 5
Negate conditions with ~:
# Filter: NOT in Sales department
not_sales = df[~(df['department'] == 'Sales')]
print(not_sales)
# name age department salary years_employed
# 0 Alice 28.0 Engineering 75000 3
# 2 Charlie 42.0 Engineering 95000 12
# 3 Diana 31.0 Marketing 61000 5
When you have more than two or three conditions, the boolean indexing syntax becomes unwieldy. That’s when you should reach for .query().
Using the .query() Method
The .query() method lets you write filter conditions as strings using a SQL-like syntax. It’s more readable for complex conditions and often performs better on large datasets.
Here’s the boolean indexing example rewritten with .query():
# Boolean indexing version
result = df[(df['age'] > 30) & (df['salary'] < 70000)]
# Equivalent .query() version
result = df.query('age > 30 and salary < 70000')
The .query() version is cleaner—no repeated df, no parentheses around each condition, and you use and/or/not instead of &/|/~.
For string comparisons, use quotes within the query string:
# Filter by string value
engineers = df.query('department == "Engineering"')
Reference external variables with the @ prefix:
min_salary = 60000
target_dept = 'Engineering'
# Reference Python variables in query
filtered = df.query('salary >= @min_salary and department == @target_dept')
print(filtered)
# name age department salary years_employed
# 0 Alice 28.0 Engineering 75000 3
# 2 Charlie 42.0 Engineering 95000 12
This variable referencing is particularly useful in functions where filter values come from parameters.
One limitation: .query() doesn’t work well with column names containing spaces or special characters. Stick to boolean indexing for those cases, or rename your columns first.
Filtering with .isin() and .between()
When checking if values are in a list or within a range, pandas provides specialized methods that are both cleaner and faster than chaining OR conditions.
Use .isin() to check membership in a collection:
# Filter: department is Engineering or Marketing
target_depts = ['Engineering', 'Marketing']
filtered = df[df['department'].isin(target_depts)]
print(filtered)
# name age department salary years_employed
# 0 Alice 28.0 Engineering 75000 3
# 2 Charlie 42.0 Engineering 95000 12
# 3 Diana 31.0 Marketing 61000 5
Compare this to the OR-chain alternative:
# Equivalent but worse
filtered = df[(df['department'] == 'Engineering') | (df['department'] == 'Marketing')]
The .isin() version scales better—adding more values to the list doesn’t make the code longer or harder to read.
For numeric ranges, .between() is your friend:
# Filter: salary between 50000 and 80000 (inclusive)
mid_range_salary = df[df['salary'].between(50000, 80000)]
print(mid_range_salary)
# name age department salary years_employed
# 0 Alice 28.0 Engineering 75000 3
# 1 Bob 35.0 Sales 52000 7
# 3 Diana 31.0 Marketing 61000 5
# 4 Eve NaN Sales 48000 2
By default, .between() is inclusive on both ends. Use the inclusive parameter to change this:
# Exclusive on both ends
df[df['salary'].between(50000, 80000, inclusive='neither')]
# Inclusive only on left
df[df['salary'].between(50000, 80000, inclusive='left')]
Handling Missing Values in Filters
Missing values (NaN/None) require special attention. Standard comparison operators don’t work intuitively with NaN—NaN == NaN returns False, and NaN > 5 also returns False.
Use .isna() and .notna() for explicit null checking:
# Filter rows where age is missing
missing_age = df[df['age'].isna()]
print(missing_age)
# name age department salary years_employed
# 4 Eve NaN Sales 48000 2
# Filter rows where age is NOT missing
has_age = df[df['age'].notna()]
print(has_age)
# name age department salary years_employed
# 0 Alice 28.0 Engineering 75000 3
# 1 Bob 35.0 Sales 52000 7
# 2 Charlie 42.0 Engineering 95000 12
# 3 Diana 31.0 Marketing 61000 5
Combine null checks with other conditions:
# Filter: age > 30 OR age is missing
result = df[(df['age'] > 30) | (df['age'].isna())]
For quick removal of rows with any missing values, use dropna():
# Drop rows with any missing values
clean_df = df.dropna()
# Drop rows with missing values in specific columns only
clean_df = df.dropna(subset=['age', 'salary'])
Performance Tips and Summary
For small to medium datasets (under a million rows), all filtering methods perform similarly. For larger datasets, keep these guidelines in mind:
Boolean indexing is the most flexible and works everywhere. It’s your default choice for simple conditions.
.query() can be faster on large DataFrames because it uses numexpr under the hood for numeric operations. It’s also more readable for complex multi-column conditions.
.isin() is significantly faster than chaining OR conditions, especially with large lists of values.
Avoid applying filters in loops. If you need to filter by multiple criteria separately, consider using groupby() instead.
Here’s a quick reference for choosing your filtering approach:
| Scenario | Recommended Method |
|---|---|
| Simple single condition | Boolean indexing |
| Multiple conditions (3+) | .query() |
| Value in list | .isin() |
| Numeric range | .between() |
| Missing value check | .isna() / .notna() |
| Column names with spaces | Boolean indexing |
Filtering is one of those operations you’ll use constantly. Get comfortable with these patterns, and you’ll write cleaner, faster pandas code. Start with boolean indexing to understand the fundamentals, then reach for .query() and the specialized methods when they make your code clearer.