How to Filter by Column Value in Pandas
Filtering DataFrames by column values is something you'll do constantly in pandas. Whether you're cleaning data, preparing features for machine learning, or generating reports, selecting rows that...
Key Insights
- Boolean indexing with comparison operators is the foundation of pandas filtering—master the syntax of wrapping conditions in parentheses and using
&,|,~for combining conditions - The
query()method offers cleaner, more readable syntax for complex filters and can be faster on large DataFrames due to expression evaluation optimization - Always use
.loc[]when filtering and selecting columns simultaneously to avoid theSettingWithCopyWarningand ensure predictable behavior
Filtering DataFrames by column values is something you’ll do constantly in pandas. Whether you’re cleaning data, preparing features for machine learning, or generating reports, selecting rows that meet specific criteria is fundamental. This article covers every practical filtering technique you need, from basic comparisons to performance-optimized approaches for large datasets.
Basic Comparison Filtering
The most straightforward way to filter a DataFrame is using comparison operators directly on columns. When you apply a comparison to a pandas Series (a column), you get back a boolean Series—a sequence of True and False values. Passing this boolean Series inside square brackets returns only the rows where the condition is True.
import pandas as pd
# Sample employee data
employees = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales'],
'salary': [75000, 52000, 68000, 45000, 61000],
'hire_date': pd.to_datetime(['2019-03-15', '2021-07-01', '2020-01-10', '2018-11-20', '2022-02-28'])
})
# Filter employees with salary greater than 50000
high_earners = employees[employees['salary'] > 50000]
print(high_earners)
Output:
name department salary hire_date
0 Alice Engineering 75000 2019-03-15
1 Bob Sales 52000 2021-07-01
2 Charlie Engineering 68000 2020-01-10
4 Eve Sales 61000 2022-02-28
All standard comparison operators work: == for equality, != for inequality, >, <, >=, and <= for numeric comparisons. These also work on datetime columns, which pandas handles intelligently.
# Filter employees hired before 2020
veterans = employees[employees['hire_date'] < '2020-01-01']
# Filter by exact department match
hr_staff = employees[employees['department'] == 'HR']
Filtering with Multiple Conditions
Real-world filtering rarely involves a single condition. Combining multiple conditions requires three operators: & for AND, | for OR, and ~ for NOT. The critical rule: always wrap each condition in parentheses. Python’s operator precedence will otherwise cause syntax errors or unexpected results.
# Filter Sales employees hired after 2020-01-01
sales_recent = employees[
(employees['department'] == 'Sales') &
(employees['hire_date'] > '2020-01-01')
]
print(sales_recent)
Output:
name department salary hire_date
1 Bob Sales 52000 2021-07-01
4 Eve Sales 61000 2022-02-28
The | operator handles OR conditions:
# Filter employees in Engineering OR with salary above 60000
eng_or_high_salary = employees[
(employees['department'] == 'Engineering') |
(employees['salary'] > 60000)
]
Use ~ to negate conditions:
# Filter employees NOT in HR
non_hr = employees[~(employees['department'] == 'HR')]
# Equivalent to:
non_hr = employees[employees['department'] != 'HR']
For complex logic, build conditions separately for clarity:
is_engineering = employees['department'] == 'Engineering'
is_senior = employees['hire_date'] < '2020-01-01'
high_salary = employees['salary'] > 65000
# Engineering veterans OR high earners
result = employees[(is_engineering & is_senior) | high_salary]
Using isin() for Multiple Values
When filtering for rows where a column matches any value from a list, isin() is cleaner and faster than chaining multiple OR conditions.
products = pd.DataFrame({
'product_id': range(1, 8),
'name': ['Laptop', 'T-Shirt', 'Novel', 'Headphones', 'Jeans', 'Cookbook', 'Tablet'],
'category': ['Electronics', 'Clothing', 'Books', 'Electronics', 'Clothing', 'Books', 'Electronics'],
'price': [999, 29, 15, 149, 59, 25, 499]
})
# Filter products in specific categories
target_categories = ['Electronics', 'Clothing', 'Books']
filtered = products[products['category'].isin(target_categories)]
# This is equivalent to (but cleaner than):
filtered_verbose = products[
(products['category'] == 'Electronics') |
(products['category'] == 'Clothing') |
(products['category'] == 'Books')
]
Negate isin() with ~ to exclude values:
# Exclude certain categories
non_books = products[~products['category'].isin(['Books'])]
String-Based Filtering with str Accessor
The .str accessor unlocks powerful string operations for filtering text columns. The most common methods are contains(), startswith(), and endswith().
customers = pd.DataFrame({
'customer_id': range(1, 7),
'name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Williams', 'Charlie Brown', 'Diana Ross'],
'email': ['john@gmail.com', 'jane@company.com', 'bob@gmail.com', 'alice@yahoo.com', 'charlie@gmail.com', 'diana@company.com']
})
# Filter customers with Gmail addresses
gmail_users = customers[customers['email'].str.contains('@gmail.com')]
print(gmail_users)
Output:
customer_id name email
0 1 John Smith john@gmail.com
2 3 Bob Johnson bob@gmail.com
4 5 Charlie Brown charlie@gmail.com
Other useful string methods:
# Names starting with 'J'
j_names = customers[customers['name'].str.startswith('J')]
# Emails ending with '.com' (all of them in this case)
dot_com = customers[customers['email'].str.endswith('.com')]
# Case-insensitive search
gmail_case_insensitive = customers[
customers['email'].str.contains('gmail', case=False)
]
# Regex patterns (contains() supports regex by default)
has_numbers = customers[customers['email'].str.contains(r'\d+', regex=True)]
Handle missing values explicitly to avoid warnings:
# Use na=False to treat NaN as non-matching
filtered = customers[customers['email'].str.contains('@gmail.com', na=False)]
Using query() Method
The query() method provides SQL-like syntax that many find more readable, especially for complex conditions. It evaluates string expressions against the DataFrame.
# Basic comparison
high_earners = employees.query('salary > 50000')
# Multiple conditions (no parentheses needed for simple AND/OR)
sales_recent = employees.query('department == "Sales" and hire_date > "2020-01-01"')
# OR conditions
eng_or_rich = employees.query('department == "Engineering" or salary > 60000')
# Using variables with @
min_salary = 55000
dept = 'Engineering'
filtered = employees.query('salary > @min_salary and department == @dept')
The query() method shines with complex conditions:
# Compare this query() version:
result = employees.query(
'(department == "Engineering" and salary > 65000) or '
'(department == "Sales" and hire_date > "2021-01-01")'
)
# To the bracket notation equivalent:
result = employees[
((employees['department'] == 'Engineering') & (employees['salary'] > 65000)) |
((employees['department'] == 'Sales') & (employees['hire_date'] > '2021-01-01'))
]
When to use query():
- Complex multi-condition filters where readability matters
- When you’re building filter strings dynamically
- Large DataFrames where
query()can be faster due to expression evaluation
When to stick with bracket notation:
- Simple single-condition filters
- When you need to use methods like
isin()or.str.contains() - When working with column names containing spaces or special characters
Performance Considerations
For small to medium DataFrames, filtering performance rarely matters. But with millions of rows, technique matters.
Use .loc[] for filtering and column selection together:
# Good: Single operation with .loc[]
result = employees.loc[employees['salary'] > 50000, ['name', 'salary']]
# Avoid: Chained indexing (can cause SettingWithCopyWarning)
result = employees[employees['salary'] > 50000][['name', 'salary']]
The .loc[] approach is not only safer but often faster because it performs the operation in one step rather than creating an intermediate DataFrame.
Avoid repeated filtering on the same condition:
# Inefficient: Computing the same boolean mask multiple times
high_salary_names = employees[employees['salary'] > 50000]['name']
high_salary_depts = employees[employees['salary'] > 50000]['department']
# Better: Compute mask once, reuse it
mask = employees['salary'] > 50000
high_salary_names = employees.loc[mask, 'name']
high_salary_depts = employees.loc[mask, 'department']
Consider query() for large DataFrames:
For DataFrames with millions of rows, query() can outperform boolean indexing because it uses numexpr for optimized expression evaluation when available.
# Can be faster on very large DataFrames
large_df.query('column_a > 100 and column_b < 50')
Use categorical dtypes for repeated filtering:
If you’re filtering on the same column repeatedly, converting it to categorical can speed things up:
employees['department'] = employees['department'].astype('category')
# Subsequent filters on department will be faster
Filtering is one of those operations you’ll perform hundreds of times in any data project. Master these techniques—basic comparisons, multiple conditions with proper parentheses, isin() for lists, string methods for text, and query() for readability—and you’ll handle any filtering task efficiently. Start with the simplest approach that works, and reach for more advanced techniques when you need cleaner code or better performance.