How to Filter by Column Value in Pandas

Filtering DataFrames by column values is something you'll do constantly in pandas. Whether you're cleaning data, preparing features for machine learning, or generating reports, selecting rows that...

Key Insights

  • Boolean indexing with comparison operators is the foundation of pandas filtering—master the syntax of wrapping conditions in parentheses and using &, |, ~ for combining conditions
  • The query() method offers cleaner, more readable syntax for complex filters and can be faster on large DataFrames due to expression evaluation optimization
  • Always use .loc[] when filtering and selecting columns simultaneously to avoid the SettingWithCopyWarning and ensure predictable behavior

Filtering DataFrames by column values is something you’ll do constantly in pandas. Whether you’re cleaning data, preparing features for machine learning, or generating reports, selecting rows that meet specific criteria is fundamental. This article covers every practical filtering technique you need, from basic comparisons to performance-optimized approaches for large datasets.

Basic Comparison Filtering

The most straightforward way to filter a DataFrame is using comparison operators directly on columns. When you apply a comparison to a pandas Series (a column), you get back a boolean Series—a sequence of True and False values. Passing this boolean Series inside square brackets returns only the rows where the condition is True.

import pandas as pd

# Sample employee data
employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales'],
    'salary': [75000, 52000, 68000, 45000, 61000],
    'hire_date': pd.to_datetime(['2019-03-15', '2021-07-01', '2020-01-10', '2018-11-20', '2022-02-28'])
})

# Filter employees with salary greater than 50000
high_earners = employees[employees['salary'] > 50000]
print(high_earners)

Output:

      name   department  salary  hire_date
0    Alice  Engineering   75000 2019-03-15
1      Bob        Sales   52000 2021-07-01
2  Charlie  Engineering   68000 2020-01-10
4      Eve        Sales   61000 2022-02-28

All standard comparison operators work: == for equality, != for inequality, >, <, >=, and <= for numeric comparisons. These also work on datetime columns, which pandas handles intelligently.

# Filter employees hired before 2020
veterans = employees[employees['hire_date'] < '2020-01-01']

# Filter by exact department match
hr_staff = employees[employees['department'] == 'HR']

Filtering with Multiple Conditions

Real-world filtering rarely involves a single condition. Combining multiple conditions requires three operators: & for AND, | for OR, and ~ for NOT. The critical rule: always wrap each condition in parentheses. Python’s operator precedence will otherwise cause syntax errors or unexpected results.

# Filter Sales employees hired after 2020-01-01
sales_recent = employees[
    (employees['department'] == 'Sales') & 
    (employees['hire_date'] > '2020-01-01')
]
print(sales_recent)

Output:

  name department  salary  hire_date
1  Bob      Sales   52000 2021-07-01
4  Eve      Sales   61000 2022-02-28

The | operator handles OR conditions:

# Filter employees in Engineering OR with salary above 60000
eng_or_high_salary = employees[
    (employees['department'] == 'Engineering') | 
    (employees['salary'] > 60000)
]

Use ~ to negate conditions:

# Filter employees NOT in HR
non_hr = employees[~(employees['department'] == 'HR')]

# Equivalent to:
non_hr = employees[employees['department'] != 'HR']

For complex logic, build conditions separately for clarity:

is_engineering = employees['department'] == 'Engineering'
is_senior = employees['hire_date'] < '2020-01-01'
high_salary = employees['salary'] > 65000

# Engineering veterans OR high earners
result = employees[(is_engineering & is_senior) | high_salary]

Using isin() for Multiple Values

When filtering for rows where a column matches any value from a list, isin() is cleaner and faster than chaining multiple OR conditions.

products = pd.DataFrame({
    'product_id': range(1, 8),
    'name': ['Laptop', 'T-Shirt', 'Novel', 'Headphones', 'Jeans', 'Cookbook', 'Tablet'],
    'category': ['Electronics', 'Clothing', 'Books', 'Electronics', 'Clothing', 'Books', 'Electronics'],
    'price': [999, 29, 15, 149, 59, 25, 499]
})

# Filter products in specific categories
target_categories = ['Electronics', 'Clothing', 'Books']
filtered = products[products['category'].isin(target_categories)]

# This is equivalent to (but cleaner than):
filtered_verbose = products[
    (products['category'] == 'Electronics') |
    (products['category'] == 'Clothing') |
    (products['category'] == 'Books')
]

Negate isin() with ~ to exclude values:

# Exclude certain categories
non_books = products[~products['category'].isin(['Books'])]

String-Based Filtering with str Accessor

The .str accessor unlocks powerful string operations for filtering text columns. The most common methods are contains(), startswith(), and endswith().

customers = pd.DataFrame({
    'customer_id': range(1, 7),
    'name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Williams', 'Charlie Brown', 'Diana Ross'],
    'email': ['john@gmail.com', 'jane@company.com', 'bob@gmail.com', 'alice@yahoo.com', 'charlie@gmail.com', 'diana@company.com']
})

# Filter customers with Gmail addresses
gmail_users = customers[customers['email'].str.contains('@gmail.com')]
print(gmail_users)

Output:

   customer_id           name              email
0            1     John Smith     john@gmail.com
2            3    Bob Johnson      bob@gmail.com
4            5  Charlie Brown  charlie@gmail.com

Other useful string methods:

# Names starting with 'J'
j_names = customers[customers['name'].str.startswith('J')]

# Emails ending with '.com' (all of them in this case)
dot_com = customers[customers['email'].str.endswith('.com')]

# Case-insensitive search
gmail_case_insensitive = customers[
    customers['email'].str.contains('gmail', case=False)
]

# Regex patterns (contains() supports regex by default)
has_numbers = customers[customers['email'].str.contains(r'\d+', regex=True)]

Handle missing values explicitly to avoid warnings:

# Use na=False to treat NaN as non-matching
filtered = customers[customers['email'].str.contains('@gmail.com', na=False)]

Using query() Method

The query() method provides SQL-like syntax that many find more readable, especially for complex conditions. It evaluates string expressions against the DataFrame.

# Basic comparison
high_earners = employees.query('salary > 50000')

# Multiple conditions (no parentheses needed for simple AND/OR)
sales_recent = employees.query('department == "Sales" and hire_date > "2020-01-01"')

# OR conditions
eng_or_rich = employees.query('department == "Engineering" or salary > 60000')

# Using variables with @
min_salary = 55000
dept = 'Engineering'
filtered = employees.query('salary > @min_salary and department == @dept')

The query() method shines with complex conditions:

# Compare this query() version:
result = employees.query(
    '(department == "Engineering" and salary > 65000) or '
    '(department == "Sales" and hire_date > "2021-01-01")'
)

# To the bracket notation equivalent:
result = employees[
    ((employees['department'] == 'Engineering') & (employees['salary'] > 65000)) |
    ((employees['department'] == 'Sales') & (employees['hire_date'] > '2021-01-01'))
]

When to use query():

  • Complex multi-condition filters where readability matters
  • When you’re building filter strings dynamically
  • Large DataFrames where query() can be faster due to expression evaluation

When to stick with bracket notation:

  • Simple single-condition filters
  • When you need to use methods like isin() or .str.contains()
  • When working with column names containing spaces or special characters

Performance Considerations

For small to medium DataFrames, filtering performance rarely matters. But with millions of rows, technique matters.

Use .loc[] for filtering and column selection together:

# Good: Single operation with .loc[]
result = employees.loc[employees['salary'] > 50000, ['name', 'salary']]

# Avoid: Chained indexing (can cause SettingWithCopyWarning)
result = employees[employees['salary'] > 50000][['name', 'salary']]

The .loc[] approach is not only safer but often faster because it performs the operation in one step rather than creating an intermediate DataFrame.

Avoid repeated filtering on the same condition:

# Inefficient: Computing the same boolean mask multiple times
high_salary_names = employees[employees['salary'] > 50000]['name']
high_salary_depts = employees[employees['salary'] > 50000]['department']

# Better: Compute mask once, reuse it
mask = employees['salary'] > 50000
high_salary_names = employees.loc[mask, 'name']
high_salary_depts = employees.loc[mask, 'department']

Consider query() for large DataFrames:

For DataFrames with millions of rows, query() can outperform boolean indexing because it uses numexpr for optimized expression evaluation when available.

# Can be faster on very large DataFrames
large_df.query('column_a > 100 and column_b < 50')

Use categorical dtypes for repeated filtering:

If you’re filtering on the same column repeatedly, converting it to categorical can speed things up:

employees['department'] = employees['department'].astype('category')
# Subsequent filters on department will be faster

Filtering is one of those operations you’ll perform hundreds of times in any data project. Master these techniques—basic comparisons, multiple conditions with proper parentheses, isin() for lists, string methods for text, and query() for readability—and you’ll handle any filtering task efficiently. Start with the simplest approach that works, and reach for more advanced techniques when you need cleaner code or better performance.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.