Pandas - Select Rows Where Column Equals Value
Boolean indexing is the most straightforward method for filtering DataFrame rows. It creates a boolean mask where each row is evaluated against your condition, returning True or False.
Key Insights
- Use boolean indexing with
df[df['column'] == value]for simple equality filtering—it’s the most readable and Pythonic approach for selecting rows - Leverage
df.query()for complex multi-condition filters with cleaner syntax, especially when chaining multiple comparisons or working with string column names - Apply
df.loc[]when you need to select both specific rows and columns simultaneously, or when working with label-based indexing alongside conditions
Boolean Indexing: The Standard Approach
Boolean indexing is the most straightforward method for filtering DataFrame rows. It creates a boolean mask where each row is evaluated against your condition, returning True or False.
import pandas as pd
# Sample dataset
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Mouse'],
'price': [1200, 25, 75, 300, 30],
'stock': [15, 100, 45, 20, 85]
})
# Select rows where product equals 'Mouse'
mice = df[df['product'] == 'Mouse']
print(mice)
Output:
product price stock
1 Mouse 25 100
4 Mouse 30 85
The expression df['product'] == 'Mouse' creates a Series of boolean values. Pandas then uses this mask to filter the DataFrame, returning only rows where the condition is True.
For numeric comparisons:
# Products priced at exactly 300
expensive = df[df['price'] == 300]
# Multiple value matching with isin()
peripherals = df[df['product'].isin(['Mouse', 'Keyboard'])]
print(peripherals)
Output:
product price stock
1 Mouse 25 100
2 Keyboard 75 45
4 Mouse 30 85
Handling String Comparisons
String matching requires attention to case sensitivity and whitespace. Pandas provides several methods for flexible string filtering.
df = pd.DataFrame({
'customer': ['John Smith', 'jane doe', 'ALICE JONES', 'Bob Wilson'],
'status': ['active', 'Active', 'ACTIVE', 'inactive']
})
# Case-sensitive exact match
active_exact = df[df['status'] == 'active']
print(f"Exact match: {len(active_exact)} rows")
# Case-insensitive matching
active_any_case = df[df['status'].str.lower() == 'active']
print(f"Case-insensitive: {len(active_any_case)} rows")
# Partial string matching
smith_customers = df[df['customer'].str.contains('Smith', case=False)]
print(smith_customers)
Output:
Exact match: 1 rows
Case-insensitive: 3 rows
customer status
0 John Smith active
For pattern matching:
# Using regex patterns
pattern_match = df[df['customer'].str.match(r'^[Jj]')]
print(pattern_match)
The query() Method for Complex Conditions
The query() method offers cleaner syntax for complex filtering operations, particularly when combining multiple conditions.
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics'],
'price': [1200, 25, 75, 300, 150],
'stock': [15, 100, 45, 20, 60]
})
# Single condition with query()
mice = df.query('product == "Mouse"')
# Multiple conditions
affordable_electronics = df.query('category == "Electronics" and price < 500')
print(affordable_electronics)
Output:
product category price stock
4 Webcam Electronics 150 60
The query() method shines with variable interpolation:
target_category = 'Electronics'
max_price = 500
# Using @ to reference variables
results = df.query('category == @target_category and price <= @max_price')
# Complex expressions
in_stock_affordable = df.query('stock > 50 and price < 100')
print(in_stock_affordable)
Output:
product category price stock
1 Mouse Accessories 25 100
4 Webcam Electronics 150 60
Using loc[] for Precise Selection
The loc[] accessor provides label-based indexing and works seamlessly with boolean conditions when you need to select specific columns alongside row filtering.
df = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105],
'department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],
'salary': [50000, 75000, 55000, 60000, 80000],
'years': [2, 5, 3, 7, 4]
})
# Select specific columns for filtered rows
sales_info = df.loc[df['department'] == 'Sales', ['employee_id', 'salary']]
print(sales_info)
Output:
employee_id salary
0 101 50000
2 103 55000
Combining multiple conditions with loc[]:
# AND conditions
experienced_it = df.loc[(df['department'] == 'IT') & (df['years'] >= 5)]
# OR conditions
high_earners_or_hr = df.loc[(df['salary'] > 70000) | (df['department'] == 'HR')]
print(high_earners_or_hr)
Output:
employee_id department salary years
1 102 IT 75000 5
3 104 HR 60000 7
4 105 IT 80000 4
Performance Considerations
When working with large datasets, method selection impacts performance significantly.
import numpy as np
import time
# Create large dataset
large_df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000),
'value': np.random.randint(0, 1000, 1000000)
})
# Boolean indexing
start = time.time()
result1 = large_df[large_df['category'] == 'A']
time1 = time.time() - start
# query() method
start = time.time()
result2 = large_df.query('category == "A"')
time2 = time.time() - start
print(f"Boolean indexing: {time1:.4f}s")
print(f"query() method: {time2:.4f}s")
For optimal performance with repeated filtering:
# Create boolean mask once, reuse multiple times
mask = large_df['category'] == 'A'
subset1 = large_df[mask]
subset2 = large_df[mask]['value'].sum()
# Use categorical data types for frequently filtered columns
large_df['category'] = large_df['category'].astype('category')
filtered = large_df[large_df['category'] == 'A'] # Faster with categorical
Handling Missing Values in Comparisons
Equality comparisons with NaN values require special handling since NaN == NaN evaluates to False in pandas.
df = pd.DataFrame({
'product': ['Laptop', None, 'Mouse', 'Keyboard', None],
'price': [1200, 25, None, 75, 300]
})
# This won't catch None/NaN values
result = df[df['product'] == None] # Returns empty DataFrame
# Correct approach for finding missing values
missing_products = df[df['product'].isna()]
print(missing_products)
# Combining null checks with other conditions
has_price_no_product = df[df['product'].isna() & df['price'].notna()]
print(has_price_no_product)
Output:
product price
1 None 25.0
4 None 300.0
product price
1 None 25.0
4 None 300.0
Chaining Filters Efficiently
When applying multiple sequential filters, consider method chaining versus combined conditions.
df = pd.DataFrame({
'region': ['North', 'South', 'North', 'East', 'West', 'North'],
'sales': [15000, 22000, 18000, 25000, 12000, 30000],
'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1', 'Q2']
})
# Chained filtering (less efficient)
result = df[df['region'] == 'North']
result = result[result['quarter'] == 'Q2']
# Combined conditions (more efficient)
result = df[(df['region'] == 'North') & (df['quarter'] == 'Q2')]
print(result)
# Using query for readability
result = df.query('region == "North" and quarter == "Q2"')
print(result)
Output:
region sales quarter
2 North 18000 Q2
5 North 30000 Q2
Choose boolean indexing for simple, single-condition filters. Use query() when readability matters with complex conditions. Apply loc[] when you need precise control over both row and column selection. Each method has its place in a data engineer’s toolkit.