Pandas - Select Rows by Condition | Application Architect

Key Insights

Boolean indexing with comparison operators is the most direct way to filter DataFrames, returning rows where conditions evaluate to True
The query() method provides a cleaner syntax for complex conditions using string expressions, especially when combining multiple criteria
The isin(), between(), and str.contains() methods handle specialized filtering scenarios more efficiently than manual boolean operations

Basic Boolean Indexing

Boolean indexing forms the foundation of conditional row selection in Pandas. You create a boolean mask by applying a condition to a column, then use that mask to filter the DataFrame.

import pandas as pd
import numpy as np

# Create sample dataset
df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'price': [1200, 25, 75, 350, 80],
    'stock': [15, 120, 45, 30, 60],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories']
})

# Select rows where price > 100
expensive_items = df[df['price'] > 100]
print(expensive_items)

The expression df['price'] > 100 creates a Series of boolean values. Pandas then returns only rows where the condition is True. This approach works with all comparison operators: ==, !=, <, >, <=, >=.

# Equal to comparison
electronics = df[df['category'] == 'Electronics']

# Not equal to
non_accessories = df[df['category'] != 'Accessories']

# Less than or equal to
affordable = df[df['price'] <= 80]

Multiple Conditions with Logical Operators

Combine multiple conditions using bitwise operators: & (and), | (or), and ~ (not). Each condition must be wrapped in parentheses due to operator precedence.

# AND condition: expensive electronics
expensive_electronics = df[(df['price'] > 100) & (df['category'] == 'Electronics')]

# OR condition: either expensive or low stock
critical_items = df[(df['price'] > 500) | (df['stock'] < 20)]

# NOT condition: exclude accessories
non_accessories = df[~(df['category'] == 'Accessories')]

# Complex combination
target_products = df[
    ((df['price'] > 50) & (df['price'] < 500)) & 
    (df['stock'] > 20)
]

The bitwise operators work element-wise on boolean Series. Never use Python’s and, or, not keywords with Pandas boolean indexing—they’ll raise a ValueError.

Using the query() Method

The query() method accepts conditions as strings, offering cleaner syntax for complex filters. It’s particularly useful when working with multiple conditions or variable column names.

# Basic query
expensive = df.query('price > 100')

# Multiple conditions
target = df.query('price > 50 and stock < 100')

# Using variables with @ prefix
min_price = 100
max_price = 500
price_range = df.query('@min_price <= price <= @max_price')

# Complex queries
complex = df.query(
    'category == "Electronics" and (price > 1000 or stock < 20)'
)

The query() method supports Python comparison operators (and, or, not) within the string expression. Use the @ prefix to reference variables from the surrounding scope.

# String matching in query
accessories = df.query('category == "Accessories"')

# In operator for membership testing
categories = ['Electronics', 'Accessories']
filtered = df.query('category in @categories')

Filtering with isin() for Membership Tests

The isin() method efficiently checks if values exist in a provided list, array, or Series. It’s more readable and performant than chaining multiple OR conditions.

# Select specific products
products_of_interest = ['Laptop', 'Monitor', 'Keyboard']
selected = df[df['product'].isin(products_of_interest)]

# Exclude specific categories
unwanted = ['Clearance', 'Discontinued']
active_products = df[~df['category'].isin(unwanted)]

# Using with numeric ranges
valid_prices = [25, 75, 80, 350]
exact_prices = df[df['price'].isin(valid_prices)]

For negation, use the ~ operator before the entire boolean expression. The isin() method works with any iterable and handles missing values gracefully.

Range-Based Filtering with between()

The between() method provides an inclusive range check, equivalent to (value >= lower) & (value <= upper) but more concise.

# Price range selection
mid_range = df[df['price'].between(50, 500)]

# Exclusive bounds using inclusive parameter
exclusive_range = df[df['price'].between(50, 500, inclusive='neither')]

# Only left bound inclusive
left_inclusive = df[df['price'].between(50, 500, inclusive='left')]

# Stock level ranges
optimal_stock = df[df['stock'].between(20, 100)]

The inclusive parameter accepts 'both' (default), 'neither', 'left', or 'right'. This method is particularly useful for date ranges and numeric intervals.

String Pattern Matching

For string columns, use str accessor methods to filter based on text patterns. These methods return boolean masks suitable for indexing.

# Create DataFrame with string data
products_df = pd.DataFrame({
    'name': ['Gaming Laptop', 'Wireless Mouse', 'Mechanical Keyboard', 
             'Gaming Monitor', 'USB Webcam'],
    'sku': ['LAP-001', 'MOU-002', 'KEY-003', 'MON-004', 'WEB-005'],
    'price': [1200, 25, 75, 350, 80]
})

# Contains substring
gaming_products = products_df[products_df['name'].str.contains('Gaming')]

# Case-insensitive search
wireless = products_df[
    products_df['name'].str.contains('wireless', case=False)
]

# Starts with pattern
lap_products = products_df[products_df['sku'].str.startswith('LAP')]

# Ends with pattern
webcam_products = products_df[products_df['name'].str.endswith('Webcam')]

# Regular expression matching
usb_or_wireless = products_df[
    products_df['name'].str.contains('USB|Wireless', regex=True)
]

The str.contains() method supports regex patterns by default. Set regex=False for literal string matching or case=False for case-insensitive searches.

Using loc and iloc with Conditions

The loc and iloc indexers allow simultaneous row and column selection based on conditions.

# Select specific columns for filtered rows
expensive_details = df.loc[df['price'] > 100, ['product', 'price']]

# Multiple conditions with column selection
target_info = df.loc[
    (df['price'] > 50) & (df['stock'] < 100),
    ['product', 'category', 'stock']
]

# Reorder columns while filtering
reordered = df.loc[
    df['category'] == 'Electronics',
    ['price', 'product', 'stock']
]

While loc uses label-based indexing, iloc uses integer positions. Combine iloc with boolean indexing by first filtering, then selecting column positions.

# Get first two columns of expensive items
expensive_subset = df[df['price'] > 100].iloc[:, :2]

Handling Missing Values in Conditions

Conditions involving NaN values require special handling since comparisons with NaN always return False.

# Create DataFrame with missing values
df_missing = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [100, np.nan, 200, 150],
    'stock': [10, 20, np.nan, 30]
})

# Filter non-null values
valid_prices = df_missing[df_missing['price'].notna()]

# Include only rows with complete data
complete_rows = df_missing[df_missing.notna().all(axis=1)]

# Filter with null-safe comparisons
# This excludes NaN values automatically
expensive = df_missing[df_missing['price'] > 150]

# Explicitly include NaN in results
price_or_null = df_missing[
    (df_missing['price'] > 150) | (df_missing['price'].isna())
]

Use notna() and isna() methods to explicitly check for missing values. Standard comparison operators silently exclude NaN values, which can lead to unexpected results if not anticipated.

Performance Considerations

For large DataFrames, method choice impacts performance significantly. Boolean indexing with NumPy arrays is generally fastest, while query() can be faster for complex conditions due to numexpr optimization.

# Create larger dataset for timing
large_df = pd.DataFrame({
    'value': np.random.randint(0, 1000, 100000),
    'category': np.random.choice(['A', 'B', 'C'], 100000)
})

# Boolean indexing - fast for simple conditions
result1 = large_df[large_df['value'] > 500]

# query() - optimized for complex conditions
result2 = large_df.query('value > 500 and category == "A"')

# isin() - optimized for membership tests
result3 = large_df[large_df['category'].isin(['A', 'B'])]

Avoid chaining multiple filter operations when a single compound condition suffices. Each intermediate step creates a new DataFrame, consuming memory and processing time. Use query() for readable complex filters or compound boolean indexing for maximum control.