Pandas - Get Index of Rows Matching Condition
• Use boolean indexing with `.index` to retrieve index values of rows matching conditions, returning an Index object that preserves the original index type and structure
Key Insights
• Use boolean indexing with .index to retrieve index values of rows matching conditions, returning an Index object that preserves the original index type and structure
• The .query() method combined with .index provides a cleaner syntax for complex conditions, especially when working with column names containing spaces or special characters
• For multiple conditions or performance-critical operations, np.where() returns integer positions while boolean masks return actual index labels—choose based on whether you need locations or labels
Basic Boolean Indexing with .index
The most straightforward approach to get indices of matching rows uses boolean indexing combined with the .index attribute. This returns an Index object containing all index labels where the condition evaluates to True.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'price': [1200, 25, 75, 350, 80],
'stock': [15, 120, 45, 8, 60]
})
# Get index of products priced above 100
expensive_items = df[df['price'] > 100].index
print(expensive_items)
# Int64Index([0, 3], dtype='int64')
# Convert to list if needed
print(expensive_items.tolist())
# [0, 3]
This approach works seamlessly with custom indices. When your DataFrame uses non-integer indices, the boolean mask returns the actual index labels:
df_custom = pd.DataFrame({
'temperature': [72, 85, 91, 68, 78],
'humidity': [45, 62, 71, 38, 55]
}, index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
hot_days = df_custom[df_custom['temperature'] > 80].index
print(hot_days)
# Index(['Tue', 'Wed'], dtype='object')
Multiple Conditions with Logical Operators
Combine multiple conditions using & (and), | (or), and ~ (not) operators. Always wrap individual conditions in parentheses due to operator precedence.
df = pd.DataFrame({
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'department': ['Sales', 'IT', 'Sales', 'IT', 'HR'],
'salary': [65000, 85000, 72000, 90000, 68000],
'years': [3, 7, 5, 10, 4]
})
# IT department with salary above 80000
high_earners_it = df[(df['department'] == 'IT') & (df['salary'] > 80000)].index
print(high_earners_it)
# Int64Index([1, 3], dtype='int64')
# Sales department OR salary above 85000
condition = (df['department'] == 'Sales') | (df['salary'] > 85000)
selected_indices = df[condition].index
print(selected_indices)
# Int64Index([0, 2, 3], dtype='int64')
# NOT in HR department with experience over 5 years
experienced_non_hr = df[~(df['department'] == 'HR') & (df['years'] > 5)].index
print(experienced_non_hr)
# Int64Index([1, 3], dtype='int64')
Using .query() for Complex Conditions
The .query() method offers cleaner syntax for complex filtering, particularly useful when dealing with multiple conditions or column names with spaces.
df = pd.DataFrame({
'product_name': ['Widget A', 'Widget B', 'Gadget C', 'Widget D'],
'unit_price': [15.99, 24.50, 8.75, 31.00],
'units_sold': [150, 89, 320, 45],
'category': ['Tools', 'Tools', 'Electronics', 'Tools']
})
# Complex condition using query
indices = df.query('unit_price > 20 and units_sold < 100').index
print(indices)
# Int64Index([1, 3], dtype='int64')
# Using variables in query with @
min_price = 10
max_units = 200
result_idx = df.query('unit_price > @min_price and units_sold < @max_units').index
print(result_idx)
# Int64Index([1, 3], dtype='int64')
# String operations in query
tools_indices = df.query('category == "Tools" and unit_price < 30').index
print(tools_indices)
# Int64Index([0], dtype='int64')
np.where() for Integer Positions
While boolean indexing returns index labels, np.where() returns integer positions. This is crucial when you need array positions rather than index labels, especially with non-sequential or custom indices.
df = pd.DataFrame({
'sensor_id': ['S001', 'S002', 'S003', 'S004', 'S005'],
'reading': [23.5, 45.2, 19.8, 52.1, 31.7]
}, index=[100, 105, 110, 115, 120])
# Get integer positions (not index labels)
positions = np.where(df['reading'] > 30)[0]
print(positions)
# [1, 3, 4]
# Get actual index labels
labels = df[df['reading'] > 30].index
print(labels)
# Int64Index([105, 115, 120], dtype='int64')
# Use positions to access by iloc
print(df.iloc[positions])
The distinction matters when working with MultiIndex or when you need to manipulate data by position:
# Using positions for slicing
high_reading_positions = np.where(df['reading'] > 30)[0]
first_position = high_reading_positions[0]
subset = df.iloc[first_position:first_position+2]
print(subset)
String Pattern Matching
For text-based conditions, combine string methods with boolean indexing to filter rows based on pattern matching.
df = pd.DataFrame({
'email': ['alice@company.com', 'bob@gmail.com', 'charlie@company.com',
'david@yahoo.com', 'eve@company.com'],
'status': ['active', 'inactive', 'active', 'active', 'pending']
})
# Contains pattern
company_emails = df[df['email'].str.contains('company')].index
print(company_emails)
# Int64Index([0, 2, 4], dtype='int64')
# Starts with pattern
gmail_users = df[df['email'].str.startswith('bob')].index
print(gmail_users)
# Int64Index([1], dtype='int64')
# Regex pattern matching
active_company = df[df['email'].str.contains('@company') &
(df['status'] == 'active')].index
print(active_company)
# Int64Index([0, 2], dtype='int64')
# Case-insensitive matching
pattern_idx = df[df['email'].str.contains('GMAIL', case=False)].index
print(pattern_idx)
# Int64Index([1], dtype='int64')
Working with Missing Values
Handle NaN values explicitly when filtering to avoid unexpected results, as comparisons with NaN always return False.
df = pd.DataFrame({
'transaction_id': range(1, 6),
'amount': [100.0, np.nan, 250.0, np.nan, 175.0],
'verified': [True, False, True, False, True]
})
# Get indices where amount is NOT null
valid_amounts = df[df['amount'].notna()].index
print(valid_amounts)
# Int64Index([0, 2, 4], dtype='int64')
# Get indices where amount IS null
missing_amounts = df[df['amount'].isna()].index
print(missing_amounts)
# Int64Index([1, 3], dtype='int64')
# Combine with other conditions
verified_with_amount = df[df['amount'].notna() & df['verified']].index
print(verified_with_amount)
# Int64Index([0, 2, 4], dtype='int64')
Performance Considerations
For large DataFrames, choose the appropriate method based on your specific use case. Boolean indexing is generally fastest for simple conditions, while .query() can be more efficient for complex expressions due to numexpr optimization.
# Create large DataFrame for testing
large_df = pd.DataFrame({
'value': np.random.randn(1000000),
'category': np.random.choice(['A', 'B', 'C'], 1000000)
})
# Boolean indexing - fast for simple conditions
%timeit large_df[large_df['value'] > 0].index
# Query - optimized for complex expressions
%timeit large_df.query('value > 0 and category == "A"').index
# For repeated operations, store the boolean mask
mask = large_df['value'] > 0
indices = large_df[mask].index # Reuse mask if needed multiple times
When you only need to check if any rows match without retrieving all indices, use .any() for early termination:
# Check if condition matches any row
has_matches = (df['price'] > 1000).any()
# Get indices only if matches exist
if has_matches:
matching_indices = df[df['price'] > 1000].index