Pandas - Drop Rows by Condition
• Pandas offers multiple methods to drop rows based on conditions: boolean indexing with bracket notation, `drop()` with index labels, and `query()` for SQL-like syntax—each with distinct performance...
Key Insights
• Pandas offers multiple methods to drop rows based on conditions: boolean indexing with bracket notation, drop() with index labels, and query() for SQL-like syntax—each with distinct performance characteristics for different dataset sizes.
• Understanding the difference between filtering (creating a new DataFrame) and dropping (modifying via index) is critical for memory efficiency, especially when chaining operations or working with large datasets.
• Combining multiple conditions using bitwise operators (&, |, ~) and leveraging isin(), between(), and str methods enables complex row removal patterns without iterative loops.
Basic Row Dropping with Boolean Indexing
Boolean indexing is the most intuitive approach for dropping rows. You create a boolean mask and use it to filter the DataFrame, keeping only rows where the condition is False.
import pandas as pd
import numpy as np
# Sample dataset
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Cable'],
'price': [1200, 25, 75, 300, 15],
'stock': [5, 150, 45, 12, 200],
'rating': [4.5, 4.2, 3.8, 4.7, 3.5]
})
# Drop rows where price is less than 50
df_filtered = df[df['price'] >= 50]
print(df_filtered)
product price stock rating
0 Laptop 1200 5 4.5
2 Keyboard 75 45 3.8
3 Monitor 300 12 4.7
The inverse approach uses the NOT operator (~) to explicitly drop rows matching a condition:
# Drop rows where stock is above 100
df_filtered = df[~(df['stock'] > 100)]
print(df_filtered)
This creates a new DataFrame without modifying the original. For in-place operations on large datasets where memory is constrained, reassign to the same variable.
Combining Multiple Conditions
Complex filtering requires combining conditions with bitwise operators. Use & for AND, | for OR, and ~ for NOT. Parentheses are mandatory around each condition.
# Drop rows where price < 50 OR rating < 4.0
df_filtered = df[~((df['price'] < 50) | (df['rating'] < 4.0))]
print(df_filtered)
product price stock rating
0 Laptop 1200 5 4.5
3 Monitor 300 12 4.7
For multiple AND conditions:
# Drop rows where price > 100 AND stock < 20
df_filtered = df[~((df['price'] > 100) & (df['stock'] < 20))]
print(df_filtered)
When dealing with many conditions, assign intermediate boolean Series for readability:
expensive = df['price'] > 100
low_stock = df['stock'] < 20
poor_rating = df['rating'] < 4.0
# Drop rows matching any of these conditions
df_filtered = df[~(expensive & low_stock | poor_rating)]
Using drop() with Index-Based Filtering
The drop() method removes rows by index labels. Combine it with boolean indexing to get index positions first:
# Get indices where condition is True
indices_to_drop = df[df['price'] < 50].index
# Drop those rows
df_dropped = df.drop(indices_to_drop)
print(df_dropped)
This approach is useful when you need to inspect which rows will be dropped before removal:
# Identify rows to drop
rows_to_remove = df[(df['stock'] > 100) & (df['price'] < 30)]
print(f"Removing {len(rows_to_remove)} rows:")
print(rows_to_remove)
# Perform the drop
df_cleaned = df.drop(rows_to_remove.index)
For in-place modification:
df.drop(indices_to_drop, inplace=True)
Query Method for Readable Filtering
The query() method provides SQL-like syntax for filtering, improving readability with complex conditions:
# Drop rows where price < 50 or stock > 100
df_filtered = df.query('price >= 50 and stock <= 100')
print(df_filtered)
Query syntax supports variables using @ prefix:
min_price = 50
max_stock = 100
df_filtered = df.query('price >= @min_price and stock <= @max_stock')
For string columns:
# Drop rows where product name contains "Cable"
df_filtered = df.query('product != "Cable"')
# Using string methods
df_filtered = df[~df['product'].str.contains('Cable')]
The query() method can be more performant on large datasets due to internal optimizations, but doesn’t support all pandas operations.
Dropping Rows with Missing Values
Handling NaN values is a common row-dropping scenario:
df_with_nan = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
# Drop rows with any NaN
df_cleaned = df_with_nan.dropna()
# Drop rows where all values are NaN
df_cleaned = df_with_nan.dropna(how='all')
# Drop rows with NaN in specific columns
df_cleaned = df_with_nan.dropna(subset=['A', 'B'])
Combine with conditions:
# Drop rows where A is NaN OR B > 6
mask = df_with_nan['A'].isna() | (df_with_nan['B'] > 6)
df_cleaned = df_with_nan[~mask]
Advanced Filtering with isin() and between()
The isin() method efficiently filters rows against a list of values:
# Drop rows where product is in a specific list
products_to_remove = ['Mouse', 'Cable']
df_filtered = df[~df['product'].isin(products_to_remove)]
print(df_filtered)
product price stock rating
0 Laptop 1200 5 4.5
2 Keyboard 75 45 3.8
3 Monitor 300 12 4.7
The between() method simplifies range filtering:
# Drop rows where price is between 20 and 100 (inclusive)
df_filtered = df[~df['price'].between(20, 100, inclusive='both')]
print(df_filtered)
Combine both for complex patterns:
# Drop rows where price is in range OR product is in list
price_range = df['price'].between(10, 80)
product_list = df['product'].isin(['Monitor'])
df_filtered = df[~(price_range | product_list)]
String-Based Row Filtering
For text data, pandas string methods enable pattern-based dropping:
df_text = pd.DataFrame({
'name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Smith'],
'email': ['john@example.com', 'jane@test.com', 'bob@example.com', 'alice@other.com']
})
# Drop rows where name contains "Smith"
df_filtered = df_text[~df_text['name'].str.contains('Smith')]
# Drop rows where email ends with "example.com"
df_filtered = df_text[~df_text['email'].str.endswith('example.com')]
# Case-insensitive matching
df_filtered = df_text[~df_text['name'].str.contains('smith', case=False)]
# Regex patterns
df_filtered = df_text[~df_text['email'].str.match(r'.*@example\.com$')]
Performance Considerations
For large datasets, method choice impacts performance:
import time
# Create large dataset
large_df = pd.DataFrame({
'A': np.random.randint(0, 100, 1000000),
'B': np.random.randint(0, 100, 1000000)
})
# Boolean indexing
start = time.time()
result1 = large_df[large_df['A'] > 50]
print(f"Boolean indexing: {time.time() - start:.4f}s")
# Query method
start = time.time()
result2 = large_df.query('A > 50')
print(f"Query method: {time.time() - start:.4f}s")
# Drop method
start = time.time()
indices = large_df[large_df['A'] <= 50].index
result3 = large_df.drop(indices)
print(f"Drop method: {time.time() - start:.4f}s")
Boolean indexing typically offers the best balance of readability and performance. The query() method can be faster for complex conditions due to numexpr optimization. The drop() method has overhead from index creation but is clearer when you need to track removed rows.
For memory efficiency with method chaining, avoid creating intermediate DataFrames:
# Less efficient - creates multiple intermediate DataFrames
df_temp = df[df['price'] > 50]
df_temp = df_temp[df_temp['stock'] < 100]
result = df_temp[df_temp['rating'] > 4.0]
# More efficient - single boolean mask
result = df[(df['price'] > 50) & (df['stock'] < 100) & (df['rating'] > 4.0)]