Pandas - Select Rows Using isin()
• The `isin()` method filters DataFrame rows by checking if column values exist in a specified list, array, or set, providing a cleaner alternative to multiple OR conditions
Key Insights
• The isin() method filters DataFrame rows by checking if column values exist in a specified list, array, or set, providing a cleaner alternative to multiple OR conditions
• Combining isin() with boolean indexing, ~ negation, and multi-column filtering enables complex row selection patterns with minimal code
• Performance optimization requires understanding when to use isin() versus alternatives like query() or direct boolean indexing, especially with large datasets
Basic Row Selection with isin()
The isin() method checks whether each element in a DataFrame column exists in a given sequence. It returns a boolean Series that you can use for filtering rows.
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
'price': [1200, 25, 75, 350, 80],
'stock': [15, 150, 90, 30, 45]
})
# Select rows where product is in the list
selected_products = ['Laptop', 'Monitor', 'Webcam']
filtered_df = df[df['product'].isin(selected_products)]
print(filtered_df)
product category price stock
0 Laptop Electronics 1200 15
3 Monitor Electronics 350 30
4 Webcam Accessories 80 45
This approach is significantly cleaner than chaining multiple OR conditions:
# Without isin() - verbose and error-prone
filtered_df = df[(df['product'] == 'Laptop') |
(df['product'] == 'Monitor') |
(df['product'] == 'Webcam')]
# With isin() - concise and readable
filtered_df = df[df['product'].isin(selected_products)]
Negating isin() with the Tilde Operator
Use the ~ operator to select rows where values are NOT in the specified list:
# Select all products except specific ones
excluded_products = ['Mouse', 'Keyboard']
df_filtered = df[~df['product'].isin(excluded_products)]
print(df_filtered)
product category price stock
0 Laptop Electronics 1200 15
3 Monitor Electronics 350 30
4 Webcam Accessories 80 45
This pattern is particularly useful when you want to exclude outliers or filter out specific categories:
# Exclude low-stock items
low_stock_threshold = [15, 25, 30]
high_stock_items = df[~df['stock'].isin(low_stock_threshold)]
Multi-Column Filtering with isin()
Apply isin() across multiple columns by combining boolean conditions:
# Create a more complex DataFrame
df_sales = pd.DataFrame({
'region': ['North', 'South', 'East', 'West', 'North', 'South'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam', 'Laptop'],
'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q3', 'Q3'],
'revenue': [50000, 5000, 8000, 30000, 12000, 48000]
})
# Filter by multiple columns
target_regions = ['North', 'South']
target_products = ['Laptop', 'Webcam']
filtered_sales = df_sales[
df_sales['region'].isin(target_regions) &
df_sales['product'].isin(target_products)
]
print(filtered_sales)
region product quarter revenue
0 North Laptop Q1 50000
4 North Webcam Q3 12000
5 South Laptop Q3 48000
Using isin() with Different Data Types
The isin() method works with various data types including integers, floats, and datetime objects:
# Numeric filtering
df_inventory = pd.DataFrame({
'item_id': [101, 102, 103, 104, 105],
'quantity': [50, 0, 25, 0, 100],
'reorder_level': [20, 15, 30, 25, 50]
})
# Find items with specific IDs
target_ids = [101, 103, 105]
selected_items = df_inventory[df_inventory['item_id'].isin(target_ids)]
# Date filtering
df_orders = pd.DataFrame({
'order_id': range(1, 6),
'order_date': pd.to_datetime(['2024-01-15', '2024-02-20', '2024-03-10',
'2024-04-05', '2024-05-12']),
'amount': [150, 200, 175, 300, 225]
})
# Filter by specific months
target_dates = pd.to_datetime(['2024-01-15', '2024-03-10', '2024-05-12'])
specific_orders = df_orders[df_orders['order_date'].isin(target_dates)]
print(specific_orders)
order_id order_date amount
0 1 2024-01-15 150
2 3 2024-03-10 175
4 5 2024-05-12 225
isin() with Sets and NumPy Arrays
While lists are common, isin() accepts any iterable including sets and NumPy arrays. Sets can offer performance benefits for large lookup collections:
# Using a set for faster lookups with large data
large_product_list = set(['Product_' + str(i) for i in range(1000)])
df_large = pd.DataFrame({
'product_code': ['Product_' + str(i) for i in range(0, 2000, 2)],
'sales': np.random.randint(100, 1000, 1000)
})
# Filter using set
filtered_large = df_large[df_large['product_code'].isin(large_product_list)]
# Using NumPy array
target_values = np.array([100, 200, 300, 400, 500])
df_numeric = pd.DataFrame({
'value': np.random.randint(0, 600, 100)
})
filtered_numeric = df_numeric[df_numeric['value'].isin(target_values)]
Combining isin() with Other Filtering Methods
Integrate isin() with standard boolean indexing for sophisticated queries:
df_employees = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales'],
'salary': [60000, 75000, 65000, 55000, 80000, 62000],
'years': [3, 5, 2, 7, 4, 6]
})
# Complex filtering: specific departments AND salary above threshold
target_depts = ['Sales', 'IT']
filtered_employees = df_employees[
df_employees['department'].isin(target_depts) &
(df_employees['salary'] > 60000) &
(df_employees['years'] >= 3)
]
print(filtered_employees)
name department salary years
1 Bob IT 75000 5
2 Charlie Sales 65000 2
4 Eve IT 80000 4
5 Frank Sales 62000 6
Performance Considerations
For large DataFrames, isin() generally outperforms multiple OR conditions. However, understanding performance characteristics helps optimize queries:
import time
# Create large DataFrame
df_large = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000000),
'value': np.random.randn(1000000)
})
# Method 1: Using isin()
start = time.time()
result1 = df_large[df_large['category'].isin(['A', 'C', 'E'])]
time_isin = time.time() - start
# Method 2: Multiple OR conditions
start = time.time()
result2 = df_large[(df_large['category'] == 'A') |
(df_large['category'] == 'C') |
(df_large['category'] == 'E')]
time_or = time.time() - start
print(f"isin() time: {time_isin:.4f}s")
print(f"OR conditions time: {time_or:.4f}s")
For extremely large lookup lists (10,000+ items), converting the lookup collection to a set before passing to isin() can provide marginal performance improvements.
Handling Missing Values with isin()
The isin() method treats NaN values distinctly. By default, NaN does not match NaN:
df_with_nan = pd.DataFrame({
'product': ['Laptop', 'Mouse', np.nan, 'Monitor', 'Webcam'],
'price': [1200, 25, 75, np.nan, 80]
})
# NaN won't match even if NaN is in the list
result = df_with_nan[df_with_nan['product'].isin(['Laptop', np.nan])]
print(result) # Only returns Laptop row
# To include NaN rows, combine with isna()
result_with_nan = df_with_nan[
df_with_nan['product'].isin(['Laptop']) |
df_with_nan['product'].isna()
]
print(result_with_nan)
product price
0 Laptop 1200.0
2 NaN 75.0
Practical Application: Data Validation
Use isin() to validate data against allowed values:
# Define valid categories
VALID_CATEGORIES = ['Electronics', 'Accessories', 'Software', 'Hardware']
VALID_REGIONS = ['North', 'South', 'East', 'West']
df_products = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Unknown Item'],
'category': ['Electronics', 'Accessories', 'Invalid'],
'region': ['North', 'South', 'Central']
})
# Find invalid entries
invalid_categories = df_products[~df_products['category'].isin(VALID_CATEGORIES)]
invalid_regions = df_products[~df_products['region'].isin(VALID_REGIONS)]
print("Invalid categories:")
print(invalid_categories)
print("\nInvalid regions:")
print(invalid_regions)
This validation pattern helps identify data quality issues during ETL processes or data ingestion pipelines.