Pandas - Select Rows Containing String
The most straightforward method to select rows containing a specific string uses the `str.contains()` method combined with boolean indexing. This approach works on any column containing string data.
Key Insights
- Use
str.contains()with boolean indexing to filter DataFrame rows based on partial string matches, with support for regex patterns and case-insensitive searches - Combine multiple string conditions using
&(AND) and|(OR) operators, or usestr.contains()with regex alternation for complex filtering scenarios - Handle missing values explicitly with the
naparameter to avoid unexpected results when filtering columns containing NaN values
Basic String Filtering with str.contains()
The most straightforward method to select rows containing a specific string uses the str.contains() method combined with boolean indexing. This approach works on any column containing string data.
import pandas as pd
# Create sample DataFrame
df = pd.DataFrame({
'product': ['iPhone 14', 'Samsung Galaxy', 'iPhone 13', 'Google Pixel', 'iPad Pro'],
'category': ['Smartphone', 'Smartphone', 'Smartphone', 'Smartphone', 'Tablet'],
'price': [999, 899, 799, 699, 1099]
})
# Select rows where 'product' contains 'iPhone'
iphone_rows = df[df['product'].str.contains('iPhone')]
print(iphone_rows)
Output:
product category price
0 iPhone 14 Smartphone 999
2 iPhone 13 Smartphone 799
The str.contains() method returns a boolean Series indicating which rows match the pattern. This boolean mask is then used to filter the DataFrame.
Case-Insensitive String Matching
By default, str.contains() is case-sensitive. Use the case parameter to perform case-insensitive searches.
df = pd.DataFrame({
'email': ['user@GMAIL.com', 'admin@yahoo.com', 'contact@Gmail.com', 'info@outlook.com'],
'status': ['active', 'inactive', 'active', 'active']
})
# Case-sensitive search (default)
gmail_sensitive = df[df['email'].str.contains('gmail')]
print("Case-sensitive:", len(gmail_sensitive)) # Returns 0
# Case-insensitive search
gmail_insensitive = df[df['email'].str.contains('gmail', case=False)]
print(gmail_insensitive)
Output:
Case-sensitive: 0
email status
0 user@GMAIL.com active
2 contact@Gmail.com active
Handling Missing Values
When your DataFrame contains NaN values, str.contains() will return NaN for those rows by default, which can cause issues with boolean indexing. Use the na parameter to control this behavior.
df = pd.DataFrame({
'description': ['New product launch', None, 'Special offer', 'Product update', ''],
'id': [1, 2, 3, 4, 5]
})
# Without handling NaN - raises error or includes NaN
try:
result = df[df['description'].str.contains('product')]
except Exception as e:
print(f"Error: {type(e).__name__}")
# Exclude NaN rows (na=False)
result = df[df['description'].str.contains('product', case=False, na=False)]
print(result)
Output:
description id
0 New product launch 1
3 Product update 4
Setting na=False treats NaN values as non-matches, while na=True would treat them as matches.
Using Regex Patterns for Advanced Matching
The str.contains() method supports regular expressions by default, enabling complex pattern matching.
df = pd.DataFrame({
'phone': ['123-456-7890', '(555) 123-4567', '555.123.4567', '5551234567', 'invalid'],
'customer': ['John', 'Jane', 'Bob', 'Alice', 'Charlie']
})
# Match phone numbers with different formats using regex
pattern = r'\d{3}[-.)]\s*\d{3}[-.)]\s*\d{4}|\(\d{3}\)\s*\d{3}-\d{4}'
valid_phones = df[df['phone'].str.contains(pattern, na=False)]
print(valid_phones)
# Match email domains
df_email = pd.DataFrame({
'email': ['user@gmail.com', 'admin@company.org', 'test@yahoo.co.uk', 'info@test.com']
})
# Find emails ending with .com or .org
pattern = r'\.(com|org)$'
result = df_email[df_email['email'].str.contains(pattern)]
print(result)
Output:
phone customer
0 123-456-7890 John
1 (555) 123-4567 Jane
2 555.123.4567 Bob
email
0 user@gmail.com
1 admin@company.org
3 info@test.com
To disable regex and search for literal strings (useful when your search string contains special regex characters), set regex=False.
df = pd.DataFrame({
'code': ['item.001', 'item.002', 'item+003', 'special.item']
})
# Search for literal dot (not regex wildcard)
result = df[df['code'].str.contains('item.', regex=False)]
print(result)
Combining Multiple String Conditions
Combine multiple string filters using logical operators & (AND), | (OR), and ~ (NOT).
df = pd.DataFrame({
'title': ['Python Developer', 'Java Developer', 'Python Data Scientist',
'Senior Python Engineer', 'JavaScript Developer'],
'location': ['Remote', 'New York', 'Remote', 'San Francisco', 'Remote'],
'salary': [90000, 85000, 95000, 120000, 80000]
})
# AND condition: Python jobs that are Remote
python_remote = df[
(df['title'].str.contains('Python', case=False)) &
(df['location'] == 'Remote')
]
print("Python Remote Jobs:")
print(python_remote)
# OR condition: Python or JavaScript positions
dev_jobs = df[
df['title'].str.contains('Python|JavaScript', case=False)
]
print("\nPython or JavaScript Jobs:")
print(dev_jobs)
# NOT condition: Non-remote positions
office_jobs = df[~(df['location'] == 'Remote')]
print("\nOffice Jobs:")
print(office_jobs)
Output:
Python Remote Jobs:
title location salary
0 Python Developer Remote 90000
2 Python Data Scientist Remote 95000
Python or JavaScript Jobs:
title location salary
0 Python Developer Remote 90000
2 Python Data Scientist Remote 95000
3 Senior Python Engineer San Francisco 120000
4 JavaScript Developer Remote 80000
Office Jobs:
title location salary
1 Java Developer New York 85000
3 Senior Python Engineer San Francisco 120000
Searching Across Multiple Columns
To search for a string across multiple columns, combine conditions or use apply() with any().
df = pd.DataFrame({
'name': ['John Smith', 'Jane Doe', 'Bob Johnson'],
'email': ['john@email.com', 'jane@smith.com', 'bob@email.com'],
'notes': ['Contact Smith Corp', 'Follow up needed', 'Smith Industries client']
})
# Search for 'Smith' in any column
search_term = 'Smith'
mask = (
df['name'].str.contains(search_term, case=False, na=False) |
df['email'].str.contains(search_term, case=False, na=False) |
df['notes'].str.contains(search_term, case=False, na=False)
)
result = df[mask]
print(result)
# Alternative using apply
def contains_in_row(row, term):
return row.astype(str).str.contains(term, case=False).any()
result_apply = df[df.apply(lambda row: contains_in_row(row, 'Smith'), axis=1)]
print(result_apply)
Output:
name email notes
0 John Smith john@email.com Contact Smith Corp
1 Jane Doe jane@smith.com Follow up needed
2 Bob Johnson bob@email.com Smith Industries client
Performance Optimization for Large DataFrames
When working with large datasets, optimize string searches by compiling regex patterns and using appropriate data types.
import re
# Create large DataFrame
df = pd.DataFrame({
'text': ['sample text'] * 100000 + ['special case'] * 100000
})
# Compile regex pattern for reuse
pattern = re.compile(r'special', re.IGNORECASE)
# Method 1: Using compiled pattern
result1 = df[df['text'].str.contains(pattern, na=False)]
# Method 2: Convert to categorical for repeated filtering
df['text'] = df['text'].astype('category')
result2 = df[df['text'].str.contains('special', case=False, na=False)]
# For exact matches, use == instead of str.contains()
exact_match = df[df['text'] == 'special case'] # Much faster
print(f"Found {len(result1)} rows")
For exact string matching, direct comparison with == is significantly faster than str.contains(). Reserve str.contains() for partial matches and pattern matching scenarios where its flexibility is necessary.