Pandas - Select Rows Containing String

The most straightforward method to select rows containing a specific string uses the `str.contains()` method combined with boolean indexing. This approach works on any column containing string data.

Key Insights

  • Use str.contains() with boolean indexing to filter DataFrame rows based on partial string matches, with support for regex patterns and case-insensitive searches
  • Combine multiple string conditions using & (AND) and | (OR) operators, or use str.contains() with regex alternation for complex filtering scenarios
  • Handle missing values explicitly with the na parameter to avoid unexpected results when filtering columns containing NaN values

Basic String Filtering with str.contains()

The most straightforward method to select rows containing a specific string uses the str.contains() method combined with boolean indexing. This approach works on any column containing string data.

import pandas as pd

# Create sample DataFrame
df = pd.DataFrame({
    'product': ['iPhone 14', 'Samsung Galaxy', 'iPhone 13', 'Google Pixel', 'iPad Pro'],
    'category': ['Smartphone', 'Smartphone', 'Smartphone', 'Smartphone', 'Tablet'],
    'price': [999, 899, 799, 699, 1099]
})

# Select rows where 'product' contains 'iPhone'
iphone_rows = df[df['product'].str.contains('iPhone')]
print(iphone_rows)

Output:

     product    category  price
0  iPhone 14  Smartphone    999
2  iPhone 13  Smartphone    799

The str.contains() method returns a boolean Series indicating which rows match the pattern. This boolean mask is then used to filter the DataFrame.

Case-Insensitive String Matching

By default, str.contains() is case-sensitive. Use the case parameter to perform case-insensitive searches.

df = pd.DataFrame({
    'email': ['user@GMAIL.com', 'admin@yahoo.com', 'contact@Gmail.com', 'info@outlook.com'],
    'status': ['active', 'inactive', 'active', 'active']
})

# Case-sensitive search (default)
gmail_sensitive = df[df['email'].str.contains('gmail')]
print("Case-sensitive:", len(gmail_sensitive))  # Returns 0

# Case-insensitive search
gmail_insensitive = df[df['email'].str.contains('gmail', case=False)]
print(gmail_insensitive)

Output:

Case-sensitive: 0
              email  status
0   user@GMAIL.com  active
2  contact@Gmail.com  active

Handling Missing Values

When your DataFrame contains NaN values, str.contains() will return NaN for those rows by default, which can cause issues with boolean indexing. Use the na parameter to control this behavior.

df = pd.DataFrame({
    'description': ['New product launch', None, 'Special offer', 'Product update', ''],
    'id': [1, 2, 3, 4, 5]
})

# Without handling NaN - raises error or includes NaN
try:
    result = df[df['description'].str.contains('product')]
except Exception as e:
    print(f"Error: {type(e).__name__}")

# Exclude NaN rows (na=False)
result = df[df['description'].str.contains('product', case=False, na=False)]
print(result)

Output:

         description  id
0  New product launch   1
3     Product update   4

Setting na=False treats NaN values as non-matches, while na=True would treat them as matches.

Using Regex Patterns for Advanced Matching

The str.contains() method supports regular expressions by default, enabling complex pattern matching.

df = pd.DataFrame({
    'phone': ['123-456-7890', '(555) 123-4567', '555.123.4567', '5551234567', 'invalid'],
    'customer': ['John', 'Jane', 'Bob', 'Alice', 'Charlie']
})

# Match phone numbers with different formats using regex
pattern = r'\d{3}[-.)]\s*\d{3}[-.)]\s*\d{4}|\(\d{3}\)\s*\d{3}-\d{4}'
valid_phones = df[df['phone'].str.contains(pattern, na=False)]
print(valid_phones)

# Match email domains
df_email = pd.DataFrame({
    'email': ['user@gmail.com', 'admin@company.org', 'test@yahoo.co.uk', 'info@test.com']
})

# Find emails ending with .com or .org
pattern = r'\.(com|org)$'
result = df_email[df_email['email'].str.contains(pattern)]
print(result)

Output:

           phone customer
0  123-456-7890     John
1  (555) 123-4567     Jane
2  555.123.4567      Bob

             email
0    user@gmail.com
1  admin@company.org
3     info@test.com

To disable regex and search for literal strings (useful when your search string contains special regex characters), set regex=False.

df = pd.DataFrame({
    'code': ['item.001', 'item.002', 'item+003', 'special.item']
})

# Search for literal dot (not regex wildcard)
result = df[df['code'].str.contains('item.', regex=False)]
print(result)

Combining Multiple String Conditions

Combine multiple string filters using logical operators & (AND), | (OR), and ~ (NOT).

df = pd.DataFrame({
    'title': ['Python Developer', 'Java Developer', 'Python Data Scientist', 
              'Senior Python Engineer', 'JavaScript Developer'],
    'location': ['Remote', 'New York', 'Remote', 'San Francisco', 'Remote'],
    'salary': [90000, 85000, 95000, 120000, 80000]
})

# AND condition: Python jobs that are Remote
python_remote = df[
    (df['title'].str.contains('Python', case=False)) & 
    (df['location'] == 'Remote')
]
print("Python Remote Jobs:")
print(python_remote)

# OR condition: Python or JavaScript positions
dev_jobs = df[
    df['title'].str.contains('Python|JavaScript', case=False)
]
print("\nPython or JavaScript Jobs:")
print(dev_jobs)

# NOT condition: Non-remote positions
office_jobs = df[~(df['location'] == 'Remote')]
print("\nOffice Jobs:")
print(office_jobs)

Output:

Python Remote Jobs:
                   title location  salary
0       Python Developer   Remote   90000
2  Python Data Scientist   Remote   95000

Python or JavaScript Jobs:
                    title    location  salary
0        Python Developer      Remote   90000
2   Python Data Scientist      Remote   95000
3  Senior Python Engineer  San Francisco  120000
4   JavaScript Developer      Remote   80000

Office Jobs:
                    title       location  salary
1          Java Developer       New York   85000
3  Senior Python Engineer  San Francisco  120000

Searching Across Multiple Columns

To search for a string across multiple columns, combine conditions or use apply() with any().

df = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Bob Johnson'],
    'email': ['john@email.com', 'jane@smith.com', 'bob@email.com'],
    'notes': ['Contact Smith Corp', 'Follow up needed', 'Smith Industries client']
})

# Search for 'Smith' in any column
search_term = 'Smith'
mask = (
    df['name'].str.contains(search_term, case=False, na=False) |
    df['email'].str.contains(search_term, case=False, na=False) |
    df['notes'].str.contains(search_term, case=False, na=False)
)
result = df[mask]
print(result)

# Alternative using apply
def contains_in_row(row, term):
    return row.astype(str).str.contains(term, case=False).any()

result_apply = df[df.apply(lambda row: contains_in_row(row, 'Smith'), axis=1)]
print(result_apply)

Output:

          name            email                  notes
0   John Smith   john@email.com      Contact Smith Corp
1     Jane Doe  jane@smith.com      Follow up needed
2  Bob Johnson    bob@email.com  Smith Industries client

Performance Optimization for Large DataFrames

When working with large datasets, optimize string searches by compiling regex patterns and using appropriate data types.

import re

# Create large DataFrame
df = pd.DataFrame({
    'text': ['sample text'] * 100000 + ['special case'] * 100000
})

# Compile regex pattern for reuse
pattern = re.compile(r'special', re.IGNORECASE)

# Method 1: Using compiled pattern
result1 = df[df['text'].str.contains(pattern, na=False)]

# Method 2: Convert to categorical for repeated filtering
df['text'] = df['text'].astype('category')
result2 = df[df['text'].str.contains('special', case=False, na=False)]

# For exact matches, use == instead of str.contains()
exact_match = df[df['text'] == 'special case']  # Much faster

print(f"Found {len(result1)} rows")

For exact string matching, direct comparison with == is significantly faster than str.contains(). Reserve str.contains() for partial matches and pattern matching scenarios where its flexibility is necessary.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.