How to Filter by String Contains in Pandas

String filtering is one of the most common operations you'll perform in data analysis. Whether you're searching through server logs for error messages, filtering customer names by keyword, or...

Key Insights

  • The str.contains() method is Pandas’ primary tool for filtering rows based on substring matches, supporting regex patterns, case-insensitive matching, and NaN handling out of the box.
  • Always use na=False when filtering with str.contains() to avoid unexpected NaN propagation that can break your boolean indexing.
  • For simple prefix matching, str.startswith() outperforms str.contains() with regex—choose the right tool for your specific pattern matching needs.

Introduction

String filtering is one of the most common operations you’ll perform in data analysis. Whether you’re searching through server logs for error messages, filtering customer names by keyword, or extracting rows containing specific product codes, you need reliable substring matching.

Pandas provides the str.contains() method as your primary tool for this job. It’s flexible, supports regular expressions, and integrates seamlessly with Pandas’ boolean indexing. But it has quirks that trip up developers—especially around NaN handling and performance.

This guide covers everything you need to filter DataFrames by string content effectively, from basic usage to advanced regex patterns.

Basic String Contains Filtering

The core syntax is straightforward: apply str.contains() to a Series and use the resulting boolean mask to filter your DataFrame.

import pandas as pd

# Sample product data
df = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5, 6],
    'name': ['MacBook Pro 16"', 'MacBook Air', 'iPhone Pro Max', 
             'iPad Premium', 'AirPods', 'Apple Watch Premium'],
    'price': [2499, 1299, 1199, 899, 249, 799]
})

# Filter products containing "Pro"
pro_products = df[df['name'].str.contains('Pro')]
print(pro_products)

Output:

   product_id            name  price
0           1  MacBook Pro 16"   2499
2           3  iPhone Pro Max   1199

You can also filter for multiple terms by chaining conditions:

# Products containing "Pro" OR "Premium"
premium_tier = df[df['name'].str.contains('Pro') | df['name'].str.contains('Premium')]
print(premium_tier)

Output:

   product_id                 name  price
0           1      MacBook Pro 16"   2499
2           3       iPhone Pro Max   1199
3           4        iPad Premium    899
5           6  Apple Watch Premium    799

This works, but there’s a cleaner way using regex—we’ll cover that shortly.

Case-Insensitive Matching

Real-world data is messy. Email addresses might be stored as “USER@EXAMPLE.COM” or “user@example.com” depending on the source. The case parameter handles this.

# Customer email data with inconsistent casing
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'email': ['john@GMAIL.COM', 'sarah@yahoo.com', 'mike@Gmail.com', 
              'lisa@outlook.com', 'tom@YAHOO.COM'],
    'signup_date': ['2024-01-15', '2024-02-20', '2024-03-10', 
                    '2024-03-15', '2024-04-01']
})

# Case-sensitive (default) - misses some Gmail users
gmail_strict = df[customers['email'].str.contains('gmail')]
print(f"Case-sensitive matches: {len(gmail_strict)}")

# Case-insensitive - catches all variations
gmail_all = customers[customers['email'].str.contains('gmail', case=False)]
print(gmail_all)

Output:

Case-sensitive matches: 0
   customer_id           email signup_date
0          101  john@GMAIL.COM  2024-01-15
2          103  mike@Gmail.com  2024-03-10

Always use case=False when you’re searching for domain names, product codes, or any text where capitalization varies. It’s a small parameter that prevents frustrating debugging sessions.

Handling NaN Values

Here’s where many developers get burned. If your column contains NaN values, str.contains() returns NaN for those rows—not False. This breaks boolean indexing.

# Data with missing values
logs = pd.DataFrame({
    'timestamp': ['2024-01-01 10:00', '2024-01-01 10:05', '2024-01-01 10:10',
                  '2024-01-01 10:15', '2024-01-01 10:20'],
    'message': ['User login successful', 'ERROR: Connection timeout', None,
                'User logout', 'ERROR: Database unreachable']
})

# This will cause issues
try:
    errors = logs[logs['message'].str.contains('ERROR')]
    print(errors)
except Exception as e:
    print(f"Error: {e}")

The result includes NaN values in the boolean mask, which can cause unexpected behavior. The fix is the na parameter:

# na=False: Treat NaN as "does not contain" (most common use case)
errors_safe = logs[logs['message'].str.contains('ERROR', na=False)]
print("With na=False:")
print(errors_safe)

# na=True: Treat NaN as "does contain" (rare, but useful for inverse filtering)
errors_inclusive = logs[logs['message'].str.contains('ERROR', na=True)]
print("\nWith na=True:")
print(errors_inclusive)

Output:

With na=False:
           timestamp                     message
1  2024-01-01 10:05  ERROR: Connection timeout
4  2024-01-01 10:20  ERROR: Database unreachable

With na=True:
           timestamp                     message
1  2024-01-01 10:05  ERROR: Connection timeout
2  2024-01-01 10:10                        None
4  2024-01-01 10:20  ERROR: Database unreachable

My recommendation: always explicitly set na=False unless you have a specific reason not to. Make it a habit.

Using Regular Expressions

The str.contains() method interprets patterns as regular expressions by default. This is powerful for complex matching scenarios.

# Transaction data spanning multiple years
transactions = pd.DataFrame({
    'transaction_id': ['TXN-2022-001', 'TXN-2023-042', 'TXN-2023-108',
                       'TXN-2024-003', 'TXN-2024-089', 'TXN-2021-200'],
    'amount': [150.00, 299.99, 75.50, 1200.00, 89.99, 450.00],
    'status': ['completed', 'completed', 'pending', 'completed', 'failed', 'completed']
})

# Filter for 2023 OR 2024 transactions using regex OR operator
recent = transactions[transactions['transaction_id'].str.contains('2023|2024', na=False)]
print("Recent transactions (2023-2024):")
print(recent)

Output:

Recent transactions (2023-2024):
  transaction_id   amount     status
1   TXN-2023-042   299.99  completed
2   TXN-2023-108    75.50    pending
3   TXN-2024-003  1200.00  completed
4   TXN-2024-089    89.99     failed

You can use anchors for more precise matching:

# Match strings starting with specific prefix
sku_data = pd.DataFrame({
    'sku': ['ELEC-001', 'ELEC-002', 'FURN-001', 'ELECTRONIC-SPECIAL', 'FURN-002'],
    'category': ['Electronics', 'Electronics', 'Furniture', 'Electronics', 'Furniture']
})

# Using ^ anchor to match only at start
electronics = sku_data[sku_data['sku'].str.contains('^ELEC-', na=False)]
print(electronics)

Output:

        sku     category
0  ELEC-001  Electronics
1  ELEC-002  Electronics

Notice that “ELECTRONIC-SPECIAL” wasn’t matched because the pattern requires “ELEC-” at the start, not just “ELEC” anywhere.

If your search string contains regex special characters (like ., *, ?, (, )), either escape them or disable regex:

# Searching for literal "file.txt" - the dot is a regex wildcard
files = pd.DataFrame({'filename': ['file.txt', 'filetxt', 'file1txt', 'data.csv']})

# Wrong: matches more than expected
print(files[files['filename'].str.contains('file.txt', na=False)])

# Correct: disable regex interpretation
print(files[files['filename'].str.contains('file.txt', regex=False, na=False)])

Negating the Filter (Does NOT Contain)

Use the ~ operator to invert your filter and exclude matching rows.

# Dataset with test and demo entries mixed with real data
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5, 6],
    'username': ['john_doe', 'test_user_1', 'jane_smith', 
                 'demo_account', 'bob_wilson', 'test_admin'],
    'email': ['john@company.com', 'test@test.com', 'jane@company.com',
              'demo@demo.com', 'bob@company.com', 'admin@test.com']
})

# Exclude test and demo accounts
real_users = users[~users['username'].str.contains('test|demo', case=False, na=False)]
print("Production users only:")
print(real_users)

Output:

Production users only:
   user_id    username             email
0        1    john_doe  john@company.com
2        3  jane_smith  jane@company.com
4        5  bob_wilson   bob@company.com

This pattern is invaluable for data cleaning—removing test records, filtering out placeholder entries, or excluding specific categories from analysis.

Performance Tips and Alternatives

For large datasets, method choice matters. Here’s when to use what:

Use str.startswith() for prefix matching:

import time

# Create a larger dataset for benchmarking
large_df = pd.DataFrame({
    'code': ['PRD-' + str(i).zfill(6) for i in range(100000)]
})

# Method 1: str.contains with regex anchor
start = time.time()
result1 = large_df[large_df['code'].str.contains('^PRD-', na=False)]
contains_time = time.time() - start

# Method 2: str.startswith (no regex overhead)
start = time.time()
result2 = large_df[large_df['code'].str.startswith('PRD-')]
startswith_time = time.time() - start

print(f"str.contains with ^: {contains_time:.4f}s")
print(f"str.startswith: {startswith_time:.4f}s")
print(f"Speedup: {contains_time / startswith_time:.2f}x")

On my machine, str.startswith() is typically 2-3x faster for prefix matching because it doesn’t involve regex compilation.

Use isin() for exact matches:

If you’re checking whether values exactly match items in a list (not substring matching), isin() is both clearer and faster:

# Exact match - use isin()
valid_statuses = ['active', 'pending', 'approved']
df_exact = df[df['status'].isin(valid_statuses)]

# Substring match - use str.contains()
df_substring = df[df['status'].str.contains('activ', na=False)]

Disable regex when not needed:

If you’re searching for a literal string without any pattern matching, set regex=False. This skips regex compilation and can improve performance:

# Faster for literal string matching
df[df['column'].str.contains('simple text', regex=False, na=False)]

The performance difference becomes significant when you’re filtering millions of rows or running the same filter repeatedly in a loop.

Wrapping Up

String filtering with str.contains() is essential for practical data analysis. The key points to remember: always handle NaN values explicitly with na=False, use case=False for flexible matching, and leverage regex for complex patterns. When performance matters, choose the right method—str.startswith() for prefixes, isin() for exact matches, and regex=False when you don’t need pattern matching.

These techniques will handle the vast majority of your string filtering needs in Pandas.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.