How to Use str.contains in Pandas

String matching is one of the most common operations when working with text data in pandas. Whether you're filtering customer names, searching product descriptions, or parsing log files, you need a...

Key Insights

  • str.contains() is pandas’ primary method for filtering DataFrames based on substring or pattern matching, returning a boolean Series you can use directly for indexing.
  • Always set regex=False when matching literal strings—it’s faster and prevents unexpected behavior from special characters like ., *, or $.
  • The na parameter is critical for real-world data; forgetting to handle NaN values will propagate missing values into your boolean mask and cause filtering errors.

Introduction

String matching is one of the most common operations when working with text data in pandas. Whether you’re filtering customer names, searching product descriptions, or parsing log files, you need a reliable way to find rows containing specific text patterns.

The str.contains() method is pandas’ answer to this problem. It checks each element in a Series for the presence of a pattern and returns a boolean Series—perfect for filtering DataFrames. Unlike Python’s in operator, which works on individual strings, str.contains() operates on entire columns efficiently.

This article covers everything you need to use str.contains() effectively: from basic substring matching to complex regex patterns, handling missing values, and avoiding common performance pitfalls.

Basic Syntax and Parameters

The method signature looks like this:

Series.str.contains(pat, case=True, flags=0, na=None, regex=True)

Here’s what each parameter does:

  • pat: The pattern or substring to search for. Can be a string or regex pattern.
  • case: Boolean controlling case sensitivity. Default is True (case-sensitive).
  • flags: Regex flags from the re module (e.g., re.IGNORECASE). Only applies when regex=True.
  • na: Value to use for missing data. Default None keeps NaN as NaN in results.
  • regex: Boolean indicating whether pat is a regex. Default is True.

Let’s start with a basic example. Say you have a product catalog and want to find all phones:

import pandas as pd

products = pd.DataFrame({
    'name': ['iPhone 15 Pro', 'Samsung Galaxy S24', 'AirPods Pro', 
             'Pixel 8 Phone', 'MacBook Pro', 'Android Phone Case'],
    'category': ['phone', 'phone', 'audio', 'phone', 'laptop', 'accessory'],
    'price': [999, 849, 249, 699, 1999, 29]
})

# Filter products with "Phone" in the name
phone_products = products[products['name'].str.contains('Phone')]
print(phone_products)

Output:

             name   category  price
3   Pixel 8 Phone      phone    699
5  Android Phone Case  accessory     29

Notice that “iPhone 15 Pro” didn’t match because the search is case-sensitive by default. The pattern “Phone” doesn’t match “iPhone” where the ‘p’ is lowercase.

Case-Sensitive vs. Case-Insensitive Matching

The case parameter controls whether matching respects letter case. This is crucial when dealing with inconsistent data entry or when you want flexible matching.

# Case-sensitive (default) - misses "iPhone"
case_sensitive = products[products['name'].str.contains('phone', case=True)]
print("Case-sensitive results:")
print(case_sensitive['name'].tolist())

# Case-insensitive - catches all variations
case_insensitive = products[products['name'].str.contains('phone', case=False)]
print("\nCase-insensitive results:")
print(case_insensitive['name'].tolist())

Output:

Case-sensitive results:
[]

Case-insensitive results:
['iPhone 15 Pro', 'Pixel 8 Phone', 'Android Phone Case']

The case-insensitive search found “iPhone” (lowercase ‘p’), “Phone” (capitalized), and “Phone Case”. For most real-world text filtering, case=False is what you want.

You can also use regex flags for case insensitivity:

import re

# Alternative using regex flags
result = products[products['name'].str.contains('phone', flags=re.IGNORECASE)]

This approach is useful when you need to combine multiple regex flags.

Using Regular Expressions

By default, str.contains() interprets the pattern as a regular expression. This gives you powerful pattern matching capabilities.

Matching String Positions

Use anchors to match patterns at specific positions:

data = pd.DataFrame({
    'email': ['john@company.com', 'admin@company.org', 'support@company.com',
              'john.doe@external.com', 'company@partner.org']
})

# Emails starting with "john"
starts_with_john = data[data['email'].str.contains('^john')]
print("Starts with 'john':")
print(starts_with_john['email'].tolist())

# Emails ending with ".com"
ends_with_com = data[data['email'].str.contains(r'\.com$')]
print("\nEnds with '.com':")
print(ends_with_com['email'].tolist())

Output:

Starts with 'john':
['john@company.com', 'john.doe@external.com']

Ends with '.com':
['john@company.com', 'support@company.com', 'john.doe@external.com']

Note the escaped dot (\.) in the second pattern—without the backslash, . matches any character.

Matching Multiple Patterns

Use the pipe operator | to match any of several patterns:

logs = pd.DataFrame({
    'message': ['ERROR: Connection failed', 'INFO: User logged in',
                'WARNING: High memory usage', 'ERROR: Timeout exceeded',
                'DEBUG: Cache cleared', 'WARNING: Disk space low']
})

# Find all errors and warnings
issues = logs[logs['message'].str.contains('ERROR|WARNING')]
print(issues)

Output:

                       message
0     ERROR: Connection failed
2  WARNING: High memory usage
3       ERROR: Timeout exceeded
5       WARNING: Disk space low

Disabling Regex for Literal Matching

When you want to match literal strings containing regex special characters, set regex=False:

files = pd.DataFrame({
    'filename': ['report.csv', 'data[2024].xlsx', 'backup (1).zip',
                 'config.json', 'notes[draft].txt']
})

# This fails or gives wrong results - brackets are regex special chars
# files[files['filename'].str.contains('[2024]')]  # Matches any single digit!

# Correct approach for literal matching
bracketed = files[files['filename'].str.contains('[2024]', regex=False)]
print(bracketed)

Output:

          filename
1  data[2024].xlsx

Setting regex=False treats the pattern as a literal string, so [2024] matches exactly those characters.

Handling Missing Values

Real-world data contains missing values. By default, str.contains() returns NaN for NaN inputs, which can cause problems when filtering:

customers = pd.DataFrame({
    'name': ['Alice Smith', 'Bob Johnson', None, 'Carol Williams', 'Dave Brown'],
    'email': ['alice@email.com', None, 'charlie@email.com', 'carol@email.com', None]
})

# Default behavior - NaN propagates
mask = customers['name'].str.contains('Smith')
print("Mask with NaN:")
print(mask)

Output:

Mask with NaN:
0     True
1    False
2      NaN
3    False
4    False
dtype: object

That NaN in the mask causes issues when filtering. Use the na parameter to control this:

# Treat NaN as False (exclude missing values)
mask_false = customers['name'].str.contains('Smith', na=False)
print("With na=False:")
print(customers[mask_false])

# Treat NaN as True (include missing values)
mask_true = customers['name'].str.contains('Smith', na=True)
print("\nWith na=True:")
print(customers[mask_true])

Output:

With na=False:
          name            email
0  Alice Smith  alice@email.com

With na=True:
          name            email
0  Alice Smith  alice@email.com
2         None  charlie@email.com

For most filtering operations, na=False is the safe choice—it excludes rows with missing values in the searched column.

Practical Use Cases

Let’s look at real-world patterns combining str.contains() with other DataFrame operations.

Filter and Select Columns

orders = pd.DataFrame({
    'order_id': ['ORD-001', 'ORD-002', 'ORD-003', 'ORD-004'],
    'product': ['Wireless Mouse', 'USB-C Cable', 'Wireless Keyboard', 'HDMI Adapter'],
    'quantity': [2, 5, 1, 3],
    'status': ['shipped', 'pending', 'shipped', 'cancelled']
})

# Find wireless products and show only relevant columns
wireless = orders.loc[
    orders['product'].str.contains('Wireless', case=False),
    ['order_id', 'product', 'quantity']
]
print(wireless)

Negation with the Tilde Operator

Use ~ to find rows that do NOT contain a pattern:

# Find orders that are NOT shipped
not_shipped = orders[~orders['status'].str.contains('shipped')]
print(not_shipped)

Output:

  order_id       product  quantity     status
1  ORD-002   USB-C Cable         5    pending
3  ORD-004  HDMI Adapter         3  cancelled

Combining Multiple Conditions

Chain conditions with & (and) and | (or):

# Wireless products that are shipped
wireless_shipped = orders[
    orders['product'].str.contains('Wireless', case=False) & 
    orders['status'].str.contains('shipped')
]

# Products containing "USB" or "HDMI"
cables_adapters = orders[
    orders['product'].str.contains('USB', case=False) | 
    orders['product'].str.contains('HDMI', case=False)
]
print(cables_adapters)

Performance Tips and Common Pitfalls

Use regex=False for simple substring matching. Regex matching is slower than literal string matching. If you don’t need pattern matching, disable it:

# Slower - uses regex engine
df[df['col'].str.contains('simple text')]

# Faster - literal string matching
df[df['col'].str.contains('simple text', regex=False)]

Escape special characters when using regex. Characters like ., *, +, ?, [, ], (, ), $, ^, and | have special meanings. Either escape them with backslashes or use regex=False.

Always handle NaN values explicitly. Don’t assume your data is clean. Add na=False to avoid unexpected NaN propagation in boolean masks.

Pre-compile patterns for repeated use. If you’re applying the same regex pattern multiple times, compile it first:

import re
pattern = re.compile(r'error|warning', re.IGNORECASE)
df[df['log'].str.contains(pattern)]

The str.contains() method is a workhorse for text filtering in pandas. Master its parameters, understand regex basics, and always handle missing values—you’ll filter DataFrames efficiently and avoid the common bugs that trip up many developers.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.