How to Use str.contains in Pandas
String matching is one of the most common operations when working with text data in pandas. Whether you're filtering customer names, searching product descriptions, or parsing log files, you need a...
Key Insights
str.contains()is pandas’ primary method for filtering DataFrames based on substring or pattern matching, returning a boolean Series you can use directly for indexing.- Always set
regex=Falsewhen matching literal strings—it’s faster and prevents unexpected behavior from special characters like.,*, or$. - The
naparameter is critical for real-world data; forgetting to handle NaN values will propagate missing values into your boolean mask and cause filtering errors.
Introduction
String matching is one of the most common operations when working with text data in pandas. Whether you’re filtering customer names, searching product descriptions, or parsing log files, you need a reliable way to find rows containing specific text patterns.
The str.contains() method is pandas’ answer to this problem. It checks each element in a Series for the presence of a pattern and returns a boolean Series—perfect for filtering DataFrames. Unlike Python’s in operator, which works on individual strings, str.contains() operates on entire columns efficiently.
This article covers everything you need to use str.contains() effectively: from basic substring matching to complex regex patterns, handling missing values, and avoiding common performance pitfalls.
Basic Syntax and Parameters
The method signature looks like this:
Series.str.contains(pat, case=True, flags=0, na=None, regex=True)
Here’s what each parameter does:
- pat: The pattern or substring to search for. Can be a string or regex pattern.
- case: Boolean controlling case sensitivity. Default is
True(case-sensitive). - flags: Regex flags from the
remodule (e.g.,re.IGNORECASE). Only applies whenregex=True. - na: Value to use for missing data. Default
Nonekeeps NaN as NaN in results. - regex: Boolean indicating whether
patis a regex. Default isTrue.
Let’s start with a basic example. Say you have a product catalog and want to find all phones:
import pandas as pd
products = pd.DataFrame({
'name': ['iPhone 15 Pro', 'Samsung Galaxy S24', 'AirPods Pro',
'Pixel 8 Phone', 'MacBook Pro', 'Android Phone Case'],
'category': ['phone', 'phone', 'audio', 'phone', 'laptop', 'accessory'],
'price': [999, 849, 249, 699, 1999, 29]
})
# Filter products with "Phone" in the name
phone_products = products[products['name'].str.contains('Phone')]
print(phone_products)
Output:
name category price
3 Pixel 8 Phone phone 699
5 Android Phone Case accessory 29
Notice that “iPhone 15 Pro” didn’t match because the search is case-sensitive by default. The pattern “Phone” doesn’t match “iPhone” where the ‘p’ is lowercase.
Case-Sensitive vs. Case-Insensitive Matching
The case parameter controls whether matching respects letter case. This is crucial when dealing with inconsistent data entry or when you want flexible matching.
# Case-sensitive (default) - misses "iPhone"
case_sensitive = products[products['name'].str.contains('phone', case=True)]
print("Case-sensitive results:")
print(case_sensitive['name'].tolist())
# Case-insensitive - catches all variations
case_insensitive = products[products['name'].str.contains('phone', case=False)]
print("\nCase-insensitive results:")
print(case_insensitive['name'].tolist())
Output:
Case-sensitive results:
[]
Case-insensitive results:
['iPhone 15 Pro', 'Pixel 8 Phone', 'Android Phone Case']
The case-insensitive search found “iPhone” (lowercase ‘p’), “Phone” (capitalized), and “Phone Case”. For most real-world text filtering, case=False is what you want.
You can also use regex flags for case insensitivity:
import re
# Alternative using regex flags
result = products[products['name'].str.contains('phone', flags=re.IGNORECASE)]
This approach is useful when you need to combine multiple regex flags.
Using Regular Expressions
By default, str.contains() interprets the pattern as a regular expression. This gives you powerful pattern matching capabilities.
Matching String Positions
Use anchors to match patterns at specific positions:
data = pd.DataFrame({
'email': ['john@company.com', 'admin@company.org', 'support@company.com',
'john.doe@external.com', 'company@partner.org']
})
# Emails starting with "john"
starts_with_john = data[data['email'].str.contains('^john')]
print("Starts with 'john':")
print(starts_with_john['email'].tolist())
# Emails ending with ".com"
ends_with_com = data[data['email'].str.contains(r'\.com$')]
print("\nEnds with '.com':")
print(ends_with_com['email'].tolist())
Output:
Starts with 'john':
['john@company.com', 'john.doe@external.com']
Ends with '.com':
['john@company.com', 'support@company.com', 'john.doe@external.com']
Note the escaped dot (\.) in the second pattern—without the backslash, . matches any character.
Matching Multiple Patterns
Use the pipe operator | to match any of several patterns:
logs = pd.DataFrame({
'message': ['ERROR: Connection failed', 'INFO: User logged in',
'WARNING: High memory usage', 'ERROR: Timeout exceeded',
'DEBUG: Cache cleared', 'WARNING: Disk space low']
})
# Find all errors and warnings
issues = logs[logs['message'].str.contains('ERROR|WARNING')]
print(issues)
Output:
message
0 ERROR: Connection failed
2 WARNING: High memory usage
3 ERROR: Timeout exceeded
5 WARNING: Disk space low
Disabling Regex for Literal Matching
When you want to match literal strings containing regex special characters, set regex=False:
files = pd.DataFrame({
'filename': ['report.csv', 'data[2024].xlsx', 'backup (1).zip',
'config.json', 'notes[draft].txt']
})
# This fails or gives wrong results - brackets are regex special chars
# files[files['filename'].str.contains('[2024]')] # Matches any single digit!
# Correct approach for literal matching
bracketed = files[files['filename'].str.contains('[2024]', regex=False)]
print(bracketed)
Output:
filename
1 data[2024].xlsx
Setting regex=False treats the pattern as a literal string, so [2024] matches exactly those characters.
Handling Missing Values
Real-world data contains missing values. By default, str.contains() returns NaN for NaN inputs, which can cause problems when filtering:
customers = pd.DataFrame({
'name': ['Alice Smith', 'Bob Johnson', None, 'Carol Williams', 'Dave Brown'],
'email': ['alice@email.com', None, 'charlie@email.com', 'carol@email.com', None]
})
# Default behavior - NaN propagates
mask = customers['name'].str.contains('Smith')
print("Mask with NaN:")
print(mask)
Output:
Mask with NaN:
0 True
1 False
2 NaN
3 False
4 False
dtype: object
That NaN in the mask causes issues when filtering. Use the na parameter to control this:
# Treat NaN as False (exclude missing values)
mask_false = customers['name'].str.contains('Smith', na=False)
print("With na=False:")
print(customers[mask_false])
# Treat NaN as True (include missing values)
mask_true = customers['name'].str.contains('Smith', na=True)
print("\nWith na=True:")
print(customers[mask_true])
Output:
With na=False:
name email
0 Alice Smith alice@email.com
With na=True:
name email
0 Alice Smith alice@email.com
2 None charlie@email.com
For most filtering operations, na=False is the safe choice—it excludes rows with missing values in the searched column.
Practical Use Cases
Let’s look at real-world patterns combining str.contains() with other DataFrame operations.
Filter and Select Columns
orders = pd.DataFrame({
'order_id': ['ORD-001', 'ORD-002', 'ORD-003', 'ORD-004'],
'product': ['Wireless Mouse', 'USB-C Cable', 'Wireless Keyboard', 'HDMI Adapter'],
'quantity': [2, 5, 1, 3],
'status': ['shipped', 'pending', 'shipped', 'cancelled']
})
# Find wireless products and show only relevant columns
wireless = orders.loc[
orders['product'].str.contains('Wireless', case=False),
['order_id', 'product', 'quantity']
]
print(wireless)
Negation with the Tilde Operator
Use ~ to find rows that do NOT contain a pattern:
# Find orders that are NOT shipped
not_shipped = orders[~orders['status'].str.contains('shipped')]
print(not_shipped)
Output:
order_id product quantity status
1 ORD-002 USB-C Cable 5 pending
3 ORD-004 HDMI Adapter 3 cancelled
Combining Multiple Conditions
Chain conditions with & (and) and | (or):
# Wireless products that are shipped
wireless_shipped = orders[
orders['product'].str.contains('Wireless', case=False) &
orders['status'].str.contains('shipped')
]
# Products containing "USB" or "HDMI"
cables_adapters = orders[
orders['product'].str.contains('USB', case=False) |
orders['product'].str.contains('HDMI', case=False)
]
print(cables_adapters)
Performance Tips and Common Pitfalls
Use regex=False for simple substring matching. Regex matching is slower than literal string matching. If you don’t need pattern matching, disable it:
# Slower - uses regex engine
df[df['col'].str.contains('simple text')]
# Faster - literal string matching
df[df['col'].str.contains('simple text', regex=False)]
Escape special characters when using regex. Characters like ., *, +, ?, [, ], (, ), $, ^, and | have special meanings. Either escape them with backslashes or use regex=False.
Always handle NaN values explicitly. Don’t assume your data is clean. Add na=False to avoid unexpected NaN propagation in boolean masks.
Pre-compile patterns for repeated use. If you’re applying the same regex pattern multiple times, compile it first:
import re
pattern = re.compile(r'error|warning', re.IGNORECASE)
df[df['log'].str.contains(pattern)]
The str.contains() method is a workhorse for text filtering in pandas. Master its parameters, understand regex basics, and always handle missing values—you’ll filter DataFrames efficiently and avoid the common bugs that trip up many developers.