Pandas - str.contains() with Examples
The `str.contains()` method checks whether a pattern exists in each string element of a pandas Series. It returns a boolean Series indicating matches.
Key Insights
str.contains()searches for patterns in pandas Series string data using regex or literal strings, returning a boolean mask for filtering DataFrames efficiently- The method offers critical parameters like
case,flags,na, andregexthat control matching behavior and handle missing values during pattern searches - Combining
str.contains()with logical operators, multiple patterns, and proper regex escaping enables complex filtering scenarios for real-world data analysis
Basic String Matching
The str.contains() method checks whether a pattern exists in each string element of a pandas Series. It returns a boolean Series indicating matches.
import pandas as pd
df = pd.DataFrame({
'product': ['iPhone 14', 'Samsung Galaxy', 'iPad Pro', 'MacBook Air', 'Galaxy Watch'],
'price': [999, 849, 799, 1199, 399]
})
# Basic pattern matching
mask = df['product'].str.contains('Galaxy')
print(mask)
# 0 False
# 1 True
# 2 False
# 3 False
# 4 True
# Filter DataFrame using the mask
filtered = df[mask]
print(filtered)
# product price
# 1 Samsung Galaxy 849
# 4 Galaxy Watch 399
The method performs substring matching by default. Any string containing the pattern returns True, regardless of position.
Case Sensitivity Control
By default, str.contains() performs case-sensitive matching. Use the case parameter to modify this behavior.
df = pd.DataFrame({
'email': ['user@GMAIL.com', 'admin@gmail.com', 'test@Yahoo.com', 'info@outlook.com']
})
# Case-sensitive (default)
gmail_sensitive = df['email'].str.contains('gmail')
print(gmail_sensitive)
# 0 False
# 1 True
# 2 False
# 3 False
# Case-insensitive matching
gmail_insensitive = df['email'].str.contains('gmail', case=False)
print(gmail_insensitive)
# 0 True
# 1 True
# 2 False
# 3 False
# Filter for Gmail addresses (case-insensitive)
gmail_users = df[df['email'].str.contains('gmail', case=False)]
print(gmail_users)
# email
# 0 user@GMAIL.com
# 1 admin@gmail.com
Handling Missing Values
Missing values (NaN) require explicit handling. The na parameter controls what boolean value to assign to null entries.
df = pd.DataFrame({
'description': ['Python developer', None, 'Java engineer', 'Python architect', '']
})
# Default behavior: NaN in result
result_default = df['description'].str.contains('Python')
print(result_default)
# 0 True
# 1 NaN
# 2 False
# 3 True
# 4 False
# Set NaN values to False
result_false = df['description'].str.contains('Python', na=False)
print(result_false)
# 0 True
# 1 False
# 2 False
# 3 True
# 4 False
# Set NaN values to True
result_true = df['description'].str.contains('Python', na=True)
print(result_true)
# 0 True
# 1 True
# 2 False
# 3 True
# 4 False
Setting na=False is common when filtering DataFrames, as it treats missing values as non-matches.
Regex Pattern Matching
str.contains() uses regex by default. This enables powerful pattern matching beyond simple substrings.
df = pd.DataFrame({
'phone': ['123-456-7890', '(555) 123-4567', '555.123.4567', 'invalid', '9876543210']
})
# Match phone numbers with dashes
dash_pattern = df['phone'].str.contains(r'\d{3}-\d{3}-\d{4}')
print(dash_pattern)
# 0 True
# 1 False
# 2 False
# 3 False
# 4 False
# Match any phone number format (dashes, dots, or parentheses)
any_format = df['phone'].str.contains(r'\d{3}[-.)]\s?\d{3}[-.)]\d{4}')
print(any_format)
# 0 True
# 1 False
# 2 True
# 3 False
# 4 False
# Match email addresses
emails = pd.DataFrame({
'contact': ['john@example.com', 'invalid.email', 'jane@test.co.uk', 'not-an-email']
})
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
valid_emails = emails[emails['contact'].str.contains(email_pattern, na=False)]
print(valid_emails)
# contact
# 0 john@example.com
# 2 jane@test.co.uk
Literal String Matching
When searching for strings containing regex special characters, disable regex interpretation using regex=False.
df = pd.DataFrame({
'code': ['item[1]', 'item[2]', 'item.test', 'item*special', 'normal_item']
})
# Wrong: treats brackets as regex character class
wrong = df['code'].str.contains('[1]') # Searches for '1' character
print(wrong)
# 0 True
# 1 True
# 2 False
# 3 False
# 4 False
# Correct: literal bracket search
correct = df['code'].str.contains('[1]', regex=False)
print(correct)
# 0 True
# 1 False
# 2 False
# 3 False
# 4 False
# Search for literal asterisk
asterisk_items = df['code'].str.contains('*', regex=False)
print(asterisk_items)
# 0 False
# 1 False
# 2 False
# 3 True
# 4 False
Alternatively, escape special characters when using regex mode:
escaped = df['code'].str.contains(r'\[1\]', regex=True)
print(escaped)
# 0 True
# 1 False
# 2 False
# 3 False
# 4 False
Multiple Pattern Matching
Combine multiple patterns using regex alternation or logical operators on boolean Series.
df = pd.DataFrame({
'skill': ['Python', 'Java', 'JavaScript', 'Python, Java', 'C++', 'Ruby']
})
# OR matching with regex pipe operator
python_or_java = df['skill'].str.contains('Python|Java', na=False)
print(python_or_java)
# 0 True
# 1 True
# 2 True
# 3 True
# 4 False
# 5 False
# AND matching using multiple conditions
has_python = df['skill'].str.contains('Python', na=False)
has_java = df['skill'].str.contains('Java', na=False)
both = has_python & has_java
print(df[both])
# skill
# 3 Python, Java
# NOT matching
not_python = ~df['skill'].str.contains('Python', na=False)
print(df[not_python])
# skill
# 1 Java
# 2 JavaScript
# 4 C++
# 5 Ruby
# Complex conditions
scripting = df['skill'].str.contains('Python|Ruby|JavaScript', na=False)
compiled = df['skill'].str.contains('Java|C\+\+', na=False)
either = scripting | compiled
print(df[either])
# skill
# 0 Python
# 1 Java
# 2 JavaScript
# 3 Python, Java
# 4 C++
# 5 Ruby
Regex Flags
Use the flags parameter to pass regex compilation flags for advanced pattern matching.
import re
df = pd.DataFrame({
'text': ['First line\nSecond line', 'Single line', 'UPPERCASE\ntext', 'normal text']
})
# Case-insensitive with IGNORECASE flag
case_insensitive = df['text'].str.contains('uppercase', flags=re.IGNORECASE, na=False)
print(case_insensitive)
# 0 False
# 1 False
# 2 True
# 3 False
# Multiline matching with DOTALL flag
# Dot matches newline characters
multiline = df['text'].str.contains('First.*Second', flags=re.DOTALL, na=False)
print(multiline)
# 0 True
# 1 False
# 2 False
# 3 False
# Combine multiple flags
combined = df['text'].str.contains('FIRST.*SECOND', flags=re.IGNORECASE | re.DOTALL, na=False)
print(combined)
# 0 True
# 1 False
# 2 False
# 3 False
Real-World Application: Log Analysis
Practical example analyzing server logs to identify errors and specific request patterns.
logs = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=6, freq='H'),
'message': [
'INFO: User login successful',
'ERROR: Database connection failed',
'WARNING: High memory usage detected',
'ERROR: NullPointerException in auth module',
'INFO: Request processed in 120ms',
'ERROR: Timeout connecting to external API'
],
'source': ['auth', 'database', 'system', 'auth', 'api', 'api']
})
# Find all errors
errors = logs[logs['message'].str.contains('ERROR', na=False)]
print(f"Total errors: {len(errors)}")
# Total errors: 3
# Find database-related issues
db_issues = logs[logs['message'].str.contains('database|connection', case=False, na=False)]
print(db_issues)
# timestamp message source
# 1 2024-01-01 01:00:00 ERROR: Database connection failed database
# 5 2024-01-01 05:00:00 ERROR: Timeout connecting to external API api
# Find authentication errors
auth_errors = logs[
(logs['message'].str.contains('ERROR', na=False)) &
(logs['source'] == 'auth')
]
print(auth_errors)
# timestamp message source
# 3 2024-01-01 03:00:00 ERROR: NullPointerException in auth module auth
# Extract performance metrics (requests under 150ms)
performance = logs[logs['message'].str.contains(r'processed in \d+ms', na=False)]
print(performance)
# timestamp message source
# 4 2024-01-01 04:00:00 INFO: Request processed in 120ms api
# Complex filtering: warnings or errors excluding specific sources
critical = logs[
logs['message'].str.contains('ERROR|WARNING', na=False) &
~logs['source'].str.contains('system', na=False)
]
print(critical)
# timestamp message source
# 1 2024-01-01 01:00:00 ERROR: Database connection failed database
# 3 2024-01-01 03:00:00 ERROR: NullPointerException in auth module auth
# 5 2024-01-01 05:00:00 ERROR: Timeout connecting to external API api
The str.contains() method provides flexible string matching for data filtering and analysis. Understanding its parameters and combining it with regex enables efficient DataFrame operations for text-heavy datasets.