Pandas - str.contains() with Examples

Key Insights

str.contains() searches for patterns in pandas Series string data using regex or literal strings, returning a boolean mask for filtering DataFrames efficiently
The method offers critical parameters like case, flags, na, and regex that control matching behavior and handle missing values during pattern searches
Combining str.contains() with logical operators, multiple patterns, and proper regex escaping enables complex filtering scenarios for real-world data analysis

Basic String Matching

The str.contains() method checks whether a pattern exists in each string element of a pandas Series. It returns a boolean Series indicating matches.

import pandas as pd

df = pd.DataFrame({
    'product': ['iPhone 14', 'Samsung Galaxy', 'iPad Pro', 'MacBook Air', 'Galaxy Watch'],
    'price': [999, 849, 799, 1199, 399]
})

# Basic pattern matching
mask = df['product'].str.contains('Galaxy')
print(mask)
# 0    False
# 1     True
# 2    False
# 3    False
# 4     True

# Filter DataFrame using the mask
filtered = df[mask]
print(filtered)
#           product  price
# 1  Samsung Galaxy    849
# 4    Galaxy Watch    399

The method performs substring matching by default. Any string containing the pattern returns True, regardless of position.

Case Sensitivity Control

By default, str.contains() performs case-sensitive matching. Use the case parameter to modify this behavior.

df = pd.DataFrame({
    'email': ['user@GMAIL.com', 'admin@gmail.com', 'test@Yahoo.com', 'info@outlook.com']
})

# Case-sensitive (default)
gmail_sensitive = df['email'].str.contains('gmail')
print(gmail_sensitive)
# 0    False
# 1     True
# 2    False
# 3    False

# Case-insensitive matching
gmail_insensitive = df['email'].str.contains('gmail', case=False)
print(gmail_insensitive)
# 0    True
# 1    True
# 2    False
# 3    False

# Filter for Gmail addresses (case-insensitive)
gmail_users = df[df['email'].str.contains('gmail', case=False)]
print(gmail_users)
#              email
# 0  user@GMAIL.com
# 1  admin@gmail.com

Handling Missing Values

Missing values (NaN) require explicit handling. The na parameter controls what boolean value to assign to null entries.

df = pd.DataFrame({
    'description': ['Python developer', None, 'Java engineer', 'Python architect', '']
})

# Default behavior: NaN in result
result_default = df['description'].str.contains('Python')
print(result_default)
# 0     True
# 1      NaN
# 2    False
# 3     True
# 4    False

# Set NaN values to False
result_false = df['description'].str.contains('Python', na=False)
print(result_false)
# 0     True
# 1    False
# 2    False
# 3     True
# 4    False

# Set NaN values to True
result_true = df['description'].str.contains('Python', na=True)
print(result_true)
# 0    True
# 1    True
# 2    False
# 3    True
# 4    False

Setting na=False is common when filtering DataFrames, as it treats missing values as non-matches.

Regex Pattern Matching

str.contains() uses regex by default. This enables powerful pattern matching beyond simple substrings.

df = pd.DataFrame({
    'phone': ['123-456-7890', '(555) 123-4567', '555.123.4567', 'invalid', '9876543210']
})

# Match phone numbers with dashes
dash_pattern = df['phone'].str.contains(r'\d{3}-\d{3}-\d{4}')
print(dash_pattern)
# 0     True
# 1    False
# 2    False
# 3    False
# 4    False

# Match any phone number format (dashes, dots, or parentheses)
any_format = df['phone'].str.contains(r'\d{3}[-.)]\s?\d{3}[-.)]\d{4}')
print(any_format)
# 0     True
# 1    False
# 2     True
# 3    False
# 4    False

# Match email addresses
emails = pd.DataFrame({
    'contact': ['john@example.com', 'invalid.email', 'jane@test.co.uk', 'not-an-email']
})

email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
valid_emails = emails[emails['contact'].str.contains(email_pattern, na=False)]
print(valid_emails)
#            contact
# 0  john@example.com
# 2  jane@test.co.uk

Literal String Matching

When searching for strings containing regex special characters, disable regex interpretation using regex=False.

df = pd.DataFrame({
    'code': ['item[1]', 'item[2]', 'item.test', 'item*special', 'normal_item']
})

# Wrong: treats brackets as regex character class
wrong = df['code'].str.contains('[1]')  # Searches for '1' character
print(wrong)
# 0    True
# 1    True
# 2    False
# 3    False
# 4    False

# Correct: literal bracket search
correct = df['code'].str.contains('[1]', regex=False)
print(correct)
# 0     True
# 1    False
# 2    False
# 3    False
# 4    False

# Search for literal asterisk
asterisk_items = df['code'].str.contains('*', regex=False)
print(asterisk_items)
# 0    False
# 1    False
# 2    False
# 3     True
# 4    False

Alternatively, escape special characters when using regex mode:

escaped = df['code'].str.contains(r'\[1\]', regex=True)
print(escaped)
# 0     True
# 1    False
# 2    False
# 3    False
# 4    False

Multiple Pattern Matching

Combine multiple patterns using regex alternation or logical operators on boolean Series.

df = pd.DataFrame({
    'skill': ['Python', 'Java', 'JavaScript', 'Python, Java', 'C++', 'Ruby']
})

# OR matching with regex pipe operator
python_or_java = df['skill'].str.contains('Python|Java', na=False)
print(python_or_java)
# 0     True
# 1     True
# 2     True
# 3     True
# 4    False
# 5    False

# AND matching using multiple conditions
has_python = df['skill'].str.contains('Python', na=False)
has_java = df['skill'].str.contains('Java', na=False)
both = has_python & has_java
print(df[both])
#           skill
# 3  Python, Java

# NOT matching
not_python = ~df['skill'].str.contains('Python', na=False)
print(df[not_python])
#        skill
# 1       Java
# 2  JavaScript
# 4         C++
# 5       Ruby

# Complex conditions
scripting = df['skill'].str.contains('Python|Ruby|JavaScript', na=False)
compiled = df['skill'].str.contains('Java|C\+\+', na=False)
either = scripting | compiled
print(df[either])
#           skill
# 0        Python
# 1          Java
# 2    JavaScript
# 3  Python, Java
# 4           C++
# 5          Ruby

Regex Flags

Use the flags parameter to pass regex compilation flags for advanced pattern matching.

import re

df = pd.DataFrame({
    'text': ['First line\nSecond line', 'Single line', 'UPPERCASE\ntext', 'normal text']
})

# Case-insensitive with IGNORECASE flag
case_insensitive = df['text'].str.contains('uppercase', flags=re.IGNORECASE, na=False)
print(case_insensitive)
# 0    False
# 1    False
# 2     True
# 3    False

# Multiline matching with DOTALL flag
# Dot matches newline characters
multiline = df['text'].str.contains('First.*Second', flags=re.DOTALL, na=False)
print(multiline)
# 0     True
# 1    False
# 2    False
# 3    False

# Combine multiple flags
combined = df['text'].str.contains('FIRST.*SECOND', flags=re.IGNORECASE | re.DOTALL, na=False)
print(combined)
# 0     True
# 1    False
# 2    False
# 3    False

Real-World Application: Log Analysis

Practical example analyzing server logs to identify errors and specific request patterns.

logs = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=6, freq='H'),
    'message': [
        'INFO: User login successful',
        'ERROR: Database connection failed',
        'WARNING: High memory usage detected',
        'ERROR: NullPointerException in auth module',
        'INFO: Request processed in 120ms',
        'ERROR: Timeout connecting to external API'
    ],
    'source': ['auth', 'database', 'system', 'auth', 'api', 'api']
})

# Find all errors
errors = logs[logs['message'].str.contains('ERROR', na=False)]
print(f"Total errors: {len(errors)}")
# Total errors: 3

# Find database-related issues
db_issues = logs[logs['message'].str.contains('database|connection', case=False, na=False)]
print(db_issues)
#             timestamp                              message    source
# 1 2024-01-01 01:00:00  ERROR: Database connection failed  database
# 5 2024-01-01 05:00:00  ERROR: Timeout connecting to external API       api

# Find authentication errors
auth_errors = logs[
    (logs['message'].str.contains('ERROR', na=False)) & 
    (logs['source'] == 'auth')
]
print(auth_errors)
#             timestamp                                    message source
# 3 2024-01-01 03:00:00  ERROR: NullPointerException in auth module   auth

# Extract performance metrics (requests under 150ms)
performance = logs[logs['message'].str.contains(r'processed in \d+ms', na=False)]
print(performance)
#             timestamp                        message source
# 4 2024-01-01 04:00:00  INFO: Request processed in 120ms    api

# Complex filtering: warnings or errors excluding specific sources
critical = logs[
    logs['message'].str.contains('ERROR|WARNING', na=False) & 
    ~logs['source'].str.contains('system', na=False)
]
print(critical)
#             timestamp                                    message    source
# 1 2024-01-01 01:00:00         ERROR: Database connection failed  database
# 3 2024-01-01 03:00:00  ERROR: NullPointerException in auth module      auth
# 5 2024-01-01 05:00:00  ERROR: Timeout connecting to external API       api

The str.contains() method provides flexible string matching for data filtering and analysis. Understanding its parameters and combining it with regex enables efficient DataFrame operations for text-heavy datasets.