Pandas - str.findall() with Regex
• `str.findall()` returns all non-overlapping matches of a regex pattern as lists within a Series, making it ideal for extracting multiple occurrences from text data
Key Insights
• str.findall() returns all non-overlapping matches of a regex pattern as lists within a Series, making it ideal for extracting multiple occurrences from text data
• Unlike str.extract() which returns only the first match, findall() captures every instance, returning empty lists for non-matching rows instead of NaN values
• Combining str.findall() with capturing groups, lookaheads, and quantifiers enables complex text parsing tasks like extracting emails, phone numbers, or structured data from unstructured text
Basic Usage and Return Format
The str.findall() method searches for all occurrences of a regex pattern within each string element of a pandas Series. It returns a Series where each element is a list containing all matches found.
import pandas as pd
import re
df = pd.DataFrame({
'text': [
'Call me at 555-1234 or 555-5678',
'No numbers here',
'Single number: 555-9999',
'Multiple: 111-2222, 333-4444, 555-6666'
]
})
# Extract all phone numbers
df['phone_numbers'] = df['text'].str.findall(r'\d{3}-\d{4}')
print(df)
Output:
text phone_numbers
0 Call me at 555-1234 or 555-5678 [555-1234, 555-5678]
1 No numbers here []
2 Single number: 555-9999 [555-9999]
3 Multiple: 111-2222, 333-4444, 555-6666 [111-2222, 333-4444, 555-6666]
Notice that non-matching rows return empty lists rather than NaN values. This behavior differs from str.extract() and makes subsequent processing more predictable.
Working with Capturing Groups
When your regex contains capturing groups (parentheses), findall() returns tuples instead of strings. Each tuple contains the captured groups for one match.
df = pd.DataFrame({
'email': [
'Contact john.doe@company.com for details',
'Reach out to jane_smith@example.org or admin@example.org',
'No emails in this text',
'Support: help@site.co.uk'
]
})
# Extract username and domain separately
pattern = r'([a-zA-Z0-9._]+)@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'
df['email_parts'] = df['email'].str.findall(pattern)
print(df['email_parts'])
Output:
0 [(john.doe, company.com)]
1 [(jane_smith, example.org), (admin, example.org)]
2 []
3 [(help, site.co.uk)]
Name: email_parts, dtype: object
To work with these tuples effectively, you can expand them into separate columns:
# Get first email only
df['first_username'] = df['email_parts'].apply(lambda x: x[0][0] if x else None)
df['first_domain'] = df['email_parts'].apply(lambda x: x[0][1] if x else None)
# Count total emails found
df['email_count'] = df['email_parts'].apply(len)
print(df[['first_username', 'first_domain', 'email_count']])
Extracting Structured Data from Text
str.findall() excels at parsing semi-structured text where multiple data points appear in varying formats.
log_data = pd.DataFrame({
'log_entry': [
'ERROR [2024-01-15 14:23:45] Database connection failed',
'INFO [2024-01-15 14:24:01] User login successful WARNING [2024-01-15 14:24:05] High memory usage',
'DEBUG [2024-01-15 14:25:12] Cache cleared',
'ERROR [2024-01-15 14:26:33] Timeout ERROR [2024-01-15 14:26:45] Retry failed'
]
})
# Extract all log levels and timestamps
pattern = r'(ERROR|WARNING|INFO|DEBUG) \[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]'
log_data['events'] = log_data['log_entry'].str.findall(pattern)
# Count events by type
def count_errors(events):
return sum(1 for level, _ in events if level == 'ERROR')
log_data['error_count'] = log_data['events'].apply(count_errors)
log_data['total_events'] = log_data['events'].apply(len)
print(log_data[['error_count', 'total_events', 'events']])
Combining with explode() for Row Expansion
When you need to create one row per match instead of lists, combine findall() with explode():
df = pd.DataFrame({
'product_id': ['A001', 'B002', 'C003'],
'tags': [
'electronics, portable, bluetooth',
'furniture, office',
'clothing, summer, sale, clearance'
]
})
# Extract all tags
df['tag_list'] = df['tags'].str.findall(r'\w+')
# Create one row per tag
df_exploded = df[['product_id', 'tag_list']].explode('tag_list')
df_exploded.rename(columns={'tag_list': 'tag'}, inplace=True)
print(df_exploded)
Output:
product_id tag
0 A001 electronics
0 A001 portable
0 A001 bluetooth
1 B002 furniture
1 B002 office
2 C003 clothing
2 C003 summer
2 C003 sale
2 C003 clearance
Advanced Pattern Matching
Lookaheads and Lookbehinds
Use lookaheads to extract values that appear before or after specific patterns:
df = pd.DataFrame({
'text': [
'Price: $25.99, Tax: $2.60',
'Total: $100.00, Discount: $15.50, Final: $84.50',
'Amount: $0.99'
]
})
# Extract all dollar amounts
df['amounts'] = df['text'].str.findall(r'\$(\d+\.\d{2})')
print(df['amounts'])
Non-capturing Groups
Use (?:...) for grouping without capturing when you want to apply quantifiers or alternation:
df = pd.DataFrame({
'urls': [
'Visit https://example.com and http://test.org',
'Check out ftp://files.net',
'Links: https://secure.site.com, http://old.site.com, https://new.site.com'
]
})
# Extract all URLs regardless of protocol
pattern = r'(?:https?|ftp)://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
df['url_list'] = df['urls'].str.findall(pattern)
print(df['url_list'])
Performance Considerations
For large datasets, str.findall() can be computationally expensive. Consider these optimization strategies:
import numpy as np
# Create sample data
df = pd.DataFrame({
'text': ['sample text ' + str(i) for i in range(100000)]
})
# Pre-filter to reduce regex operations
df['has_numbers'] = df['text'].str.contains(r'\d+', na=False)
df.loc[df['has_numbers'], 'numbers'] = df.loc[df['has_numbers'], 'text'].str.findall(r'\d+')
# For very large datasets, compile regex once
compiled_pattern = re.compile(r'\d+')
df['numbers_compiled'] = df['text'].apply(lambda x: compiled_pattern.findall(x))
Common Pitfalls and Solutions
Empty Lists vs NaN
Remember that findall() returns empty lists, not NaN, for non-matches:
df = pd.DataFrame({'text': ['has123numbers', 'no numbers', None]})
df['numbers'] = df['text'].str.findall(r'\d+')
# This won't work as expected
# df[df['numbers'].isna()] # Returns empty DataFrame
# Correct approach
df['has_matches'] = df['numbers'].apply(lambda x: len(x) > 0 if isinstance(x, list) else False)
Overlapping Matches
findall() only returns non-overlapping matches:
text = pd.Series(['AAAA'])
# Only finds non-overlapping 'AA'
print(text.str.findall(r'AA')) # [['AA', 'AA']]
# For overlapping matches, use lookahead
print(text.str.findall(r'(?=(AA))')) # [['AA', 'AA', 'AA']]
Practical Example: Log Analysis
Here’s a complete example analyzing server logs:
logs = pd.DataFrame({
'entry': [
'192.168.1.1 - - [15/Jan/2024:14:23:45] "GET /api/users HTTP/1.1" 200 1234',
'10.0.0.5 - - [15/Jan/2024:14:24:01] "POST /api/login HTTP/1.1" 401 567 "Invalid credentials"',
'192.168.1.1 - - [15/Jan/2024:14:25:12] "GET /api/products HTTP/1.1" 200 8901'
]
})
# Extract IP addresses
logs['ips'] = logs['entry'].str.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b')
# Extract HTTP methods
logs['methods'] = logs['entry'].str.findall(r'"(GET|POST|PUT|DELETE)')
# Extract status codes
logs['status_codes'] = logs['entry'].str.findall(r'" (\d{3}) ')
# Extract response sizes
logs['sizes'] = logs['entry'].str.findall(r'" \d{3} (\d+)')
print(logs[['ips', 'methods', 'status_codes', 'sizes']])
The str.findall() method provides robust pattern matching capabilities essential for text processing pipelines. Master regex patterns and understand the list-based return format to unlock powerful data extraction workflows in pandas.