Pandas - str.findall() with Regex

• `str.findall()` returns all non-overlapping matches of a regex pattern as lists within a Series, making it ideal for extracting multiple occurrences from text data

Key Insights

str.findall() returns all non-overlapping matches of a regex pattern as lists within a Series, making it ideal for extracting multiple occurrences from text data • Unlike str.extract() which returns only the first match, findall() captures every instance, returning empty lists for non-matching rows instead of NaN values • Combining str.findall() with capturing groups, lookaheads, and quantifiers enables complex text parsing tasks like extracting emails, phone numbers, or structured data from unstructured text

Basic Usage and Return Format

The str.findall() method searches for all occurrences of a regex pattern within each string element of a pandas Series. It returns a Series where each element is a list containing all matches found.

import pandas as pd
import re

df = pd.DataFrame({
    'text': [
        'Call me at 555-1234 or 555-5678',
        'No numbers here',
        'Single number: 555-9999',
        'Multiple: 111-2222, 333-4444, 555-6666'
    ]
})

# Extract all phone numbers
df['phone_numbers'] = df['text'].str.findall(r'\d{3}-\d{4}')
print(df)

Output:

                                    text              phone_numbers
0      Call me at 555-1234 or 555-5678  [555-1234, 555-5678]
1                      No numbers here                        []
2              Single number: 555-9999              [555-9999]
3  Multiple: 111-2222, 333-4444, 555-6666  [111-2222, 333-4444, 555-6666]

Notice that non-matching rows return empty lists rather than NaN values. This behavior differs from str.extract() and makes subsequent processing more predictable.

Working with Capturing Groups

When your regex contains capturing groups (parentheses), findall() returns tuples instead of strings. Each tuple contains the captured groups for one match.

df = pd.DataFrame({
    'email': [
        'Contact john.doe@company.com for details',
        'Reach out to jane_smith@example.org or admin@example.org',
        'No emails in this text',
        'Support: help@site.co.uk'
    ]
})

# Extract username and domain separately
pattern = r'([a-zA-Z0-9._]+)@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'
df['email_parts'] = df['email'].str.findall(pattern)
print(df['email_parts'])

Output:

0                      [(john.doe, company.com)]
1    [(jane_smith, example.org), (admin, example.org)]
2                                                    []
3                          [(help, site.co.uk)]
Name: email_parts, dtype: object

To work with these tuples effectively, you can expand them into separate columns:

# Get first email only
df['first_username'] = df['email_parts'].apply(lambda x: x[0][0] if x else None)
df['first_domain'] = df['email_parts'].apply(lambda x: x[0][1] if x else None)

# Count total emails found
df['email_count'] = df['email_parts'].apply(len)
print(df[['first_username', 'first_domain', 'email_count']])

Extracting Structured Data from Text

str.findall() excels at parsing semi-structured text where multiple data points appear in varying formats.

log_data = pd.DataFrame({
    'log_entry': [
        'ERROR [2024-01-15 14:23:45] Database connection failed',
        'INFO [2024-01-15 14:24:01] User login successful WARNING [2024-01-15 14:24:05] High memory usage',
        'DEBUG [2024-01-15 14:25:12] Cache cleared',
        'ERROR [2024-01-15 14:26:33] Timeout ERROR [2024-01-15 14:26:45] Retry failed'
    ]
})

# Extract all log levels and timestamps
pattern = r'(ERROR|WARNING|INFO|DEBUG) \[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]'
log_data['events'] = log_data['log_entry'].str.findall(pattern)

# Count events by type
def count_errors(events):
    return sum(1 for level, _ in events if level == 'ERROR')

log_data['error_count'] = log_data['events'].apply(count_errors)
log_data['total_events'] = log_data['events'].apply(len)

print(log_data[['error_count', 'total_events', 'events']])

Combining with explode() for Row Expansion

When you need to create one row per match instead of lists, combine findall() with explode():

df = pd.DataFrame({
    'product_id': ['A001', 'B002', 'C003'],
    'tags': [
        'electronics, portable, bluetooth',
        'furniture, office',
        'clothing, summer, sale, clearance'
    ]
})

# Extract all tags
df['tag_list'] = df['tags'].str.findall(r'\w+')

# Create one row per tag
df_exploded = df[['product_id', 'tag_list']].explode('tag_list')
df_exploded.rename(columns={'tag_list': 'tag'}, inplace=True)
print(df_exploded)

Output:

  product_id         tag
0       A001 electronics
0       A001    portable
0       A001   bluetooth
1       B002   furniture
1       B002      office
2       C003    clothing
2       C003      summer
2       C003        sale
2       C003   clearance

Advanced Pattern Matching

Lookaheads and Lookbehinds

Use lookaheads to extract values that appear before or after specific patterns:

df = pd.DataFrame({
    'text': [
        'Price: $25.99, Tax: $2.60',
        'Total: $100.00, Discount: $15.50, Final: $84.50',
        'Amount: $0.99'
    ]
})

# Extract all dollar amounts
df['amounts'] = df['text'].str.findall(r'\$(\d+\.\d{2})')
print(df['amounts'])

Non-capturing Groups

Use (?:...) for grouping without capturing when you want to apply quantifiers or alternation:

df = pd.DataFrame({
    'urls': [
        'Visit https://example.com and http://test.org',
        'Check out ftp://files.net',
        'Links: https://secure.site.com, http://old.site.com, https://new.site.com'
    ]
})

# Extract all URLs regardless of protocol
pattern = r'(?:https?|ftp)://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
df['url_list'] = df['urls'].str.findall(pattern)
print(df['url_list'])

Performance Considerations

For large datasets, str.findall() can be computationally expensive. Consider these optimization strategies:

import numpy as np

# Create sample data
df = pd.DataFrame({
    'text': ['sample text ' + str(i) for i in range(100000)]
})

# Pre-filter to reduce regex operations
df['has_numbers'] = df['text'].str.contains(r'\d+', na=False)
df.loc[df['has_numbers'], 'numbers'] = df.loc[df['has_numbers'], 'text'].str.findall(r'\d+')

# For very large datasets, compile regex once
compiled_pattern = re.compile(r'\d+')
df['numbers_compiled'] = df['text'].apply(lambda x: compiled_pattern.findall(x))

Common Pitfalls and Solutions

Empty Lists vs NaN

Remember that findall() returns empty lists, not NaN, for non-matches:

df = pd.DataFrame({'text': ['has123numbers', 'no numbers', None]})
df['numbers'] = df['text'].str.findall(r'\d+')

# This won't work as expected
# df[df['numbers'].isna()]  # Returns empty DataFrame

# Correct approach
df['has_matches'] = df['numbers'].apply(lambda x: len(x) > 0 if isinstance(x, list) else False)

Overlapping Matches

findall() only returns non-overlapping matches:

text = pd.Series(['AAAA'])

# Only finds non-overlapping 'AA'
print(text.str.findall(r'AA'))  # [['AA', 'AA']]

# For overlapping matches, use lookahead
print(text.str.findall(r'(?=(AA))'))  # [['AA', 'AA', 'AA']]

Practical Example: Log Analysis

Here’s a complete example analyzing server logs:

logs = pd.DataFrame({
    'entry': [
        '192.168.1.1 - - [15/Jan/2024:14:23:45] "GET /api/users HTTP/1.1" 200 1234',
        '10.0.0.5 - - [15/Jan/2024:14:24:01] "POST /api/login HTTP/1.1" 401 567 "Invalid credentials"',
        '192.168.1.1 - - [15/Jan/2024:14:25:12] "GET /api/products HTTP/1.1" 200 8901'
    ]
})

# Extract IP addresses
logs['ips'] = logs['entry'].str.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b')

# Extract HTTP methods
logs['methods'] = logs['entry'].str.findall(r'"(GET|POST|PUT|DELETE)')

# Extract status codes
logs['status_codes'] = logs['entry'].str.findall(r'" (\d{3}) ')

# Extract response sizes
logs['sizes'] = logs['entry'].str.findall(r'" \d{3} (\d+)')

print(logs[['ips', 'methods', 'status_codes', 'sizes']])

The str.findall() method provides robust pattern matching capabilities essential for text processing pipelines. Master regex patterns and understand the list-based return format to unlock powerful data extraction workflows in pandas.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.