Python - Regex Match, Search, FindAll

• `match()` checks patterns only at the string's beginning, `search()` finds the first occurrence anywhere, and `findall()` returns all non-overlapping matches as a list

Key Insights

match() checks patterns only at the string’s beginning, search() finds the first occurrence anywhere, and findall() returns all non-overlapping matches as a list • Compiled regex patterns with re.compile() offer better performance for repeated operations and cleaner code when using the same pattern multiple times • Raw strings (r"pattern") prevent Python from interpreting backslashes, making regex patterns more readable and avoiding double-escaping issues

Understanding the Core Differences

Python’s re module provides three fundamental methods for pattern matching, each serving distinct use cases. The match() method anchors to the start of the string, search() scans the entire string for the first occurrence, and findall() extracts all matches into a list.

import re

text = "The price is $25 and the discount is $10"

# match() - only checks beginning
match_result = re.match(r'\$\d+', text)
print(match_result)  # None - pattern doesn't start at beginning

# search() - finds first occurrence anywhere
search_result = re.search(r'\$\d+', text)
print(search_result.group())  # $25

# findall() - returns all matches
findall_result = re.findall(r'\$\d+', text)
print(findall_result)  # ['$25', '$10']

Working with Match Objects

When match() or search() find a pattern, they return a Match object containing position information and captured groups. Understanding these objects is critical for extracting precise data.

import re

log_entry = "2024-01-15 14:23:45 ERROR Database connection failed"

# Extract timestamp and level
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+)'
match = re.search(pattern, log_entry)

if match:
    print(match.group(0))  # Full match: 2024-01-15 14:23:45 ERROR
    print(match.group(1))  # First group: 2024-01-15
    print(match.group(2))  # Second group: 14:23:45
    print(match.group(3))  # Third group: ERROR
    
    # Named groups for clarity
    pattern_named = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+)'
    match_named = re.search(pattern_named, log_entry)
    print(match_named.group('level'))  # ERROR
    print(match_named.groupdict())  # {'date': '2024-01-15', 'time': '14:23:45', 'level': 'ERROR'}

Compiled Patterns for Performance

Compiling patterns once and reusing them eliminates redundant parsing overhead. This approach is essential when processing large datasets or applying the same pattern repeatedly.

import re
import time

emails = [
    "contact@example.com",
    "invalid-email",
    "admin@test.org",
    "user@domain.co.uk"
] * 10000

# Without compilation - slower
start = time.time()
pattern_str = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
valid_emails_1 = [email for email in emails if re.match(pattern_str, email)]
time_uncompiled = time.time() - start

# With compilation - faster
start = time.time()
pattern_compiled = re.compile(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
valid_emails_2 = [email for email in emails if pattern_compiled.match(email)]
time_compiled = time.time() - start

print(f"Uncompiled: {time_uncompiled:.4f}s")
print(f"Compiled: {time_compiled:.4f}s")
print(f"Speedup: {time_uncompiled/time_compiled:.2f}x")

Advanced findall() Techniques

The findall() method behaves differently depending on whether your pattern contains groups. With no groups, it returns matched strings. With one group, it returns that group’s values. With multiple groups, it returns tuples.

import re

html = """
<div class="product" data-id="101">Laptop</div>
<div class="product" data-id="102">Mouse</div>
<div class="product" data-id="103">Keyboard</div>
"""

# No groups - returns full matches
tags = re.findall(r'<div[^>]*>', html)
print(tags)  # ['<div class="product" data-id="101">', ...]

# Single group - returns only group content
ids = re.findall(r'data-id="(\d+)"', html)
print(ids)  # ['101', '102', '103']

# Multiple groups - returns tuples
products = re.findall(r'data-id="(\d+)">([^<]+)', html)
print(products)  # [('101', 'Laptop'), ('102', 'Mouse'), ('103', 'Keyboard')]

# Non-capturing groups with (?:...) when you need grouping but not capture
urls = "Visit https://example.com and http://test.org"
domains = re.findall(r'https?://([a-z.]+)', urls)
print(domains)  # ['example.com', 'test.org']

Practical Data Extraction Patterns

Real-world applications often require extracting structured data from unstructured text. Here are common patterns for API logs, configuration files, and user input validation.

import re

# Extract API endpoints and response times from logs
api_log = """
GET /api/users 200 45ms
POST /api/orders 201 123ms
GET /api/products 200 34ms
DELETE /api/users/123 404 12ms
"""

# Using findall with multiple groups
pattern = r'(\w+) (/[\w/]+) (\d+) (\d+)ms'
requests = re.findall(pattern, api_log)

for method, endpoint, status, time in requests:
    if int(time) > 50:
        print(f"Slow request: {method} {endpoint} took {time}ms")

# Parse configuration key-value pairs
config = """
database.host=localhost
database.port=5432
cache.enabled=true
cache.ttl=3600
"""

config_dict = {}
pattern = re.compile(r'^([a-z.]+)=(.+)$', re.MULTILINE)
for match in pattern.finditer(config):
    key, value = match.groups()
    config_dict[key] = value

print(config_dict)
# {'database.host': 'localhost', 'database.port': '5432', ...}

Handling Edge Cases and Validation

Production code requires robust pattern matching that handles malformed input gracefully. Always validate Match objects before accessing groups and consider using try-except blocks for critical operations.

import re

def extract_phone_numbers(text):
    """Extract US phone numbers in various formats"""
    # Matches: (123) 456-7890, 123-456-7890, 123.456.7890
    pattern = re.compile(r'\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})')
    
    matches = pattern.findall(text)
    return [f"({area}) {prefix}-{line}" for area, prefix, line in matches]

contact_info = """
Call us at (555) 123-4567 or 555.987.6543
Emergency: 555-111-2222
"""

phones = extract_phone_numbers(contact_info)
print(phones)  # ['(555) 123-4567', '(555) 987-6543', '(555) 111-2222']

# Validate and extract with error handling
def safe_extract_version(text):
    """Safely extract semantic version numbers"""
    pattern = r'v?(\d+)\.(\d+)\.(\d+)'
    match = re.search(pattern, text)
    
    if not match:
        return None
    
    try:
        major, minor, patch = map(int, match.groups())
        return {"major": major, "minor": minor, "patch": patch}
    except (ValueError, AttributeError):
        return None

print(safe_extract_version("Release v2.3.1"))  # {'major': 2, 'minor': 3, 'patch': 1}
print(safe_extract_version("Invalid"))  # None

Using finditer() for Memory Efficiency

When processing large files or streams, finditer() returns an iterator of Match objects instead of loading all matches into memory like findall().

import re

# Simulate large log file
large_log = "ERROR: Connection timeout\n" * 100000 + \
            "WARNING: Slow query\n" * 50000 + \
            "INFO: Request processed\n" * 200000

# Memory-efficient iteration
pattern = re.compile(r'^(ERROR|WARNING): (.+)$', re.MULTILINE)
error_count = 0
warning_count = 0

for match in pattern.finditer(large_log):
    level, message = match.groups()
    if level == "ERROR":
        error_count += 1
    elif level == "WARNING":
        warning_count += 1

print(f"Errors: {error_count}, Warnings: {warning_count}")

Flags and Multiline Patterns

Regex flags modify pattern behavior. The most common are re.IGNORECASE, re.MULTILINE, and re.DOTALL. Understanding when to apply each flag prevents subtle bugs.

import re

markdown = """# Header
This is **bold** text.
This is *italic* text.
## Subheader
More **bold** content."""

# Case-insensitive matching
headers = re.findall(r'^#{1,6}\s+(.+)$', markdown, re.MULTILINE | re.IGNORECASE)
print(headers)  # ['Header', 'Subheader']

# Extract bold text with non-greedy matching
bold_text = re.findall(r'\*\*(.+?)\*\*', markdown)
print(bold_text)  # ['bold', 'bold']

# DOTALL flag makes . match newlines
code_block = """```python
def hello():
    print("world")
```"""

code = re.search(r'```python(.+?)```', code_block, re.DOTALL)
if code:
    print(code.group(1).strip())

Choose the appropriate method based on your needs: match() for strict format validation, search() for finding specific patterns, findall() for batch extraction, and finditer() for memory-efficient processing. Compile patterns when reusing them, and always handle Match objects defensively in production code.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.