Python Regular Expressions: re Module Complete Guide

Key Insights

The re module provides powerful pattern matching beyond simple string methods, essential for validation, parsing, and complex text extraction—but use plain string methods when they suffice to avoid unnecessary complexity.
Understanding the distinction between search(), match(), and fullmatch() prevents common bugs; search() finds patterns anywhere, match() only at the start, and fullmatch() requires the entire string to match.
Compile frequently-used patterns with re.compile() for better performance and readability, and leverage named groups (?P<name>...) to make data extraction self-documenting.

Introduction to Regular Expressions in Python

Regular expressions (regex) are pattern-matching tools for text processing. Python’s re module provides a complete implementation for searching, matching, and manipulating strings based on patterns. While string methods like str.find() or str.startswith() work well for simple cases, regex excels at complex pattern matching, validation, and extraction.

Use regex when you need to:

Validate formats (emails, phone numbers, URLs)
Extract structured data from unstructured text
Search for patterns with variations (case-insensitive, optional components)
Split or replace text based on complex rules

Here’s when string methods are sufficient versus when regex is necessary:

import re

text = "Contact: john@example.com"

# String method - simple and fast for exact matches
if "@" in text:
    print("Contains email")

# Regex - necessary for actual validation
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
if re.search(email_pattern, text):
    print("Contains valid email format")

The basic re.search() function scans through a string looking for the first location where the pattern matches:

import re

text = "The price is $49.99"
match = re.search(r'\$\d+\.\d{2}', text)
if match:
    print(f"Found price: {match.group()}")  # Output: Found price: $49.99

Core Pattern Matching Functions

The re module provides several functions for different matching scenarios. Understanding their differences prevents common bugs.

search() vs. match() vs. fullmatch()

import re

pattern = r'\d{3}'
text = "Call 555-1234 for info"

# search() - finds pattern anywhere in string
print(re.search(pattern, text).group())  # Output: 555

# match() - only matches at start of string
print(re.match(pattern, text))  # Output: None (doesn't start with digits)

# fullmatch() - entire string must match pattern
print(re.fullmatch(pattern, "555"))  # Output: <Match object>
print(re.fullmatch(pattern, "555-1234"))  # Output: None (extra characters)

findall() for extracting all matches

findall() returns a list of all non-overlapping matches, perfect for extraction tasks:

import re

text = """
Contact us at support@company.com or sales@company.com.
For urgent matters, reach admin@company.org.
"""

emails = re.findall(r'\b[\w.%+-]+@[\w.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails)
# Output: ['support@company.com', 'sales@company.com', 'admin@company.org']

finditer() for memory efficiency

When processing large texts, finditer() returns an iterator of Match objects instead of storing all matches in memory:

import re

log = "ERROR at 10:23:45, WARNING at 10:24:12, ERROR at 10:25:03"
pattern = r'(ERROR|WARNING) at (\d{2}:\d{2}:\d{2})'

for match in re.finditer(pattern, log):
    level, timestamp = match.groups()
    print(f"{level}: {timestamp}")
# Output:
# ERROR: 10:23:45
# WARNING: 10:24:12
# ERROR: 10:25:03

Pattern Syntax and Special Characters

Regex patterns use special characters to define matching rules. Master these fundamentals:

Character classes and predefined sets

import re

text = "User123 logged in at 2024-01-15"

# \d matches digits, \w matches word characters (letters, digits, underscore)
# \s matches whitespace
user_id = re.search(r'User\d+', text).group()  # User123
date = re.search(r'\d{4}-\d{2}-\d{2}', text).group()  # 2024-01-15

# Custom character class
code = "Product: AB-123-XY"
product_code = re.search(r'[A-Z]{2}-\d{3}-[A-Z]{2}', code).group()
print(product_code)  # AB-123-XY

Quantifiers for repetition

import re

# * (zero or more), + (one or more), ? (zero or one), {n,m} (between n and m)
patterns = {
    r'colou?r': "color or colour",  # ? makes 'u' optional
    r'\d{3,5}': "3 to 5 digits",
    r'ha+': "ha, haa, haaa, etc.",
    r'(ab)*': "empty, ab, abab, etc."
}

text = "colour color haaaa 12345"
for pattern, description in patterns.items():
    matches = re.findall(pattern, text)
    print(f"{description}: {matches}")

Capturing groups and named groups

Groups extract specific parts of matches:

import re

# Standard capturing groups
phone = "Call (555) 123-4567"
match = re.search(r'\((\d{3})\) (\d{3})-(\d{4})', phone)
area, prefix, line = match.groups()
print(f"Area: {area}, Prefix: {prefix}, Line: {line}")

# Named groups for clarity
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
date = "2024-01-15"
match = re.search(pattern, date)
print(match.groupdict())  # {'year': '2024', 'month': '01', 'day': '15'}

Lookahead and lookbehind assertions

These match positions without consuming characters:

import re

# Positive lookahead: match X only if followed by Y
text = "Price: $100, Discount: 20%"
prices = re.findall(r'\d+(?=\$)', text)  # Digits followed by $
# Note: This finds digits before $, need different approach

# Better example for lookahead
passwords = ["Pass123!", "weak", "Secure#99"]
# Match passwords with at least one digit (lookahead)
valid = [p for p in passwords if re.search(r'^(?=.*\d).{6,}$', p)]
print(valid)  # ['Pass123!', 'Secure#99']

# Positive lookbehind: match X only if preceded by Y
text = "Price $100, Cost $50"
amounts = re.findall(r'(?<=\$)\d+', text)  # Digits preceded by $
print(amounts)  # ['100', '50']

Match Objects and Extracting Data

When a pattern matches, you get a Match object with methods for accessing the matched data:

import re

phone = "Contact: (555) 123-4567 ext. 890"
pattern = r'\((?P<area>\d{3})\) (?P<prefix>\d{3})-(?P<line>\d{4})'
match = re.search(pattern, phone)

# group() - get matched text
print(match.group())  # (555) 123-4567
print(match.group(1))  # 555 (first group)
print(match.group('area'))  # 555 (named group)

# groups() - tuple of all groups
print(match.groups())  # ('555', '123', '4567')

# groupdict() - dictionary of named groups
print(match.groupdict())  # {'area': '555', 'prefix': '123', 'line': '4567'}

# span() - start and end positions
print(match.span())  # (9, 23)
print(f"Found at position {match.start()} to {match.end()}")

This is particularly useful for parsing structured data:

import re

log_entry = "2024-01-15 14:23:45 ERROR Database connection failed"
pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.*)'

match = re.search(pattern, log_entry)
log_data = match.groupdict()
print(log_data)
# {'date': '2024-01-15', 'time': '14:23:45', 'level': 'ERROR', 
#  'message': 'Database connection failed'}

String Substitution and Splitting

The re module provides powerful text transformation capabilities beyond simple string replacement.

re.sub() with backreferences

import re

# Reformat dates from MM/DD/YYYY to YYYY-MM-DD
dates = "Meeting on 01/15/2024 and 03/22/2024"
reformatted = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
print(reformatted)  # Meeting on 2024-01-15 and 2024-03-22

# Using named groups
pattern = r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})'
reformatted = re.sub(pattern, r'\g<year>-\g<month>-\g<day>', dates)
print(reformatted)

Function-based replacement

For complex transformations, pass a function to re.sub():

import re

def celsius_to_fahrenheit(match):
    celsius = float(match.group(1))
    fahrenheit = (celsius * 9/5) + 32
    return f"{fahrenheit:.1f}°F"

text = "Temperature: 20°C, Max: 25°C"
converted = re.sub(r'(\d+)°C', celsius_to_fahrenheit, text)
print(converted)  # Temperature: 68.0°F, Max: 77.0°F

re.split() with multiple delimiters

import re

# Split on multiple delimiters
text = "apple,banana;cherry|date:elderberry"
fruits = re.split(r'[,;|:]', text)
print(fruits)  # ['apple', 'banana', 'cherry', 'date', 'elderberry']

# Split on whitespace but keep delimiters
text = "Hello    world\t\tfrom\nPython"
parts = re.split(r'(\s+)', text)
print(parts)  # ['Hello', '    ', 'world', '\t\t', 'from', '\n', 'Python']

Compilation and Performance Optimization

Compile patterns you use repeatedly for better performance and readability:

import re

# Without compilation - pattern parsed every time
for _ in range(1000):
    re.search(r'\d{3}-\d{4}', "Call 555-1234")

# With compilation - pattern parsed once
phone_pattern = re.compile(r'\d{3}-\d{4}')
for _ in range(1000):
    phone_pattern.search("Call 555-1234")

Using flags for behavior modification

import re

text = "Hello WORLD\nPython Regex"

# Case-insensitive matching
matches = re.findall(r'python', text, re.IGNORECASE)
print(matches)  # ['Python']

# MULTILINE: ^ and $ match line boundaries
lines = re.findall(r'^[A-Z].*', text, re.MULTILINE)
print(lines)  # ['Hello WORLD', 'Python Regex']

# DOTALL: . matches newlines too
match = re.search(r'Hello.*Regex', text, re.DOTALL)
print(match.group())  # Hello WORLD\nPython Regex

# VERBOSE: write readable patterns with comments
email_pattern = re.compile(r'''
    \b                  # Word boundary
    [\w.%+-]+           # Username part
    @                   # @ symbol
    [\w.-]+             # Domain name
    \.[A-Z|a-z]{2,}     # Top-level domain
    \b                  # Word boundary
''', re.VERBOSE | re.IGNORECASE)

# Combine flags with bitwise OR
pattern = re.compile(r'^python', re.IGNORECASE | re.MULTILINE)

Real-World Examples and Best Practices

Email validation

import re

def validate_email(email):
    pattern = re.compile(
        r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    )
    return pattern.match(email) is not None

emails = ["user@example.com", "invalid@", "test@domain.co.uk"]
for email in emails:
    print(f"{email}: {validate_email(email)}")

Parsing log files

import re

log_pattern = re.compile(
    r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) '
    r'\[(?P<level>\w+)\] '
    r'(?P<message>.*)'
)

logs = """
2024-01-15 10:23:45 [ERROR] Connection timeout
2024-01-15 10:24:12 [INFO] User logged in
2024-01-15 10:25:03 [WARNING] High memory usage
"""

for line in logs.strip().split('\n'):
    match = log_pattern.match(line)
    if match:
        data = match.groupdict()
        if data['level'] == 'ERROR':
            print(f"Error at {data['timestamp']}: {data['message']}")

Sanitizing user input

import re

def sanitize_filename(filename):
    # Remove or replace invalid characters
    sanitized = re.sub(r'[<>:"/\\|?*]', '_', filename)
    # Remove leading/trailing dots and spaces
    sanitized = re.sub(r'^[.\s]+|[.\s]+$', '', sanitized)
    return sanitized

filenames = ["report.txt", "file<name>.doc", "..hidden.txt"]
for name in filenames:
    print(f"{name} -> {sanitize_filename(name)}")

Common pitfalls

Greedy vs. non-greedy quantifiers can dramatically affect results:

import re

html = "<div>Content</div><div>More</div>"

# Greedy - matches as much as possible
greedy = re.findall(r'<div>.*</div>', html)
print(greedy)  # ['<div>Content</div><div>More</div>']

# Non-greedy - matches as little as possible
non_greedy = re.findall(r'<div>.*?</div>', html)
print(non_greedy)  # ['<div>Content</div>', '<div>More</div>']

Avoid catastrophic backtracking with nested quantifiers:

import re

# BAD - can cause exponential backtracking
# pattern = r'(a+)+b'

# GOOD - more specific, faster
pattern = r'a+b'

# For complex patterns, use atomic groups or possessive quantifiers
# Or simplify the pattern to avoid nested repetition

The re module is indispensable for text processing in Python. Start with simple patterns, compile frequently-used ones, use named groups for clarity, and always test your patterns with edge cases. When in doubt, prefer readability over cleverness—your future self will thank you.