Python Regular Expressions: re Module Complete Guide
Regular expressions (regex) are pattern-matching tools for text processing. Python's `re` module provides a complete implementation for searching, matching, and manipulating strings based on...
Key Insights
- The
remodule provides powerful pattern matching beyond simple string methods, essential for validation, parsing, and complex text extraction—but use plain string methods when they suffice to avoid unnecessary complexity. - Understanding the distinction between
search(),match(), andfullmatch()prevents common bugs;search()finds patterns anywhere,match()only at the start, andfullmatch()requires the entire string to match. - Compile frequently-used patterns with
re.compile()for better performance and readability, and leverage named groups(?P<name>...)to make data extraction self-documenting.
Introduction to Regular Expressions in Python
Regular expressions (regex) are pattern-matching tools for text processing. Python’s re module provides a complete implementation for searching, matching, and manipulating strings based on patterns. While string methods like str.find() or str.startswith() work well for simple cases, regex excels at complex pattern matching, validation, and extraction.
Use regex when you need to:
- Validate formats (emails, phone numbers, URLs)
- Extract structured data from unstructured text
- Search for patterns with variations (case-insensitive, optional components)
- Split or replace text based on complex rules
Here’s when string methods are sufficient versus when regex is necessary:
import re
text = "Contact: john@example.com"
# String method - simple and fast for exact matches
if "@" in text:
print("Contains email")
# Regex - necessary for actual validation
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
if re.search(email_pattern, text):
print("Contains valid email format")
The basic re.search() function scans through a string looking for the first location where the pattern matches:
import re
text = "The price is $49.99"
match = re.search(r'\$\d+\.\d{2}', text)
if match:
print(f"Found price: {match.group()}") # Output: Found price: $49.99
Core Pattern Matching Functions
The re module provides several functions for different matching scenarios. Understanding their differences prevents common bugs.
search() vs. match() vs. fullmatch()
import re
pattern = r'\d{3}'
text = "Call 555-1234 for info"
# search() - finds pattern anywhere in string
print(re.search(pattern, text).group()) # Output: 555
# match() - only matches at start of string
print(re.match(pattern, text)) # Output: None (doesn't start with digits)
# fullmatch() - entire string must match pattern
print(re.fullmatch(pattern, "555")) # Output: <Match object>
print(re.fullmatch(pattern, "555-1234")) # Output: None (extra characters)
findall() for extracting all matches
findall() returns a list of all non-overlapping matches, perfect for extraction tasks:
import re
text = """
Contact us at support@company.com or sales@company.com.
For urgent matters, reach admin@company.org.
"""
emails = re.findall(r'\b[\w.%+-]+@[\w.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails)
# Output: ['support@company.com', 'sales@company.com', 'admin@company.org']
finditer() for memory efficiency
When processing large texts, finditer() returns an iterator of Match objects instead of storing all matches in memory:
import re
log = "ERROR at 10:23:45, WARNING at 10:24:12, ERROR at 10:25:03"
pattern = r'(ERROR|WARNING) at (\d{2}:\d{2}:\d{2})'
for match in re.finditer(pattern, log):
level, timestamp = match.groups()
print(f"{level}: {timestamp}")
# Output:
# ERROR: 10:23:45
# WARNING: 10:24:12
# ERROR: 10:25:03
Pattern Syntax and Special Characters
Regex patterns use special characters to define matching rules. Master these fundamentals:
Character classes and predefined sets
import re
text = "User123 logged in at 2024-01-15"
# \d matches digits, \w matches word characters (letters, digits, underscore)
# \s matches whitespace
user_id = re.search(r'User\d+', text).group() # User123
date = re.search(r'\d{4}-\d{2}-\d{2}', text).group() # 2024-01-15
# Custom character class
code = "Product: AB-123-XY"
product_code = re.search(r'[A-Z]{2}-\d{3}-[A-Z]{2}', code).group()
print(product_code) # AB-123-XY
Quantifiers for repetition
import re
# * (zero or more), + (one or more), ? (zero or one), {n,m} (between n and m)
patterns = {
r'colou?r': "color or colour", # ? makes 'u' optional
r'\d{3,5}': "3 to 5 digits",
r'ha+': "ha, haa, haaa, etc.",
r'(ab)*': "empty, ab, abab, etc."
}
text = "colour color haaaa 12345"
for pattern, description in patterns.items():
matches = re.findall(pattern, text)
print(f"{description}: {matches}")
Capturing groups and named groups
Groups extract specific parts of matches:
import re
# Standard capturing groups
phone = "Call (555) 123-4567"
match = re.search(r'\((\d{3})\) (\d{3})-(\d{4})', phone)
area, prefix, line = match.groups()
print(f"Area: {area}, Prefix: {prefix}, Line: {line}")
# Named groups for clarity
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
date = "2024-01-15"
match = re.search(pattern, date)
print(match.groupdict()) # {'year': '2024', 'month': '01', 'day': '15'}
Lookahead and lookbehind assertions
These match positions without consuming characters:
import re
# Positive lookahead: match X only if followed by Y
text = "Price: $100, Discount: 20%"
prices = re.findall(r'\d+(?=\$)', text) # Digits followed by $
# Note: This finds digits before $, need different approach
# Better example for lookahead
passwords = ["Pass123!", "weak", "Secure#99"]
# Match passwords with at least one digit (lookahead)
valid = [p for p in passwords if re.search(r'^(?=.*\d).{6,}$', p)]
print(valid) # ['Pass123!', 'Secure#99']
# Positive lookbehind: match X only if preceded by Y
text = "Price $100, Cost $50"
amounts = re.findall(r'(?<=\$)\d+', text) # Digits preceded by $
print(amounts) # ['100', '50']
Match Objects and Extracting Data
When a pattern matches, you get a Match object with methods for accessing the matched data:
import re
phone = "Contact: (555) 123-4567 ext. 890"
pattern = r'\((?P<area>\d{3})\) (?P<prefix>\d{3})-(?P<line>\d{4})'
match = re.search(pattern, phone)
# group() - get matched text
print(match.group()) # (555) 123-4567
print(match.group(1)) # 555 (first group)
print(match.group('area')) # 555 (named group)
# groups() - tuple of all groups
print(match.groups()) # ('555', '123', '4567')
# groupdict() - dictionary of named groups
print(match.groupdict()) # {'area': '555', 'prefix': '123', 'line': '4567'}
# span() - start and end positions
print(match.span()) # (9, 23)
print(f"Found at position {match.start()} to {match.end()}")
This is particularly useful for parsing structured data:
import re
log_entry = "2024-01-15 14:23:45 ERROR Database connection failed"
pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.*)'
match = re.search(pattern, log_entry)
log_data = match.groupdict()
print(log_data)
# {'date': '2024-01-15', 'time': '14:23:45', 'level': 'ERROR',
# 'message': 'Database connection failed'}
String Substitution and Splitting
The re module provides powerful text transformation capabilities beyond simple string replacement.
re.sub() with backreferences
import re
# Reformat dates from MM/DD/YYYY to YYYY-MM-DD
dates = "Meeting on 01/15/2024 and 03/22/2024"
reformatted = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
print(reformatted) # Meeting on 2024-01-15 and 2024-03-22
# Using named groups
pattern = r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})'
reformatted = re.sub(pattern, r'\g<year>-\g<month>-\g<day>', dates)
print(reformatted)
Function-based replacement
For complex transformations, pass a function to re.sub():
import re
def celsius_to_fahrenheit(match):
celsius = float(match.group(1))
fahrenheit = (celsius * 9/5) + 32
return f"{fahrenheit:.1f}°F"
text = "Temperature: 20°C, Max: 25°C"
converted = re.sub(r'(\d+)°C', celsius_to_fahrenheit, text)
print(converted) # Temperature: 68.0°F, Max: 77.0°F
re.split() with multiple delimiters
import re
# Split on multiple delimiters
text = "apple,banana;cherry|date:elderberry"
fruits = re.split(r'[,;|:]', text)
print(fruits) # ['apple', 'banana', 'cherry', 'date', 'elderberry']
# Split on whitespace but keep delimiters
text = "Hello world\t\tfrom\nPython"
parts = re.split(r'(\s+)', text)
print(parts) # ['Hello', ' ', 'world', '\t\t', 'from', '\n', 'Python']
Compilation and Performance Optimization
Compile patterns you use repeatedly for better performance and readability:
import re
# Without compilation - pattern parsed every time
for _ in range(1000):
re.search(r'\d{3}-\d{4}', "Call 555-1234")
# With compilation - pattern parsed once
phone_pattern = re.compile(r'\d{3}-\d{4}')
for _ in range(1000):
phone_pattern.search("Call 555-1234")
Using flags for behavior modification
import re
text = "Hello WORLD\nPython Regex"
# Case-insensitive matching
matches = re.findall(r'python', text, re.IGNORECASE)
print(matches) # ['Python']
# MULTILINE: ^ and $ match line boundaries
lines = re.findall(r'^[A-Z].*', text, re.MULTILINE)
print(lines) # ['Hello WORLD', 'Python Regex']
# DOTALL: . matches newlines too
match = re.search(r'Hello.*Regex', text, re.DOTALL)
print(match.group()) # Hello WORLD\nPython Regex
# VERBOSE: write readable patterns with comments
email_pattern = re.compile(r'''
\b # Word boundary
[\w.%+-]+ # Username part
@ # @ symbol
[\w.-]+ # Domain name
\.[A-Z|a-z]{2,} # Top-level domain
\b # Word boundary
''', re.VERBOSE | re.IGNORECASE)
# Combine flags with bitwise OR
pattern = re.compile(r'^python', re.IGNORECASE | re.MULTILINE)
Real-World Examples and Best Practices
Email validation
import re
def validate_email(email):
pattern = re.compile(
r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
)
return pattern.match(email) is not None
emails = ["user@example.com", "invalid@", "test@domain.co.uk"]
for email in emails:
print(f"{email}: {validate_email(email)}")
Parsing log files
import re
log_pattern = re.compile(
r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) '
r'\[(?P<level>\w+)\] '
r'(?P<message>.*)'
)
logs = """
2024-01-15 10:23:45 [ERROR] Connection timeout
2024-01-15 10:24:12 [INFO] User logged in
2024-01-15 10:25:03 [WARNING] High memory usage
"""
for line in logs.strip().split('\n'):
match = log_pattern.match(line)
if match:
data = match.groupdict()
if data['level'] == 'ERROR':
print(f"Error at {data['timestamp']}: {data['message']}")
Sanitizing user input
import re
def sanitize_filename(filename):
# Remove or replace invalid characters
sanitized = re.sub(r'[<>:"/\\|?*]', '_', filename)
# Remove leading/trailing dots and spaces
sanitized = re.sub(r'^[.\s]+|[.\s]+$', '', sanitized)
return sanitized
filenames = ["report.txt", "file<name>.doc", "..hidden.txt"]
for name in filenames:
print(f"{name} -> {sanitize_filename(name)}")
Common pitfalls
Greedy vs. non-greedy quantifiers can dramatically affect results:
import re
html = "<div>Content</div><div>More</div>"
# Greedy - matches as much as possible
greedy = re.findall(r'<div>.*</div>', html)
print(greedy) # ['<div>Content</div><div>More</div>']
# Non-greedy - matches as little as possible
non_greedy = re.findall(r'<div>.*?</div>', html)
print(non_greedy) # ['<div>Content</div>', '<div>More</div>']
Avoid catastrophic backtracking with nested quantifiers:
import re
# BAD - can cause exponential backtracking
# pattern = r'(a+)+b'
# GOOD - more specific, faster
pattern = r'a+b'
# For complex patterns, use atomic groups or possessive quantifiers
# Or simplify the pattern to avoid nested repetition
The re module is indispensable for text processing in Python. Start with simple patterns, compile frequently-used ones, use named groups for clarity, and always test your patterns with edge cases. When in doubt, prefer readability over cleverness—your future self will thank you.