Python - Regex Replace (re.sub) | Application Architect

Key Insights

re.sub() performs pattern-based string replacement with support for backreferences, groups, and replacement functions for complex transformations
Understanding capture groups, lookaheads, and replacement patterns enables powerful text manipulation beyond simple string replace operations
Performance considerations matter: compile regex patterns for repeated use and choose the right flags for case sensitivity and multiline matching

Basic Pattern Replacement

The re.sub() function replaces all occurrences of a pattern in a string. The syntax is re.sub(pattern, replacement, string, count=0, flags=0).

import re

text = "The price is $100 and the discount is $20"
result = re.sub(r'\$\d+', '$XX', text)
print(result)
# Output: The price is $XX and the discount is $XX

# Limit replacements with count parameter
result = re.sub(r'\$\d+', '$XX', text, count=1)
print(result)
# Output: The price is $XX and the discount is $20

The pattern \$\d+ matches a dollar sign (escaped because $ is a special regex character) followed by one or more digits. Without the count parameter, all matches are replaced.

Capture Groups and Backreferences

Capture groups let you extract parts of the matched pattern and reuse them in the replacement string using backreferences (\1, \2, etc.).

import re

# Swap first and last names
names = "John Smith, Jane Doe, Bob Johnson"
result = re.sub(r'(\w+) (\w+)', r'\2, \1', names)
print(result)
# Output: Smith, John, Doe, Jane, Johnson, Bob

# Format phone numbers
phones = "Contact: 5551234567 or 5559876543"
result = re.sub(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', phones)
print(result)
# Output: Contact: (555) 123-4567 or (555) 987-6543

# Extract and reformat dates
dates = "Meeting on 2024-03-15 and 2024-12-25"
result = re.sub(r'(\d{4})-(\d{2})-(\d{2})', r'\2/\3/\1', dates)
print(result)
# Output: Meeting on 03/15/2024 and 12/25/2024

Parentheses create capture groups. \1 references the first group, \2 the second, and so on. This enables complex transformations while preserving parts of the original match.

Replacement Functions

For complex logic, pass a function as the replacement parameter. The function receives a match object and returns the replacement string.

import re

# Convert numbers to their doubled value
def double_number(match):
    number = int(match.group(0))
    return str(number * 2)

text = "I have 5 apples and 10 oranges"
result = re.sub(r'\d+', double_number, text)
print(result)
# Output: I have 10 apples and 20 oranges

# Capitalize words selectively
def capitalize_tech_terms(match):
    word = match.group(1)
    tech_terms = {'python', 'javascript', 'sql', 'api'}
    return word.upper() if word.lower() in tech_terms else word

text = "Learn python and javascript for api development"
result = re.sub(r'\b(\w+)\b', capitalize_tech_terms, text)
print(result)
# Output: Learn PYTHON and JAVASCRIPT for API development

# Format currency with proper separators
def format_currency(match):
    amount = match.group(1)
    return f"${int(amount):,}"

text = "Revenue: $1000000 and expenses: $250000"
result = re.sub(r'\$(\d+)', format_currency, text)
print(result)
# Output: Revenue: $1,000,000 and expenses: $250,000

The match object provides methods like group(), groups(), start(), and end() for accessing matched content and positions.

Named Groups

Named groups improve readability and make replacement patterns self-documenting.

import re

# Extract and reformat email addresses
email = "Contact john.doe@example.com or jane_smith@company.org"
pattern = r'(?P<user>[\w.]+)@(?P<domain>[\w.]+)'
result = re.sub(pattern, r'\g<user> [at] \g<domain>', email)
print(result)
# Output: Contact john.doe [at] example.com or jane_smith [at] company.org

# Parse and reformat log entries
log = "[2024-03-15 14:30:22] ERROR: Database connection failed"
pattern = r'\[(?P<date>\S+) (?P<time>\S+)\] (?P<level>\w+): (?P<message>.+)'
result = re.sub(pattern, r'[\g<level>] \g<date> - \g<message>', log)
print(result)
# Output: [ERROR] 2024-03-15 - Database connection failed

# Using named groups in replacement functions
def format_log(match):
    return f"{match.group('level'):8} | {match.group('date')} | {match.group('message')}"

result = re.sub(pattern, format_log, log)
print(result)
# Output: ERROR    | 2024-03-15 | Database connection failed

Named groups use (?P<name>pattern) syntax and are referenced with \g<name> in replacement strings or match.group('name') in functions.

Flags for Advanced Matching

Flags modify regex behavior for case sensitivity, multiline matching, and more.

import re

# Case-insensitive replacement
text = "Python is great. PYTHON is awesome. python rocks!"
result = re.sub(r'python', 'JavaScript', text, flags=re.IGNORECASE)
print(result)
# Output: JavaScript is great. JavaScript is awesome. JavaScript rocks!

# Multiline mode: ^ and $ match line boundaries
content = """Title: Introduction
Author: John
Title: Chapter 1
Author: Jane"""

result = re.sub(r'^Title: ', 'Section: ', content, flags=re.MULTILINE)
print(result)
# Output: Section: Introduction
# Author: John
# Section: Chapter 1
# Author: Jane

# Dot matches newlines with DOTALL
html = "<div>\n  <p>Content</p>\n</div>"
result = re.sub(r'<div>.*?</div>', '[REMOVED]', html, flags=re.DOTALL)
print(result)
# Output: [REMOVED]

# Combine multiple flags
text = "ERROR: Failed\nWARNING: Issue\nerror: Problem"
result = re.sub(r'^error:', 'CRITICAL:', text, flags=re.MULTILINE | re.IGNORECASE)
print(result)
# Output: CRITICAL: Failed
# WARNING: Issue
# CRITICAL: Problem

Common flags: re.IGNORECASE (case-insensitive), re.MULTILINE (^ and $ match line boundaries), re.DOTALL (. matches newlines), re.VERBOSE (allows comments in patterns).

Compiled Patterns for Performance

Compile patterns when performing multiple replacements to avoid recompiling the regex.

import re
import time

text = "The quick brown fox jumps over the lazy dog" * 1000

# Without compilation (slower for repeated use)
start = time.time()
for _ in range(1000):
    result = re.sub(r'\b\w{4}\b', 'XXXX', text)
print(f"Without compilation: {time.time() - start:.4f}s")

# With compilation (faster for repeated use)
pattern = re.compile(r'\b\w{4}\b')
start = time.time()
for _ in range(1000):
    result = pattern.sub('XXXX', text)
print(f"With compilation: {time.time() - start:.4f}s")

# Practical example: sanitizing user input
sanitizer = re.compile(r'[<>"\']')

def clean_input(user_data):
    return sanitizer.sub('', user_data)

inputs = ["<script>alert('xss')</script>", "Normal text", "Quote: \"hello\""]
cleaned = [clean_input(inp) for inp in inputs]
print(cleaned)
# Output: ['scriptalert(xss)/script', 'Normal text', 'Quote: hello']

Lookahead and Lookbehind Assertions

These assertions match patterns without including them in the replacement.

import re

# Positive lookahead: match digits followed by 'px'
css = "width: 100px; height: 200px; margin: 50px"
result = re.sub(r'\d+(?=px)', lambda m: str(int(m.group()) * 2), css)
print(result)
# Output: width: 200px; height: 400px; margin: 100px

# Negative lookahead: match numbers NOT followed by '%'
text = "Discount: 20 on items priced at 50%"
result = re.sub(r'\d+(?!%)', 'XX', text)
print(result)
# Output: Discount: XX on items priced at 50%

# Positive lookbehind: match digits preceded by '$'
prices = "Item costs $100 and quantity is 5"
result = re.sub(r'(?<=\$)\d+', lambda m: str(int(m.group()) * 1.1), prices)
print(result)
# Output: Item costs $110.0 and quantity is 5

# Negative lookbehind: match words NOT preceded by 'not '
text = "This is good. This is not good. That is great."
result = re.sub(r'(?<!not )\b(good|great)\b', 'excellent', text)
print(result)
# Output: This is excellent. This is not good. That is excellent.

Lookahead: (?=...) (positive), (?!...) (negative). Lookbehind: (?<=...) (positive), (?<!...) (negative). Lookbehind patterns must be fixed-width.

Real-World Applications

import re

# Sanitize SQL-like strings (basic example)
def sanitize_sql_input(query):
    # Remove dangerous keywords
    pattern = re.compile(r'\b(DROP|DELETE|TRUNCATE|ALTER)\b', re.IGNORECASE)
    return pattern.sub('[BLOCKED]', query)

query = "SELECT * FROM users; DROP TABLE users;"
print(sanitize_sql_input(query))
# Output: SELECT * FROM users; [BLOCKED] TABLE users;

# Redact sensitive information
def redact_pii(text):
    # Redact SSN patterns
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', 'XXX-XX-XXXX', text)
    # Redact email addresses
    text = re.sub(r'\b[\w.]+@[\w.]+\b', '[EMAIL REDACTED]', text)
    # Redact credit card numbers
    text = re.sub(r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b', 
                  'XXXX-XXXX-XXXX-XXXX', text)
    return text

data = "SSN: 123-45-6789, Email: user@example.com, Card: 1234-5678-9012-3456"
print(redact_pii(data))
# Output: SSN: XXX-XX-XXXX, Email: [EMAIL REDACTED], Card: XXXX-XXXX-XXXX-XXXX

# Convert markdown-style links to HTML
def markdown_to_html_links(text):
    pattern = r'\[([^\]]+)\]\(([^\)]+)\)'
    return re.sub(pattern, r'<a href="\2">\1</a>', text)

markdown = "Check [Python docs](https://python.org) and [GitHub](https://github.com)"
print(markdown_to_html_links(markdown))
# Output: Check <a href="https://python.org">Python docs</a> and <a href="https://github.com">GitHub</a>

The re.sub() function handles everything from simple text replacement to complex transformations using capture groups, replacement functions, and assertions. Compile patterns for performance in loops, use named groups for clarity, and leverage lookahead/lookbehind for precise matching without consumption.