PySpark - Replace Column Values (regexp_replace)
Data cleaning is messy. Real-world datasets arrive with inconsistent formatting, unwanted characters, and patterns that vary just enough to make simple string replacement useless. PySpark's...
Key Insights
regexp_replaceis PySpark’s most powerful string transformation function, enabling pattern-based replacements that go far beyond simple string substitution—use it for complex data cleaning tasks involving variable patterns- Chain multiple
regexp_replaceoperations or combine them with other column transformations to handle multi-step data standardization workflows efficiently within a single DataFrame operation - For simple literal string replacements without patterns, use PySpark’s
replace()function instead—it’s significantly faster and avoids the regex engine overhead
Introduction to regexp_replace in PySpark
Data cleaning is messy. Real-world datasets arrive with inconsistent formatting, unwanted characters, and patterns that vary just enough to make simple string replacement useless. PySpark’s regexp_replace function from pyspark.sql.functions solves this by applying regular expression pattern matching to transform column values across distributed datasets.
Unlike basic string replacement methods that only match exact literals, regexp_replace handles variable patterns. Need to strip all special characters from user input? Standardize phone numbers formatted a dozen different ways? Remove HTML tags from scraped content? regexp_replace handles all of these scenarios.
The function operates on DataFrame columns and returns a new column with replacements applied, making it perfect for ETL pipelines where data standardization is critical.
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace, col
spark = SparkSession.builder.appName("RegexReplace").getOrCreate()
# Sample dataset with messy data
data = [
(1, "Contact: (555) 123-4567", "$1,234.56"),
(2, "Phone: 555.987.6543", "$987.00"),
(3, "Call 5551234567", "$12,345.67"),
(4, "Mobile: +1-555-246-8135", "$5,678.90")
]
df = spark.createDataFrame(data, ["id", "contact_info", "amount"])
df.show(truncate=False)
Basic Syntax and Simple Pattern Replacement
The regexp_replace function takes three arguments: the column to transform, the regex pattern to match, and the replacement string. The basic signature is:
regexp_replace(col, pattern, replacement)
Start with simple patterns before moving to complex regex. Here’s how to replace exact phrases:
# Replace the word "Contact:" with "Phone:"
df_cleaned = df.withColumn(
"contact_info",
regexp_replace(col("contact_info"), "Contact:", "Phone:")
)
df_cleaned.show(truncate=False)
For phone number formatting, remove common separators:
# Remove parentheses, dashes, and dots from phone numbers
df_phones = df.withColumn(
"clean_phone",
regexp_replace(col("contact_info"), "[()-.]", "")
)
df_phones.show(truncate=False)
By default, regexp_replace is case-sensitive. To make it case-insensitive, use the (?i) flag at the start of your pattern:
# Case-insensitive replacement
df_case = df.withColumn(
"contact_info",
regexp_replace(col("contact_info"), "(?i)phone|contact|call|mobile", "Tel")
)
df_case.show(truncate=False)
Advanced Regular Expression Patterns
The real power emerges when you leverage full regex capabilities. Character classes, quantifiers, anchors, and capture groups enable sophisticated transformations.
Remove all non-alphanumeric characters using character class negation:
# Keep only letters and numbers
df_alphanum = df.withColumn(
"clean_contact",
regexp_replace(col("contact_info"), "[^a-zA-Z0-9]", "")
)
df_alphanum.show(truncate=False)
Extract and reformat patterns using capture groups. The parentheses in the pattern create groups you can reference in the replacement string with $1, $2, etc:
# Reformat phone numbers from various formats to XXX-XXX-XXXX
df_formatted = df.withColumn(
"formatted_phone",
regexp_replace(
col("contact_info"),
r".*?(\d{3}).*?(\d{3}).*?(\d{4}).*",
"$1-$2-$3"
)
)
df_formatted.show(truncate=False)
Clean and standardize email addresses:
email_data = [
(1, "user@EXAMPLE.COM"),
(2, " admin@test.com "),
(3, "contact+spam@site.org")
]
df_email = spark.createDataFrame(email_data, ["id", "email"])
# Lowercase and trim, remove plus-addressing
df_email_clean = df_email.withColumn(
"email",
regexp_replace(col("email"), r"\+.*?@", "@")
)
df_email_clean.show()
Multiple Replacements and Chaining
Complex data cleaning often requires multiple transformation steps. Chain regexp_replace calls to apply sequential patterns:
# Multi-step phone number cleaning
df_multi = df.withColumn(
"phone",
regexp_replace(col("contact_info"), r"[^\d]", "") # Remove non-digits
).withColumn(
"phone",
regexp_replace(col("phone"), r"^1", "") # Remove leading 1
).withColumn(
"phone",
regexp_replace(col("phone"), r"(\d{3})(\d{3})(\d{4})", "$1-$2-$3") # Format
)
df_multi.select("contact_info", "phone").show(truncate=False)
Apply different transformations to multiple columns simultaneously:
# Clean multiple columns in one operation
df_all_clean = df.select(
col("id"),
regexp_replace(col("contact_info"), r"[^\d]", "").alias("phone_digits"),
regexp_replace(col("amount"), r"[$,]", "").alias("numeric_amount")
)
df_all_clean.show(truncate=False)
Common Use Cases and Practical Examples
Here are production-ready patterns for frequent data cleaning scenarios.
Standardizing phone numbers across different input formats:
phone_data = [
(1, "(555) 123-4567"),
(2, "555.987.6543"),
(3, "+1 555-246-8135"),
(4, "5551234567")
]
df_phones = spark.createDataFrame(phone_data, ["id", "raw_phone"])
df_std_phone = df_phones.withColumn(
"standardized",
regexp_replace(
regexp_replace(col("raw_phone"), r"[^\d]", ""),
r"^1?(\d{3})(\d{3})(\d{4})$",
"($1) $2-$3"
)
)
df_std_phone.show(truncate=False)
Cleaning currency values for numeric operations:
# Remove currency symbols and convert to numeric
df_currency = df.withColumn(
"clean_amount",
regexp_replace(col("amount"), r"[$,]", "").cast("double")
)
df_currency.show()
Sanitizing user input by removing HTML tags and normalizing whitespace:
html_data = [
(1, "<p>Hello World</p>"),
(2, "<div>Test <span>content</span></div>"),
(3, "Normal text with spaces")
]
df_html = spark.createDataFrame(html_data, ["id", "content"])
df_sanitized = df_html.withColumn(
"clean_content",
regexp_replace(col("content"), r"<[^>]+>", "") # Remove HTML tags
).withColumn(
"clean_content",
regexp_replace(col("clean_content"), r"\s+", " ") # Normalize whitespace
)
df_sanitized.show(truncate=False)
Masking sensitive data for PII compliance:
pii_data = [
(1, "SSN: 123-45-6789", "card: 4532-1234-5678-9010"),
(2, "SSN: 987-65-4321", "card: 5555-6666-7777-8888")
]
df_pii = spark.createDataFrame(pii_data, ["id", "ssn_field", "cc_field"])
df_masked = df_pii.withColumn(
"ssn_field",
regexp_replace(col("ssn_field"), r"\d{3}-\d{2}-(\d{4})", "XXX-XX-$1")
).withColumn(
"cc_field",
regexp_replace(col("cc_field"), r"\d{4}-\d{4}-\d{4}-(\d{4})", "XXXX-XXXX-XXXX-$1")
)
df_masked.show(truncate=False)
Performance Considerations and Best Practices
Regular expressions are powerful but computationally expensive. When you don’t need pattern matching, use simpler alternatives.
For literal string replacement without patterns, use the DataFrame’s replace() method:
# Slower: using regexp_replace for literal replacement
df_regex = df.withColumn("amount", regexp_replace(col("amount"), "$", ""))
# Faster: using replace for literal strings
df_literal = df.replace("$", "", subset=["amount"])
When processing massive datasets, consider these optimization strategies:
- Filter before transforming: Apply
regexp_replaceonly to rows that need it - Compile patterns once: While PySpark doesn’t expose regex compilation directly, avoid complex patterns in tight loops
- Use simpler functions when possible:
substring,trim,lowerare faster than regex equivalents - Partition appropriately: Ensure your data is well-partitioned before applying transformations
# Optimize by filtering first
df_optimized = df.filter(col("contact_info").contains("555")).withColumn(
"clean_phone",
regexp_replace(col("contact_info"), r"[^\d]", "")
)
Troubleshooting and Common Pitfalls
Escaping special regex characters is the most common source of errors. Characters like ., *, +, ?, [, ], (, ), {, }, ^, $, |, and \ have special meaning in regex. Escape them with backslashes:
# Wrong: dot matches any character
df.withColumn("test", regexp_replace(col("text"), ".", "X"))
# Correct: escape the dot to match literal period
df.withColumn("test", regexp_replace(col("text"), r"\.", "X"))
Handling null values requires explicit checks. regexp_replace returns null when the input is null:
from pyspark.sql.functions import when, isnull
# Handle nulls before regex operation
df_null_safe = df.withColumn(
"clean_phone",
when(isnull(col("contact_info")), "UNKNOWN")
.otherwise(regexp_replace(col("contact_info"), r"[^\d]", ""))
)
Debug complex patterns by testing incrementally:
# Test your regex pattern on sample data first
test_df = spark.createDataFrame([(1, "test-data-123")], ["id", "text"])
# Test pattern step by step
test_df.withColumn("step1", regexp_replace(col("text"), r"-", " ")).show()
test_df.withColumn("step2", regexp_replace(col("text"), r"\d+", "NUM")).show()
test_df.withColumn("step3", regexp_replace(col("text"), r"(\w+)-(\w+)-(\d+)", "$3-$2-$1")).show()
The regexp_replace function is indispensable for production data pipelines dealing with inconsistent input. Master the regex patterns relevant to your domain, test thoroughly with edge cases, and you’ll handle even the messiest data transformations with confidence.