PySpark - SQL LIKE Pattern Matching
Pattern matching is fundamental to data filtering and cleaning in big data workflows. Whether you're analyzing server logs, validating customer records, or categorizing products, you need efficient...
Key Insights
- PySpark’s LIKE operator works identically to SQL, using
%for any sequence and_for single characters, but can be accessed through both SQL strings and DataFrame API methods - For complex patterns beyond basic wildcards, use
rlike()with regular expressions, though this comes with performance trade-offs that matter at scale - Simple string methods like
startswith()andcontains()often outperform LIKE for basic patterns and should be your first choice when wildcards aren’t needed
Introduction to LIKE Pattern Matching in PySpark
Pattern matching is fundamental to data filtering and cleaning in big data workflows. Whether you’re analyzing server logs, validating customer records, or categorizing products, you need efficient ways to find strings that match specific patterns. PySpark provides the LIKE operator, borrowed directly from SQL, to handle these scenarios at scale.
The LIKE operator excels at simple pattern matching where you need flexibility beyond exact matches but don’t require the full power of regular expressions. In distributed computing environments, understanding how to use LIKE effectively—and when to choose alternatives—directly impacts query performance and cluster resource utilization.
Let’s start with a realistic dataset to work with throughout this article:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lower, upper
spark = SparkSession.builder.appName("LikePatterns").getOrCreate()
data = [
("John Smith", "john.smith@gmail.com", "PRD-A001", "New York"),
("Alice Johnson", "alice.j@yahoo.com", "PRD-B102", "San Francisco"),
("Bob Anderson", "bob@company.com", "SRV-C001", "New York"),
("Carol White", "carol.white@gmail.com", "PRD-A205", "Chicago"),
("David Brown", "dbrown@outlook.com", "PRD-D301", "New York"),
("Eve Martinez", "eve.m@gmail.com", "SRV-A001", "Los Angeles"),
("Frank Wilson", "frank@company.com", "PRD-B999", "Chicago"),
(None, "noemail@test.com", "PRD-NULL", "Boston")
]
df = spark.createDataFrame(data, ["name", "email", "product_code", "city"])
df.createOrReplaceTempView("customers")
Basic LIKE Patterns with Wildcards
The LIKE operator uses two wildcards: % matches any sequence of characters (including zero characters), and _ matches exactly one character. These simple tools cover most pattern matching needs.
Finding all customers whose names start with ‘A’:
# Using DataFrame API
result = df.filter(col("name").like("A%"))
result.show()
# Output:
# +-------------+------------------+------------+-------------+
# | name| email|product_code| city|
# +-------------+------------------+------------+-------------+
# |Alice Johnson|alice.j@yahoo.com| PRD-B102|San Francisco|
# +-------------+------------------+------------+-------------+
Filtering for Gmail users:
gmail_users = df.filter(col("email").like("%@gmail.com"))
gmail_users.show()
# Output shows John Smith, Carol White, and Eve Martinez
The single character wildcard _ is useful for structured codes with specific formats:
# Find product codes with pattern PRD-X001 (any single letter, then 001)
specific_pattern = df.filter(col("product_code").like("PRD-_001"))
specific_pattern.show()
# Matches: PRD-A001, SRV-A001 (wait, this won't match SRV because of PRD prefix)
You can combine wildcards for more complex patterns:
# Product codes starting with PRD, any character, then ending in 01
pattern = df.filter(col("product_code").like("PRD-__01"))
pattern.show()
# Matches PRD-A001 (four characters between PRD- and 1)
Using LIKE with PySpark SQL and DataFrame API
PySpark offers two approaches to pattern matching: SQL strings and DataFrame API methods. Choose based on your team’s preferences and existing codebase style.
SQL approach using spark.sql():
# Simple SQL LIKE query
sql_result = spark.sql("""
SELECT name, email, city
FROM customers
WHERE email LIKE '%@gmail.com'
""")
sql_result.show()
# Multiple conditions with AND
sql_multiple = spark.sql("""
SELECT name, product_code, city
FROM customers
WHERE city LIKE 'New%'
AND product_code LIKE 'PRD%'
""")
sql_multiple.show()
DataFrame API equivalent:
# Single condition
df_result = df.filter(col("email").like("%@gmail.com")) \
.select("name", "email", "city")
# Multiple conditions - more readable with line breaks
df_multiple = df.filter(
(col("city").like("New%")) &
(col("product_code").like("PRD%"))
).select("name", "product_code", "city")
# OR conditions
df_or = df.filter(
(col("email").like("%@gmail.com")) |
(col("email").like("%@yahoo.com"))
)
df_or.show()
The DataFrame API is generally more maintainable for complex logic, while SQL strings work well for analysts familiar with SQL syntax.
RLIKE and Regular Expressions for Advanced Patterns
When LIKE’s wildcards aren’t sufficient, use rlike() for full regular expression support. This is essential for validating formats, extracting structured data, or matching complex patterns.
Validating product code formats (three letters, dash, four digits):
# Strict format validation
valid_codes = df.filter(col("product_code").rlike("^[A-Z]{3}-[0-9]{4}$"))
valid_codes.show()
Email validation with regex:
# Basic email pattern (simplified)
email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
valid_emails = df.filter(col("email").rlike(email_pattern))
valid_emails.show()
Finding emails from specific domain patterns:
# Match common free email providers
free_email_pattern = r"@(gmail|yahoo|outlook|hotmail)\.com$"
free_emails = df.filter(col("email").rlike(free_email_pattern))
free_emails.show()
Extracting specific patterns:
from pyspark.sql.functions import regexp_extract
# Extract product category (first three letters before dash)
df_with_category = df.withColumn(
"category",
regexp_extract(col("product_code"), "^([A-Z]{3})-", 1)
)
df_with_category.select("product_code", "category").show()
Performance Considerations and Best Practices
Performance matters when processing terabytes of data. LIKE operations can’t leverage indexes in the same way exact matches can, especially with leading wildcards.
Avoid leading wildcards when possible. Patterns like LIKE '%value' force full column scans. If you’re searching for suffixes frequently, consider storing reversed strings in a separate column.
Use simpler methods for simple patterns:
from pyspark.sql.functions import col
# Instead of LIKE for simple prefix matching
# SLOWER:
df.filter(col("email").like("john%"))
# FASTER:
df.filter(col("email").startswith("john"))
# Instead of LIKE for substring matching
# SLOWER:
df.filter(col("email").like("%@gmail.com%"))
# FASTER:
df.filter(col("email").contains("@gmail.com"))
Leverage partition pruning:
# If data is partitioned by date or category
# Filter on partition columns first
optimized = df.filter(col("partition_date") == "2024-01-01") \
.filter(col("email").like("%@gmail.com"))
# This reduces data scanned before pattern matching
Combine filters strategically:
# Put selective filters first
df.filter(col("city") == "New York") # Highly selective
.filter(col("email").like("%@gmail.com")) # Less selective
Common Pitfalls and Case Sensitivity
LIKE matching in PySpark is case-sensitive by default. This catches many developers off guard.
Handle case sensitivity explicitly:
# Case-insensitive search
case_insensitive = df.filter(lower(col("name")).like("john%"))
case_insensitive.show()
# Alternative: convert both pattern and column
df.filter(upper(col("email")).like(upper("%@GMAIL.COM"))).show()
Escaping special characters:
When you need to search for literal % or _ characters, you must escape them:
# Searching for product codes containing literal underscore
# This won't work as expected:
# df.filter(col("notes").like("%discount_code%"))
# Escape the underscore:
df.filter(col("notes").like("%discount\\_code%")).show()
NULL handling:
LIKE operations on NULL values return NULL, not false:
# This excludes NULL names
df.filter(col("name").like("A%")).show()
# Explicitly handle NULLs if needed
from pyspark.sql.functions import when
df.withColumn(
"matches_pattern",
when(col("name").isNull(), False)
.when(col("name").like("A%"), True)
.otherwise(False)
).show()
Practical Use Cases
Data quality validation:
# Find invalid email formats in customer database
invalid_emails = df.filter(
~col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
)
# Flag product codes not matching expected format
invalid_products = df.filter(
~col("product_code").rlike("^(PRD|SRV)-[A-Z][0-9]{3}$")
)
Customer segmentation:
# Identify corporate vs personal email addresses
corporate_customers = df.filter(
~col("email").rlike(r"@(gmail|yahoo|hotmail|outlook)\.com$")
)
personal_customers = df.filter(
col("email").rlike(r"@(gmail|yahoo|hotmail|outlook)\.com$")
)
Complete ETL pipeline example:
def clean_customer_data(input_df):
"""
Filter and clean customer records based on pattern matching rules
"""
cleaned = input_df.filter(
# Valid email format
col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
).filter(
# Valid product code format
col("product_code").rlike("^(PRD|SRV)-[A-Z][0-9]{3}$")
).filter(
# Name is not null and contains at least one letter
col("name").isNotNull() & col("name").rlike("[a-zA-Z]")
).filter(
# Exclude test/dummy emails
~lower(col("email")).like("%test%") &
~lower(col("email")).like("%dummy%")
)
return cleaned
# Apply cleaning pipeline
clean_df = clean_customer_data(df)
clean_df.show()
Pattern matching with LIKE is a fundamental skill for PySpark developers. Master the basics, understand when to use alternatives, and always consider performance implications at scale. Your data pipelines will be faster and more maintainable.