PySpark - SQL LIKE Pattern Matching | Application Architect

Key Insights

PySpark’s LIKE operator works identically to SQL, using % for any sequence and _ for single characters, but can be accessed through both SQL strings and DataFrame API methods
For complex patterns beyond basic wildcards, use rlike() with regular expressions, though this comes with performance trade-offs that matter at scale
Simple string methods like startswith() and contains() often outperform LIKE for basic patterns and should be your first choice when wildcards aren’t needed

Introduction to LIKE Pattern Matching in PySpark

Pattern matching is fundamental to data filtering and cleaning in big data workflows. Whether you’re analyzing server logs, validating customer records, or categorizing products, you need efficient ways to find strings that match specific patterns. PySpark provides the LIKE operator, borrowed directly from SQL, to handle these scenarios at scale.

The LIKE operator excels at simple pattern matching where you need flexibility beyond exact matches but don’t require the full power of regular expressions. In distributed computing environments, understanding how to use LIKE effectively—and when to choose alternatives—directly impacts query performance and cluster resource utilization.

Let’s start with a realistic dataset to work with throughout this article:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lower, upper

spark = SparkSession.builder.appName("LikePatterns").getOrCreate()

data = [
    ("John Smith", "john.smith@gmail.com", "PRD-A001", "New York"),
    ("Alice Johnson", "alice.j@yahoo.com", "PRD-B102", "San Francisco"),
    ("Bob Anderson", "bob@company.com", "SRV-C001", "New York"),
    ("Carol White", "carol.white@gmail.com", "PRD-A205", "Chicago"),
    ("David Brown", "dbrown@outlook.com", "PRD-D301", "New York"),
    ("Eve Martinez", "eve.m@gmail.com", "SRV-A001", "Los Angeles"),
    ("Frank Wilson", "frank@company.com", "PRD-B999", "Chicago"),
    (None, "noemail@test.com", "PRD-NULL", "Boston")
]

df = spark.createDataFrame(data, ["name", "email", "product_code", "city"])
df.createOrReplaceTempView("customers")

Basic LIKE Patterns with Wildcards

The LIKE operator uses two wildcards: % matches any sequence of characters (including zero characters), and _ matches exactly one character. These simple tools cover most pattern matching needs.

Finding all customers whose names start with ‘A’:

# Using DataFrame API
result = df.filter(col("name").like("A%"))
result.show()

# Output:
# +-------------+------------------+------------+-------------+
# |         name|             email|product_code|         city|
# +-------------+------------------+------------+-------------+
# |Alice Johnson|alice.j@yahoo.com|     PRD-B102|San Francisco|
# +-------------+------------------+------------+-------------+

Filtering for Gmail users:

gmail_users = df.filter(col("email").like("%@gmail.com"))
gmail_users.show()

# Output shows John Smith, Carol White, and Eve Martinez

The single character wildcard _ is useful for structured codes with specific formats:

# Find product codes with pattern PRD-X001 (any single letter, then 001)
specific_pattern = df.filter(col("product_code").like("PRD-_001"))
specific_pattern.show()

# Matches: PRD-A001, SRV-A001 (wait, this won't match SRV because of PRD prefix)

You can combine wildcards for more complex patterns:

# Product codes starting with PRD, any character, then ending in 01
pattern = df.filter(col("product_code").like("PRD-__01"))
pattern.show()

# Matches PRD-A001 (four characters between PRD- and 1)

Using LIKE with PySpark SQL and DataFrame API

PySpark offers two approaches to pattern matching: SQL strings and DataFrame API methods. Choose based on your team’s preferences and existing codebase style.

SQL approach using spark.sql():

# Simple SQL LIKE query
sql_result = spark.sql("""
    SELECT name, email, city
    FROM customers
    WHERE email LIKE '%@gmail.com'
""")
sql_result.show()

# Multiple conditions with AND
sql_multiple = spark.sql("""
    SELECT name, product_code, city
    FROM customers
    WHERE city LIKE 'New%' 
    AND product_code LIKE 'PRD%'
""")
sql_multiple.show()

DataFrame API equivalent:

# Single condition
df_result = df.filter(col("email").like("%@gmail.com")) \
              .select("name", "email", "city")

# Multiple conditions - more readable with line breaks
df_multiple = df.filter(
    (col("city").like("New%")) & 
    (col("product_code").like("PRD%"))
).select("name", "product_code", "city")

# OR conditions
df_or = df.filter(
    (col("email").like("%@gmail.com")) | 
    (col("email").like("%@yahoo.com"))
)
df_or.show()

The DataFrame API is generally more maintainable for complex logic, while SQL strings work well for analysts familiar with SQL syntax.

RLIKE and Regular Expressions for Advanced Patterns

When LIKE’s wildcards aren’t sufficient, use rlike() for full regular expression support. This is essential for validating formats, extracting structured data, or matching complex patterns.

Validating product code formats (three letters, dash, four digits):

# Strict format validation
valid_codes = df.filter(col("product_code").rlike("^[A-Z]{3}-[0-9]{4}$"))
valid_codes.show()

Email validation with regex:

# Basic email pattern (simplified)
email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
valid_emails = df.filter(col("email").rlike(email_pattern))
valid_emails.show()

Finding emails from specific domain patterns:

# Match common free email providers
free_email_pattern = r"@(gmail|yahoo|outlook|hotmail)\.com$"
free_emails = df.filter(col("email").rlike(free_email_pattern))
free_emails.show()

Extracting specific patterns:

from pyspark.sql.functions import regexp_extract

# Extract product category (first three letters before dash)
df_with_category = df.withColumn(
    "category",
    regexp_extract(col("product_code"), "^([A-Z]{3})-", 1)
)
df_with_category.select("product_code", "category").show()

Performance Considerations and Best Practices

Performance matters when processing terabytes of data. LIKE operations can’t leverage indexes in the same way exact matches can, especially with leading wildcards.

Avoid leading wildcards when possible. Patterns like LIKE '%value' force full column scans. If you’re searching for suffixes frequently, consider storing reversed strings in a separate column.

Use simpler methods for simple patterns:

from pyspark.sql.functions import col

# Instead of LIKE for simple prefix matching
# SLOWER:
df.filter(col("email").like("john%"))

# FASTER:
df.filter(col("email").startswith("john"))

# Instead of LIKE for substring matching
# SLOWER:
df.filter(col("email").like("%@gmail.com%"))

# FASTER:
df.filter(col("email").contains("@gmail.com"))

Leverage partition pruning:

# If data is partitioned by date or category
# Filter on partition columns first
optimized = df.filter(col("partition_date") == "2024-01-01") \
              .filter(col("email").like("%@gmail.com"))

# This reduces data scanned before pattern matching

Combine filters strategically:

# Put selective filters first
df.filter(col("city") == "New York")  # Highly selective
  .filter(col("email").like("%@gmail.com"))  # Less selective

Common Pitfalls and Case Sensitivity

LIKE matching in PySpark is case-sensitive by default. This catches many developers off guard.

Handle case sensitivity explicitly:

# Case-insensitive search
case_insensitive = df.filter(lower(col("name")).like("john%"))
case_insensitive.show()

# Alternative: convert both pattern and column
df.filter(upper(col("email")).like(upper("%@GMAIL.COM"))).show()

Escaping special characters:

When you need to search for literal % or _ characters, you must escape them:

# Searching for product codes containing literal underscore
# This won't work as expected:
# df.filter(col("notes").like("%discount_code%"))

# Escape the underscore:
df.filter(col("notes").like("%discount\\_code%")).show()

NULL handling:

LIKE operations on NULL values return NULL, not false:

# This excludes NULL names
df.filter(col("name").like("A%")).show()

# Explicitly handle NULLs if needed
from pyspark.sql.functions import when

df.withColumn(
    "matches_pattern",
    when(col("name").isNull(), False)
    .when(col("name").like("A%"), True)
    .otherwise(False)
).show()

Practical Use Cases

Data quality validation:

# Find invalid email formats in customer database
invalid_emails = df.filter(
    ~col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
)

# Flag product codes not matching expected format
invalid_products = df.filter(
    ~col("product_code").rlike("^(PRD|SRV)-[A-Z][0-9]{3}$")
)

Customer segmentation:

# Identify corporate vs personal email addresses
corporate_customers = df.filter(
    ~col("email").rlike(r"@(gmail|yahoo|hotmail|outlook)\.com$")
)

personal_customers = df.filter(
    col("email").rlike(r"@(gmail|yahoo|hotmail|outlook)\.com$")
)

Complete ETL pipeline example:

def clean_customer_data(input_df):
    """
    Filter and clean customer records based on pattern matching rules
    """
    cleaned = input_df.filter(
        # Valid email format
        col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
    ).filter(
        # Valid product code format
        col("product_code").rlike("^(PRD|SRV)-[A-Z][0-9]{3}$")
    ).filter(
        # Name is not null and contains at least one letter
        col("name").isNotNull() & col("name").rlike("[a-zA-Z]")
    ).filter(
        # Exclude test/dummy emails
        ~lower(col("email")).like("%test%") &
        ~lower(col("email")).like("%dummy%")
    )
    
    return cleaned

# Apply cleaning pipeline
clean_df = clean_customer_data(df)
clean_df.show()

Pattern matching with LIKE is a fundamental skill for PySpark developers. Master the basics, understand when to use alternatives, and always consider performance implications at scale. Your data pipelines will be faster and more maintainable.