PySpark - Filter Rows Using like and rlike

Pattern matching is a fundamental operation when working with DataFrames in PySpark. Whether you're cleaning data, validating formats, or filtering records based on text patterns, you'll frequently...

Key Insights

  • Use like() for simple SQL-style wildcards (% for any characters, _ for single character) when you need basic pattern matching—it’s more readable and performs better for straightforward cases.
  • Use rlike() for complex pattern matching with full regex support, such as validating email formats, extracting specific patterns, or matching multiple alternatives in a single expression.
  • For simple prefix/suffix matching, prefer startswith() and endswith() methods over like() as they’re more explicit and can be optimized better by Spark’s Catalyst optimizer.

Introduction

Pattern matching is a fundamental operation when working with DataFrames in PySpark. Whether you’re cleaning data, validating formats, or filtering records based on text patterns, you’ll frequently need to match strings against patterns. PySpark provides two primary methods for pattern-based filtering: like() and rlike().

The like() method uses SQL-style wildcards—the same pattern matching you’d use in a SQL WHERE clause. It’s simple, readable, and perfect for straightforward pattern matching. The rlike() method (short for “regex like”) leverages the full power of regular expressions, enabling complex pattern matching that would be impossible or extremely cumbersome with SQL wildcards alone.

Choosing between these methods isn’t just about capability—it’s about readability, maintainability, and performance. This article provides practical guidance on when and how to use each method effectively.

Using like() for SQL-Style Pattern Matching

The like() method supports two wildcard characters: % (matches zero or more characters) and _ (matches exactly one character). This simplicity makes it ideal for basic pattern matching where you don’t need the complexity of regular expressions.

Here’s how to filter rows where a column starts with specific text:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PatternMatching").getOrCreate()

# Sample data
data = [
    ("John Doe", "john.doe@gmail.com"),
    ("Jane Smith", "jane.smith@yahoo.com"),
    ("John Adams", "jadams@gmail.com"),
    ("Bob Johnson", "bob@company.com")
]

df = spark.createDataFrame(data, ["name", "email"])

# Filter names starting with "John"
john_df = df.filter(df.name.like("John%"))
john_df.show()
# +----------+-------------------+
# |      name|              email|
# +----------+-------------------+
# |  John Doe|john.doe@gmail.com|
# |John Adams| jadams@gmail.com|
# +----------+-------------------+

To filter rows where a column ends with specific text:

# Filter Gmail addresses
gmail_df = df.filter(df.email.like("%@gmail.com"))
gmail_df.show()
# +----------+-------------------+
# |      name|              email|
# +----------+-------------------+
# |  John Doe|john.doe@gmail.com|
# |John Adams| jadams@gmail.com|
# +----------+-------------------+

For matching text anywhere within a string, place % on both sides:

# Sample product data
products = [
    ("Dell Laptop Pro",),
    ("HP Desktop Computer",),
    ("Apple MacBook Laptop",),
    ("Samsung Tablet",)
]

product_df = spark.createDataFrame(products, ["product"])

# Filter products containing "laptop" (case-sensitive)
laptop_df = product_df.filter(product_df.product.like("%Laptop%"))
laptop_df.show()
# +--------------------+
# |             product|
# +--------------------+
# |    Dell Laptop Pro|
# |Apple MacBook Lap...|
# +--------------------+

The underscore wildcard matches exactly one character, useful for fixed-length patterns:

# Sample code data
codes = [
    ("ABC",),
    ("A1C",),
    ("A2C",),
    ("ABCD",),
    ("XYZ",)
]

code_df = spark.createDataFrame(codes, ["code"])

# Match codes like "A_C" (A, any single character, C)
filtered_codes = code_df.filter(code_df.code.like("A_C"))
filtered_codes.show()
# +----+
# |code|
# +----+
# | A1C|
# | A2C|
# +----+

Using rlike() for Regular Expression Matching

When you need more sophisticated pattern matching, rlike() provides full regular expression support. This is essential for validating formats, matching complex patterns, or using character classes.

Here’s how to match alphanumeric patterns with specific formats:

# Sample order IDs
orders = [
    ("ABC1234",),
    ("XYZ5678",),
    ("AB123",),
    ("ABCD12345",),
    ("xyz9999",)
]

order_df = spark.createDataFrame(orders, ["order_id"])

# Match pattern: exactly 3 uppercase letters followed by 4 digits
valid_orders = order_df.filter(order_df.order_id.rlike("^[A-Z]{3}[0-9]{4}$"))
valid_orders.show()
# +--------+
# |order_id|
# +--------+
# | ABC1234|
# | XYZ5678|
# +--------+

Email validation is a common use case for regex patterns:

# Sample email data with various formats
emails = [
    ("user@example.com",),
    ("invalid.email",),
    ("user.name+tag@example.co.uk",),
    ("@example.com",),
    ("user@domain",)
]

email_df = spark.createDataFrame(emails, ["email"])

# Validate email format with regex
valid_emails = email_df.filter(
    email_df.email.rlike("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")
)
valid_emails.show()
# +---------------------------+
# |                      email|
# +---------------------------+
# |          user@example.com|
# |user.name+tag@example.co.uk|
# +---------------------------+

Note the double backslashes (\\) in the regex pattern—this is necessary because the string is interpreted by both Python and the regex engine.

For matching multiple phone number formats:

# Sample phone numbers in different formats
phones = [
    ("555-123-4567",),
    ("(555)123-4567",),
    ("555.123.4567",),
    ("5551234567",),
    ("123-456",)
]

phone_df = spark.createDataFrame(phones, ["phone"])

# Match two common formats
valid_phones = phone_df.filter(
    phone_df.phone.rlike("^\\d{3}-\\d{3}-\\d{4}$|^\\(\\d{3}\\)\\d{3}-\\d{4}$")
)
valid_phones.show()
# +-------------+
# |        phone|
# +-------------+
# | 555-123-4567|
# |(555)123-4567|
# +-------------+

For case-insensitive matching, use the (?i) flag at the start of your regex:

# Case-insensitive product search
product_df.filter(product_df.product.rlike("(?i)laptop")).show()
# +--------------------+
# |             product|
# +--------------------+
# |    Dell Laptop Pro|
# |Apple MacBook Lap...|
# +--------------------+

Negating Patterns with NOT

Sometimes you need to filter out rows that match a pattern. Use the ~ operator to negate the condition:

# Exclude Gmail addresses
non_gmail = df.filter(~df.email.like("%@gmail.com"))
non_gmail.show()
# +-----------+----------------------+
# |       name|                 email|
# +-----------+----------------------+
# | Jane Smith|jane.smith@yahoo.com|
# |Bob Johnson|    bob@company.com|
# +-----------+----------------------+

# Exclude specific patterns with regex
non_john = df.filter(~df.name.rlike("^John"))
non_john.show()
# +-----------+----------------------+
# |       name|                 email|
# +-----------+----------------------+
# | Jane Smith|jane.smith@yahoo.com|
# |Bob Johnson|    bob@company.com|
# +-----------+----------------------+

You can also use the NOT keyword with the col() function:

from pyspark.sql.functions import col, not_

df.filter(~col("email").like("%@gmail.com")).show()

Performance Considerations and Best Practices

Performance matters when filtering large datasets. The like() method generally performs better than rlike() for simple patterns because it doesn’t invoke the regex engine. However, for simple prefix or suffix matching, PySpark provides even more optimized methods.

Here’s a performance comparison approach:

from pyspark.sql.functions import col
import time

# Create a larger dataset for benchmarking
large_df = spark.range(0, 1000000).selectExpr("concat('user', id, '@example.com') as email")

# Method 1: Using like()
start = time.time()
large_df.filter(col("email").like("user1%")).count()
like_time = time.time() - start

# Method 2: Using rlike()
start = time.time()
large_df.filter(col("email").rlike("^user1")).count()
rlike_time = time.time() - start

# Method 3: Using startswith()
start = time.time()
large_df.filter(col("email").startswith("user1")).count()
startswith_time = time.time() - start

print(f"like(): {like_time:.3f}s")
print(f"rlike(): {rlike_time:.3f}s")
print(f"startswith(): {startswith_time:.3f}s")

Best practices for choosing your method:

  • Use startswith() or endswith() for simple prefix/suffix matching—they’re explicit and optimized
  • Use like() when you need SQL-style wildcards for moderate complexity
  • Use rlike() only when you genuinely need regex features like character classes, alternation, or complex validation
  • Avoid rlike() with % wildcards—if you’re using SQL-style patterns, stick with like()

Combining Multiple Pattern Filters

Real-world filtering often requires combining multiple conditions. Use & for AND logic and | for OR logic:

# Sample customer data
customers = [
    ("John Doe", "john@gmail.com", "Premium"),
    ("Jane Smith", "jane@yahoo.com", "Basic"),
    ("Bob Johnson", "bob@gmail.com", "Basic"),
    ("Alice Williams", "alice@gmail.com", "Premium")
]

customer_df = spark.createDataFrame(customers, ["name", "email", "plan"])

# Find Premium customers with Gmail addresses
premium_gmail = customer_df.filter(
    (customer_df.email.like("%@gmail.com")) & 
    (customer_df.plan == "Premium")
)
premium_gmail.show()
# +--------------+----------------+-------+
# |          name|           email|   plan|
# +--------------+----------------+-------+
# |      John Doe| john@gmail.com|Premium|
# |Alice Williams|alice@gmail.com|Premium|
# +--------------+----------------+-------+

# Find customers whose names start with 'J' OR have Yahoo emails
j_or_yahoo = customer_df.filter(
    (customer_df.name.like("J%")) | 
    (customer_df.email.like("%@yahoo.com"))
)
j_or_yahoo.show()

You can mix like() and rlike() in the same query:

# Complex filtering: Gmail users with valid phone-style usernames
mixed_filter = customer_df.filter(
    (customer_df.email.like("%@gmail.com")) &
    (customer_df.name.rlike("^[A-Z][a-z]+\\s[A-Z][a-z]+$"))
)

Conclusion

Choosing between like() and rlike() comes down to pattern complexity and performance requirements. Use like() as your default for simple SQL-style patterns—it’s readable, familiar to SQL developers, and performs well. Reach for rlike() when you need regex features like character classes, quantifiers, or complex validation logic.

For the simplest cases—prefix and suffix matching—prefer startswith() and endswith(). They communicate intent clearly and allow Spark’s optimizer to work more effectively.

Remember that both methods are case-sensitive by default. For case-insensitive matching, use the (?i) flag with rlike() or convert columns to lowercase with lower() before using like().

Beyond like() and rlike(), explore related PySpark string functions like contains(), substr(), and regexp_extract() for additional text manipulation capabilities. The key is matching the tool to the task—simple patterns deserve simple solutions, while complex patterns justify the power of regular expressions.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.