PySpark - Filter Rows Using contains()

Key Insights

PySpark’s contains() method is case-sensitive by default—combine it with lower() or upper() for case-insensitive filtering across large distributed datasets
Chain multiple contains() conditions using & (AND), | (OR), and ~ (NOT) operators, but always wrap each condition in parentheses to avoid operator precedence issues
Always check for null values before applying contains() to prevent NullPointerExceptions that can crash your Spark jobs on production data

Introduction to String Filtering in PySpark

When working with large-scale data processing in PySpark, filtering rows based on substring matches is one of the most common operations you’ll perform. Whether you’re analyzing server logs, filtering customer records, or cleaning product catalogs, you need a reliable way to find rows where a column contains specific text patterns.

The contains() method from pyspark.sql.functions provides exactly this capability. Unlike pandas where you might use str.contains(), PySpark’s contains() is designed to work efficiently across distributed datasets, leveraging Spark’s parallel processing capabilities.

Let’s start with a practical dataset to demonstrate these concepts:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lower, upper

# Initialize Spark session
spark = SparkSession.builder.appName("ContainsExample").getOrCreate()

# Create sample data
data = [
    (1, "john.doe@gmail.com", "Premium User", "New York"),
    (2, "jane.smith@yahoo.com", "Basic User", "Los Angeles"),
    (3, "bob.wilson@GMAIL.COM", "Premium User", "Chicago"),
    (4, "alice.brown@outlook.com", "Premium User", "Houston"),
    (5, "charlie.davis@gmail.com", None, "Phoenix"),
    (6, None, "Basic User", "Philadelphia")
]

df = spark.createDataFrame(data, ["id", "email", "account_type", "city"])
df.show(truncate=False)

This creates a DataFrame with email addresses, account types, and cities—a realistic scenario where you’d need substring filtering.

Basic contains() Syntax and Usage

The fundamental syntax for contains() is straightforward: col('column_name').contains('substring'). You use this expression with either filter() or where() methods (they’re interchangeable in PySpark).

Here’s how to filter for all Gmail users:

# Find all Gmail users
gmail_users = df.filter(col("email").contains("gmail.com"))
gmail_users.show(truncate=False)

# Alternative syntax using where()
gmail_users_alt = df.where(col("email").contains("gmail.com"))

# You can also use the string column name directly
gmail_users_simple = df.filter(df.email.contains("gmail.com"))

Output:

+---+------------------------+------------+--------+
|id |email                   |account_type|city    |
+---+------------------------+------------+--------+
|1  |john.doe@gmail.com      |Premium User|New York|
|5  |charlie.davis@gmail.com |null        |Phoenix |
+---+------------------------+------------+--------+

Notice that row 3 with “GMAIL.COM” (uppercase) wasn’t included. This brings us to an important characteristic of contains().

Case-Sensitive vs Case-Insensitive Filtering

By default, contains() performs case-sensitive matching. This is often not what you want when dealing with real-world data where inconsistent capitalization is common.

Here’s the case-sensitive behavior:

# Case-sensitive search - misses uppercase variants
case_sensitive = df.filter(col("email").contains("gmail"))
print(f"Case-sensitive count: {case_sensitive.count()}")  # Returns 2

# Case-insensitive search using lower()
case_insensitive = df.filter(lower(col("email")).contains("gmail"))
print(f"Case-insensitive count: {case_insensitive.count()}")  # Returns 3

case_insensitive.show(truncate=False)

Output:

+---+------------------------+------------+--------+
|id |email                   |account_type|city    |
+---+------------------------+------------+--------+
|1  |john.doe@gmail.com      |Premium User|New York|
|3  |bob.wilson@GMAIL.COM    |Premium User|Chicago |
|5  |charlie.davis@gmail.com |null        |Phoenix |
+---+------------------------+------------+--------+

The lower() function converts the entire column to lowercase before applying the contains check. You can also use upper() if you prefer—just make sure your search string matches the case transformation you apply.

# Using upper() instead
case_insensitive_upper = df.filter(upper(col("email")).contains("GMAIL"))

Advanced Filtering Patterns

Real-world scenarios often require combining multiple conditions. PySpark uses bitwise operators for logical operations: & (AND), | (OR), and ~ (NOT). Critical point: always wrap each condition in parentheses.

# Find Premium Gmail users (AND condition)
premium_gmail = df.filter(
    (col("email").contains("gmail")) & 
    (col("account_type").contains("Premium"))
)
premium_gmail.show(truncate=False)

# Find users with Gmail OR Yahoo email (OR condition)
gmail_or_yahoo = df.filter(
    (lower(col("email")).contains("gmail")) | 
    (lower(col("email")).contains("yahoo"))
)
print(f"Gmail or Yahoo users: {gmail_or_yahoo.count()}")

# Find non-Gmail users (NOT condition)
non_gmail = df.filter(~col("email").contains("gmail"))
non_gmail.show(truncate=False)

You can also filter across multiple columns simultaneously:

# Find rows where email contains "gmail" OR city contains "New"
multi_column = df.filter(
    (lower(col("email")).contains("gmail")) | 
    (col("city").contains("New"))
)
multi_column.show(truncate=False)

Combining contains() with other string functions provides even more power:

from pyspark.sql.functions import startswith, endswith

# Find Gmail users with emails starting with 'j'
complex_filter = df.filter(
    (col("email").startswith("j")) & 
    (col("email").contains("gmail"))
)
complex_filter.show(truncate=False)

# Find Premium users in cities ending with 'o'
city_account_filter = df.filter(
    (col("city").endswith("o")) & 
    (col("account_type") == "Premium User")
)
city_account_filter.show(truncate=False)

Performance Considerations and Best Practices

When working with large datasets, performance matters. Here are key considerations:

Partition Pruning: If your data is partitioned and you’re filtering on partition columns, Spark can skip entire partitions. However, contains() on non-partition columns requires scanning all data.

Push Down Predicates: When reading from data sources like Parquet or databases, simple filters can be pushed down to the source, reducing data transfer. However, complex string operations like lower(col()).contains() may prevent predicate pushdown.

Alternatives to Consider:

from pyspark.sql.functions import regexp_extract, like

# Using SQL LIKE syntax (often optimized by Spark SQL)
like_filter = df.filter(col("email").like("%gmail%"))

# Using regex for more complex patterns
# Note: rlike is generally slower than contains for simple substring matching
regex_filter = df.filter(col("email").rlike(".*gmail.*"))

# For exact matching, use equality (much faster)
exact_match = df.filter(col("account_type") == "Premium User")

For simple substring matching, contains() is typically faster than regex. Use like() when you need SQL-style wildcards, and reserve regex for complex patterns.

Caching Strategy: If you’re applying multiple filters on the same DataFrame, consider caching:

# Cache the DataFrame if you'll filter it multiple times
df.cache()

gmail_users = df.filter(col("email").contains("gmail"))
yahoo_users = df.filter(col("email").contains("yahoo"))

# Don't forget to unpersist when done
df.unpersist()

Common Pitfalls and Troubleshooting

Null Value Handling: The most common issue with contains() is null values. If a column contains nulls, contains() will return null for those rows, which are excluded from filter results.

# This will fail silently - row 6 has null email
gmail_only = df.filter(col("email").contains("gmail"))
print(f"Count without null handling: {gmail_only.count()}")

# Proper null handling - explicitly check for non-null values
safe_filter = df.filter(
    col("email").isNotNull() & 
    col("email").contains("gmail")
)
print(f"Count with null handling: {safe_filter.count()}")

# Or handle nulls in account_type
safe_account_filter = df.filter(
    col("account_type").isNotNull() & 
    col("account_type").contains("Premium")
)
safe_account_filter.show(truncate=False)

Special Characters: When searching for strings with special regex characters, remember that contains() treats them literally, unlike rlike():

# Create data with special characters
special_data = [
    (1, "user@example.com", "test.file.txt"),
    (2, "admin@site.org", "data[1].csv"),
    (3, "info@domain.net", "report(2023).pdf")
]

df_special = spark.createDataFrame(special_data, ["id", "email", "filename"])

# contains() treats dots literally (no escaping needed)
dot_filter = df_special.filter(col("filename").contains(".txt"))
dot_filter.show()

# Searching for brackets - also literal
bracket_filter = df_special.filter(col("filename").contains("[1]"))
bracket_filter.show()

Type Errors: Ensure you’re applying contains() to string columns. Attempting to use it on numeric columns will raise an error:

# This will cause an error
# numeric_filter = df.filter(col("id").contains("1"))  # Error!

# Convert to string first if needed
string_id_filter = df.filter(col("id").cast("string").contains("1"))
string_id_filter.show()

Operator Precedence: Always use parentheses around each condition when combining filters. Without them, operator precedence can produce unexpected results:

# WRONG - may not work as expected
# wrong_filter = df.filter(col("email").contains("gmail") & col("account_type") == "Premium User")

# CORRECT - explicit parentheses
correct_filter = df.filter(
    (col("email").contains("gmail")) & 
    (col("account_type") == "Premium User")
)

The contains() method is your go-to tool for substring filtering in PySpark. Master these patterns, handle nulls properly, and understand the case-sensitivity behavior, and you’ll be able to filter massive datasets efficiently and reliably.