PySpark - Substring from Column
String manipulation is fundamental to data engineering workflows, especially when dealing with raw data that requires cleaning, parsing, or transformation. PySpark's DataFrame API provides a...
Key Insights
- PySpark provides two primary methods for substring extraction: the
substring()SQL function and thesubstr()Column method, both using 1-based indexing unlike Python’s native string slicing - Built-in substring functions significantly outperform custom UDFs due to Catalyst optimizer integration and avoid serialization overhead between JVM and Python processes
- For pattern-based extraction,
regexp_extract()offers more flexibility than fixed-position substrings and handles variable-length data more elegantly
Introduction to String Manipulation in PySpark
String manipulation is fundamental to data engineering workflows, especially when dealing with raw data that requires cleaning, parsing, or transformation. PySpark’s DataFrame API provides a comprehensive suite of string functions that operate at scale across distributed datasets.
Substring operations are among the most frequently used string transformations. Whether you’re extracting date components from timestamps, parsing structured codes, or isolating specific portions of text fields, substring functionality is essential. PySpark offers two primary approaches: the substring() SQL function from pyspark.sql.functions and the substr() method available on Column objects.
Let’s start with a sample dataset to demonstrate these techniques:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring
spark = SparkSession.builder.appName("SubstringExample").getOrCreate()
data = [
("PROD-2024-001", "John Smith", "john.smith@example.com"),
("PROD-2024-002", "Jane Doe", "jane.doe@company.org"),
("TEST-2023-999", "Bob Wilson", "bob@test.io"),
("PROD-2024-003", "Alice Brown", "alice.brown@example.com")
]
df = spark.createDataFrame(data, ["product_id", "name", "email"])
df.show(truncate=False)
Using the substring() Function
The substring() function follows SQL conventions with the signature substring(str, pos, len). The critical detail: PySpark uses 1-based indexing, not the 0-based indexing Python developers are accustomed to. The first character is at position 1, not 0.
Parameters:
str: The column or string expression to extract frompos: Starting position (1-based index)len: Number of characters to extract
Here’s how to extract the first four characters (the product type prefix):
from pyspark.sql.functions import substring
df_with_prefix = df.withColumn(
"product_type",
substring("product_id", 1, 4)
)
df_with_prefix.select("product_id", "product_type").show()
Output:
+-------------+------------+
| product_id|product_type|
+-------------+------------+
|PROD-2024-001| PROD|
|PROD-2024-002| PROD|
|TEST-2023-999| TEST|
|PROD-2024-003| PROD|
+-------------+------------+
To extract the year portion (characters 6-9), we start at position 6 and take 4 characters:
df_with_year = df.withColumn(
"year",
substring("product_id", 6, 4)
)
df_with_year.select("product_id", "year").show()
When you need everything from a specific position to the end of the string, simply provide a length value larger than any expected string length, or use a very large number:
# Extract everything after the second hyphen (position 11 onward)
df_with_sequence = df.withColumn(
"sequence_number",
substring("product_id", 11, 100) # Large length extracts to end
)
df_with_sequence.select("product_id", "sequence_number").show()
Using Column Methods with substr()
The substr() method provides an object-oriented alternative that’s particularly useful when chaining multiple transformations. It’s a method on Column objects with the same parameters: substr(startPos, length).
from pyspark.sql.functions import col
df_with_domain = df.withColumn(
"first_name",
col("name").substr(1, 4)
)
df_with_domain.select("name", "first_name").show()
The real power of substr() emerges when chaining operations. Here’s extracting a substring and immediately converting it to uppercase:
from pyspark.sql.functions import upper, trim
df_chained = df.withColumn(
"product_type_upper",
col("product_id").substr(1, 4).upper()
).withColumn(
"clean_name",
trim(col("name")).substr(1, 10)
)
df_chained.select("product_type_upper", "clean_name").show()
Both substring() and substr() produce identical results; the choice is stylistic. Use substring() when working with SQL-style function composition and substr() when leveraging Column method chaining.
Advanced Substring Techniques
Real-world data rarely has fixed positions for extraction. You’ll often need conditional logic or dynamic substring operations based on other column values or content analysis.
Conditional Substring Extraction
Use when() to apply different substring logic based on conditions:
from pyspark.sql.functions import when, length
df_conditional = df.withColumn(
"email_prefix",
when(col("email").contains("@example.com"),
substring("email", 1, 4))
.when(col("email").contains("@company.org"),
substring("email", 1, 5))
.otherwise(substring("email", 1, 3))
)
df_conditional.select("email", "email_prefix").show(truncate=False)
Finding Delimiters with instr()
The instr() function locates the position of a substring, enabling dynamic extraction:
from pyspark.sql.functions import instr
# Extract everything before the @ symbol in email addresses
df_username = df.withColumn(
"at_position",
instr("email", "@")
).withColumn(
"username",
substring("email", 1, col("at_position") - 1)
)
df_username.select("email", "username").show(truncate=False)
Handling Edge Cases
PySpark’s substring functions gracefully handle edge cases. If the starting position exceeds the string length, you get an empty string. If the requested length exceeds available characters, you get whatever remains:
from pyspark.sql.functions import coalesce, lit
# Safe substring with null handling
df_safe = df.withColumn(
"safe_extract",
coalesce(
when(length("name") >= 5, substring("name", 1, 5)),
col("name") # Use full name if shorter than 5 chars
)
)
df_safe.select("name", "safe_extract").show()
Regular Expression Alternative with regexp_extract()
When you need pattern-based extraction rather than fixed positions, regexp_extract() is more appropriate. This function uses Java regex patterns and capture groups.
Extract email domains:
from pyspark.sql.functions import regexp_extract
df_domain = df.withColumn(
"email_domain",
regexp_extract("email", r"@([\w\.]+)", 1)
)
df_domain.select("email", "email_domain").show(truncate=False)
Extract numeric portions from product IDs:
df_numbers = df.withColumn(
"product_number",
regexp_extract("product_id", r"-(\d{3})$", 1)
)
df_numbers.select("product_id", "product_number").show()
The pattern r"-(\d{3})$" matches a hyphen followed by exactly three digits at the end of the string. The 1 parameter indicates we want the first capture group (the digits).
For extracting first and last names when separated by spaces:
df_names = df.withColumn(
"first_name",
regexp_extract("name", r"^(\w+)", 1)
).withColumn(
"last_name",
regexp_extract("name", r"\s(\w+)$", 1)
)
df_names.select("name", "first_name", "last_name").show()
Performance Considerations and Best Practices
PySpark’s built-in string functions are implemented in the JVM and optimized by the Catalyst query optimizer. This makes them significantly faster than user-defined functions (UDFs) that require serialization between Python and JVM.
Avoid this anti-pattern:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# DON'T DO THIS - UDF is much slower
@udf(StringType())
def extract_substring_udf(s):
return s[0:4] if s else None
df_slow = df.withColumn("prefix", extract_substring_udf("product_id"))
Use built-in functions instead:
# DO THIS - Built-in function is optimized
df_fast = df.withColumn("prefix", substring("product_id", 1, 4))
The performance difference becomes dramatic with large datasets. Built-in functions can be 5-10x faster or more, depending on cluster configuration.
Additional best practices:
-
Cache strategically: If you’re performing multiple substring operations on the same DataFrame, cache it after the initial load to avoid recomputing transformations.
-
Minimize column creation: Instead of creating intermediate columns for each substring operation, combine logic where possible.
-
Use column pruning: Select only the columns you need before performing expensive string operations.
# Efficient: select columns first, then transform
df_efficient = df.select("product_id", "name") \
.withColumn("type", substring("product_id", 1, 4)) \
.withColumn("year", substring("product_id", 6, 4))
- Leverage predicate pushdown: When filtering on substring results, structure your queries to allow partition pruning:
# Filter after substring extraction allows optimization
df_filtered = df.withColumn("year", substring("product_id", 6, 4)) \
.filter(col("year") == "2024")
Substring operations are fundamental building blocks in PySpark data pipelines. Master these techniques, understand the performance implications, and you’ll handle string manipulation efficiently at any scale. Remember that PySpark’s 1-based indexing differs from Python’s conventions, and always prefer built-in functions over UDFs for optimal performance.