Spark SQL - String Functions Complete List
The foundational string functions handle concatenation, case conversion, and trimming operations that form the building blocks of text processing.
Key Insights
- Spark SQL provides 50+ string functions for data manipulation, from basic operations like
concatandtrimto advanced pattern matching withregexp_extractandregexp_replace - String functions in Spark SQL are optimized for distributed processing and can handle null values gracefully, making them production-ready for large-scale data transformations
- Understanding the difference between similar functions like
substringvssubstr, orsplitvssplit_part, prevents common bugs and improves query performance
Basic String Operations
The foundational string functions handle concatenation, case conversion, and trimming operations that form the building blocks of text processing.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("StringFunctions").getOrCreate()
df = spark.createDataFrame([
(" John Doe ", "john.doe@email.com"),
("Jane Smith", "JANE.SMITH@EMAIL.COM"),
("Bob Johnson", None)
], ["name", "email"])
# Concatenation
df.select(
concat(col("name"), lit(" - "), col("email")).alias("concat_result"),
concat_ws("|", col("name"), col("email")).alias("concat_ws_result")
).show(truncate=False)
# Case conversion
df.select(
upper(col("name")).alias("upper"),
lower(col("email")).alias("lower"),
initcap(col("name")).alias("initcap")
).show(truncate=False)
# Trimming
df.select(
trim(col("name")).alias("trim"),
ltrim(col("name")).alias("ltrim"),
rtrim(col("name")).alias("rtrim")
).show(truncate=False)
The concat_ws function is particularly useful as it handles null values by skipping them rather than returning null for the entire result.
Length and Substring Operations
These functions extract portions of strings or measure their dimensions, critical for data validation and parsing.
data = [("Hello World", "2024-01-15"), ("Spark SQL", "2024-02-20")]
df = spark.createDataFrame(data, ["text", "date"])
# Length functions
df.select(
col("text"),
length(col("text")).alias("length"),
char_length(col("text")).alias("char_length"),
bit_length(col("text")).alias("bit_length")
).show()
# Substring extraction
df.select(
substring(col("text"), 1, 5).alias("substring"),
substr(col("text"), 7, 5).alias("substr"),
left(col("text"), 5).alias("left"),
right(col("text"), 5).alias("right")
).show()
# Position finding
df.select(
instr(col("text"), "o").alias("instr"),
locate("o", col("text")).alias("locate"),
locate("o", col("text"), 5).alias("locate_from_pos")
).show()
Note that substring and substr are functionally identical in Spark SQL. The locate function allows specifying a starting position for the search, unlike instr.
Pattern Matching and Regular Expressions
Regular expression functions provide powerful pattern matching capabilities for complex string transformations.
log_data = [
("2024-01-15 ERROR: Database connection failed at 192.168.1.100",),
("2024-01-15 INFO: User admin logged in from 10.0.0.50",),
("2024-01-15 WARN: High memory usage detected",)
]
df = spark.createDataFrame(log_data, ["log"])
# Extract patterns
df.select(
col("log"),
regexp_extract(col("log"), r"(\d+\.\d+\.\d+\.\d+)", 1).alias("ip_address"),
regexp_extract(col("log"), r"(ERROR|INFO|WARN)", 1).alias("log_level")
).show(truncate=False)
# Replace patterns
df.select(
regexp_replace(col("log"), r"\d+\.\d+\.\d+\.\d+", "XXX.XXX.XXX.XXX").alias("masked_ip"),
translate(col("log"), "aeiou", "12345").alias("vowels_replaced")
).show(truncate=False)
# Boolean pattern matching
df.select(
col("log"),
rlike(col("log"), r"\d+\.\d+\.\d+\.\d+").alias("contains_ip"),
col("log").like("%ERROR%").alias("is_error")
).show(truncate=False)
The regexp_extract function uses capture groups (parentheses in the regex) and the second parameter specifies which group to extract.
String Splitting and Array Operations
Converting strings to arrays enables complex parsing scenarios, especially for delimited data.
csv_data = [
("John,Doe,35,Engineer",),
("Jane,Smith,28,Designer",),
("Bob,Johnson,42,Manager",)
]
df = spark.createDataFrame(csv_data, ["csv_line"])
# Split strings
df.select(
split(col("csv_line"), ",").alias("fields"),
split(col("csv_line"), ",").getItem(0).alias("first_name"),
split(col("csv_line"), ",").getItem(1).alias("last_name")
).show(truncate=False)
# Array to string
from pyspark.sql.functions import array, lit
df_array = spark.createDataFrame([
(["apple", "banana", "cherry"],),
(["dog", "cat", "bird"],)
], ["items"])
df_array.select(
concat_ws(", ", col("items")).alias("joined"),
array_join(col("items"), " | ").alias("array_joined")
).show(truncate=False)
Padding and Formatting
Padding functions ensure consistent string lengths, essential for fixed-width file formats and alignment.
products = [
("A1", "Widget", 9.99),
("B22", "Gadget", 149.50),
("C333", "Tool", 1299.00)
]
df = spark.createDataFrame(products, ["code", "name", "price"])
# Padding
df.select(
lpad(col("code"), 10, "0").alias("lpad_code"),
rpad(col("name"), 20, ".").alias("rpad_name"),
format_number(col("price"), 2).alias("formatted_price"),
format_string("Product: %s - $%.2f", col("name"), col("price")).alias("formatted")
).show(truncate=False)
# Repeating strings
df.select(
repeat(col("code"), 3).alias("repeated"),
repeat("-", 20).alias("separator")
).show(truncate=False)
Advanced String Manipulation
These functions handle encoding, reversing, and specialized transformations.
mixed_data = [
("Hello World", "data:text/plain;base64,SGVsbG8="),
("Spark SQL", "test@example.com")
]
df = spark.createDataFrame(mixed_data, ["text", "encoded"])
# Reversing and overlaying
df.select(
reverse(col("text")).alias("reversed"),
overlay(col("text"), lit("***"), 7, 5).alias("overlay_result")
).show(truncate=False)
# Encoding operations
df.select(
base64(col("text")).alias("base64_encoded"),
unbase64(lit("SGVsbG8gV29ybGQ=")).alias("base64_decoded"),
encode(col("text"), "UTF-8").alias("encoded_bytes")
).show(truncate=False)
# Soundex for phonetic matching
names = [("Smith",), ("Smyth",), ("Schmidt",)]
df_names = spark.createDataFrame(names, ["name"])
df_names.select(
col("name"),
soundex(col("name")).alias("soundex_code")
).show()
The soundex function is particularly useful for fuzzy name matching in data deduplication scenarios.
Null Handling and Conditional Operations
Proper null handling prevents unexpected errors in production pipelines.
user_data = [
("John", "Doe", "john@email.com"),
("Jane", None, "jane@email.com"),
(None, "Smith", None)
]
df = spark.createDataFrame(user_data, ["first_name", "last_name", "email"])
# Coalesce and null handling
df.select(
coalesce(col("first_name"), lit("Unknown")).alias("first_name_safe"),
coalesce(col("email"), lit("no-email@domain.com")).alias("email_safe"),
when(col("last_name").isNull(), lit("N/A")).otherwise(col("last_name")).alias("last_name_safe")
).show()
# Null-safe concatenation
df.select(
concat_ws(" ", col("first_name"), col("last_name")).alias("full_name"),
nvl(col("email"), lit("no-email")).alias("email_nvl"),
nvl2(col("last_name"), lit("Has last name"), lit("Missing")).alias("status")
).show()
Performance Considerations
String functions in Spark SQL execute in a distributed manner, but certain patterns optimize better than others.
# Efficient: Single pass with multiple operations
df.select(
trim(lower(col("name"))).alias("normalized_name"),
regexp_replace(col("email"), r"@.*$", "@company.com").alias("masked_email")
)
# Less efficient: Multiple separate transformations requiring multiple passes
df_step1 = df.withColumn("trimmed", trim(col("name")))
df_step2 = df_step1.withColumn("lowered", lower(col("trimmed")))
df_step3 = df_step2.withColumn("final", col("lowered"))
# Use SQL expressions for complex logic
df.selectExpr(
"trim(lower(name)) as normalized_name",
"regexp_replace(email, '@.*$', '@company.com') as masked_email"
)
Chaining operations in a single select statement allows Spark’s Catalyst optimizer to create more efficient execution plans. When working with large datasets, prefer built-in functions over UDFs, as they benefit from Spark’s optimizations and avoid serialization overhead.
String functions form the backbone of data cleaning and transformation pipelines. Master these operations to build robust, performant data processing workflows that scale across distributed clusters.