PySpark - Concatenate Two or More Columns

Key Insights

Use concat_ws() instead of concat() when you need separators—it handles nulls gracefully by skipping them rather than making the entire result null
Always cast numeric or date columns to strings before concatenation to avoid type mismatch errors that will crash your Spark job
For complex formatting needs, format_string() provides printf-style templates that are more readable than chaining multiple concat operations

Introduction

Column concatenation is one of those bread-and-butter operations you’ll perform constantly in PySpark. Whether you’re building composite keys for joins, creating human-readable display names, or generating formatted output for downstream systems, knowing how to efficiently combine column values is essential.

Common scenarios include merging first and last names into full names, combining address components into complete mailing addresses, creating unique identifiers from multiple fields, or generating formatted strings for reporting. The challenge isn’t just joining the values—it’s handling nulls properly, managing data types, and choosing the right function for your specific use case.

Let’s start with a simple example that illustrates why you need concatenation:

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, concat_ws, col

spark = SparkSession.builder.appName("concatenation").getOrCreate()

# Sample data
data = [
    ("John", "Doe", "john.doe@email.com"),
    ("Jane", "Smith", "jane.smith@email.com"),
    ("Bob", None, "bob@email.com")
]

df = spark.createDataFrame(data, ["first_name", "last_name", "email"])
df.show()

# Output:
# +----------+---------+--------------------+
# |first_name|last_name|               email|
# +----------+---------+--------------------+
# |      John|      Doe|john.doe@email.com  |
# |      Jane|    Smith|jane.smith@email.com|
# |       Bob|     null|        bob@email.com|
# +----------+---------+--------------------+

We want to create a full_name column, but notice that Bob has a null last name—this is where choosing the right concatenation method matters.

Basic Concatenation with concat()

The concat() function is your starting point for combining columns. It takes multiple column expressions and joins them together without any separator. The syntax is straightforward, but there’s a critical gotcha: if any input column contains null, the entire result becomes null.

from pyspark.sql.functions import concat, lit

# Basic concatenation of two columns
df_basic = df.withColumn("full_name", concat(col("first_name"), col("last_name")))
df_basic.select("first_name", "last_name", "full_name").show()

# Output:
# +----------+---------+---------+
# |first_name|last_name|full_name|
# +----------+---------+---------+
# |      John|      Doe|  JohnDoe|
# |      Jane|    Smith|JaneSmith|
# |       Bob|     null|     null|  # <- Problem!
# +----------+---------+---------+

Notice Bob’s full name is null because his last name is null. This is rarely what you want. You can add literal separators using the lit() function:

# Adding a space separator with lit()
df_with_space = df.withColumn(
    "full_name", 
    concat(col("first_name"), lit(" "), col("last_name"))
)
df_with_space.select("full_name").show()

# Still produces null for Bob

You can concatenate as many columns as needed:

# Concatenating multiple columns
df_multi = df.withColumn(
    "contact_info",
    concat(
        col("first_name"),
        lit(" "),
        col("last_name"),
        lit(" <"),
        col("email"),
        lit(">")
    )
)
df_multi.select("contact_info").show(truncate=False)

# Output for non-null rows:
# John Doe <john.doe@email.com>
# Jane Smith <jane.smith@email.com>
# null  # <- Bob's row

The null-propagation behavior makes concat() unsuitable for most real-world scenarios. That’s where concat_ws() comes in.

Concatenation with Separators using concat_ws()

The concat_ws() function (concatenate with separator) is almost always the better choice. It takes a separator string as the first argument, followed by the columns to concatenate. The killer feature: it skips null values instead of making the entire result null.

# Using concat_ws with space separator
df_safe = df.withColumn(
    "full_name",
    concat_ws(" ", col("first_name"), col("last_name"))
)
df_safe.select("first_name", "last_name", "full_name").show()

# Output:
# +----------+---------+---------+
# |first_name|last_name|full_name|
# +----------+---------+---------+
# |      John|      Doe| John Doe|
# |      Jane|    Smith|Jane Smith|
# |       Bob|     null|      Bob|  # <- Works correctly!
# +----------+---------+---------+

Perfect! Bob’s full name is just “Bob” since the null last name is skipped. Here are more separator examples:

# Comma separator for CSV-style output
df.withColumn("csv_name", concat_ws(",", col("last_name"), col("first_name"))) \
  .select("csv_name").show()

# Pipe separator for database keys
df.withColumn("composite_key", concat_ws("|", col("first_name"), col("email"))) \
  .select("composite_key").show(truncate=False)

# Output:
# +---------------------------+
# |composite_key              |
# +---------------------------+
# |John|john.doe@email.com    |
# |Jane|jane.smith@email.com  |
# |Bob|bob@email.com          |
# +---------------------------+

For real-world address concatenation:

address_data = [
    ("123 Main St", "Apt 4B", "New York", "NY", "10001"),
    ("456 Oak Ave", None, "Boston", "MA", "02101"),
    ("789 Pine Rd", "", "Chicago", "IL", "60601")
]

addr_df = spark.createDataFrame(
    address_data, 
    ["street", "unit", "city", "state", "zip"]
)

# Create formatted address
addr_formatted = addr_df.withColumn(
    "full_address",
    concat_ws(", ", col("street"), col("unit"), col("city"), 
              concat_ws(" ", col("state"), col("zip")))
)
addr_formatted.select("full_address").show(truncate=False)

# Output:
# +------------------------------------+
# |full_address                        |
# +------------------------------------+
# |123 Main St, Apt 4B, New York, NY 10001|
# |456 Oak Ave, Boston, MA 02101       |  # No awkward comma for missing unit
# |789 Pine Rd, , Chicago, IL 60601    |  # Empty string still appears!
# +------------------------------------+

Notice that empty strings are NOT skipped—only nulls are. We’ll address this in the edge cases section.

Advanced Concatenation Techniques

For template-based formatting, format_string() provides printf-style syntax:

from pyspark.sql.functions import format_string

# Template-based formatting
df_formatted = df.withColumn(
    "display_name",
    format_string("%s, %s (%s)", col("last_name"), col("first_name"), col("email"))
)
df_formatted.select("display_name").show(truncate=False)

# More complex formatting with padding
numbers_df = spark.createDataFrame([(1, 42), (2, 7), (3, 123)], ["id", "value"])
numbers_df.withColumn(
    "formatted",
    format_string("ID: %03d | Value: %05d", col("id"), col("value"))
).show(truncate=False)

# Output:
# +----------------------+
# |formatted             |
# +----------------------+
# |ID: 001 | Value: 00042|
# |ID: 002 | Value: 00007|
# |ID: 003 | Value: 00123|
# +----------------------+

When working with numeric or date columns, you must cast to string first:

from pyspark.sql.functions import cast
from pyspark.sql.types import StringType

mixed_data = [
    ("Product", 101, 29.99),
    ("Service", 202, 149.50)
]

mixed_df = spark.createDataFrame(mixed_data, ["type", "id", "price"])

# This will fail:
# mixed_df.withColumn("description", concat(col("type"), col("id"))).show()
# Error: cannot resolve 'concat(type, id)' due to data type mismatch

# Correct approach:
mixed_df.withColumn(
    "description",
    concat_ws("-", col("type"), col("id").cast(StringType()), col("price").cast(StringType()))
).show()

# Output:
# +---------+
# |description|
# +---------+
# |Product-101-29.99|
# |Service-202-149.5|
# +---------+

For conditional concatenation:

from pyspark.sql.functions import when

df.withColumn(
    "full_name",
    when(col("last_name").isNull(), col("first_name"))
    .otherwise(concat_ws(" ", col("first_name"), col("last_name")))
).select("full_name").show()

Handling Edge Cases

Empty strings require special handling since concat_ws() only skips nulls:

from pyspark.sql.functions import coalesce, when, length, trim

# Replace empty strings with null before concatenation
addr_cleaned = addr_df.withColumn(
    "unit_clean",
    when(trim(col("unit")) == "", None).otherwise(col("unit"))
).withColumn(
    "full_address",
    concat_ws(", ", col("street"), col("unit_clean"), col("city"),
              concat_ws(" ", col("state"), col("zip")))
)
addr_cleaned.select("full_address").show(truncate=False)

For comprehensive null handling with defaults:

# Provide default values for nulls
df.withColumn(
    "full_name",
    concat_ws(" ", 
              coalesce(col("first_name"), lit("Unknown")),
              coalesce(col("last_name"), lit("User")))
).select("full_name").show()

Type conversion strategies for complex scenarios:

from pyspark.sql.functions import date_format
from datetime import datetime

complex_data = [
    (1, "Event", datetime(2024, 1, 15), 99.99),
    (2, "Meeting", datetime(2024, 2, 20), None)
]

complex_df = spark.createDataFrame(
    complex_data, 
    ["id", "name", "event_date", "price"]
)

# Comprehensive type handling
complex_df.withColumn(
    "summary",
    concat_ws(" | ",
              concat(lit("ID:"), col("id").cast("string")),
              col("name"),
              date_format(col("event_date"), "yyyy-MM-dd"),
              concat(lit("$"), coalesce(col("price").cast("string"), lit("FREE"))))
).select("summary").show(truncate=False)

# Output:
# +----------------------------------------+
# |summary                                 |
# +----------------------------------------+
# |ID:1 | Event | 2024-01-15 | $99.99     |
# |ID:2 | Meeting | 2024-02-20 | $FREE    |
# +----------------------------------------+

Performance Considerations

For large datasets, concatenation performance is generally good, but there are optimization opportunities:

# Avoid repeated casting - do it once
df_optimized = df \
    .withColumn("id_str", col("id").cast("string")) \
    .withColumn("price_str", col("price").cast("string")) \
    .withColumn("summary", concat_ws("-", col("name"), col("id_str"), col("price_str"))) \
    .drop("id_str", "price_str")

# Use concat_ws over multiple concat operations
# Bad (multiple operations):
# concat(col("a"), lit(","), col("b"), lit(","), col("c"))
# Good (single operation):
# concat_ws(",", col("a"), col("b"), col("c"))

When working with partitioned data, ensure your concatenation doesn’t create data skew:

# If creating composite keys, consider the cardinality
# This could create skew if user_id has uneven distribution:
df.withColumn("composite_key", concat_ws("_", col("user_id"), col("session_id")))

# Monitor partition sizes after concatenation operations
df_result.rdd.glom().map(len).collect()  # Check partition distribution

Conclusion

PySpark offers multiple concatenation approaches, each suited for different scenarios:

concat(): Basic concatenation without separators, but nulls propagate to the result. Use only when you’re certain no nulls exist or null results are acceptable.
concat_ws(): The workhorse function for most use cases. Handles nulls gracefully by skipping them and allows clean separator insertion. This should be your default choice.
format_string(): Best for complex formatting with templates, padding, or precision control. Ideal for generating fixed-width reports or formatted output.

Quick Reference:

Function	Separator	Null Handling	Best For
`concat()`	Manual with `lit()`	Propagates nulls	No nulls expected
`concat_ws()`	First argument	Skips nulls	General use, variable data
`format_string()`	In template	Propagates nulls	Formatted output, reports

Remember to cast non-string columns before concatenation, handle empty strings separately from nulls when needed, and choose concat_ws() as your default for robust, production-ready code.