PySpark - Pad String with lpad and rpad

String padding is a fundamental operation when working with data integration, reporting, and legacy system compatibility. In PySpark, the `lpad()` and `rpad()` functions from `pyspark.sql.functions`...

Key Insights

  • PySpark’s lpad() and rpad() functions pad strings to a specified length by adding characters to the left or right side, essential for generating fixed-width files and formatting data for external systems.
  • Both functions truncate strings that exceed the target length rather than throwing errors, which can lead to silent data loss if not handled carefully in production pipelines.
  • Padding operations are computationally inexpensive but should be applied strategically—use select() with multiple transformations instead of chaining withColumn() calls to avoid creating unnecessary intermediate DataFrames.

Introduction

String padding is a fundamental operation when working with data integration, reporting, and legacy system compatibility. In PySpark, the lpad() and rpad() functions from pyspark.sql.functions provide efficient column-level operations for padding strings to a fixed width. These functions are indispensable when you need to format output for mainframe systems expecting fixed-width records, align data in console outputs, or ensure consistent formatting for identifiers like account numbers and transaction IDs.

Unlike simple string concatenation, these functions intelligently handle variable-length inputs and ensure every value in a column meets your specified length requirements. Let’s look at a quick comparison:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lpad, rpad

spark = SparkSession.builder.appName("StringPadding").getOrCreate()

# Create sample data
data = [("123",), ("45",), ("6789",)]
df = spark.createDataFrame(data, ["id"])

# Apply padding
result = df.select(
    "id",
    lpad("id", 6, "0").alias("padded_left"),
    rpad("id", 6, "0").alias("padded_right")
)

result.show()
# +----+-----------+------------+
# |  id|padded_left|padded_right|
# +----+-----------+------------+
# | 123|     000123|      123000|
# |  45|     000045|      450000|
# |6789|     006789|      678900|
# +----+-----------+------------+

Understanding lpad() - Left Padding

The lpad() function pads strings on the left side until they reach the specified length. The syntax is straightforward: lpad(col, len, pad) where col is your column name, len is the target length, and pad is the character(s) to use for padding.

Left padding is particularly common in financial and enterprise systems where numeric identifiers need consistent formatting. Here’s how it works with different scenarios:

from pyspark.sql.functions import lpad, col

# Sample transaction data
transactions = [
    ("TXN123",),
    ("TXN45",),
    ("TXN9",),
    ("TXN123456789",)
]
df = spark.createDataFrame(transactions, ["transaction_id"])

# Pad with zeros to create 10-character IDs
df_padded = df.select(
    col("transaction_id"),
    lpad("transaction_id", 10, "0").alias("formatted_id")
)

df_padded.show(truncate=False)
# +-------------+------------+
# |transaction_id|formatted_id|
# +-------------+------------+
# |TXN123       |0000TXN123  |
# |TXN45        |00000TXN45  |
# |TXN9         |000000TXN9  |
# |TXN123456789 |TXN123456789|
# +-------------+------------+

Notice that the last row, which already exceeds 10 characters, isn’t padded—it’s truncated. This is critical behavior to understand.

For numeric account numbers, zero-padding is the standard approach:

accounts = [("123",), ("4567",), ("89",)]
df_accounts = spark.createDataFrame(accounts, ["account_num"])

df_accounts.select(
    lpad("account_num", 8, "0").alias("account_number")
).show()
# +--------------+
# |account_number|
# +--------------+
# |      00000123|
# |      00004567|
# |      00000089|
# +--------------+

When a string is already at or exceeds the target length, lpad() returns it unchanged (or truncated):

mixed_lengths = [("12345",), ("123456789",), ("12",)]
df_mixed = spark.createDataFrame(mixed_lengths, ["value"])

df_mixed.select(
    col("value"),
    lpad("value", 5, "X").alias("padded")
).show()
# +---------+------+
# |    value|padded|
# +---------+------+
# |    12345| 12345|
# |123456789|12345 | # Truncated!
# |       12|   XX12|
# +---------+------+

Understanding rpad() - Right Padding

The rpad() function mirrors lpad() but adds padding characters to the right side. The syntax is identical: rpad(col, len, pad). Right padding is essential for creating fixed-width text fields, especially for name fields, descriptions, and textual data in legacy file formats.

# Customer names with varying lengths
customers = [
    ("Alice",),
    ("Bob",),
    ("Christopher",)
]
df_customers = spark.createDataFrame(customers, ["name"])

df_customers.select(
    col("name"),
    rpad("name", 15, " ").alias("padded_name")
).show(truncate=False)
# +-----------+---------------+
# |name       |padded_name    |
# +-----------+---------------+
# |Alice      |Alice          |
# |Bob        |Bob            |
# |Christopher|Christopher    |
# +-----------+---------------+

You can use custom padding characters for visual effects or specific format requirements:

# Create a formatted report header
headers = [("Product",), ("Price",), ("Qty",)]
df_headers = spark.createDataFrame(headers, ["header"])

df_headers.select(
    rpad("header", 20, ".").alias("formatted_header")
).show(truncate=False)
# +--------------------+
# |formatted_header    |
# +--------------------+
# |Product.............|
# |Price...............|
# |Qty.................|
# +--------------------+

For fixed-width file generation, right padding ensures each field occupies exactly the required space:

# Generate fixed-width export format
records = [
    ("ITEM001", "Widget", "125"),
    ("ITEM002", "Gadget", "75"),
]
df_export = spark.createDataFrame(records, ["item_code", "description", "price"])

df_fixed = df_export.select(
    rpad("item_code", 10, " ").alias("item_code"),
    rpad("description", 30, " ").alias("description"),
    lpad("price", 8, "0").alias("price")
)

df_fixed.show(truncate=False)
# +----------+------------------------------+--------+
# |item_code |description                   |price   |
# +----------+------------------------------+--------+
# |ITEM001   |Widget                        |00000125|
# |ITEM002   |Gadget                        |00000075|
# +----------+------------------------------+--------+

Practical Use Cases

Real-world data engineering often requires precise formatting for interoperability with external systems. Here are battle-tested scenarios where padding functions are essential.

Fixed-Width File Generation for Banking Systems:

from pyspark.sql.functions import lpad, rpad, concat

# Banking transaction export
transactions = [
    ("ACC12345", "John Smith", "1250.50", "2024-01-15"),
    ("ACC789", "Jane Doe", "75.25", "2024-01-15"),
]
df_bank = spark.createDataFrame(
    transactions, 
    ["account", "name", "amount", "date"]
)

# Format for mainframe: 10-char account, 30-char name, 12-char amount, 10-char date
df_mainframe = df_bank.select(
    concat(
        lpad("account", 10, "0"),
        rpad("name", 30, " "),
        lpad("amount", 12, "0"),
        rpad("date", 10, " ")
    ).alias("fixed_width_record")
)

df_mainframe.show(truncate=False)
# +--------------------------------------------------------------+
# |fixed_width_record                                            |
# +--------------------------------------------------------------+
# |0ACC12345John Smith                    0000001250.502024-01-15|
# |00000ACC789Jane Doe                      000000075.252024-01-15|
# +--------------------------------------------------------------+

Creating Aligned Console Output:

# Sales report with aligned columns
sales = [
    ("Q1", "North", "125000"),
    ("Q1", "South", "98000"),
    ("Q2", "North", "142000"),
]
df_sales = spark.createDataFrame(sales, ["quarter", "region", "revenue"])

df_report = df_sales.select(
    rpad("quarter", 8, " ").alias("Quarter"),
    rpad("region", 12, " ").alias("Region"),
    lpad("revenue", 15, " ").alias("Revenue")
)

df_report.show(truncate=False)
# +--------+------------+---------------+
# |Quarter |Region      |Revenue        |
# +--------+------------+---------------+
# |Q1      |North       |         125000|
# |Q1      |South       |          98000|
# |Q2      |North       |         142000|
# +--------+------------+---------------+

Centering Text with Combined Padding:

from pyspark.sql.functions import length, lit

# Center text by calculating padding on both sides
titles = [("Report",), ("Summary",), ("Data",)]
df_titles = spark.createDataFrame(titles, ["title"])

target_width = 20
df_centered = df_titles.select(
    col("title"),
    rpad(
        lpad("title", (target_width + length("title")) / 2, " "),
        target_width,
        " "
    ).alias("centered")
)

df_centered.show(truncate=False)
# +-------+--------------------+
# |title  |centered            |
# +-------+--------------------+
# |Report |       Report       |
# |Summary|      Summary       |
# |Data   |        Data        |
# +-------+--------------------+

Working with NULL Values and Edge Cases

Understanding how padding functions handle edge cases prevents data quality issues in production pipelines.

NULL Handling:

from pyspark.sql.functions import coalesce, lit

# Data with NULL values
data_with_nulls = [("123",), (None,), ("45",)]
df_nulls = spark.createDataFrame(data_with_nulls, ["value"])

df_nulls.select(
    col("value"),
    lpad("value", 5, "0").alias("padded"),
    lpad(coalesce("value", lit("")), 5, "0").alias("padded_safe")
).show()
# +-----+------+-----------+
# |value|padded|padded_safe|
# +-----+------+-----------+
# |  123| 00123|      00123|
# | NULL|  NULL|      00000|
# |   45| 00045|      00045|
# +-----+------+-----------+

NULL values propagate through padding functions. Use coalesce() to provide default values when NULLs aren’t acceptable in your output.

Truncation Behavior:

# Strings exceeding target length are truncated
long_strings = [("ABCDEFGHIJ",), ("SHORT",)]
df_long = spark.createDataFrame(long_strings, ["text"])

df_long.select(
    col("text"),
    length("text").alias("original_length"),
    lpad("text", 7, "X").alias("padded"),
    length(lpad("text", 7, "X")).alias("final_length")
).show(truncate=False)
# +----------+---------------+-------+------------+
# |text      |original_length|padded |final_length|
# +----------+---------------+-------+------------+
# |ABCDEFGHIJ|10             |ABCDEFG|7           |
# |SHORT     |5              |XXSHORT|7           |
# +----------+---------------+-------+------------+

Always validate your data’s length distribution before applying padding to avoid silent truncation.

Performance Considerations and Best Practices

Padding operations are computationally cheap—they’re simple string manipulations—but inefficient usage patterns can impact performance on large datasets.

Use select() for Multiple Transformations:

# Inefficient: Multiple withColumn calls create intermediate DataFrames
df_inefficient = df \
    .withColumn("padded_id", lpad("id", 10, "0")) \
    .withColumn("padded_name", rpad("name", 30, " ")) \
    .withColumn("padded_amount", lpad("amount", 12, "0"))

# Efficient: Single select with all transformations
df_efficient = df.select(
    "*",
    lpad("id", 10, "0").alias("padded_id"),
    rpad("name", 30, " ").alias("padded_name"),
    lpad("amount", 12, "0").alias("padded_amount")
)

Consider Padding After Filtering:

# Apply padding only to the subset that needs it
df_filtered = df.filter(col("status") == "ACTIVE")
df_formatted = df_filtered.select(
    lpad("account_id", 10, "0").alias("account_id"),
    # ... other columns
)

Partition-Aware Processing:

When generating fixed-width files, align your padding operations with partitioning strategy:

# Partition by date before generating fixed-width output
df_partitioned = df_formatted.repartition("transaction_date")

# Write partitioned fixed-width files
df_partitioned.write \
    .partitionBy("transaction_date") \
    .mode("overwrite") \
    .text("/output/fixed_width_export")

The key takeaway: lpad() and rpad() are essential tools for data formatting in PySpark. Use them confidently for fixed-width file generation, report formatting, and system integration, but always validate length requirements and handle NULLs explicitly to avoid surprises in production.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.