PySpark - Convert Integer to String

Type conversion is a fundamental operation when working with PySpark DataFrames. Converting integers to strings is particularly common when preparing data for export to systems that expect string...

Key Insights

  • Use the cast() method for type conversion in PySpark—it’s the standard approach that works with both column expressions and SQL-style syntax, handling null values automatically without errors.
  • Converting multiple columns efficiently requires using select() with list comprehensions rather than chaining multiple withColumn() calls, which creates unnecessary intermediate DataFrames and degrades performance.
  • PySpark string conversions preserve null values as null (not “null” strings), but you can use format_string() or lpad() for specific formatting needs like zero-padding or fixed-width outputs.

Introduction

Type conversion is a fundamental operation when working with PySpark DataFrames. Converting integers to strings is particularly common when preparing data for export to systems that expect string types, joining tables with mismatched key types, or formatting numeric identifiers that shouldn’t be treated as mathematical values (like ZIP codes, product IDs, or account numbers).

Unlike pandas where type conversion is straightforward with astype(), PySpark requires explicit casting operations that are distributed across your cluster. Understanding the proper methods ensures your transformations are both correct and performant.

Let’s start with a sample DataFrame containing integer columns:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.appName("IntToString").getOrCreate()

data = [
    (1001, 12345, 50),
    (1002, 67890, 75),
    (1003, None, 100),
    (1004, 11111, 25)
]

schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("zip_code", IntegerType(), True),
    StructField("age", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)
df.printSchema()

This creates a DataFrame where zip_code should be treated as a string (it’s an identifier, not a number), and customer_id might need string conversion for joining with external systems.

Using the cast() Method

The cast() method is the primary way to convert data types in PySpark. It works as a column transformation and can be used with either string literals or explicit type imports.

Here’s the most straightforward approach using string type specification:

from pyspark.sql.functions import col

# Convert single column using cast with string literal
df_converted = df.withColumn("zip_code", col("zip_code").cast("string"))

df_converted.printSchema()
# zip_code is now StringType

For better type safety and IDE support, import StringType explicitly:

from pyspark.sql.types import StringType

df_converted = df.withColumn(
    "zip_code", 
    col("zip_code").cast(StringType())
)

Both approaches produce identical results, but using StringType() makes your code more maintainable and catches type errors at development time rather than runtime.

You can also use SQL-style casting syntax, which some developers find more readable:

df_converted = df.withColumn("customer_id", col("customer_id").cast("string"))

# Verify the conversion
df_converted.select("customer_id").show()

The cast() method automatically handles null values—they remain null in the string column rather than converting to the string “None” or “null”.

Using selectExpr() with SQL CAST

If you’re more comfortable with SQL syntax or migrating SQL-based transformations to PySpark, selectExpr() provides a familiar interface:

df_converted = df.selectExpr(
    "CAST(customer_id AS STRING) as customer_id",
    "CAST(zip_code AS STRING) as zip_code",
    "age"
)

df_converted.printSchema()

This approach is particularly useful when you’re already using other SQL expressions in your transformation pipeline. You can mix and match SQL expressions with column selections:

df_converted = df.selectExpr(
    "CAST(customer_id AS STRING) as customer_id_str",
    "CAST(zip_code AS STRING) as zip_code_str",
    "age",
    "age * 2 as age_doubled"
)

The main limitation of selectExpr() is that you must explicitly list all columns you want to keep. This becomes verbose with wide DataFrames, making withColumn() more practical for selective transformations.

Converting Multiple Columns at Once

When you need to convert several columns, chaining withColumn() calls is inefficient because each call creates a new DataFrame in the execution plan:

# AVOID: This creates multiple intermediate DataFrames
df_bad = df.withColumn("customer_id", col("customer_id").cast("string")) \
           .withColumn("zip_code", col("zip_code").cast("string")) \
           .withColumn("age", col("age").cast("string"))

Instead, use select() with a list comprehension to perform all conversions in a single transformation:

from pyspark.sql.functions import col

# Columns to convert
cols_to_convert = ["customer_id", "zip_code"]

# Build the select expression
select_expr = [
    col(c).cast("string").alias(c) if c in cols_to_convert else col(c)
    for c in df.columns
]

df_converted = df.select(select_expr)
df_converted.printSchema()

This approach scales well to any number of columns and maintains the original column order. For more complex scenarios where you want to rename columns during conversion:

# Dictionary mapping old names to new names for converted columns
conversion_map = {
    "customer_id": "customer_id_str",
    "zip_code": "zip_code_str"
}

select_expr = []
for c in df.columns:
    if c in conversion_map:
        select_expr.append(col(c).cast("string").alias(conversion_map[c]))
    else:
        select_expr.append(col(c))

df_converted = df.select(select_expr)

You can also use a functional approach with reduce() if you prefer that style, though it’s less readable:

from functools import reduce

cols_to_convert = ["customer_id", "zip_code"]

df_converted = reduce(
    lambda df, col_name: df.withColumn(col_name, col(col_name).cast("string")),
    cols_to_convert,
    df
)

Handling Null Values and Edge Cases

PySpark’s cast() preserves null values during conversion—nulls remain null rather than becoming empty strings or the string “null”:

# Create DataFrame with nulls
data_with_nulls = [
    (1, 100),
    (2, None),
    (None, 300)
]

df_nulls = spark.createDataFrame(data_with_nulls, ["id", "value"])

df_converted = df_nulls.select(
    col("id").cast("string").alias("id"),
    col("value").cast("string").alias("value")
)

df_converted.show()
# Nulls are preserved, not converted to "None"

For formatting requirements like zero-padding ZIP codes or fixed-width identifiers, use lpad() or format_string():

from pyspark.sql.functions import lpad, format_string

# Pad ZIP codes to 5 digits with leading zeros
df_formatted = df.withColumn(
    "zip_code_padded",
    lpad(col("zip_code").cast("string"), 5, "0")
)

df_formatted.show()

For more complex formatting:

# Format with custom patterns
df_formatted = df.withColumn(
    "formatted_id",
    format_string("CUST-%05d", col("customer_id"))
)

df_formatted.show()

Note that format_string() requires the input to still be an integer, so apply formatting before or instead of casting, not after.

Performance Considerations

Type conversions are generally fast operations in Spark, but they do require a full pass through your data. For large datasets, consider these optimization strategies:

  1. Perform conversions early: Convert types as soon as you read the data rather than carrying around incorrect types through multiple transformations.

  2. Use select() over multiple withColumn() calls: As shown earlier, this creates a more efficient execution plan.

  3. Consider caching after conversion: If you’ll use the converted DataFrame multiple times, cache it to avoid recomputing the conversion:

df_converted = df.select([col(c).cast("string") for c in df.columns])
df_converted.cache()

You can examine the execution plan to understand the overhead:

df_converted = df.withColumn("zip_code", col("zip_code").cast("string"))
df_converted.explain(True)

The physical plan will show a Project operation that includes the cast. Type conversions don’t trigger shuffles or wide transformations, so they’re relatively inexpensive.

Complete Working Example

Here’s a comprehensive example that demonstrates the concepts in a realistic scenario:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lpad
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Initialize Spark
spark = SparkSession.builder.appName("IntToStringExample").getOrCreate()

# Sample data: customer records with numeric fields
data = [
    (1001, 2101, 12345, 50000),
    (1002, 2102, 67890, 75000),
    (1003, 2103, None, 60000),
    (1004, 2104, 1234, 45000)
]

schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("account_number", IntegerType(), True),
    StructField("zip_code", IntegerType(), True),
    StructField("annual_income", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)

print("Original Schema:")
df.printSchema()

# Convert identifier columns to strings, keep numeric columns as integers
# Also format ZIP codes with leading zeros
string_columns = ["customer_id", "account_number"]

select_expr = []
for c in df.columns:
    if c in string_columns:
        select_expr.append(col(c).cast("string").alias(c))
    elif c == "zip_code":
        # ZIP codes need zero-padding
        select_expr.append(lpad(col(c).cast("string"), 5, "0").alias(c))
    else:
        select_expr.append(col(c))

df_final = df.select(select_expr)

print("\nConverted Schema:")
df_final.printSchema()

print("\nSample Data:")
df_final.show()

# Verify types
for field in df_final.schema.fields:
    print(f"{field.name}: {field.dataType}")

This example shows a production-ready pattern: converting identifiers to strings while preserving numeric types for actual numbers, with proper null handling and formatting for special cases like ZIP codes.

Converting integers to strings in PySpark is straightforward once you understand the available methods. Use cast() for most scenarios, leverage select() for batch conversions, and apply formatting functions when you need more than simple type conversion. Always verify your schema after transformations to ensure the conversions were applied correctly.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.