PySpark - Rename Column Name in DataFrame

PySpark DataFrames are the backbone of distributed data processing, but real-world datasets rarely arrive with clean, consistent column names. You'll encounter spaces, special characters,...

Key Insights

  • PySpark offers five distinct methods for renaming columns, each optimized for different scenarios: withColumnRenamed() for single columns, toDF() for renaming all columns, and select() with alias() for selective renaming during projections.
  • Column renaming in PySpark always returns a new DataFrame due to immutability—the original DataFrame remains unchanged, which is critical for understanding transformation chains and debugging.
  • For production environments handling dynamic schemas, use dictionary-based renaming with reduce() or list comprehensions to programmatically rename multiple columns while maintaining code readability and performance.

Introduction

PySpark DataFrames are the backbone of distributed data processing, but real-world datasets rarely arrive with clean, consistent column names. You’ll encounter spaces, special characters, inconsistent casing, and cryptic abbreviations that make your code harder to read and maintain. Column renaming becomes essential when standardizing data from multiple sources, preparing datasets for machine learning pipelines, or resolving naming conflicts before joins.

Consider a typical scenario: you’ve ingested data from a legacy system where columns are named with spaces and mixed case, and you need to join it with another dataset. Without proper renaming, you’re stuck with unwieldy column references and potential errors.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("ColumnRenaming").getOrCreate()

# Sample DataFrame with problematic column names
data = [
    (1, "John Doe", 75000, "Engineering"),
    (2, "Jane Smith", 82000, "Marketing"),
    (3, "Bob Johnson", 68000, "Sales")
]

df = spark.createDataFrame(data, ["Employee ID", "Full Name", "Annual Salary", "Dept"])
df.show()

# Output shows columns with spaces - problematic for many operations
# +----------+-----------+-------------+----------+
# |Employee ID|Full Name |Annual Salary|     Dept|
# +----------+-----------+-------------+----------+

Let’s explore the most effective methods to handle this common challenge.

Using withColumnRenamed() Method

The withColumnRenamed() method is the most intuitive approach for renaming columns. It takes two arguments: the existing column name and the new name. Since PySpark DataFrames are immutable, this method returns a new DataFrame with the renamed column.

# Rename a single column
df_renamed = df.withColumnRenamed("Employee ID", "employee_id")
df_renamed.printSchema()

# Output:
# root
#  |-- employee_id: long (nullable = true)
#  |-- Full Name: string (nullable = true)
#  |-- Annual Salary: long (nullable = true)
#  |-- Dept: string (nullable = true)

For multiple columns, chain withColumnRenamed() calls. While this might seem verbose, it’s explicit and easy to debug:

df_clean = (df
    .withColumnRenamed("Employee ID", "employee_id")
    .withColumnRenamed("Full Name", "full_name")
    .withColumnRenamed("Annual Salary", "annual_salary")
    .withColumnRenamed("Dept", "department")
)

df_clean.show()

# +----------+----------+-------------+----------+
# |employee_id|full_name|annual_salary|department|
# +----------+----------+-------------+----------+

This method is ideal when you need to rename a few specific columns and want code that’s self-documenting. The main drawback is verbosity when dealing with many columns.

Using alias() with select()

The alias() method shines when you’re already selecting specific columns. It allows you to rename columns inline during projection operations, making your code more concise.

# Rename one column during selection
df_selected = df.select(
    col("Employee ID").alias("employee_id"),
    col("Full Name"),
    col("Annual Salary")
)

df_selected.show()

For renaming multiple columns, combine alias() with select() in a single statement:

df_aliased = df.select(
    col("Employee ID").alias("employee_id"),
    col("Full Name").alias("full_name"),
    col("Annual Salary").alias("annual_salary"),
    col("Dept").alias("department")
)

df_aliased.show()

You can also mix renamed and non-renamed columns, which is useful when only some columns need standardization:

# Keep some columns as-is, rename others
df_mixed = df.select(
    col("Employee ID").alias("id"),
    "Full Name",  # Keep original name
    col("Annual Salary").alias("salary"),
    "Dept"  # Keep original name
)

Use this approach when column selection and renaming happen together—it’s more efficient than selecting first and then renaming.

Using toDF() Method

When you need to rename all columns in one operation, toDF() provides the cleanest syntax. Pass a list of new column names in the same order as the existing columns:

# Rename all columns at once
new_column_names = ["employee_id", "full_name", "annual_salary", "department"]
df_all_renamed = df.toDF(*new_column_names)

df_all_renamed.show()

# +----------+----------+-------------+----------+
# |employee_id|full_name|annual_salary|department|
# +----------+----------+-------------+----------+

This method is particularly useful when you have a predetermined naming convention for all columns. However, be cautious: the number of new names must exactly match the number of existing columns, or you’ll get a runtime error. This makes toDF() less flexible for partial renaming scenarios.

# This will fail if counts don't match
try:
    df.toDF("id", "name")  # Only 2 names for 4 columns
except Exception as e:
    print(f"Error: {e}")

Using selectExpr() for Complex Renaming

The selectExpr() method accepts SQL expressions as strings, enabling SQL-style renaming with the AS keyword. This approach is powerful when combining renaming with transformations:

# Rename using SQL syntax
df_sql = df.selectExpr(
    "`Employee ID` AS employee_id",
    "`Full Name` AS full_name",
    "`Annual Salary` AS annual_salary",
    "Dept AS department"
)

df_sql.show()

Note the backticks around column names with spaces—this is SQL’s way of escaping special characters.

Where selectExpr() truly excels is combining renaming with transformations:

# Rename while transforming
df_transformed = df.selectExpr(
    "`Employee ID` AS employee_id",
    "UPPER(`Full Name`) AS full_name_upper",
    "`Annual Salary` * 1.1 AS projected_salary",
    "LOWER(Dept) AS department"
)

df_transformed.show()

# +----------+--------------+----------------+----------+
# |employee_id|full_name_upper|projected_salary|department|
# +----------+--------------+----------------+----------+
# |         1|      JOHN DOE|         82500.0|engineering|
# |         2|    JANE SMITH|         90200.0| marketing|
# |         3|   BOB JOHNSON|         74800.0|     sales|
# +----------+--------------+----------------+----------+

This method is ideal when you’re already writing SQL-like transformations and want to keep everything in one consistent style.

Renaming Multiple Columns Dynamically

Production code often requires dynamic column renaming based on patterns or mappings. Dictionary-based approaches offer the most flexibility:

from functools import reduce

# Define mapping of old names to new names
column_mapping = {
    "Employee ID": "employee_id",
    "Full Name": "full_name",
    "Annual Salary": "annual_salary",
    "Dept": "department"
}

# Apply renaming using reduce
df_mapped = reduce(
    lambda df, col_name: df.withColumnRenamed(col_name, column_mapping[col_name]),
    column_mapping.keys(),
    df
)

df_mapped.show()

For pattern-based renaming, use list comprehensions with select():

# Convert all columns to lowercase and replace spaces with underscores
df_pattern = df.select([
    col(c).alias(c.lower().replace(" ", "_")) 
    for c in df.columns
])

df_pattern.show()

# +----------+----------+-------------+----------+
# |employee_id|full_name|annual_salary|      dept|
# +----------+----------+-------------+----------+

You can also implement conditional renaming logic:

# Rename only columns containing spaces
df_conditional = df.select([
    col(c).alias(c.replace(" ", "_")) if " " in c else col(c)
    for c in df.columns
])

df_conditional.show()

This programmatic approach scales well and keeps your code DRY (Don’t Repeat Yourself) when dealing with dozens or hundreds of columns.

Best Practices and Performance Considerations

Choose your renaming method based on context. Use withColumnRenamed() for clarity when renaming one or two columns. Opt for toDF() when renaming all columns with a predefined list. Prefer select() with alias() when you’re already projecting columns. Reserve selectExpr() for cases where you’re mixing renaming with SQL transformations.

Handle special characters carefully. Column names with spaces, dots, or other special characters require backticks in SQL expressions or the col() function in Python expressions:

# Before: problematic column access
# df.select(Employee ID)  # This fails

# After: proper handling
df.select(col("Employee ID"))  # Works
df.select("`Employee ID`")     # Works in selectExpr

# Better: rename to remove special characters
df_safe = df.select([
    col(c).alias(c.replace(" ", "_").replace(".", "_").lower())
    for c in df.columns
])

Performance-wise, all renaming operations are metadata changes in Spark’s logical plan—they don’t trigger data movement. However, chaining many withColumnRenamed() calls can make your logical plan harder to optimize. When renaming many columns, select() with list comprehensions or toDF() produces cleaner execution plans.

Always validate your renamed DataFrame schema, especially in dynamic renaming scenarios:

# Verify the renaming worked as expected
print("Original columns:", df.columns)
print("Renamed columns:", df_clean.columns)

# Check for duplicates (which cause errors)
if len(df_clean.columns) != len(set(df_clean.columns)):
    raise ValueError("Duplicate column names detected after renaming")

Column renaming is a fundamental skill in PySpark data engineering. Master these techniques, and you’ll handle any schema transformation challenge with confidence.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.