How to Rename Columns in PySpark
Column renaming in PySpark seems trivial until you're knee-deep in a data pipeline with inconsistent schemas, spaces in column names, or the need to align datasets from different sources. Whether...
Key Insights
withColumnRenamed()is the most readable option for renaming a few columns, but chaining many calls becomes unwieldy—usetoDF()orselect()withalias()for bulk operations instead.- All column renaming methods in PySpark are lazy transformations that don’t affect performance differently; choose based on code readability and maintainability.
- Programmatic renaming patterns (adding prefixes, cleaning special characters) are essential for production pipelines where schema consistency matters across multiple data sources.
Introduction
Column renaming in PySpark seems trivial until you’re knee-deep in a data pipeline with inconsistent schemas, spaces in column names, or the need to align datasets from different sources. Whether you’re cleaning messy CSV imports, preparing data for joins, or making column names more readable for downstream consumers, knowing the right renaming technique matters.
PySpark offers several approaches to rename columns, each with its own strengths. Some are better for quick, one-off renames. Others shine when you need to transform an entire schema programmatically. This article covers all the practical methods you’ll encounter, with code examples you can use immediately.
Let’s start with sample data we’ll use throughout:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("column_rename").getOrCreate()
data = [
(1, "Alice", 50000, "Engineering"),
(2, "Bob", 60000, "Marketing"),
(3, "Carol", 55000, "Engineering")
]
df = spark.createDataFrame(data, ["emp_id", "emp_name", "salary", "dept"])
df.show()
+------+--------+------+-----------+
|emp_id|emp_name|salary| dept|
+------+--------+------+-----------+
| 1| Alice| 50000|Engineering|
| 2| Bob| 60000| Marketing|
| 3| Carol| 55000|Engineering|
+------+--------+------+-----------+
Using withColumnRenamed()
The withColumnRenamed() method is the most intuitive approach. It takes two arguments: the existing column name and the new name. The syntax reads almost like plain English.
# Rename a single column
df_renamed = df.withColumnRenamed("emp_name", "employee_name")
df_renamed.printSchema()
root
|-- emp_id: long (nullable = true)
|-- employee_name: string (nullable = true)
|-- salary: long (nullable = true)
|-- dept: string (nullable = true)
For multiple columns, chain the method calls:
# Chain multiple renames
df_renamed = (df
.withColumnRenamed("emp_id", "employee_id")
.withColumnRenamed("emp_name", "employee_name")
.withColumnRenamed("dept", "department")
)
df_renamed.show()
+-----------+-------------+------+-----------+
|employee_id|employee_name|salary| department|
+-----------+-------------+------+-----------+
| 1| Alice| 50000|Engineering|
| 2| Bob| 60000| Marketing|
| 3| Carol| 55000|Engineering|
+-----------+-------------+------+-----------+
This approach works well for renaming one to three columns. Beyond that, the chaining becomes verbose and harder to maintain. If the original column name doesn’t exist, PySpark silently returns the DataFrame unchanged—no error, no warning. This behavior can mask bugs, so verify your column names exist before renaming.
Using select() with alias()
The select() method combined with alias() gives you more control. You explicitly select which columns to include and rename them in the same operation. This approach is particularly useful when you want to rename columns while also reordering or filtering them.
# Using alias() with col()
df_renamed = df.select(
col("emp_id").alias("employee_id"),
col("emp_name").alias("employee_name"),
col("salary"),
col("dept").alias("department")
)
df_renamed.show()
You can also use the string-based column reference with alias():
# Alternative syntax using F.col
from pyspark.sql import functions as F
df_renamed = df.select(
F.col("emp_id").alias("id"),
F.col("emp_name").alias("name"),
F.col("salary").alias("annual_salary"),
F.col("dept").alias("department")
)
df_renamed.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- annual_salary: long (nullable = true)
|-- department: string (nullable = true)
The select() approach forces you to be explicit about every column in the output. This is both a feature and a limitation—you won’t accidentally carry forward unwanted columns, but you must list everything you want to keep.
Using toDF() for Bulk Renaming
When you need to rename all columns at once, toDF() is the cleanest solution. Pass a list of new column names in the same order as the existing columns.
# Rename all columns at once
new_column_names = ["id", "name", "compensation", "department"]
df_renamed = df.toDF(*new_column_names)
df_renamed.show()
+---+-----+------------+-----------+
| id| name|compensation| department|
+---+-----+------------+-----------+
| 1|Alice| 50000|Engineering|
| 2| Bob| 60000| Marketing|
| 3|Carol| 55000|Engineering|
+---+-----+------------+-----------+
This method is ideal when you’re replacing an entire schema, such as when reading files without headers or when standardizing column names from external sources. The critical requirement: the number of new names must exactly match the number of existing columns, and the order matters.
# Practical example: standardizing column names from a CSV
raw_df = spark.read.csv("data.csv", header=True)
standardized_names = ["user_id", "first_name", "last_name", "email", "created_at"]
clean_df = raw_df.toDF(*standardized_names)
Renaming with selectExpr() and SQL Expressions
If you’re more comfortable with SQL syntax, selectExpr() lets you use the familiar AS keyword for aliasing:
# SQL-style renaming
df_renamed = df.selectExpr(
"emp_id AS employee_id",
"emp_name AS employee_name",
"salary",
"dept AS department"
)
df_renamed.show()
+-----------+-------------+------+-----------+
|employee_id|employee_name|salary| department|
+-----------+-------------+------+-----------+
| 1| Alice| 50000|Engineering|
| 2| Bob| 60000| Marketing|
| 3| Carol| 55000|Engineering|
+-----------+-------------+------+-----------+
This approach is particularly useful when you’re also performing transformations:
# Combine renaming with transformations
df_transformed = df.selectExpr(
"emp_id AS id",
"UPPER(emp_name) AS name_upper",
"salary * 1.1 AS salary_with_raise",
"dept AS department"
)
df_transformed.show()
+---+----------+-----------------+-----------+
| id|name_upper|salary_with_raise| department|
+---+----------+-----------------+-----------+
| 1| ALICE| 55000.0|Engineering|
| 2| BOB| 66000.0| Marketing|
| 3| CAROL| 60500.0|Engineering|
+---+----------+-----------------+-----------+
Programmatic Renaming (Dynamic Patterns)
Production pipelines often require dynamic column renaming—adding prefixes to distinguish sources after joins, removing special characters, or converting to snake_case. Here’s how to handle these scenarios:
# Add prefix to all columns
prefix = "emp_"
df_prefixed = df.toDF(*[f"{prefix}{c}" for c in df.columns])
df_prefixed.printSchema()
root
|-- emp_emp_id: long (nullable = true)
|-- emp_emp_name: string (nullable = true)
|-- emp_salary: long (nullable = true)
|-- emp_dept: string (nullable = true)
For more complex transformations, iterate through columns with reduce():
from functools import reduce
import re
# Clean column names: lowercase and replace spaces/special chars with underscores
messy_df = spark.createDataFrame(
[(1, "test", 100)],
["Employee ID", "Full Name!", "Salary ($)"]
)
def clean_column_name(name):
# Lowercase, replace non-alphanumeric with underscore, remove consecutive underscores
cleaned = re.sub(r'[^a-zA-Z0-9]', '_', name.lower())
cleaned = re.sub(r'_+', '_', cleaned)
return cleaned.strip('_')
clean_df = reduce(
lambda df, col_name: df.withColumnRenamed(col_name, clean_column_name(col_name)),
messy_df.columns,
messy_df
)
clean_df.printSchema()
root
|-- employee_id: long (nullable = true)
|-- full_name: string (nullable = true)
|-- salary: long (nullable = true)
For adding suffixes to specific columns (useful when joining tables with overlapping names):
# Add suffix to specific columns before a join
columns_to_suffix = ["salary", "dept"]
suffix = "_left"
for col_name in columns_to_suffix:
if col_name in df.columns:
df = df.withColumnRenamed(col_name, f"{col_name}{suffix}")
Best Practices and Performance Considerations
All column renaming operations in PySpark are lazy transformations. They modify the logical plan but don’t execute until an action triggers computation. This means there’s no performance difference between methods—choose based on readability and maintainability.
When to use each method:
withColumnRenamed(): Best for renaming 1-3 columns when readability matters most.select()withalias(): Use when you’re also filtering columns or want explicit control over output schema.toDF(): Ideal for bulk renaming when you have a complete list of new names.selectExpr(): Choose when combining renames with SQL transformations.- Programmatic patterns: Essential for dynamic schemas and production pipelines.
Handling nested struct columns requires a different approach. You can’t rename nested fields directly with withColumnRenamed(). Instead, reconstruct the struct:
from pyspark.sql.functions import struct
# For nested structs, rebuild the structure
df_nested = df.select(struct("emp_id", "emp_name").alias("employee"), "salary")
# To rename fields within the struct, you need to recreate it
df_renamed = df_nested.select(
struct(
col("employee.emp_id").alias("id"),
col("employee.emp_name").alias("name")
).alias("employee"),
"salary"
)
Column renaming is a fundamental operation you’ll use constantly in PySpark. Master these techniques, and you’ll handle any schema transformation your pipeline throws at you.