How to Delete a Column in PySpark

Column deletion is one of those operations you'll perform constantly in PySpark. Whether you're cleaning up raw data, removing sensitive fields before export, trimming unnecessary columns to reduce...

Key Insights

  • The drop() method is PySpark’s primary tool for column deletion, accepting single columns, multiple columns, or unpacked lists—and it silently ignores non-existent columns by default
  • When you need to remove many columns but keep a few, use select() as the inverse approach—it’s often cleaner than listing dozens of columns to drop
  • PySpark DataFrames are immutable, so every drop() call returns a new DataFrame; always assign the result to a variable or chain operations together

Introduction

Column deletion is one of those operations you’ll perform constantly in PySpark. Whether you’re cleaning up raw data, removing sensitive fields before export, trimming unnecessary columns to reduce shuffle overhead, or just tidying up intermediate DataFrames in a complex pipeline, knowing how to efficiently drop columns matters.

PySpark provides several approaches for column removal, each suited to different scenarios. The method you choose depends on whether you’re dropping one column or many, whether you know the column names upfront or need to match patterns, and whether you’re building a defensive pipeline that handles schema variations gracefully.

Let’s work through each approach with practical examples you can apply immediately.

Using drop() for Single Column Deletion

The drop() method is your go-to for removing columns. The syntax is straightforward:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ColumnDeletion").getOrCreate()

# Sample DataFrame
data = [
    (1, "Alice", "alice@example.com", 50000),
    (2, "Bob", "bob@example.com", 60000),
    (3, "Carol", "carol@example.com", 55000)
]
df = spark.createDataFrame(data, ["id", "name", "email", "salary"])

# Drop a single column
df_without_email = df.drop("email")
df_without_email.show()

Output:

+---+-----+------+
| id| name|salary|
+---+-----+------+
|  1|Alice| 50000|
|  2|  Bob| 60000|
|  3|Carol| 55000|
+---+-----+------+

You can also pass a Column object instead of a string:

from pyspark.sql.functions import col

df_without_salary = df.drop(col("salary"))

Both approaches work identically. Use strings for simplicity; use Column objects when you’re already working with column expressions in your code.

One critical point: drop() returns a new DataFrame. The original df remains unchanged. This immutability is fundamental to Spark’s design, but it trips up developers coming from pandas where in-place operations are common.

Dropping Multiple Columns at Once

When you need to remove several columns, drop() accepts multiple arguments:

# Drop multiple columns by listing them
df_minimal = df.drop("email", "salary")
df_minimal.show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  3|Carol|
+---+-----+

If your columns to drop are stored in a list—common when column names come from configuration or are computed dynamically—use the unpacking operator:

columns_to_remove = ["email", "salary"]
df_minimal = df.drop(*columns_to_remove)

The asterisk unpacks the list into separate arguments. Without it, you’d be passing a single list object, which would raise an error.

This pattern is particularly useful in ETL pipelines where you might read column names from a config file:

import json

# Imagine this comes from a config file
config = '{"drop_columns": ["email", "salary", "internal_id"]}'
drop_list = json.loads(config)["drop_columns"]

# Filter to only columns that actually exist (more on this later)
existing_drops = [c for c in drop_list if c in df.columns]
df_cleaned = df.drop(*existing_drops)

Using select() as an Alternative Approach

Sometimes it’s easier to specify what you want to keep rather than what you want to remove. The select() method handles this elegantly:

# Keep only id and name
df_selected = df.select("id", "name")

When the columns to drop outnumber the columns to keep, select() produces cleaner code. But the real power comes from combining select() with list comprehensions:

columns_to_drop = {"email", "salary"}  # Using a set for O(1) lookup
df_filtered = df.select([c for c in df.columns if c not in columns_to_drop])

This approach shines when you’re working with wide DataFrames. Imagine a DataFrame with 200 columns where you need to drop 195 of them. Listing 5 columns to keep beats listing 195 to drop.

You can also use select() with column expressions for more complex scenarios:

from pyspark.sql.functions import col

# Select all columns except those in the drop list, and rename 'id' to 'user_id'
keep_cols = [
    col("id").alias("user_id") if c == "id" else col(c)
    for c in df.columns 
    if c not in columns_to_drop
]
df_transformed = df.select(keep_cols)

Handling Non-Existent Columns

Here’s a behavior that surprises many developers: drop() silently ignores columns that don’t exist. No error, no warning:

# This works fine even though 'nonexistent' isn't a column
df_result = df.drop("nonexistent", "email")
print(df_result.columns)  # ['id', 'name', 'salary']

This is actually useful for building robust pipelines. Your code won’t break if upstream schema changes remove a column you were dropping anyway.

However, sometimes you want to know if a column exists before operating on it. Use the columns property for explicit checks:

def safe_drop_with_logging(df, columns_to_drop):
    """Drop columns with logging for missing ones."""
    existing = set(df.columns)
    
    for col_name in columns_to_drop:
        if col_name not in existing:
            print(f"Warning: Column '{col_name}' not found in DataFrame")
    
    return df.drop(*columns_to_drop)

# Usage
df_result = safe_drop_with_logging(df, ["email", "nonexistent", "also_missing"])

For strict validation where missing columns should cause failures:

def strict_drop(df, columns_to_drop):
    """Drop columns, raising an error if any don't exist."""
    existing = set(df.columns)
    missing = [c for c in columns_to_drop if c not in existing]
    
    if missing:
        raise ValueError(f"Columns not found: {missing}")
    
    return df.drop(*columns_to_drop)

Dropping Columns by Pattern or Condition

Real-world data often has columns that follow naming conventions. Temporary columns might end with _tmp, metadata columns might start with meta_, or you might have versioned columns like value_v1, value_v2. Pattern-based dropping handles these cases cleanly.

Using regex to match column names:

import re

# Sample DataFrame with temporary columns
data = [(1, "Alice", 100, 200, 300)]
df_with_temps = spark.createDataFrame(
    data, 
    ["id", "name", "score_tmp", "calc_tmp", "final_score"]
)

# Drop all columns ending with '_tmp'
pattern = re.compile(r".*_tmp$")
cols_to_drop = [c for c in df_with_temps.columns if pattern.match(c)]
df_clean = df_with_temps.drop(*cols_to_drop)

print(df_clean.columns)  # ['id', 'name', 'final_score']

You can also drop columns based on data characteristics, though this requires inspecting the data:

from pyspark.sql.functions import count, when, isnan, isnull

def get_empty_columns(df, threshold=1.0):
    """Find columns where null ratio exceeds threshold."""
    total_rows = df.count()
    empty_cols = []
    
    for col_name in df.columns:
        null_count = df.filter(
            isnull(col_name) | isnan(col_name)
        ).count()
        
        if null_count / total_rows >= threshold:
            empty_cols.append(col_name)
    
    return empty_cols

# Drop columns that are 100% null
empty_columns = get_empty_columns(df, threshold=1.0)
df_no_empty = df.drop(*empty_columns)

For columns matching multiple patterns:

patterns_to_drop = [
    r"^temp_",      # Starts with 'temp_'
    r"_backup$",    # Ends with '_backup'
    r"^__",         # Starts with double underscore
]

def matches_any_pattern(col_name, patterns):
    return any(re.match(p, col_name) for p in patterns)

cols_to_drop = [
    c for c in df.columns 
    if matches_any_pattern(c, patterns_to_drop)
]
df_clean = df.drop(*cols_to_drop)

Summary and Best Practices

Here’s a quick reference for choosing the right approach:

Scenario Method Example
Drop one column drop() df.drop("col")
Drop multiple known columns drop() with multiple args df.drop("a", "b", "c")
Drop columns from a list drop() with unpacking df.drop(*column_list)
Keep specific columns select() df.select("a", "b")
Drop by pattern List comprehension + drop() df.drop(*[c for c in df.columns if c.endswith("_tmp")])
Drop with validation Custom function Check df.columns first

A few final recommendations:

Chain operations when possible. Instead of creating intermediate variables, chain transformations:

df_final = (df
    .drop("temp_col", "debug_col")
    .filter(col("status") == "active")
    .select("id", "name", "value"))

Be explicit in production code. While drop() silently ignores missing columns, consider adding validation in production pipelines where schema drift could indicate upstream problems.

Consider performance with wide DataFrames. Dropping columns is a metadata operation—it doesn’t trigger computation. But selecting fewer columns early in your pipeline reduces data shuffled in subsequent operations.

Document your drops. When removing columns for compliance reasons (PII, sensitive data), add comments explaining why. Future maintainers will thank you.

Column deletion in PySpark is simple in isolation but becomes nuanced in production pipelines. Master these patterns, and you’ll handle schema manipulation confidently regardless of the complexity.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.