PySpark - Drop Multiple Columns | Application Architect

Key Insights

PySpark offers multiple methods to drop columns: drop() with arguments, list unpacking, or inverse selection with select(), each suited for different scenarios
Dropping unnecessary columns early in your pipeline reduces memory footprint and improves query performance, especially with wide DataFrames containing hundreds of columns
Use select() with column filtering when you have more columns to drop than keep, and use drop() when removing a small subset of columns

Introduction

Working with large datasets in PySpark often means dealing with DataFrames that contain far more columns than you actually need. Whether you’re cleaning data, reducing memory consumption, removing personally identifiable information (PII), or simply streamlining your dataset for analysis, dropping columns is a fundamental operation you’ll perform regularly.

The performance implications are real. A DataFrame with 100 columns when you only need 10 wastes memory and slows down every subsequent transformation. Spark must track, serialize, and shuffle all columns during operations like joins and aggregations. By dropping unnecessary columns early in your data pipeline, you reduce the data volume that moves across your cluster, leading to faster execution times and lower resource consumption.

Let’s start with a sample DataFrame that represents a common scenario—customer data with various attributes, some of which we don’t need:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

spark = SparkSession.builder.appName("DropColumns").getOrCreate()

# Sample data with many columns
data = [
    (1, "John", "Doe", "john@email.com", "555-1234", "123 Main St", "New York", "NY", 10001, 45000.0, "2020-01-15", "temp_id_123", "internal_code_A"),
    (2, "Jane", "Smith", "jane@email.com", "555-5678", "456 Oak Ave", "Boston", "MA", 02101, 52000.0, "2019-06-20", "temp_id_456", "internal_code_B"),
    (3, "Bob", "Johnson", "bob@email.com", "555-9012", "789 Pine Rd", "Chicago", "IL", 60601, 48000.0, "2021-03-10", "temp_id_789", "internal_code_C")
]

columns = ["id", "first_name", "last_name", "email", "phone", "address", 
           "city", "state", "zip", "salary", "hire_date", "temp_session_id", "internal_reference"]

df = spark.createDataFrame(data, columns)
df.show()

This DataFrame contains temporary session IDs and internal reference codes that we don’t need for our analysis. Let’s explore the various methods to drop them.

Using drop() with Multiple Arguments

The most straightforward approach is using the drop() method with multiple column names as separate arguments. This method is clean, readable, and works perfectly when you know exactly which columns to remove.

# Drop multiple columns by passing them as separate arguments
df_cleaned = df.drop('temp_session_id', 'internal_reference')

print(f"Original columns: {len(df.columns)}")
print(f"After drop: {len(df_cleaned.columns)}")

df_cleaned.show()

This approach is ideal when you’re dropping a small, fixed number of columns that you can explicitly name. The code is self-documenting—anyone reading it immediately understands which columns are being removed.

You can also chain multiple drop() calls, though this is less efficient:

# Less efficient - creates intermediate DataFrames
df_cleaned = df.drop('temp_session_id').drop('internal_reference')

While this works, each drop() call creates a new DataFrame reference. For dropping multiple columns, pass them all to a single drop() call.

Using drop() with List Unpacking

When your columns to drop are stored in a list—perhaps read from a configuration file or determined dynamically—you need the unpacking operator (*). This is the most flexible approach for production code.

# Define columns to drop in a list
cols_to_drop = ['temp_session_id', 'internal_reference', 'phone']

# Use unpacking operator to pass list elements as arguments
df_cleaned = df.drop(*cols_to_drop)

df_cleaned.show()

This pattern shines in real-world scenarios where column removal logic is configurable:

# Configuration-driven approach
config = {
    'remove_pii': True,
    'remove_temp_columns': True,
    'remove_internal_fields': True
}

cols_to_drop = []

if config['remove_pii']:
    cols_to_drop.extend(['email', 'phone', 'address'])

if config['remove_temp_columns']:
    cols_to_drop.extend([col for col in df.columns if col.startswith('temp_')])

if config['remove_internal_fields']:
    cols_to_drop.extend([col for col in df.columns if col.startswith('internal_')])

# Remove duplicates and drop
cols_to_drop = list(set(cols_to_drop))
df_cleaned = df.drop(*cols_to_drop)

print(f"Dropped columns: {cols_to_drop}")

Using select() with Column Exclusion

Sometimes it’s easier to specify what you want to keep rather than what you want to remove. When you’re dropping more than half your columns, the inverse approach with select() is more concise and often more readable.

# Define columns to keep
cols_to_keep = ['id', 'first_name', 'last_name', 'city', 'state', 'salary', 'hire_date']

df_selected = df.select(cols_to_keep)
df_selected.show()

You can also use list comprehension to dynamically filter columns:

# Keep all columns except those in the exclusion list
cols_to_exclude = ['temp_session_id', 'internal_reference', 'phone', 'address', 'zip']

df_selected = df.select([col for col in df.columns if col not in cols_to_exclude])
df_selected.show()

This approach is particularly useful when combined with set operations:

# Keep only columns that match certain criteria
required_cols = {'id', 'first_name', 'last_name', 'email', 'salary'}
available_cols = set(df.columns)

# Keep only required columns that exist in the DataFrame
cols_to_select = list(required_cols.intersection(available_cols))

df_selected = df.select(cols_to_select)

Dropping Columns by Pattern or Condition

Advanced use cases often require dropping columns based on patterns or data types rather than explicit names. This is common when dealing with DataFrames generated by automated processes that create temporary or metadata columns.

# Drop all columns starting with 'temp_'
temp_columns = [col for col in df.columns if col.startswith('temp_')]
df_no_temp = df.drop(*temp_columns)

# Drop all columns starting with 'internal_'
internal_columns = [col for col in df.columns if col.startswith('internal_')]
df_no_internal = df.drop(*internal_columns)

# Combine both patterns
pattern_columns = [col for col in df.columns 
                   if col.startswith('temp_') or col.startswith('internal_')]
df_cleaned = df.drop(*pattern_columns)

print(f"Dropped columns matching patterns: {pattern_columns}")

You can also drop columns based on data type:

from pyspark.sql.types import StringType, IntegerType

# Drop all string columns (be careful with this!)
string_columns = [field.name for field in df.schema.fields 
                  if isinstance(field.dataType, StringType)]

print(f"String columns: {string_columns}")

# More practical: drop all columns except specific types
numeric_columns = [field.name for field in df.schema.fields 
                   if isinstance(field.dataType, (IntegerType, DoubleType))]

df_numeric_only = df.select(numeric_columns)
df_numeric_only.show()

Using regular expressions for more complex patterns:

import re

# Drop columns matching a regex pattern
pattern = re.compile(r'^(temp_|internal_)')
regex_columns = [col for col in df.columns if pattern.match(col)]

df_cleaned = df.drop(*regex_columns)
print(f"Dropped columns matching regex: {regex_columns}")

Performance Considerations and Best Practices

Different methods have different performance characteristics, though for most use cases the differences are negligible. The key is choosing the right tool for your specific scenario.

Method Selection Guidelines:

Use drop() with explicit arguments when removing 1-5 known columns. Most readable for small, fixed sets.
Use drop() with list unpacking when columns are determined dynamically or come from configuration. Best for maintainable, flexible code.
Use select() with filtering when keeping fewer columns than dropping, or when you need to ensure specific columns exist.
Use pattern-based filtering for automated cleanup of generated columns or metadata fields.

Common Pitfalls:

# AVOID: Dropping columns in a loop
for col in cols_to_drop:
    df = df.drop(col)  # Creates multiple DataFrame references

# PREFER: Single drop operation
df = df.drop(*cols_to_drop)

# AVOID: Dropping non-existent columns without checking
try:
    df_cleaned = df.drop('nonexistent_column')
except Exception as e:
    print(f"Error: {e}")

# PREFER: Check column existence first
cols_to_drop = ['temp_session_id', 'nonexistent_column']
existing_cols = [col for col in cols_to_drop if col in df.columns]
df_cleaned = df.drop(*existing_cols)

Memory and Performance:

Drop columns as early as possible in your pipeline. If you’re reading a Parquet file with 200 columns but only need 20, select those 20 immediately:

# Efficient: read only needed columns
df = spark.read.parquet("large_file.parquet").select(needed_columns)

# Less efficient: read everything then drop
df = spark.read.parquet("large_file.parquet").drop(*unwanted_columns)

Conclusion

PySpark provides flexible methods for dropping columns, each suited to different scenarios. Here’s a quick reference:

Method	Best For	Example
`drop('col1', 'col2')`	Small, fixed set of columns	`df.drop('temp_id', 'internal_ref')`
`drop(*list)`	Dynamic column lists	`df.drop(*cols_to_drop)`
`select()` filtering	Keeping fewer than dropping	`df.select([c for c in df.columns if c not in exclude])`
Pattern-based	Automated cleanup	`df.drop(*[c for c in df.columns if c.startswith('temp_')])`

Remember: drop columns early, drop them once, and choose the method that makes your intent clearest to future readers of your code. The best approach is the one that makes your data pipeline both performant and maintainable.