Pandas - Drop Column from DataFrame
• Pandas offers multiple methods to drop columns: `drop()`, `pop()`, direct deletion with `del`, and column selection—each suited for different use cases and performance requirements
Key Insights
• Pandas offers multiple methods to drop columns: drop(), pop(), direct deletion with del, and column selection—each suited for different use cases and performance requirements
• The inplace parameter in drop() modifies the original DataFrame, while omitting it returns a new DataFrame, giving you control over mutability
• Dropping multiple columns simultaneously is more efficient than iterative deletion, and understanding label-based versus position-based removal prevents common errors
Using drop() Method
The drop() method is the most versatile approach for removing columns from a DataFrame. It accepts column labels and provides fine-grained control over the operation.
import pandas as pd
df = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david@example.com'],
'age': [25, 30, 35, 28],
'city': ['NYC', 'LA', 'Chicago', 'Houston']
})
# Drop a single column - returns new DataFrame
df_new = df.drop('email', axis=1)
print(df_new.columns)
# Output: Index(['user_id', 'name', 'age', 'city'], dtype='object')
# Drop column in place - modifies original DataFrame
df.drop('city', axis=1, inplace=True)
print(df.columns)
# Output: Index(['user_id', 'name', 'email', 'age'], dtype='object')
The axis=1 parameter specifies column-wise operation. Alternatively, use axis='columns' for readability. Without inplace=True, the original DataFrame remains unchanged.
Dropping Multiple Columns
When removing multiple columns, pass a list of column names to drop(). This approach is significantly faster than dropping columns one at a time.
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'temp_col1': [10, 20, 30],
'temp_col2': [40, 50, 60],
'temp_col3': [70, 80, 90],
'score': [95, 87, 92]
})
# Drop multiple columns at once
columns_to_drop = ['temp_col1', 'temp_col2', 'temp_col3']
df_cleaned = df.drop(columns=columns_to_drop)
print(df_cleaned.columns)
# Output: Index(['id', 'name', 'score'], dtype='object')
# Alternative syntax using axis parameter
df_cleaned = df.drop(['temp_col1', 'temp_col2', 'temp_col3'], axis=1)
The columns parameter provides clearer intent than axis=1 when your code emphasizes which columns to remove.
Using columns Parameter
Pandas 0.21.0 introduced the columns parameter as a more explicit alternative to the axis parameter.
df = pd.DataFrame({
'product_id': [101, 102, 103],
'product_name': ['Widget', 'Gadget', 'Tool'],
'internal_code': ['W01', 'G02', 'T03'],
'price': [19.99, 29.99, 39.99]
})
# Using columns parameter (more readable)
df_public = df.drop(columns=['internal_code'])
# Equivalent to using axis=1
df_public = df.drop(['internal_code'], axis=1)
This syntax improves code clarity, especially for developers less familiar with the axis convention.
Dropping Columns by Position
When you need to drop columns by their position rather than name, use iloc for selection or convert positions to column names.
df = pd.DataFrame({
'col_a': [1, 2, 3],
'col_b': [4, 5, 6],
'col_c': [7, 8, 9],
'col_d': [10, 11, 12],
'col_e': [13, 14, 15]
})
# Keep all columns except positions 1 and 3 (0-indexed)
columns_to_keep = [i for i in range(len(df.columns)) if i not in [1, 3]]
df_filtered = df.iloc[:, columns_to_keep]
print(df_filtered.columns)
# Output: Index(['col_a', 'col_c', 'col_e'], dtype='object')
# Alternative: Get column names by position and drop them
cols_to_drop = [df.columns[i] for i in [1, 3]]
df_filtered = df.drop(columns=cols_to_drop)
Position-based dropping is useful when working with programmatically generated DataFrames where column names aren’t known in advance.
Using del and pop()
For single-column removal with side effects, use del or pop(). These methods always modify the DataFrame in place.
df = pd.DataFrame({
'user_id': [1, 2, 3],
'username': ['alice', 'bob', 'charlie'],
'temp_data': ['x', 'y', 'z'],
'score': [100, 200, 300]
})
# Using del - removes column, no return value
del df['temp_data']
print(df.columns)
# Output: Index(['user_id', 'username', 'score'], dtype='object')
# Using pop() - removes and returns the column as a Series
score_column = df.pop('score')
print(score_column)
# Output:
# 0 100
# 1 200
# 2 300
# Name: score, dtype: int64
print(df.columns)
# Output: Index(['user_id', 'username'], dtype='object')
Use pop() when you need the column data for further processing. Use del for straightforward removal when you don’t need the data.
Column Selection (Inverse Approach)
Instead of dropping unwanted columns, select desired columns. This approach is cleaner when keeping fewer columns than dropping.
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'email': ['a@ex.com', 'b@ex.com', 'c@ex.com'],
'phone': ['111', '222', '333'],
'address': ['Addr1', 'Addr2', 'Addr3'],
'notes': ['Note1', 'Note2', 'Note3']
})
# Keep only specific columns
df_minimal = df[['id', 'name', 'email']]
print(df_minimal.columns)
# Output: Index(['id', 'name', 'email'], dtype='object')
# Using filter() with regex
df_filtered = df.filter(regex='^(id|name|email)$')
This pattern is more maintainable when your DataFrame has many columns and you only need a small subset.
Handling Errors When Dropping Columns
By default, drop() raises a KeyError if a column doesn’t exist. Use the errors parameter to control this behavior.
df = pd.DataFrame({
'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]
})
# This raises KeyError
try:
df_new = df.drop(['col2', 'nonexistent_col'], axis=1)
except KeyError as e:
print(f"Error: {e}")
# Ignore errors silently
df_new = df.drop(['col2', 'nonexistent_col'], axis=1, errors='ignore')
print(df_new.columns)
# Output: Index(['col1', 'col3'], dtype='object')
Setting errors='ignore' is particularly useful when dropping columns that may or may not exist, such as in data pipeline scenarios with varying input schemas.
Conditional Column Dropping
Drop columns based on conditions like data type, name patterns, or content analysis.
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'temp_score': [10, 20, 30],
'temp_rank': [1, 2, 3],
'final_score': [95, 87, 92]
})
# Drop all columns starting with 'temp_'
temp_cols = [col for col in df.columns if col.startswith('temp_')]
df_cleaned = df.drop(columns=temp_cols)
print(df_cleaned.columns)
# Output: Index(['id', 'name', 'final_score'], dtype='object')
# Drop columns by data type
numeric_cols = df.select_dtypes(include=['int64']).columns
df_non_numeric = df.drop(columns=numeric_cols)
# Drop columns with all NaN values
df_with_nans = pd.DataFrame({
'a': [1, 2, 3],
'b': [None, None, None],
'c': [4, 5, 6]
})
df_cleaned = df_with_nans.dropna(axis=1, how='all')
print(df_cleaned.columns)
# Output: Index(['a', 'c'], dtype='object')
Conditional dropping enables dynamic DataFrame cleaning based on runtime data characteristics.
Performance Considerations
When dropping many columns or working with large DataFrames, method choice impacts performance.
import pandas as pd
import numpy as np
# Create large DataFrame
df = pd.DataFrame(np.random.randn(10000, 100))
cols_to_drop = list(range(0, 50))
# Efficient: Drop multiple columns at once
df_new = df.drop(columns=cols_to_drop)
# Inefficient: Iterative dropping
for col in cols_to_drop:
df = df.drop(columns=[col])
# Most efficient for keeping few columns: direct selection
df_new = df[[col for col in df.columns if col not in cols_to_drop]]
Direct column selection typically outperforms drop() when keeping fewer than 50% of columns. For removing a small number of columns from many, drop() provides better readability with acceptable performance.