Pandas - Select Multiple Columns
The most straightforward method for selecting multiple columns uses bracket notation with a list of column names. This approach is readable and works well when you know the exact column names.
Key Insights
- Pandas offers multiple syntaxes for selecting columns: bracket notation with lists,
.loc[],.iloc[], and.filter()methods, each optimized for different use cases - Column selection returns a DataFrame when multiple columns are specified, but a Series when selecting a single column—understanding this distinction prevents type errors in data pipelines
- Advanced selection techniques like regex patterns, callable functions, and conditional selection enable dynamic column filtering without hardcoding column names
Basic Column Selection with Bracket Notation
The most straightforward method for selecting multiple columns uses bracket notation with a list of column names. This approach is readable and works well when you know the exact column names.
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'first_name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie'],
'last_name': ['Doe', 'Smith', 'Johnson', 'Williams', 'Brown'],
'age': [28, 34, 45, 29, 38],
'purchase_amount': [150.50, 220.00, 89.99, 310.25, 175.80],
'email': ['john@email.com', 'jane@email.com', 'bob@email.com',
'alice@email.com', 'charlie@email.com']
})
# Select multiple columns
selected = df[['customer_id', 'first_name', 'purchase_amount']]
print(selected)
Output:
customer_id first_name purchase_amount
0 1 John 150.50
1 2 Jane 220.00
2 3 Bob 89.99
3 4 Alice 310.25
4 5 Charlie 175.80
Note the double brackets: the outer brackets are the indexing operator, the inner brackets create a list. Using single brackets with a single column name returns a Series, not a DataFrame.
# Returns a Series
series_result = df['first_name']
print(type(series_result)) # <class 'pandas.core.series.Series'>
# Returns a DataFrame
df_result = df[['first_name']]
print(type(df_result)) # <class 'pandas.core.frame.DataFrame'>
Using .loc[] for Label-Based Selection
The .loc[] accessor provides label-based indexing and supports more complex selection patterns. It’s particularly useful when combining row and column selection.
# Select all rows, specific columns
result = df.loc[:, ['first_name', 'last_name', 'age']]
# Select specific rows and columns
result = df.loc[0:2, ['customer_id', 'purchase_amount']]
print(result)
Output:
customer_id purchase_amount
0 1 150.50
1 2 220.00
2 3 89.99
The .loc[] method also accepts column slices:
# Select column range (inclusive on both ends)
result = df.loc[:, 'first_name':'age']
print(result.columns.tolist())
# ['first_name', 'last_name', 'age']
Position-Based Selection with .iloc[]
When working with columns by position rather than name, .iloc[] provides integer-based indexing. This is valuable when column positions are known but names might change.
# Select columns by position (0-indexed)
result = df.iloc[:, [0, 2, 4]] # customer_id, last_name, purchase_amount
print(result)
# Select column range by position
result = df.iloc[:, 1:4] # Columns 1, 2, 3 (end exclusive)
print(result.columns.tolist())
# ['first_name', 'last_name', 'age']
# Combine row and column position selection
result = df.iloc[0:3, [1, 2]] # First 3 rows, columns 1 and 2
print(result)
Output:
first_name last_name
0 John Doe
1 Jane Smith
2 Bob Johnson
Dynamic Selection with .filter()
The .filter() method enables dynamic column selection using patterns, regex, or callable functions. This approach scales better than hardcoding column names.
# Select columns containing specific substring
name_columns = df.filter(like='name')
print(name_columns.columns.tolist())
# ['first_name', 'last_name']
# Select columns using regex pattern
df_extended = df.copy()
df_extended['purchase_date'] = pd.date_range('2024-01-01', periods=5)
df_extended['purchase_category'] = ['A', 'B', 'A', 'C', 'B']
# Select all columns starting with 'purchase'
purchase_cols = df_extended.filter(regex='^purchase')
print(purchase_cols.columns.tolist())
# ['purchase_amount', 'purchase_date', 'purchase_category']
# Select columns by exact list
result = df.filter(items=['customer_id', 'age', 'email'])
print(result.columns.tolist())
# ['customer_id', 'age', 'email']
Conditional Column Selection
Select columns based on data types or custom conditions using boolean indexing and comprehensions.
# Select all numeric columns
numeric_cols = df.select_dtypes(include=[np.number])
print(numeric_cols.columns.tolist())
# ['customer_id', 'age', 'purchase_amount']
# Select non-numeric columns
text_cols = df.select_dtypes(exclude=[np.number])
print(text_cols.columns.tolist())
# ['first_name', 'last_name', 'email']
# Select columns based on custom condition
high_variance_cols = [col for col in df.select_dtypes(include=[np.number]).columns
if df[col].std() > 50]
print(high_variance_cols)
# ['purchase_amount']
Column Selection with Callable Functions
Pass callable functions to .loc[] for dynamic selection logic based on DataFrame state.
# Select columns where all values are non-null
def all_non_null(col):
return df[col].notna().all()
non_null_cols = [col for col in df.columns if all_non_null(col)]
result = df[non_null_cols]
# More complex selection logic
def select_columns(df_cols):
# Select columns that don't end with '_id' and aren't email
return [col for col in df_cols if not col.endswith('_id') and col != 'email']
result = df.loc[:, select_columns(df.columns)]
print(result.columns.tolist())
# ['first_name', 'last_name', 'age', 'purchase_amount']
Reordering and Renaming During Selection
Combine selection with reordering or renaming in a single operation.
# Reorder columns during selection
ordered = df[['customer_id', 'purchase_amount', 'first_name', 'last_name']]
# Select and rename simultaneously
renamed = df[['customer_id', 'first_name', 'purchase_amount']].rename(columns={
'customer_id': 'id',
'first_name': 'name',
'purchase_amount': 'amount'
})
print(renamed)
Output:
id name amount
0 1 John 150.50
1 2 Jane 220.00
2 3 Bob 89.99
3 4 Alice 310.25
4 5 Charlie 175.80
Performance Considerations
Column selection performance varies by method. For large DataFrames, choose the appropriate technique:
import time
# Create large DataFrame
large_df = pd.DataFrame(np.random.randn(100000, 50))
large_df.columns = [f'col_{i}' for i in range(50)]
# Benchmark different selection methods
columns_to_select = [f'col_{i}' for i in range(0, 50, 5)]
# Method 1: Direct bracket notation
start = time.time()
result1 = large_df[columns_to_select]
time1 = time.time() - start
# Method 2: .loc[]
start = time.time()
result2 = large_df.loc[:, columns_to_select]
time2 = time.time() - start
# Method 3: .filter()
start = time.time()
result3 = large_df.filter(items=columns_to_select)
time3 = time.time() - start
print(f"Bracket notation: {time1:.6f}s")
print(f".loc[]: {time2:.6f}s")
print(f".filter(): {time3:.6f}s")
Bracket notation typically performs best for simple column lists. Use .loc[] when combining row and column selection. Reserve .filter() for pattern-based selection where the performance overhead is justified by code maintainability.
Handling Missing Columns
Implement error handling for column selection to prevent KeyErrors in production code:
# Columns that may or may not exist
desired_columns = ['customer_id', 'first_name', 'nonexistent_column']
# Select only existing columns
existing_columns = [col for col in desired_columns if col in df.columns]
safe_result = df[existing_columns]
# Alternative: use .filter() which ignores missing columns
safe_result = df.filter(items=desired_columns)
# Raise custom error if required columns are missing
required_columns = ['customer_id', 'purchase_amount']
missing = [col for col in required_columns if col not in df.columns]
if missing:
raise ValueError(f"Missing required columns: {missing}")
Choose your column selection method based on use case: bracket notation for explicit lists, .loc[] for combined row-column operations, .iloc[] for position-based selection, and .filter() for pattern matching. Understanding these options enables you to write more maintainable and performant pandas code.