Pandas - Select Multiple Columns | Application Architect

Key Insights

Pandas offers multiple syntaxes for selecting columns: bracket notation with lists, .loc[], .iloc[], and .filter() methods, each optimized for different use cases
Column selection returns a DataFrame when multiple columns are specified, but a Series when selecting a single column—understanding this distinction prevents type errors in data pipelines
Advanced selection techniques like regex patterns, callable functions, and conditional selection enable dynamic column filtering without hardcoding column names

Basic Column Selection with Bracket Notation

The most straightforward method for selecting multiple columns uses bracket notation with a list of column names. This approach is readable and works well when you know the exact column names.

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'first_name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie'],
    'last_name': ['Doe', 'Smith', 'Johnson', 'Williams', 'Brown'],
    'age': [28, 34, 45, 29, 38],
    'purchase_amount': [150.50, 220.00, 89.99, 310.25, 175.80],
    'email': ['john@email.com', 'jane@email.com', 'bob@email.com', 
              'alice@email.com', 'charlie@email.com']
})

# Select multiple columns
selected = df[['customer_id', 'first_name', 'purchase_amount']]
print(selected)

Output:

   customer_id first_name  purchase_amount
0            1       John           150.50
1            2       Jane           220.00
2            3        Bob            89.99
3            4      Alice           310.25
4            5    Charlie           175.80

Note the double brackets: the outer brackets are the indexing operator, the inner brackets create a list. Using single brackets with a single column name returns a Series, not a DataFrame.

# Returns a Series
series_result = df['first_name']
print(type(series_result))  # <class 'pandas.core.series.Series'>

# Returns a DataFrame
df_result = df[['first_name']]
print(type(df_result))  # <class 'pandas.core.frame.DataFrame'>

Using .loc[] for Label-Based Selection

The .loc[] accessor provides label-based indexing and supports more complex selection patterns. It’s particularly useful when combining row and column selection.

# Select all rows, specific columns
result = df.loc[:, ['first_name', 'last_name', 'age']]

# Select specific rows and columns
result = df.loc[0:2, ['customer_id', 'purchase_amount']]
print(result)

Output:

   customer_id  purchase_amount
0            1           150.50
1            2           220.00
2            3            89.99

The .loc[] method also accepts column slices:

# Select column range (inclusive on both ends)
result = df.loc[:, 'first_name':'age']
print(result.columns.tolist())
# ['first_name', 'last_name', 'age']

Position-Based Selection with .iloc[]

When working with columns by position rather than name, .iloc[] provides integer-based indexing. This is valuable when column positions are known but names might change.

# Select columns by position (0-indexed)
result = df.iloc[:, [0, 2, 4]]  # customer_id, last_name, purchase_amount
print(result)

# Select column range by position
result = df.iloc[:, 1:4]  # Columns 1, 2, 3 (end exclusive)
print(result.columns.tolist())
# ['first_name', 'last_name', 'age']

# Combine row and column position selection
result = df.iloc[0:3, [1, 2]]  # First 3 rows, columns 1 and 2
print(result)

Output:

  first_name last_name
0       John       Doe
1       Jane     Smith
2        Bob   Johnson

Dynamic Selection with .filter()

The .filter() method enables dynamic column selection using patterns, regex, or callable functions. This approach scales better than hardcoding column names.

# Select columns containing specific substring
name_columns = df.filter(like='name')
print(name_columns.columns.tolist())
# ['first_name', 'last_name']

# Select columns using regex pattern
df_extended = df.copy()
df_extended['purchase_date'] = pd.date_range('2024-01-01', periods=5)
df_extended['purchase_category'] = ['A', 'B', 'A', 'C', 'B']

# Select all columns starting with 'purchase'
purchase_cols = df_extended.filter(regex='^purchase')
print(purchase_cols.columns.tolist())
# ['purchase_amount', 'purchase_date', 'purchase_category']

# Select columns by exact list
result = df.filter(items=['customer_id', 'age', 'email'])
print(result.columns.tolist())
# ['customer_id', 'age', 'email']

Conditional Column Selection

Select columns based on data types or custom conditions using boolean indexing and comprehensions.

# Select all numeric columns
numeric_cols = df.select_dtypes(include=[np.number])
print(numeric_cols.columns.tolist())
# ['customer_id', 'age', 'purchase_amount']

# Select non-numeric columns
text_cols = df.select_dtypes(exclude=[np.number])
print(text_cols.columns.tolist())
# ['first_name', 'last_name', 'email']

# Select columns based on custom condition
high_variance_cols = [col for col in df.select_dtypes(include=[np.number]).columns 
                      if df[col].std() > 50]
print(high_variance_cols)
# ['purchase_amount']

Column Selection with Callable Functions

Pass callable functions to .loc[] for dynamic selection logic based on DataFrame state.

# Select columns where all values are non-null
def all_non_null(col):
    return df[col].notna().all()

non_null_cols = [col for col in df.columns if all_non_null(col)]
result = df[non_null_cols]

# More complex selection logic
def select_columns(df_cols):
    # Select columns that don't end with '_id' and aren't email
    return [col for col in df_cols if not col.endswith('_id') and col != 'email']

result = df.loc[:, select_columns(df.columns)]
print(result.columns.tolist())
# ['first_name', 'last_name', 'age', 'purchase_amount']

Reordering and Renaming During Selection

Combine selection with reordering or renaming in a single operation.

# Reorder columns during selection
ordered = df[['customer_id', 'purchase_amount', 'first_name', 'last_name']]

# Select and rename simultaneously
renamed = df[['customer_id', 'first_name', 'purchase_amount']].rename(columns={
    'customer_id': 'id',
    'first_name': 'name',
    'purchase_amount': 'amount'
})
print(renamed)

Output:

   id     name  amount
0   1     John  150.50
1   2     Jane  220.00
2   3      Bob   89.99
3   4    Alice  310.25
4   5  Charlie  175.80

Performance Considerations

Column selection performance varies by method. For large DataFrames, choose the appropriate technique:

import time

# Create large DataFrame
large_df = pd.DataFrame(np.random.randn(100000, 50))
large_df.columns = [f'col_{i}' for i in range(50)]

# Benchmark different selection methods
columns_to_select = [f'col_{i}' for i in range(0, 50, 5)]

# Method 1: Direct bracket notation
start = time.time()
result1 = large_df[columns_to_select]
time1 = time.time() - start

# Method 2: .loc[]
start = time.time()
result2 = large_df.loc[:, columns_to_select]
time2 = time.time() - start

# Method 3: .filter()
start = time.time()
result3 = large_df.filter(items=columns_to_select)
time3 = time.time() - start

print(f"Bracket notation: {time1:.6f}s")
print(f".loc[]: {time2:.6f}s")
print(f".filter(): {time3:.6f}s")

Bracket notation typically performs best for simple column lists. Use .loc[] when combining row and column selection. Reserve .filter() for pattern-based selection where the performance overhead is justified by code maintainability.

Handling Missing Columns

Implement error handling for column selection to prevent KeyErrors in production code:

# Columns that may or may not exist
desired_columns = ['customer_id', 'first_name', 'nonexistent_column']

# Select only existing columns
existing_columns = [col for col in desired_columns if col in df.columns]
safe_result = df[existing_columns]

# Alternative: use .filter() which ignores missing columns
safe_result = df.filter(items=desired_columns)

# Raise custom error if required columns are missing
required_columns = ['customer_id', 'purchase_amount']
missing = [col for col in required_columns if col not in df.columns]
if missing:
    raise ValueError(f"Missing required columns: {missing}")

Choose your column selection method based on use case: bracket notation for explicit lists, .loc[] for combined row-column operations, .iloc[] for position-based selection, and .filter() for pattern matching. Understanding these options enables you to write more maintainable and performant pandas code.

Basic Column Selection with Bracket Notation

Using .loc[] for Label-Based Selection

Position-Based Selection with .iloc[]

Dynamic Selection with .filter()

Conditional Column Selection

Column Selection with Callable Functions

Reordering and Renaming During Selection

Performance Considerations

Handling Missing Columns

Liked this? There's more.