How to Select Columns in Pandas

Column selection is the bread and butter of pandas work. Before you can clean, transform, or analyze data, you need to extract the specific columns you care about. Whether you're dropping irrelevant...

Key Insights

  • Bracket notation df['column'] is safer than dot notation df.column because it works with all column names, including those with spaces or that match DataFrame method names.
  • Use loc for label-based slicing and iloc for position-based slicing—mixing them up is a common source of off-by-one errors since loc includes the endpoint while iloc doesn’t.
  • For dynamic column selection in production code, prefer filter() with regex patterns or list comprehensions over hardcoded column names to make your pipelines more maintainable.

Introduction

Column selection is the bread and butter of pandas work. Before you can clean, transform, or analyze data, you need to extract the specific columns you care about. Whether you’re dropping irrelevant features before training a model, selecting columns for a report, or just exploring a new dataset, you’ll use column selection constantly.

This article covers every practical method for selecting columns in pandas, from basic bracket notation to advanced pattern matching. By the end, you’ll know exactly which approach to use for any situation.

Basic Column Selection Methods

Let’s start with a sample DataFrame to work with throughout this article:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'sales_2022': [15000, 23000, 18000, 31000, 27000],
    'sales_2023': [17500, 21000, 22000, 35000, 29000],
    'region': ['North', 'South', 'North', 'East', 'West'],
    'signup_date': pd.to_datetime(['2020-01-15', '2019-06-20', '2021-03-10', '2018-11-05', '2022-07-22'])
})

Single Column Selection

You have two options for selecting a single column:

# Bracket notation - always works
name_series = df['name']

# Dot notation - convenient but limited
name_series = df.name

Both return a pandas Series. However, bracket notation is the safer choice. Dot notation fails when:

  • The column name contains spaces: df.first name is a syntax error
  • The column name matches a DataFrame method: df.count returns the method, not a column
  • The column name starts with a number: df.2023_sales is invalid Python

Stick with bracket notation in production code. Dot notation is fine for quick exploration in a notebook, but it will eventually bite you.

Multiple Column Selection

To select multiple columns, pass a list of column names inside brackets:

# Returns a DataFrame, not a Series
subset = df[['name', 'sales_2023', 'region']]
print(type(subset))  # <class 'pandas.core.frame.DataFrame'>

# Single column with double brackets also returns DataFrame
single_col_df = df[['name']]
print(type(single_col_df))  # <class 'pandas.core.frame.DataFrame'>

This distinction matters. If you need a DataFrame (for example, to pass to a function expecting one), use double brackets even for a single column.

Selection with loc and iloc

The loc and iloc accessors give you more control, especially when you need to select both rows and columns simultaneously.

Label-Based Selection with loc

loc uses column names (labels) for selection:

# Select all rows, specific columns
df.loc[:, ['name', 'region']]

# Select all rows, column range (inclusive on both ends!)
df.loc[:, 'name':'region']

The slice 'name':'region' includes both endpoints. This is different from standard Python slicing and catches many people off guard.

# Combine row and column selection
df.loc[df['sales_2023'] > 25000, ['name', 'sales_2023']]

This returns only the name and sales_2023 columns for customers with 2023 sales above 25,000.

Position-Based Selection with iloc

iloc uses integer positions, starting from 0:

# First three columns
df.iloc[:, 0:3]

# Last two columns
df.iloc[:, -2:]

# Specific positions
df.iloc[:, [0, 2, 4]]

Unlike loc, iloc follows standard Python slicing conventions—the endpoint is excluded. So df.iloc[:, 0:3] returns columns at positions 0, 1, and 2.

# Compare the difference
print(df.loc[:, 'customer_id':'sales_2022'].columns)
# Index(['customer_id', 'name', 'sales_2022'], dtype='object')

print(df.iloc[:, 0:3].columns)
# Index(['customer_id', 'name', 'sales_2022'], dtype='object')

# Same result here, but loc includes endpoint, iloc excludes it

When to use which: Use loc when you know column names and want readable code. Use iloc when working with column positions programmatically or when column names might change but positions won’t.

Filtering Columns by Data Type

Real-world DataFrames contain mixed types. The select_dtypes() method lets you grab columns by their data type:

# Select only numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64'])
print(numeric_cols.columns)
# Index(['customer_id', 'sales_2022', 'sales_2023'], dtype='object')

# Select only object (string) columns
string_cols = df.select_dtypes(include=['object'])
print(string_cols.columns)
# Index(['name', 'region'], dtype='object')

# Use numpy types for broader matching
numeric_cols = df.select_dtypes(include=[np.number])

# Exclude specific types
non_numeric = df.select_dtypes(exclude=[np.number])

This is invaluable for preprocessing pipelines. Need to apply scaling to all numeric features? Need to encode all categorical columns? select_dtypes() makes it trivial:

# Common preprocessing pattern
from sklearn.preprocessing import StandardScaler

numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
df[numeric_features] = StandardScaler().fit_transform(df[numeric_features])

Dynamic Column Selection with filter()

The filter() method provides pattern matching for column names. It’s underused but extremely powerful for datasets with naming conventions.

Using the regex Parameter

# Select columns starting with 'sales_'
sales_columns = df.filter(regex='^sales_')
print(sales_columns.columns)
# Index(['sales_2022', 'sales_2023'], dtype='object')

# Select columns ending with a year
year_columns = df.filter(regex='_20[0-9]{2}$')
print(year_columns.columns)
# Index(['sales_2022', 'sales_2023'], dtype='object')

# Select columns containing 'id' (case insensitive)
id_columns = df.filter(regex='(?i)id')

Using the like Parameter

For simpler substring matching without regex:

# Columns containing 'sales'
df.filter(like='sales')

# Columns containing 'date'
df.filter(like='date')

filter() shines when working with wide datasets that follow naming conventions. Financial data often has columns like revenue_q1_2023, revenue_q2_2023, etc. Pattern matching lets you select all Q1 columns or all 2023 columns without listing them explicitly.

Programmatic Column Selection

Sometimes you need custom logic that goes beyond built-in methods. List comprehensions with df.columns handle these cases.

List Comprehensions

# Columns containing a substring
cols_with_sales = [col for col in df.columns if 'sales' in col]
df[cols_with_sales]

# Columns NOT containing a substring
non_sales_cols = [col for col in df.columns if 'sales' not in col]
df[non_sales_cols]

# Columns meeting multiple conditions
selected = [col for col in df.columns 
            if col.startswith('sales') or col == 'name']
df[selected]

Boolean Indexing on Columns

You can also use pandas string methods on the column index:

# Using str accessor on column index
mask = df.columns.str.contains('sales')
df.loc[:, mask]

# Combine conditions
mask = df.columns.str.startswith('sales') | (df.columns == 'name')
df.loc[:, mask]

# Case-insensitive matching
mask = df.columns.str.lower().str.contains('id')
df.loc[:, mask]

This approach integrates well with pandas’ vectorized operations and can be more readable than list comprehensions for complex conditions.

Summary and Best Practices

Here’s a quick reference for choosing the right method:

Method Use Case Returns
df['col'] Single column, safe syntax Series
df[['col1', 'col2']] Multiple known columns DataFrame
df.loc[:, 'a':'c'] Label-based range (inclusive) DataFrame
df.iloc[:, 0:3] Position-based range (exclusive end) DataFrame
df.select_dtypes() By data type DataFrame
df.filter(regex=...) Pattern matching DataFrame
List comprehension Custom logic List (use with df[...])

Performance considerations: For large DataFrames, all these methods are roughly equivalent in performance—the bottleneck is typically the data copy, not the selection logic. However, avoid repeated column selection in loops. Select once and reuse:

# Bad: selects columns on every iteration
for i in range(1000):
    process(df[['col1', 'col2']])

# Better: select once
subset = df[['col1', 'col2']]
for i in range(1000):
    process(subset)

Final recommendations:

  • Default to bracket notation for clarity and safety
  • Use loc when combining row and column selection
  • Use select_dtypes() for type-based preprocessing pipelines
  • Use filter() with regex for datasets following naming conventions
  • Fall back to list comprehensions for custom logic

Master these methods and you’ll handle any column selection task efficiently.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.