Pandas - Select Columns by Index Position

Key Insights

Pandas provides multiple methods to select columns by index position: iloc[], direct integer indexing with columns, and take() for more complex selections
Understanding the difference between iloc[] (position-based) and loc[] (label-based) prevents common indexing errors when working with non-sequential column positions
Combining positional selection with Python slicing, lists, and boolean arrays enables flexible column subsetting for data transformation pipelines

Using iloc for Positional Column Selection

The iloc[] indexer is the primary method for position-based column selection in Pandas. It uses zero-based integer indexing, making it ideal when you know the exact position of columns regardless of their names.

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'product_id': [101, 102, 103, 104],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [999.99, 29.99, 79.99, 299.99],
    'quantity': [5, 50, 30, 15],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics']
})

# Select single column by position (second column)
price_column = df.iloc[:, 1]
print(price_column)
# Output:
# 0      Laptop
# 1       Mouse
# 2    Keyboard
# 3     Monitor
# Name: product_name, dtype: object

# Select multiple columns by position
subset = df.iloc[:, [0, 2, 3]]
print(subset)
#    product_id   price  quantity
# 0         101  999.99         5
# 1         102   29.99        50
# 2         103   79.99        30
# 3         104  299.99        15

The syntax df.iloc[:, column_positions] uses the colon to select all rows and specifies column positions after the comma. Single integers return a Series, while lists return a DataFrame.

Slicing Columns by Position Range

Python’s slice notation works seamlessly with iloc[] for selecting consecutive columns. This approach is cleaner than listing individual positions when you need a range.

# Select first three columns (positions 0, 1, 2)
first_three = df.iloc[:, 0:3]
print(first_three)
#    product_id product_name   price
# 0         101       Laptop  999.99
# 1         102        Mouse   29.99
# 2         103     Keyboard   79.99
# 3         104      Monitor  299.99

# Select columns from position 2 to end
from_third = df.iloc[:, 2:]
print(from_third)
#     price  quantity       category
# 0  999.99         5    Electronics
# 1   29.99        50   Accessories
# 2   79.99        30   Accessories
# 3  299.99        15    Electronics

# Select every other column
alternate = df.iloc[:, ::2]
print(alternate)
#    product_id   price       category
# 0         101  999.99    Electronics
# 1         102   29.99   Accessories
# 2         103   79.99   Accessories
# 3         104  299.99    Electronics

# Select columns in reverse order
reversed_df = df.iloc[:, ::-1]
print(reversed_df.columns.tolist())
# ['category', 'quantity', 'price', 'product_name', 'product_id']

Slice notation follows the pattern start:stop:step. Remember that the stop position is exclusive, so 0:3 selects positions 0, 1, and 2.

Selecting Non-Consecutive Columns

When you need specific columns that aren’t adjacent, pass a list of positions to iloc[]. This is particularly useful when restructuring data or selecting features for machine learning.

# Select first, third, and last columns
selected = df.iloc[:, [0, 2, -1]]
print(selected)
#    product_id   price       category
# 0         101  999.99    Electronics
# 1         102   29.99   Accessories
# 2         103   79.99   Accessories
# 3         104  299.99    Electronics

# Reorder columns by position
reordered = df.iloc[:, [4, 1, 2, 3, 0]]
print(reordered.columns.tolist())
# ['category', 'product_name', 'price', 'quantity', 'product_id']

# Combine ranges and individual positions
complex_selection = df.iloc[:, [0] + list(range(2, 5))]
print(complex_selection.columns.tolist())
# ['product_id', 'price', 'quantity', 'category']

Negative indexing works with iloc[], where -1 refers to the last column, -2 to the second-to-last, and so on.

Using take() for Advanced Selection

The take() method provides an alternative approach with additional functionality, particularly useful for handling out-of-bounds indices or selecting along specific axes.

# Select columns using take()
subset = df.take([0, 2, 3], axis=1)
print(subset)
#    product_id   price  quantity
# 0         101  999.99         5
# 1         102   29.99        50
# 2         103   79.99        30
# 3         104  299.99        15

# take() with allow_fill for handling missing positions
# Useful when indices might not exist
positions = [0, 2, 10]  # Position 10 doesn't exist
try:
    result = df.take(positions, axis=1)
except IndexError as e:
    print(f"Error: {e}")
    # Use allow_fill with fill_value for safety
    result = df.take([0, 2], axis=1)

# Duplicate columns by repeating positions
duplicated = df.take([0, 0, 1, 1], axis=1)
print(duplicated.columns.tolist())
# ['product_id', 'product_id', 'product_name', 'product_name']

The axis=1 parameter specifies column selection (axis=0 would select rows). Unlike iloc[], take() allows duplicating columns by repeating indices.

Conditional Selection Based on Position

Combine boolean arrays with positional indexing for dynamic column selection based on conditions.

# Select columns at even positions
n_cols = len(df.columns)
even_positions = [i for i in range(n_cols) if i % 2 == 0]
even_cols = df.iloc[:, even_positions]
print(even_cols)
#    product_id   price       category
# 0         101  999.99    Electronics
# 1         102   29.99   Accessories
# 2         103   79.99   Accessories
# 3         104  299.99    Electronics

# Select columns based on position condition
# Get first half of columns
half_point = len(df.columns) // 2
first_half = df.iloc[:, :half_point]
print(first_half.columns.tolist())
# ['product_id', 'product_name']

# Boolean array selection
bool_array = np.array([True, False, True, False, True])
selected = df.iloc[:, bool_array]
print(selected.columns.tolist())
# ['product_id', 'price', 'category']

Boolean arrays must match the number of columns exactly. This technique is powerful when combined with programmatic column analysis.

Direct Column Index Manipulation

Access the underlying column index positions directly for more control over column selection logic.

# Get column positions by name
col_positions = [df.columns.get_loc(col) for col in ['price', 'quantity']]
subset = df.iloc[:, col_positions]
print(subset)
#     price  quantity
# 0  999.99         5
# 1   29.99        50
# 2   79.99        30
# 3  299.99        15

# Find positions of columns matching a pattern
numeric_positions = [i for i, col in enumerate(df.columns) 
                     if df[col].dtype in ['int64', 'float64']]
numeric_df = df.iloc[:, numeric_positions]
print(numeric_df.columns.tolist())
# ['product_id', 'price', 'quantity']

# Exclude specific positions
all_positions = set(range(len(df.columns)))
exclude_positions = {1, 3}  # Exclude positions 1 and 3
keep_positions = sorted(all_positions - exclude_positions)
filtered = df.iloc[:, keep_positions]
print(filtered.columns.tolist())
# ['product_id', 'price', 'category']

This approach bridges the gap between label-based and position-based selection, enabling complex selection logic based on column metadata.

Performance Considerations

Position-based selection generally performs better than label-based selection for large DataFrames, especially when accessing columns repeatedly in loops.

import time

# Create large DataFrame
large_df = pd.DataFrame(np.random.rand(10000, 100))

# Position-based selection (faster)
start = time.time()
for _ in range(1000):
    subset = large_df.iloc[:, [0, 10, 20, 30]]
position_time = time.time() - start

# Label-based selection (slower)
start = time.time()
for _ in range(1000):
    subset = large_df.loc[:, [0, 10, 20, 30]]
label_time = time.time() - start

print(f"Position-based: {position_time:.4f}s")
print(f"Label-based: {label_time:.4f}s")
# Position-based is typically 10-20% faster

# Best practice: Store positions outside loops
positions = [0, 10, 20, 30]
start = time.time()
for _ in range(1000):
    subset = large_df.iloc[:, positions]
optimized_time = time.time() - start
print(f"Optimized: {optimized_time:.4f}s")

For production code processing large datasets, cache column positions outside loops and use iloc[] for consistent performance. Position-based indexing also provides stability when column names might change but their positions remain constant.