PySpark - Select Columns by Index

PySpark DataFrames are designed around named column access, but there are legitimate scenarios where selecting columns by their positional index becomes necessary. You might be processing CSV files...

Key Insights

  • PySpark DataFrames don’t natively support integer-based indexing like pandas, but you can select columns by index using the df.columns list attribute combined with standard selection methods
  • The most practical approach is df.select(*[df.columns[i] for i in indices]) which allows flexible index-based selection including ranges, specific positions, and conditional logic
  • Index-based column selection is essential when working with headerless data, dynamically generated DataFrames, or when column positions are more reliable than names in your pipeline

Introduction & Use Case

PySpark DataFrames are designed around named column access, but there are legitimate scenarios where selecting columns by their positional index becomes necessary. You might be processing CSV files without headers, working with DataFrames where column names are unreliable or auto-generated, or building dynamic ETL pipelines that need to select columns based on their position rather than name.

Unlike pandas, which offers iloc for positional indexing, PySpark requires a different approach. Understanding how to select columns by index in PySpark is crucial for building flexible data pipelines that can adapt to varying data structures.

Let’s create a sample DataFrame to work with throughout this article:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("ColumnIndexing").getOrCreate()

data = [
    (1, "Alice", 29, "Engineering", 95000),
    (2, "Bob", 35, "Marketing", 75000),
    (3, "Charlie", 28, "Engineering", 88000),
    (4, "Diana", 42, "Sales", 82000)
]

df = spark.createDataFrame(data, ["id", "name", "age", "department", "salary"])
df.show()

This gives us a simple DataFrame with five columns that we’ll use to demonstrate various indexing techniques.

Understanding PySpark DataFrame Columns

PySpark DataFrames are fundamentally different from pandas DataFrames. They’re distributed, immutable, and optimized for columnar operations. The DataFrame structure stores column names as metadata, and the columns attribute returns a standard Python list of column name strings.

# Display schema
df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- name: string (nullable = true)
#  |-- age: long (nullable = true)
#  |-- department: string (nullable = true)
#  |-- salary: long (nullable = true)

# Access column names as a list
print(df.columns)
# ['id', 'name', 'age', 'department', 'salary']

print(type(df.columns))
# <class 'list'>

Since df.columns is a regular Python list, you can leverage standard list indexing and slicing operations. The key is combining this list with PySpark’s select() method, which expects column names or Column objects.

Method 1: Using df.columns with List Indexing

The most straightforward approach is to use the columns attribute with Python’s indexing syntax. This works because select() accepts column names as strings.

# Select single column by index (first column)
df.select(df.columns[0]).show()
# +---+
# | id|
# +---+
# |  1|
# |  2|
# |  3|
# |  4|
# +---+

# Select multiple specific columns by index
df.select(df.columns[1], df.columns[3]).show()
# +-------+------------+
# |   name|  department|
# +-------+------------+
# |  Alice| Engineering|
# |    Bob|   Marketing|
# |Charlie| Engineering|
# |  Diana|       Sales|
# +-------+------------+

# Select a range of columns using slicing
df.select(*df.columns[1:4]).show()
# +-------+---+------------+
# |   name|age|  department|
# +-------+---+------------+
# |  Alice| 29| Engineering|
# |    Bob| 35|   Marketing|
# |Charlie| 28| Engineering|
# |  Diana| 42|       Sales|
# +-------+---+------------+

Note the asterisk (*) operator when using slices. Since df.columns[1:4] returns a list, we need to unpack it as separate arguments to select().

You can also use negative indexing:

# Select last column
df.select(df.columns[-1]).show()

# Select last three columns
df.select(*df.columns[-3:]).show()

Method 2: Using List Comprehension for Advanced Selection

List comprehensions provide powerful pattern-based column selection. This is where index-based selection really shines compared to name-based approaches.

# Select even-indexed columns (0, 2, 4)
even_cols = [df.columns[i] for i in range(len(df.columns)) if i % 2 == 0]
df.select(*even_cols).show()
# +---+---+------+
# | id|age|salary|
# +---+---+------+
# |  1| 29| 95000|
# |  2| 35| 75000|
# |  3| 28| 88000|
# |  4| 42| 82000|
# +---+---+------+

# Select first 2 and last 2 columns
first_last = df.columns[:2] + df.columns[-2:]
df.select(*first_last).show()
# +---+-------+------------+------+
# | id|   name|  department|salary|
# +---+-------+------------+------+
# |  1|  Alice| Engineering| 95000|
# |  2|    Bob|   Marketing| 75000|
# |  3|Charlie| Engineering| 88000|
# |  4|  Diana|       Sales| 82000|
# +---+-------+------------+------+

# Exclude specific indices (e.g., exclude columns 0 and 3)
exclude_indices = {0, 3}
selected_cols = [df.columns[i] for i in range(len(df.columns)) if i not in exclude_indices]
df.select(*selected_cols).show()
# +-------+---+------+
# |   name|age|salary|
# +-------+---+------+
# |  Alice| 29| 95000|
# |    Bob| 35| 75000|
# |Charlie| 28| 88000|
# |  Diana| 42| 82000|
# +-------+---+------+

This approach is particularly useful when you need to apply business logic to determine which columns to select based on their position.

Method 3: Using col() Function with Dynamic Operations

When you need to perform transformations while selecting by index, combine col() with indexed column names:

from pyspark.sql.functions import col, upper

# Select and transform columns by index
df.select(
    col(df.columns[0]),
    upper(col(df.columns[1])).alias("name_upper"),
    col(df.columns[4])
).show()
# +---+----------+------+
# | id|name_upper|salary|
# +---+----------+------+
# |  1|     ALICE| 95000|
# |  2|       BOB| 75000|
# |  3|   CHARLIE| 88000|
# |  4|     DIANA| 82000|
# +---+----------+------+

# Rename columns while selecting by index
df.select(
    col(df.columns[0]).alias("employee_id"),
    col(df.columns[1]).alias("employee_name"),
    col(df.columns[4]).alias("annual_salary")
).show()

This pattern is valuable when building reusable functions that operate on DataFrames with varying schemas but consistent column positions.

Practical Examples & Best Practices

Here’s a real-world scenario: processing a CSV file without headers where you know the column positions but not the names:

# Reading CSV without header - PySpark assigns default names
df_no_header = spark.read.csv("data.csv", header=False)
# Columns will be named: _c0, _c1, _c2, etc.

# Select columns by position and assign meaningful names
result = df_no_header.select(
    col(df_no_header.columns[0]).alias("customer_id"),
    col(df_no_header.columns[2]).alias("purchase_amount"),
    col(df_no_header.columns[5]).alias("transaction_date")
)

For dynamic column selection based on user input or configuration:

def select_columns_by_indices(df, indices):
    """
    Select DataFrame columns by their positional indices.
    
    Args:
        df: PySpark DataFrame
        indices: List of integers or a slice object
    
    Returns:
        DataFrame with selected columns
    """
    if isinstance(indices, slice):
        selected = df.columns[indices]
    else:
        selected = [df.columns[i] for i in indices]
    
    return df.select(*selected)

# Usage
result = select_columns_by_indices(df, [0, 2, 4])
result = select_columns_by_indices(df, slice(1, 4))

Performance-wise, these methods have minimal overhead since they’re simply translating indices to column names before the actual Spark operation. The select() operation itself performs identically whether you use indexed or named access.

Common Pitfalls & Alternatives

The biggest pitfall is using index-based selection when name-based would be clearer and more maintainable:

# Avoid this when column names are stable
df.select(df.columns[1], df.columns[3])  # What are these columns?

# Prefer this for readability
df.select("name", "department")  # Clear intent

Index-based selection breaks if your DataFrame schema changes. If a new column is added at position 0, all your indices shift:

# This is fragile
important_data = df.select(df.columns[2], df.columns[4])

# This is robust
important_data = df.select("age", "salary")

Another common error is forgetting to unpack lists when selecting multiple columns:

# This will fail
df.select(df.columns[1:4])  # Error: select() doesn't accept lists

# Correct approach
df.select(*df.columns[1:4])  # Unpacks list as separate arguments

Use index-based selection when:

  • Processing files without headers
  • Building generic pipeline functions that work across varying schemas
  • Column positions are guaranteed but names might vary
  • Implementing positional transformations (e.g., “process every third column”)

Use name-based selection when:

  • Schema is known and stable
  • Code readability is important
  • Working with well-defined data models

By understanding both approaches, you can choose the right tool for each situation in your PySpark data pipelines.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.