Pandas: DataFrame Indexing and Selection

DataFrame indexing is where Pandas beginners stumble and intermediates get bitten by subtle bugs. The library offers multiple ways to select and modify data, each with distinct behaviors that can...

Key Insights

  • Use .loc[] for label-based selection and .iloc[] for position-based selection—mixing them up causes subtle bugs that waste hours of debugging time
  • Boolean indexing with .loc[] is the safest way to filter and modify data, avoiding the dreaded SettingWithCopyWarning
  • Direct bracket notation [] is convenient for column access but dangerous for row slicing—prefer explicit methods for production code

DataFrame indexing is where Pandas beginners stumble and intermediates get bitten by subtle bugs. The library offers multiple ways to select and modify data, each with distinct behaviors that can produce unexpected results if you don’t understand the underlying mechanics. This guide cuts through the confusion and gives you a mental model for choosing the right approach every time.

Setting Up Our Sample Data

Let’s create a DataFrame we’ll use throughout this article:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department': ['Engineering', 'Sales', 'Engineering', 'Marketing', 'Sales'],
    'salary': [95000, 65000, 87000, 72000, 58000],
    'years_experience': [5, 3, 4, 6, 2],
    'remote': [True, False, True, False, True]
}, index=['emp_001', 'emp_002', 'emp_003', 'emp_004', 'emp_005'])

print(df)
            name   department  salary  years_experience  remote
emp_001    Alice  Engineering   95000                 5    True
emp_002      Bob        Sales   65000                 3   False
emp_003  Charlie  Engineering   87000                 4    True
emp_004    Diana    Marketing   72000                 6   False
emp_005      Eve        Sales   58000                 2    True

Label-Based Selection with .loc[]

The .loc[] accessor selects data by label—the actual index values and column names. This is your go-to method when you know the names of what you want.

# Single row by label
df.loc['emp_001']

# Single cell: row label, column label
df.loc['emp_001', 'salary']  # Returns: 95000

# Multiple rows
df.loc[['emp_001', 'emp_003']]

# Multiple rows and specific columns
df.loc[['emp_001', 'emp_003'], ['name', 'salary']]

A critical detail: label slicing with .loc[] is inclusive on both ends. This differs from standard Python slicing:

# Includes both emp_001 AND emp_003
df.loc['emp_001':'emp_003']

Boolean conditions work naturally with .loc[]:

# Filter rows where salary exceeds 70000
df.loc[df['salary'] > 70000]

# Filter with column selection
df.loc[df['salary'] > 70000, ['name', 'department']]

Position-Based Selection with .iloc[]

When you need to select by position rather than label, use .iloc[]. This works like standard Python indexing—zero-based with exclusive end points.

# First row (position 0)
df.iloc[0]

# First three rows
df.iloc[:3]  # Rows at positions 0, 1, 2

# Specific row and column positions
df.iloc[0, 2]  # First row, third column (salary)

# Multiple rows and columns by position
df.iloc[[0, 2, 4], [0, 2]]  # Rows 0,2,4 and columns 0,2

Negative indexing works as expected:

# Last row
df.iloc[-1]

# Last two rows, all columns except the last
df.iloc[-2:, :-1]

When to use which? Use .loc[] when your code’s meaning depends on column names or meaningful index labels. Use .iloc[] when you need positional access—like “the first 10 rows” or “every other column.”

Direct Bracket Notation

Direct bracket notation [] is the shorthand you’ll see everywhere. It’s convenient but has quirks that trip people up.

# Single column - returns Series
df['name']

# Multiple columns - returns DataFrame
df[['name', 'salary']]

Row slicing works but behaves differently than you might expect:

# Slicing rows by position (not label!)
df[0:3]  # First three rows

# This does NOT work - raises KeyError
# df[0]  # Trying to get first row

Here’s the problem: df[x] interprets x as a column name if it’s a single value, but as row positions if it’s a slice. This inconsistency causes bugs:

# If you have integer column names, things get confusing
df_int_cols = pd.DataFrame([[1, 2], [3, 4]], columns=[0, 1])
df_int_cols[0]  # Returns column named 0, not first row

My recommendation: use bracket notation only for column access. For anything involving rows, use .loc[] or .iloc[] explicitly.

Boolean Indexing and Filtering

Boolean indexing is how you filter DataFrames based on conditions. You create a boolean mask—a Series of True/False values—and use it to select rows.

# Create a mask
high_salary = df['salary'] > 70000
print(high_salary)
emp_001     True
emp_002    False
emp_003     True
emp_004     True
emp_005    False
Name: salary, dtype: bool
# Apply the mask
df[high_salary]
# Or inline
df[df['salary'] > 70000]

Combine conditions using & (and), | (or), and ~ (not). Parentheses are mandatory due to operator precedence:

# Engineers with high salary
df[(df['department'] == 'Engineering') & (df['salary'] > 80000)]

# Remote workers OR high experience
df[(df['remote'] == True) | (df['years_experience'] > 4)]

# NOT in Sales
df[~(df['department'] == 'Sales')]

For complex filters, .query() offers better readability:

# Same filter, cleaner syntax
df.query('department == "Engineering" and salary > 80000')

# Variables work with @ prefix
min_salary = 70000
df.query('salary > @min_salary')

# Multiple conditions
df.query('department in ["Engineering", "Marketing"] and remote == True')

Modifying Data with Indexing

This is where most Pandas bugs live. Setting values incorrectly triggers SettingWithCopyWarning or silently fails to modify your data.

The safe way to modify data uses .loc[] or .iloc[]:

# Update a single cell
df.loc['emp_001', 'salary'] = 100000

# Update multiple cells
df.loc[['emp_001', 'emp_002'], 'salary'] = [98000, 68000]

# Conditional update
df.loc[df['department'] == 'Sales', 'salary'] = df.loc[df['department'] == 'Sales', 'salary'] * 1.1

Chained assignment is the enemy. This pattern looks reasonable but often fails:

# BAD - chained assignment
df[df['department'] == 'Sales']['salary'] = 60000  # May not work!

# GOOD - single assignment with .loc[]
df.loc[df['department'] == 'Sales', 'salary'] = 60000

The chained version creates an intermediate copy, and you end up modifying the copy instead of the original DataFrame.

When you intentionally want a copy, be explicit:

# Create an explicit copy
sales_df = df[df['department'] == 'Sales'].copy()
sales_df['bonus'] = sales_df['salary'] * 0.1  # Safe to modify

Best Practices and Performance Tips

Choose your indexing method deliberately:

Scenario Method
Select columns by name df['col'] or df[['col1', 'col2']]
Select rows by label df.loc[label]
Select rows by position df.iloc[pos]
Filter with conditions df.loc[condition] or df.query()
Modify data Always .loc[] or .iloc[]

Performance matters at scale. Here’s a quick comparison:

import timeit

large_df = pd.DataFrame(np.random.randn(100000, 10), 
                        columns=[f'col_{i}' for i in range(10)])

# Boolean indexing
%timeit large_df[large_df['col_0'] > 0]
# ~1.5 ms

# .query() method
%timeit large_df.query('col_0 > 0')
# ~2.5 ms for simple queries, faster for complex ones

# .loc with boolean
%timeit large_df.loc[large_df['col_0'] > 0]
# ~1.5 ms

For simple conditions, direct boolean indexing and .loc[] perform similarly. .query() has overhead for parsing but becomes competitive with complex multi-condition filters due to expression optimization.

Watch for index alignment. Pandas automatically aligns data by index during operations, which can produce unexpected NaN values:

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
print(s1 + s2)
a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64

If you want positional alignment instead, use .values or .to_numpy() to work with the underlying arrays.

Master these indexing patterns and you’ll write cleaner, faster, and more reliable Pandas code. The key insight is that explicit is better than implicit—when in doubt, use .loc[] or .iloc[] and state your intentions clearly.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.