Pandas - Iterate Over Rows (iterrows, itertuples)
Pandas is built for vectorized operations. Before iterating over rows, exhaust these alternatives:
Key Insights
- itertuples() is 100x faster than iterrows() for most row iteration tasks, returning named tuples with minimal overhead compared to Series objects
- Vectorized operations outperform any row iteration by 300-1000x — only iterate when operations cannot be vectorized (complex conditionals, API calls, external dependencies)
- apply() with axis=1 provides middle-ground performance when you need row-wise operations but can encapsulate logic in a function, though it’s still slower than true vectorization
When You Actually Need Row Iteration
Pandas is built for vectorized operations. Before iterating over rows, exhaust these alternatives:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'price': [100, 200, 150],
'quantity': [2, 1, 3]
})
# Bad: Row iteration
total = 0
for idx, row in df.iterrows():
total += row['price'] * row['quantity']
# Good: Vectorized
total = (df['price'] * df['quantity']).sum()
Valid row iteration use cases:
- Making API calls with row data
- Complex conditional logic that can’t be vectorized
- Interacting with external systems (databases, files)
- Debugging and data inspection
iterrows(): Convenient but Slow
iterrows() yields index and Series objects for each row. The Series creation overhead makes it the slowest iteration method.
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 75000]
})
for index, row in df.iterrows():
print(f"Index: {index}")
print(f"Name: {row['name']}, Age: {row['age']}")
print(f"Type of row: {type(row)}") # pandas.Series
print("---")
Output:
Index: 0
Name: Alice, Age: 25
Type of row: <class 'pandas.core.series.Series'>
---
Index: 1
Name: Bob, Age: 30
Type of row: <class 'pandas.core.series.Series'>
---
The Series object provides convenient attribute access but creates significant overhead. Each row is converted to a Series with its own index, dtype checking, and metadata.
itertuples(): The Fast Alternative
itertuples() returns named tuples, which are lightweight and fast to create. This is your default choice when row iteration is necessary.
for row in df.itertuples():
print(f"Index: {row.Index}")
print(f"Name: {row.name}, Age: {row.age}")
print(f"Type of row: {type(row)}") # pandas named tuple
print("---")
# Access by position (Index is always position 0)
for row in df.itertuples():
print(f"{row[1]} is {row[2]} years old")
Output:
Index: 0
Name: Alice, Age: 25
Type of row: <class 'pandas.core.frame.Pandas'>
---
Control the tuple name and index inclusion:
# Custom tuple name, exclude index
for row in df.itertuples(index=False, name='Employee'):
print(f"{row.name}: ${row.salary}")
# Without index, position 0 is first column
for row in df.itertuples(index=False):
print(row[0]) # name column
Performance Comparison
Here’s a realistic benchmark with 10,000 rows:
import time
# Create test DataFrame
df = pd.DataFrame({
'A': np.random.randint(0, 100, 10000),
'B': np.random.randint(0, 100, 10000),
'C': np.random.randint(0, 100, 10000)
})
# Method 1: iterrows
start = time.time()
result = []
for idx, row in df.iterrows():
result.append(row['A'] + row['B'] + row['C'])
print(f"iterrows: {time.time() - start:.4f}s")
# Method 2: itertuples
start = time.time()
result = []
for row in df.itertuples(index=False):
result.append(row.A + row.B + row.C)
print(f"itertuples: {time.time() - start:.4f}s")
# Method 3: apply
start = time.time()
result = df.apply(lambda row: row['A'] + row['B'] + row['C'], axis=1)
print(f"apply: {time.time() - start:.4f}s")
# Method 4: Vectorized
start = time.time()
result = df['A'] + df['B'] + df['C']
print(f"vectorized: {time.time() - start:.4f}s")
Typical results:
iterrows: 1.2450s
itertuples: 0.0125s
apply: 0.0890s
vectorized: 0.0012s
Handling Column Names with Spaces
Named tuples require valid Python identifiers. Pandas handles invalid column names:
df = pd.DataFrame({
'First Name': ['Alice', 'Bob'],
'Last Name': ['Smith', 'Jones'],
'Age-Years': [25, 30]
})
for row in df.itertuples():
# Spaces and hyphens replaced with underscores
print(f"{row.First_Name} {row.Last_Name}, {row.Age_Years}")
If column names conflict with reserved words or start with numbers, access by position:
df = pd.DataFrame({
'class': ['A', 'B'],
'1st_score': [95, 88]
})
for row in df.itertuples(index=False):
# Access by position when names are problematic
print(f"Class: {row[0]}, Score: {row[1]}")
Modifying DataFrames During Iteration
Never modify a DataFrame while iterating over it. Both methods iterate over a view, and modifications can cause unexpected behavior.
# Wrong: Modifying during iteration
for idx, row in df.iterrows():
df.loc[idx, 'new_column'] = row['age'] * 2 # Slow and error-prone
# Right: Collect results, then assign
results = []
for row in df.itertuples():
results.append(row.age * 2)
df['new_column'] = results
# Better: Vectorized
df['new_column'] = df['age'] * 2
Practical Example: API Calls with Rate Limiting
A legitimate use case where row iteration is appropriate:
import time
import requests
from typing import List
def enrich_user_data(df: pd.DataFrame) -> pd.DataFrame:
"""Fetch additional data from API for each user."""
enriched_data = []
for row in df.itertuples(index=False):
try:
# Simulate API call
response = requests.get(
f"https://api.example.com/users/{row.user_id}",
timeout=5
)
data = response.json()
enriched_data.append({
'user_id': row.user_id,
'name': row.name,
'premium': data.get('premium', False),
'joined_date': data.get('joined_date')
})
except requests.RequestException as e:
print(f"Error fetching user {row.user_id}: {e}")
enriched_data.append({
'user_id': row.user_id,
'name': row.name,
'premium': None,
'joined_date': None
})
# Rate limiting
time.sleep(0.1)
return pd.DataFrame(enriched_data)
# Usage
users_df = pd.DataFrame({
'user_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
enriched_df = enrich_user_data(users_df)
Complex Conditional Logic
When business logic has multiple interdependent conditions:
def categorize_transaction(df: pd.DataFrame) -> pd.DataFrame:
"""Complex categorization that's hard to vectorize."""
categories = []
for row in df.itertuples():
if row.amount > 1000 and row.country == 'US':
if row.customer_type == 'premium':
category = 'high_value_domestic'
else:
category = 'review_required'
elif row.amount > 500 and row.fraud_score > 0.7:
category = 'fraud_check'
elif row.merchant_category in ['gambling', 'crypto']:
category = 'restricted'
else:
category = 'standard'
categories.append(category)
df['category'] = categories
return df
Even here, consider np.select() for vectorization:
conditions = [
(df['amount'] > 1000) & (df['country'] == 'US') & (df['customer_type'] == 'premium'),
(df['amount'] > 1000) & (df['country'] == 'US'),
(df['amount'] > 500) & (df['fraud_score'] > 0.7),
df['merchant_category'].isin(['gambling', 'crypto'])
]
choices = ['high_value_domestic', 'review_required', 'fraud_check', 'restricted']
df['category'] = np.select(conditions, choices, default='standard')
Memory Considerations
For massive DataFrames, iteration can be memory-efficient compared to creating intermediate arrays:
# Process in chunks without loading everything to memory
def process_large_file(filepath: str, chunksize: int = 10000):
for chunk in pd.read_csv(filepath, chunksize=chunksize):
for row in chunk.itertuples():
# Process each row
yield process_row(row)
Use itertuples() over iterrows() always—the performance difference becomes critical at scale. But first, ask if you really need row iteration. Vectorized operations aren’t just faster; they’re more maintainable and Pythonic.