Pandas - Iterate Over Rows (iterrows, itertuples)

Key Insights

itertuples() is 100x faster than iterrows() for most row iteration tasks, returning named tuples with minimal overhead compared to Series objects
Vectorized operations outperform any row iteration by 300-1000x — only iterate when operations cannot be vectorized (complex conditionals, API calls, external dependencies)
apply() with axis=1 provides middle-ground performance when you need row-wise operations but can encapsulate logic in a function, though it’s still slower than true vectorization

When You Actually Need Row Iteration

Pandas is built for vectorized operations. Before iterating over rows, exhaust these alternatives:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'price': [100, 200, 150],
    'quantity': [2, 1, 3]
})

# Bad: Row iteration
total = 0
for idx, row in df.iterrows():
    total += row['price'] * row['quantity']

# Good: Vectorized
total = (df['price'] * df['quantity']).sum()

Valid row iteration use cases:

Making API calls with row data
Complex conditional logic that can’t be vectorized
Interacting with external systems (databases, files)
Debugging and data inspection

iterrows(): Convenient but Slow

iterrows() yields index and Series objects for each row. The Series creation overhead makes it the slowest iteration method.

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 75000]
})

for index, row in df.iterrows():
    print(f"Index: {index}")
    print(f"Name: {row['name']}, Age: {row['age']}")
    print(f"Type of row: {type(row)}")  # pandas.Series
    print("---")

Output:

Index: 0
Name: Alice, Age: 25
Type of row: <class 'pandas.core.series.Series'>
---
Index: 1
Name: Bob, Age: 30
Type of row: <class 'pandas.core.series.Series'>
---

The Series object provides convenient attribute access but creates significant overhead. Each row is converted to a Series with its own index, dtype checking, and metadata.

itertuples(): The Fast Alternative

itertuples() returns named tuples, which are lightweight and fast to create. This is your default choice when row iteration is necessary.

for row in df.itertuples():
    print(f"Index: {row.Index}")
    print(f"Name: {row.name}, Age: {row.age}")
    print(f"Type of row: {type(row)}")  # pandas named tuple
    print("---")

# Access by position (Index is always position 0)
for row in df.itertuples():
    print(f"{row[1]} is {row[2]} years old")

Output:

Index: 0
Name: Alice, Age: 25
Type of row: <class 'pandas.core.frame.Pandas'>
---

Control the tuple name and index inclusion:

# Custom tuple name, exclude index
for row in df.itertuples(index=False, name='Employee'):
    print(f"{row.name}: ${row.salary}")

# Without index, position 0 is first column
for row in df.itertuples(index=False):
    print(row[0])  # name column

Performance Comparison

Here’s a realistic benchmark with 10,000 rows:

import time

# Create test DataFrame
df = pd.DataFrame({
    'A': np.random.randint(0, 100, 10000),
    'B': np.random.randint(0, 100, 10000),
    'C': np.random.randint(0, 100, 10000)
})

# Method 1: iterrows
start = time.time()
result = []
for idx, row in df.iterrows():
    result.append(row['A'] + row['B'] + row['C'])
print(f"iterrows: {time.time() - start:.4f}s")

# Method 2: itertuples
start = time.time()
result = []
for row in df.itertuples(index=False):
    result.append(row.A + row.B + row.C)
print(f"itertuples: {time.time() - start:.4f}s")

# Method 3: apply
start = time.time()
result = df.apply(lambda row: row['A'] + row['B'] + row['C'], axis=1)
print(f"apply: {time.time() - start:.4f}s")

# Method 4: Vectorized
start = time.time()
result = df['A'] + df['B'] + df['C']
print(f"vectorized: {time.time() - start:.4f}s")

Typical results:

iterrows: 1.2450s
itertuples: 0.0125s
apply: 0.0890s
vectorized: 0.0012s

Handling Column Names with Spaces

Named tuples require valid Python identifiers. Pandas handles invalid column names:

df = pd.DataFrame({
    'First Name': ['Alice', 'Bob'],
    'Last Name': ['Smith', 'Jones'],
    'Age-Years': [25, 30]
})

for row in df.itertuples():
    # Spaces and hyphens replaced with underscores
    print(f"{row.First_Name} {row.Last_Name}, {row.Age_Years}")

If column names conflict with reserved words or start with numbers, access by position:

df = pd.DataFrame({
    'class': ['A', 'B'],
    '1st_score': [95, 88]
})

for row in df.itertuples(index=False):
    # Access by position when names are problematic
    print(f"Class: {row[0]}, Score: {row[1]}")

Modifying DataFrames During Iteration

Never modify a DataFrame while iterating over it. Both methods iterate over a view, and modifications can cause unexpected behavior.

# Wrong: Modifying during iteration
for idx, row in df.iterrows():
    df.loc[idx, 'new_column'] = row['age'] * 2  # Slow and error-prone

# Right: Collect results, then assign
results = []
for row in df.itertuples():
    results.append(row.age * 2)
df['new_column'] = results

# Better: Vectorized
df['new_column'] = df['age'] * 2

Practical Example: API Calls with Rate Limiting

A legitimate use case where row iteration is appropriate:

import time
import requests
from typing import List

def enrich_user_data(df: pd.DataFrame) -> pd.DataFrame:
    """Fetch additional data from API for each user."""
    enriched_data = []
    
    for row in df.itertuples(index=False):
        try:
            # Simulate API call
            response = requests.get(
                f"https://api.example.com/users/{row.user_id}",
                timeout=5
            )
            data = response.json()
            
            enriched_data.append({
                'user_id': row.user_id,
                'name': row.name,
                'premium': data.get('premium', False),
                'joined_date': data.get('joined_date')
            })
            
        except requests.RequestException as e:
            print(f"Error fetching user {row.user_id}: {e}")
            enriched_data.append({
                'user_id': row.user_id,
                'name': row.name,
                'premium': None,
                'joined_date': None
            })
        
        # Rate limiting
        time.sleep(0.1)
    
    return pd.DataFrame(enriched_data)

# Usage
users_df = pd.DataFrame({
    'user_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

enriched_df = enrich_user_data(users_df)

Complex Conditional Logic

When business logic has multiple interdependent conditions:

def categorize_transaction(df: pd.DataFrame) -> pd.DataFrame:
    """Complex categorization that's hard to vectorize."""
    categories = []
    
    for row in df.itertuples():
        if row.amount > 1000 and row.country == 'US':
            if row.customer_type == 'premium':
                category = 'high_value_domestic'
            else:
                category = 'review_required'
        elif row.amount > 500 and row.fraud_score > 0.7:
            category = 'fraud_check'
        elif row.merchant_category in ['gambling', 'crypto']:
            category = 'restricted'
        else:
            category = 'standard'
        
        categories.append(category)
    
    df['category'] = categories
    return df

Even here, consider np.select() for vectorization:

conditions = [
    (df['amount'] > 1000) & (df['country'] == 'US') & (df['customer_type'] == 'premium'),
    (df['amount'] > 1000) & (df['country'] == 'US'),
    (df['amount'] > 500) & (df['fraud_score'] > 0.7),
    df['merchant_category'].isin(['gambling', 'crypto'])
]

choices = ['high_value_domestic', 'review_required', 'fraud_check', 'restricted']

df['category'] = np.select(conditions, choices, default='standard')

Memory Considerations

For massive DataFrames, iteration can be memory-efficient compared to creating intermediate arrays:

# Process in chunks without loading everything to memory
def process_large_file(filepath: str, chunksize: int = 10000):
    for chunk in pd.read_csv(filepath, chunksize=chunksize):
        for row in chunk.itertuples():
            # Process each row
            yield process_row(row)

Use itertuples() over iterrows() always—the performance difference becomes critical at scale. But first, ask if you really need row iteration. Vectorized operations aren’t just faster; they’re more maintainable and Pythonic.