Pandas - Iterate Over Rows (iterrows, itertuples)

Pandas is built for vectorized operations. Before iterating over rows, exhaust these alternatives:

Key Insights

  • itertuples() is 100x faster than iterrows() for most row iteration tasks, returning named tuples with minimal overhead compared to Series objects
  • Vectorized operations outperform any row iteration by 300-1000x — only iterate when operations cannot be vectorized (complex conditionals, API calls, external dependencies)
  • apply() with axis=1 provides middle-ground performance when you need row-wise operations but can encapsulate logic in a function, though it’s still slower than true vectorization

When You Actually Need Row Iteration

Pandas is built for vectorized operations. Before iterating over rows, exhaust these alternatives:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'price': [100, 200, 150],
    'quantity': [2, 1, 3]
})

# Bad: Row iteration
total = 0
for idx, row in df.iterrows():
    total += row['price'] * row['quantity']

# Good: Vectorized
total = (df['price'] * df['quantity']).sum()

Valid row iteration use cases:

  • Making API calls with row data
  • Complex conditional logic that can’t be vectorized
  • Interacting with external systems (databases, files)
  • Debugging and data inspection

iterrows(): Convenient but Slow

iterrows() yields index and Series objects for each row. The Series creation overhead makes it the slowest iteration method.

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 75000]
})

for index, row in df.iterrows():
    print(f"Index: {index}")
    print(f"Name: {row['name']}, Age: {row['age']}")
    print(f"Type of row: {type(row)}")  # pandas.Series
    print("---")

Output:

Index: 0
Name: Alice, Age: 25
Type of row: <class 'pandas.core.series.Series'>
---
Index: 1
Name: Bob, Age: 30
Type of row: <class 'pandas.core.series.Series'>
---

The Series object provides convenient attribute access but creates significant overhead. Each row is converted to a Series with its own index, dtype checking, and metadata.

itertuples(): The Fast Alternative

itertuples() returns named tuples, which are lightweight and fast to create. This is your default choice when row iteration is necessary.

for row in df.itertuples():
    print(f"Index: {row.Index}")
    print(f"Name: {row.name}, Age: {row.age}")
    print(f"Type of row: {type(row)}")  # pandas named tuple
    print("---")

# Access by position (Index is always position 0)
for row in df.itertuples():
    print(f"{row[1]} is {row[2]} years old")

Output:

Index: 0
Name: Alice, Age: 25
Type of row: <class 'pandas.core.frame.Pandas'>
---

Control the tuple name and index inclusion:

# Custom tuple name, exclude index
for row in df.itertuples(index=False, name='Employee'):
    print(f"{row.name}: ${row.salary}")

# Without index, position 0 is first column
for row in df.itertuples(index=False):
    print(row[0])  # name column

Performance Comparison

Here’s a realistic benchmark with 10,000 rows:

import time

# Create test DataFrame
df = pd.DataFrame({
    'A': np.random.randint(0, 100, 10000),
    'B': np.random.randint(0, 100, 10000),
    'C': np.random.randint(0, 100, 10000)
})

# Method 1: iterrows
start = time.time()
result = []
for idx, row in df.iterrows():
    result.append(row['A'] + row['B'] + row['C'])
print(f"iterrows: {time.time() - start:.4f}s")

# Method 2: itertuples
start = time.time()
result = []
for row in df.itertuples(index=False):
    result.append(row.A + row.B + row.C)
print(f"itertuples: {time.time() - start:.4f}s")

# Method 3: apply
start = time.time()
result = df.apply(lambda row: row['A'] + row['B'] + row['C'], axis=1)
print(f"apply: {time.time() - start:.4f}s")

# Method 4: Vectorized
start = time.time()
result = df['A'] + df['B'] + df['C']
print(f"vectorized: {time.time() - start:.4f}s")

Typical results:

iterrows: 1.2450s
itertuples: 0.0125s
apply: 0.0890s
vectorized: 0.0012s

Handling Column Names with Spaces

Named tuples require valid Python identifiers. Pandas handles invalid column names:

df = pd.DataFrame({
    'First Name': ['Alice', 'Bob'],
    'Last Name': ['Smith', 'Jones'],
    'Age-Years': [25, 30]
})

for row in df.itertuples():
    # Spaces and hyphens replaced with underscores
    print(f"{row.First_Name} {row.Last_Name}, {row.Age_Years}")

If column names conflict with reserved words or start with numbers, access by position:

df = pd.DataFrame({
    'class': ['A', 'B'],
    '1st_score': [95, 88]
})

for row in df.itertuples(index=False):
    # Access by position when names are problematic
    print(f"Class: {row[0]}, Score: {row[1]}")

Modifying DataFrames During Iteration

Never modify a DataFrame while iterating over it. Both methods iterate over a view, and modifications can cause unexpected behavior.

# Wrong: Modifying during iteration
for idx, row in df.iterrows():
    df.loc[idx, 'new_column'] = row['age'] * 2  # Slow and error-prone

# Right: Collect results, then assign
results = []
for row in df.itertuples():
    results.append(row.age * 2)
df['new_column'] = results

# Better: Vectorized
df['new_column'] = df['age'] * 2

Practical Example: API Calls with Rate Limiting

A legitimate use case where row iteration is appropriate:

import time
import requests
from typing import List

def enrich_user_data(df: pd.DataFrame) -> pd.DataFrame:
    """Fetch additional data from API for each user."""
    enriched_data = []
    
    for row in df.itertuples(index=False):
        try:
            # Simulate API call
            response = requests.get(
                f"https://api.example.com/users/{row.user_id}",
                timeout=5
            )
            data = response.json()
            
            enriched_data.append({
                'user_id': row.user_id,
                'name': row.name,
                'premium': data.get('premium', False),
                'joined_date': data.get('joined_date')
            })
            
        except requests.RequestException as e:
            print(f"Error fetching user {row.user_id}: {e}")
            enriched_data.append({
                'user_id': row.user_id,
                'name': row.name,
                'premium': None,
                'joined_date': None
            })
        
        # Rate limiting
        time.sleep(0.1)
    
    return pd.DataFrame(enriched_data)

# Usage
users_df = pd.DataFrame({
    'user_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

enriched_df = enrich_user_data(users_df)

Complex Conditional Logic

When business logic has multiple interdependent conditions:

def categorize_transaction(df: pd.DataFrame) -> pd.DataFrame:
    """Complex categorization that's hard to vectorize."""
    categories = []
    
    for row in df.itertuples():
        if row.amount > 1000 and row.country == 'US':
            if row.customer_type == 'premium':
                category = 'high_value_domestic'
            else:
                category = 'review_required'
        elif row.amount > 500 and row.fraud_score > 0.7:
            category = 'fraud_check'
        elif row.merchant_category in ['gambling', 'crypto']:
            category = 'restricted'
        else:
            category = 'standard'
        
        categories.append(category)
    
    df['category'] = categories
    return df

Even here, consider np.select() for vectorization:

conditions = [
    (df['amount'] > 1000) & (df['country'] == 'US') & (df['customer_type'] == 'premium'),
    (df['amount'] > 1000) & (df['country'] == 'US'),
    (df['amount'] > 500) & (df['fraud_score'] > 0.7),
    df['merchant_category'].isin(['gambling', 'crypto'])
]

choices = ['high_value_domestic', 'review_required', 'fraud_check', 'restricted']

df['category'] = np.select(conditions, choices, default='standard')

Memory Considerations

For massive DataFrames, iteration can be memory-efficient compared to creating intermediate arrays:

# Process in chunks without loading everything to memory
def process_large_file(filepath: str, chunksize: int = 10000):
    for chunk in pd.read_csv(filepath, chunksize=chunksize):
        for row in chunk.itertuples():
            # Process each row
            yield process_row(row)

Use itertuples() over iterrows() always—the performance difference becomes critical at scale. But first, ask if you really need row iteration. Vectorized operations aren’t just faster; they’re more maintainable and Pythonic.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.