How to Iterate Over Rows in Pandas
Row iteration is one of those topics where knowing *how* to do something is less important than knowing *when* to do it. Pandas is built on NumPy, which processes entire arrays in optimized C code....
Key Insights
- Row iteration in Pandas is almost always the wrong approach—vectorized operations can be 100x faster and should be your default choice
- When you must iterate,
itertuples()outperformsiterrows()by 10-100x because it returns lightweight namedtuples instead of Series objects - The
apply()method withaxis=1offers a middle ground, but it’s still just a dressed-up loop and won’t match vectorized performance
Row iteration is one of those topics where knowing how to do something is less important than knowing when to do it. Pandas is built on NumPy, which processes entire arrays in optimized C code. The moment you write a Python for loop over DataFrame rows, you’re throwing away that performance advantage. That said, sometimes iteration is genuinely necessary—complex conditional logic, API calls per row, or operations that depend on previous row state. Let’s cover all your options, from the methods you should avoid to the ones you should reach for first.
Using iterrows() - The Basic Approach
The iterrows() method is what most developers discover first. It yields pairs of (index, Series) for each row, making it intuitive to work with.
import pandas as pd
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'price': [25.99, 49.99, 12.50],
'quantity': [100, 45, 200]
})
for index, row in df.iterrows():
print(f"Row {index}: {row['product']} costs ${row['price']}")
Output:
Row 0: Widget costs $25.99
Row 1: Gadget costs $49.99
Row 2: Sprocket costs $12.50
The API is clean and readable. You access columns by name, and the index is right there if you need it. However, iterrows() has two significant problems.
First, it’s slow. Each row gets converted into a Pandas Series object, which involves substantial overhead. Second, it doesn’t preserve dtypes. If you have a DataFrame with integers and floats, iterrows() may convert everything to float64 to fit into a single Series. This can cause subtle bugs:
df = pd.DataFrame({'id': [1, 2, 3], 'value': [1.5, 2.5, 3.5]})
for index, row in df.iterrows():
print(f"id type: {type(row['id'])}") # float64, not int64
Use iterrows() only for quick debugging, small datasets where performance doesn’t matter, or when you genuinely need the row as a Series for downstream processing.
Using itertuples() - The Faster Alternative
The itertuples() method returns each row as a namedtuple, which is a lightweight Python object. This eliminates the Series creation overhead and preserves dtypes.
import pandas as pd
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'price': [25.99, 49.99, 12.50],
'quantity': [100, 45, 200]
})
for row in df.itertuples():
print(f"Row {row.Index}: {row.product} costs ${row.price}")
Output:
Row 0: Widget costs $25.99
Row 1: Gadget costs $49.99
Row 2: Sprocket costs $12.50
Notice that you access values as attributes (row.product) rather than dictionary-style access. The index is available as row.Index (capital I).
There’s one gotcha: column names that aren’t valid Python identifiers get replaced with positional names. If your column is named unit price or 2024_sales, you’ll need to access it by position:
df = pd.DataFrame({'unit price': [25.99, 49.99]})
for row in df.itertuples():
# row.unit price won't work
print(row[1]) # Access by position instead
You can also disable the index and customize the tuple name:
for row in df.itertuples(index=False, name='Product'):
print(type(row)) # <class '__main__.Product'>
When iteration is unavoidable, itertuples() should be your default choice. It’s consistently 10-100x faster than iterrows().
Using apply() - Row-wise Function Application
The apply() method with axis=1 lets you run a function on each row without writing an explicit loop. It’s cleaner syntax, but don’t be fooled—it’s still iteration under the hood.
import pandas as pd
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'price': [25.99, 49.99, 12.50],
'quantity': [100, 45, 200]
})
def calculate_total(row):
base = row['price'] * row['quantity']
# Apply bulk discount for large quantities
if row['quantity'] > 100:
return base * 0.9
return base
df['total'] = df.apply(calculate_total, axis=1)
print(df)
Output:
product price quantity total
0 Widget 25.99 100 2599.00
1 Gadget 49.99 45 2249.55
2 Sprocket 12.50 200 2250.00
The apply() approach shines when you have complex row-wise logic that would be awkward to vectorize. It’s more readable than a loop with manual assignment, and it integrates naturally with method chaining.
For simple operations, you can use lambda functions:
df['revenue'] = df.apply(lambda row: row['price'] * row['quantity'], axis=1)
But here’s the thing: that lambda is slower than the equivalent vectorized operation. Use apply() when the logic genuinely requires row context—like calling external functions, handling complex conditionals, or when readability trumps performance for your use case.
Vectorized Operations - The Preferred Approach
Vectorization means expressing operations on entire columns at once, letting Pandas and NumPy handle the iteration in optimized compiled code. This should be your first instinct for any data transformation.
Let’s rewrite the previous examples without any iteration:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'price': [25.99, 49.99, 12.50],
'quantity': [100, 45, 200]
})
# Simple calculation - just use column operations
df['revenue'] = df['price'] * df['quantity']
# Conditional logic - use np.where or boolean indexing
df['total'] = np.where(
df['quantity'] > 100,
df['price'] * df['quantity'] * 0.9,
df['price'] * df['quantity']
)
print(df)
Output:
product price quantity revenue total
0 Widget 25.99 100 2599.00 2599.00
1 Gadget 49.99 45 2249.55 2249.55
2 Sprocket 12.50 200 2500.00 2250.00
For more complex conditions, use np.select:
conditions = [
df['quantity'] > 150,
df['quantity'] > 100,
df['quantity'] > 50
]
choices = [0.85, 0.90, 0.95] # Discount multipliers
df['discount_rate'] = np.select(conditions, choices, default=1.0)
df['final_price'] = df['price'] * df['quantity'] * df['discount_rate']
String operations are also vectorized through the .str accessor:
df['product_upper'] = df['product'].str.upper()
df['product_length'] = df['product'].str.len()
The mental shift is this: stop thinking “for each row, do X” and start thinking “take column A, transform it, store in column B.” Once you internalize this pattern, you’ll rarely need explicit iteration.
Performance Comparison
Let’s quantify the performance differences with a realistic benchmark:
import pandas as pd
import numpy as np
import time
# Create a larger dataset
n_rows = 100_000
df = pd.DataFrame({
'a': np.random.randn(n_rows),
'b': np.random.randn(n_rows),
'c': np.random.randint(0, 100, n_rows)
})
def benchmark(name, func):
start = time.perf_counter()
func()
elapsed = time.perf_counter() - start
print(f"{name}: {elapsed:.4f} seconds")
# Method 1: iterrows()
def using_iterrows():
result = []
for idx, row in df.iterrows():
result.append(row['a'] * row['b'] + row['c'])
return result
# Method 2: itertuples()
def using_itertuples():
result = []
for row in df.itertuples():
result.append(row.a * row.b + row.c)
return result
# Method 3: apply()
def using_apply():
return df.apply(lambda row: row['a'] * row['b'] + row['c'], axis=1)
# Method 4: Vectorized
def using_vectorized():
return df['a'] * df['b'] + df['c']
benchmark("iterrows()", using_iterrows)
benchmark("itertuples()", using_itertuples)
benchmark("apply()", using_apply)
benchmark("vectorized", using_vectorized)
Typical output on a modern machine:
iterrows(): 4.2341 seconds
itertuples(): 0.1823 seconds
apply(): 1.0567 seconds
vectorized: 0.0012 seconds
The vectorized approach is roughly 3,500x faster than iterrows() and 150x faster than itertuples(). These ratios hold across different operations and dataset sizes. The performance gap only widens as your data grows.
Conclusion
The decision tree is straightforward:
-
Try vectorization first. If you can express your operation using column arithmetic,
np.where,np.select, or built-in Pandas methods, do that. It’s faster and usually more readable once you’re comfortable with the syntax. -
Use
itertuples()when iteration is unavoidable. This includes scenarios like making API calls per row, writing to external systems, or operations where each row genuinely depends on complex state. It’s the fastest iteration method and preserves dtypes. -
Use
apply()for complex row logic that’s awkward to vectorize. It won’t be fast, but it’s cleaner than a manual loop and works well in method chains. -
Reserve
iterrows()for debugging and exploration. It’s the slowest option, but the Series return type can be convenient when you’re poking around in a notebook.
The broader lesson: Pandas rewards you for thinking in columns rather than rows. Invest time learning np.where, np.select, and the various Pandas transform methods. That investment pays dividends every time you process data.