How to Use Apply in Pandas
The `apply()` function in pandas lets you run custom functions across your data. It's the escape hatch you reach for when pandas' built-in methods don't cover your use case. Need to parse a custom...
Key Insights
- The
apply()function is pandas’ Swiss Army knife for custom transformations, but it’s often slower than vectorized alternatives—use it when built-in methods won’t cut it, not as your default approach. - Understanding the
axisparameter is crucial:axis=0applies your function to each column (default), whileaxis=1applies it to each row—getting this wrong is one of the most common pandas mistakes. - Always prefer vectorized operations for performance-critical code; reserve
apply()for complex logic that genuinely requires row-by-row or element-by-element processing.
Introduction to Apply
The apply() function in pandas lets you run custom functions across your data. It’s the escape hatch you reach for when pandas’ built-in methods don’t cover your use case. Need to parse a custom date format? Apply. Want to categorize values based on complex business logic? Apply. Have to call an external API for each row? Apply (though you probably shouldn’t).
Here’s the thing: apply() is powerful but often misused. New pandas users treat it as a hammer for every nail, when vectorized operations would be faster and cleaner. By the end of this article, you’ll know exactly when apply() is the right tool and when to reach for something else.
Apply on Series
The simplest use of apply() is on a Series—a single column of your DataFrame. You pass a function, and pandas calls it once for each element.
import pandas as pd
# Sample data
df = pd.DataFrame({
'name': [' John Smith ', 'jane doe', 'BOB WILSON'],
'age': [25, 34, 45],
'salary': [50000, 75000, 120000]
})
# Clean up names: strip whitespace and title case
df['name_clean'] = df['name'].apply(lambda x: x.strip().title())
print(df['name_clean'])
Output:
0 John Smith
1 Jane Doe
2 Bob Wilson
Name: name_clean, dtype: object
You can also use apply() to categorize values:
def categorize_salary(salary):
if salary < 60000:
return 'entry'
elif salary < 100000:
return 'mid'
else:
return 'senior'
df['level'] = df['salary'].apply(categorize_salary)
print(df[['salary', 'level']])
Output:
salary level
0 50000 entry
1 75000 mid
2 120000 senior
For the string cleaning example, you could use df['name'].str.strip().str.title() instead—it’s faster. But the categorization example shows where apply() shines: complex conditional logic that doesn’t map cleanly to vectorized operations.
Apply on DataFrames (Row-wise vs Column-wise)
When you call apply() on a DataFrame, the axis parameter determines how your function receives data:
axis=0(default): Your function receives each column as a Seriesaxis=1: Your function receives each row as a Series
This trips up everyone at first. Think of it this way: axis=0 means “apply along the rows” (moving down), which gives you columns. axis=1 means “apply along the columns” (moving across), which gives you rows.
df = pd.DataFrame({
'q1_sales': [100, 200, 150],
'q2_sales': [120, 180, 200],
'q3_sales': [90, 220, 175],
'q4_sales': [150, 250, 225]
})
# Column-wise: get the mean of each column (axis=0)
column_means = df.apply(lambda col: col.mean(), axis=0)
print("Column means:")
print(column_means)
# Row-wise: get the total for each row (axis=1)
df['annual_total'] = df.apply(lambda row: row.sum(), axis=1)
print("\nWith row totals:")
print(df)
Output:
Column means:
q1_sales 150.0
q2_sales 166.666667
q3_sales 161.666667
q4_sales 208.333333
dtype: float64
With row totals:
q1_sales q2_sales q3_sales q4_sales annual_total
0 100 120 90 150 460
1 200 180 220 250 850
2 150 200 175 225 750
For simple aggregations like this, use df.sum(axis=1) instead. But apply() with axis=1 becomes essential when you need to access multiple columns with complex logic:
def calculate_bonus(row):
base_bonus = row['annual_total'] * 0.1
if row['q4_sales'] > row['q1_sales']:
return base_bonus * 1.2 # 20% extra for growth
return base_bonus
df['bonus'] = df.apply(calculate_bonus, axis=1)
Using Lambda Functions vs Named Functions
Lambda functions are convenient for simple, one-off transformations. Named functions are better when:
- The logic spans multiple lines
- You need to reuse the function
- You want to add docstrings or type hints
- Debugging is easier with a named function in the stack trace
# Lambda: fine for simple transformations
df['salary_k'] = df['salary'].apply(lambda x: x / 1000)
# Named function: better for complex logic
def parse_employee_code(code):
"""
Parse employee codes like 'EMP-2023-0042' into components.
Returns a dict with year and sequence number.
"""
parts = code.split('-')
if len(parts) != 3:
return {'year': None, 'sequence': None}
try:
return {
'year': int(parts[1]),
'sequence': int(parts[2])
}
except ValueError:
return {'year': None, 'sequence': None}
# Using the named function
codes = pd.Series(['EMP-2023-0042', 'EMP-2022-0108', 'INVALID'])
parsed = codes.apply(parse_employee_code)
print(parsed)
My rule of thumb: if your lambda has more than one operation or any conditional logic, extract it to a named function.
Passing Additional Arguments
Sometimes your function needs parameters beyond the row or element. Use args for positional arguments and **kwargs for keyword arguments:
def apply_threshold(value, threshold, default=0):
"""Return value if above threshold, otherwise return default."""
return value if value >= threshold else default
df = pd.DataFrame({'scores': [85, 42, 91, 67, 38]})
# Using args for positional arguments
df['passed_70'] = df['scores'].apply(apply_threshold, args=(70,))
# Using kwargs for keyword arguments
df['passed_50_with_default'] = df['scores'].apply(
apply_threshold,
threshold=50,
default=-1
)
print(df)
Output:
scores passed_70 passed_50_with_default
0 85 85 85
1 42 0 -1
2 91 91 91
3 67 0 67
4 38 0 -1
This pattern is especially useful when you want to reuse a function with different configurations without rewriting it.
Performance Considerations
Here’s the uncomfortable truth: apply() is slow. It’s essentially a Python loop under the hood, which means you lose pandas’ optimized C-based operations.
Let’s benchmark:
import numpy as np
import time
# Create a large DataFrame
n = 1_000_000
df = pd.DataFrame({
'a': np.random.randn(n),
'b': np.random.randn(n)
})
# Method 1: apply() with lambda
start = time.time()
result1 = df['a'].apply(lambda x: x ** 2 + 1)
apply_time = time.time() - start
# Method 2: Vectorized NumPy operation
start = time.time()
result2 = df['a'] ** 2 + 1
vectorized_time = time.time() - start
print(f"apply() time: {apply_time:.3f}s")
print(f"Vectorized time: {vectorized_time:.3f}s")
print(f"Speedup: {apply_time / vectorized_time:.1f}x")
Typical output:
apply() time: 0.412s
Vectorized time: 0.008s
Speedup: 51.5x
That’s a 50x difference. For row-wise operations, consider these alternatives:
# Instead of apply with axis=1 for simple cases:
df['sum'] = df.apply(lambda row: row['a'] + row['b'], axis=1) # Slow
# Use vectorized operations:
df['sum'] = df['a'] + df['b'] # Fast
# For conditional logic, use np.where or np.select:
df['category'] = df['a'].apply(lambda x: 'high' if x > 0 else 'low') # Slow
df['category'] = np.where(df['a'] > 0, 'high', 'low') # Fast
Use apply() when you genuinely need custom Python logic that can’t be vectorized. For everything else, find the vectorized alternative.
Common Pitfalls and Best Practices
Pitfall 1: Inconsistent return types
Your function must return consistent types, or you’ll get unexpected results:
def bad_categorize(value):
if value > 100:
return 'high'
elif value > 50:
return 'medium'
# Bug: no return for value <= 50, returns None implicitly
df = pd.DataFrame({'values': [120, 75, 30]})
df['category'] = df['values'].apply(bad_categorize)
print(df)
print(df['category'].dtype) # object, but contains None
Always ensure your function handles all cases explicitly.
Pitfall 2: Modifying data in place
Never modify the original data inside apply():
# DON'T do this
def bad_function(row):
row['new_col'] = row['a'] * 2 # Modifying the row
return row
# DO this instead
def good_function(row):
return row['a'] * 2
df['new_col'] = df.apply(good_function, axis=1)
Pitfall 3: Ignoring NaN values
Handle missing values explicitly:
def safe_parse(value):
if pd.isna(value):
return None
try:
return float(value) * 2
except (ValueError, TypeError):
return None
df['parsed'] = df['messy_column'].apply(safe_parse)
Best practices summary:
- Check if a vectorized alternative exists before using
apply() - Use named functions for anything beyond trivial transformations
- Always handle edge cases: None, NaN, empty strings, unexpected types
- Return consistent types from your function
- Test your function on edge cases before applying to the full DataFrame
The apply() function is essential in your pandas toolkit, but it’s a specialized tool. Master it, but don’t overuse it.