How to Apply a Function to Multiple Columns in Pandas
Applying functions to multiple columns is one of the most common operations in pandas. Whether you're calculating derived metrics, cleaning inconsistent data, or engineering features for machine...
Key Insights
- Vectorized operations should be your default choice—they’re 10-100x faster than
apply()and more readable - Use
apply()withaxis=1for row-wise operations that genuinely require complex logic across multiple columns - The
assign()method enables clean, chainable transformations when creating multiple derived columns simultaneously
Introduction
Applying functions to multiple columns is one of the most common operations in pandas. Whether you’re calculating derived metrics, cleaning inconsistent data, or engineering features for machine learning, you’ll frequently need to transform data across several columns at once.
The challenge is that pandas offers multiple ways to accomplish this—apply(), vectorized operations, map(), assign()—and choosing the wrong approach can result in code that’s either painfully slow or unnecessarily complex.
This guide covers each method, when to use it, and the performance implications of your choices. By the end, you’ll know exactly which tool to reach for in any multi-column transformation scenario.
Using apply() with Multiple Columns
The apply() method with axis=1 processes your DataFrame row by row, giving you access to all column values in each row. This is the most flexible approach, but that flexibility comes at a cost.
import pandas as pd
# Sample health data
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'weight_kg': [65, 80, 72, 58],
'height_m': [1.65, 1.80, 1.75, 1.60]
})
# Calculate BMI using apply with axis=1
df['bmi'] = df.apply(
lambda row: row['weight_kg'] / row['height_m'] ** 2,
axis=1
)
print(df)
Output:
name weight_kg height_m bmi
0 Alice 65 1.65 23.875115
1 Bob 80 1.80 24.691358
2 Charlie 72 1.75 23.510204
3 Diana 58 1.60 22.656250
For more complex logic, use a named function instead of a lambda:
def calculate_health_score(row):
bmi = row['weight_kg'] / row['height_m'] ** 2
if 18.5 <= bmi < 25:
return 'healthy'
elif 25 <= bmi < 30:
return 'overweight'
elif bmi >= 30:
return 'obese'
else:
return 'underweight'
df['health_category'] = df.apply(calculate_health_score, axis=1)
Named functions are easier to test, debug, and reuse. Reserve lambdas for truly simple one-liners.
Vectorized Operations (The Preferred Approach)
Here’s the truth: most multi-column operations don’t need apply() at all. Vectorized operations work directly on entire columns using optimized C code under the hood, making them dramatically faster.
import numpy as np
# Sales data
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo', 'Thingamajig'],
'price': [29.99, 49.99, 19.99, 99.99],
'quantity': [100, 50, 200, 25],
'discount_pct': [0.1, 0.15, 0.05, 0.2]
})
# Vectorized approach (fast and readable)
df['subtotal'] = df['price'] * df['quantity']
df['discount'] = df['subtotal'] * df['discount_pct']
df['total'] = df['subtotal'] - df['discount']
# Compare to apply approach (slow and verbose)
df['total_apply'] = df.apply(
lambda row: (row['price'] * row['quantity']) * (1 - row['discount_pct']),
axis=1
)
Let’s quantify the performance difference:
import time
# Create a larger dataset for timing
large_df = pd.DataFrame({
'price': np.random.uniform(10, 100, 100_000),
'quantity': np.random.randint(1, 100, 100_000),
'discount_pct': np.random.uniform(0, 0.3, 100_000)
})
# Time vectorized approach
start = time.perf_counter()
large_df['total_vec'] = large_df['price'] * large_df['quantity'] * (1 - large_df['discount_pct'])
vec_time = time.perf_counter() - start
# Time apply approach
start = time.perf_counter()
large_df['total_apply'] = large_df.apply(
lambda row: row['price'] * row['quantity'] * (1 - row['discount_pct']),
axis=1
)
apply_time = time.perf_counter() - start
print(f"Vectorized: {vec_time:.4f}s")
print(f"Apply: {apply_time:.4f}s")
print(f"Apply is {apply_time/vec_time:.1f}x slower")
Typical output:
Vectorized: 0.0015s
Apply: 2.3421s
Apply is 1561.4x slower
The vectorized approach isn’t just faster—it’s also clearer about what’s happening. When you see df['a'] * df['b'], the intent is obvious.
Using apply() on Selected Columns
Sometimes you need to apply the same transformation to multiple columns. Select your target columns first, then apply:
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'test_1': [85, 92, 78],
'test_2': [88, 85, 92],
'test_3': [90, 88, 85]
})
# Normalize test scores (z-score normalization)
score_columns = ['test_1', 'test_2', 'test_3']
df[score_columns] = df[score_columns].apply(
lambda x: (x - x.mean()) / x.std()
)
print(df.round(2))
Note that when you apply to selected columns without axis=1, the function operates column-wise by default. Each column x in the lambda is a Series containing all values for that column.
For row-wise operations on selected columns:
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'math': [85, 92, 78],
'science': [88, 85, 92],
'english': [90, 88, 85]
})
# Calculate average across selected columns for each row
subject_cols = ['math', 'science', 'english']
df['average'] = df[subject_cols].apply(lambda row: row.mean(), axis=1)
# But this is better done with:
df['average'] = df[subject_cols].mean(axis=1)
Again, prefer built-in methods when they exist.
The map() Method for Element-wise Operations
For element-wise transformations across a DataFrame, use map() (renamed from applymap() in pandas 2.1+):
# For pandas < 2.1, use applymap()
# For pandas >= 2.1, use map()
df = pd.DataFrame({
'product': ['widget', 'GADGET', 'Gizmo'],
'category': ['electronics', 'TOOLS', 'Home'],
'supplier': ['acme', 'GLOBEX', 'Initech']
})
# Standardize all string columns to title case
string_cols = df.select_dtypes(include='object').columns
df[string_cols] = df[string_cols].map(str.title) # or .applymap() for older pandas
print(df)
Output:
product category supplier
0 Widget Electronics Acme
1 Gadget Tools Globex
2 Gizmo Home Initech
For numeric formatting:
df = pd.DataFrame({
'revenue': [1234567.891, 987654.321, 456789.012],
'margin': [0.234567, 0.345678, 0.123456],
'growth': [0.156789, -0.023456, 0.089012]
})
# Round all values to 2 decimal places
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] = df[numeric_cols].map(lambda x: round(x, 2))
Using assign() for Multiple New Columns
The assign() method creates new columns in a chainable, functional style. It’s particularly useful in data pipelines:
df = pd.DataFrame({
'first_name': ['Alice', 'Bob', 'Charlie'],
'last_name': ['Smith', 'Jones', 'Brown'],
'birth_year': [1990, 1985, 1978],
'salary': [75000, 82000, 95000]
})
# Create multiple derived columns in one chain
result = (
df
.assign(
full_name=lambda x: x['first_name'] + ' ' + x['last_name'],
age=lambda x: 2024 - x['birth_year'],
salary_bracket=lambda x: pd.cut(
x['salary'],
bins=[0, 50000, 80000, 100000, float('inf')],
labels=['entry', 'mid', 'senior', 'executive']
),
tax_estimate=lambda x: x['salary'] * 0.25
)
)
print(result)
The lambda functions in assign() receive the DataFrame as it exists at that point in the chain, allowing you to reference columns created earlier in the same assign() call (in pandas 0.23+).
# Reference columns created in the same assign
result = df.assign(
age=lambda x: 2024 - x['birth_year'],
age_group=lambda x: pd.cut(x['age'], bins=[0, 30, 50, 100], labels=['young', 'middle', 'senior'])
)
Performance Considerations and Best Practices
Here’s a practical hierarchy for choosing your approach:
| Approach | Speed | Use When |
|---|---|---|
| Vectorized operations | Fastest | Simple arithmetic, comparisons, string methods |
| Built-in aggregations | Fast | mean(), sum(), std() across columns |
apply() with NumPy |
Moderate | Complex logic that NumPy can optimize |
apply() with pure Python |
Slow | Complex conditional logic, external API calls |
import numpy as np
# When you must use apply, leverage NumPy inside
def complex_calculation(row):
values = np.array([row['a'], row['b'], row['c']])
return np.sqrt(np.sum(values ** 2)) # Euclidean norm
# Better: use NumPy directly on the DataFrame
df['norm'] = np.sqrt(df['a']**2 + df['b']**2 + df['c']**2)
For genuinely complex row-wise operations on large datasets, consider parallelization libraries:
# pip install swifter
import swifter
# Automatically parallelizes apply operations
df['result'] = df.swifter.apply(complex_function, axis=1)
The bottom line: start with vectorized operations. Only reach for apply() when your logic genuinely can’t be expressed as column operations. When you do use apply(), keep the function simple and consider whether NumPy functions can help inside it.