How to Apply a Function to Multiple Columns in Pandas

Applying functions to multiple columns is one of the most common operations in pandas. Whether you're calculating derived metrics, cleaning inconsistent data, or engineering features for machine...

Key Insights

  • Vectorized operations should be your default choice—they’re 10-100x faster than apply() and more readable
  • Use apply() with axis=1 for row-wise operations that genuinely require complex logic across multiple columns
  • The assign() method enables clean, chainable transformations when creating multiple derived columns simultaneously

Introduction

Applying functions to multiple columns is one of the most common operations in pandas. Whether you’re calculating derived metrics, cleaning inconsistent data, or engineering features for machine learning, you’ll frequently need to transform data across several columns at once.

The challenge is that pandas offers multiple ways to accomplish this—apply(), vectorized operations, map(), assign()—and choosing the wrong approach can result in code that’s either painfully slow or unnecessarily complex.

This guide covers each method, when to use it, and the performance implications of your choices. By the end, you’ll know exactly which tool to reach for in any multi-column transformation scenario.

Using apply() with Multiple Columns

The apply() method with axis=1 processes your DataFrame row by row, giving you access to all column values in each row. This is the most flexible approach, but that flexibility comes at a cost.

import pandas as pd

# Sample health data
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'weight_kg': [65, 80, 72, 58],
    'height_m': [1.65, 1.80, 1.75, 1.60]
})

# Calculate BMI using apply with axis=1
df['bmi'] = df.apply(
    lambda row: row['weight_kg'] / row['height_m'] ** 2, 
    axis=1
)

print(df)

Output:

      name  weight_kg  height_m        bmi
0    Alice         65      1.65  23.875115
1      Bob         80      1.80  24.691358
2  Charlie         72      1.75  23.510204
3    Diana         58      1.60  22.656250

For more complex logic, use a named function instead of a lambda:

def calculate_health_score(row):
    bmi = row['weight_kg'] / row['height_m'] ** 2
    
    if 18.5 <= bmi < 25:
        return 'healthy'
    elif 25 <= bmi < 30:
        return 'overweight'
    elif bmi >= 30:
        return 'obese'
    else:
        return 'underweight'

df['health_category'] = df.apply(calculate_health_score, axis=1)

Named functions are easier to test, debug, and reuse. Reserve lambdas for truly simple one-liners.

Vectorized Operations (The Preferred Approach)

Here’s the truth: most multi-column operations don’t need apply() at all. Vectorized operations work directly on entire columns using optimized C code under the hood, making them dramatically faster.

import numpy as np

# Sales data
df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo', 'Thingamajig'],
    'price': [29.99, 49.99, 19.99, 99.99],
    'quantity': [100, 50, 200, 25],
    'discount_pct': [0.1, 0.15, 0.05, 0.2]
})

# Vectorized approach (fast and readable)
df['subtotal'] = df['price'] * df['quantity']
df['discount'] = df['subtotal'] * df['discount_pct']
df['total'] = df['subtotal'] - df['discount']

# Compare to apply approach (slow and verbose)
df['total_apply'] = df.apply(
    lambda row: (row['price'] * row['quantity']) * (1 - row['discount_pct']),
    axis=1
)

Let’s quantify the performance difference:

import time

# Create a larger dataset for timing
large_df = pd.DataFrame({
    'price': np.random.uniform(10, 100, 100_000),
    'quantity': np.random.randint(1, 100, 100_000),
    'discount_pct': np.random.uniform(0, 0.3, 100_000)
})

# Time vectorized approach
start = time.perf_counter()
large_df['total_vec'] = large_df['price'] * large_df['quantity'] * (1 - large_df['discount_pct'])
vec_time = time.perf_counter() - start

# Time apply approach
start = time.perf_counter()
large_df['total_apply'] = large_df.apply(
    lambda row: row['price'] * row['quantity'] * (1 - row['discount_pct']),
    axis=1
)
apply_time = time.perf_counter() - start

print(f"Vectorized: {vec_time:.4f}s")
print(f"Apply: {apply_time:.4f}s")
print(f"Apply is {apply_time/vec_time:.1f}x slower")

Typical output:

Vectorized: 0.0015s
Apply: 2.3421s
Apply is 1561.4x slower

The vectorized approach isn’t just faster—it’s also clearer about what’s happening. When you see df['a'] * df['b'], the intent is obvious.

Using apply() on Selected Columns

Sometimes you need to apply the same transformation to multiple columns. Select your target columns first, then apply:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'test_1': [85, 92, 78],
    'test_2': [88, 85, 92],
    'test_3': [90, 88, 85]
})

# Normalize test scores (z-score normalization)
score_columns = ['test_1', 'test_2', 'test_3']

df[score_columns] = df[score_columns].apply(
    lambda x: (x - x.mean()) / x.std()
)

print(df.round(2))

Note that when you apply to selected columns without axis=1, the function operates column-wise by default. Each column x in the lambda is a Series containing all values for that column.

For row-wise operations on selected columns:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'math': [85, 92, 78],
    'science': [88, 85, 92],
    'english': [90, 88, 85]
})

# Calculate average across selected columns for each row
subject_cols = ['math', 'science', 'english']
df['average'] = df[subject_cols].apply(lambda row: row.mean(), axis=1)

# But this is better done with:
df['average'] = df[subject_cols].mean(axis=1)

Again, prefer built-in methods when they exist.

The map() Method for Element-wise Operations

For element-wise transformations across a DataFrame, use map() (renamed from applymap() in pandas 2.1+):

# For pandas < 2.1, use applymap()
# For pandas >= 2.1, use map()

df = pd.DataFrame({
    'product': ['widget', 'GADGET', 'Gizmo'],
    'category': ['electronics', 'TOOLS', 'Home'],
    'supplier': ['acme', 'GLOBEX', 'Initech']
})

# Standardize all string columns to title case
string_cols = df.select_dtypes(include='object').columns
df[string_cols] = df[string_cols].map(str.title)  # or .applymap() for older pandas

print(df)

Output:

   product     category supplier
0   Widget  Electronics     Acme
1   Gadget        Tools   Globex
2    Gizmo         Home  Initech

For numeric formatting:

df = pd.DataFrame({
    'revenue': [1234567.891, 987654.321, 456789.012],
    'margin': [0.234567, 0.345678, 0.123456],
    'growth': [0.156789, -0.023456, 0.089012]
})

# Round all values to 2 decimal places
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] = df[numeric_cols].map(lambda x: round(x, 2))

Using assign() for Multiple New Columns

The assign() method creates new columns in a chainable, functional style. It’s particularly useful in data pipelines:

df = pd.DataFrame({
    'first_name': ['Alice', 'Bob', 'Charlie'],
    'last_name': ['Smith', 'Jones', 'Brown'],
    'birth_year': [1990, 1985, 1978],
    'salary': [75000, 82000, 95000]
})

# Create multiple derived columns in one chain
result = (
    df
    .assign(
        full_name=lambda x: x['first_name'] + ' ' + x['last_name'],
        age=lambda x: 2024 - x['birth_year'],
        salary_bracket=lambda x: pd.cut(
            x['salary'], 
            bins=[0, 50000, 80000, 100000, float('inf')],
            labels=['entry', 'mid', 'senior', 'executive']
        ),
        tax_estimate=lambda x: x['salary'] * 0.25
    )
)

print(result)

The lambda functions in assign() receive the DataFrame as it exists at that point in the chain, allowing you to reference columns created earlier in the same assign() call (in pandas 0.23+).

# Reference columns created in the same assign
result = df.assign(
    age=lambda x: 2024 - x['birth_year'],
    age_group=lambda x: pd.cut(x['age'], bins=[0, 30, 50, 100], labels=['young', 'middle', 'senior'])
)

Performance Considerations and Best Practices

Here’s a practical hierarchy for choosing your approach:

Approach Speed Use When
Vectorized operations Fastest Simple arithmetic, comparisons, string methods
Built-in aggregations Fast mean(), sum(), std() across columns
apply() with NumPy Moderate Complex logic that NumPy can optimize
apply() with pure Python Slow Complex conditional logic, external API calls
import numpy as np

# When you must use apply, leverage NumPy inside
def complex_calculation(row):
    values = np.array([row['a'], row['b'], row['c']])
    return np.sqrt(np.sum(values ** 2))  # Euclidean norm

# Better: use NumPy directly on the DataFrame
df['norm'] = np.sqrt(df['a']**2 + df['b']**2 + df['c']**2)

For genuinely complex row-wise operations on large datasets, consider parallelization libraries:

# pip install swifter
import swifter

# Automatically parallelizes apply operations
df['result'] = df.swifter.apply(complex_function, axis=1)

The bottom line: start with vectorized operations. Only reach for apply() when your logic genuinely can’t be expressed as column operations. When you do use apply(), keep the function simple and consider whether NumPy functions can help inside it.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.