Pandas - Apply Function to Column

• The `apply()` method transforms DataFrame columns using custom functions, lambda expressions, or built-in functions, offering more flexibility than vectorized operations for complex transformations

Key Insights

• The apply() method transforms DataFrame columns using custom functions, lambda expressions, or built-in functions, offering more flexibility than vectorized operations for complex transformations • Performance varies significantly: vectorized operations are fastest, apply() with NumPy functions is moderate, and apply() with Python functions is slowest due to row-by-row iteration • Choose between apply(), map(), applymap() (deprecated), and vectorized operations based on whether you’re transforming a Series, DataFrame, or need element-wise operations

Basic Apply Syntax

The apply() method executes a function along an axis of a DataFrame or on values of a Series. For columns, you pass the function as an argument without parentheses.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'price': [100, 250, 175, 300],
    'quantity': [2, 1, 3, 2],
    'discount': [0.1, 0.15, 0.05, 0.2]
})

# Apply built-in function to a column
df['price_log'] = df['price'].apply(np.log)

# Apply lambda function
df['total'] = df['price'].apply(lambda x: x * 1.08)

print(df)
   price  quantity  discount  price_log   total
0    100         2      0.10   4.605170  108.00
1    250         1      0.15   5.521461  270.00
2    175         3      0.05   5.164786  189.00
3    300         2      0.20   5.703782  324.00

Custom Functions with Apply

Define custom functions for complex transformations that require conditional logic or multiple operations.

def categorize_price(price):
    if price < 150:
        return 'Budget'
    elif price < 250:
        return 'Mid-range'
    else:
        return 'Premium'

df['category'] = df['price'].apply(categorize_price)

# Function with multiple parameters
def calculate_final_price(price, discount_rate=0.1, tax_rate=0.08):
    discounted = price * (1 - discount_rate)
    return discounted * (1 + tax_rate)

# Pass additional arguments
df['final_price'] = df['price'].apply(
    calculate_final_price, 
    discount_rate=0.15, 
    tax_rate=0.08
)

print(df[['price', 'category', 'final_price']])
   price   category  final_price
0    100     Budget        91.80
1    250    Premium       229.50
2    175  Mid-range       160.65
3    300    Premium       275.40

Accessing Multiple Columns

Use apply() on the entire DataFrame with axis=1 to access multiple columns in your function.

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [100, 250, 175, 300],
    'quantity': [2, 1, 3, 2],
    'discount': [0.1, 0.15, 0.05, 0.2]
})

def calculate_revenue(row):
    base_price = row['price'] * row['quantity']
    discount_amount = base_price * row['discount']
    return base_price - discount_amount

df['revenue'] = df.apply(calculate_revenue, axis=1)

# Return multiple values using a Series
def compute_metrics(row):
    revenue = row['price'] * row['quantity']
    discount_amt = revenue * row['discount']
    return pd.Series({
        'revenue': revenue,
        'discount_amount': discount_amt,
        'net_revenue': revenue - discount_amt
    })

metrics = df.apply(compute_metrics, axis=1)
df = pd.concat([df, metrics], axis=1)

print(df)
  product  price  quantity  discount  revenue  discount_amount  net_revenue
0       A    100         2      0.10    200.0             20.0        180.0
1       B    250         1      0.15    250.0             37.5        212.5
2       C    175         3      0.05    525.0             26.25       498.75
3       D    300         2      0.20    600.0            120.0        480.0

Apply vs Map vs Vectorized Operations

Understanding the differences helps you choose the right tool for performance optimization.

import time

# Create larger dataset for timing
df = pd.DataFrame({
    'values': np.random.randint(1, 1000, 100000)
})

# Vectorized operation (fastest)
start = time.time()
df['vectorized'] = df['values'] * 2 + 10
vectorized_time = time.time() - start

# Apply with lambda
start = time.time()
df['applied'] = df['values'].apply(lambda x: x * 2 + 10)
apply_time = time.time() - start

# Map with dictionary (for categorical mappings)
mapping = {i: i * 2 + 10 for i in range(1, 1000)}
start = time.time()
df['mapped'] = df['values'].map(mapping)
map_time = time.time() - start

print(f"Vectorized: {vectorized_time:.4f}s")
print(f"Apply: {apply_time:.4f}s")
print(f"Map: {map_time:.4f}s")

For simple arithmetic operations, vectorized operations are 50-100x faster than apply(). Use apply() when you need conditional logic or complex transformations that can’t be vectorized.

String Operations with Apply

String manipulation often requires apply() when built-in string methods don’t suffice.

df = pd.DataFrame({
    'email': ['john.doe@company.com', 'jane.smith@firm.org', 
              'bob.jones@startup.io', 'alice.wong@corp.net']
})

# Extract domain
df['domain'] = df['email'].apply(lambda x: x.split('@')[1])

# Complex string transformation
def format_email(email):
    username, domain = email.split('@')
    name_parts = username.split('.')
    formatted_name = ' '.join([part.capitalize() for part in name_parts])
    return f"{formatted_name} ({domain})"

df['formatted'] = df['email'].apply(format_email)

# Using regex with apply
import re

def extract_tld(email):
    match = re.search(r'\.([a-z]+)$', email)
    return match.group(1) if match else None

df['tld'] = df['email'].apply(extract_tld)

print(df)
                     email           domain              formatted  tld
0  john.doe@company.com     company.com   John Doe (company.com)  com
1   jane.smith@firm.org        firm.org    Jane Smith (firm.org)  org
2  bob.jones@startup.io      startup.io   Bob Jones (startup.io)   io
3   alice.wong@corp.net         corp.net     Alice Wong (corp.net)  net

Handling Errors in Apply

Implement error handling to prevent entire operations from failing due to bad data.

df = pd.DataFrame({
    'values': ['100', '250', 'invalid', '300', None, '175']
})

def safe_convert(value):
    try:
        return float(value)
    except (ValueError, TypeError):
        return np.nan

df['numeric'] = df['values'].apply(safe_convert)

# More sophisticated error handling with logging
def convert_with_logging(value, index):
    try:
        return float(value)
    except (ValueError, TypeError) as e:
        print(f"Error at index {index}: {e}")
        return np.nan

df['numeric_logged'] = df.reset_index().apply(
    lambda row: convert_with_logging(row['values'], row['index']), 
    axis=1
)

print(df)

Performance Optimization Strategies

When apply() is necessary, optimize its execution.

# Bad: Repeated expensive operations
def inefficient_transform(x):
    # Computes same value repeatedly
    threshold = df['values'].mean()
    return x > threshold

# Good: Compute once, reuse
threshold = df['values'].mean()
def efficient_transform(x):
    return x > threshold

# Use NumPy functions when possible
df = pd.DataFrame({
    'values': np.random.rand(10000)
})

# Slower
df['sqrt_apply'] = df['values'].apply(lambda x: x ** 0.5)

# Faster
df['sqrt_numpy'] = np.sqrt(df['values'])

# For complex conditions, use np.where or np.select
conditions = [
    df['values'] < 0.3,
    df['values'] < 0.7,
    df['values'] >= 0.7
]
choices = ['Low', 'Medium', 'High']
df['category'] = np.select(conditions, choices)

print(df.head())

Parallel Processing with Apply

For CPU-intensive operations on large datasets, consider parallel execution.

from multiprocessing import Pool
import numpy as np

df = pd.DataFrame({
    'values': np.random.randint(1, 1000000, 10000)
})

def expensive_computation(x):
    # Simulate expensive operation
    return sum([i for i in range(int(x ** 0.5))])

# Sequential apply
result_sequential = df['values'].apply(expensive_computation)

# Parallel apply using swifter (install: pip install swifter)
# import swifter
# result_parallel = df['values'].swifter.apply(expensive_computation)

# Manual parallel processing
def parallel_apply(series, func, n_jobs=4):
    with Pool(n_jobs) as pool:
        result = pool.map(func, series)
    return pd.Series(result, index=series.index)

# result_manual = parallel_apply(df['values'], expensive_computation)

The apply() method provides essential flexibility for data transformations in Pandas. While vectorized operations should be your first choice for performance, apply() becomes invaluable for complex logic, conditional transformations, and operations requiring multiple column access. Balance readability and performance by profiling your code and choosing the appropriate method for each use case.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.