Pandas - Apply Function to Column
• The `apply()` method transforms DataFrame columns using custom functions, lambda expressions, or built-in functions, offering more flexibility than vectorized operations for complex transformations
Key Insights
• The apply() method transforms DataFrame columns using custom functions, lambda expressions, or built-in functions, offering more flexibility than vectorized operations for complex transformations
• Performance varies significantly: vectorized operations are fastest, apply() with NumPy functions is moderate, and apply() with Python functions is slowest due to row-by-row iteration
• Choose between apply(), map(), applymap() (deprecated), and vectorized operations based on whether you’re transforming a Series, DataFrame, or need element-wise operations
Basic Apply Syntax
The apply() method executes a function along an axis of a DataFrame or on values of a Series. For columns, you pass the function as an argument without parentheses.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'price': [100, 250, 175, 300],
'quantity': [2, 1, 3, 2],
'discount': [0.1, 0.15, 0.05, 0.2]
})
# Apply built-in function to a column
df['price_log'] = df['price'].apply(np.log)
# Apply lambda function
df['total'] = df['price'].apply(lambda x: x * 1.08)
print(df)
price quantity discount price_log total
0 100 2 0.10 4.605170 108.00
1 250 1 0.15 5.521461 270.00
2 175 3 0.05 5.164786 189.00
3 300 2 0.20 5.703782 324.00
Custom Functions with Apply
Define custom functions for complex transformations that require conditional logic or multiple operations.
def categorize_price(price):
if price < 150:
return 'Budget'
elif price < 250:
return 'Mid-range'
else:
return 'Premium'
df['category'] = df['price'].apply(categorize_price)
# Function with multiple parameters
def calculate_final_price(price, discount_rate=0.1, tax_rate=0.08):
discounted = price * (1 - discount_rate)
return discounted * (1 + tax_rate)
# Pass additional arguments
df['final_price'] = df['price'].apply(
calculate_final_price,
discount_rate=0.15,
tax_rate=0.08
)
print(df[['price', 'category', 'final_price']])
price category final_price
0 100 Budget 91.80
1 250 Premium 229.50
2 175 Mid-range 160.65
3 300 Premium 275.40
Accessing Multiple Columns
Use apply() on the entire DataFrame with axis=1 to access multiple columns in your function.
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [100, 250, 175, 300],
'quantity': [2, 1, 3, 2],
'discount': [0.1, 0.15, 0.05, 0.2]
})
def calculate_revenue(row):
base_price = row['price'] * row['quantity']
discount_amount = base_price * row['discount']
return base_price - discount_amount
df['revenue'] = df.apply(calculate_revenue, axis=1)
# Return multiple values using a Series
def compute_metrics(row):
revenue = row['price'] * row['quantity']
discount_amt = revenue * row['discount']
return pd.Series({
'revenue': revenue,
'discount_amount': discount_amt,
'net_revenue': revenue - discount_amt
})
metrics = df.apply(compute_metrics, axis=1)
df = pd.concat([df, metrics], axis=1)
print(df)
product price quantity discount revenue discount_amount net_revenue
0 A 100 2 0.10 200.0 20.0 180.0
1 B 250 1 0.15 250.0 37.5 212.5
2 C 175 3 0.05 525.0 26.25 498.75
3 D 300 2 0.20 600.0 120.0 480.0
Apply vs Map vs Vectorized Operations
Understanding the differences helps you choose the right tool for performance optimization.
import time
# Create larger dataset for timing
df = pd.DataFrame({
'values': np.random.randint(1, 1000, 100000)
})
# Vectorized operation (fastest)
start = time.time()
df['vectorized'] = df['values'] * 2 + 10
vectorized_time = time.time() - start
# Apply with lambda
start = time.time()
df['applied'] = df['values'].apply(lambda x: x * 2 + 10)
apply_time = time.time() - start
# Map with dictionary (for categorical mappings)
mapping = {i: i * 2 + 10 for i in range(1, 1000)}
start = time.time()
df['mapped'] = df['values'].map(mapping)
map_time = time.time() - start
print(f"Vectorized: {vectorized_time:.4f}s")
print(f"Apply: {apply_time:.4f}s")
print(f"Map: {map_time:.4f}s")
For simple arithmetic operations, vectorized operations are 50-100x faster than apply(). Use apply() when you need conditional logic or complex transformations that can’t be vectorized.
String Operations with Apply
String manipulation often requires apply() when built-in string methods don’t suffice.
df = pd.DataFrame({
'email': ['john.doe@company.com', 'jane.smith@firm.org',
'bob.jones@startup.io', 'alice.wong@corp.net']
})
# Extract domain
df['domain'] = df['email'].apply(lambda x: x.split('@')[1])
# Complex string transformation
def format_email(email):
username, domain = email.split('@')
name_parts = username.split('.')
formatted_name = ' '.join([part.capitalize() for part in name_parts])
return f"{formatted_name} ({domain})"
df['formatted'] = df['email'].apply(format_email)
# Using regex with apply
import re
def extract_tld(email):
match = re.search(r'\.([a-z]+)$', email)
return match.group(1) if match else None
df['tld'] = df['email'].apply(extract_tld)
print(df)
email domain formatted tld
0 john.doe@company.com company.com John Doe (company.com) com
1 jane.smith@firm.org firm.org Jane Smith (firm.org) org
2 bob.jones@startup.io startup.io Bob Jones (startup.io) io
3 alice.wong@corp.net corp.net Alice Wong (corp.net) net
Handling Errors in Apply
Implement error handling to prevent entire operations from failing due to bad data.
df = pd.DataFrame({
'values': ['100', '250', 'invalid', '300', None, '175']
})
def safe_convert(value):
try:
return float(value)
except (ValueError, TypeError):
return np.nan
df['numeric'] = df['values'].apply(safe_convert)
# More sophisticated error handling with logging
def convert_with_logging(value, index):
try:
return float(value)
except (ValueError, TypeError) as e:
print(f"Error at index {index}: {e}")
return np.nan
df['numeric_logged'] = df.reset_index().apply(
lambda row: convert_with_logging(row['values'], row['index']),
axis=1
)
print(df)
Performance Optimization Strategies
When apply() is necessary, optimize its execution.
# Bad: Repeated expensive operations
def inefficient_transform(x):
# Computes same value repeatedly
threshold = df['values'].mean()
return x > threshold
# Good: Compute once, reuse
threshold = df['values'].mean()
def efficient_transform(x):
return x > threshold
# Use NumPy functions when possible
df = pd.DataFrame({
'values': np.random.rand(10000)
})
# Slower
df['sqrt_apply'] = df['values'].apply(lambda x: x ** 0.5)
# Faster
df['sqrt_numpy'] = np.sqrt(df['values'])
# For complex conditions, use np.where or np.select
conditions = [
df['values'] < 0.3,
df['values'] < 0.7,
df['values'] >= 0.7
]
choices = ['Low', 'Medium', 'High']
df['category'] = np.select(conditions, choices)
print(df.head())
Parallel Processing with Apply
For CPU-intensive operations on large datasets, consider parallel execution.
from multiprocessing import Pool
import numpy as np
df = pd.DataFrame({
'values': np.random.randint(1, 1000000, 10000)
})
def expensive_computation(x):
# Simulate expensive operation
return sum([i for i in range(int(x ** 0.5))])
# Sequential apply
result_sequential = df['values'].apply(expensive_computation)
# Parallel apply using swifter (install: pip install swifter)
# import swifter
# result_parallel = df['values'].swifter.apply(expensive_computation)
# Manual parallel processing
def parallel_apply(series, func, n_jobs=4):
with Pool(n_jobs) as pool:
result = pool.map(func, series)
return pd.Series(result, index=series.index)
# result_manual = parallel_apply(df['values'], expensive_computation)
The apply() method provides essential flexibility for data transformations in Pandas. While vectorized operations should be your first choice for performance, apply() becomes invaluable for complex logic, conditional transformations, and operations requiring multiple column access. Balance readability and performance by profiling your code and choosing the appropriate method for each use case.