Pandas - Apply Lambda Function to Column
• Lambda functions with `apply()` provide a concise way to transform DataFrame columns without writing separate function definitions, ideal for simple operations like string manipulation,...
Key Insights
• Lambda functions with apply() provide a concise way to transform DataFrame columns without writing separate function definitions, ideal for simple operations like string manipulation, mathematical transformations, and conditional logic.
• Use apply() for row-wise operations, map() for Series element-wise transformations, and vectorized operations when performance matters—lambda with apply() is slower than vectorized alternatives but more readable for complex logic.
• Lambda functions can access multiple columns simultaneously using axis=1, enabling sophisticated transformations that depend on relationships between different fields in your dataset.
Basic Lambda Application on Single Column
The apply() method combined with lambda functions offers a straightforward approach to transform DataFrame columns. Here’s the fundamental pattern:
import pandas as pd
df = pd.DataFrame({
'price': [100, 250, 175, 300, 425],
'quantity': [2, 1, 3, 2, 1]
})
# Apply lambda to calculate 20% discount
df['discounted_price'] = df['price'].apply(lambda x: x * 0.8)
print(df)
Output:
price quantity discounted_price
0 100 2 80.0
1 250 1 200.0
2 175 3 140.0
3 300 2 240.0
4 425 1 340.0
The lambda function receives each element from the price column as x and returns the transformed value. Pandas automatically creates a new Series with the results.
String Transformations
Lambda functions excel at string manipulation tasks that would otherwise require verbose code:
df = pd.DataFrame({
'name': ['john doe', 'jane smith', 'bob wilson'],
'email': ['john@example.com', 'jane@example.com', 'bob@example.com']
})
# Capitalize names
df['name_formatted'] = df['name'].apply(lambda x: x.title())
# Extract email domain
df['domain'] = df['email'].apply(lambda x: x.split('@')[1])
# Create username from email
df['username'] = df['email'].apply(lambda x: x.split('@')[0].upper())
print(df)
Output:
name email name_formatted domain username
0 john doe john@example.com John Doe example.com JOHN
1 jane smith jane@example.com Jane Smith example.com JANE
2 bob wilson bob@example.com Bob Wilson example.com BOB
Conditional Logic with Lambda
Lambda functions support inline conditional expressions using Python’s ternary operator:
df = pd.DataFrame({
'temperature': [15, 25, 30, 10, 35],
'humidity': [60, 70, 80, 50, 85]
})
# Categorize temperature
df['temp_category'] = df['temperature'].apply(
lambda x: 'Hot' if x > 28 else ('Warm' if x > 20 else 'Cold')
)
# Flag high humidity
df['high_humidity'] = df['humidity'].apply(lambda x: True if x > 75 else False)
print(df)
Output:
temperature humidity temp_category high_humidity
0 15 60 Cold False
1 25 70 Warm False
2 30 80 Hot True
3 10 50 Cold False
4 35 85 Hot True
Multi-Column Operations with axis=1
Access multiple columns simultaneously by setting axis=1, which passes entire rows to the lambda function:
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'cost': [50, 75, 100, 125],
'selling_price': [80, 100, 140, 150],
'units_sold': [100, 150, 80, 120]
})
# Calculate profit margin percentage
df['profit_margin'] = df.apply(
lambda row: ((row['selling_price'] - row['cost']) / row['selling_price']) * 100,
axis=1
)
# Calculate total revenue
df['revenue'] = df.apply(
lambda row: row['selling_price'] * row['units_sold'],
axis=1
)
# Create status based on multiple conditions
df['status'] = df.apply(
lambda row: 'High Performer' if row['profit_margin'] > 30 and row['units_sold'] > 100 else 'Standard',
axis=1
)
print(df.round(2))
Output:
product cost selling_price units_sold profit_margin revenue status
0 A 50 80 100 37.50 8000 High Performer
1 B 75 100 150 25.00 15000 High Performer
2 C 100 140 80 28.57 11200 Standard
3 D 125 150 120 16.67 18000 Standard
Lambda with External Functions
Combine lambda functions with external libraries or custom functions for complex transformations:
import numpy as np
from datetime import datetime
df = pd.DataFrame({
'date_string': ['2024-01-15', '2024-02-20', '2024-03-10'],
'values': [10, -5, 15],
'scores': [85, 92, 78]
})
# Parse dates
df['parsed_date'] = df['date_string'].apply(
lambda x: datetime.strptime(x, '%Y-%m-%d')
)
# Apply numpy function
df['abs_values'] = df['values'].apply(lambda x: np.abs(x))
# Custom function with lambda wrapper
def calculate_grade(score):
if score >= 90: return 'A'
elif score >= 80: return 'B'
else: return 'C'
df['grade'] = df['scores'].apply(lambda x: calculate_grade(x))
print(df)
Output:
date_string values scores parsed_date abs_values grade
0 2024-01-15 10 85 2024-01-15 10 B
1 2024-02-20 -5 92 2024-02-20 5 A
2 2024-03-10 15 78 2024-03-10 15 C
Handling None and NaN Values
Lambda functions need explicit handling for missing data:
df = pd.DataFrame({
'values': [10, None, 25, np.nan, 30],
'names': ['Alice', None, 'Bob', 'Charlie', None]
})
# Safe mathematical operation
df['doubled'] = df['values'].apply(
lambda x: x * 2 if pd.notna(x) else 0
)
# Safe string operation
df['name_length'] = df['names'].apply(
lambda x: len(x) if pd.notna(x) else 0
)
print(df)
Output:
values names doubled name_length
0 10.0 Alice 20.0 5
1 NaN None 0.0 0
2 25.0 Bob 50.0 3
3 NaN Charlie 0.0 7
4 30.0 None 60.0 0
Performance Considerations
Lambda with apply() is convenient but slower than vectorized operations. Compare approaches:
import time
df = pd.DataFrame({
'values': range(100000)
})
# Lambda approach
start = time.time()
df['lambda_result'] = df['values'].apply(lambda x: x * 2 + 10)
lambda_time = time.time() - start
# Vectorized approach
start = time.time()
df['vectorized_result'] = df['values'] * 2 + 10
vectorized_time = time.time() - start
print(f"Lambda time: {lambda_time:.4f}s")
print(f"Vectorized time: {vectorized_time:.4f}s")
print(f"Speedup: {lambda_time/vectorized_time:.1f}x")
Use vectorized operations when possible. Reserve lambda functions for:
- Complex logic that can’t be vectorized
- String operations requiring method chaining
- Conditional transformations with multiple branches
- Operations requiring external function calls
Common Patterns and Best Practices
df = pd.DataFrame({
'text': ['hello world', 'PYTHON pandas', 'Data Science'],
'numbers': [1, 2, 3]
})
# Chain multiple string methods
df['cleaned'] = df['text'].apply(
lambda x: x.lower().strip().replace(' ', '_')
)
# Type conversion with error handling
df['safe_int'] = df['numbers'].apply(
lambda x: int(x) if isinstance(x, (int, float)) else 0
)
# Complex extraction
df['word_count'] = df['text'].apply(lambda x: len(x.split()))
print(df)
For better readability with complex logic, define named functions instead of cramming everything into lambda:
def process_text(text):
text = text.lower()
text = text.strip()
return text.replace(' ', '_')
df['processed'] = df['text'].apply(process_text)
Lambda functions with apply() strike a balance between code brevity and functionality. Use them judiciously, understanding the performance trade-offs, and switch to vectorized operations or named functions when lambda expressions become unwieldy.