Pandas - assign() to Add Computed Columns
• The `assign()` method enables functional-style column creation by returning a new DataFrame rather than modifying in place, making it ideal for method chaining and immutable data pipelines.
Key Insights
• The assign() method enables functional-style column creation by returning a new DataFrame rather than modifying in place, making it ideal for method chaining and immutable data pipelines.
• You can reference newly created columns within the same assign() call by using callable functions instead of direct values, allowing sequential column dependencies.
• Lambda functions and named functions work interchangeably with assign(), but named functions improve readability for complex transformations and enable better testing.
Understanding assign() Fundamentals
The assign() method adds new columns to a DataFrame while preserving the original. Unlike direct assignment with bracket notation, assign() returns a new DataFrame object, leaving the source unchanged.
import pandas as pd
import numpy as np
# Sample dataset
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [100, 150, 200, 120],
'quantity': [5, 3, 7, 4]
})
# Traditional assignment (modifies in place)
df['revenue'] = df['price'] * df['quantity']
# Using assign() (returns new DataFrame)
df_new = df.assign(revenue=df['price'] * df['quantity'])
The key difference: assign() supports method chaining, making it valuable in data transformation pipelines where you want to preserve immutability or create readable sequential operations.
Method Chaining with assign()
Method chaining transforms verbose multi-step operations into clean, readable pipelines. Each assign() call returns a DataFrame, allowing you to chain multiple operations.
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [100, 150, 200, 120],
'quantity': [5, 3, 7, 4],
'cost': [60, 90, 130, 75]
})
result = (df
.assign(revenue=lambda x: x['price'] * x['quantity'])
.assign(profit=lambda x: x['revenue'] - (x['cost'] * x['quantity']))
.assign(margin=lambda x: (x['profit'] / x['revenue']) * 100)
.query('margin > 30')
.sort_values('profit', ascending=False)
)
print(result)
Output:
product price quantity cost revenue profit margin
0 A 100 5 60 500 200 40.000000
2 C 200 7 130 1400 490 35.000000
Multiple Columns in Single assign()
You can add multiple columns in one assign() call using keyword arguments. This approach is cleaner than multiple separate operations.
df = pd.DataFrame({
'temperature_f': [32, 68, 86, 104],
'humidity': [45, 60, 75, 80]
})
df_converted = df.assign(
temperature_c=lambda x: (x['temperature_f'] - 32) * 5/9,
temperature_k=lambda x: ((x['temperature_f'] - 32) * 5/9) + 273.15,
comfort_index=lambda x: x['temperature_f'] - (0.55 * (1 - x['humidity']/100) * (x['temperature_f'] - 58))
)
print(df_converted.round(2))
Output:
temperature_f humidity temperature_c temperature_k comfort_index
0 32 45 0.00 273.15 17.14
1 68 60 20.00 293.15 65.80
2 86 75 30.00 303.15 83.15
3 104 80 40.00 313.15 101.92
Referencing Newly Created Columns
A powerful feature of assign() is the ability to reference columns created earlier in the same call. Use callable functions (lambdas or named functions) to access the DataFrame state at evaluation time.
df = pd.DataFrame({
'sales_q1': [10000, 15000, 12000],
'sales_q2': [11000, 14000, 13000],
'sales_q3': [12000, 16000, 14000],
'sales_q4': [13000, 17000, 15000]
})
df_analysis = df.assign(
total_sales=lambda x: x['sales_q1'] + x['sales_q2'] + x['sales_q3'] + x['sales_q4'],
avg_quarterly=lambda x: x['total_sales'] / 4,
q4_vs_avg=lambda x: ((x['sales_q4'] - x['avg_quarterly']) / x['avg_quarterly']) * 100
)
print(df_analysis)
Output:
sales_q1 sales_q2 sales_q3 sales_q4 total_sales avg_quarterly q4_vs_avg
0 10000 11000 12000 13000 46000 11500.0 13.043478
1 15000 14000 16000 17000 62000 15500.0 9.677419
2 12000 13000 14000 15000 54000 13500.0 11.111111
Using Named Functions for Complex Logic
For complex transformations, named functions improve readability and testability compared to lambda expressions.
def calculate_discount_tier(row):
"""Assign discount tier based on purchase amount and customer type"""
if row['customer_type'] == 'premium':
if row['purchase_amount'] >= 1000:
return 'platinum'
elif row['purchase_amount'] >= 500:
return 'gold'
else:
return 'silver'
else:
if row['purchase_amount'] >= 1000:
return 'gold'
else:
return 'bronze'
def apply_discount(row):
"""Calculate discount percentage based on tier"""
discount_map = {
'platinum': 0.20,
'gold': 0.15,
'silver': 0.10,
'bronze': 0.05
}
return discount_map.get(row['discount_tier'], 0)
df = pd.DataFrame({
'customer_type': ['premium', 'standard', 'premium', 'standard'],
'purchase_amount': [1200, 800, 450, 1500]
})
df_discounted = (df
.assign(discount_tier=lambda x: x.apply(calculate_discount_tier, axis=1))
.assign(discount_rate=lambda x: x.apply(apply_discount, axis=1))
.assign(final_amount=lambda x: x['purchase_amount'] * (1 - x['discount_rate']))
)
print(df_discounted)
Output:
customer_type purchase_amount discount_tier discount_rate final_amount
0 premium 1200 platinum 0.20 960.0
1 standard 800 bronze 0.05 760.0
2 premium 450 silver 0.10 405.0
3 standard 1500 gold 0.15 1275.0
Conditional Column Assignment
Combine assign() with np.where() or np.select() for conditional logic without explicit loops.
df = pd.DataFrame({
'employee': ['Alice', 'Bob', 'Charlie', 'David'],
'years_service': [2, 5, 8, 12],
'performance_score': [85, 92, 78, 95]
})
conditions = [
(df['years_service'] >= 10) & (df['performance_score'] >= 90),
(df['years_service'] >= 5) & (df['performance_score'] >= 85),
(df['years_service'] >= 3) & (df['performance_score'] >= 80)
]
choices = ['Senior', 'Mid-level', 'Junior']
df_categorized = df.assign(
bonus_eligible=lambda x: np.where(x['performance_score'] >= 85, 'Yes', 'No'),
level=lambda x: np.select(conditions, choices, default='Entry')
)
print(df_categorized)
Output:
employee years_service performance_score bonus_eligible level
0 Alice 2 85 Yes Entry
1 Bob 5 92 Yes Mid-level
2 Charlie 8 78 No Entry
3 David 12 95 Yes Senior
Performance Considerations
While assign() creates new DataFrames, pandas optimizes memory usage through copy-on-write mechanics in recent versions. However, for very large datasets with many sequential operations, consider the performance implications.
import time
# Large dataset
df_large = pd.DataFrame({
'col1': np.random.rand(1000000),
'col2': np.random.rand(1000000)
})
# Method 1: Multiple assign() calls
start = time.time()
df1 = (df_large
.assign(col3=lambda x: x['col1'] * 2)
.assign(col4=lambda x: x['col2'] * 3)
.assign(col5=lambda x: x['col3'] + x['col4'])
)
time1 = time.time() - start
# Method 2: Single assign() call
start = time.time()
df2 = df_large.assign(
col3=lambda x: x['col1'] * 2,
col4=lambda x: x['col2'] * 3,
col5=lambda x: x['col1'] * 2 + x['col2'] * 3
)
time2 = time.time() - start
print(f"Multiple assign() calls: {time1:.4f}s")
print(f"Single assign() call: {time2:.4f}s")
Single assign() calls generally perform better by reducing intermediate DataFrame creation, though the difference narrows with copy-on-write enabled in pandas 2.0+.
Integration with Data Pipelines
The assign() method excels in ETL pipelines where immutability and traceability matter. Combine it with other DataFrame methods for complete transformations.
def process_sales_data(df):
"""Complete sales data processing pipeline"""
return (df
.assign(
revenue=lambda x: x['price'] * x['quantity'],
revenue_category=lambda x: pd.cut(
x['price'] * x['quantity'],
bins=[0, 500, 1000, float('inf')],
labels=['Low', 'Medium', 'High']
)
)
.assign(
cumulative_revenue=lambda x: x.groupby('product')['revenue'].cumsum()
)
.sort_values(['product', 'cumulative_revenue'])
.reset_index(drop=True)
)
df_sales = pd.DataFrame({
'product': ['A', 'A', 'B', 'B', 'C'],
'price': [100, 120, 150, 140, 200],
'quantity': [5, 4, 3, 6, 7]
})
result = process_sales_data(df_sales)
print(result)
This pattern creates testable, reusable transformation functions that clearly express data processing intent without side effects.