How to Use Assign in Pandas
The `assign()` method is one of pandas' most underappreciated features. It creates new columns on a DataFrame and returns a copy with those columns added. This might sound trivial—after all, you can...
Key Insights
- The
assign()method creates new columns while returning a new DataFrame, making it essential for method chaining and functional programming patterns in pandas. - Lambda functions inside
assign()reference the DataFrame being operated on, which prevents bugs when chaining multiple operations together. - Use
assign()for transformation pipelines and direct assignment (df['col'] = value) for simple, standalone modifications—each has its place.
Introduction to the Assign Method
The assign() method is one of pandas’ most underappreciated features. It creates new columns on a DataFrame and returns a copy with those columns added. This might sound trivial—after all, you can just write df['new_col'] = values—but the distinction matters enormously when you’re building data transformation pipelines.
The key difference is immutability. Direct assignment modifies your DataFrame in place. assign() returns a new DataFrame, leaving the original untouched. This functional programming approach enables method chaining, where you pipe data through a series of transformations in a single, readable expression.
If you’ve ever written code that looks like a wall of intermediate variables—df1, df2, df_filtered, df_final—then assign() will change how you work with pandas.
Basic Syntax and Simple Column Assignment
The basic syntax is straightforward: pass keyword arguments where the keyword becomes the column name and the value becomes the column data.
import pandas as pd
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo'],
'quantity': [100, 50, 75]
})
# Add a single column with static values
df_with_category = df.assign(category='Electronics')
print(df_with_category)
Output:
product quantity category
0 Widget 100 Electronics
1 Gadget 50 Electronics
2 Gizmo 75 Electronics
You can also pass a list, NumPy array, or Series as the value:
# Add a column from a list
prices = [9.99, 24.99, 14.99]
df_with_prices = df.assign(unit_price=prices)
print(df_with_prices)
Output:
product quantity unit_price
0 Widget 100 9.99
1 Gadget 50 24.99
2 Gizmo 75 14.99
Notice that the original df remains unchanged. This is the immutability principle in action.
Creating Columns from Existing Data
The real power of assign() emerges when you derive new columns from existing ones. You can reference other columns directly, but there’s a catch—you need to reference them through the DataFrame being passed in.
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo'],
'quantity': [100, 50, 75],
'unit_price': [9.99, 24.99, 14.99]
})
# Calculate total price using a lambda
df_with_total = df.assign(total_price=lambda x: x['quantity'] * x['unit_price'])
print(df_with_total)
Output:
product quantity unit_price total_price
0 Widget 100 9.99 999.0
1 Gadget 50 24.99 1249.5
2 Gizmo 75 14.99 1124.25
The lambda function receives the DataFrame as its argument (conventionally named x or df), and you perform your calculation on it.
Using Callable Functions (Lambdas)
You might wonder why we use lambdas instead of just writing df['quantity'] * df['unit_price'] directly. Both work in isolation, but lambdas become essential in method chains.
Consider this problematic code:
# This works but is fragile
df = pd.DataFrame({'price': [100, 200, 300]})
result = df.assign(discount=df['price'] * 0.1) # References original df
Now consider what happens in a chain:
# This breaks in subtle ways
df = pd.DataFrame({'price': [100, 200, 300], 'active': [True, False, True]})
# The df['price'] reference points to the ORIGINAL df, not the filtered one
result = (df
.query('active == True')
.assign(discount=df['price'] * 0.1) # BUG: uses all 3 rows, not 2
)
The df['price'] reference evaluates immediately against the original DataFrame, not the filtered intermediate result. This causes index alignment issues or outright errors.
The lambda solves this by deferring evaluation:
# Correct approach with lambda
result = (df
.query('active == True')
.assign(discount=lambda x: x['price'] * 0.1) # x is the filtered DataFrame
)
print(result)
Output:
price active discount
0 100 True 10.0
2 300 True 30.0
The lambda receives whatever DataFrame exists at that point in the chain. Always use lambdas when chaining operations.
Assigning Multiple Columns at Once
You can add multiple columns in a single assign() call by passing multiple keyword arguments:
df = pd.DataFrame({
'order_date': pd.to_datetime(['2024-01-15', '2024-03-22', '2024-07-08']),
'amount': [150.00, 275.50, 89.99]
})
# Extract multiple date components at once
df_expanded = df.assign(
year=lambda x: x['order_date'].dt.year,
month=lambda x: x['order_date'].dt.month,
day=lambda x: x['order_date'].dt.day,
quarter=lambda x: x['order_date'].dt.quarter
)
print(df_expanded)
Output:
order_date amount year month day quarter
0 2024-01-15 150.00 2024 1 15 1
1 2024-03-22 275.50 2024 3 22 1
2 2024-07-08 89.99 2024 7 8 3
Starting with Python 3.7+ and pandas 0.23+, the columns are added in the order you specify them. This matters when later columns depend on earlier ones:
# Columns can reference previously assigned columns in the same call
df = pd.DataFrame({'base_price': [100, 200, 300]})
df_calculated = df.assign(
tax=lambda x: x['base_price'] * 0.08,
total=lambda x: x['base_price'] + x['tax'] # References 'tax' created above
)
print(df_calculated)
Output:
base_price tax total
0 100 8.0 108.0
1 200 16.0 216.0
2 300 24.0 324.0
This sequential evaluation within a single assign() call keeps related calculations together.
Method Chaining with Assign
Here’s where assign() truly shines. Method chaining lets you express a complete data transformation as a single, readable pipeline:
import pandas as pd
# Sample sales data
sales_data = """
date,product,quantity,unit_price,region
2024-01-15,Widget,10,9.99,North
2024-01-16,Gadget,5,24.99,South
2024-01-17,Widget,8,9.99,North
2024-02-01,Gizmo,12,14.99,North
2024-02-15,Widget,20,9.99,South
2024-03-01,Gadget,3,24.99,North
"""
from io import StringIO
# Complete transformation pipeline
result = (
pd.read_csv(StringIO(sales_data))
.assign(
date=lambda x: pd.to_datetime(x['date']),
revenue=lambda x: x['quantity'] * x['unit_price'],
month=lambda x: x['date'].dt.to_period('M')
)
.query('region == "North"')
.groupby(['month', 'product'])
.agg(
total_quantity=('quantity', 'sum'),
total_revenue=('revenue', 'sum')
)
.reset_index()
)
print(result)
Output:
month product total_quantity total_revenue
0 2024-01 Widget 18 179.82
1 2024-02 Gizmo 12 179.88
2 2024-03 Gadget 3 74.97
This pipeline reads data, adds computed columns, filters rows, aggregates, and resets the index—all without a single intermediate variable. The code reads top to bottom, describing exactly what happens to your data at each step.
Compare this to the imperative alternative:
# Imperative style - harder to follow
df = pd.read_csv(StringIO(sales_data))
df['date'] = pd.to_datetime(df['date'])
df['revenue'] = df['quantity'] * df['unit_price']
df['month'] = df['date'].dt.to_period('M')
df_north = df[df['region'] == 'North']
result = df_north.groupby(['month', 'product']).agg(
total_quantity=('quantity', 'sum'),
total_revenue=('revenue', 'sum')
).reset_index()
The chained version is more concise and makes the data flow explicit.
Assign vs. Direct Column Assignment
Both approaches have their place. Here’s a direct comparison:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'score': [85, 92, 78]
})
# Direct assignment - modifies in place
df['grade'] = df['score'].apply(lambda x: 'A' if x >= 90 else 'B' if x >= 80 else 'C')
print("After direct assignment:")
print(df)
# Reset for comparison
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'score': [85, 92, 78]
})
# Assign - returns new DataFrame
df_graded = df.assign(
grade=lambda x: x['score'].apply(lambda s: 'A' if s >= 90 else 'B' if s >= 80 else 'C')
)
print("\nOriginal df unchanged:")
print(df)
print("\nNew df with grade:")
print(df_graded)
Use direct assignment when:
- You’re doing simple, one-off modifications
- Memory efficiency matters (no copy created)
- You’re working interactively in a notebook and want to see changes immediately
Use assign() when:
- Building transformation pipelines
- You need the original DataFrame preserved
- Working in production code where immutability prevents bugs
- Chaining multiple operations together
The overhead of creating a new DataFrame is negligible for most workloads. When in doubt, prefer assign() for its clarity and safety. Reserve direct assignment for exploratory work or performance-critical sections where profiling shows it matters.