How to Use Assign in Pandas

The `assign()` method is one of pandas' most underappreciated features. It creates new columns on a DataFrame and returns a copy with those columns added. This might sound trivial—after all, you can...

Key Insights

  • The assign() method creates new columns while returning a new DataFrame, making it essential for method chaining and functional programming patterns in pandas.
  • Lambda functions inside assign() reference the DataFrame being operated on, which prevents bugs when chaining multiple operations together.
  • Use assign() for transformation pipelines and direct assignment (df['col'] = value) for simple, standalone modifications—each has its place.

Introduction to the Assign Method

The assign() method is one of pandas’ most underappreciated features. It creates new columns on a DataFrame and returns a copy with those columns added. This might sound trivial—after all, you can just write df['new_col'] = values—but the distinction matters enormously when you’re building data transformation pipelines.

The key difference is immutability. Direct assignment modifies your DataFrame in place. assign() returns a new DataFrame, leaving the original untouched. This functional programming approach enables method chaining, where you pipe data through a series of transformations in a single, readable expression.

If you’ve ever written code that looks like a wall of intermediate variables—df1, df2, df_filtered, df_final—then assign() will change how you work with pandas.

Basic Syntax and Simple Column Assignment

The basic syntax is straightforward: pass keyword arguments where the keyword becomes the column name and the value becomes the column data.

import pandas as pd

df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo'],
    'quantity': [100, 50, 75]
})

# Add a single column with static values
df_with_category = df.assign(category='Electronics')

print(df_with_category)

Output:

   product  quantity     category
0   Widget       100  Electronics
1   Gadget        50  Electronics
2    Gizmo        75  Electronics

You can also pass a list, NumPy array, or Series as the value:

# Add a column from a list
prices = [9.99, 24.99, 14.99]
df_with_prices = df.assign(unit_price=prices)

print(df_with_prices)

Output:

   product  quantity  unit_price
0   Widget       100        9.99
1   Gadget        50       24.99
2    Gizmo        75       14.99

Notice that the original df remains unchanged. This is the immutability principle in action.

Creating Columns from Existing Data

The real power of assign() emerges when you derive new columns from existing ones. You can reference other columns directly, but there’s a catch—you need to reference them through the DataFrame being passed in.

df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo'],
    'quantity': [100, 50, 75],
    'unit_price': [9.99, 24.99, 14.99]
})

# Calculate total price using a lambda
df_with_total = df.assign(total_price=lambda x: x['quantity'] * x['unit_price'])

print(df_with_total)

Output:

   product  quantity  unit_price  total_price
0   Widget       100        9.99        999.0
1   Gadget        50       24.99       1249.5
2    Gizmo        75       14.99       1124.25

The lambda function receives the DataFrame as its argument (conventionally named x or df), and you perform your calculation on it.

Using Callable Functions (Lambdas)

You might wonder why we use lambdas instead of just writing df['quantity'] * df['unit_price'] directly. Both work in isolation, but lambdas become essential in method chains.

Consider this problematic code:

# This works but is fragile
df = pd.DataFrame({'price': [100, 200, 300]})
result = df.assign(discount=df['price'] * 0.1)  # References original df

Now consider what happens in a chain:

# This breaks in subtle ways
df = pd.DataFrame({'price': [100, 200, 300], 'active': [True, False, True]})

# The df['price'] reference points to the ORIGINAL df, not the filtered one
result = (df
    .query('active == True')
    .assign(discount=df['price'] * 0.1)  # BUG: uses all 3 rows, not 2
)

The df['price'] reference evaluates immediately against the original DataFrame, not the filtered intermediate result. This causes index alignment issues or outright errors.

The lambda solves this by deferring evaluation:

# Correct approach with lambda
result = (df
    .query('active == True')
    .assign(discount=lambda x: x['price'] * 0.1)  # x is the filtered DataFrame
)

print(result)

Output:

   price  active  discount
0    100    True      10.0
2    300    True      30.0

The lambda receives whatever DataFrame exists at that point in the chain. Always use lambdas when chaining operations.

Assigning Multiple Columns at Once

You can add multiple columns in a single assign() call by passing multiple keyword arguments:

df = pd.DataFrame({
    'order_date': pd.to_datetime(['2024-01-15', '2024-03-22', '2024-07-08']),
    'amount': [150.00, 275.50, 89.99]
})

# Extract multiple date components at once
df_expanded = df.assign(
    year=lambda x: x['order_date'].dt.year,
    month=lambda x: x['order_date'].dt.month,
    day=lambda x: x['order_date'].dt.day,
    quarter=lambda x: x['order_date'].dt.quarter
)

print(df_expanded)

Output:

  order_date  amount  year  month  day  quarter
0 2024-01-15  150.00  2024      1   15        1
1 2024-03-22  275.50  2024      3   22        1
2 2024-07-08   89.99  2024      7    8        3

Starting with Python 3.7+ and pandas 0.23+, the columns are added in the order you specify them. This matters when later columns depend on earlier ones:

# Columns can reference previously assigned columns in the same call
df = pd.DataFrame({'base_price': [100, 200, 300]})

df_calculated = df.assign(
    tax=lambda x: x['base_price'] * 0.08,
    total=lambda x: x['base_price'] + x['tax']  # References 'tax' created above
)

print(df_calculated)

Output:

   base_price   tax  total
0         100   8.0  108.0
1         200  16.0  216.0
2         300  24.0  324.0

This sequential evaluation within a single assign() call keeps related calculations together.

Method Chaining with Assign

Here’s where assign() truly shines. Method chaining lets you express a complete data transformation as a single, readable pipeline:

import pandas as pd

# Sample sales data
sales_data = """
date,product,quantity,unit_price,region
2024-01-15,Widget,10,9.99,North
2024-01-16,Gadget,5,24.99,South
2024-01-17,Widget,8,9.99,North
2024-02-01,Gizmo,12,14.99,North
2024-02-15,Widget,20,9.99,South
2024-03-01,Gadget,3,24.99,North
"""

from io import StringIO

# Complete transformation pipeline
result = (
    pd.read_csv(StringIO(sales_data))
    .assign(
        date=lambda x: pd.to_datetime(x['date']),
        revenue=lambda x: x['quantity'] * x['unit_price'],
        month=lambda x: x['date'].dt.to_period('M')
    )
    .query('region == "North"')
    .groupby(['month', 'product'])
    .agg(
        total_quantity=('quantity', 'sum'),
        total_revenue=('revenue', 'sum')
    )
    .reset_index()
)

print(result)

Output:

     month product  total_quantity  total_revenue
0  2024-01  Widget              18         179.82
1  2024-02   Gizmo              12         179.88
2  2024-03  Gadget               3          74.97

This pipeline reads data, adds computed columns, filters rows, aggregates, and resets the index—all without a single intermediate variable. The code reads top to bottom, describing exactly what happens to your data at each step.

Compare this to the imperative alternative:

# Imperative style - harder to follow
df = pd.read_csv(StringIO(sales_data))
df['date'] = pd.to_datetime(df['date'])
df['revenue'] = df['quantity'] * df['unit_price']
df['month'] = df['date'].dt.to_period('M')
df_north = df[df['region'] == 'North']
result = df_north.groupby(['month', 'product']).agg(
    total_quantity=('quantity', 'sum'),
    total_revenue=('revenue', 'sum')
).reset_index()

The chained version is more concise and makes the data flow explicit.

Assign vs. Direct Column Assignment

Both approaches have their place. Here’s a direct comparison:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [85, 92, 78]
})

# Direct assignment - modifies in place
df['grade'] = df['score'].apply(lambda x: 'A' if x >= 90 else 'B' if x >= 80 else 'C')
print("After direct assignment:")
print(df)

# Reset for comparison
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [85, 92, 78]
})

# Assign - returns new DataFrame
df_graded = df.assign(
    grade=lambda x: x['score'].apply(lambda s: 'A' if s >= 90 else 'B' if s >= 80 else 'C')
)
print("\nOriginal df unchanged:")
print(df)
print("\nNew df with grade:")
print(df_graded)

Use direct assignment when:

  • You’re doing simple, one-off modifications
  • Memory efficiency matters (no copy created)
  • You’re working interactively in a notebook and want to see changes immediately

Use assign() when:

  • Building transformation pipelines
  • You need the original DataFrame preserved
  • Working in production code where immutability prevents bugs
  • Chaining multiple operations together

The overhead of creating a new DataFrame is negligible for most workloads. When in doubt, prefer assign() for its clarity and safety. Reserve direct assignment for exploratory work or performance-critical sections where profiling shows it matters.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.