Pandas - assign() to Add Computed Columns

Key Insights

• The assign() method enables functional-style column creation by returning a new DataFrame rather than modifying in place, making it ideal for method chaining and immutable data pipelines. • You can reference newly created columns within the same assign() call by using callable functions instead of direct values, allowing sequential column dependencies. • Lambda functions and named functions work interchangeably with assign(), but named functions improve readability for complex transformations and enable better testing.

Understanding assign() Fundamentals

The assign() method adds new columns to a DataFrame while preserving the original. Unlike direct assignment with bracket notation, assign() returns a new DataFrame object, leaving the source unchanged.

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [100, 150, 200, 120],
    'quantity': [5, 3, 7, 4]
})

# Traditional assignment (modifies in place)
df['revenue'] = df['price'] * df['quantity']

# Using assign() (returns new DataFrame)
df_new = df.assign(revenue=df['price'] * df['quantity'])

The key difference: assign() supports method chaining, making it valuable in data transformation pipelines where you want to preserve immutability or create readable sequential operations.

Method Chaining with assign()

Method chaining transforms verbose multi-step operations into clean, readable pipelines. Each assign() call returns a DataFrame, allowing you to chain multiple operations.

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [100, 150, 200, 120],
    'quantity': [5, 3, 7, 4],
    'cost': [60, 90, 130, 75]
})

result = (df
    .assign(revenue=lambda x: x['price'] * x['quantity'])
    .assign(profit=lambda x: x['revenue'] - (x['cost'] * x['quantity']))
    .assign(margin=lambda x: (x['profit'] / x['revenue']) * 100)
    .query('margin > 30')
    .sort_values('profit', ascending=False)
)

print(result)

Output:

  product  price  quantity  cost  revenue  profit     margin
0       A    100         5    60      500     200  40.000000
2       C    200         7   130     1400     490  35.000000

Multiple Columns in Single assign()

You can add multiple columns in one assign() call using keyword arguments. This approach is cleaner than multiple separate operations.

df = pd.DataFrame({
    'temperature_f': [32, 68, 86, 104],
    'humidity': [45, 60, 75, 80]
})

df_converted = df.assign(
    temperature_c=lambda x: (x['temperature_f'] - 32) * 5/9,
    temperature_k=lambda x: ((x['temperature_f'] - 32) * 5/9) + 273.15,
    comfort_index=lambda x: x['temperature_f'] - (0.55 * (1 - x['humidity']/100) * (x['temperature_f'] - 58))
)

print(df_converted.round(2))

Output:

   temperature_f  humidity  temperature_c  temperature_k  comfort_index
0             32        45           0.00         273.15          17.14
1             68        60          20.00         293.15          65.80
2             86        75          30.00         303.15          83.15
3            104        80          40.00         313.15         101.92

Referencing Newly Created Columns

A powerful feature of assign() is the ability to reference columns created earlier in the same call. Use callable functions (lambdas or named functions) to access the DataFrame state at evaluation time.

df = pd.DataFrame({
    'sales_q1': [10000, 15000, 12000],
    'sales_q2': [11000, 14000, 13000],
    'sales_q3': [12000, 16000, 14000],
    'sales_q4': [13000, 17000, 15000]
})

df_analysis = df.assign(
    total_sales=lambda x: x['sales_q1'] + x['sales_q2'] + x['sales_q3'] + x['sales_q4'],
    avg_quarterly=lambda x: x['total_sales'] / 4,
    q4_vs_avg=lambda x: ((x['sales_q4'] - x['avg_quarterly']) / x['avg_quarterly']) * 100
)

print(df_analysis)

Output:

   sales_q1  sales_q2  sales_q3  sales_q4  total_sales  avg_quarterly  q4_vs_avg
0     10000     11000     12000     13000        46000        11500.0  13.043478
1     15000     14000     16000     17000        62000        15500.0   9.677419
2     12000     13000     14000     15000        54000        13500.0  11.111111

Using Named Functions for Complex Logic

For complex transformations, named functions improve readability and testability compared to lambda expressions.

def calculate_discount_tier(row):
    """Assign discount tier based on purchase amount and customer type"""
    if row['customer_type'] == 'premium':
        if row['purchase_amount'] >= 1000:
            return 'platinum'
        elif row['purchase_amount'] >= 500:
            return 'gold'
        else:
            return 'silver'
    else:
        if row['purchase_amount'] >= 1000:
            return 'gold'
        else:
            return 'bronze'

def apply_discount(row):
    """Calculate discount percentage based on tier"""
    discount_map = {
        'platinum': 0.20,
        'gold': 0.15,
        'silver': 0.10,
        'bronze': 0.05
    }
    return discount_map.get(row['discount_tier'], 0)

df = pd.DataFrame({
    'customer_type': ['premium', 'standard', 'premium', 'standard'],
    'purchase_amount': [1200, 800, 450, 1500]
})

df_discounted = (df
    .assign(discount_tier=lambda x: x.apply(calculate_discount_tier, axis=1))
    .assign(discount_rate=lambda x: x.apply(apply_discount, axis=1))
    .assign(final_amount=lambda x: x['purchase_amount'] * (1 - x['discount_rate']))
)

print(df_discounted)

Output:

  customer_type  purchase_amount discount_tier  discount_rate  final_amount
0       premium             1200      platinum           0.20        960.0
1      standard              800        bronze           0.05        760.0
2       premium              450        silver           0.10        405.0
3      standard             1500          gold           0.15       1275.0

Conditional Column Assignment

Combine assign() with np.where() or np.select() for conditional logic without explicit loops.

df = pd.DataFrame({
    'employee': ['Alice', 'Bob', 'Charlie', 'David'],
    'years_service': [2, 5, 8, 12],
    'performance_score': [85, 92, 78, 95]
})

conditions = [
    (df['years_service'] >= 10) & (df['performance_score'] >= 90),
    (df['years_service'] >= 5) & (df['performance_score'] >= 85),
    (df['years_service'] >= 3) & (df['performance_score'] >= 80)
]

choices = ['Senior', 'Mid-level', 'Junior']

df_categorized = df.assign(
    bonus_eligible=lambda x: np.where(x['performance_score'] >= 85, 'Yes', 'No'),
    level=lambda x: np.select(conditions, choices, default='Entry')
)

print(df_categorized)

Output:

  employee  years_service  performance_score bonus_eligible      level
0    Alice              2                 85            Yes      Entry
1      Bob              5                 92            Yes  Mid-level
2  Charlie              8                 78             No      Entry
3    David             12                 95            Yes     Senior

Performance Considerations

While assign() creates new DataFrames, pandas optimizes memory usage through copy-on-write mechanics in recent versions. However, for very large datasets with many sequential operations, consider the performance implications.

import time

# Large dataset
df_large = pd.DataFrame({
    'col1': np.random.rand(1000000),
    'col2': np.random.rand(1000000)
})

# Method 1: Multiple assign() calls
start = time.time()
df1 = (df_large
    .assign(col3=lambda x: x['col1'] * 2)
    .assign(col4=lambda x: x['col2'] * 3)
    .assign(col5=lambda x: x['col3'] + x['col4'])
)
time1 = time.time() - start

# Method 2: Single assign() call
start = time.time()
df2 = df_large.assign(
    col3=lambda x: x['col1'] * 2,
    col4=lambda x: x['col2'] * 3,
    col5=lambda x: x['col1'] * 2 + x['col2'] * 3
)
time2 = time.time() - start

print(f"Multiple assign() calls: {time1:.4f}s")
print(f"Single assign() call: {time2:.4f}s")

Single assign() calls generally perform better by reducing intermediate DataFrame creation, though the difference narrows with copy-on-write enabled in pandas 2.0+.

Integration with Data Pipelines

The assign() method excels in ETL pipelines where immutability and traceability matter. Combine it with other DataFrame methods for complete transformations.

def process_sales_data(df):
    """Complete sales data processing pipeline"""
    return (df
        .assign(
            revenue=lambda x: x['price'] * x['quantity'],
            revenue_category=lambda x: pd.cut(
                x['price'] * x['quantity'],
                bins=[0, 500, 1000, float('inf')],
                labels=['Low', 'Medium', 'High']
            )
        )
        .assign(
            cumulative_revenue=lambda x: x.groupby('product')['revenue'].cumsum()
        )
        .sort_values(['product', 'cumulative_revenue'])
        .reset_index(drop=True)
    )

df_sales = pd.DataFrame({
    'product': ['A', 'A', 'B', 'B', 'C'],
    'price': [100, 120, 150, 140, 200],
    'quantity': [5, 4, 3, 6, 7]
})

result = process_sales_data(df_sales)
print(result)

This pattern creates testable, reusable transformation functions that clearly express data processing intent without side effects.

Understanding assign() Fundamentals

Method Chaining with assign()

Multiple Columns in Single assign()

Referencing Newly Created Columns

Using Named Functions for Complex Logic

Conditional Column Assignment

Performance Considerations

Integration with Data Pipelines

Liked this? There's more.