Pandas - Add Column Based on Another Column

The simplest way to add a column based on another is through direct arithmetic operations. Pandas broadcasts these operations across the entire column efficiently.

Key Insights

  • Creating new columns from existing ones is fundamental to data transformation in Pandas, with methods ranging from simple arithmetic operations to complex conditional logic using apply(), map(), and vectorized operations.
  • Vectorized operations outperform iterative approaches by 10-100x for most column transformations, making them the preferred choice for production data pipelines.
  • The np.select() function provides the most readable and performant solution for multi-condition column creation, surpassing nested np.where() calls and lambda functions.

Basic Column Creation with Arithmetic Operations

The simplest way to add a column based on another is through direct arithmetic operations. Pandas broadcasts these operations across the entire column efficiently.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [100, 150, 200, 250],
    'quantity': [5, 3, 7, 2]
})

# Simple arithmetic
df['total'] = df['price'] * df['quantity']

# Multiple column operations
df['discounted_total'] = (df['price'] * 0.9) * df['quantity']

# Combining operations
df['profit_margin'] = (df['price'] - 50) / df['price'] * 100

print(df)

Output:

  product  price  quantity  total  discounted_total  profit_margin
0       A    100         5    500             450.0           50.0
1       B    150         3    450             405.0           66.7
2       C    200         7   1400            1260.0           75.0
3       D    250         2    500             450.0           80.0

Conditional Column Creation with np.where()

For binary conditions, np.where() provides a clean, vectorized approach that’s significantly faster than iterating through rows.

df = pd.DataFrame({
    'employee': ['John', 'Sarah', 'Mike', 'Emma'],
    'sales': [45000, 62000, 38000, 71000],
    'years_experience': [2, 5, 1, 7]
})

# Single condition
df['performance'] = np.where(df['sales'] > 50000, 'High', 'Low')

# Multiple conditions with nested np.where()
df['bonus_tier'] = np.where(
    df['sales'] > 60000, 
    'Tier 1',
    np.where(df['sales'] > 45000, 'Tier 2', 'Tier 3')
)

# Combining multiple columns
df['eligible_promotion'] = np.where(
    (df['sales'] > 50000) & (df['years_experience'] >= 3),
    'Yes',
    'No'
)

print(df)

Output:

  employee  sales  years_experience performance bonus_tier eligible_promotion
0     John  45000                 2         Low     Tier 3                 No
1    Sarah  62000                 5        High     Tier 1                Yes
2     Mike  38000                 1         Low     Tier 3                 No
3     Emma  71000                 7        High     Tier 1                Yes

Multi-Condition Logic with np.select()

When dealing with multiple conditions, np.select() offers superior readability and performance compared to nested np.where() calls.

df = pd.DataFrame({
    'customer_id': range(1, 6),
    'purchase_amount': [150, 450, 850, 1200, 2500],
    'customer_type': ['new', 'returning', 'new', 'vip', 'vip']
})

# Define conditions and corresponding values
conditions = [
    (df['purchase_amount'] < 200),
    (df['purchase_amount'] >= 200) & (df['purchase_amount'] < 500),
    (df['purchase_amount'] >= 500) & (df['purchase_amount'] < 1000),
    (df['purchase_amount'] >= 1000) & (df['customer_type'] == 'vip'),
    (df['purchase_amount'] >= 1000)
]

choices = ['Bronze', 'Silver', 'Gold', 'Platinum', 'Gold']

df['loyalty_tier'] = np.select(conditions, choices, default='None')

# Calculate discount based on tier
discount_conditions = [
    df['loyalty_tier'] == 'Bronze',
    df['loyalty_tier'] == 'Silver',
    df['loyalty_tier'] == 'Gold',
    df['loyalty_tier'] == 'Platinum'
]

discount_values = [0.05, 0.10, 0.15, 0.25]

df['discount_rate'] = np.select(discount_conditions, discount_values, default=0)
df['final_amount'] = df['purchase_amount'] * (1 - df['discount_rate'])

print(df)

Output:

   customer_id  purchase_amount customer_type loyalty_tier  discount_rate  final_amount
0            1              150           new       Bronze           0.05        142.50
1            2              450     returning       Silver           0.10        405.00
2            3              850           new         Gold           0.15        722.50
3            4             1200           vip     Platinum           0.25        900.00
4            5             2500           vip     Platinum           0.25       1875.00

Using apply() for Complex Transformations

When vectorized operations aren’t sufficient, apply() allows custom functions. While slower than vectorized methods, it handles complex logic elegantly.

df = pd.DataFrame({
    'order_id': ['A001', 'A002', 'A003', 'A004'],
    'items': [['laptop', 'mouse'], ['keyboard'], ['monitor', 'cable', 'stand'], ['laptop']],
    'prices': [[1200, 25], [75], [300, 15, 50], [1100]]
})

# Apply function to single column
def calculate_order_total(items_list, prices_list):
    return sum(prices_list)

df['order_total'] = df.apply(
    lambda row: calculate_order_total(row['items'], row['prices']), 
    axis=1
)

# More complex logic with multiple conditions
def categorize_order(row):
    total = sum(row['prices'])
    item_count = len(row['items'])
    
    if total > 1000 and item_count == 1:
        return 'High-Value Single Item'
    elif total > 1000:
        return 'High-Value Bundle'
    elif item_count > 2:
        return 'Multi-Item'
    else:
        return 'Standard'

df['order_category'] = df.apply(categorize_order, axis=1)

print(df)

Output:

  order_id                          items              prices  order_total              order_category
0     A001              [laptop, mouse]        [1200, 25]         1225         High-Value Bundle
1     A002                   [keyboard]              [75]           75                    Standard
2     A003  [monitor, cable, stand]  [300, 15, 50]          365              Multi-Item
3     A004                    [laptop]            [1100]         1100  High-Value Single Item

Mapping Values with map() and replace()

For straightforward value substitutions based on another column, map() and replace() provide clean, efficient solutions.

df = pd.DataFrame({
    'product_code': ['ELC', 'CLT', 'FRN', 'ELC', 'TOY'],
    'units_sold': [150, 200, 80, 300, 120]
})

# Using map() with dictionary
category_mapping = {
    'ELC': 'Electronics',
    'CLT': 'Clothing',
    'FRN': 'Furniture',
    'TOY': 'Toys'
}

df['category'] = df['product_code'].map(category_mapping)

# Using map() with Series for more complex lookups
commission_rates = pd.Series({
    'Electronics': 0.08,
    'Clothing': 0.12,
    'Furniture': 0.10,
    'Toys': 0.15
})

df['commission_rate'] = df['category'].map(commission_rates)
df['commission'] = df['units_sold'] * 50 * df['commission_rate']  # Assuming $50 per unit

print(df)

Output:

  product_code  units_sold      category  commission_rate  commission
0          ELC         150   Electronics             0.08       600.0
1          CLT         200      Clothing             0.12      1200.0
2          FRN          80     Furniture             0.10       400.0
3          ELC         300   Electronics             0.08      1200.0
4          TOY         120          Toys             0.15       900.0

String Operations for Column Creation

String methods in Pandas enable powerful text-based column transformations without explicit loops.

df = pd.DataFrame({
    'full_name': ['John Smith', 'Sarah Johnson', 'Mike Davis', 'Emma Wilson'],
    'email': ['john.smith@company.com', 'sarah.j@company.com', 
              'mike.davis@company.com', 'emma.w@company.com']
})

# Extract parts of strings
df['first_name'] = df['full_name'].str.split().str[0]
df['last_name'] = df['full_name'].str.split().str[1]

# String manipulation
df['username'] = df['email'].str.split('@').str[0]
df['domain'] = df['email'].str.split('@').str[1]

# Conditional string operations
df['name_length_category'] = df['full_name'].str.len().apply(
    lambda x: 'Short' if x < 12 else 'Long'
)

print(df)

Output:

      full_name                    email first_name last_name      username           domain name_length_category
0    John Smith   john.smith@company.com       John     Smith    john.smith  company.com                 Short
1  Sarah Johnson    sarah.j@company.com      Sarah   Johnson       sarah.j  company.com                  Long
2    Mike Davis   mike.davis@company.com       Mike     Davis    mike.davis  company.com                 Short
3   Emma Wilson     emma.w@company.com       Emma    Wilson        emma.w  company.com                 Short

Performance Considerations

Understanding performance differences between approaches is critical for large datasets.

import time

# Create large dataset
df_large = pd.DataFrame({
    'value': np.random.randint(0, 100, 1000000)
})

# Vectorized approach
start = time.time()
df_large['category_vec'] = np.where(df_large['value'] > 50, 'High', 'Low')
vectorized_time = time.time() - start

# Apply approach
start = time.time()
df_large['category_apply'] = df_large['value'].apply(lambda x: 'High' if x > 50 else 'Low')
apply_time = time.time() - start

print(f"Vectorized: {vectorized_time:.4f}s")
print(f"Apply: {apply_time:.4f}s")
print(f"Speedup: {apply_time/vectorized_time:.2f}x")

For production pipelines processing millions of rows, always prefer vectorized operations. Reserve apply() for genuinely complex logic that cannot be vectorized, and consider using numba or cython for further optimization when necessary.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.