Pandas - Add Column Based on Another Column
The simplest way to add a column based on another is through direct arithmetic operations. Pandas broadcasts these operations across the entire column efficiently.
Key Insights
- Creating new columns from existing ones is fundamental to data transformation in Pandas, with methods ranging from simple arithmetic operations to complex conditional logic using
apply(),map(), and vectorized operations. - Vectorized operations outperform iterative approaches by 10-100x for most column transformations, making them the preferred choice for production data pipelines.
- The
np.select()function provides the most readable and performant solution for multi-condition column creation, surpassing nestednp.where()calls and lambda functions.
Basic Column Creation with Arithmetic Operations
The simplest way to add a column based on another is through direct arithmetic operations. Pandas broadcasts these operations across the entire column efficiently.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [100, 150, 200, 250],
'quantity': [5, 3, 7, 2]
})
# Simple arithmetic
df['total'] = df['price'] * df['quantity']
# Multiple column operations
df['discounted_total'] = (df['price'] * 0.9) * df['quantity']
# Combining operations
df['profit_margin'] = (df['price'] - 50) / df['price'] * 100
print(df)
Output:
product price quantity total discounted_total profit_margin
0 A 100 5 500 450.0 50.0
1 B 150 3 450 405.0 66.7
2 C 200 7 1400 1260.0 75.0
3 D 250 2 500 450.0 80.0
Conditional Column Creation with np.where()
For binary conditions, np.where() provides a clean, vectorized approach that’s significantly faster than iterating through rows.
df = pd.DataFrame({
'employee': ['John', 'Sarah', 'Mike', 'Emma'],
'sales': [45000, 62000, 38000, 71000],
'years_experience': [2, 5, 1, 7]
})
# Single condition
df['performance'] = np.where(df['sales'] > 50000, 'High', 'Low')
# Multiple conditions with nested np.where()
df['bonus_tier'] = np.where(
df['sales'] > 60000,
'Tier 1',
np.where(df['sales'] > 45000, 'Tier 2', 'Tier 3')
)
# Combining multiple columns
df['eligible_promotion'] = np.where(
(df['sales'] > 50000) & (df['years_experience'] >= 3),
'Yes',
'No'
)
print(df)
Output:
employee sales years_experience performance bonus_tier eligible_promotion
0 John 45000 2 Low Tier 3 No
1 Sarah 62000 5 High Tier 1 Yes
2 Mike 38000 1 Low Tier 3 No
3 Emma 71000 7 High Tier 1 Yes
Multi-Condition Logic with np.select()
When dealing with multiple conditions, np.select() offers superior readability and performance compared to nested np.where() calls.
df = pd.DataFrame({
'customer_id': range(1, 6),
'purchase_amount': [150, 450, 850, 1200, 2500],
'customer_type': ['new', 'returning', 'new', 'vip', 'vip']
})
# Define conditions and corresponding values
conditions = [
(df['purchase_amount'] < 200),
(df['purchase_amount'] >= 200) & (df['purchase_amount'] < 500),
(df['purchase_amount'] >= 500) & (df['purchase_amount'] < 1000),
(df['purchase_amount'] >= 1000) & (df['customer_type'] == 'vip'),
(df['purchase_amount'] >= 1000)
]
choices = ['Bronze', 'Silver', 'Gold', 'Platinum', 'Gold']
df['loyalty_tier'] = np.select(conditions, choices, default='None')
# Calculate discount based on tier
discount_conditions = [
df['loyalty_tier'] == 'Bronze',
df['loyalty_tier'] == 'Silver',
df['loyalty_tier'] == 'Gold',
df['loyalty_tier'] == 'Platinum'
]
discount_values = [0.05, 0.10, 0.15, 0.25]
df['discount_rate'] = np.select(discount_conditions, discount_values, default=0)
df['final_amount'] = df['purchase_amount'] * (1 - df['discount_rate'])
print(df)
Output:
customer_id purchase_amount customer_type loyalty_tier discount_rate final_amount
0 1 150 new Bronze 0.05 142.50
1 2 450 returning Silver 0.10 405.00
2 3 850 new Gold 0.15 722.50
3 4 1200 vip Platinum 0.25 900.00
4 5 2500 vip Platinum 0.25 1875.00
Using apply() for Complex Transformations
When vectorized operations aren’t sufficient, apply() allows custom functions. While slower than vectorized methods, it handles complex logic elegantly.
df = pd.DataFrame({
'order_id': ['A001', 'A002', 'A003', 'A004'],
'items': [['laptop', 'mouse'], ['keyboard'], ['monitor', 'cable', 'stand'], ['laptop']],
'prices': [[1200, 25], [75], [300, 15, 50], [1100]]
})
# Apply function to single column
def calculate_order_total(items_list, prices_list):
return sum(prices_list)
df['order_total'] = df.apply(
lambda row: calculate_order_total(row['items'], row['prices']),
axis=1
)
# More complex logic with multiple conditions
def categorize_order(row):
total = sum(row['prices'])
item_count = len(row['items'])
if total > 1000 and item_count == 1:
return 'High-Value Single Item'
elif total > 1000:
return 'High-Value Bundle'
elif item_count > 2:
return 'Multi-Item'
else:
return 'Standard'
df['order_category'] = df.apply(categorize_order, axis=1)
print(df)
Output:
order_id items prices order_total order_category
0 A001 [laptop, mouse] [1200, 25] 1225 High-Value Bundle
1 A002 [keyboard] [75] 75 Standard
2 A003 [monitor, cable, stand] [300, 15, 50] 365 Multi-Item
3 A004 [laptop] [1100] 1100 High-Value Single Item
Mapping Values with map() and replace()
For straightforward value substitutions based on another column, map() and replace() provide clean, efficient solutions.
df = pd.DataFrame({
'product_code': ['ELC', 'CLT', 'FRN', 'ELC', 'TOY'],
'units_sold': [150, 200, 80, 300, 120]
})
# Using map() with dictionary
category_mapping = {
'ELC': 'Electronics',
'CLT': 'Clothing',
'FRN': 'Furniture',
'TOY': 'Toys'
}
df['category'] = df['product_code'].map(category_mapping)
# Using map() with Series for more complex lookups
commission_rates = pd.Series({
'Electronics': 0.08,
'Clothing': 0.12,
'Furniture': 0.10,
'Toys': 0.15
})
df['commission_rate'] = df['category'].map(commission_rates)
df['commission'] = df['units_sold'] * 50 * df['commission_rate'] # Assuming $50 per unit
print(df)
Output:
product_code units_sold category commission_rate commission
0 ELC 150 Electronics 0.08 600.0
1 CLT 200 Clothing 0.12 1200.0
2 FRN 80 Furniture 0.10 400.0
3 ELC 300 Electronics 0.08 1200.0
4 TOY 120 Toys 0.15 900.0
String Operations for Column Creation
String methods in Pandas enable powerful text-based column transformations without explicit loops.
df = pd.DataFrame({
'full_name': ['John Smith', 'Sarah Johnson', 'Mike Davis', 'Emma Wilson'],
'email': ['john.smith@company.com', 'sarah.j@company.com',
'mike.davis@company.com', 'emma.w@company.com']
})
# Extract parts of strings
df['first_name'] = df['full_name'].str.split().str[0]
df['last_name'] = df['full_name'].str.split().str[1]
# String manipulation
df['username'] = df['email'].str.split('@').str[0]
df['domain'] = df['email'].str.split('@').str[1]
# Conditional string operations
df['name_length_category'] = df['full_name'].str.len().apply(
lambda x: 'Short' if x < 12 else 'Long'
)
print(df)
Output:
full_name email first_name last_name username domain name_length_category
0 John Smith john.smith@company.com John Smith john.smith company.com Short
1 Sarah Johnson sarah.j@company.com Sarah Johnson sarah.j company.com Long
2 Mike Davis mike.davis@company.com Mike Davis mike.davis company.com Short
3 Emma Wilson emma.w@company.com Emma Wilson emma.w company.com Short
Performance Considerations
Understanding performance differences between approaches is critical for large datasets.
import time
# Create large dataset
df_large = pd.DataFrame({
'value': np.random.randint(0, 100, 1000000)
})
# Vectorized approach
start = time.time()
df_large['category_vec'] = np.where(df_large['value'] > 50, 'High', 'Low')
vectorized_time = time.time() - start
# Apply approach
start = time.time()
df_large['category_apply'] = df_large['value'].apply(lambda x: 'High' if x > 50 else 'Low')
apply_time = time.time() - start
print(f"Vectorized: {vectorized_time:.4f}s")
print(f"Apply: {apply_time:.4f}s")
print(f"Speedup: {apply_time/vectorized_time:.2f}x")
For production pipelines processing millions of rows, always prefer vectorized operations. Reserve apply() for genuinely complex logic that cannot be vectorized, and consider using numba or cython for further optimization when necessary.