Pandas - Add New Column to DataFrame
The simplest method to add a column is direct assignment using bracket notation. This approach works for scalar values, lists, arrays, or Series objects.
Key Insights
- Adding columns to DataFrames can be done through direct assignment,
assign(),insert(), orapply()methods, each suited for different scenarios and performance requirements - Conditional column creation using
np.where(),np.select(), orapply()with lambda functions enables complex business logic implementation with varying performance characteristics - Understanding vectorized operations versus row-wise operations is critical—vectorized approaches can be 100x faster for large datasets
Direct Column Assignment
The simplest method to add a column is direct assignment using bracket notation. This approach works for scalar values, lists, arrays, or Series objects.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [100, 150, 200, 120],
'quantity': [5, 3, 7, 2]
})
# Add scalar value
df['discount_rate'] = 0.1
# Add calculated column
df['total'] = df['price'] * df['quantity']
# Add from list
df['category'] = ['Electronics', 'Clothing', 'Electronics', 'Clothing']
print(df)
Output:
product price quantity discount_rate total category
0 A 100 5 0.1 500 Electronics
1 B 150 3 0.1 450 Clothing
2 C 200 7 0.1 1400 Electronics
3 D 120 2 0.1 240 Clothing
Direct assignment modifies the DataFrame in place. The column is appended to the rightmost position by default.
Using assign() for Method Chaining
The assign() method returns a new DataFrame with added columns, enabling clean method chaining without modifying the original DataFrame.
df_original = pd.DataFrame({
'revenue': [10000, 15000, 12000],
'costs': [7000, 9000, 8000]
})
df_new = (df_original
.assign(profit=lambda x: x['revenue'] - x['costs'])
.assign(profit_margin=lambda x: (x['profit'] / x['revenue']) * 100)
.assign(performance=lambda x: np.where(x['profit_margin'] > 25, 'Good', 'Poor'))
)
print(df_new)
Output:
revenue costs profit profit_margin performance
0 10000 7000 3000 30.000000 Good
1 15000 9000 6000 40.000000 Good
2 12000 8000 4000 33.333333 Good
The lambda functions in assign() receive the DataFrame being constructed, allowing you to reference previously created columns within the same chain.
Inserting Columns at Specific Positions
Use insert() when column position matters for readability or downstream processing requirements.
df = pd.DataFrame({
'user_id': [101, 102, 103],
'purchase_amount': [250, 400, 175]
})
# Insert at position 1 (between user_id and purchase_amount)
df.insert(1, 'user_name', ['Alice', 'Bob', 'Charlie'])
# Insert at the beginning
df.insert(0, 'record_id', range(1, 4))
print(df)
Output:
record_id user_id user_name purchase_amount
0 1 101 Alice 250
1 2 102 Bob 400
2 3 103 Charlie 175
The insert() method modifies the DataFrame in place and raises an error if the column already exists.
Conditional Column Creation with np.where()
For binary conditions, np.where() provides vectorized performance superior to iterative approaches.
df = pd.DataFrame({
'temperature': [72, 85, 90, 68, 95, 78],
'humidity': [45, 60, 70, 40, 75, 55]
})
# Simple condition
df['comfort_level'] = np.where(df['temperature'] < 75, 'Comfortable', 'Hot')
# Nested conditions
df['alert'] = np.where(
df['temperature'] > 90,
'High Temp Alert',
np.where(df['humidity'] > 65, 'High Humidity Alert', 'Normal')
)
print(df)
Output:
temperature humidity comfort_level alert
0 72 45 Comfortable Normal
1 85 60 Hot Normal
2 90 70 Hot High Humidity Alert
3 68 40 Comfortable Normal
4 95 75 Hot High Temp Alert
5 78 55 Hot Normal
Multi-Condition Logic with np.select()
When handling multiple conditions, np.select() offers better readability than nested np.where() calls.
df = pd.DataFrame({
'score': [45, 67, 89, 92, 55, 73, 38, 81],
})
conditions = [
df['score'] >= 90,
df['score'] >= 80,
df['score'] >= 70,
df['score'] >= 60,
df['score'] >= 50
]
choices = ['A', 'B', 'C', 'D', 'E']
df['grade'] = np.select(conditions, choices, default='F')
print(df)
Output:
score grade
0 45 F
1 67 D
2 89 B
3 92 A
4 55 E
5 73 C
6 38 F
7 81 B
Using apply() for Complex Logic
When vectorized operations aren’t feasible, apply() enables row-wise or column-wise operations with custom functions.
df = pd.DataFrame({
'text': ['Python Programming', 'Data Science', 'ML', 'AI'],
'value': [100, 200, 150, 175]
})
# Row-wise operation
def categorize_row(row):
if len(row['text']) > 10 and row['value'] > 150:
return 'Premium Long'
elif len(row['text']) > 10:
return 'Standard Long'
elif row['value'] > 150:
return 'Premium Short'
else:
return 'Standard Short'
df['category'] = df.apply(categorize_row, axis=1)
print(df)
Output:
text value category
0 Python Programming 100 Standard Long
1 Data Science 200 Premium Long
2 ML 150 Standard Short
3 AI 175 Premium Short
Performance Considerations
Vectorized operations dramatically outperform iterative approaches. Here’s a comparison:
import time
# Create large DataFrame
df_large = pd.DataFrame({
'value': np.random.randint(1, 100, 100000)
})
# Vectorized approach
start = time.time()
df_large['category_vec'] = np.where(df_large['value'] > 50, 'High', 'Low')
vec_time = time.time() - start
# Apply approach
start = time.time()
df_large['category_apply'] = df_large['value'].apply(lambda x: 'High' if x > 50 else 'Low')
apply_time = time.time() - start
print(f"Vectorized: {vec_time:.4f}s")
print(f"Apply: {apply_time:.4f}s")
print(f"Speedup: {apply_time/vec_time:.1f}x")
Typical output shows vectorized operations are 50-100x faster for simple conditions.
Adding Multiple Columns Simultaneously
Add multiple columns efficiently using dictionary assignment or multiple assign() arguments.
df = pd.DataFrame({
'sales': [1000, 1500, 1200],
'returns': [50, 75, 60]
})
# Dictionary approach
new_columns = {
'net_sales': df['sales'] - df['returns'],
'return_rate': (df['returns'] / df['sales']) * 100,
'status': 'Active'
}
df = df.assign(**new_columns)
print(df)
Output:
sales returns net_sales return_rate status
0 1000 50 950 5.000000 Active
1 1500 75 1425 5.000000 Active
2 1200 60 1140 5.000000 Active
Working with DateTime Columns
Extract components from datetime columns to create derived features for time-series analysis.
df = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=5, freq='D'),
'value': [100, 150, 120, 180, 140]
})
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day_of_week'] = df['timestamp'].dt.day_name()
df['is_weekend'] = df['timestamp'].dt.dayofweek >= 5
print(df)
This approach efficiently creates multiple time-based features essential for temporal pattern analysis and forecasting models.