Pandas - Add New Column to DataFrame

Key Insights

Adding columns to DataFrames can be done through direct assignment, assign(), insert(), or apply() methods, each suited for different scenarios and performance requirements
Conditional column creation using np.where(), np.select(), or apply() with lambda functions enables complex business logic implementation with varying performance characteristics
Understanding vectorized operations versus row-wise operations is critical—vectorized approaches can be 100x faster for large datasets

Direct Column Assignment

The simplest method to add a column is direct assignment using bracket notation. This approach works for scalar values, lists, arrays, or Series objects.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [100, 150, 200, 120],
    'quantity': [5, 3, 7, 2]
})

# Add scalar value
df['discount_rate'] = 0.1

# Add calculated column
df['total'] = df['price'] * df['quantity']

# Add from list
df['category'] = ['Electronics', 'Clothing', 'Electronics', 'Clothing']

print(df)

Output:

  product  price  quantity  discount_rate  total     category
0       A    100         5            0.1    500  Electronics
1       B    150         3            0.1    450     Clothing
2       C    200         7            0.1   1400  Electronics
3       D    120         2            0.1    240     Clothing

Direct assignment modifies the DataFrame in place. The column is appended to the rightmost position by default.

Using assign() for Method Chaining

The assign() method returns a new DataFrame with added columns, enabling clean method chaining without modifying the original DataFrame.

df_original = pd.DataFrame({
    'revenue': [10000, 15000, 12000],
    'costs': [7000, 9000, 8000]
})

df_new = (df_original
    .assign(profit=lambda x: x['revenue'] - x['costs'])
    .assign(profit_margin=lambda x: (x['profit'] / x['revenue']) * 100)
    .assign(performance=lambda x: np.where(x['profit_margin'] > 25, 'Good', 'Poor'))
)

print(df_new)

Output:

   revenue  costs  profit  profit_margin performance
0    10000   7000    3000      30.000000        Good
1    15000   9000    6000      40.000000        Good
2    12000   8000    4000      33.333333        Good

The lambda functions in assign() receive the DataFrame being constructed, allowing you to reference previously created columns within the same chain.

Inserting Columns at Specific Positions

Use insert() when column position matters for readability or downstream processing requirements.

df = pd.DataFrame({
    'user_id': [101, 102, 103],
    'purchase_amount': [250, 400, 175]
})

# Insert at position 1 (between user_id and purchase_amount)
df.insert(1, 'user_name', ['Alice', 'Bob', 'Charlie'])

# Insert at the beginning
df.insert(0, 'record_id', range(1, 4))

print(df)

Output:

   record_id  user_id user_name  purchase_amount
0          1      101     Alice              250
1          2      102       Bob              400
2          3      103   Charlie              175

The insert() method modifies the DataFrame in place and raises an error if the column already exists.

Conditional Column Creation with np.where()

For binary conditions, np.where() provides vectorized performance superior to iterative approaches.

df = pd.DataFrame({
    'temperature': [72, 85, 90, 68, 95, 78],
    'humidity': [45, 60, 70, 40, 75, 55]
})

# Simple condition
df['comfort_level'] = np.where(df['temperature'] < 75, 'Comfortable', 'Hot')

# Nested conditions
df['alert'] = np.where(
    df['temperature'] > 90,
    'High Temp Alert',
    np.where(df['humidity'] > 65, 'High Humidity Alert', 'Normal')
)

print(df)

Output:

   temperature  humidity comfort_level              alert
0           72        45   Comfortable             Normal
1           85        60           Hot             Normal
2           90        70           Hot  High Humidity Alert
3           68        40   Comfortable             Normal
4           95        75           Hot    High Temp Alert
5           78        55           Hot             Normal

Multi-Condition Logic with np.select()

When handling multiple conditions, np.select() offers better readability than nested np.where() calls.

df = pd.DataFrame({
    'score': [45, 67, 89, 92, 55, 73, 38, 81],
})

conditions = [
    df['score'] >= 90,
    df['score'] >= 80,
    df['score'] >= 70,
    df['score'] >= 60,
    df['score'] >= 50
]

choices = ['A', 'B', 'C', 'D', 'E']

df['grade'] = np.select(conditions, choices, default='F')

print(df)

Output:

   score grade
0     45     F
1     67     D
2     89     B
3     92     A
4     55     E
5     73     C
6     38     F
7     81     B

Using apply() for Complex Logic

When vectorized operations aren’t feasible, apply() enables row-wise or column-wise operations with custom functions.

df = pd.DataFrame({
    'text': ['Python Programming', 'Data Science', 'ML', 'AI'],
    'value': [100, 200, 150, 175]
})

# Row-wise operation
def categorize_row(row):
    if len(row['text']) > 10 and row['value'] > 150:
        return 'Premium Long'
    elif len(row['text']) > 10:
        return 'Standard Long'
    elif row['value'] > 150:
        return 'Premium Short'
    else:
        return 'Standard Short'

df['category'] = df.apply(categorize_row, axis=1)

print(df)

Output:

                  text  value       category
0  Python Programming    100  Standard Long
1        Data Science    200   Premium Long
2                  ML    150  Standard Short
3                  AI    175   Premium Short

Performance Considerations

Vectorized operations dramatically outperform iterative approaches. Here’s a comparison:

import time

# Create large DataFrame
df_large = pd.DataFrame({
    'value': np.random.randint(1, 100, 100000)
})

# Vectorized approach
start = time.time()
df_large['category_vec'] = np.where(df_large['value'] > 50, 'High', 'Low')
vec_time = time.time() - start

# Apply approach
start = time.time()
df_large['category_apply'] = df_large['value'].apply(lambda x: 'High' if x > 50 else 'Low')
apply_time = time.time() - start

print(f"Vectorized: {vec_time:.4f}s")
print(f"Apply: {apply_time:.4f}s")
print(f"Speedup: {apply_time/vec_time:.1f}x")

Typical output shows vectorized operations are 50-100x faster for simple conditions.

Adding Multiple Columns Simultaneously

Add multiple columns efficiently using dictionary assignment or multiple assign() arguments.

df = pd.DataFrame({
    'sales': [1000, 1500, 1200],
    'returns': [50, 75, 60]
})

# Dictionary approach
new_columns = {
    'net_sales': df['sales'] - df['returns'],
    'return_rate': (df['returns'] / df['sales']) * 100,
    'status': 'Active'
}

df = df.assign(**new_columns)

print(df)

Output:

   sales  returns  net_sales  return_rate  status
0   1000       50        950     5.000000  Active
1   1500       75       1425     5.000000  Active
2   1200       60       1140     5.000000  Active

Working with DateTime Columns

Extract components from datetime columns to create derived features for time-series analysis.

df = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=5, freq='D'),
    'value': [100, 150, 120, 180, 140]
})

df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day_of_week'] = df['timestamp'].dt.day_name()
df['is_weekend'] = df['timestamp'].dt.dayofweek >= 5

print(df)

This approach efficiently creates multiple time-based features essential for temporal pattern analysis and forecasting models.

Direct Column Assignment

Using assign() for Method Chaining

Inserting Columns at Specific Positions

Conditional Column Creation with np.where()

Multi-Condition Logic with np.select()

Using apply() for Complex Logic

Performance Considerations

Adding Multiple Columns Simultaneously

Working with DateTime Columns

Liked this? There's more.