How to Add a New Column in Pandas
Adding columns to a Pandas DataFrame is one of the most common operations you'll perform in data analysis. Whether you're calculating derived metrics, categorizing data, or preparing features for...
Key Insights
- Direct bracket assignment (
df['new_col'] = values) is the fastest and most readable approach for most column additions—use it as your default method. - The
assign()method shines in method chaining scenarios where you want to keep transformations readable and avoid intermediate variables. - Always prefer vectorized operations (
np.where(), arithmetic on columns) overapply()with custom functions—the performance difference can be 100x or more on large datasets.
Adding columns to a Pandas DataFrame is one of the most common operations you’ll perform in data analysis. Whether you’re calculating derived metrics, categorizing data, or preparing features for machine learning, you need to know the right tool for each situation. Pandas offers several approaches, each with distinct trade-offs in readability, performance, and flexibility.
This guide covers the practical methods you’ll actually use, with clear recommendations on when to reach for each one.
Adding a Column with Direct Assignment
The bracket notation is your bread and butter. It’s simple, fast, and immediately readable to anyone familiar with Pandas.
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo', 'Doohickey'],
'price': [25.99, 49.99, 15.50, 99.00],
'quantity': [100, 50, 200, 25]
})
# Add a column with a scalar value (broadcasts to all rows)
df['currency'] = 'USD'
# Add a column from a list (must match DataFrame length)
df['warehouse'] = ['A', 'B', 'A', 'C']
# Add a column from a Series
df['discount_rate'] = pd.Series([0.1, 0.15, 0.05, 0.2])
print(df)
Output:
product price quantity currency warehouse discount_rate
0 Widget 25.99 100 USD A 0.10
1 Gadget 49.99 50 USD B 0.15
2 Gizmo 15.50 200 USD A 0.05
3 Doohickey 99.00 25 USD C 0.20
A few things to note: scalar values automatically broadcast to fill every row. Lists must have exactly the same length as the DataFrame—Pandas will raise a ValueError otherwise. When using a Series, Pandas aligns by index, which can cause unexpected NaN values if the indices don’t match.
This method modifies the DataFrame in place. If you need to preserve the original, make a copy first with df.copy().
Using the assign() Method
The assign() method returns a new DataFrame with the added columns, leaving the original untouched. This makes it ideal for method chaining—those fluent pipelines where you transform data step by step.
import pandas as pd
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo', 'Doohickey'],
'price': [25.99, 49.99, 15.50, 99.00],
'quantity': [100, 50, 200, 25],
'category': ['electronics', 'electronics', 'tools', 'tools']
})
# Chain assign with other operations
result = (
df
.assign(
inventory_value=lambda x: x['price'] * x['quantity'],
is_expensive=lambda x: x['price'] > 30
)
.query('category == "electronics"')
.sort_values('inventory_value', ascending=False)
)
print(result)
Output:
product price quantity category inventory_value is_expensive
0 Widget 25.99 100 electronics 2599.00 False
1 Gadget 49.99 50 electronics 2499.50 True
The lambda functions receive the DataFrame as input, which lets you reference columns that were just created in the same assign() call (in Python 3.6+ where dictionary order is preserved).
Use assign() when you’re building transformation pipelines or when you explicitly want immutable operations. Stick with bracket notation for simple, one-off additions where the overhead of a method call isn’t justified.
Adding Columns Based on Existing Columns
Derived columns are everywhere in real-world data work. Pandas makes arithmetic between columns straightforward—operations are vectorized and align automatically by index.
import pandas as pd
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo', 'Doohickey'],
'unit_price': [25.99, 49.99, 15.50, 99.00],
'quantity': [100, 50, 200, 25],
'tax_rate': [0.08, 0.08, 0.06, 0.10]
})
# Simple arithmetic between columns
df['subtotal'] = df['unit_price'] * df['quantity']
# Multiple columns in one calculation
df['tax_amount'] = df['subtotal'] * df['tax_rate']
df['total'] = df['subtotal'] + df['tax_amount']
# Using built-in string methods on string columns
df['product_upper'] = df['product'].str.upper()
print(df[['product', 'subtotal', 'tax_amount', 'total']])
Output:
product subtotal tax_amount total
0 Widget 2599.00 207.92 2806.92
1 Gadget 2499.50 199.96 2699.46
2 Gizmo 3100.00 186.00 3286.00
3 Doohickey 2475.00 247.50 2722.50
For more complex transformations, you can use apply() with a custom function. But be warned: apply() is slow because it processes rows one at a time in Python rather than using optimized C code.
# Using apply for complex logic (slower, use sparingly)
def calculate_shipping(row):
if row['quantity'] > 100:
return 0 # Free shipping for bulk orders
elif row['total'] > 2700:
return 10 # Flat rate for large orders
else:
return 25 # Standard shipping
df['shipping'] = df.apply(calculate_shipping, axis=1)
Whenever possible, express your logic using vectorized operations instead of apply(). The next section shows you how.
Conditional Column Creation with np.where() and apply()
Creating columns based on conditions is a daily task. For binary conditions (if-else), np.where() is your best friend—it’s vectorized and fast.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales'],
'salary': [95000, 65000, 120000, 55000, 78000],
'years_employed': [5, 2, 8, 1, 4]
})
# Binary condition with np.where()
df['salary_band'] = np.where(df['salary'] >= 80000, 'Senior', 'Junior')
# Nested np.where() for multiple conditions
df['experience_level'] = np.where(
df['years_employed'] >= 5, 'Veteran',
np.where(df['years_employed'] >= 2, 'Intermediate', 'New')
)
print(df)
Output:
employee department salary years_employed salary_band experience_level
0 Alice Engineering 95000 5 Senior Veteran
1 Bob Sales 65000 2 Junior Intermediate
2 Charlie Engineering 120000 8 Senior Veteran
3 Diana HR 55000 1 Junior New
4 Eve Sales 78000 4 Junior Intermediate
For more than two or three conditions, nested np.where() becomes unreadable. Use np.select() instead:
# Multiple conditions with np.select() - much cleaner
conditions = [
df['salary'] >= 100000,
df['salary'] >= 75000,
df['salary'] >= 50000
]
choices = ['Executive', 'Senior', 'Mid-Level']
df['compensation_tier'] = np.select(conditions, choices, default='Entry')
print(df[['employee', 'salary', 'compensation_tier']])
Output:
employee salary compensation_tier
0 Alice 95000 Senior
1 Bob 65000 Mid-Level
2 Charlie 120000 Executive
3 Diana 55000 Mid-Level
4 Eve 78000 Senior
Reserve apply() for truly complex logic that can’t be expressed with vectorized operations—things like parsing nested JSON, calling external APIs, or business rules that span many columns with intricate dependencies.
Inserting Columns at Specific Positions
By default, new columns appear at the end of the DataFrame. When column order matters—for reports, exports, or just readability—use the insert() method.
import pandas as pd
df = pd.DataFrame({
'first_name': ['Alice', 'Bob', 'Charlie'],
'last_name': ['Smith', 'Jones', 'Brown'],
'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']
})
# Insert 'id' column at position 0 (first column)
df.insert(0, 'id', [1001, 1002, 1003])
# Insert 'full_name' at position 3 (after last_name)
df.insert(3, 'full_name', df['first_name'] + ' ' + df['last_name'])
print(df)
Output:
id first_name last_name full_name email
0 1001 Alice Smith Alice Smith alice@example.com
1 1002 Bob Jones Bob Jones bob@example.com
2 1003 Charlie Brown Charlie Brown charlie@example.com
The insert() method takes three required arguments: the position index, the column name, and the values. It modifies the DataFrame in place and raises a ValueError if the column name already exists—pass allow_duplicates=True to override this behavior (though duplicate column names are generally a bad idea).
One caveat: insert() doesn’t support method chaining since it returns None. If you need to insert columns in a pipeline, you’ll have to break the chain or use a workaround like defining a helper function.
Conclusion
You now have a complete toolkit for adding columns to Pandas DataFrames:
- Direct assignment (
df['col'] = values): Your default choice for simplicity and speed. assign()method: Best for method chaining and immutable transformations.- Arithmetic operations: Vectorized calculations between existing columns.
np.where()andnp.select(): Conditional logic that stays fast at scale.insert()method: When column position matters.
The pattern I follow: start with direct assignment, reach for assign() when building pipelines, and always prefer vectorized operations over apply(). Your future self (and your colleagues) will thank you for the readable, performant code.