How to Add a New Column in Pandas

Adding columns to a Pandas DataFrame is one of the most common operations you'll perform in data analysis. Whether you're calculating derived metrics, categorizing data, or preparing features for...

Key Insights

  • Direct bracket assignment (df['new_col'] = values) is the fastest and most readable approach for most column additions—use it as your default method.
  • The assign() method shines in method chaining scenarios where you want to keep transformations readable and avoid intermediate variables.
  • Always prefer vectorized operations (np.where(), arithmetic on columns) over apply() with custom functions—the performance difference can be 100x or more on large datasets.

Adding columns to a Pandas DataFrame is one of the most common operations you’ll perform in data analysis. Whether you’re calculating derived metrics, categorizing data, or preparing features for machine learning, you need to know the right tool for each situation. Pandas offers several approaches, each with distinct trade-offs in readability, performance, and flexibility.

This guide covers the practical methods you’ll actually use, with clear recommendations on when to reach for each one.

Adding a Column with Direct Assignment

The bracket notation is your bread and butter. It’s simple, fast, and immediately readable to anyone familiar with Pandas.

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo', 'Doohickey'],
    'price': [25.99, 49.99, 15.50, 99.00],
    'quantity': [100, 50, 200, 25]
})

# Add a column with a scalar value (broadcasts to all rows)
df['currency'] = 'USD'

# Add a column from a list (must match DataFrame length)
df['warehouse'] = ['A', 'B', 'A', 'C']

# Add a column from a Series
df['discount_rate'] = pd.Series([0.1, 0.15, 0.05, 0.2])

print(df)

Output:

     product  price  quantity currency warehouse  discount_rate
0     Widget  25.99       100      USD         A           0.10
1     Gadget  49.99        50      USD         B           0.15
2      Gizmo  15.50       200      USD         A           0.05
3  Doohickey  99.00        25      USD         C           0.20

A few things to note: scalar values automatically broadcast to fill every row. Lists must have exactly the same length as the DataFrame—Pandas will raise a ValueError otherwise. When using a Series, Pandas aligns by index, which can cause unexpected NaN values if the indices don’t match.

This method modifies the DataFrame in place. If you need to preserve the original, make a copy first with df.copy().

Using the assign() Method

The assign() method returns a new DataFrame with the added columns, leaving the original untouched. This makes it ideal for method chaining—those fluent pipelines where you transform data step by step.

import pandas as pd

df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo', 'Doohickey'],
    'price': [25.99, 49.99, 15.50, 99.00],
    'quantity': [100, 50, 200, 25],
    'category': ['electronics', 'electronics', 'tools', 'tools']
})

# Chain assign with other operations
result = (
    df
    .assign(
        inventory_value=lambda x: x['price'] * x['quantity'],
        is_expensive=lambda x: x['price'] > 30
    )
    .query('category == "electronics"')
    .sort_values('inventory_value', ascending=False)
)

print(result)

Output:

  product  price  quantity     category  inventory_value  is_expensive
0  Widget  25.99       100  electronics          2599.00         False
1  Gadget  49.99        50  electronics          2499.50          True

The lambda functions receive the DataFrame as input, which lets you reference columns that were just created in the same assign() call (in Python 3.6+ where dictionary order is preserved).

Use assign() when you’re building transformation pipelines or when you explicitly want immutable operations. Stick with bracket notation for simple, one-off additions where the overhead of a method call isn’t justified.

Adding Columns Based on Existing Columns

Derived columns are everywhere in real-world data work. Pandas makes arithmetic between columns straightforward—operations are vectorized and align automatically by index.

import pandas as pd

df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo', 'Doohickey'],
    'unit_price': [25.99, 49.99, 15.50, 99.00],
    'quantity': [100, 50, 200, 25],
    'tax_rate': [0.08, 0.08, 0.06, 0.10]
})

# Simple arithmetic between columns
df['subtotal'] = df['unit_price'] * df['quantity']

# Multiple columns in one calculation
df['tax_amount'] = df['subtotal'] * df['tax_rate']
df['total'] = df['subtotal'] + df['tax_amount']

# Using built-in string methods on string columns
df['product_upper'] = df['product'].str.upper()

print(df[['product', 'subtotal', 'tax_amount', 'total']])

Output:

     product  subtotal  tax_amount     total
0     Widget   2599.00      207.92   2806.92
1     Gadget   2499.50      199.96   2699.46
2      Gizmo   3100.00      186.00   3286.00
3  Doohickey   2475.00      247.50   2722.50

For more complex transformations, you can use apply() with a custom function. But be warned: apply() is slow because it processes rows one at a time in Python rather than using optimized C code.

# Using apply for complex logic (slower, use sparingly)
def calculate_shipping(row):
    if row['quantity'] > 100:
        return 0  # Free shipping for bulk orders
    elif row['total'] > 2700:
        return 10  # Flat rate for large orders
    else:
        return 25  # Standard shipping

df['shipping'] = df.apply(calculate_shipping, axis=1)

Whenever possible, express your logic using vectorized operations instead of apply(). The next section shows you how.

Conditional Column Creation with np.where() and apply()

Creating columns based on conditions is a daily task. For binary conditions (if-else), np.where() is your best friend—it’s vectorized and fast.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales'],
    'salary': [95000, 65000, 120000, 55000, 78000],
    'years_employed': [5, 2, 8, 1, 4]
})

# Binary condition with np.where()
df['salary_band'] = np.where(df['salary'] >= 80000, 'Senior', 'Junior')

# Nested np.where() for multiple conditions
df['experience_level'] = np.where(
    df['years_employed'] >= 5, 'Veteran',
    np.where(df['years_employed'] >= 2, 'Intermediate', 'New')
)

print(df)

Output:

  employee   department  salary  years_employed salary_band experience_level
0    Alice  Engineering   95000               5      Senior          Veteran
1      Bob        Sales   65000               2      Junior     Intermediate
2  Charlie  Engineering  120000               8      Senior          Veteran
3    Diana           HR   55000               1      Junior              New
4      Eve        Sales   78000               4      Junior     Intermediate

For more than two or three conditions, nested np.where() becomes unreadable. Use np.select() instead:

# Multiple conditions with np.select() - much cleaner
conditions = [
    df['salary'] >= 100000,
    df['salary'] >= 75000,
    df['salary'] >= 50000
]
choices = ['Executive', 'Senior', 'Mid-Level']

df['compensation_tier'] = np.select(conditions, choices, default='Entry')

print(df[['employee', 'salary', 'compensation_tier']])

Output:

  employee  salary compensation_tier
0    Alice   95000            Senior
1      Bob   65000         Mid-Level
2  Charlie  120000         Executive
3    Diana   55000         Mid-Level
4      Eve   78000            Senior

Reserve apply() for truly complex logic that can’t be expressed with vectorized operations—things like parsing nested JSON, calling external APIs, or business rules that span many columns with intricate dependencies.

Inserting Columns at Specific Positions

By default, new columns appear at the end of the DataFrame. When column order matters—for reports, exports, or just readability—use the insert() method.

import pandas as pd

df = pd.DataFrame({
    'first_name': ['Alice', 'Bob', 'Charlie'],
    'last_name': ['Smith', 'Jones', 'Brown'],
    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']
})

# Insert 'id' column at position 0 (first column)
df.insert(0, 'id', [1001, 1002, 1003])

# Insert 'full_name' at position 3 (after last_name)
df.insert(3, 'full_name', df['first_name'] + ' ' + df['last_name'])

print(df)

Output:

     id first_name last_name       full_name                email
0  1001      Alice     Smith     Alice Smith    alice@example.com
1  1002        Bob     Jones       Bob Jones      bob@example.com
2  1003    Charlie     Brown   Charlie Brown  charlie@example.com

The insert() method takes three required arguments: the position index, the column name, and the values. It modifies the DataFrame in place and raises a ValueError if the column name already exists—pass allow_duplicates=True to override this behavior (though duplicate column names are generally a bad idea).

One caveat: insert() doesn’t support method chaining since it returns None. If you need to insert columns in a pipeline, you’ll have to break the chain or use a workaround like defining a helper function.

Conclusion

You now have a complete toolkit for adding columns to Pandas DataFrames:

  • Direct assignment (df['col'] = values): Your default choice for simplicity and speed.
  • assign() method: Best for method chaining and immutable transformations.
  • Arithmetic operations: Vectorized calculations between existing columns.
  • np.where() and np.select(): Conditional logic that stays fast at scale.
  • insert() method: When column position matters.

The pattern I follow: start with direct assignment, reach for assign() when building pipelines, and always prefer vectorized operations over apply(). Your future self (and your colleagues) will thank you for the readable, performant code.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.