How to Add a New Column in Polars
If you're coming from pandas, your first instinct might be to write `df['new_col'] = value`. That won't work in Polars. The library takes an immutable approach to DataFrames—every transformation...
Key Insights
- Polars DataFrames are immutable, so
with_columns()returns a new DataFrame rather than modifying in place—this design enables powerful optimizations and predictable behavior - The expression API (
pl.col(),pl.lit(),pl.when()) is the idiomatic way to create columns, offering both readability and performance over row-wise operations - Lazy mode with
with_columns()allows Polars to optimize multiple column additions into a single pass over your data, dramatically improving performance on large datasets
Why Adding Columns in Polars Differs from Pandas
If you’re coming from pandas, your first instinct might be to write df['new_col'] = value. That won’t work in Polars. The library takes an immutable approach to DataFrames—every transformation returns a new DataFrame rather than modifying the original. This isn’t a limitation; it’s a deliberate design choice that enables Polars to parallelize operations and optimize query plans.
The with_columns() method is your primary tool for adding columns. Once you internalize this pattern, you’ll find it more expressive and less error-prone than pandas’ assignment syntax.
Using with_columns() for Basic Column Addition
The with_columns() method accepts one or more expressions and returns a new DataFrame with those columns added. Let’s start with the simplest cases.
import polars as pl
# Create a sample DataFrame
df = pl.DataFrame({
"product": ["Widget", "Gadget", "Gizmo"],
"price": [10.00, 25.50, 15.75],
"quantity": [100, 50, 75]
})
# Add a constant column
df_with_status = df.with_columns(
pl.lit("active").alias("status")
)
print(df_with_status)
Output:
shape: (3, 4)
┌─────────┬───────┬──────────┬────────┐
│ product ┆ price ┆ quantity ┆ status │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 ┆ str │
╞═════════╪═══════╪══════════╪════════╡
│ Widget ┆ 10.0 ┆ 100 ┆ active │
│ Gadget ┆ 25.5 ┆ 50 ┆ active │
│ Gizmo ┆ 15.75 ┆ 75 ┆ active │
└─────────┴───────┴──────────┴────────┘
The pl.lit() function creates a literal value expression, and alias() names the resulting column. For calculated columns based on existing data, reference columns with pl.col():
# Add a calculated column
df_with_total = df.with_columns(
(pl.col("price") * pl.col("quantity")).alias("total_value")
)
print(df_with_total)
Output:
shape: (3, 4)
┌─────────┬───────┬──────────┬─────────────┐
│ product ┆ price ┆ quantity ┆ total_value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 ┆ f64 │
╞═════════╪═══════╪══════════╪═════════════╡
│ Widget ┆ 10.0 ┆ 100 ┆ 1000.0 │
│ Gadget ┆ 25.5 ┆ 50 ┆ 1275.0 │
│ Gizmo ┆ 15.75 ┆ 75 ┆ 1181.25 │
└─────────┴───────┴──────────┴─────────────┘
Creating Columns with Expressions
Polars expressions are where the library really shines. They’re composable, optimizable, and far more powerful than simple arithmetic.
Conditional Logic with when().then().otherwise()
This is Polars’ equivalent of SQL’s CASE WHEN or numpy’s where():
df = pl.DataFrame({
"product": ["Widget", "Gadget", "Gizmo", "Thingamajig"],
"price": [10.00, 25.50, 15.75, 99.99],
"quantity": [100, 50, 75, 10]
})
# Create a tier column based on price
df_with_tier = df.with_columns(
pl.when(pl.col("price") < 15)
.then(pl.lit("budget"))
.when(pl.col("price") < 50)
.then(pl.lit("standard"))
.otherwise(pl.lit("premium"))
.alias("price_tier")
)
print(df_with_tier)
Output:
shape: (4, 4)
┌─────────────┬───────┬──────────┬────────────┐
│ product ┆ price ┆ quantity ┆ price_tier │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 ┆ str │
╞═════════════╪═══════╪══════════╪════════════╡
│ Widget ┆ 10.0 ┆ 100 ┆ budget │
│ Gadget ┆ 25.5 ┆ 50 ┆ standard │
│ Gizmo ┆ 15.75 ┆ 75 ┆ standard │
│ Thingamajig ┆ 99.99 ┆ 10 ┆ premium │
└─────────────┴───────┴──────────┴────────────┘
String Manipulation
Polars provides a rich set of string operations through the .str namespace:
df = pl.DataFrame({
"email": ["alice@example.com", "bob@company.org", "charlie@startup.io"]
})
# Extract domain from email
df_with_domain = df.with_columns(
pl.col("email").str.split("@").list.last().alias("domain"),
pl.col("email").str.split("@").list.first().alias("username"),
pl.col("email").str.to_uppercase().alias("email_upper")
)
print(df_with_domain)
Output:
shape: (3, 4)
┌────────────────────┬────────────┬─────────┬────────────────────────┐
│ email ┆ domain ┆ username┆ email_upper │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞════════════════════╪════════════╪═════════╪════════════════════════╡
│ alice@example.com ┆ example.com┆ alice ┆ ALICE@EXAMPLE.COM │
│ bob@company.org ┆ company.org┆ bob ┆ BOB@COMPANY.ORG │
│ charlie@startup.io ┆ startup.io ┆ charlie ┆ CHARLIE@STARTUP.IO │
└────────────────────┴────────────┴─────────┴────────────────────────┘
Adding Multiple Columns at Once
One of Polars’ strengths is efficiently handling multiple operations. Pass multiple expressions to a single with_columns() call:
df = pl.DataFrame({
"product": ["Widget", "Gadget", "Gizmo"],
"price": [10.00, 25.50, 15.75],
"quantity": [100, 50, 75],
"cost": [6.00, 15.00, 9.50]
})
# Add multiple calculated columns at once
df_enriched = df.with_columns(
(pl.col("price") * pl.col("quantity")).alias("revenue"),
(pl.col("cost") * pl.col("quantity")).alias("total_cost"),
(pl.col("price") - pl.col("cost")).alias("margin_per_unit"),
((pl.col("price") - pl.col("cost")) / pl.col("price") * 100).alias("margin_pct")
)
print(df_enriched)
This is more efficient than chaining multiple with_columns() calls because Polars can optimize and parallelize the operations.
For dynamic column creation, use list comprehensions:
# Create percentage columns for multiple numeric fields
numeric_cols = ["price", "cost"]
total_sum = df.select(pl.col(numeric_cols).sum())
df_with_pcts = df.with_columns([
(pl.col(col) / pl.col(col).sum() * 100).alias(f"{col}_pct_of_total")
for col in numeric_cols
])
print(df_with_pcts)
Adding Columns in Lazy Mode
For large datasets, lazy evaluation is essential. Instead of executing each operation immediately, Polars builds a query plan and optimizes it before execution:
# Simulate a larger dataset
df_large = pl.DataFrame({
"id": range(1_000_000),
"value": [i * 0.5 for i in range(1_000_000)],
"category": ["A", "B", "C"] * 333333 + ["A"]
})
# Lazy evaluation with multiple column additions
result = (
df_large
.lazy()
.with_columns(
(pl.col("value") * 2).alias("doubled"),
(pl.col("value").log()).alias("log_value"),
pl.when(pl.col("category") == "A")
.then(pl.col("value") * 1.1)
.otherwise(pl.col("value"))
.alias("adjusted_value")
)
.filter(pl.col("doubled") > 100)
.collect()
)
The .lazy() call converts the DataFrame to a LazyFrame, and .collect() triggers execution. Between these calls, Polars optimizes the query—it might push filters before column creation, parallelize independent operations, or eliminate unused columns.
You can inspect the query plan with .explain():
query = (
df_large
.lazy()
.with_columns((pl.col("value") * 2).alias("doubled"))
.filter(pl.col("doubled") > 100)
)
print(query.explain())
Common Patterns and Use Cases
Date Component Extraction
Working with dates often requires extracting components:
df = pl.DataFrame({
"order_date": ["2024-01-15", "2024-03-22", "2024-12-01"]
}).with_columns(pl.col("order_date").str.to_date())
df_with_date_parts = df.with_columns(
pl.col("order_date").dt.year().alias("year"),
pl.col("order_date").dt.month().alias("month"),
pl.col("order_date").dt.weekday().alias("day_of_week"),
pl.col("order_date").dt.quarter().alias("quarter")
)
print(df_with_date_parts)
Normalization and Scaling
Common in data preprocessing:
df = pl.DataFrame({
"feature": [10, 20, 30, 40, 50]
})
df_normalized = df.with_columns(
# Min-max normalization
((pl.col("feature") - pl.col("feature").min()) /
(pl.col("feature").max() - pl.col("feature").min())).alias("normalized"),
# Z-score standardization
((pl.col("feature") - pl.col("feature").mean()) /
pl.col("feature").std()).alias("standardized")
)
print(df_normalized)
Flag Columns from Conditions
Creating boolean flags for filtering or analysis:
df = pl.DataFrame({
"customer_id": [1, 2, 3, 4, 5],
"total_purchases": [150, 500, 75, 1200, 300],
"account_age_days": [365, 30, 180, 720, 90]
})
df_with_flags = df.with_columns(
(pl.col("total_purchases") >= 500).alias("is_high_value"),
(pl.col("account_age_days") >= 365).alias("is_established"),
((pl.col("total_purchases") >= 500) &
(pl.col("account_age_days") >= 365)).alias("is_vip")
)
print(df_with_flags)
Conclusion
Adding columns in Polars centers on the with_columns() method combined with the expression API. The key patterns to remember:
- Use
pl.lit()for constant values andpl.col()for referencing existing columns - Chain conditional logic with
when().then().otherwise() - Add multiple columns in a single
with_columns()call for better performance - Switch to lazy mode for large datasets to benefit from query optimization
The immutable design might feel unfamiliar at first, but it enables Polars’ impressive performance characteristics. Once you embrace expressions over imperative operations, you’ll write cleaner, faster data transformations.
For more details, the Polars user guide on expressions is an excellent resource.