How to Use Expressions in Polars
If you're coming from pandas, you probably think of data manipulation as a series of method calls that immediately transform your DataFrame. Polars takes a fundamentally different approach....
Key Insights
- Polars expressions are composable building blocks that describe transformations without immediately executing them, enabling the query optimizer to dramatically improve performance
- Unlike pandas where operations happen row-by-row or require vectorized functions, Polars expressions operate on entire columns and can be parallelized automatically
- Mastering expressions is the single most important skill for Polars proficiency—they’re used everywhere from simple column selection to complex aggregations
Introduction to Polars Expressions
If you’re coming from pandas, you probably think of data manipulation as a series of method calls that immediately transform your DataFrame. Polars takes a fundamentally different approach. Expressions are declarative descriptions of what you want to compute, not imperative commands that execute immediately.
This distinction matters. When you write df["price"] * df["quantity"] in pandas, the multiplication happens right then. In Polars, col("price") * col("quantity") creates an expression object that describes the computation. The actual work happens when you pass that expression to a context like select() or with_columns().
This design enables Polars to optimize your queries before running them. It can reorder operations, eliminate redundant work, and parallelize across CPU cores. You write readable code; Polars figures out the fastest way to execute it.
Basic Column Selection and Transformation
Every Polars expression starts with selecting data. The col() function is your primary tool for referencing columns, while lit() creates constant values.
import polars as pl
df = pl.DataFrame({
"product": ["Widget", "Gadget", "Sprocket"],
"price": [10.50, 25.00, 8.75],
"quantity": [100, 50, 200],
"discount_pct": [0.1, 0.15, 0.05]
})
# Basic column selection
result = df.select(
pl.col("product"),
pl.col("price"),
)
# Arithmetic operations create new expressions
result = df.select(
pl.col("product"),
(pl.col("price") * pl.col("quantity")).alias("gross_revenue"),
(pl.col("price") * (1 - pl.col("discount_pct"))).alias("discounted_price"),
(pl.lit(100) - pl.col("quantity")).alias("restock_needed"),
)
print(result)
shape: (3, 4)
┌──────────┬───────────────┬──────────────────┬────────────────┐
│ product ┆ gross_revenue ┆ discounted_price ┆ restock_needed │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ i64 │
╞══════════╪═══════════════╪══════════════════╪════════════════╡
│ Widget ┆ 1050.0 ┆ 9.45 ┆ 0 │
│ Gadget ┆ 1250.0 ┆ 21.25 ┆ 50 │
│ Sprocket ┆ 1750.0 ┆ 8.3125 ┆ -100 │
└──────────┴───────────────┴──────────────────┴────────────────┘
Notice how alias() names the resulting column. Without it, Polars generates a name based on the expression, which gets unwieldy for complex calculations.
You can also select multiple columns at once using patterns:
# Select all columns
df.select(pl.all())
# Select columns by pattern
df.select(pl.col("^price.*$")) # regex matching
# Select by dtype
df.select(pl.col(pl.Float64))
# Exclude specific columns
df.select(pl.all().exclude("discount_pct"))
String and Numeric Expressions
Polars organizes type-specific operations into namespaces. String operations live under .str, datetime operations under .dt, list operations under .list, and so on.
df = pl.DataFrame({
"name": ["ALICE SMITH", "bob jones", "Charlie Brown"],
"email": ["alice@example.com", "BOB@COMPANY.ORG", "charlie@test.net"],
"salary": [75432.567, 82100.123, 69999.999],
"start_date": ["2023-01-15", "2022-06-01", "2023-09-20"]
})
result = df.select(
# String operations
pl.col("name").str.to_lowercase().alias("name_lower"),
pl.col("name").str.split(" ").list.first().alias("first_name"),
pl.col("email").str.to_lowercase().str.contains("@company").alias("is_company_email"),
# Numeric operations
pl.col("salary").round(2).alias("salary_rounded"),
pl.col("salary").cast(pl.Int64).alias("salary_int"),
(pl.col("salary") / 12).round(2).alias("monthly_salary"),
# Date parsing and extraction
pl.col("start_date").str.to_date().dt.month().alias("start_month"),
)
print(result)
shape: (3, 7)
┌────────────────┬────────────┬──────────────────┬────────────────┬────────────┬────────────────┬─────────────┐
│ name_lower ┆ first_name ┆ is_company_email ┆ salary_rounded ┆ salary_int ┆ monthly_salary ┆ start_month │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool ┆ f64 ┆ i64 ┆ f64 ┆ i8 │
╞════════════════╪════════════╪══════════════════╪════════════════╪════════════╪════════════════╪═════════════╡
│ alice smith ┆ ALICE ┆ false ┆ 75432.57 ┆ 75432 ┆ 6286.05 ┆ 1 │
│ bob jones ┆ bob ┆ true ┆ 82100.12 ┆ 82100 ┆ 6841.68 ┆ 6 │
│ charlie brown ┆ Charlie ┆ false ┆ 70000.0 ┆ 69999 ┆ 5833.33 ┆ 9 │
└────────────────┴────────────┴──────────────────┴────────────────┴────────────┴────────────────┴─────────────┘
The namespace pattern keeps the API organized and discoverable. When you type col("name").str., your IDE shows all available string methods.
Aggregation Expressions
Expressions really shine in aggregation contexts. Within a group_by(), you can compute multiple aggregations in a single pass over the data.
df = pl.DataFrame({
"department": ["Engineering", "Engineering", "Sales", "Sales", "Engineering", "Sales"],
"employee": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"],
"salary": [95000, 87000, 72000, 68000, 102000, 75000],
"tenure_years": [5, 3, 7, 2, 8, 4],
"performance_score": [4.2, 3.8, 4.5, 3.9, 4.8, 4.1]
})
result = df.group_by("department").agg(
pl.col("employee").count().alias("headcount"),
pl.col("salary").sum().alias("total_salary"),
pl.col("salary").mean().alias("avg_salary"),
pl.col("salary").max().alias("max_salary"),
pl.col("salary").min().alias("min_salary"),
pl.col("tenure_years").mean().alias("avg_tenure"),
pl.col("performance_score").mean().round(2).alias("avg_performance"),
pl.col("employee").filter(pl.col("salary") == pl.col("salary").max()).first().alias("top_earner"),
)
print(result)
shape: (2, 9)
┌─────────────┬───────────┬──────────────┬────────────┬────────────┬────────────┬────────────┬─────────────────┬────────────┐
│ department ┆ headcount ┆ total_salary ┆ avg_salary ┆ max_salary ┆ min_salary ┆ avg_tenure ┆ avg_performance ┆ top_earner │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 ┆ f64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ str │
╞═════════════╪═══════════╪══════════════╪════════════╪════════════╪════════════╪════════════╪═════════════════╪════════════╡
│ Sales ┆ 3 ┆ 215000 ┆ 71666.67 ┆ 75000 ┆ 68000 ┆ 4.333333 ┆ 4.17 ┆ Frank │
│ Engineering ┆ 3 ┆ 284000 ┆ 94666.67 ┆ 102000 ┆ 87000 ┆ 5.333333 ┆ 4.27 ┆ Eve │
└─────────────┴───────────┴──────────────┴────────────┴────────────┴────────────┴────────────┴─────────────────┴────────────┘
That last expression—finding the top earner—demonstrates how you can combine filtering and aggregation within a single expression. In pandas, this would require multiple steps or a custom aggregation function.
Conditional Expressions with when/then/otherwise
The when/then/otherwise pattern replaces if-else logic and np.where() calls. It’s more readable and handles multiple conditions elegantly.
df = pl.DataFrame({
"product": ["A", "B", "C", "D", "E"],
"revenue": [150000, 45000, 89000, 12000, 250000],
"margin_pct": [0.35, 0.12, 0.28, -0.05, 0.42],
"region": ["North", "South", None, "East", "West"]
})
result = df.with_columns(
# Simple condition
pl.when(pl.col("revenue") > 100000)
.then(pl.lit("High"))
.otherwise(pl.lit("Standard"))
.alias("revenue_tier"),
# Multiple conditions (chained when/then)
pl.when(pl.col("margin_pct") < 0)
.then(pl.lit("Loss"))
.when(pl.col("margin_pct") < 0.2)
.then(pl.lit("Low"))
.when(pl.col("margin_pct") < 0.35)
.then(pl.lit("Medium"))
.otherwise(pl.lit("High"))
.alias("margin_category"),
# Handling nulls explicitly
pl.when(pl.col("region").is_null())
.then(pl.lit("Unknown"))
.otherwise(pl.col("region"))
.alias("region_clean"),
# Conditional calculation
pl.when(pl.col("margin_pct") > 0.3)
.then(pl.col("revenue") * 1.1) # 10% bonus allocation
.otherwise(pl.col("revenue"))
.alias("adjusted_revenue"),
)
print(result)
shape: (5, 8)
┌─────────┬─────────┬────────────┬────────┬──────────────┬─────────────────┬──────────────┬──────────────────┐
│ product ┆ revenue ┆ margin_pct ┆ region ┆ revenue_tier ┆ margin_category ┆ region_clean ┆ adjusted_revenue │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str ┆ str ┆ str ┆ str ┆ f64 │
╞═════════╪═════════╪════════════╪════════╪══════════════╪═════════════════╪══════════════╪══════════════════╡
│ A ┆ 150000 ┆ 0.35 ┆ North ┆ High ┆ High ┆ North ┆ 165000.0 │
│ B ┆ 45000 ┆ 0.12 ┆ South ┆ Standard ┆ Low ┆ South ┆ 45000.0 │
│ C ┆ 89000 ┆ 0.28 ┆ null ┆ Standard ┆ Medium ┆ Unknown ┆ 89000.0 │
│ D ┆ 12000 ┆ -0.05 ┆ East ┆ Standard ┆ Loss ┆ East ┆ 12000.0 │
│ E ┆ 250000 ┆ 0.42 ┆ West ┆ High ┆ High ┆ West ┆ 275000.0 │
└─────────┴─────────┴────────────┴────────┴──────────────┴─────────────────┴──────────────┴──────────────────┘
For null handling specifically, Polars also provides fill_null() and coalesce() as more concise alternatives when that’s all you need.
Combining and Chaining Expressions
Real-world data transformations require combining multiple expressions. Polars makes this natural through method chaining and expression lists.
df = pl.DataFrame({
"transaction_id": range(1, 7),
"customer_name": [" John Doe ", "jane smith", "ROBERT JOHNSON", "Mary Williams", "james brown", "Patricia Davis"],
"amount": [1250.00, 89.50, 3400.00, 567.25, 12.99, 890.00],
"category": ["electronics", "books", "electronics", "clothing", "books", "electronics"],
"timestamp": ["2024-01-15 10:30:00", "2024-01-15 11:45:00", "2024-01-16 09:00:00",
"2024-01-16 14:30:00", "2024-01-17 08:15:00", "2024-01-17 16:45:00"]
})
# Complex transformation pipeline
result = df.select(
pl.col("transaction_id"),
# Chain multiple string operations
pl.col("customer_name")
.str.strip_chars()
.str.to_titlecase()
.str.replace_all(r"\s+", " ")
.alias("customer_name_clean"),
# Combine arithmetic with conditional logic
pl.when(pl.col("amount") > 1000)
.then(pl.col("amount") * 0.95) # 5% discount for large orders
.otherwise(pl.col("amount"))
.round(2)
.alias("final_amount"),
pl.col("category").str.to_titlecase(),
# Parse and transform datetime
pl.col("timestamp")
.str.to_datetime()
.alias("parsed_timestamp"),
pl.col("timestamp")
.str.to_datetime()
.dt.strftime("%A")
.alias("day_of_week"),
)
print(result)
You can also build expression lists programmatically:
# Define reusable expressions
numeric_cols = ["amount"]
transformations = [
pl.col(c).mean().alias(f"{c}_mean") for c in numeric_cols
] + [
pl.col(c).std().alias(f"{c}_std") for c in numeric_cols
]
stats = df.select(transformations)
Performance Tips and Common Patterns
The biggest performance gains come from using lazy evaluation. Instead of pl.DataFrame, use pl.LazyFrame or call .lazy() on an existing DataFrame.
# Eager execution - each operation runs immediately
df_eager = pl.read_csv("large_file.csv")
result_eager = df_eager.filter(pl.col("status") == "active")
result_eager = result_eager.select(pl.col("revenue").sum())
# Lazy execution - builds query plan, optimizes, then executes
result_lazy = (
pl.scan_csv("large_file.csv")
.filter(pl.col("status") == "active")
.select(pl.col("revenue").sum())
.collect() # Execution happens here
)
With lazy evaluation, Polars can push filters down to the file reader (predicate pushdown), only read necessary columns (projection pushdown), and eliminate redundant operations.
Common anti-patterns to avoid:
# Bad: Multiple separate operations
df = df.with_columns(pl.col("a") * 2)
df = df.with_columns(pl.col("b") + 1)
df = df.with_columns(pl.col("c").str.to_lowercase())
# Good: Single operation with multiple expressions
df = df.with_columns(
pl.col("a") * 2,
pl.col("b") + 1,
pl.col("c").str.to_lowercase(),
)
# Bad: Using Python loops
for col in ["a", "b", "c"]:
df = df.with_columns(pl.col(col).fill_null(0))
# Good: Expression list
df = df.with_columns([pl.col(c).fill_null(0) for c in ["a", "b", "c"]])
Expressions are the heart of Polars. Once you internalize the pattern of building expressions and passing them to contexts, you’ll find the API remarkably consistent. Whether you’re selecting columns, filtering rows, aggregating groups, or transforming data, the same expression syntax applies everywhere.