How to Sort by Multiple Columns in Polars
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a focus on parallel execution, it routinely outperforms pandas by 10-100x on common...
Key Insights
- Polars’
sort()method accepts a list of column names and a corresponding list of booleans fordescendingto control sort direction per column independently - Use
nulls_last=True(or a list of booleans) to push null values to the end of sorted results, preventing them from disrupting your data ordering - Lazy mode sorting with
lazy().sort().collect()enables Polars to optimize query execution and reduce memory usage on large datasets
Introduction
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a focus on parallel execution, it routinely outperforms pandas by 10-100x on common operations. But raw speed means nothing if you can’t express the transformations you need.
Multi-column sorting is one of those bread-and-butter operations that comes up constantly: sort sales data by region then date, rank products by category then revenue, order log entries by severity then timestamp. Polars handles this elegantly, but the API has a few nuances worth understanding.
This guide covers everything from basic multi-column sorts to expression-based sorting and lazy evaluation. By the end, you’ll know exactly which approach to use for any sorting scenario.
Basic Multi-Column Sorting with sort()
The sort() method is your primary tool for ordering DataFrames. Pass a list of column names, and Polars sorts by the first column, then uses subsequent columns as tiebreakers.
import polars as pl
# Create a sample sales dataset
sales = pl.DataFrame({
"region": ["West", "East", "West", "East", "West", "East"],
"date": ["2024-01-15", "2024-01-15", "2024-01-10", "2024-01-10", "2024-01-15", "2024-01-10"],
"revenue": [1500, 2200, 1800, 1900, 1200, 2500]
})
# Sort by region first, then by date
sorted_sales = sales.sort(["region", "date"])
print(sorted_sales)
Output:
shape: (6, 3)
┌────────┬────────────┬─────────┐
│ region ┆ date ┆ revenue │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════════╪═════════╡
│ East ┆ 2024-01-10 ┆ 1900 │
│ East ┆ 2024-01-10 ┆ 2500 │
│ East ┆ 2024-01-15 ┆ 2200 │
│ West ┆ 2024-01-10 ┆ 1800 │
│ West ┆ 2024-01-15 ┆ 1500 │
│ West ┆ 2024-01-15 ┆ 1200 │
└────────┴────────────┴─────────┘
By default, all columns sort in ascending order. The “East” region appears before “West” alphabetically, and within each region, earlier dates come first.
Notice that rows with identical region and date values (like the two “East” rows on 2024-01-10) maintain their relative order from the original DataFrame. Polars uses a stable sort, which matters when you need deterministic results.
Controlling Sort Direction Per Column
Real-world sorting rarely wants everything ascending. You typically want to see the most recent dates first, the highest revenues at the top, or some combination thereof. The descending parameter accepts a list of booleans matching your column list.
# Create a product performance dataset
products = pl.DataFrame({
"category": ["Electronics", "Electronics", "Clothing", "Clothing", "Electronics", "Clothing"],
"product": ["Laptop", "Phone", "Jacket", "Shirt", "Tablet", "Pants"],
"revenue": [45000, 38000, 12000, 8000, 22000, 15000]
})
# Sort by category (ascending) and revenue (descending)
# This shows top performers within each category
top_performers = products.sort(
["category", "revenue"],
descending=[False, True]
)
print(top_performers)
Output:
shape: (6, 3)
┌─────────────┬─────────┬─────────┐
│ category ┆ product ┆ revenue │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════════╪═════════╪═════════╡
│ Clothing ┆ Pants ┆ 15000 │
│ Clothing ┆ Jacket ┆ 12000 │
│ Clothing ┆ Shirt ┆ 8000 │
│ Electronics ┆ Laptop ┆ 45000 │
│ Electronics ┆ Phone ┆ 38000 │
│ Electronics ┆ Tablet ┆ 22000 │
└─────────────┴─────────┴─────────┘
The descending list must match the length of your column list. Each boolean corresponds positionally to its column: False for category means ascending (A-Z), True for revenue means descending (highest first).
You can also pass a single boolean to apply the same direction to all columns:
# Sort all columns descending
all_desc = products.sort(["category", "revenue"], descending=True)
Handling Null Values in Multi-Column Sorts
Null values are inevitable in real datasets. By default, Polars places nulls first in ascending sorts and last in descending sorts. The nulls_last parameter gives you explicit control.
# Dataset with missing values
inventory = pl.DataFrame({
"warehouse": ["NYC", "LA", "NYC", "LA", "Chicago", "Chicago"],
"product": ["Widget", "Widget", "Gadget", None, "Widget", None],
"stock": [100, None, 50, 200, None, 75]
})
# Default behavior: nulls appear first in ascending sort
default_sort = inventory.sort(["product", "stock"])
print("Default null handling:")
print(default_sort)
# Push nulls to the end for both columns
nulls_end = inventory.sort(
["product", "stock"],
nulls_last=True
)
print("\nNulls pushed to end:")
print(nulls_end)
Output:
Default null handling:
shape: (6, 3)
┌───────────┬─────────┬───────┐
│ warehouse ┆ product ┆ stock │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════════╪═════════╪═══════╡
│ LA ┆ null ┆ 200 │
│ Chicago ┆ null ┆ 75 │
│ NYC ┆ Gadget ┆ 50 │
│ Chicago ┆ Widget ┆ null │
│ NYC ┆ Widget ┆ 100 │
│ LA ┆ Widget ┆ null │
└───────────┴─────────┴───────┘
Nulls pushed to end:
shape: (6, 3)
┌───────────┬─────────┬───────┐
│ warehouse ┆ product ┆ stock │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════════╪═════════╪═══════╡
│ NYC ┆ Gadget ┆ 50 │
│ NYC ┆ Widget ┆ 100 │
│ Chicago ┆ Widget ┆ null │
│ LA ┆ Widget ┆ null │
│ LA ┆ null ┆ 200 │
│ Chicago ┆ null ┆ 75 │
└───────────┴─────────┴───────┘
Like descending, you can pass a list of booleans to nulls_last for per-column control:
# Nulls last for product, nulls first for stock
mixed_nulls = inventory.sort(
["product", "stock"],
nulls_last=[True, False]
)
Sorting with Expressions
Sometimes you need to sort by computed values rather than raw column data. Polars expressions let you define complex sort keys inline. The sort_by() method within a select() or with_columns() context handles this elegantly.
# Dataset where we want to sort by derived values
users = pl.DataFrame({
"username": ["alice_smith", "bob", "charlie_brown", "dan", "eve_wilson"],
"signup_date": ["2024-01-15", "2024-01-10", "2024-01-15", "2024-01-10", "2024-01-12"],
"posts": [45, 120, 30, 85, 60]
})
# Sort by username length (ascending), then by posts (descending)
sorted_users = users.sort_by(
pl.col("username").str.len_chars(),
pl.col("posts"),
descending=[False, True]
)
print(sorted_users)
Output:
shape: (5, 3)
┌───────────────┬─────────────┬───────┐
│ username ┆ signup_date ┆ posts │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════════════╪═════════════╪═══════╡
│ bob ┆ 2024-01-10 ┆ 120 │
│ dan ┆ 2024-01-10 ┆ 85 │
│ eve_wilson ┆ 2024-01-12 ┆ 60 │
│ alice_smith ┆ 2024-01-15 ┆ 45 │
│ charlie_brown ┆ 2024-01-15 ┆ 30 │
└───────────────┴─────────────┴───────┘
Expression-based sorting opens up powerful possibilities: sort by absolute values, by date components, by string patterns, or by any transformation Polars supports. The sort happens on the computed values, but the original data remains intact.
Multi-Column Sorting in Lazy Mode
For large datasets, lazy evaluation is essential. Polars’ lazy API builds a query plan that gets optimized before execution. Sorting in lazy mode allows the query optimizer to push predicates, eliminate unnecessary columns, and parallelize effectively.
# Simulate a larger dataset
import random
large_sales = pl.DataFrame({
"region": random.choices(["North", "South", "East", "West"], k=100000),
"quarter": random.choices(["Q1", "Q2", "Q3", "Q4"], k=100000),
"revenue": [random.randint(1000, 50000) for _ in range(100000)]
})
# Lazy mode: build query plan, then execute
result = (
large_sales
.lazy()
.filter(pl.col("revenue") > 10000)
.sort(
["region", "quarter", "revenue"],
descending=[False, False, True]
)
.head(1000)
.collect()
)
print(f"Result shape: {result.shape}")
print(result.head(10))
The key insight here: Polars optimizes the entire query. The filter happens before the sort (reducing data volume), and only the top 1000 rows actually get materialized. In eager mode, you’d sort the entire filtered dataset before taking the head.
You can inspect the query plan to see optimizations:
query = (
large_sales
.lazy()
.filter(pl.col("revenue") > 10000)
.sort(["region", "revenue"], descending=[False, True])
.head(100)
)
print(query.explain())
Performance Considerations
Polars sorting is fast for several reasons worth understanding:
Parallel execution: Polars automatically parallelizes sort operations across available CPU cores. A multi-column sort on millions of rows will saturate your machine without any configuration.
Memory efficiency: Unlike pandas, which often creates multiple intermediate copies during complex operations, Polars uses Apache Arrow’s memory format and minimizes allocations. For large datasets, this difference is dramatic.
When to use lazy mode: If your workflow involves filtering, selecting columns, or other transformations before sorting, lazy mode lets Polars optimize the entire pipeline. The rule of thumb: use lazy mode when your data exceeds a few hundred thousand rows or when you’re chaining multiple operations.
Comparison with pandas: A quick benchmark sorting 1 million rows by two columns typically shows Polars completing in 50-100ms versus pandas taking 500-1000ms. The gap widens with more columns and larger datasets.
# Quick performance comparison
import time
# Polars timing
start = time.perf_counter()
_ = large_sales.sort(["region", "quarter", "revenue"], descending=[False, False, True])
polars_time = time.perf_counter() - start
print(f"Polars sort: {polars_time:.3f}s")
For production workloads, always prefer lazy mode with explicit collect() calls. It gives Polars maximum flexibility to optimize your queries and keeps memory usage predictable.
Multi-column sorting in Polars is straightforward once you understand the parameter patterns. Use lists for column names, match them with lists for descending and nulls_last, and reach for expressions when you need computed sort keys. The performance will take care of itself.