How to Sort a DataFrame in Polars

Sorting is one of the most common DataFrame operations, yet it's also one where performance differences between libraries become painfully obvious. If you've ever waited minutes for pandas to sort a...

Key Insights

  • Polars sorting is significantly faster than pandas due to its Rust-based implementation and parallel execution, making it ideal for large datasets where sort operations become bottlenecks.
  • The sort() method accepts expressions, not just column names, allowing you to sort by computed values without creating intermediate columns.
  • For top-N queries, use top_k() instead of sorting the entire DataFrame—it’s algorithmically more efficient and can be orders of magnitude faster on large datasets.

Introduction

Sorting is one of the most common DataFrame operations, yet it’s also one where performance differences between libraries become painfully obvious. If you’ve ever waited minutes for pandas to sort a multi-million row DataFrame, you’ll appreciate what Polars brings to the table.

Polars implements sorting in Rust with parallel execution by default. On a typical 8-core machine sorting 10 million rows, you’ll see 5-10x speedups compared to pandas. But speed isn’t the only advantage—Polars also offers a more expressive API that lets you sort by computed expressions, handle nulls explicitly, and integrate sorting into lazy query plans for automatic optimization.

This article covers everything you need to know about sorting DataFrames in Polars, from basic operations to performance-critical patterns.

Basic Single-Column Sorting

The simplest sorting operation uses the sort() method with a column name. By default, Polars sorts in ascending order.

import polars as pl

# Create sample data
df = pl.DataFrame({
    "product": ["Widget", "Gadget", "Sprocket", "Gizmo", "Doohickey"],
    "price": [29.99, 149.99, 9.99, 79.99, 19.99],
    "quantity": [100, 25, 500, 75, 200]
})

# Sort by price (ascending by default)
sorted_df = df.sort("price")
print(sorted_df)

Output:

shape: (5, 3)
┌───────────┬────────┬──────────┐
│ product   ┆ price  ┆ quantity │
│ ---       ┆ ---    ┆ ---      │
│ str       ┆ f64    ┆ i64      │
╞═══════════╪════════╪══════════╡
│ Sprocket  ┆ 9.99   ┆ 500      │
│ Doohickey ┆ 19.99  ┆ 200      │
│ Widget    ┆ 29.99  ┆ 100      │
│ Gizmo     ┆ 79.99  ┆ 75       │
│ Gadget    ┆ 149.99 ┆ 25       │
└───────────┴────────┴──────────┘

The sort() method returns a new DataFrame—Polars DataFrames are immutable by design. This immutability enables safe parallel operations and prevents the subtle bugs that come from in-place modifications.

Descending Order and Multiple Columns

Real-world sorting often requires descending order or multi-column sort keys. Polars handles both with the descending parameter.

# Sort by price in descending order
expensive_first = df.sort("price", descending=True)
print(expensive_first)

For multi-column sorting, pass a list of column names. The sort priority follows the list order—the first column is the primary sort key, the second breaks ties, and so on.

# Create data with ties
df_with_ties = pl.DataFrame({
    "category": ["Electronics", "Electronics", "Clothing", "Clothing", "Electronics"],
    "product": ["Phone", "Laptop", "Shirt", "Pants", "Tablet"],
    "price": [699.99, 999.99, 29.99, 49.99, 399.99]
})

# Sort by category (ascending), then by price (descending)
sorted_multi = df_with_ties.sort(
    ["category", "price"], 
    descending=[False, True]
)
print(sorted_multi)

Output:

shape: (5, 3)
┌─────────────┬─────────┬────────┐
│ category    ┆ product ┆ price  │
│ ---         ┆ ---     ┆ ---    │
│ str         ┆ str     ┆ f64    │
╞═════════════╪═════════╪════════╡
│ Clothing    ┆ Pants   ┆ 49.99  │
│ Clothing    ┆ Shirt   ┆ 29.99  │
│ Electronics ┆ Laptop  ┆ 999.99 │
│ Electronics ┆ Phone   ┆ 699.99 │
│ Electronics ┆ Tablet  ┆ 399.99 │
└─────────────┴─────────┴────────┘

Notice that the descending parameter accepts either a single boolean (applied to all columns) or a list matching the column order. This flexibility lets you mix ascending and descending sorts without awkward workarounds.

Sorting with Expressions

Here’s where Polars really shines. Instead of creating temporary columns for computed sort keys, you can pass expressions directly to sort().

# Sort by total value (price * quantity) without creating a new column
df_sorted_by_value = df.sort(pl.col("price") * pl.col("quantity"))
print(df_sorted_by_value)

Output:

shape: (5, 3)
┌───────────┬────────┬──────────┐
│ product   ┆ price  ┆ quantity │
│ ---       ┆ ---    ┆ ---      │
│ str       ┆ f64    ┆ i64      │
╞═══════════╪════════╪══════════╡
│ Widget    ┆ 29.99  ┆ 100      │
│ Gadget    ┆ 149.99 ┆ 25       │
│ Doohickey ┆ 19.99  ┆ 200      │
│ Sprocket  ┆ 9.99   ┆ 500      │
│ Gizmo     ┆ 79.99  ┆ 75       │
└───────────┴────────┴──────────┘

This pattern extends to any valid Polars expression. You can sort by string lengths, date components, or complex conditional logic:

# Sort by product name length
df.sort(pl.col("product").str.len_chars())

# Sort by absolute difference from a target price
target_price = 50.0
df.sort((pl.col("price") - target_price).abs())

# Sort with conditional logic: prioritize items in stock
df_inventory = pl.DataFrame({
    "product": ["A", "B", "C", "D"],
    "price": [10, 20, 15, 25],
    "in_stock": [True, False, True, False]
})

# In-stock items first, then by price
df_inventory.sort(
    [pl.col("in_stock").not_(), pl.col("price")],
    descending=[False, False]
)

The expression-based sorting eliminates the need for temporary columns and makes your intent clearer in the code.

Handling Null Values

Null handling in sorting is often overlooked until it causes problems. By default, Polars places null values last in ascending sorts and first in descending sorts. You can override this behavior with the nulls_last parameter.

df_with_nulls = pl.DataFrame({
    "product": ["Widget", "Gadget", "Sprocket", "Gizmo"],
    "rating": [4.5, None, 3.8, None]
})

# Default behavior: nulls at the end for ascending sort
print("Default (nulls last for ascending):")
print(df_with_nulls.sort("rating"))

# Force nulls to appear first
print("\nNulls first:")
print(df_with_nulls.sort("rating", nulls_last=False))

# Descending with nulls at the end
print("\nDescending with nulls last:")
print(df_with_nulls.sort("rating", descending=True, nulls_last=True))

Output:

Default (nulls last for ascending):
shape: (4, 2)
┌──────────┬────────┐
│ product  ┆ rating │
│ ---      ┆ ---    │
│ str      ┆ f64    │
╞══════════╪════════╡
│ Sprocket ┆ 3.8    │
│ Widget   ┆ 4.5    │
│ Gadget   ┆ null   │
│ Gizmo    ┆ null   │
└──────────┴────────┘

Nulls first:
shape: (4, 2)
┌──────────┬────────┐
│ product  ┆ rating │
│ ---      ┆ ---    │
│ str      ┆ f64    │
╞══════════╪════════╡
│ Gadget   ┆ null   │
│ Gizmo    ┆ null   │
│ Sprocket ┆ 3.8    │
│ Widget   ┆ 4.5    │
└──────────┴────────┘

Explicit null handling is crucial for data pipelines where null placement affects downstream operations like window functions or joins.

Sorting in Lazy Mode

Polars’ lazy evaluation mode is where performance optimizations really kick in. When you sort in lazy mode, Polars can optimize the query plan—for example, pushing filters before sorts to reduce the data volume being sorted.

# Create a lazy frame
lf = pl.DataFrame({
    "id": range(1000000),
    "value": [i % 100 for i in range(1000000)],
    "category": ["A", "B", "C", "D"] * 250000
}).lazy()

# Build a query with filtering and sorting
result = (
    lf
    .filter(pl.col("category") == "A")
    .sort("value", descending=True)
    .head(10)
    .collect()  # Execute the query
)
print(result)

The lazy API also enables you to inspect the query plan:

# See the optimized query plan
query = (
    lf
    .filter(pl.col("value") > 50)
    .sort("value")
    .head(100)
)
print(query.explain())

In this plan, Polars will filter first, then sort only the filtered rows. This automatic optimization can dramatically reduce execution time on large datasets.

Performance Tips and Conclusion

Let’s talk about when sorting might not be the right choice. If you only need the top N rows, top_k() is significantly faster than sorting the entire DataFrame:

import time

# Create a large DataFrame
large_df = pl.DataFrame({
    "value": range(10_000_000)
}).with_columns(pl.col("value").shuffle())

# Approach 1: Sort and take head (slower)
start = time.time()
result1 = large_df.sort("value", descending=True).head(5)
sort_time = time.time() - start

# Approach 2: Use top_k (faster)
start = time.time()
result2 = large_df.select(pl.col("value").top_k(5))
topk_time = time.time() - start

print(f"Sort + head: {sort_time:.3f}s")
print(f"top_k: {topk_time:.3f}s")
print(f"Speedup: {sort_time/topk_time:.1f}x")

On my machine, top_k() is roughly 10x faster for this operation because it uses a partial sort algorithm that doesn’t need to order the entire dataset.

Additional performance considerations:

  1. Sort stability: Polars sort is not stable by default (equal elements may not preserve their original order). Use maintain_order=True if you need stability, but expect a performance cost.

  2. Pre-sorted data: If your data is already partially sorted, Polars can sometimes detect this and optimize accordingly.

  3. Memory usage: Sorting requires additional memory for intermediate results. For extremely large datasets, consider sorting in lazy mode with streaming enabled.

  4. Parallel execution: Polars automatically parallelizes sorting across available cores. Ensure you’re not limiting this with POLARS_MAX_THREADS.

Sorting in Polars is fast, expressive, and integrates seamlessly with the rest of the API. Whether you’re doing simple column sorts or complex expression-based ordering, the patterns covered here will handle the vast majority of real-world use cases. Start with the basics, leverage expressions for computed sort keys, and reach for top_k() when you only need partial results.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.