How to Sort a DataFrame in PySpark

Sorting is one of the most common operations in data processing, yet it's also one of the most expensive in distributed systems. When you sort a DataFrame in PySpark, you're coordinating data...

Key Insights

  • PySpark provides orderBy() and sort() as interchangeable methods for DataFrame sorting, but understanding their behavior with null values and multiple columns is essential for correct results.
  • Full DataFrame sorts trigger expensive shuffle operations across the cluster; use sortWithinPartitions() when you only need ordered data within each partition for downstream processing.
  • Always explicitly handle null positioning with asc_nulls_first() or desc_nulls_last() variants—relying on default behavior leads to inconsistent results across different Spark versions and configurations.

Introduction

Sorting is one of the most common operations in data processing, yet it’s also one of the most expensive in distributed systems. When you sort a DataFrame in PySpark, you’re coordinating data movement across potentially hundreds of nodes, shuffling millions of records to produce a globally ordered result.

Understanding how PySpark handles sorting—and when to avoid full sorts entirely—can mean the difference between a pipeline that runs in minutes versus one that times out. This article covers the practical mechanics of DataFrame sorting, from basic single-column operations to performance-optimized partition-level sorting.

Basic Sorting with orderBy() and sort()

PySpark provides two primary methods for sorting DataFrames: orderBy() and sort(). These methods are functionally identical—sort() is simply an alias for orderBy(). Use whichever reads better in your codebase, but pick one and stay consistent.

Here’s a basic example with sample employee data:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("SortingDemo").getOrCreate()

# Create sample data
data = [
    ("Alice", "Engineering", 95000),
    ("Bob", "Sales", 72000),
    ("Charlie", "Engineering", 110000),
    ("Diana", "Marketing", 68000),
    ("Eve", "Sales", 85000),
]

df = spark.createDataFrame(data, ["name", "department", "salary"])

# Ascending sort (default)
df.orderBy("salary").show()

# Descending sort using col()
df.orderBy(col("salary").desc()).show()

# Descending sort using string syntax
df.sort("salary", ascending=False).show()

The ascending parameter accepts a boolean for single-column sorts or a list of booleans for multi-column sorts. However, I recommend using the col().desc() syntax—it’s more explicit and scales better when you add complexity.

# Both produce the same result, but the second is clearer
df.orderBy("salary", ascending=False)
df.orderBy(col("salary").desc())

Multi-Column Sorting

Real-world sorting typically involves multiple columns. You might want employees sorted by department alphabetically, then by salary within each department in descending order to see top earners first.

from pyspark.sql.functions import col, asc, desc

# Sort by department ascending, then salary descending
df.orderBy(
    col("department").asc(),
    col("salary").desc()
).show()

# Alternative syntax using asc() and desc() functions
df.orderBy(
    asc("department"),
    desc("salary")
).show()

# Using the ascending parameter with a list (less readable)
df.orderBy(["department", "salary"], ascending=[True, False]).show()

Output:

+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|Charlie|Engineering|110000|
|  Alice|Engineering| 95000|
| Diana|  Marketing| 68000|
|    Eve|      Sales| 85000|
|    Bob|      Sales| 72000|
+-------+-----------+------+

The column order in your orderBy() call matters. The first column is the primary sort key, the second breaks ties in the first, and so on. This seems obvious, but I’ve seen production bugs where developers listed columns in the wrong order and got subtly incorrect results.

Handling Null Values

Null handling is where sorting gets tricky. By default, PySpark places nulls last for ascending sorts and first for descending sorts—but this behavior isn’t guaranteed across all Spark configurations. Always be explicit.

# Data with null values
data_with_nulls = [
    ("Alice", "Engineering", 95000),
    ("Bob", "Sales", None),
    ("Charlie", None, 110000),
    ("Diana", "Marketing", 68000),
    ("Eve", "Sales", 85000),
]

df_nulls = spark.createDataFrame(data_with_nulls, ["name", "department", "salary"])

# Explicit null positioning
from pyspark.sql.functions import col

# Nulls first in ascending order
df_nulls.orderBy(col("salary").asc_nulls_first()).show()

# Nulls last in ascending order
df_nulls.orderBy(col("salary").asc_nulls_last()).show()

# Nulls first in descending order
df_nulls.orderBy(col("salary").desc_nulls_first()).show()

# Nulls last in descending order
df_nulls.orderBy(col("salary").desc_nulls_last()).show()

Here’s what asc_nulls_first() produces:

+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|    Bob|      Sales|  null|
|  Diana|  Marketing| 68000|
|    Eve|      Sales| 85000|
|  Alice|Engineering| 95000|
|Charlie|       null|110000|
+-------+-----------+------+

For multi-column sorts with nulls, specify null handling for each column:

df_nulls.orderBy(
    col("department").asc_nulls_last(),
    col("salary").desc_nulls_first()
).show()

Sorting with sortWithinPartitions()

Here’s where understanding distributed systems pays off. orderBy() performs a global sort, which requires shuffling data across all partitions. This is expensive. If you only need data sorted within each partition—for example, before writing to partitioned Parquet files or applying partition-level window functions—use sortWithinPartitions().

# Create a larger dataset to demonstrate partitioning
from pyspark.sql.functions import spark_partition_id

large_data = [
    ("Alice", "Engineering", 95000),
    ("Bob", "Sales", 72000),
    ("Charlie", "Engineering", 110000),
    ("Diana", "Marketing", 68000),
    ("Eve", "Sales", 85000),
    ("Frank", "Engineering", 88000),
    ("Grace", "Marketing", 92000),
    ("Henry", "Sales", 78000),
]

df_large = spark.createDataFrame(large_data, ["name", "department", "salary"])

# Repartition by department
df_partitioned = df_large.repartition(3, "department")

# Global sort - shuffles data across partitions
df_global = df_partitioned.orderBy("salary")

# Partition-local sort - no shuffle, sorts within each partition
df_local = df_partitioned.sortWithinPartitions("salary")

# Show partition IDs to see the difference
df_global.withColumn("partition", spark_partition_id()).show()
df_local.withColumn("partition", spark_partition_id()).show()

With orderBy(), records move between partitions to achieve global ordering. With sortWithinPartitions(), each partition is sorted independently—records stay on their original nodes.

Use sortWithinPartitions() when:

  • Writing partitioned output files where each partition should be internally sorted
  • Preparing data for partition-level aggregations
  • Sorting before a mapPartitions() operation that assumes ordered input

Use orderBy() when:

  • You need a globally ordered result (e.g., top-N queries)
  • The downstream consumer expects total ordering
  • You’re collecting results to the driver

Performance Considerations

Sorting is a shuffle operation, and shuffles are the primary performance bottleneck in Spark jobs. Let’s examine what actually happens when you sort.

# Create a DataFrame and examine the execution plan
df = spark.createDataFrame(
    [("Alice", 95000), ("Bob", 72000), ("Charlie", 110000)],
    ["name", "salary"]
)

# Show the physical execution plan
df.orderBy("salary").explain(mode="extended")

This outputs something like:

== Physical Plan ==
*(1) Sort [salary#1L ASC NULLS LAST], true, 0
+- Exchange rangepartitioning(salary#1L ASC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#42]
   +- *(1) Scan ExistingRDD[name#0,salary#1L]

The Exchange operator indicates a shuffle. Data is redistributed across partitions using range partitioning to prepare for the sort. The number 200 is the default shuffle partition count (spark.sql.shuffle.partitions).

For large datasets, tune this parameter:

# Reduce shuffle partitions for smaller datasets
spark.conf.set("spark.sql.shuffle.partitions", 50)

# Or increase for very large datasets
spark.conf.set("spark.sql.shuffle.partitions", 500)

Compare the plans for orderBy() versus sortWithinPartitions():

# Global sort - includes Exchange (shuffle)
df.repartition(4).orderBy("salary").explain()

# Partition sort - no Exchange
df.repartition(4).sortWithinPartitions("salary").explain()

The partition-local sort eliminates the shuffle entirely, which can reduce job runtime by orders of magnitude on large datasets.

Other performance tips:

  1. Filter before sorting: Reduce the dataset size before expensive operations
  2. Cache if reusing: If you sort once and use the result multiple times, cache it
  3. Consider approximate methods: For top-N queries, limit() after orderBy() enables optimizations
  4. Watch for skew: If one sort key value dominates, you’ll get partition skew
# Efficient top-N pattern
df.orderBy(col("salary").desc()).limit(10).show()

# Spark optimizes this to avoid full sort when possible

Conclusion

PySpark sorting is straightforward for simple cases but requires careful thought for production workloads. Use orderBy() or sort() interchangeably for global sorting, but always specify null handling explicitly with asc_nulls_first() and related methods.

For performance-critical pipelines, question whether you actually need a global sort. If partition-level ordering suffices, sortWithinPartitions() eliminates shuffle overhead entirely. Use explain() to verify your assumptions about execution plans, and tune spark.sql.shuffle.partitions based on your data size.

The right sorting strategy depends on your downstream requirements. Global sorts are expensive but sometimes necessary. Partition sorts are cheap but provide weaker guarantees. Know the difference, and your Spark jobs will thank you.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.