PySpark - OrderBy (Sort) DataFrame
Sorting is a fundamental operation in data analysis, whether you're preparing reports, identifying top performers, or organizing data for downstream processing. In PySpark, you have two methods that...
Key Insights
orderBy()andsort()are functionally identical in PySpark—both perform global sorting across all partitions, which can be expensive in distributed environments- Multi-column sorting follows strict precedence: PySpark sorts by the first column, then breaks ties using subsequent columns in the order specified
- Use
sortWithinPartitions()instead of global sorting when you only need ordered data within each partition—it’s significantly faster and avoids the costly shuffle operation
Introduction to DataFrame Sorting in PySpark
Sorting is a fundamental operation in data analysis, whether you’re preparing reports, identifying top performers, or organizing data for downstream processing. In PySpark, you have two methods that accomplish the same task: orderBy() and sort(). Despite the different names, these methods are functionally equivalent—use whichever feels more natural to you.
Unlike sorting in pandas or single-machine environments, PySpark sorting operates across a distributed cluster. This means data from all partitions must be shuffled and reorganized, making it one of the more expensive operations you’ll perform. Understanding how to sort efficiently can dramatically impact your application’s performance.
Let’s create a sample dataset we’ll use throughout this article:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, asc, upper, when
spark = SparkSession.builder.appName("SortingExample").getOrCreate()
data = [
("John", "Engineering", 75000, None),
("Jane", "Marketing", 82000, "Senior"),
("Mike", "Engineering", 68000, "Junior"),
("Sarah", "Sales", 95000, "Senior"),
("Tom", "Engineering", 82000, "Mid"),
("Emily", "Marketing", None, "Junior"),
("David", "Sales", 71000, "Mid"),
("Lisa", "Engineering", 95000, "Senior")
]
df = spark.createDataFrame(data, ["name", "department", "salary", "level"])
df.show()
Basic Sorting Operations
The simplest sorting operation orders your DataFrame by a single column in ascending order. By default, both orderBy() and sort() arrange data from smallest to largest (or A to Z for strings):
# Sort by salary in ascending order (default)
df.orderBy("salary").show()
# Identical result using sort()
df.sort("salary").show()
# Explicit ascending order using col() and asc()
df.orderBy(col("salary").asc()).show()
To reverse the order, use the desc() function:
from pyspark.sql.functions import desc
# Sort by salary in descending order
df.orderBy(desc("salary")).show()
# Alternative syntax using col()
df.orderBy(col("salary").desc()).show()
Both orderBy() and sort() accept column names as strings or Column objects. The Column object approach (using col()) provides more flexibility for complex expressions, while string names offer cleaner syntax for simple cases.
Multi-Column Sorting
Real-world scenarios often require sorting by multiple columns. PySpark processes these in order: it sorts by the first column, then uses subsequent columns to break ties.
# Sort by department (ascending), then salary (descending)
df.orderBy("department", desc("salary")).show()
# Sort by multiple columns with different orders
df.orderBy(
col("department").asc(),
col("salary").desc(),
col("name").asc()
).show()
The precedence matters significantly. Consider these two different sorting operations:
# Salary first, then department
df.orderBy(desc("salary"), "department").show()
# Department first, then salary
df.orderBy("department", desc("salary")).show()
The first groups all high salaries together regardless of department, while the second creates department-based groups with salaries sorted within each group. Choose your column order based on your analytical needs.
You can mix string and Column notation in the same operation:
# Mixed notation - perfectly valid
df.orderBy("department", col("salary").desc(), "name").show()
Advanced Sorting Techniques
Handling Null Values
By default, PySpark places null values last when sorting in ascending order and first when sorting in descending order. You can override this behavior:
from pyspark.sql.functions import asc_nulls_first, asc_nulls_last
from pyspark.sql.functions import desc_nulls_first, desc_nulls_last
# Nulls first with ascending order
df.orderBy(asc_nulls_first("salary")).show()
# Nulls last with descending order
df.orderBy(desc_nulls_last("salary")).show()
Sorting on Computed Columns
You don’t need to create new columns to sort by transformed values. You can sort directly on expressions:
# Sort by salary bonus (salary * 1.1)
df.orderBy((col("salary") * 1.1).desc()).show()
# Case-insensitive sorting by name
df.orderBy(upper("name")).show()
# Sort by conditional logic
df.orderBy(
when(col("department") == "Engineering", 1)
.when(col("department") == "Sales", 2)
.otherwise(3)
).show()
This approach is more efficient than creating temporary columns for sorting, especially with large datasets.
Complex String Sorting
For case-insensitive or locale-specific sorting:
# Case-insensitive department sorting
df.orderBy(upper("department"), desc("salary")).show()
# You can also use lower() for the same effect
from pyspark.sql.functions import lower
df.orderBy(lower("name")).show()
Performance Considerations
Sorting in PySpark triggers a full shuffle—data from all partitions gets redistributed across the cluster. This is expensive. Here’s what you need to know:
Global vs. Partition-Level Sorting
If you only need data sorted within each partition (not globally), use sortWithinPartitions(). This avoids the shuffle and runs much faster:
# Global sort - expensive, shuffles all data
df.orderBy("salary").show()
# Partition-level sort - much faster, no shuffle
df.sortWithinPartitions("salary").show()
Use sortWithinPartitions() when:
- You’re writing data to partitioned storage and want each file sorted
- You’re performing partition-wise operations that benefit from local ordering
- You don’t need global ordering across the entire dataset
Repartitioning Strategy
Sometimes repartitioning before sorting can improve performance, especially if your data is heavily skewed:
# Repartition by department before sorting
df.repartition("department").orderBy("department", desc("salary")).show()
# Control partition count
df.repartition(10).orderBy("salary").show()
However, repartitioning also causes a shuffle, so test both approaches with your data to see which performs better.
When to Avoid Sorting
Don’t sort unless you need to. Consider these alternatives:
- For top-N queries, consider sampling or approximate methods for very large datasets
- For grouped aggregations, you don’t need to sort first—
groupBy()handles it - For joins, PySpark optimizes internally—pre-sorting rarely helps
Common Use Cases and Best Practices
Finding Top-N Records
Combine sorting with limit() for top-N queries:
# Top 3 highest salaries
df.orderBy(desc("salary")).limit(3).show()
# Top 2 earners per department using window functions
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy("department").orderBy(desc("salary"))
df.withColumn("rank", row_number().over(window_spec)) \
.filter(col("rank") <= 2) \
.drop("rank") \
.show()
Ranking with Window Functions
Window functions often require ordering. The sorting happens within the window specification:
from pyspark.sql.functions import rank, dense_rank
window_spec = Window.partitionBy("department").orderBy(desc("salary"))
df.withColumn("rank", rank().over(window_spec)) \
.withColumn("dense_rank", dense_rank().over(window_spec)) \
.orderBy("department", "rank") \
.show()
Chaining Operations
Sorting integrates cleanly with other DataFrame operations:
# Filter, sort, and select in one pipeline
result = df.filter(col("salary").isNotNull()) \
.orderBy("department", desc("salary")) \
.select("name", "department", "salary")
result.show()
# Complex pipeline with transformations
from pyspark.sql.functions import round
df.filter(col("department").isin(["Engineering", "Sales"])) \
.withColumn("salary_k", round(col("salary") / 1000, 1)) \
.orderBy(desc("salary_k"), "name") \
.show()
Best Practices Summary
-
Minimize sorting operations: Each sort triggers a shuffle. Cache the result if you’ll use it multiple times.
-
Use
sortWithinPartitions()when possible: If global ordering isn’t required, this saves significant resources. -
Sort late in your pipeline: Filter and select columns before sorting to reduce the data volume being shuffled.
-
Be explicit about null handling: Don’t rely on default behavior—use
asc_nulls_first()ordesc_nulls_last()to make your intentions clear. -
Consider alternatives: For simple aggregations or top-N queries on massive datasets, explore approximate methods or sampling.
Sorting is powerful but expensive in distributed systems. Use it deliberately, understand its performance implications, and choose the right sorting method for your use case. When you need global ordering, orderBy() and sort() deliver consistent, reliable results. When partition-level ordering suffices, sortWithinPartitions() provides a faster alternative. Match your sorting strategy to your analytical needs, and your PySpark applications will run efficiently at scale.