PySpark - Sort DataFrame by Multiple Columns
Sorting DataFrames by multiple columns is a fundamental operation in PySpark that you'll use constantly for data analysis, reporting, and preparation workflows. Whether you're ranking sales...
Key Insights
- PySpark offers
orderBy()for global sorting andsortWithinPartitions()for partition-level sorting—choose based on whether you need total ordering or just organized partitions - Combine
asc()anddesc()functions to apply different sort orders across columns, and use null positioning methods likeasc_nulls_last()to control where null values appear in your sorted results - Multi-column sorting triggers a shuffle operation in PySpark, making it expensive on large datasets—consider partitioning strategies and whether you truly need global ordering before applying sorts
Introduction
Sorting DataFrames by multiple columns is a fundamental operation in PySpark that you’ll use constantly for data analysis, reporting, and preparation workflows. Whether you’re ranking sales performance by region and revenue, organizing log data by timestamp and severity, or preparing data for window functions, understanding how to control sort order across multiple dimensions is essential.
Unlike single-column sorting, multi-column sorts let you establish hierarchical ordering—think of it like sorting a spreadsheet where you first group by one column, then break ties using subsequent columns. PySpark provides flexible APIs to handle ascending and descending orders, null value positioning, and performance trade-offs between global and partition-level sorting.
Basic Single Column Sorting (Quick Recap)
Before diving into multi-column scenarios, let’s quickly review single-column sorting. PySpark provides two primary methods: orderBy() and sort(). They’re functionally identical—use whichever reads better to you.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("SortExample").getOrCreate()
# Create sample employee data
data = [
("Alice", "Engineering", 95000),
("Bob", "Sales", 75000),
("Charlie", "Engineering", 88000),
("Diana", "Marketing", 82000),
("Eve", "Sales", 79000)
]
df = spark.createDataFrame(data, ["name", "department", "salary"])
# Sort by salary - both methods are equivalent
df.orderBy("salary").show()
df.sort("salary").show()
# Descending order
df.orderBy(col("salary").desc()).show()
The default behavior is ascending order. For descending, you need to explicitly use the desc() function on the column.
Sorting by Multiple Columns - Ascending Order
Multi-column sorting in PySpark follows a left-to-right priority. The first column specified is the primary sort key, the second breaks ties in the first, and so on.
# Sort by department first, then by salary within each department
df.orderBy("department", "salary").show()
# Alternative syntax using col()
df.orderBy(col("department"), col("salary")).show()
Output:
+-------+------------+------+
| name| department|salary|
+-------+------------+------+
|Charlie| Engineering| 88000|
| Alice| Engineering| 95000|
| Diana| Marketing| 82000|
| Bob| Sales| 75000|
| Eve| Sales| 79000|
+-------+------------+------+
Notice how employees are first grouped by department (alphabetically), then within each department, they’re sorted by salary from lowest to highest. This hierarchical ordering is exactly what you need for most reporting scenarios.
Mixed Sort Orders (Ascending and Descending)
Real-world requirements often demand different sort directions for different columns. You might want departments in alphabetical order but salaries from highest to lowest within each department.
from pyspark.sql.functions import asc, desc
# Department ascending, salary descending
df.orderBy(asc("department"), desc("salary")).show()
# Alternative using col() - more explicit
df.orderBy(col("department").asc(), col("salary").desc()).show()
Output:
+-------+------------+------+
| name| department|salary|
+-------+------------+------+
| Alice| Engineering| 95000|
|Charlie| Engineering| 88000|
| Diana| Marketing| 82000|
| Eve| Sales| 79000|
| Bob| Sales| 75000|
+-------+------------+------+
Now Alice appears before Charlie because we’ve reversed the salary order within each department. This pattern is perfect for ranking scenarios where you want to see top performers first.
For complex expressions, use F.col() with transformations:
from pyspark.sql import functions as F
# Sort by department length (descending), then salary (ascending)
df.orderBy(
F.length(F.col("department")).desc(),
F.col("salary").asc()
).show()
This flexibility lets you sort by derived values without creating temporary columns.
Handling Null Values in Multi-Column Sorts
Null handling becomes critical when working with real-world data. PySpark provides fine-grained control over where nulls appear in your sorted output.
# Create DataFrame with null values
data_with_nulls = [
("Alice", "Engineering", 95000),
("Bob", None, 75000),
("Charlie", "Engineering", None),
("Diana", "Marketing", 82000),
("Eve", None, None)
]
df_nulls = spark.createDataFrame(data_with_nulls, ["name", "department", "salary"])
# Default behavior - nulls appear last in ascending, first in descending
df_nulls.orderBy("department", "salary").show()
# Explicit null positioning
df_nulls.orderBy(
col("department").asc_nulls_first(),
col("salary").asc_nulls_last()
).show()
# Another example: nulls first for both columns
df_nulls.orderBy(
col("department").asc_nulls_first(),
col("salary").desc_nulls_first()
).show()
The four null positioning methods are:
asc_nulls_first(): Ascending with nulls at the beginningasc_nulls_last(): Ascending with nulls at the enddesc_nulls_first(): Descending with nulls at the beginningdesc_nulls_last(): Descending with nulls at the end
Default behavior places nulls last for ascending sorts and first for descending sorts, but don’t rely on defaults—be explicit when null positioning matters for your use case.
Performance Considerations
Sorting in PySpark is expensive because it requires a full shuffle—data must be redistributed across partitions to achieve global ordering. Understanding the performance implications helps you make informed decisions.
Global Ordering vs Partition-Level Sorting
orderBy() produces a globally sorted DataFrame, meaning the entire dataset follows your specified order. sortWithinPartitions() only sorts within each partition, which is much faster but doesn’t guarantee global order.
# Global ordering - expensive, full shuffle
df_global = df.orderBy("department", "salary")
# Partition-level sorting - faster, no global guarantee
df_partitioned = df.repartition("department").sortWithinPartitions("salary")
# Compare execution plans
df_global.explain()
df_partitioned.explain()
When examining the execution plans, you’ll see orderBy() includes an Exchange operation (the shuffle), while sortWithinPartitions() avoids it if data is already partitioned appropriately.
When to use each approach:
Use orderBy() when:
- You need guaranteed global ordering (e.g., taking the top N records overall)
- You’re working with small to medium datasets
- The sorted result is your final output
Use sortWithinPartitions() when:
- You’re preparing data for partition-wise operations (window functions, aggregations)
- Global ordering isn’t required
- You’re processing large datasets and want to avoid shuffle overhead
Best practices for large datasets:
# Bad: Sorting huge dataset without partitioning strategy
large_df.orderBy("date", "user_id") # Expensive shuffle
# Better: Partition by common group, then sort within partitions
large_df.repartition("date").sortWithinPartitions("user_id")
# Best: Combine with filtering to reduce data volume first
large_df.filter(col("date") >= "2024-01-01") \
.repartition("date") \
.sortWithinPartitions("user_id")
If you must use orderBy() on large datasets, consider:
- Filtering data first to reduce volume
- Increasing shuffle partitions:
spark.conf.set("spark.sql.shuffle.partitions", "400") - Using
coalesce()after sorting if you need fewer partitions downstream
Conclusion
Multi-column sorting in PySpark is straightforward once you understand the hierarchy of sort keys and the available control mechanisms. Here’s your quick reference:
Sorting Methods:
orderBy()/sort(): Global ordering, full shufflesortWithinPartitions(): Partition-level ordering, no shuffle
Sort Direction:
asc()/desc(): Control ascending/descending per column- Default is ascending if not specified
Null Handling:
asc_nulls_first()/asc_nulls_last()desc_nulls_first()/desc_nulls_last()
Performance Strategy:
- Use
sortWithinPartitions()when global ordering isn’t required - Partition data strategically before sorting
- Filter before sorting to reduce data volume
- Monitor execution plans to understand shuffle behavior
The key is matching your sorting approach to your actual requirements. Don’t pay the cost of global ordering if partition-level sorting suffices. Be explicit about null handling when it matters. And always consider the data volume and cluster resources when choosing between sorting strategies.