PySpark - Window Functions (Row Number, Rank, Dense Rank)
Window functions in PySpark operate on a set of rows related to the current row, performing calculations without reducing the number of rows in your result set. This is fundamentally different from...
Key Insights
- Window functions allow you to perform calculations across rows related to the current row without collapsing the result set, unlike GROUP BY aggregations that reduce rows.
- ROW_NUMBER assigns unique sequential integers regardless of ties, RANK creates gaps after ties, and DENSE_RANK maintains consecutive numbering without gaps—choose based on whether you need uniqueness or how you want ties handled.
- Proper partitioning strategy is critical for performance: partition by high-cardinality columns to distribute work evenly and avoid data skew that can bottleneck your Spark job.
Introduction to Window Functions in PySpark
Window functions in PySpark operate on a set of rows related to the current row, performing calculations without reducing the number of rows in your result set. This is fundamentally different from standard aggregations like groupBy(), which collapse multiple rows into summary rows.
Consider a scenario where you want to rank employees by salary within each department while still maintaining all employee records. A standard groupBy() would only give you aggregated department-level statistics. Window functions let you add that ranking information to each individual employee row.
The three ranking window functions—ROW_NUMBER, RANK, and DENSE_RANK—each assign integer values to rows based on ordering criteria, but they handle ties differently. Understanding these differences is crucial for choosing the right function for your use case.
Setting Up Window Specifications
Before applying window functions, you need to define a window specification that tells PySpark how to partition and order your data. The Window class provides partitionBy() and orderBy() methods for this purpose.
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, col
# Initialize Spark session
spark = SparkSession.builder.appName("WindowFunctions").getOrCreate()
# Create sample employee dataset
data = [
("Alice", "Engineering", 95000),
("Bob", "Engineering", 87000),
("Charlie", "Engineering", 87000),
("David", "Sales", 72000),
("Eve", "Sales", 81000),
("Frank", "Sales", 81000),
("Grace", "Marketing", 68000),
("Henry", "Marketing", 75000)
]
df = spark.createDataFrame(data, ["name", "department", "salary"])
# Define window specification
window_spec = Window.partitionBy("department").orderBy(col("salary").desc())
The partitionBy() clause divides your data into separate groups, and calculations happen independently within each partition. The orderBy() clause determines the sequence for ranking. Without proper ordering, ranking functions produce non-deterministic results.
ROW_NUMBER - Sequential Numbering
ROW_NUMBER assigns a unique sequential integer to each row within a partition, starting from 1. Even when rows have identical values in the ordering column, ROW_NUMBER assigns different numbers. The order among tied rows is arbitrary unless you specify additional ordering columns.
# Apply ROW_NUMBER to rank employees by salary within departments
df_with_row_number = df.withColumn(
"row_num",
row_number().over(window_spec)
)
df_with_row_number.orderBy("department", "row_num").show()
Output:
+-------+------------+------+-------+
| name| department|salary|row_num|
+-------+------------+------+-------+
| Alice| Engineering| 95000| 1|
| Bob| Engineering| 87000| 2|
|Charlie| Engineering| 87000| 3|
| Grace| Marketing| 68000| 2|
| Henry| Marketing| 75000| 1|
| Eve| Sales| 81000| 1|
| Frank| Sales| 81000| 2|
| David| Sales| 72000| 3|
+-------+------------+------+-------+
ROW_NUMBER is ideal for deduplication scenarios. If you have duplicate records and need to keep only one, filter for row_num = 1. It’s also perfect for pagination—you can select rows 11-20 for page 2 of results. The guaranteed uniqueness within each partition makes it the go-to choice when you need distinct identifiers.
RANK - Ranking with Gaps
RANK assigns the same rank to rows with identical values in the ordering column, but it creates gaps in the sequence. If two rows tie for rank 2, the next row gets rank 4, skipping 3.
# Create dataset with student test scores
student_data = [
("Alice", "Class A", 95),
("Bob", "Class A", 87),
("Charlie", "Class A", 87),
("David", "Class A", 82),
("Eve", "Class B", 91),
("Frank", "Class B", 91),
("Grace", "Class B", 88),
("Henry", "Class B", 88)
]
students_df = spark.createDataFrame(student_data, ["name", "class", "score"])
window_spec_students = Window.partitionBy("class").orderBy(col("score").desc())
students_with_rank = students_df.withColumn(
"rank",
rank().over(window_spec_students)
)
students_with_rank.orderBy("class", "rank").show()
Output:
+-------+-------+-----+----+
| name| class|score|rank|
+-------+-------+-----+----+
| Alice|Class A| 95| 1|
| Bob|Class A| 87| 2|
|Charlie|Class A| 87| 2|
| David|Class A| 82| 4|
| Eve|Class B| 91| 1|
| Frank|Class B| 91| 1|
| Grace|Class B| 88| 3|
| Henry|Class B| 88| 3|
+-------+-------+-----+----+
Notice how Bob and Charlie both receive rank 2, and David gets rank 4. This gap reflects that David is actually the fourth-best performer when you count all students above him. RANK is appropriate when you want rankings that reflect actual position in a competition or leaderboard scenario.
DENSE_RANK - Ranking without Gaps
DENSE_RANK also assigns the same rank to tied rows, but it doesn’t create gaps. After a tie at rank 2, the next distinct value gets rank 3, maintaining consecutive numbering.
# Product revenue dataset
product_data = [
("Product A", "Electronics", 50000),
("Product B", "Electronics", 45000),
("Product C", "Electronics", 45000),
("Product D", "Electronics", 38000),
("Product E", "Clothing", 28000),
("Product F", "Clothing", 28000),
("Product G", "Clothing", 22000),
("Product H", "Clothing", 18000)
]
products_df = spark.createDataFrame(product_data, ["product", "category", "revenue"])
window_spec_products = Window.partitionBy("category").orderBy(col("revenue").desc())
products_with_dense_rank = products_df.withColumn(
"dense_rank",
dense_rank().over(window_spec_products)
)
products_with_dense_rank.orderBy("category", "dense_rank").show()
Output:
+---------+-----------+-------+----------+
| product| category|revenue|dense_rank|
+---------+-----------+-------+----------+
|Product A|Electronics| 50000| 1|
|Product B|Electronics| 45000| 2|
|Product C|Electronics| 45000| 2|
|Product D|Electronics| 38000| 3|
|Product E| Clothing| 28000| 1|
|Product F| Clothing| 28000| 1|
|Product G| Clothing| 22000| 2|
|Product H| Clothing| 18000| 3|
+---------+-----------+-------+----------+
DENSE_RANK works well when you need to know how many distinct performance levels exist. In the example above, there are exactly 3 distinct revenue levels in each category, which DENSE_RANK makes immediately clear.
Practical Comparison and Use Cases
Let’s apply all three functions to the same dataset to see their differences clearly:
# Apply all three ranking functions simultaneously
comparison_df = df.withColumn(
"row_number",
row_number().over(window_spec)
).withColumn(
"rank",
rank().over(window_spec)
).withColumn(
"dense_rank",
dense_rank().over(window_spec)
)
comparison_df.orderBy("department", col("salary").desc()).show()
Output:
+-------+------------+------+----------+----+----------+
| name| department|salary|row_number|rank|dense_rank|
+-------+------------+------+----------+----+----------+
| Alice| Engineering| 95000| 1| 1| 1|
| Bob| Engineering| 87000| 2| 2| 2|
|Charlie| Engineering| 87000| 3| 2| 2|
| Henry| Marketing| 75000| 1| 1| 1|
| Grace| Marketing| 68000| 2| 2| 2|
| Eve| Sales| 81000| 1| 1| 1|
| Frank| Sales| 81000| 2| 1| 1|
| David| Sales| 72000| 3| 3| 2|
+-------+------------+------+----------+----+----------+
Decision matrix for choosing the right function:
- Use ROW_NUMBER when you need unique identifiers within partitions, for deduplication, or when implementing pagination
- Use RANK when gaps matter—typically in competitive scenarios where you want to show true position (e.g., “4th place out of 10”)
- Use DENSE_RANK when you want consecutive numbering or need to count distinct levels of performance
Performance Considerations and Best Practices
Window functions trigger a shuffle operation in Spark, which can be expensive. Here are strategies to optimize performance:
Partition wisely: Choose partition columns with reasonable cardinality. Partitioning by a column with millions of unique values creates too many small partitions. Partitioning by a column with only two values creates data skew.
# Good: Reasonable cardinality
Window.partitionBy("department", "location")
# Problematic: Too few partitions, potential skew
Window.partitionBy("country") # if most data is in one country
# Problematic: Too many partitions
Window.partitionBy("employee_id") # unique per row
Limit data early: Apply filters before window functions to reduce the data volume being shuffled.
# Filter first, then apply window function
filtered_df = df.filter(col("year") == 2024)
result = filtered_df.withColumn("rank", rank().over(window_spec))
Reuse window specifications: If you’re applying multiple window functions with the same partitioning and ordering, define the window spec once.
Cache strategically: If you’re applying multiple window functions to the same base dataset, cache it after initial transformations but before window operations.
base_df = df.filter(col("active") == True).cache()
result1 = base_df.withColumn("row_num", row_number().over(window_spec))
result2 = base_df.withColumn("rank", rank().over(window_spec))
Monitor data skew: Use Spark UI to identify if certain partitions are processing significantly more data. If you see skew, consider salting techniques or repartitioning strategies.
Window functions are powerful tools in PySpark for analytical queries that require ranking, running totals, or moving averages while preserving row-level detail. Master these three ranking functions, understand their behavioral differences with ties, and apply appropriate partitioning strategies to build efficient data pipelines.