Window Functions in PySpark vs Pandas vs SQL
Window functions solve a specific problem: you need to perform calculations across groups of rows, but you don't want to collapse your data. Think calculating a running total, ranking items within...
Key Insights
- Window functions let you perform calculations across related rows while keeping your original data intact—unlike GROUP BY which collapses rows. Master these in one technology, and the concepts transfer directly to the others.
- PySpark and SQL share nearly identical window function semantics, making migration straightforward. Pandas takes a different approach with
groupby().transform()androlling(), which can trip up developers switching between tools. - Frame specifications (ROWS vs RANGE) are where most bugs hide. Understanding the difference between physical row offsets and logical value ranges will save you hours of debugging incorrect running totals and moving averages.
Introduction to Window Functions
Window functions solve a specific problem: you need to perform calculations across groups of rows, but you don’t want to collapse your data. Think calculating a running total, ranking items within categories, or comparing each row to the previous one.
With GROUP BY, you lose individual rows. With window functions, every original row stays intact while gaining access to aggregate or positional information from related rows.
All three technologies—SQL, PySpark, and Pandas—support window functions, but the syntax and mental models differ enough to cause confusion when switching between them. This article gives you a practical reference for translating window operations across all three.
Core Syntax Comparison
Let’s start with the fundamental structure. Here’s how you define a window and apply a basic ROW_NUMBER() in each technology:
SQL:
SELECT
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS row_num
FROM orders;
PySpark:
from pyspark.sql import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy("customer_id").orderBy("order_date")
df = df.withColumn("row_num", row_number().over(window_spec))
Pandas:
df["row_num"] = df.groupby("customer_id")["order_date"].transform(
lambda x: range(1, len(x) + 1)
)
# Or more idiomatically for row_number specifically:
df["row_num"] = df.groupby("customer_id").cumcount() + 1
Notice the pattern: SQL and PySpark both use explicit OVER clauses with PARTITION BY and ORDER BY. Pandas takes a different route—groupby() combined with transform() or specialized methods like cumcount().
Ranking Functions
Ranking functions assign positions to rows within partitions. The four main ones behave identically across SQL and PySpark, but require workarounds in Pandas.
ROW_NUMBER: Sequential integers, no ties. Row 1, 2, 3, 4 regardless of duplicate values.
RANK: Allows ties, but skips numbers. If two rows tie for position 2, the next row gets position 4.
DENSE_RANK: Allows ties, no gaps. Two rows tie for position 2, next row gets position 3.
NTILE(n): Divides rows into n roughly equal buckets.
Here’s finding the top 2 orders per customer by amount:
SQL:
WITH ranked AS (
SELECT
customer_id,
order_id,
amount,
RANK() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS amount_rank
FROM orders
)
SELECT * FROM ranked WHERE amount_rank <= 2;
PySpark:
from pyspark.sql.functions import rank
window_spec = Window.partitionBy("customer_id").orderBy(col("amount").desc())
df_ranked = df.withColumn("amount_rank", rank().over(window_spec))
df_top2 = df_ranked.filter(col("amount_rank") <= 2)
Pandas:
df["amount_rank"] = df.groupby("customer_id")["amount"].rank(
method="min", # "min" = RANK, "dense" = DENSE_RANK, "first" = ROW_NUMBER
ascending=False
)
df_top2 = df[df["amount_rank"] <= 2]
The Pandas rank() method’s method parameter controls tie-handling: "min" mimics SQL’s RANK, "dense" mimics DENSE_RANK, and "first" mimics ROW_NUMBER.
Aggregate Window Functions
Running totals, cumulative counts, and moving averages use aggregate functions with window specifications.
Cumulative sum by customer:
SQL:
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_amount
FROM orders;
PySpark:
from pyspark.sql.functions import sum as spark_sum
window_spec = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(
Window.unboundedPreceding, Window.currentRow
)
df = df.withColumn("cumulative_amount", spark_sum("amount").over(window_spec))
Pandas:
df = df.sort_values(["customer_id", "order_date"])
df["cumulative_amount"] = df.groupby("customer_id")["amount"].cumsum()
7-day rolling average:
SQL:
SELECT
date,
revenue,
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day_avg
FROM daily_revenue;
PySpark:
from pyspark.sql.functions import avg
window_spec = Window.orderBy("date").rowsBetween(-6, Window.currentRow)
df = df.withColumn("rolling_7day_avg", avg("revenue").over(window_spec))
Pandas:
df = df.sort_values("date")
df["rolling_7day_avg"] = df["revenue"].rolling(window=7, min_periods=1).mean()
Pandas’ rolling() method is purpose-built for this use case and often more readable than the SQL/PySpark approach.
Offset Functions (LAG/LEAD)
LAG and LEAD access values from previous or subsequent rows—essential for time-series analysis and change detection.
Month-over-month growth:
SQL:
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
(revenue - LAG(revenue, 1) OVER (ORDER BY month)) /
LAG(revenue, 1) OVER (ORDER BY month) * 100 AS mom_growth_pct
FROM monthly_revenue;
PySpark:
from pyspark.sql.functions import lag
window_spec = Window.orderBy("month")
df = df.withColumn("prev_month_revenue", lag("revenue", 1).over(window_spec))
df = df.withColumn(
"mom_growth_pct",
((col("revenue") - col("prev_month_revenue")) / col("prev_month_revenue") * 100)
)
Pandas:
df = df.sort_values("month")
df["prev_month_revenue"] = df["revenue"].shift(1)
df["mom_growth_pct"] = (df["revenue"] - df["prev_month_revenue"]) / df["prev_month_revenue"] * 100
Pandas uses shift() instead of lag()/lead(). Positive values shift down (like LAG), negative values shift up (like LEAD).
Detecting value changes:
# PySpark
window_spec = Window.partitionBy("user_id").orderBy("event_time")
df = df.withColumn("prev_status", lag("status", 1).over(window_spec))
df = df.withColumn("status_changed", col("status") != col("prev_status"))
# Pandas
df["prev_status"] = df.groupby("user_id")["status"].shift(1)
df["status_changed"] = df["status"] != df["prev_status"]
Frame Specifications
Frame specifications define exactly which rows participate in your window calculation. This is where subtle bugs creep in.
ROWS BETWEEN: Physical row offsets. “3 rows before current row” means exactly 3 rows, regardless of values.
RANGE BETWEEN: Logical value ranges. “Values within 3 of current row’s value” depends on the ORDER BY column’s actual values.
SQL example showing the difference:
-- ROWS: Always includes exactly 3 preceding rows
SUM(amount) OVER (
ORDER BY order_date
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
)
-- RANGE: Includes all rows where order_date is within 3 days
SUM(amount) OVER (
ORDER BY order_date
RANGE BETWEEN INTERVAL 3 DAY PRECEDING AND CURRENT ROW
)
PySpark:
# ROWS-based frame
rows_window = Window.orderBy("order_date").rowsBetween(-3, Window.currentRow)
# RANGE-based frame (note: PySpark RANGE works with numeric/date ordering)
range_window = Window.orderBy("order_date").rangeBetween(-3, Window.currentRow)
Pandas doesn’t have direct ROWS vs RANGE equivalents. The rolling() method works on row counts by default, but you can use time-based windows with datetime indices:
# Row-based rolling
df["rolling_sum"] = df["amount"].rolling(window=4).sum() # Current + 3 preceding
# Time-based rolling (requires datetime index)
df = df.set_index("order_date")
df["rolling_sum"] = df["amount"].rolling(window="3D").sum() # 3-day window
Performance Considerations and When to Use Each
Choose SQL when:
- You’re working directly in a database
- Your team is SQL-fluent and the logic stays in queries
- You need the query optimizer to handle execution planning
Choose PySpark when:
- Data exceeds single-machine memory
- You’re building data pipelines with complex transformations
- You need to chain window operations with other Spark operations
Choose Pandas when:
- Data fits in memory (generally under 1-2GB)
- You need rapid prototyping and exploration
- You’re doing ad-hoc analysis in notebooks
PySpark optimization tips:
- Partition wisely. Window functions require shuffling data by partition key. Choose partition columns with reasonable cardinality—too few partitions create bottlenecks, too many create overhead.
# Bad: Partitioning by high-cardinality column
Window.partitionBy("transaction_id") # Millions of partitions
# Better: Partition by natural grouping
Window.partitionBy("customer_id") # Thousands of partitions
- Combine windows when possible. Multiple window functions with identical specifications share the same shuffle:
# Single shuffle for both operations
window_spec = Window.partitionBy("customer_id").orderBy("order_date")
df = df.withColumn("row_num", row_number().over(window_spec))
df = df.withColumn("running_total", spark_sum("amount").over(window_spec))
- Cache before multiple window passes. If you need different partition schemes, cache the DataFrame to avoid recomputing upstream transformations.
Pandas optimization tips:
-
Sort once, then use cumulative methods. Sorting is expensive; do it once before applying multiple window operations.
-
Use built-in methods over
transform()with lambdas. Methods likecumsum(),rank(), andshift()are vectorized and significantly faster than custom lambda functions.
Window functions are one of those features that, once mastered, fundamentally change how you approach data problems. The syntax varies across technologies, but the mental model stays consistent: define your partition, specify your ordering, set your frame, and apply your function. Get comfortable with that pattern, and you’ll move fluidly between SQL, PySpark, and Pandas without missing a beat.