Window Functions in PySpark vs Pandas vs SQL

Window functions solve a specific problem: you need to perform calculations across groups of rows, but you don't want to collapse your data. Think calculating a running total, ranking items within...

Key Insights

  • Window functions let you perform calculations across related rows while keeping your original data intact—unlike GROUP BY which collapses rows. Master these in one technology, and the concepts transfer directly to the others.
  • PySpark and SQL share nearly identical window function semantics, making migration straightforward. Pandas takes a different approach with groupby().transform() and rolling(), which can trip up developers switching between tools.
  • Frame specifications (ROWS vs RANGE) are where most bugs hide. Understanding the difference between physical row offsets and logical value ranges will save you hours of debugging incorrect running totals and moving averages.

Introduction to Window Functions

Window functions solve a specific problem: you need to perform calculations across groups of rows, but you don’t want to collapse your data. Think calculating a running total, ranking items within categories, or comparing each row to the previous one.

With GROUP BY, you lose individual rows. With window functions, every original row stays intact while gaining access to aggregate or positional information from related rows.

All three technologies—SQL, PySpark, and Pandas—support window functions, but the syntax and mental models differ enough to cause confusion when switching between them. This article gives you a practical reference for translating window operations across all three.

Core Syntax Comparison

Let’s start with the fundamental structure. Here’s how you define a window and apply a basic ROW_NUMBER() in each technology:

SQL:

SELECT 
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id 
        ORDER BY order_date
    ) AS row_num
FROM orders;

PySpark:

from pyspark.sql import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("customer_id").orderBy("order_date")

df = df.withColumn("row_num", row_number().over(window_spec))

Pandas:

df["row_num"] = df.groupby("customer_id")["order_date"].transform(
    lambda x: range(1, len(x) + 1)
)

# Or more idiomatically for row_number specifically:
df["row_num"] = df.groupby("customer_id").cumcount() + 1

Notice the pattern: SQL and PySpark both use explicit OVER clauses with PARTITION BY and ORDER BY. Pandas takes a different route—groupby() combined with transform() or specialized methods like cumcount().

Ranking Functions

Ranking functions assign positions to rows within partitions. The four main ones behave identically across SQL and PySpark, but require workarounds in Pandas.

ROW_NUMBER: Sequential integers, no ties. Row 1, 2, 3, 4 regardless of duplicate values.

RANK: Allows ties, but skips numbers. If two rows tie for position 2, the next row gets position 4.

DENSE_RANK: Allows ties, no gaps. Two rows tie for position 2, next row gets position 3.

NTILE(n): Divides rows into n roughly equal buckets.

Here’s finding the top 2 orders per customer by amount:

SQL:

WITH ranked AS (
    SELECT 
        customer_id,
        order_id,
        amount,
        RANK() OVER (
            PARTITION BY customer_id 
            ORDER BY amount DESC
        ) AS amount_rank
    FROM orders
)
SELECT * FROM ranked WHERE amount_rank <= 2;

PySpark:

from pyspark.sql.functions import rank

window_spec = Window.partitionBy("customer_id").orderBy(col("amount").desc())

df_ranked = df.withColumn("amount_rank", rank().over(window_spec))
df_top2 = df_ranked.filter(col("amount_rank") <= 2)

Pandas:

df["amount_rank"] = df.groupby("customer_id")["amount"].rank(
    method="min",  # "min" = RANK, "dense" = DENSE_RANK, "first" = ROW_NUMBER
    ascending=False
)
df_top2 = df[df["amount_rank"] <= 2]

The Pandas rank() method’s method parameter controls tie-handling: "min" mimics SQL’s RANK, "dense" mimics DENSE_RANK, and "first" mimics ROW_NUMBER.

Aggregate Window Functions

Running totals, cumulative counts, and moving averages use aggregate functions with window specifications.

Cumulative sum by customer:

SQL:

SELECT 
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id 
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_amount
FROM orders;

PySpark:

from pyspark.sql.functions import sum as spark_sum

window_spec = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(
    Window.unboundedPreceding, Window.currentRow
)

df = df.withColumn("cumulative_amount", spark_sum("amount").over(window_spec))

Pandas:

df = df.sort_values(["customer_id", "order_date"])
df["cumulative_amount"] = df.groupby("customer_id")["amount"].cumsum()

7-day rolling average:

SQL:

SELECT 
    date,
    revenue,
    AVG(revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7day_avg
FROM daily_revenue;

PySpark:

from pyspark.sql.functions import avg

window_spec = Window.orderBy("date").rowsBetween(-6, Window.currentRow)

df = df.withColumn("rolling_7day_avg", avg("revenue").over(window_spec))

Pandas:

df = df.sort_values("date")
df["rolling_7day_avg"] = df["revenue"].rolling(window=7, min_periods=1).mean()

Pandas’ rolling() method is purpose-built for this use case and often more readable than the SQL/PySpark approach.

Offset Functions (LAG/LEAD)

LAG and LEAD access values from previous or subsequent rows—essential for time-series analysis and change detection.

Month-over-month growth:

SQL:

SELECT 
    month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
    (revenue - LAG(revenue, 1) OVER (ORDER BY month)) / 
        LAG(revenue, 1) OVER (ORDER BY month) * 100 AS mom_growth_pct
FROM monthly_revenue;

PySpark:

from pyspark.sql.functions import lag

window_spec = Window.orderBy("month")

df = df.withColumn("prev_month_revenue", lag("revenue", 1).over(window_spec))
df = df.withColumn(
    "mom_growth_pct",
    ((col("revenue") - col("prev_month_revenue")) / col("prev_month_revenue") * 100)
)

Pandas:

df = df.sort_values("month")
df["prev_month_revenue"] = df["revenue"].shift(1)
df["mom_growth_pct"] = (df["revenue"] - df["prev_month_revenue"]) / df["prev_month_revenue"] * 100

Pandas uses shift() instead of lag()/lead(). Positive values shift down (like LAG), negative values shift up (like LEAD).

Detecting value changes:

# PySpark
window_spec = Window.partitionBy("user_id").orderBy("event_time")
df = df.withColumn("prev_status", lag("status", 1).over(window_spec))
df = df.withColumn("status_changed", col("status") != col("prev_status"))

# Pandas
df["prev_status"] = df.groupby("user_id")["status"].shift(1)
df["status_changed"] = df["status"] != df["prev_status"]

Frame Specifications

Frame specifications define exactly which rows participate in your window calculation. This is where subtle bugs creep in.

ROWS BETWEEN: Physical row offsets. “3 rows before current row” means exactly 3 rows, regardless of values.

RANGE BETWEEN: Logical value ranges. “Values within 3 of current row’s value” depends on the ORDER BY column’s actual values.

SQL example showing the difference:

-- ROWS: Always includes exactly 3 preceding rows
SUM(amount) OVER (
    ORDER BY order_date
    ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
)

-- RANGE: Includes all rows where order_date is within 3 days
SUM(amount) OVER (
    ORDER BY order_date
    RANGE BETWEEN INTERVAL 3 DAY PRECEDING AND CURRENT ROW
)

PySpark:

# ROWS-based frame
rows_window = Window.orderBy("order_date").rowsBetween(-3, Window.currentRow)

# RANGE-based frame (note: PySpark RANGE works with numeric/date ordering)
range_window = Window.orderBy("order_date").rangeBetween(-3, Window.currentRow)

Pandas doesn’t have direct ROWS vs RANGE equivalents. The rolling() method works on row counts by default, but you can use time-based windows with datetime indices:

# Row-based rolling
df["rolling_sum"] = df["amount"].rolling(window=4).sum()  # Current + 3 preceding

# Time-based rolling (requires datetime index)
df = df.set_index("order_date")
df["rolling_sum"] = df["amount"].rolling(window="3D").sum()  # 3-day window

Performance Considerations and When to Use Each

Choose SQL when:

  • You’re working directly in a database
  • Your team is SQL-fluent and the logic stays in queries
  • You need the query optimizer to handle execution planning

Choose PySpark when:

  • Data exceeds single-machine memory
  • You’re building data pipelines with complex transformations
  • You need to chain window operations with other Spark operations

Choose Pandas when:

  • Data fits in memory (generally under 1-2GB)
  • You need rapid prototyping and exploration
  • You’re doing ad-hoc analysis in notebooks

PySpark optimization tips:

  1. Partition wisely. Window functions require shuffling data by partition key. Choose partition columns with reasonable cardinality—too few partitions create bottlenecks, too many create overhead.
# Bad: Partitioning by high-cardinality column
Window.partitionBy("transaction_id")  # Millions of partitions

# Better: Partition by natural grouping
Window.partitionBy("customer_id")  # Thousands of partitions
  1. Combine windows when possible. Multiple window functions with identical specifications share the same shuffle:
# Single shuffle for both operations
window_spec = Window.partitionBy("customer_id").orderBy("order_date")
df = df.withColumn("row_num", row_number().over(window_spec))
df = df.withColumn("running_total", spark_sum("amount").over(window_spec))
  1. Cache before multiple window passes. If you need different partition schemes, cache the DataFrame to avoid recomputing upstream transformations.

Pandas optimization tips:

  1. Sort once, then use cumulative methods. Sorting is expensive; do it once before applying multiple window operations.

  2. Use built-in methods over transform() with lambdas. Methods like cumsum(), rank(), and shift() are vectorized and significantly faster than custom lambda functions.

Window functions are one of those features that, once mastered, fundamentally change how you approach data problems. The syntax varies across technologies, but the mental model stays consistent: define your partition, specify your ordering, set your frame, and apply your function. Get comfortable with that pattern, and you’ll move fluidly between SQL, PySpark, and Pandas without missing a beat.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.