How to Calculate Cumulative Sum in Polars

Cumulative sums appear everywhere in data analysis. You need them for running totals in financial reports, year-to-date calculations in sales dashboards, and cumulative metrics in time series...

Key Insights

  • Polars’ cum_sum() expression calculates running totals efficiently, and you can combine it with over() for grouped cumulative sums that would require complex groupby().transform() calls in pandas.
  • Null values propagate through cumulative sums by default—use fill_null(0) before the calculation or chain methods to handle missing data appropriately for your use case.
  • Lazy evaluation with scan_csv() and deferred collect() lets you compute cumulative sums on datasets larger than memory without changing your calculation logic.

Introduction

Cumulative sums appear everywhere in data analysis. You need them for running totals in financial reports, year-to-date calculations in sales dashboards, and cumulative metrics in time series analysis. The operation is simple: for each row, sum all values from the beginning up to and including that row.

Polars handles cumulative sums with the cum_sum() expression, which outperforms pandas’ equivalent in both speed and memory efficiency. The performance gap widens as your data grows—Polars’ Rust-based engine and lazy evaluation model make it particularly well-suited for cumulative operations on large datasets.

This article covers everything you need to calculate cumulative sums in Polars, from basic single-column operations to grouped calculations and memory-efficient processing of large files.

Basic Cumulative Sum with cum_sum()

The cum_sum() expression calculates a running total over a column. You typically use it inside with_columns() to add the cumulative sum as a new column while preserving the original data.

import polars as pl

df = pl.DataFrame({
    "day": [1, 2, 3, 4, 5],
    "sales": [100, 150, 200, 175, 225]
})

result = df.with_columns(
    pl.col("sales").cum_sum().alias("cumulative_sales")
)

print(result)

Output:

shape: (5, 3)
┌─────┬───────┬──────────────────┐
│ day ┆ sales ┆ cumulative_sales │
│ --- ┆ ---   ┆ ---              │
│ i64 ┆ i64   ┆ i64              │
╞═════╪═══════╪══════════════════╡
│ 1   ┆ 100   ┆ 100              │
│ 2   ┆ 150   ┆ 250              │
│ 3   ┆ 200   ┆ 450              │
│ 4   ┆ 175   ┆ 625              │
│ 5   ┆ 225   ┆ 850              │
└─────┴───────┴──────────────────┘

The cumulative sum starts at 100 (the first value) and accumulates: 100 + 150 = 250, 250 + 200 = 450, and so on. Each row contains the sum of all previous rows plus itself.

You can also calculate cumulative sums on multiple columns simultaneously:

df = pl.DataFrame({
    "day": [1, 2, 3, 4],
    "revenue": [1000, 1200, 800, 1500],
    "expenses": [400, 500, 450, 600]
})

result = df.with_columns(
    pl.col("revenue").cum_sum().alias("cumulative_revenue"),
    pl.col("expenses").cum_sum().alias("cumulative_expenses")
)

Handling Null Values

Null values in your data affect cumulative sum calculations. By default, Polars propagates nulls—once a null appears, all subsequent cumulative values become null.

df = pl.DataFrame({
    "day": [1, 2, 3, 4, 5],
    "sales": [100, 150, None, 175, 225]
})

result = df.with_columns(
    pl.col("sales").cum_sum().alias("cumulative_sales")
)

print(result)

Output:

shape: (5, 3)
┌─────┬───────┬──────────────────┐
│ day ┆ sales ┆ cumulative_sales │
│ --- ┆ ---   ┆ ---              │
│ i64 ┆ i64   ┆ i64              │
╞═════╪═══════╪══════════════════╡
│ 1   ┆ 100   ┆ 100              │
│ 2   ┆ 150   ┆ 250              │
│ 3   ┆ null  ┆ null             │
│ 4   ┆ 175   ┆ null             │
│ 5   ┆ 225   ┆ null             │
└─────┴───────┴──────────────────┘

This behavior might not be what you want. The most common solution is to fill nulls with zero before calculating the cumulative sum:

result = df.with_columns(
    pl.col("sales").fill_null(0).cum_sum().alias("cumulative_sales")
)

print(result)

Output:

shape: (5, 3)
┌─────┬───────┬──────────────────┐
│ day ┆ sales ┆ cumulative_sales │
│ --- ┆ ---   ┆ ---              │
│ i64 ┆ i64   ┆ i64              │
╞═════╪═══════╪══════════════════╡
│ 1   ┆ 100   ┆ 100              │
│ 2   ┆ 150   ┆ 250              │
│ 3   ┆ null  ┆ 250              │
│ 4   ┆ 175   ┆ 425              │
│ 5   ┆ 225   ┆ 650              │
└─────┴───────┴──────────────────┘

Now the cumulative sum treats the null as zero and continues accumulating. The original sales column retains its null value, which is usually what you want for data integrity.

The cum_sum() function also accepts a reverse parameter. When set to True, it calculates the cumulative sum from the bottom of the column upward:

df = pl.DataFrame({
    "value": [10, 20, 30, 40]
})

result = df.with_columns(
    pl.col("value").cum_sum().alias("forward"),
    pl.col("value").cum_sum(reverse=True).alias("backward")
)

print(result)

Output:

shape: (4, 3)
┌───────┬─────────┬──────────┐
│ value ┆ forward ┆ backward │
│ ---   ┆ ---     ┆ ---      │
│ i64   ┆ i64     ┆ i64      │
╞═══════╪═════════╪══════════╡
│ 10    ┆ 10      ┆ 100      │
│ 20    ┆ 30      ┆ 90       │
│ 30    ┆ 60      ┆ 70       │
│ 40    ┆ 100     ┆ 40       │
└───────┴─────────┴──────────┘

Reverse cumulative sums are useful for calculating “remaining totals”—how much is left from the current row to the end.

Grouped Cumulative Sums with over()

Real-world data often requires cumulative sums within groups. Sales by region, transactions by customer, or metrics by category all need grouped running totals. Polars handles this with the over() window function.

df = pl.DataFrame({
    "region": ["East", "East", "East", "West", "West", "West"],
    "month": [1, 2, 3, 1, 2, 3],
    "sales": [100, 150, 200, 80, 120, 160]
})

result = df.with_columns(
    pl.col("sales").cum_sum().over("region").alias("regional_cumsum")
)

print(result)

Output:

shape: (6, 4)
┌────────┬───────┬───────┬─────────────────┐
│ region ┆ month ┆ sales ┆ regional_cumsum │
│ ---    ┆ ---   ┆ ---   ┆ ---             │
│ str    ┆ i64   ┆ i64   ┆ i64             │
╞════════╪═══════╪═══════╪═════════════════╡
│ East   ┆ 1     ┆ 100   ┆ 100             │
│ East   ┆ 2     ┆ 150   ┆ 250             │
│ East   ┆ 3     ┆ 200   ┆ 450             │
│ West   ┆ 1     ┆ 80    ┆ 80              │
│ West   ┆ 2     ┆ 120   ┆ 200             │
│ West   ┆ 3     ┆ 160   ┆ 360             │
└────────┴───────┴───────┴─────────────────┘

The cumulative sum resets for each region. East accumulates to 450, while West accumulates separately to 360.

You can group by multiple columns for more granular cumulative sums:

df = pl.DataFrame({
    "year": [2023, 2023, 2023, 2023, 2024, 2024, 2024, 2024],
    "region": ["East", "East", "West", "West", "East", "East", "West", "West"],
    "quarter": [1, 2, 1, 2, 1, 2, 1, 2],
    "sales": [100, 150, 80, 120, 110, 160, 90, 130]
})

result = df.with_columns(
    pl.col("sales").cum_sum().over(["year", "region"]).alias("ytd_sales")
)

This calculates year-to-date sales separately for each year-region combination.

Cumulative Sum with Sorting

Cumulative sums depend on row order. For time series data, you must sort by date before calculating the running total. Otherwise, you’ll get meaningless results.

df = pl.DataFrame({
    "date": ["2024-01-15", "2024-01-01", "2024-01-22", "2024-01-08"],
    "amount": [200, 100, 400, 150]
})

# Wrong: cumulative sum on unsorted data
wrong_result = df.with_columns(
    pl.col("amount").cum_sum().alias("running_total")
)

# Correct: sort first, then calculate
correct_result = (
    df.with_columns(pl.col("date").str.to_date())
    .sort("date")
    .with_columns(pl.col("amount").cum_sum().alias("running_total"))
)

print(correct_result)

Output:

shape: (4, 3)
┌────────────┬────────┬───────────────┐
│ date       ┆ amount ┆ running_total │
│ ---        ┆ ---    ┆ ---           │
│ date       ┆ i64    ┆ i64           │
╞════════════╪════════╪═══════════════╡
│ 2024-01-01 ┆ 100    ┆ 100           │
│ 2024-01-08 ┆ 150    ┆ 250           │
│ 2024-01-15 ┆ 200    ┆ 450           │
│ 2024-01-22 ┆ 400    ┆ 850           │
└────────────┴────────┴───────────────┘

For grouped cumulative sums with time series data, combine sorting with over():

df = pl.DataFrame({
    "date": ["2024-01-01", "2024-01-02", "2024-01-01", "2024-01-02"],
    "category": ["A", "A", "B", "B"],
    "value": [10, 20, 15, 25]
})

result = (
    df.with_columns(pl.col("date").str.to_date())
    .sort("date")
    .with_columns(
        pl.col("value").cum_sum().over("category").alias("category_cumsum")
    )
)

Lazy Evaluation for Large Datasets

When working with files that exceed available memory, Polars’ lazy API lets you compute cumulative sums without loading everything at once. The query optimizer determines the most efficient execution plan.

# Memory-efficient cumulative sum on a large CSV
result = (
    pl.scan_csv("large_sales_data.csv")
    .with_columns(pl.col("date").str.to_date())
    .sort("date")
    .with_columns(
        pl.col("amount").cum_sum().alias("running_total")
    )
    .collect()
)

The scan_csv() function creates a lazy frame that doesn’t read the file immediately. All transformations are recorded as a query plan. Only when you call collect() does Polars execute the optimized plan.

For grouped operations on large data:

result = (
    pl.scan_csv("transactions.csv")
    .sort("transaction_date")
    .with_columns(
        pl.col("amount").cum_sum().over("customer_id").alias("customer_total")
    )
    .filter(pl.col("customer_total") > 10000)  # Filter after calculation
    .collect()
)

The lazy API also enables streaming execution for truly massive datasets. Add streaming=True to collect() when your data won’t fit in memory even after optimization:

result = (
    pl.scan_csv("massive_file.csv")
    .with_columns(pl.col("value").cum_sum().alias("cumsum"))
    .collect(streaming=True)
)

Conclusion

Polars provides a clean, efficient interface for cumulative sum calculations. The cum_sum() expression handles basic running totals, while over() enables grouped calculations without the verbose syntax required in pandas. For large datasets, the lazy API with scan_csv() keeps memory usage manageable.

Key methods to remember:

  • pl.col("x").cum_sum() for basic cumulative sums
  • pl.col("x").fill_null(0).cum_sum() to handle missing values
  • pl.col("x").cum_sum().over("group") for grouped running totals
  • pl.col("x").cum_sum(reverse=True) for reverse cumulative sums

Polars also offers related cumulative functions: cum_max() for running maximums, cum_min() for running minimums, and cum_prod() for cumulative products. These follow the same patterns shown here and work with over() for grouped calculations.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.