Pandas - Cumulative Sum (cumsum) | Application Architect

Key Insights

The cumsum() method calculates running totals across rows or columns, essential for tracking accumulated values like sales totals, running balances, or time-series aggregations
Cumulative sums handle missing values intelligently with skipna parameter and work seamlessly across different data types including integers, floats, and datetime deltas
Group-wise cumulative calculations enable powerful analytics like running totals per category, percentage contributions, and detecting threshold breaches in segmented data

Basic Cumulative Sum Operations

The cumsum() method computes the cumulative sum of elements along a specified axis. By default, it operates on each column independently, returning a DataFrame or Series with the same shape as the input.

import pandas as pd
import numpy as np

# Simple Series cumulative sum
sales = pd.Series([100, 150, 200, 120, 180])
cumulative_sales = sales.cumsum()
print(cumulative_sales)
# 0    100
# 1    250
# 2    450
# 3    570
# 4    750

# DataFrame cumulative sum
df = pd.DataFrame({
    'revenue': [1000, 1500, 1200, 1800],
    'costs': [600, 800, 700, 900],
    'profit': [400, 700, 500, 900]
})

cumulative_df = df.cumsum()
print(cumulative_df)
#    revenue  costs  profit
# 0     1000    600     400
# 1     2500   1400    1100
# 2     3700   2100    1600
# 3     5500   3000    2500

The axis parameter controls direction: axis=0 (default) calculates cumulative sums down rows, while axis=1 accumulates across columns.

# Cumulative sum across columns
df_horizontal = df.cumsum(axis=1)
print(df_horizontal)
#    revenue  costs  profit
# 0     1000   1600    2000
# 1     1500   2300    3000
# 2     1200   1900    2400
# 3     1800   2700    3600

Handling Missing Values

The skipna parameter determines how cumsum() handles NaN values. When skipna=True (default), missing values are ignored and the cumulative sum continues. When False, any NaN propagates through all subsequent values.

# Data with missing values
data_with_nan = pd.Series([10, 20, np.nan, 30, 40])

# Default behavior: skip NaN
cumsum_skip = data_with_nan.cumsum()
print(cumsum_skip)
# 0    10.0
# 1    30.0
# 2     NaN
# 3    60.0
# 4   100.0

# Propagate NaN through results
cumsum_propagate = data_with_nan.cumsum(skipna=False)
print(cumsum_propagate)
# 0    10.0
# 1    30.0
# 2     NaN
# 3     NaN
# 4     NaN

For financial or scientific applications where missing data should halt accumulation, use skipna=False. For analytics where you want to continue despite gaps, the default behavior works best.

Group-wise Cumulative Sums

The real power of cumsum() emerges when combined with groupby(). This enables calculating running totals within categories, essential for customer lifetime value, departmental budgets, or product-line analytics.

# Sales data by region
sales_data = pd.DataFrame({
    'region': ['North', 'North', 'South', 'North', 'South', 'South'],
    'month': [1, 2, 1, 3, 2, 3],
    'sales': [1000, 1200, 800, 1500, 900, 1100]
})

# Cumulative sum per region
sales_data['cumulative_sales'] = sales_data.groupby('region')['sales'].cumsum()
print(sales_data)
#   region  month  sales  cumulative_sales
# 0  North      1   1000              1000
# 1  North      2   1200              2200
# 2  South      1    800               800
# 3  North      3   1500              3700
# 4  South      2    900              1700
# 5  South      3   1100              2800

Multiple grouping levels enable hierarchical cumulative calculations:

# Multi-level grouping
detailed_sales = pd.DataFrame({
    'region': ['North', 'North', 'North', 'South', 'South', 'South'],
    'product': ['A', 'B', 'A', 'A', 'B', 'A'],
    'sales': [100, 150, 120, 90, 110, 95]
})

detailed_sales['cumsum'] = detailed_sales.groupby(['region', 'product'])['sales'].cumsum()
print(detailed_sales)
#   region product  sales  cumsum
# 0  North       A    100     100
# 1  North       B    150     150
# 2  North       A    120     220
# 3  South       A     90      90
# 4  South       B    110     110
# 5  South       A     95     185

Time Series Applications

Cumulative sums are fundamental for time-series analysis, particularly for calculating running totals, year-to-date figures, and detecting trend changes.

# Daily transaction data
dates = pd.date_range('2024-01-01', periods=10, freq='D')
transactions = pd.DataFrame({
    'date': dates,
    'amount': [250, 180, 320, 290, 410, 220, 380, 270, 340, 300]
})
transactions.set_index('date', inplace=True)

# Running total
transactions['running_total'] = transactions['amount'].cumsum()

# Calculate percentage of total at each point
total_amount = transactions['amount'].sum()
transactions['pct_of_total'] = (transactions['running_total'] / total_amount) * 100

print(transactions.head())
#             amount  running_total  pct_of_total
# date                                           
# 2024-01-01     250            250      8.333333
# 2024-01-02     180            430     14.333333
# 2024-01-03     320            750     25.000000
# 2024-01-04     290           1040     34.666667
# 2024-01-05     410           1450     48.333333

For monthly or yearly aggregations, resample first then apply cumulative sum:

# Monthly cumulative totals
monthly_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=90, freq='D'),
    'value': np.random.randint(50, 200, 90)
})
monthly_data.set_index('date', inplace=True)

# Resample to monthly, then cumsum
monthly_cumsum = monthly_data.resample('M').sum().cumsum()
print(monthly_cumsum.head())

Performance Considerations and Alternatives

For large datasets, cumsum() is highly optimized using NumPy’s underlying C implementations. However, certain patterns can impact performance.

# Efficient: Direct cumsum on entire column
df_large = pd.DataFrame({'values': np.random.randint(1, 100, 1000000)})
%timeit df_large['values'].cumsum()

# Less efficient: Applying cumsum in a loop
# Avoid this pattern
result = []
running_sum = 0
for val in df_large['values']:
    running_sum += val
    result.append(running_sum)

When you need conditional cumulative sums (accumulate only when certain conditions are met), combine boolean indexing with cumsum:

# Cumulative sum only for positive values
data = pd.Series([10, -5, 15, -3, 20, 8, -12, 25])
positive_mask = data > 0

# Create cumulative sum that only counts positive values
cumsum_positive = (data * positive_mask).cumsum()
print(cumsum_positive)
# 0    10
# 1    10
# 2    25
# 3    25
# 4    45
# 5    53
# 6    53
# 7    78

Detecting Threshold Breaches

A practical application involves identifying when cumulative values exceed thresholds, useful for budget tracking, inventory management, or alert systems.

# Budget tracking with threshold detection
expenses = pd.DataFrame({
    'category': ['Marketing', 'Marketing', 'IT', 'Marketing', 'IT', 'IT'],
    'amount': [5000, 3000, 8000, 4000, 6000, 5000],
    'budget_limit': [15000, 15000, 20000, 15000, 20000, 20000]
})

expenses['cumulative'] = expenses.groupby('category')['amount'].cumsum()
expenses['over_budget'] = expenses['cumulative'] > expenses['budget_limit']
expenses['remaining'] = expenses['budget_limit'] - expenses['cumulative']

print(expenses)
#     category  amount  budget_limit  cumulative  over_budget  remaining
# 0  Marketing    5000         15000        5000        False      10000
# 1  Marketing    3000         15000        8000        False       7000
# 2         IT    8000         20000        8000        False      12000
# 3  Marketing    4000         15000       12000        False       3000
# 4         IT    6000         20000       14000        False       6000
# 5         IT    5000         20000       19000        False       1000

This pattern enables real-time monitoring and early warning systems for resource consumption across multiple dimensions simultaneously.