Pandas - Cumulative Sum (cumsum)
The `cumsum()` method computes the cumulative sum of elements along a specified axis. By default, it operates on each column independently, returning a DataFrame or Series with the same shape as the...
Key Insights
- The
cumsum()method calculates running totals across rows or columns, essential for tracking accumulated values like sales totals, running balances, or time-series aggregations - Cumulative sums handle missing values intelligently with
skipnaparameter and work seamlessly across different data types including integers, floats, and datetime deltas - Group-wise cumulative calculations enable powerful analytics like running totals per category, percentage contributions, and detecting threshold breaches in segmented data
Basic Cumulative Sum Operations
The cumsum() method computes the cumulative sum of elements along a specified axis. By default, it operates on each column independently, returning a DataFrame or Series with the same shape as the input.
import pandas as pd
import numpy as np
# Simple Series cumulative sum
sales = pd.Series([100, 150, 200, 120, 180])
cumulative_sales = sales.cumsum()
print(cumulative_sales)
# 0 100
# 1 250
# 2 450
# 3 570
# 4 750
# DataFrame cumulative sum
df = pd.DataFrame({
'revenue': [1000, 1500, 1200, 1800],
'costs': [600, 800, 700, 900],
'profit': [400, 700, 500, 900]
})
cumulative_df = df.cumsum()
print(cumulative_df)
# revenue costs profit
# 0 1000 600 400
# 1 2500 1400 1100
# 2 3700 2100 1600
# 3 5500 3000 2500
The axis parameter controls direction: axis=0 (default) calculates cumulative sums down rows, while axis=1 accumulates across columns.
# Cumulative sum across columns
df_horizontal = df.cumsum(axis=1)
print(df_horizontal)
# revenue costs profit
# 0 1000 1600 2000
# 1 1500 2300 3000
# 2 1200 1900 2400
# 3 1800 2700 3600
Handling Missing Values
The skipna parameter determines how cumsum() handles NaN values. When skipna=True (default), missing values are ignored and the cumulative sum continues. When False, any NaN propagates through all subsequent values.
# Data with missing values
data_with_nan = pd.Series([10, 20, np.nan, 30, 40])
# Default behavior: skip NaN
cumsum_skip = data_with_nan.cumsum()
print(cumsum_skip)
# 0 10.0
# 1 30.0
# 2 NaN
# 3 60.0
# 4 100.0
# Propagate NaN through results
cumsum_propagate = data_with_nan.cumsum(skipna=False)
print(cumsum_propagate)
# 0 10.0
# 1 30.0
# 2 NaN
# 3 NaN
# 4 NaN
For financial or scientific applications where missing data should halt accumulation, use skipna=False. For analytics where you want to continue despite gaps, the default behavior works best.
Group-wise Cumulative Sums
The real power of cumsum() emerges when combined with groupby(). This enables calculating running totals within categories, essential for customer lifetime value, departmental budgets, or product-line analytics.
# Sales data by region
sales_data = pd.DataFrame({
'region': ['North', 'North', 'South', 'North', 'South', 'South'],
'month': [1, 2, 1, 3, 2, 3],
'sales': [1000, 1200, 800, 1500, 900, 1100]
})
# Cumulative sum per region
sales_data['cumulative_sales'] = sales_data.groupby('region')['sales'].cumsum()
print(sales_data)
# region month sales cumulative_sales
# 0 North 1 1000 1000
# 1 North 2 1200 2200
# 2 South 1 800 800
# 3 North 3 1500 3700
# 4 South 2 900 1700
# 5 South 3 1100 2800
Multiple grouping levels enable hierarchical cumulative calculations:
# Multi-level grouping
detailed_sales = pd.DataFrame({
'region': ['North', 'North', 'North', 'South', 'South', 'South'],
'product': ['A', 'B', 'A', 'A', 'B', 'A'],
'sales': [100, 150, 120, 90, 110, 95]
})
detailed_sales['cumsum'] = detailed_sales.groupby(['region', 'product'])['sales'].cumsum()
print(detailed_sales)
# region product sales cumsum
# 0 North A 100 100
# 1 North B 150 150
# 2 North A 120 220
# 3 South A 90 90
# 4 South B 110 110
# 5 South A 95 185
Time Series Applications
Cumulative sums are fundamental for time-series analysis, particularly for calculating running totals, year-to-date figures, and detecting trend changes.
# Daily transaction data
dates = pd.date_range('2024-01-01', periods=10, freq='D')
transactions = pd.DataFrame({
'date': dates,
'amount': [250, 180, 320, 290, 410, 220, 380, 270, 340, 300]
})
transactions.set_index('date', inplace=True)
# Running total
transactions['running_total'] = transactions['amount'].cumsum()
# Calculate percentage of total at each point
total_amount = transactions['amount'].sum()
transactions['pct_of_total'] = (transactions['running_total'] / total_amount) * 100
print(transactions.head())
# amount running_total pct_of_total
# date
# 2024-01-01 250 250 8.333333
# 2024-01-02 180 430 14.333333
# 2024-01-03 320 750 25.000000
# 2024-01-04 290 1040 34.666667
# 2024-01-05 410 1450 48.333333
For monthly or yearly aggregations, resample first then apply cumulative sum:
# Monthly cumulative totals
monthly_data = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=90, freq='D'),
'value': np.random.randint(50, 200, 90)
})
monthly_data.set_index('date', inplace=True)
# Resample to monthly, then cumsum
monthly_cumsum = monthly_data.resample('M').sum().cumsum()
print(monthly_cumsum.head())
Performance Considerations and Alternatives
For large datasets, cumsum() is highly optimized using NumPy’s underlying C implementations. However, certain patterns can impact performance.
# Efficient: Direct cumsum on entire column
df_large = pd.DataFrame({'values': np.random.randint(1, 100, 1000000)})
%timeit df_large['values'].cumsum()
# Less efficient: Applying cumsum in a loop
# Avoid this pattern
result = []
running_sum = 0
for val in df_large['values']:
running_sum += val
result.append(running_sum)
When you need conditional cumulative sums (accumulate only when certain conditions are met), combine boolean indexing with cumsum:
# Cumulative sum only for positive values
data = pd.Series([10, -5, 15, -3, 20, 8, -12, 25])
positive_mask = data > 0
# Create cumulative sum that only counts positive values
cumsum_positive = (data * positive_mask).cumsum()
print(cumsum_positive)
# 0 10
# 1 10
# 2 25
# 3 25
# 4 45
# 5 53
# 6 53
# 7 78
Detecting Threshold Breaches
A practical application involves identifying when cumulative values exceed thresholds, useful for budget tracking, inventory management, or alert systems.
# Budget tracking with threshold detection
expenses = pd.DataFrame({
'category': ['Marketing', 'Marketing', 'IT', 'Marketing', 'IT', 'IT'],
'amount': [5000, 3000, 8000, 4000, 6000, 5000],
'budget_limit': [15000, 15000, 20000, 15000, 20000, 20000]
})
expenses['cumulative'] = expenses.groupby('category')['amount'].cumsum()
expenses['over_budget'] = expenses['cumulative'] > expenses['budget_limit']
expenses['remaining'] = expenses['budget_limit'] - expenses['cumulative']
print(expenses)
# category amount budget_limit cumulative over_budget remaining
# 0 Marketing 5000 15000 5000 False 10000
# 1 Marketing 3000 15000 8000 False 7000
# 2 IT 8000 20000 8000 False 12000
# 3 Marketing 4000 15000 12000 False 3000
# 4 IT 6000 20000 14000 False 6000
# 5 IT 5000 20000 19000 False 1000
This pattern enables real-time monitoring and early warning systems for resource consumption across multiple dimensions simultaneously.