NumPy - np.cumsum() and np.cumprod()
• `np.cumsum()` and `np.cumprod()` compute running totals and products across arrays, essential for time-series analysis, financial calculations, and statistical transformations
Key Insights
• np.cumsum() and np.cumprod() compute running totals and products across arrays, essential for time-series analysis, financial calculations, and statistical transformations
• Both functions support multi-dimensional operations with axis control, enabling column-wise or row-wise cumulative calculations on matrices
• Performance scales linearly O(n) and handles large datasets efficiently, but memory usage doubles when not using in-place operations on existing arrays
Understanding Cumulative Operations
Cumulative operations transform sequences by computing running aggregates. np.cumsum() calculates cumulative sums where each element represents the sum of all previous elements including itself. np.cumprod() does the same for products. These operations appear frequently in financial modeling (running balances, compound returns), signal processing (integration), and probability distributions (cumulative distribution functions).
import numpy as np
# Basic cumulative sum
arr = np.array([1, 2, 3, 4, 5])
cumsum = np.cumsum(arr)
print(f"Original: {arr}")
print(f"Cumsum: {cumsum}")
# Original: [1 2 3 4 5]
# Cumsum: [ 1 3 6 10 15]
# Basic cumulative product
cumprod = np.cumprod(arr)
print(f"Cumprod: {cumprod}")
# Cumprod: [ 1 2 6 24 120]
Working with Multi-Dimensional Arrays
The axis parameter controls which dimension to compute cumulative operations along. axis=0 operates down columns (row-wise accumulation), axis=1 operates across rows (column-wise accumulation), and axis=None flattens the array first.
# 2D array operations
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Cumsum along rows (down columns)
cumsum_axis0 = np.cumsum(matrix, axis=0)
print("Cumsum axis=0 (down columns):")
print(cumsum_axis0)
# [[ 1 2 3]
# [ 5 7 9]
# [12 15 18]]
# Cumsum along columns (across rows)
cumsum_axis1 = np.cumsum(matrix, axis=1)
print("\nCumsum axis=1 (across rows):")
print(cumsum_axis1)
# [[ 1 3 6]
# [ 4 9 15]
# [ 7 15 24]]
# Flattened cumsum
cumsum_flat = np.cumsum(matrix)
print(f"\nCumsum flattened: {cumsum_flat}")
# [ 1 3 6 10 15 21 28 36 45]
Financial Time Series Applications
Cumulative operations excel at calculating running balances, portfolio values, and compound returns. Here’s a practical implementation for financial analysis:
# Daily returns to cumulative portfolio value
initial_investment = 10000
daily_returns = np.array([0.01, -0.02, 0.015, 0.03, -0.01, 0.02])
# Method 1: Using cumprod for compound returns
growth_factors = 1 + daily_returns
cumulative_growth = np.cumprod(growth_factors)
portfolio_values = initial_investment * cumulative_growth
print("Portfolio values:")
print(portfolio_values)
# [10100. 9898. 10046.5 10347.895 10244.4155 10449.303781]
# Method 2: Calculate drawdown (distance from peak)
cummax = np.maximum.accumulate(portfolio_values)
drawdown = (portfolio_values - cummax) / cummax * 100
print("\nDrawdown percentages:")
print(drawdown)
# [ 0. -2. -0.52970297 0. -1. 0.]
# Running profit/loss
daily_pnl = np.diff(portfolio_values, prepend=initial_investment)
cumulative_pnl = np.cumsum(daily_pnl)
print(f"\nCumulative P&L: {cumulative_pnl[-1]:.2f}")
# Cumulative P&L: 449.30
Data Type Handling and Overflow
NumPy preserves input data types by default, which can cause integer overflow. Specify dtype explicitly for safe operations:
# Integer overflow example
small_ints = np.array([100, 100, 100, 100], dtype=np.int8)
unsafe_cumprod = np.cumprod(small_ints)
print(f"Unsafe (int8): {unsafe_cumprod}")
# Unsafe (int8): [ 100 16 64 0] # Overflow!
# Safe operation with explicit dtype
safe_cumprod = np.cumprod(small_ints, dtype=np.int64)
print(f"Safe (int64): {safe_cumprod}")
# Safe (int64): [ 100 10000 1000000 100000000]
# Float precision considerations
precise_values = np.array([1.1, 2.2, 3.3, 4.4], dtype=np.float64)
cumprod_f64 = np.cumprod(precise_values)
print(f"Float64 precision: {cumprod_f64}")
# Using float32 loses precision
cumprod_f32 = np.cumprod(precise_values, dtype=np.float32)
print(f"Float32 precision: {cumprod_f32}")
Performance Optimization with Output Arrays
Pre-allocating output arrays reduces memory allocations and improves performance in loops or repeated calculations:
# Pre-allocated output array
data = np.random.rand(1000000)
output = np.empty_like(data)
# Using out parameter
np.cumsum(data, out=output)
# Benchmark comparison
import time
# Without pre-allocation
start = time.perf_counter()
for _ in range(100):
result = np.cumsum(data)
elapsed_new = time.perf_counter() - start
# With pre-allocation
output = np.empty_like(data)
start = time.perf_counter()
for _ in range(100):
np.cumsum(data, out=output)
elapsed_reuse = time.perf_counter() - start
print(f"New allocation: {elapsed_new:.4f}s")
print(f"Reused array: {elapsed_reuse:.4f}s")
print(f"Speedup: {elapsed_new/elapsed_reuse:.2f}x")
Statistical Applications
Cumulative operations enable efficient computation of moving statistics and probability distributions:
# Empirical Cumulative Distribution Function (ECDF)
data = np.random.normal(loc=50, scale=10, size=1000)
sorted_data = np.sort(data)
cumulative_prob = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
# Find percentiles using cumsum
def find_percentile(sorted_data, cum_prob, percentile):
idx = np.searchsorted(cum_prob, percentile / 100)
return sorted_data[idx]
p50 = find_percentile(sorted_data, cumulative_prob, 50)
p95 = find_percentile(sorted_data, cumulative_prob, 95)
print(f"50th percentile: {p50:.2f}")
print(f"95th percentile: {p95:.2f}")
# Weighted cumulative sum for moving averages
prices = np.array([100, 102, 101, 105, 107, 106])
weights = np.array([1, 1, 1, 2, 2, 2])
weighted_sum = np.cumsum(prices * weights)
weight_sum = np.cumsum(weights)
weighted_avg = weighted_sum / weight_sum
print(f"Weighted moving average: {weighted_avg}")
# [100. 101. 101. 102.75 104.125 104.77777778]
Handling Missing Data and Masking
NumPy’s masked arrays work with cumulative operations, properly handling NaN and missing values:
# Data with NaN values
data_with_nan = np.array([1.0, 2.0, np.nan, 4.0, 5.0])
# Standard cumsum propagates NaN
standard_cumsum = np.cumsum(data_with_nan)
print(f"Standard: {standard_cumsum}")
# [1. 3. nan nan nan]
# Using masked arrays
masked_data = np.ma.masked_invalid(data_with_nan)
masked_cumsum = np.cumsum(masked_data)
print(f"Masked: {masked_cumsum}")
# [1.0 3.0 -- 7.0 12.0]
# Alternative: use nansum with custom implementation
def cumsum_skipnan(arr):
result = np.empty_like(arr)
cumulative = 0
for i, val in enumerate(arr):
if not np.isnan(val):
cumulative += val
result[i] = cumulative
return result
skipnan_result = cumsum_skipnan(data_with_nan)
print(f"Skip NaN: {skipnan_result}")
# [1. 3. 3. 7. 12.]
Integration with Pandas for Time Series
While NumPy provides the computational engine, combining with Pandas adds datetime indexing and business logic:
import pandas as pd
# Time series with cumulative operations
dates = pd.date_range('2024-01-01', periods=6, freq='D')
sales = np.array([100, 150, 120, 200, 180, 220])
df = pd.DataFrame({
'sales': sales,
'cumulative_sales': np.cumsum(sales),
'running_product': np.cumprod(1 + sales/1000)
}, index=dates)
print(df)
# sales cumulative_sales running_product
# 2024-01-01 100 100 1.100000
# 2024-01-02 150 250 1.265000
# 2024-01-03 120 370 1.416800
# 2024-01-04 200 570 1.700160
# 2024-01-05 180 750 2.006189
# 2024-01-06 220 970 2.451551
These cumulative operations form the foundation for complex analytical pipelines. Understanding axis control, dtype management, and performance characteristics enables efficient implementation of financial models, statistical analysis, and time-series processing at scale.