NumPy - np.cumsum() and np.cumprod()

• `np.cumsum()` and `np.cumprod()` compute running totals and products across arrays, essential for time-series analysis, financial calculations, and statistical transformations

Key Insights

np.cumsum() and np.cumprod() compute running totals and products across arrays, essential for time-series analysis, financial calculations, and statistical transformations • Both functions support multi-dimensional operations with axis control, enabling column-wise or row-wise cumulative calculations on matrices • Performance scales linearly O(n) and handles large datasets efficiently, but memory usage doubles when not using in-place operations on existing arrays

Understanding Cumulative Operations

Cumulative operations transform sequences by computing running aggregates. np.cumsum() calculates cumulative sums where each element represents the sum of all previous elements including itself. np.cumprod() does the same for products. These operations appear frequently in financial modeling (running balances, compound returns), signal processing (integration), and probability distributions (cumulative distribution functions).

import numpy as np

# Basic cumulative sum
arr = np.array([1, 2, 3, 4, 5])
cumsum = np.cumsum(arr)
print(f"Original: {arr}")
print(f"Cumsum:   {cumsum}")
# Original: [1 2 3 4 5]
# Cumsum:   [ 1  3  6 10 15]

# Basic cumulative product
cumprod = np.cumprod(arr)
print(f"Cumprod:  {cumprod}")
# Cumprod:  [  1   2   6  24 120]

Working with Multi-Dimensional Arrays

The axis parameter controls which dimension to compute cumulative operations along. axis=0 operates down columns (row-wise accumulation), axis=1 operates across rows (column-wise accumulation), and axis=None flattens the array first.

# 2D array operations
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Cumsum along rows (down columns)
cumsum_axis0 = np.cumsum(matrix, axis=0)
print("Cumsum axis=0 (down columns):")
print(cumsum_axis0)
# [[ 1  2  3]
#  [ 5  7  9]
#  [12 15 18]]

# Cumsum along columns (across rows)
cumsum_axis1 = np.cumsum(matrix, axis=1)
print("\nCumsum axis=1 (across rows):")
print(cumsum_axis1)
# [[ 1  3  6]
#  [ 4  9 15]
#  [ 7 15 24]]

# Flattened cumsum
cumsum_flat = np.cumsum(matrix)
print(f"\nCumsum flattened: {cumsum_flat}")
# [ 1  3  6 10 15 21 28 36 45]

Financial Time Series Applications

Cumulative operations excel at calculating running balances, portfolio values, and compound returns. Here’s a practical implementation for financial analysis:

# Daily returns to cumulative portfolio value
initial_investment = 10000
daily_returns = np.array([0.01, -0.02, 0.015, 0.03, -0.01, 0.02])

# Method 1: Using cumprod for compound returns
growth_factors = 1 + daily_returns
cumulative_growth = np.cumprod(growth_factors)
portfolio_values = initial_investment * cumulative_growth

print("Portfolio values:")
print(portfolio_values)
# [10100.  9898.  10046.5 10347.895 10244.4155 10449.303781]

# Method 2: Calculate drawdown (distance from peak)
cummax = np.maximum.accumulate(portfolio_values)
drawdown = (portfolio_values - cummax) / cummax * 100

print("\nDrawdown percentages:")
print(drawdown)
# [ 0.  -2.  -0.52970297  0.  -1.  0.]

# Running profit/loss
daily_pnl = np.diff(portfolio_values, prepend=initial_investment)
cumulative_pnl = np.cumsum(daily_pnl)
print(f"\nCumulative P&L: {cumulative_pnl[-1]:.2f}")
# Cumulative P&L: 449.30

Data Type Handling and Overflow

NumPy preserves input data types by default, which can cause integer overflow. Specify dtype explicitly for safe operations:

# Integer overflow example
small_ints = np.array([100, 100, 100, 100], dtype=np.int8)
unsafe_cumprod = np.cumprod(small_ints)
print(f"Unsafe (int8): {unsafe_cumprod}")
# Unsafe (int8): [ 100  16  64  0]  # Overflow!

# Safe operation with explicit dtype
safe_cumprod = np.cumprod(small_ints, dtype=np.int64)
print(f"Safe (int64): {safe_cumprod}")
# Safe (int64): [    100   10000 1000000 100000000]

# Float precision considerations
precise_values = np.array([1.1, 2.2, 3.3, 4.4], dtype=np.float64)
cumprod_f64 = np.cumprod(precise_values)
print(f"Float64 precision: {cumprod_f64}")

# Using float32 loses precision
cumprod_f32 = np.cumprod(precise_values, dtype=np.float32)
print(f"Float32 precision: {cumprod_f32}")

Performance Optimization with Output Arrays

Pre-allocating output arrays reduces memory allocations and improves performance in loops or repeated calculations:

# Pre-allocated output array
data = np.random.rand(1000000)
output = np.empty_like(data)

# Using out parameter
np.cumsum(data, out=output)

# Benchmark comparison
import time

# Without pre-allocation
start = time.perf_counter()
for _ in range(100):
    result = np.cumsum(data)
elapsed_new = time.perf_counter() - start

# With pre-allocation
output = np.empty_like(data)
start = time.perf_counter()
for _ in range(100):
    np.cumsum(data, out=output)
elapsed_reuse = time.perf_counter() - start

print(f"New allocation: {elapsed_new:.4f}s")
print(f"Reused array:   {elapsed_reuse:.4f}s")
print(f"Speedup: {elapsed_new/elapsed_reuse:.2f}x")

Statistical Applications

Cumulative operations enable efficient computation of moving statistics and probability distributions:

# Empirical Cumulative Distribution Function (ECDF)
data = np.random.normal(loc=50, scale=10, size=1000)
sorted_data = np.sort(data)
cumulative_prob = np.arange(1, len(sorted_data) + 1) / len(sorted_data)

# Find percentiles using cumsum
def find_percentile(sorted_data, cum_prob, percentile):
    idx = np.searchsorted(cum_prob, percentile / 100)
    return sorted_data[idx]

p50 = find_percentile(sorted_data, cumulative_prob, 50)
p95 = find_percentile(sorted_data, cumulative_prob, 95)
print(f"50th percentile: {p50:.2f}")
print(f"95th percentile: {p95:.2f}")

# Weighted cumulative sum for moving averages
prices = np.array([100, 102, 101, 105, 107, 106])
weights = np.array([1, 1, 1, 2, 2, 2])

weighted_sum = np.cumsum(prices * weights)
weight_sum = np.cumsum(weights)
weighted_avg = weighted_sum / weight_sum

print(f"Weighted moving average: {weighted_avg}")
# [100. 101. 101. 102.75 104.125 104.77777778]

Handling Missing Data and Masking

NumPy’s masked arrays work with cumulative operations, properly handling NaN and missing values:

# Data with NaN values
data_with_nan = np.array([1.0, 2.0, np.nan, 4.0, 5.0])

# Standard cumsum propagates NaN
standard_cumsum = np.cumsum(data_with_nan)
print(f"Standard: {standard_cumsum}")
# [1. 3. nan nan nan]

# Using masked arrays
masked_data = np.ma.masked_invalid(data_with_nan)
masked_cumsum = np.cumsum(masked_data)
print(f"Masked: {masked_cumsum}")
# [1.0 3.0 -- 7.0 12.0]

# Alternative: use nansum with custom implementation
def cumsum_skipnan(arr):
    result = np.empty_like(arr)
    cumulative = 0
    for i, val in enumerate(arr):
        if not np.isnan(val):
            cumulative += val
        result[i] = cumulative
    return result

skipnan_result = cumsum_skipnan(data_with_nan)
print(f"Skip NaN: {skipnan_result}")
# [1. 3. 3. 7. 12.]

Integration with Pandas for Time Series

While NumPy provides the computational engine, combining with Pandas adds datetime indexing and business logic:

import pandas as pd

# Time series with cumulative operations
dates = pd.date_range('2024-01-01', periods=6, freq='D')
sales = np.array([100, 150, 120, 200, 180, 220])

df = pd.DataFrame({
    'sales': sales,
    'cumulative_sales': np.cumsum(sales),
    'running_product': np.cumprod(1 + sales/1000)
}, index=dates)

print(df)
#             sales  cumulative_sales  running_product
# 2024-01-01    100               100         1.100000
# 2024-01-02    150               250         1.265000
# 2024-01-03    120               370         1.416800
# 2024-01-04    200               570         1.700160
# 2024-01-05    180               750         2.006189
# 2024-01-06    220               970         2.451551

These cumulative operations form the foundation for complex analytical pipelines. Understanding axis control, dtype management, and performance characteristics enables efficient implementation of financial models, statistical analysis, and time-series processing at scale.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.