R dplyr - lag() and lead() Functions

Key Insights

• The lag() and lead() functions shift values within a vector by a specified number of positions, essential for time-series analysis, calculating differences between consecutive rows, and comparing current values with previous or future observations.

• Both functions preserve vector length by filling shifted positions with NA by default, though you can specify custom default values—critical when working with grouped data where you need explicit control over boundary conditions.

• When combined with group_by(), these functions operate within each group independently, making them powerful for panel data analysis where you need to compare observations within the same entity over time without cross-contamination between groups.

Basic Syntax and Behavior

The lag() function shifts values backward (retrieving previous values), while lead() shifts values forward (retrieving next values). Both are part of the dplyr package and work seamlessly within data pipelines.

library(dplyr)

# Basic vector operations
values <- c(10, 20, 30, 40, 50)

lag(values)     # [NA, 10, 20, 30, 40]
lag(values, 2)  # [NA, NA, 10, 20, 30]
lead(values)    # [20, 30, 40, 50, NA]
lead(values, 2) # [30, 40, 50, NA, NA]

The n parameter controls how many positions to shift. The default parameter specifies what value to use for positions that fall outside the vector boundaries:

lag(values, default = 0)   # [0, 10, 20, 30, 40]
lead(values, default = 99) # [20, 30, 40, 50, 99]

Calculating Period-over-Period Changes

A common use case involves calculating differences or percentage changes between consecutive observations. This is fundamental for financial analysis, sales metrics, and any time-series data.

sales_data <- tibble(
  month = 1:12,
  revenue = c(100, 120, 115, 130, 145, 140, 160, 175, 170, 180, 195, 200)
)

sales_analysis <- sales_data %>%
  mutate(
    prev_revenue = lag(revenue),
    revenue_change = revenue - lag(revenue),
    pct_change = (revenue - lag(revenue)) / lag(revenue) * 100,
    next_revenue = lead(revenue),
    will_increase = lead(revenue) > revenue
  )

print(sales_analysis)

This produces a dataframe showing each month’s revenue alongside its previous value, the absolute change, percentage change, and whether the next month will see an increase.

Working with Grouped Data

The real power of lag() and lead() emerges when working with grouped data. These functions respect group boundaries, ensuring that comparisons only occur within the same group.

customer_orders <- tibble(
  customer_id = rep(c("A", "B", "C"), each = 4),
  order_date = rep(seq.Date(as.Date("2024-01-01"), by = "month", length.out = 4), 3),
  order_value = c(100, 150, 120, 180,  # Customer A
                  200, 220, 210, 250,  # Customer B
                  80, 90, 85, 100)     # Customer C
)

customer_analysis <- customer_orders %>%
  group_by(customer_id) %>%
  arrange(customer_id, order_date) %>%
  mutate(
    prev_order = lag(order_value),
    days_since_last = as.numeric(order_date - lag(order_date)),
    value_change = order_value - lag(order_value),
    avg_last_two = (order_value + lag(order_value)) / 2,
    next_order_higher = lead(order_value) > order_value
  ) %>%
  ungroup()

Notice how the first order for each customer has NA for prev_order—the function doesn’t pull values from the previous group.

Multi-Period Lags and Leads

For more complex analyses, you might need to compare values across multiple periods simultaneously. This is common in seasonal analysis or when implementing moving average strategies.

stock_prices <- tibble(
  date = seq.Date(as.Date("2024-01-01"), by = "day", length.out = 10),
  price = c(100, 102, 101, 105, 107, 106, 110, 112, 111, 115)
)

technical_analysis <- stock_prices %>%
  mutate(
    lag_1 = lag(price, 1),
    lag_2 = lag(price, 2),
    lag_3 = lag(price, 3),
    ma_3 = (price + lag(price, 1) + lag(price, 2)) / 3,
    price_momentum = price - lag(price, 3),
    lead_1 = lead(price, 1),
    lead_2 = lead(price, 2),
    future_gain = lead(price, 2) - price
  )

Conditional Logic with Lag and Lead

Combining these functions with conditional statements enables sophisticated pattern detection and rule-based analysis.

temperature_data <- tibble(
  day = 1:15,
  temp = c(20, 22, 21, 25, 27, 26, 24, 23, 25, 28, 30, 29, 27, 26, 28)
)

weather_patterns <- temperature_data %>%
  mutate(
    warming_trend = temp > lag(temp) & lag(temp) > lag(temp, 2),
    cooling_trend = temp < lag(temp) & lag(temp) < lag(temp, 2),
    peak = temp > lag(temp) & temp > lead(temp),
    valley = temp < lag(temp) & temp < lead(temp),
    stable = abs(temp - lag(temp)) <= 1 & abs(temp - lead(temp)) <= 1
  )

This identifies various temperature patterns: consecutive warming or cooling trends, local peaks and valleys, and stable periods.

Handling Missing Values and Edge Cases

When your data contains NA values, lag() and lead() propagate them through calculations. You need explicit handling strategies:

incomplete_data <- tibble(
  id = 1:8,
  value = c(10, NA, 30, 40, NA, 60, 70, 80)
)

handled_data <- incomplete_data %>%
  mutate(
    # Standard lag propagates NA
    simple_lag = lag(value),
    
    # Calculate change only when both values exist
    change = if_else(!is.na(value) & !is.na(lag(value)),
                     value - lag(value),
                     NA_real_),
    
    # Use last known value when current is NA
    filled = if_else(is.na(value), lag(value), value),
    
    # Count consecutive non-NA values
    has_history = !is.na(value) & !is.na(lag(value))
  )

Performance Considerations with Large Datasets

For large datasets, these functions are highly optimized, but combining multiple lag/lead operations with complex grouping can impact performance. Consider computing intermediate results:

# Less efficient: repeated lag calculations
inefficient <- large_data %>%
  group_by(category) %>%
  mutate(
    calc1 = value - lag(value),
    calc2 = (value - lag(value)) / lag(value),
    calc3 = value + lag(value)
  )

# More efficient: compute lag once
efficient <- large_data %>%
  group_by(category) %>%
  mutate(
    prev_value = lag(value),
    calc1 = value - prev_value,
    calc2 = (value - prev_value) / prev_value,
    calc3 = value + prev_value
  )

Practical Application: Customer Retention Analysis

Here’s a complete example analyzing customer purchase patterns to identify retention risks:

library(dplyr)
library(lubridate)

purchases <- tibble(
  customer_id = rep(1:5, each = 6),
  purchase_date = rep(seq.Date(as.Date("2024-01-01"), by = "month", length.out = 6), 5),
  amount = runif(30, 50, 500)
)

retention_analysis <- purchases %>%
  group_by(customer_id) %>%
  arrange(customer_id, purchase_date) %>%
  mutate(
    prev_amount = lag(amount),
    amount_change_pct = (amount - prev_amount) / prev_amount * 100,
    declining_spend = amount < prev_amount & prev_amount < lag(amount, 2),
    next_amount = lead(amount),
    will_churn = is.na(lead(amount, 2)),  # No purchase 2 months ahead
    avg_spend_trend = (amount + lag(amount) + lag(amount, 2)) / 3,
    risk_score = case_when(
      declining_spend & amount < 100 ~ "High Risk",
      amount_change_pct < -20 ~ "Medium Risk",
      TRUE ~ "Low Risk"
    )
  ) %>%
  ungroup()

This analysis identifies customers with declining spending patterns, calculates trend indicators, and assigns risk scores based on multiple factors—all made possible through strategic use of lag() and lead() functions.