R dplyr - lag() and lead() Functions
• The `lag()` and `lead()` functions shift values within a vector by a specified number of positions, essential for time-series analysis, calculating differences between consecutive rows, and...
Key Insights
• The lag() and lead() functions shift values within a vector by a specified number of positions, essential for time-series analysis, calculating differences between consecutive rows, and comparing current values with previous or future observations.
• Both functions preserve vector length by filling shifted positions with NA by default, though you can specify custom default values—critical when working with grouped data where you need explicit control over boundary conditions.
• When combined with group_by(), these functions operate within each group independently, making them powerful for panel data analysis where you need to compare observations within the same entity over time without cross-contamination between groups.
Basic Syntax and Behavior
The lag() function shifts values backward (retrieving previous values), while lead() shifts values forward (retrieving next values). Both are part of the dplyr package and work seamlessly within data pipelines.
library(dplyr)
# Basic vector operations
values <- c(10, 20, 30, 40, 50)
lag(values) # [NA, 10, 20, 30, 40]
lag(values, 2) # [NA, NA, 10, 20, 30]
lead(values) # [20, 30, 40, 50, NA]
lead(values, 2) # [30, 40, 50, NA, NA]
The n parameter controls how many positions to shift. The default parameter specifies what value to use for positions that fall outside the vector boundaries:
lag(values, default = 0) # [0, 10, 20, 30, 40]
lead(values, default = 99) # [20, 30, 40, 50, 99]
Calculating Period-over-Period Changes
A common use case involves calculating differences or percentage changes between consecutive observations. This is fundamental for financial analysis, sales metrics, and any time-series data.
sales_data <- tibble(
month = 1:12,
revenue = c(100, 120, 115, 130, 145, 140, 160, 175, 170, 180, 195, 200)
)
sales_analysis <- sales_data %>%
mutate(
prev_revenue = lag(revenue),
revenue_change = revenue - lag(revenue),
pct_change = (revenue - lag(revenue)) / lag(revenue) * 100,
next_revenue = lead(revenue),
will_increase = lead(revenue) > revenue
)
print(sales_analysis)
This produces a dataframe showing each month’s revenue alongside its previous value, the absolute change, percentage change, and whether the next month will see an increase.
Working with Grouped Data
The real power of lag() and lead() emerges when working with grouped data. These functions respect group boundaries, ensuring that comparisons only occur within the same group.
customer_orders <- tibble(
customer_id = rep(c("A", "B", "C"), each = 4),
order_date = rep(seq.Date(as.Date("2024-01-01"), by = "month", length.out = 4), 3),
order_value = c(100, 150, 120, 180, # Customer A
200, 220, 210, 250, # Customer B
80, 90, 85, 100) # Customer C
)
customer_analysis <- customer_orders %>%
group_by(customer_id) %>%
arrange(customer_id, order_date) %>%
mutate(
prev_order = lag(order_value),
days_since_last = as.numeric(order_date - lag(order_date)),
value_change = order_value - lag(order_value),
avg_last_two = (order_value + lag(order_value)) / 2,
next_order_higher = lead(order_value) > order_value
) %>%
ungroup()
Notice how the first order for each customer has NA for prev_order—the function doesn’t pull values from the previous group.
Multi-Period Lags and Leads
For more complex analyses, you might need to compare values across multiple periods simultaneously. This is common in seasonal analysis or when implementing moving average strategies.
stock_prices <- tibble(
date = seq.Date(as.Date("2024-01-01"), by = "day", length.out = 10),
price = c(100, 102, 101, 105, 107, 106, 110, 112, 111, 115)
)
technical_analysis <- stock_prices %>%
mutate(
lag_1 = lag(price, 1),
lag_2 = lag(price, 2),
lag_3 = lag(price, 3),
ma_3 = (price + lag(price, 1) + lag(price, 2)) / 3,
price_momentum = price - lag(price, 3),
lead_1 = lead(price, 1),
lead_2 = lead(price, 2),
future_gain = lead(price, 2) - price
)
Conditional Logic with Lag and Lead
Combining these functions with conditional statements enables sophisticated pattern detection and rule-based analysis.
temperature_data <- tibble(
day = 1:15,
temp = c(20, 22, 21, 25, 27, 26, 24, 23, 25, 28, 30, 29, 27, 26, 28)
)
weather_patterns <- temperature_data %>%
mutate(
warming_trend = temp > lag(temp) & lag(temp) > lag(temp, 2),
cooling_trend = temp < lag(temp) & lag(temp) < lag(temp, 2),
peak = temp > lag(temp) & temp > lead(temp),
valley = temp < lag(temp) & temp < lead(temp),
stable = abs(temp - lag(temp)) <= 1 & abs(temp - lead(temp)) <= 1
)
This identifies various temperature patterns: consecutive warming or cooling trends, local peaks and valleys, and stable periods.
Handling Missing Values and Edge Cases
When your data contains NA values, lag() and lead() propagate them through calculations. You need explicit handling strategies:
incomplete_data <- tibble(
id = 1:8,
value = c(10, NA, 30, 40, NA, 60, 70, 80)
)
handled_data <- incomplete_data %>%
mutate(
# Standard lag propagates NA
simple_lag = lag(value),
# Calculate change only when both values exist
change = if_else(!is.na(value) & !is.na(lag(value)),
value - lag(value),
NA_real_),
# Use last known value when current is NA
filled = if_else(is.na(value), lag(value), value),
# Count consecutive non-NA values
has_history = !is.na(value) & !is.na(lag(value))
)
Performance Considerations with Large Datasets
For large datasets, these functions are highly optimized, but combining multiple lag/lead operations with complex grouping can impact performance. Consider computing intermediate results:
# Less efficient: repeated lag calculations
inefficient <- large_data %>%
group_by(category) %>%
mutate(
calc1 = value - lag(value),
calc2 = (value - lag(value)) / lag(value),
calc3 = value + lag(value)
)
# More efficient: compute lag once
efficient <- large_data %>%
group_by(category) %>%
mutate(
prev_value = lag(value),
calc1 = value - prev_value,
calc2 = (value - prev_value) / prev_value,
calc3 = value + prev_value
)
Practical Application: Customer Retention Analysis
Here’s a complete example analyzing customer purchase patterns to identify retention risks:
library(dplyr)
library(lubridate)
purchases <- tibble(
customer_id = rep(1:5, each = 6),
purchase_date = rep(seq.Date(as.Date("2024-01-01"), by = "month", length.out = 6), 5),
amount = runif(30, 50, 500)
)
retention_analysis <- purchases %>%
group_by(customer_id) %>%
arrange(customer_id, purchase_date) %>%
mutate(
prev_amount = lag(amount),
amount_change_pct = (amount - prev_amount) / prev_amount * 100,
declining_spend = amount < prev_amount & prev_amount < lag(amount, 2),
next_amount = lead(amount),
will_churn = is.na(lead(amount, 2)), # No purchase 2 months ahead
avg_spend_trend = (amount + lag(amount) + lag(amount, 2)) / 3,
risk_score = case_when(
declining_spend & amount < 100 ~ "High Risk",
amount_change_pct < -20 ~ "Medium Risk",
TRUE ~ "Low Risk"
)
) %>%
ungroup()
This analysis identifies customers with declining spending patterns, calculates trend indicators, and assigns risk scores based on multiple factors—all made possible through strategic use of lag() and lead() functions.