R tidyr - complete() - Fill in Missing Combinations
Implicit missing values are combinations of variables that don't appear in your dataset but should exist based on the data's structure. These are fundamentally different from explicit NA values that...
Key Insights
- The
complete()function identifies implicit missing values by expanding data frames to include all combinations of specified variables, turning structural gaps into explicit NA values - Unlike
expand()which only generates combinations,complete()preserves existing data while filling gaps with NA or custom values through thefillparameter - Combining
complete()withfill()or using the fill parameter enables sophisticated missing data handling patterns essential for time series analysis, panel data, and cross-tabulations
Understanding Implicit vs Explicit Missing Values
Implicit missing values are combinations of variables that don’t appear in your dataset but should exist based on the data’s structure. These are fundamentally different from explicit NA values that appear in your data.
library(tidyr)
library(dplyr)
# Sales data with implicit missing values
sales <- tibble(
product = c("A", "A", "B", "B", "C"),
quarter = c("Q1", "Q2", "Q1", "Q3", "Q1"),
revenue = c(100, 150, 200, 180, 90)
)
print(sales)
# product quarter revenue
# <chr> <chr> <dbl>
# 1 A Q1 100
# 2 A Q2 150
# 3 B Q1 200
# 4 B Q3 180
# 5 C Q1 90
Product A has no Q3 or Q4 data, Product B has no Q2 or Q4 data, and Product C only has Q1. These missing combinations are implicit—they don’t exist as rows with NA values.
Basic complete() Syntax
The complete() function takes column names as arguments and creates all possible combinations of their unique values.
# Make implicit missing values explicit
sales_complete <- sales %>%
complete(product, quarter)
print(sales_complete)
# product quarter revenue
# <chr> <chr> <dbl>
# 1 A Q1 100
# 2 A Q2 150
# 3 A Q3 NA
# 4 A Q4 NA
# 5 B Q1 200
# 6 B Q2 NA
# 7 B Q3 180
# 8 B Q4 NA
# 9 C Q1 90
# 10 C Q2 NA
# 11 C Q3 NA
# 12 C Q4 NA
Now every product-quarter combination exists explicitly. Missing values are NA rather than absent rows.
Filling Missing Values During Completion
The fill parameter allows you to specify default values for newly created rows instead of NA.
# Fill missing revenue with 0
sales_filled <- sales %>%
complete(product, quarter, fill = list(revenue = 0))
print(sales_filled)
# product quarter revenue
# <chr> <chr> <dbl>
# 1 A Q1 100
# 2 A Q2 150
# 3 A Q3 0
# 4 A Q4 0
# 5 B Q1 200
# 6 B Q2 0
# 7 B Q3 180
# 8 B Q4 0
# 9 C Q1 90
# 10 C Q2 0
# 11 C Q3 0
# 12 C Q4 0
This is particularly useful when zero is semantically different from NA—zero revenue means no sales occurred, while NA might mean data wasn’t collected.
Completing with Sequences
Use helper functions like full_seq() to complete numeric sequences with specific intervals.
# Time series data with gaps
temperature <- tibble(
day = c(1, 2, 4, 7, 10),
temp_c = c(20, 21, 19, 22, 23)
)
# Complete all days from 1 to 10
temperature_complete <- temperature %>%
complete(day = full_seq(day, 1))
print(temperature_complete)
# day temp_c
# <dbl> <dbl>
# 1 1 20
# 2 2 21
# 3 3 NA
# 4 4 19
# 5 5 NA
# 6 6 NA
# 7 7 22
# 8 8 NA
# 9 9 NA
# 10 10 23
For date sequences, use seq() functions:
library(lubridate)
# Date-based time series
events <- tibble(
date = as.Date(c("2024-01-01", "2024-01-03", "2024-01-07")),
events = c(5, 3, 8)
)
# Complete all dates in range
events_complete <- events %>%
complete(date = seq(min(date), max(date), by = "day"),
fill = list(events = 0))
print(events_complete)
# date events
# <date> <dbl>
# 1 2024-01-01 5
# 2 2024-01-02 0
# 3 2024-01-03 3
# 4 2024-01-04 0
# 5 2024-01-05 0
# 6 2024-01-06 0
# 7 2024-01-07 8
Nested Completion with nesting()
Use nesting() to complete only combinations that should logically exist together, based on combinations already present in your data.
# Patient visits - not all patients visit every clinic
visits <- tibble(
patient_id = c(1, 1, 2, 2, 3),
clinic = c("A", "A", "B", "B", "A"),
visit_date = as.Date(c("2024-01-01", "2024-01-15",
"2024-01-01", "2024-01-15",
"2024-01-01")),
blood_pressure = c(120, 118, 135, 133, 125)
)
# Complete all visit dates for each patient-clinic combination
visits_complete <- visits %>%
complete(nesting(patient_id, clinic), visit_date)
print(visits_complete)
# patient_id clinic visit_date blood_pressure
# <dbl> <chr> <date> <dbl>
# 1 1 A 2024-01-01 120
# 2 1 A 2024-01-15 118
# 3 2 B 2024-01-01 135
# 4 2 B 2024-01-15 133
# 5 3 A 2024-01-01 125
# 6 3 A 2024-01-15 NA
Patient 1 only gets rows for clinic A, patient 2 only for clinic B, and patient 3 only for clinic A. Without nesting(), you’d get all patients × all clinics × all dates.
Practical Application: Panel Data Analysis
Panel data analysis requires balanced datasets where each entity has observations for all time periods.
# Company financial data - unbalanced panel
financials <- tibble(
company = c("ACME", "ACME", "ACME", "TechCo", "TechCo", "GlobalInc"),
year = c(2020, 2021, 2023, 2021, 2022, 2020),
revenue_m = c(50, 55, 65, 120, 135, 200),
employees = c(100, 110, 125, 500, 550, 1000)
)
# Create balanced panel
balanced_panel <- financials %>%
complete(company,
year = full_seq(year, 1),
fill = list(revenue_m = NA_real_, employees = NA_integer_))
print(balanced_panel)
# company year revenue_m employees
# <chr> <dbl> <dbl> <int>
# 1 ACME 2020 50 100
# 2 ACME 2021 55 110
# 3 ACME 2022 NA NA
# 4 ACME 2023 65 125
# 5 GlobalInc 2020 200 1000
# 6 GlobalInc 2021 NA NA
# 7 GlobalInc 2022 NA NA
# 8 GlobalInc 2023 NA NA
# 9 TechCo 2020 NA NA
# 10 TechCo 2021 120 500
# 11 TechCo 2022 135 550
# 12 TechCo 2023 NA NA
Now you can apply panel data models or calculate year-over-year growth rates with proper handling of missing periods.
Combining complete() with fill()
After using complete(), apply fill() to propagate values forward or backward.
# Sensor readings with irregular intervals
sensor_data <- tibble(
timestamp = c(1, 3, 7, 8),
status = c("OK", "OK", "WARNING", "OK"),
value = c(100, 105, 98, 101)
)
# Complete sequence and carry forward status
sensor_complete <- sensor_data %>%
complete(timestamp = full_seq(timestamp, 1)) %>%
fill(status, .direction = "down")
print(sensor_complete)
# timestamp status value
# <dbl> <chr> <dbl>
# 1 1 OK 100
# 2 2 OK NA
# 3 3 OK 105
# 4 4 OK NA
# 5 5 OK NA
# 6 6 OK NA
# 7 7 WARNING 98
# 8 8 OK 101
This pattern is essential for time series where categorical variables should persist until they change.
Edge Cases and Gotchas
Be aware of memory implications when completing large datasets with many variables:
# This creates 1000 * 1000 = 1,000,000 rows
large_complete <- tibble(x = 1:1000, y = 1:1000, value = 1:1000) %>%
complete(x, y) # Probably not what you want
# Use nesting() to maintain existing combinations
large_sensible <- tibble(x = 1:1000, y = 1:1000, value = 1:1000) %>%
complete(nesting(x, y)) # Returns original data
When working with factors, complete() respects factor levels:
# Factor levels control completion
data_factor <- tibble(
category = factor(c("A", "C"), levels = c("A", "B", "C", "D")),
value = c(1, 3)
)
completed_factor <- data_factor %>%
complete(category, fill = list(value = 0))
print(completed_factor)
# category value
# <fct> <dbl>
# 1 A 1
# 2 B 0
# 3 C 3
# 4 D 0
This ensures your completed dataset respects predefined categorical structures, crucial for consistent reporting and visualization.