R tidyr - complete() - Fill in Missing Combinations

Key Insights

The complete() function identifies implicit missing values by expanding data frames to include all combinations of specified variables, turning structural gaps into explicit NA values
Unlike expand() which only generates combinations, complete() preserves existing data while filling gaps with NA or custom values through the fill parameter
Combining complete() with fill() or using the fill parameter enables sophisticated missing data handling patterns essential for time series analysis, panel data, and cross-tabulations

Understanding Implicit vs Explicit Missing Values

Implicit missing values are combinations of variables that don’t appear in your dataset but should exist based on the data’s structure. These are fundamentally different from explicit NA values that appear in your data.

library(tidyr)
library(dplyr)

# Sales data with implicit missing values
sales <- tibble(
  product = c("A", "A", "B", "B", "C"),
  quarter = c("Q1", "Q2", "Q1", "Q3", "Q1"),
  revenue = c(100, 150, 200, 180, 90)
)

print(sales)
#   product quarter revenue
#   <chr>   <chr>     <dbl>
# 1 A       Q1          100
# 2 A       Q2          150
# 3 B       Q1          200
# 4 B       Q3          180
# 5 C       Q1           90

Product A has no Q3 or Q4 data, Product B has no Q2 or Q4 data, and Product C only has Q1. These missing combinations are implicit—they don’t exist as rows with NA values.

Basic complete() Syntax

The complete() function takes column names as arguments and creates all possible combinations of their unique values.

# Make implicit missing values explicit
sales_complete <- sales %>%
  complete(product, quarter)

print(sales_complete)
#    product quarter revenue
#    <chr>   <chr>     <dbl>
#  1 A       Q1          100
#  2 A       Q2          150
#  3 A       Q3           NA
#  4 A       Q4           NA
#  5 B       Q1          200
#  6 B       Q2           NA
#  7 B       Q3          180
#  8 B       Q4           NA
#  9 C       Q1           90
# 10 C       Q2           NA
# 11 C       Q3           NA
# 12 C       Q4           NA

Now every product-quarter combination exists explicitly. Missing values are NA rather than absent rows.

Filling Missing Values During Completion

The fill parameter allows you to specify default values for newly created rows instead of NA.

# Fill missing revenue with 0
sales_filled <- sales %>%
  complete(product, quarter, fill = list(revenue = 0))

print(sales_filled)
#    product quarter revenue
#    <chr>   <chr>     <dbl>
#  1 A       Q1          100
#  2 A       Q2          150
#  3 A       Q3            0
#  4 A       Q4            0
#  5 B       Q1          200
#  6 B       Q2            0
#  7 B       Q3          180
#  8 B       Q4            0
#  9 C       Q1           90
# 10 C       Q2            0
# 11 C       Q3            0
# 12 C       Q4            0

This is particularly useful when zero is semantically different from NA—zero revenue means no sales occurred, while NA might mean data wasn’t collected.

Completing with Sequences

Use helper functions like full_seq() to complete numeric sequences with specific intervals.

# Time series data with gaps
temperature <- tibble(
  day = c(1, 2, 4, 7, 10),
  temp_c = c(20, 21, 19, 22, 23)
)

# Complete all days from 1 to 10
temperature_complete <- temperature %>%
  complete(day = full_seq(day, 1))

print(temperature_complete)
#      day temp_c
#    <dbl>  <dbl>
#  1     1     20
#  2     2     21
#  3     3     NA
#  4     4     19
#  5     5     NA
#  6     6     NA
#  7     7     22
#  8     8     NA
#  9     9     NA
# 10    10     23

For date sequences, use seq() functions:

library(lubridate)

# Date-based time series
events <- tibble(
  date = as.Date(c("2024-01-01", "2024-01-03", "2024-01-07")),
  events = c(5, 3, 8)
)

# Complete all dates in range
events_complete <- events %>%
  complete(date = seq(min(date), max(date), by = "day"),
           fill = list(events = 0))

print(events_complete)
#   date       events
#   <date>      <dbl>
# 1 2024-01-01      5
# 2 2024-01-02      0
# 3 2024-01-03      3
# 4 2024-01-04      0
# 5 2024-01-05      0
# 6 2024-01-06      0
# 7 2024-01-07      8

Nested Completion with nesting()

Use nesting() to complete only combinations that should logically exist together, based on combinations already present in your data.

# Patient visits - not all patients visit every clinic
visits <- tibble(
  patient_id = c(1, 1, 2, 2, 3),
  clinic = c("A", "A", "B", "B", "A"),
  visit_date = as.Date(c("2024-01-01", "2024-01-15", 
                         "2024-01-01", "2024-01-15",
                         "2024-01-01")),
  blood_pressure = c(120, 118, 135, 133, 125)
)

# Complete all visit dates for each patient-clinic combination
visits_complete <- visits %>%
  complete(nesting(patient_id, clinic), visit_date)

print(visits_complete)
#   patient_id clinic visit_date blood_pressure
#        <dbl> <chr>  <date>              <dbl>
# 1          1 A      2024-01-01            120
# 2          1 A      2024-01-15            118
# 3          2 B      2024-01-01            135
# 4          2 B      2024-01-15            133
# 5          3 A      2024-01-01            125
# 6          3 A      2024-01-15             NA

Patient 1 only gets rows for clinic A, patient 2 only for clinic B, and patient 3 only for clinic A. Without nesting(), you’d get all patients × all clinics × all dates.

Practical Application: Panel Data Analysis

Panel data analysis requires balanced datasets where each entity has observations for all time periods.

# Company financial data - unbalanced panel
financials <- tibble(
  company = c("ACME", "ACME", "ACME", "TechCo", "TechCo", "GlobalInc"),
  year = c(2020, 2021, 2023, 2021, 2022, 2020),
  revenue_m = c(50, 55, 65, 120, 135, 200),
  employees = c(100, 110, 125, 500, 550, 1000)
)

# Create balanced panel
balanced_panel <- financials %>%
  complete(company, 
           year = full_seq(year, 1),
           fill = list(revenue_m = NA_real_, employees = NA_integer_))

print(balanced_panel)
#    company  year revenue_m employees
#    <chr>   <dbl>     <dbl>     <int>
#  1 ACME     2020        50       100
#  2 ACME     2021        55       110
#  3 ACME     2022        NA        NA
#  4 ACME     2023        65       125
#  5 GlobalInc 2020       200      1000
#  6 GlobalInc 2021        NA        NA
#  7 GlobalInc 2022        NA        NA
#  8 GlobalInc 2023        NA        NA
#  9 TechCo   2020        NA        NA
# 10 TechCo   2021       120       500
# 11 TechCo   2022       135       550
# 12 TechCo   2023        NA        NA

Now you can apply panel data models or calculate year-over-year growth rates with proper handling of missing periods.

Combining complete() with fill()

After using complete(), apply fill() to propagate values forward or backward.

# Sensor readings with irregular intervals
sensor_data <- tibble(
  timestamp = c(1, 3, 7, 8),
  status = c("OK", "OK", "WARNING", "OK"),
  value = c(100, 105, 98, 101)
)

# Complete sequence and carry forward status
sensor_complete <- sensor_data %>%
  complete(timestamp = full_seq(timestamp, 1)) %>%
  fill(status, .direction = "down")

print(sensor_complete)
#   timestamp status  value
#       <dbl> <chr>   <dbl>
# 1         1 OK        100
# 2         2 OK         NA
# 3         3 OK        105
# 4         4 OK         NA
# 5         5 OK         NA
# 6         6 OK         NA
# 7         7 WARNING    98
# 8         8 OK        101

This pattern is essential for time series where categorical variables should persist until they change.

Edge Cases and Gotchas

Be aware of memory implications when completing large datasets with many variables:

# This creates 1000 * 1000 = 1,000,000 rows
large_complete <- tibble(x = 1:1000, y = 1:1000, value = 1:1000) %>%
  complete(x, y)  # Probably not what you want

# Use nesting() to maintain existing combinations
large_sensible <- tibble(x = 1:1000, y = 1:1000, value = 1:1000) %>%
  complete(nesting(x, y))  # Returns original data

When working with factors, complete() respects factor levels:

# Factor levels control completion
data_factor <- tibble(
  category = factor(c("A", "C"), levels = c("A", "B", "C", "D")),
  value = c(1, 3)
)

completed_factor <- data_factor %>%
  complete(category, fill = list(value = 0))

print(completed_factor)
#   category value
#   <fct>    <dbl>
# 1 A            1
# 2 B            0
# 3 C            3
# 4 D            0

This ensures your completed dataset respects predefined categorical structures, crucial for consistent reporting and visualization.