R tidyr - expand_grid() and crossing()

Both `expand_grid()` and `crossing()` create data frames containing all possible combinations of their input vectors. They're essential for generating test scenarios, creating complete datasets for...

Key Insights

  • expand_grid() and crossing() both generate all combinations of input vectors, but expand_grid() preserves input order while crossing() sorts outputs and removes duplicates
  • These functions replace the deprecated expand.grid() from base R with faster, more predictable behavior and better handling of data frames and factors
  • Use expand_grid() for systematic parameter sweeps, simulation scenarios, and creating complete factorial designs where order matters

Understanding Combination Generation

Both expand_grid() and crossing() create data frames containing all possible combinations of their input vectors. They’re essential for generating test scenarios, creating complete datasets for modeling, and preparing data for joins.

library(tidyr)

# Basic expand_grid usage
expand_grid(
  method = c("GET", "POST"),
  status = c(200, 404, 500)
)
# A tibble: 6 × 2
  method status
  <chr>   <dbl>
1 GET       200
2 GET       404
3 GET       500
4 POST      200
5 POST      404
6 POST      500

The function maintains the order of inputs, with the first argument varying slowest and the last varying fastest. This predictable ordering makes it ideal for structured analyses.

Key Differences Between expand_grid() and crossing()

While functionally similar, these functions have distinct behaviors that matter in production code.

# expand_grid preserves duplicates and order
expand_grid(
  x = c(2, 1, 2),
  y = c("b", "a")
)
# A tibble: 6 × 2
      x y    
  <dbl> <chr>
1     2 b    
2     2 a    
3     1 b    
4     1 a    
5     2 b    
6     2 a    
# crossing sorts and removes duplicates
crossing(
  x = c(2, 1, 2),
  y = c("b", "a")
)
# A tibble: 4 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     1 b    
3     2 a    
4     2 b    

Use expand_grid() when order matters for your analysis or when duplicates represent distinct scenarios. Use crossing() when you need a sorted, deduplicated set of unique combinations.

Generating Parameter Grids for Model Tuning

Parameter sweeps are a common use case. Here’s a practical example for testing database connection configurations:

connection_tests <- expand_grid(
  pool_size = c(5, 10, 20),
  timeout_ms = c(1000, 5000, 10000),
  retry_attempts = c(1, 3, 5)
)

# Add simulation results
connection_tests$success_rate <- runif(nrow(connection_tests))

# Find optimal configuration
library(dplyr)
connection_tests %>%
  arrange(desc(success_rate)) %>%
  slice(1)

This approach systematically tests 27 configurations without manual enumeration.

Working with Date Sequences

Combining expand_grid() with date sequences creates time-series scaffolds:

# Generate daily metrics template for multiple services
metrics_template <- expand_grid(
  date = seq(as.Date("2024-01-01"), as.Date("2024-01-07"), by = "day"),
  service = c("api", "web", "worker"),
  metric = c("requests", "errors", "latency_p95")
)

head(metrics_template, 9)
# A tibble: 9 × 3
  date       service metric     
  <date>     <chr>   <chr>      
1 2024-01-01 api     requests   
2 2024-01-01 api     errors     
3 2024-01-01 api     latency_p95
4 2024-01-01 web     requests   
5 2024-01-01 web     errors     
6 2024-01-01 web     latency_p95
7 2024-01-01 worker  requests   
8 2024-01-01 worker  errors     
9 2024-01-01 worker  latency_p95

This structure is perfect for left-joining actual metrics data, ensuring no date-service-metric combinations are missing from reports.

Nested Data Frames and List Columns

Both functions handle data frames and list columns, enabling complex nested structures:

# Create test scenarios with nested configurations
test_scenarios <- expand_grid(
  environment = c("dev", "staging", "prod"),
  config = list(
    list(cache = TRUE, workers = 2),
    list(cache = FALSE, workers = 4)
  )
)

test_scenarios
# A tibble: 6 × 2
  environment config          
  <chr>       <list>          
1 dev         <named list [2]>
2 dev         <named list [2]>
3 staging     <named list [2]>
4 staging     <named list [2]>
5 prod        <named list [2]>
6 prod        <named list [2]>

Access nested values using tidyr::unnest_wider() or purrr::map():

library(purrr)

test_scenarios %>%
  mutate(
    cache_enabled = map_lgl(config, "cache"),
    worker_count = map_dbl(config, "workers")
  ) %>%
  select(-config)

A/B Testing Scenario Generation

Generate complete factorial designs for experiments:

# E-commerce checkout flow experiment
ab_test_design <- expand_grid(
  button_color = c("green", "blue", "orange"),
  button_text = c("Buy Now", "Add to Cart", "Purchase"),
  layout = c("single_column", "two_column"),
  user_segment = c("new", "returning", "premium")
)

# 54 unique combinations
nrow(ab_test_design)
[1] 54

Assign traffic allocation:

ab_test_design %>%
  mutate(
    variant_id = row_number(),
    traffic_pct = 100 / n()
  ) %>%
  select(variant_id, everything())

Handling Missing Combinations with Joins

Real-world data often has gaps. Use expand_grid() to create a complete scaffold, then join actual data:

# Actual API response times (incomplete data)
actual_data <- tibble(
  endpoint = c("/users", "/users", "/orders"),
  method = c("GET", "POST", "GET"),
  avg_ms = c(45, 120, 89)
)

# Complete grid of expected combinations
expected_combinations <- expand_grid(
  endpoint = c("/users", "/orders", "/products"),
  method = c("GET", "POST", "PUT", "DELETE")
)

# Identify missing data
expected_combinations %>%
  anti_join(actual_data, by = c("endpoint", "method"))
# A tibble: 9 × 2
  endpoint  method
  <chr>     <chr> 
1 /users    PUT   
2 /users    DELETE
3 /orders   POST  
4 /orders   PUT   
5 /orders   DELETE
6 /products GET   
7 /products POST  
8 /products PUT   
9 /products DELETE

Performance Considerations

Both functions are optimized for speed, but size grows multiplicatively:

# Small inputs create large outputs
big_grid <- expand_grid(
  a = 1:100,
  b = 1:100,
  c = 1:100
)

nrow(big_grid)  # 1,000,000 rows
[1] 1000000

For large grids, consider:

  1. Generate combinations in chunks
  2. Filter early using .name_repair or immediate filter()
  3. Use crossing() to reduce duplicates when appropriate
# Filter during generation using dplyr
expand_grid(
  x = 1:100,
  y = 1:100
) %>%
  filter(x <= y)  # Only upper triangle: 5,050 rows instead of 10,000

Replacing Base R expand.grid()

The tidyr functions offer several advantages over expand.grid():

# Base R (old approach)
base_result <- expand.grid(
  x = c("a", "b"),
  y = 1:2,
  stringsAsFactors = FALSE
)

# tidyr (modern approach)
tidyr_result <- expand_grid(
  x = c("a", "b"),
  y = 1:2
)

Key improvements:

  • No automatic factor conversion (no stringsAsFactors needed)
  • Consistent column ordering matching input order
  • Better performance on large datasets
  • Returns tibbles with enhanced printing
  • Handles list columns naturally

Practical Simulation Example

Combine everything for Monte Carlo simulations:

set.seed(42)

simulation <- expand_grid(
  run = 1:1000,
  initial_investment = c(10000, 50000, 100000),
  annual_return = seq(0.05, 0.15, by = 0.05)
) %>%
  mutate(
    years = 10,
    volatility = rnorm(n(), mean = 0, sd = 0.1),
    final_value = initial_investment * (1 + annual_return + volatility)^years
  )

# Summarize results by scenario
simulation %>%
  group_by(initial_investment, annual_return) %>%
  summarise(
    mean_return = mean(final_value),
    sd_return = sd(final_value),
    .groups = "drop"
  )

This generates 9,000 simulation runs across 9 scenarios, enabling robust statistical analysis of investment strategies.

Both expand_grid() and crossing() are fundamental tools for systematic data generation. Choose expand_grid() for ordered, complete expansions and crossing() when you need sorted, unique combinations.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.