R tidyr - expand_grid() and crossing()
Both `expand_grid()` and `crossing()` create data frames containing all possible combinations of their input vectors. They're essential for generating test scenarios, creating complete datasets for...
Key Insights
expand_grid()andcrossing()both generate all combinations of input vectors, butexpand_grid()preserves input order whilecrossing()sorts outputs and removes duplicates- These functions replace the deprecated
expand.grid()from base R with faster, more predictable behavior and better handling of data frames and factors - Use
expand_grid()for systematic parameter sweeps, simulation scenarios, and creating complete factorial designs where order matters
Understanding Combination Generation
Both expand_grid() and crossing() create data frames containing all possible combinations of their input vectors. They’re essential for generating test scenarios, creating complete datasets for modeling, and preparing data for joins.
library(tidyr)
# Basic expand_grid usage
expand_grid(
method = c("GET", "POST"),
status = c(200, 404, 500)
)
# A tibble: 6 × 2
method status
<chr> <dbl>
1 GET 200
2 GET 404
3 GET 500
4 POST 200
5 POST 404
6 POST 500
The function maintains the order of inputs, with the first argument varying slowest and the last varying fastest. This predictable ordering makes it ideal for structured analyses.
Key Differences Between expand_grid() and crossing()
While functionally similar, these functions have distinct behaviors that matter in production code.
# expand_grid preserves duplicates and order
expand_grid(
x = c(2, 1, 2),
y = c("b", "a")
)
# A tibble: 6 × 2
x y
<dbl> <chr>
1 2 b
2 2 a
3 1 b
4 1 a
5 2 b
6 2 a
# crossing sorts and removes duplicates
crossing(
x = c(2, 1, 2),
y = c("b", "a")
)
# A tibble: 4 × 2
x y
<dbl> <chr>
1 1 a
2 1 b
3 2 a
4 2 b
Use expand_grid() when order matters for your analysis or when duplicates represent distinct scenarios. Use crossing() when you need a sorted, deduplicated set of unique combinations.
Generating Parameter Grids for Model Tuning
Parameter sweeps are a common use case. Here’s a practical example for testing database connection configurations:
connection_tests <- expand_grid(
pool_size = c(5, 10, 20),
timeout_ms = c(1000, 5000, 10000),
retry_attempts = c(1, 3, 5)
)
# Add simulation results
connection_tests$success_rate <- runif(nrow(connection_tests))
# Find optimal configuration
library(dplyr)
connection_tests %>%
arrange(desc(success_rate)) %>%
slice(1)
This approach systematically tests 27 configurations without manual enumeration.
Working with Date Sequences
Combining expand_grid() with date sequences creates time-series scaffolds:
# Generate daily metrics template for multiple services
metrics_template <- expand_grid(
date = seq(as.Date("2024-01-01"), as.Date("2024-01-07"), by = "day"),
service = c("api", "web", "worker"),
metric = c("requests", "errors", "latency_p95")
)
head(metrics_template, 9)
# A tibble: 9 × 3
date service metric
<date> <chr> <chr>
1 2024-01-01 api requests
2 2024-01-01 api errors
3 2024-01-01 api latency_p95
4 2024-01-01 web requests
5 2024-01-01 web errors
6 2024-01-01 web latency_p95
7 2024-01-01 worker requests
8 2024-01-01 worker errors
9 2024-01-01 worker latency_p95
This structure is perfect for left-joining actual metrics data, ensuring no date-service-metric combinations are missing from reports.
Nested Data Frames and List Columns
Both functions handle data frames and list columns, enabling complex nested structures:
# Create test scenarios with nested configurations
test_scenarios <- expand_grid(
environment = c("dev", "staging", "prod"),
config = list(
list(cache = TRUE, workers = 2),
list(cache = FALSE, workers = 4)
)
)
test_scenarios
# A tibble: 6 × 2
environment config
<chr> <list>
1 dev <named list [2]>
2 dev <named list [2]>
3 staging <named list [2]>
4 staging <named list [2]>
5 prod <named list [2]>
6 prod <named list [2]>
Access nested values using tidyr::unnest_wider() or purrr::map():
library(purrr)
test_scenarios %>%
mutate(
cache_enabled = map_lgl(config, "cache"),
worker_count = map_dbl(config, "workers")
) %>%
select(-config)
A/B Testing Scenario Generation
Generate complete factorial designs for experiments:
# E-commerce checkout flow experiment
ab_test_design <- expand_grid(
button_color = c("green", "blue", "orange"),
button_text = c("Buy Now", "Add to Cart", "Purchase"),
layout = c("single_column", "two_column"),
user_segment = c("new", "returning", "premium")
)
# 54 unique combinations
nrow(ab_test_design)
[1] 54
Assign traffic allocation:
ab_test_design %>%
mutate(
variant_id = row_number(),
traffic_pct = 100 / n()
) %>%
select(variant_id, everything())
Handling Missing Combinations with Joins
Real-world data often has gaps. Use expand_grid() to create a complete scaffold, then join actual data:
# Actual API response times (incomplete data)
actual_data <- tibble(
endpoint = c("/users", "/users", "/orders"),
method = c("GET", "POST", "GET"),
avg_ms = c(45, 120, 89)
)
# Complete grid of expected combinations
expected_combinations <- expand_grid(
endpoint = c("/users", "/orders", "/products"),
method = c("GET", "POST", "PUT", "DELETE")
)
# Identify missing data
expected_combinations %>%
anti_join(actual_data, by = c("endpoint", "method"))
# A tibble: 9 × 2
endpoint method
<chr> <chr>
1 /users PUT
2 /users DELETE
3 /orders POST
4 /orders PUT
5 /orders DELETE
6 /products GET
7 /products POST
8 /products PUT
9 /products DELETE
Performance Considerations
Both functions are optimized for speed, but size grows multiplicatively:
# Small inputs create large outputs
big_grid <- expand_grid(
a = 1:100,
b = 1:100,
c = 1:100
)
nrow(big_grid) # 1,000,000 rows
[1] 1000000
For large grids, consider:
- Generate combinations in chunks
- Filter early using
.name_repairor immediatefilter() - Use
crossing()to reduce duplicates when appropriate
# Filter during generation using dplyr
expand_grid(
x = 1:100,
y = 1:100
) %>%
filter(x <= y) # Only upper triangle: 5,050 rows instead of 10,000
Replacing Base R expand.grid()
The tidyr functions offer several advantages over expand.grid():
# Base R (old approach)
base_result <- expand.grid(
x = c("a", "b"),
y = 1:2,
stringsAsFactors = FALSE
)
# tidyr (modern approach)
tidyr_result <- expand_grid(
x = c("a", "b"),
y = 1:2
)
Key improvements:
- No automatic factor conversion (no
stringsAsFactorsneeded) - Consistent column ordering matching input order
- Better performance on large datasets
- Returns tibbles with enhanced printing
- Handles list columns naturally
Practical Simulation Example
Combine everything for Monte Carlo simulations:
set.seed(42)
simulation <- expand_grid(
run = 1:1000,
initial_investment = c(10000, 50000, 100000),
annual_return = seq(0.05, 0.15, by = 0.05)
) %>%
mutate(
years = 10,
volatility = rnorm(n(), mean = 0, sd = 0.1),
final_value = initial_investment * (1 + annual_return + volatility)^years
)
# Summarize results by scenario
simulation %>%
group_by(initial_investment, annual_return) %>%
summarise(
mean_return = mean(final_value),
sd_return = sd(final_value),
.groups = "drop"
)
This generates 9,000 simulation runs across 9 scenarios, enabling robust statistical analysis of investment strategies.
Both expand_grid() and crossing() are fundamental tools for systematic data generation. Choose expand_grid() for ordered, complete expansions and crossing() when you need sorted, unique combinations.