R tidyr - pivot_wider() (Long to Wide)
Long-format data stores observations in rows where each row represents a single measurement. Wide-format data spreads these measurements across columns. `pivot_wider()` from the tidyr package...
Key Insights
pivot_wider()transforms long-format data into wide format by spreading key-value pairs across multiple columns, essential for creating analysis-ready datasets and summary tables- The function uses
names_fromto specify which column(s) contain future column names andvalues_fromto identify which column(s) hold the values to spread - Advanced features like
names_prefix,values_fill, andvalues_fnhandle edge cases including duplicate identifiers, missing values, and multiple observations per cell
Understanding Long to Wide Transformation
Long-format data stores observations in rows where each row represents a single measurement. Wide-format data spreads these measurements across columns. pivot_wider() from the tidyr package performs this transformation, making data suitable for human readability, certain statistical analyses, and reporting.
The basic syntax requires identifying which columns define the structure:
library(tidyr)
library(dplyr)
# Long format data
sales_long <- tibble(
region = c("North", "North", "South", "South"),
quarter = c("Q1", "Q2", "Q1", "Q2"),
revenue = c(100, 150, 120, 180)
)
# Convert to wide format
sales_wide <- sales_long %>%
pivot_wider(
names_from = quarter,
values_from = revenue
)
print(sales_wide)
Output:
# A tibble: 2 × 3
region Q1 Q2
<chr> <dbl> <dbl>
1 North 100 150
2 South 120 180
Working with Multiple Value Columns
Real-world datasets often require spreading multiple metrics simultaneously. pivot_wider() handles this by accepting vectors for values_from:
# Multiple metrics per observation
metrics_long <- tibble(
store = rep(c("Store_A", "Store_B"), each = 3),
month = rep(c("Jan", "Feb", "Mar"), 2),
sales = c(1000, 1200, 1100, 900, 950, 1050),
costs = c(600, 700, 650, 550, 580, 630)
)
# Spread both sales and costs
metrics_wide <- metrics_long %>%
pivot_wider(
names_from = month,
values_from = c(sales, costs)
)
print(metrics_wide)
Output:
# A tibble: 2 × 7
store sales_Jan sales_Feb sales_Mar costs_Jan costs_Feb costs_Mar
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Store_A 1000 1200 1100 600 700 650
2 Store_B 900 950 1050 550 580 630
The column names combine the metric name with the month value. Control this naming with names_sep or names_glue.
Handling Missing Values and Duplicates
Missing combinations in long data create NA values in wide format. Use values_fill to replace these with defaults:
# Incomplete data
incomplete_data <- tibble(
product = c("Widget", "Widget", "Gadget"),
year = c(2022, 2023, 2022),
units = c(100, 120, 80)
)
# Fill missing values with 0
complete_wide <- incomplete_data %>%
pivot_wider(
names_from = year,
values_from = units,
values_fill = 0
)
print(complete_wide)
Output:
# A tibble: 2 × 3
product `2022` `2023`
<chr> <dbl> <dbl>
1 Widget 100 120
2 Gadget 80 0
When multiple observations exist for the same identifier combination, specify an aggregation function with values_fn:
# Duplicate entries
duplicates <- tibble(
category = c("A", "A", "A", "B", "B"),
type = c("X", "X", "Y", "X", "Y"),
amount = c(10, 15, 20, 25, 30)
)
# Aggregate duplicates by summing
aggregated_wide <- duplicates %>%
pivot_wider(
names_from = type,
values_from = amount,
values_fn = sum
)
print(aggregated_wide)
Output:
# A tibble: 2 × 3
category X Y
<chr> <dbl> <dbl>
1 A 25 20
2 B 25 30
Advanced Column Naming Strategies
Control output column names precisely using names_prefix, names_sep, and names_glue:
# Survey responses
survey <- tibble(
respondent = c(1, 1, 2, 2),
question = c("Q1", "Q2", "Q1", "Q2"),
score = c(5, 4, 3, 5)
)
# Add prefix to column names
survey_wide_prefix <- survey %>%
pivot_wider(
names_from = question,
values_from = score,
names_prefix = "score_"
)
print(survey_wide_prefix)
Output:
# A tibble: 2 × 3
respondent score_Q1 score_Q2
<dbl> <dbl> <dbl>
1 1 5 4
2 2 3 5
For complex naming with multiple columns in names_from, use names_glue:
# Multi-dimensional data
multi_dim <- tibble(
id = c(1, 1, 1, 1),
metric = c("temp", "temp", "humidity", "humidity"),
time = c("AM", "PM", "AM", "PM"),
value = c(20, 25, 60, 55)
)
# Custom column naming template
multi_wide <- multi_dim %>%
pivot_wider(
names_from = c(metric, time),
values_from = value,
names_glue = "{metric}_{time}"
)
print(multi_wide)
Output:
# A tibble: 1 × 5
id temp_AM temp_PM humidity_AM humidity_PM
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20 25 60 55
Pivoting with Multiple Name Columns
Spread data based on combinations of multiple columns by passing a vector to names_from:
# Sales by product and channel
sales_multi <- tibble(
date = rep("2024-01", 4),
product = c("A", "A", "B", "B"),
channel = c("Online", "Store", "Online", "Store"),
revenue = c(1000, 800, 1200, 900)
)
# Create columns from product-channel combinations
sales_multi_wide <- sales_multi %>%
pivot_wider(
names_from = c(product, channel),
values_from = revenue,
names_sep = "_"
)
print(sales_multi_wide)
Output:
# A tibble: 1 × 5
date A_Online A_Store B_Online B_Store
<chr> <dbl> <dbl> <dbl> <dbl>
1 2024-01 1000 800 1200 900
Practical Use Case: Creating Comparison Tables
A common application involves creating side-by-side comparison tables from time-series data:
# Monthly performance data
performance <- tibble(
department = rep(c("Sales", "Marketing", "Operations"), each = 3),
month = rep(c("Jan", "Feb", "Mar"), 3),
budget = c(50000, 52000, 51000, 30000, 31000, 32000, 40000, 41000, 42000),
actual = c(48000, 53000, 52000, 29000, 30500, 33000, 39000, 40500, 41500)
)
# Create comparison table with calculated variance
comparison <- performance %>%
mutate(variance = actual - budget) %>%
pivot_wider(
names_from = month,
values_from = c(budget, actual, variance),
names_glue = "{month}_{.value}"
) %>%
select(department, starts_with("Jan"), starts_with("Feb"), starts_with("Mar"))
print(comparison)
This produces a structured comparison table where each month’s budget, actual, and variance appear together, facilitating quick analysis across departments and time periods.
Performance Considerations
For large datasets, pivot_wider() performance depends on the number of unique combinations in names_from. Pre-filter data and ensure identifier columns are properly indexed. When dealing with millions of rows, consider:
# Efficient approach for large data
large_data %>%
filter(relevant_condition) %>% # Reduce rows first
distinct() %>% # Remove duplicates
pivot_wider(
names_from = category,
values_from = value,
values_fn = list # Returns lists for multiple values
)
The values_fn = list option creates list-columns when aggregation isn’t straightforward, preserving all values for manual inspection or custom processing downstream.