R tidyr - pivot_wider() (Long to Wide)

Key Insights

pivot_wider() transforms long-format data into wide format by spreading key-value pairs across multiple columns, essential for creating analysis-ready datasets and summary tables
The function uses names_from to specify which column(s) contain future column names and values_from to identify which column(s) hold the values to spread
Advanced features like names_prefix, values_fill, and values_fn handle edge cases including duplicate identifiers, missing values, and multiple observations per cell

Understanding Long to Wide Transformation

Long-format data stores observations in rows where each row represents a single measurement. Wide-format data spreads these measurements across columns. pivot_wider() from the tidyr package performs this transformation, making data suitable for human readability, certain statistical analyses, and reporting.

The basic syntax requires identifying which columns define the structure:

library(tidyr)
library(dplyr)

# Long format data
sales_long <- tibble(
  region = c("North", "North", "South", "South"),
  quarter = c("Q1", "Q2", "Q1", "Q2"),
  revenue = c(100, 150, 120, 180)
)

# Convert to wide format
sales_wide <- sales_long %>%
  pivot_wider(
    names_from = quarter,
    values_from = revenue
  )

print(sales_wide)

Output:

# A tibble: 2 × 3
  region    Q1    Q2
  <chr>  <dbl> <dbl>
1 North    100   150
2 South    120   180

Working with Multiple Value Columns

Real-world datasets often require spreading multiple metrics simultaneously. pivot_wider() handles this by accepting vectors for values_from:

# Multiple metrics per observation
metrics_long <- tibble(
  store = rep(c("Store_A", "Store_B"), each = 3),
  month = rep(c("Jan", "Feb", "Mar"), 2),
  sales = c(1000, 1200, 1100, 900, 950, 1050),
  costs = c(600, 700, 650, 550, 580, 630)
)

# Spread both sales and costs
metrics_wide <- metrics_long %>%
  pivot_wider(
    names_from = month,
    values_from = c(sales, costs)
  )

print(metrics_wide)

Output:

# A tibble: 2 × 7
  store   sales_Jan sales_Feb sales_Mar costs_Jan costs_Feb costs_Mar
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 Store_A      1000      1200      1100       600       700       650
2 Store_B       900       950      1050       550       580       630

The column names combine the metric name with the month value. Control this naming with names_sep or names_glue.

Handling Missing Values and Duplicates

Missing combinations in long data create NA values in wide format. Use values_fill to replace these with defaults:

# Incomplete data
incomplete_data <- tibble(
  product = c("Widget", "Widget", "Gadget"),
  year = c(2022, 2023, 2022),
  units = c(100, 120, 80)
)

# Fill missing values with 0
complete_wide <- incomplete_data %>%
  pivot_wider(
    names_from = year,
    values_from = units,
    values_fill = 0
  )

print(complete_wide)

Output:

# A tibble: 2 × 3
  product `2022` `2023`
  <chr>    <dbl>  <dbl>
1 Widget     100    120
2 Gadget      80      0

When multiple observations exist for the same identifier combination, specify an aggregation function with values_fn:

# Duplicate entries
duplicates <- tibble(
  category = c("A", "A", "A", "B", "B"),
  type = c("X", "X", "Y", "X", "Y"),
  amount = c(10, 15, 20, 25, 30)
)

# Aggregate duplicates by summing
aggregated_wide <- duplicates %>%
  pivot_wider(
    names_from = type,
    values_from = amount,
    values_fn = sum
  )

print(aggregated_wide)

Output:

# A tibble: 2 × 3
  category     X     Y
  <chr>    <dbl> <dbl>
1 A           25    20
2 B           25    30

Advanced Column Naming Strategies

Control output column names precisely using names_prefix, names_sep, and names_glue:

# Survey responses
survey <- tibble(
  respondent = c(1, 1, 2, 2),
  question = c("Q1", "Q2", "Q1", "Q2"),
  score = c(5, 4, 3, 5)
)

# Add prefix to column names
survey_wide_prefix <- survey %>%
  pivot_wider(
    names_from = question,
    values_from = score,
    names_prefix = "score_"
  )

print(survey_wide_prefix)

Output:

# A tibble: 2 × 3
  respondent score_Q1 score_Q2
       <dbl>    <dbl>    <dbl>
1          1        5        4
2          2        3        5

For complex naming with multiple columns in names_from, use names_glue:

# Multi-dimensional data
multi_dim <- tibble(
  id = c(1, 1, 1, 1),
  metric = c("temp", "temp", "humidity", "humidity"),
  time = c("AM", "PM", "AM", "PM"),
  value = c(20, 25, 60, 55)
)

# Custom column naming template
multi_wide <- multi_dim %>%
  pivot_wider(
    names_from = c(metric, time),
    values_from = value,
    names_glue = "{metric}_{time}"
  )

print(multi_wide)

Output:

# A tibble: 1 × 5
     id temp_AM temp_PM humidity_AM humidity_PM
  <dbl>   <dbl>   <dbl>       <dbl>       <dbl>
1     1      20      25          60          55

Pivoting with Multiple Name Columns

Spread data based on combinations of multiple columns by passing a vector to names_from:

# Sales by product and channel
sales_multi <- tibble(
  date = rep("2024-01", 4),
  product = c("A", "A", "B", "B"),
  channel = c("Online", "Store", "Online", "Store"),
  revenue = c(1000, 800, 1200, 900)
)

# Create columns from product-channel combinations
sales_multi_wide <- sales_multi %>%
  pivot_wider(
    names_from = c(product, channel),
    values_from = revenue,
    names_sep = "_"
  )

print(sales_multi_wide)

Output:

# A tibble: 1 × 5
  date    A_Online A_Store B_Online B_Store
  <chr>      <dbl>   <dbl>    <dbl>   <dbl>
1 2024-01     1000     800     1200     900

Practical Use Case: Creating Comparison Tables

A common application involves creating side-by-side comparison tables from time-series data:

# Monthly performance data
performance <- tibble(
  department = rep(c("Sales", "Marketing", "Operations"), each = 3),
  month = rep(c("Jan", "Feb", "Mar"), 3),
  budget = c(50000, 52000, 51000, 30000, 31000, 32000, 40000, 41000, 42000),
  actual = c(48000, 53000, 52000, 29000, 30500, 33000, 39000, 40500, 41500)
)

# Create comparison table with calculated variance
comparison <- performance %>%
  mutate(variance = actual - budget) %>%
  pivot_wider(
    names_from = month,
    values_from = c(budget, actual, variance),
    names_glue = "{month}_{.value}"
  ) %>%
  select(department, starts_with("Jan"), starts_with("Feb"), starts_with("Mar"))

print(comparison)

This produces a structured comparison table where each month’s budget, actual, and variance appear together, facilitating quick analysis across departments and time periods.

Performance Considerations

For large datasets, pivot_wider() performance depends on the number of unique combinations in names_from. Pre-filter data and ensure identifier columns are properly indexed. When dealing with millions of rows, consider:

# Efficient approach for large data
large_data %>%
  filter(relevant_condition) %>%  # Reduce rows first
  distinct() %>%                   # Remove duplicates
  pivot_wider(
    names_from = category,
    values_from = value,
    values_fn = list                # Returns lists for multiple values
  )

The values_fn = list option creates list-columns when aggregation isn’t straightforward, preserving all values for manual inspection or custom processing downstream.

Understanding Long to Wide Transformation

Working with Multiple Value Columns

Handling Missing Values and Duplicates

Advanced Column Naming Strategies

Pivoting with Multiple Name Columns

Practical Use Case: Creating Comparison Tables

Performance Considerations

Liked this? There's more.