R tidyr - pivot_wider() (Long to Wide)

Long-format data stores observations in rows where each row represents a single measurement. Wide-format data spreads these measurements across columns. `pivot_wider()` from the tidyr package...

Key Insights

  • pivot_wider() transforms long-format data into wide format by spreading key-value pairs across multiple columns, essential for creating analysis-ready datasets and summary tables
  • The function uses names_from to specify which column(s) contain future column names and values_from to identify which column(s) hold the values to spread
  • Advanced features like names_prefix, values_fill, and values_fn handle edge cases including duplicate identifiers, missing values, and multiple observations per cell

Understanding Long to Wide Transformation

Long-format data stores observations in rows where each row represents a single measurement. Wide-format data spreads these measurements across columns. pivot_wider() from the tidyr package performs this transformation, making data suitable for human readability, certain statistical analyses, and reporting.

The basic syntax requires identifying which columns define the structure:

library(tidyr)
library(dplyr)

# Long format data
sales_long <- tibble(
  region = c("North", "North", "South", "South"),
  quarter = c("Q1", "Q2", "Q1", "Q2"),
  revenue = c(100, 150, 120, 180)
)

# Convert to wide format
sales_wide <- sales_long %>%
  pivot_wider(
    names_from = quarter,
    values_from = revenue
  )

print(sales_wide)

Output:

# A tibble: 2 × 3
  region    Q1    Q2
  <chr>  <dbl> <dbl>
1 North    100   150
2 South    120   180

Working with Multiple Value Columns

Real-world datasets often require spreading multiple metrics simultaneously. pivot_wider() handles this by accepting vectors for values_from:

# Multiple metrics per observation
metrics_long <- tibble(
  store = rep(c("Store_A", "Store_B"), each = 3),
  month = rep(c("Jan", "Feb", "Mar"), 2),
  sales = c(1000, 1200, 1100, 900, 950, 1050),
  costs = c(600, 700, 650, 550, 580, 630)
)

# Spread both sales and costs
metrics_wide <- metrics_long %>%
  pivot_wider(
    names_from = month,
    values_from = c(sales, costs)
  )

print(metrics_wide)

Output:

# A tibble: 2 × 7
  store   sales_Jan sales_Feb sales_Mar costs_Jan costs_Feb costs_Mar
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 Store_A      1000      1200      1100       600       700       650
2 Store_B       900       950      1050       550       580       630

The column names combine the metric name with the month value. Control this naming with names_sep or names_glue.

Handling Missing Values and Duplicates

Missing combinations in long data create NA values in wide format. Use values_fill to replace these with defaults:

# Incomplete data
incomplete_data <- tibble(
  product = c("Widget", "Widget", "Gadget"),
  year = c(2022, 2023, 2022),
  units = c(100, 120, 80)
)

# Fill missing values with 0
complete_wide <- incomplete_data %>%
  pivot_wider(
    names_from = year,
    values_from = units,
    values_fill = 0
  )

print(complete_wide)

Output:

# A tibble: 2 × 3
  product `2022` `2023`
  <chr>    <dbl>  <dbl>
1 Widget     100    120
2 Gadget      80      0

When multiple observations exist for the same identifier combination, specify an aggregation function with values_fn:

# Duplicate entries
duplicates <- tibble(
  category = c("A", "A", "A", "B", "B"),
  type = c("X", "X", "Y", "X", "Y"),
  amount = c(10, 15, 20, 25, 30)
)

# Aggregate duplicates by summing
aggregated_wide <- duplicates %>%
  pivot_wider(
    names_from = type,
    values_from = amount,
    values_fn = sum
  )

print(aggregated_wide)

Output:

# A tibble: 2 × 3
  category     X     Y
  <chr>    <dbl> <dbl>
1 A           25    20
2 B           25    30

Advanced Column Naming Strategies

Control output column names precisely using names_prefix, names_sep, and names_glue:

# Survey responses
survey <- tibble(
  respondent = c(1, 1, 2, 2),
  question = c("Q1", "Q2", "Q1", "Q2"),
  score = c(5, 4, 3, 5)
)

# Add prefix to column names
survey_wide_prefix <- survey %>%
  pivot_wider(
    names_from = question,
    values_from = score,
    names_prefix = "score_"
  )

print(survey_wide_prefix)

Output:

# A tibble: 2 × 3
  respondent score_Q1 score_Q2
       <dbl>    <dbl>    <dbl>
1          1        5        4
2          2        3        5

For complex naming with multiple columns in names_from, use names_glue:

# Multi-dimensional data
multi_dim <- tibble(
  id = c(1, 1, 1, 1),
  metric = c("temp", "temp", "humidity", "humidity"),
  time = c("AM", "PM", "AM", "PM"),
  value = c(20, 25, 60, 55)
)

# Custom column naming template
multi_wide <- multi_dim %>%
  pivot_wider(
    names_from = c(metric, time),
    values_from = value,
    names_glue = "{metric}_{time}"
  )

print(multi_wide)

Output:

# A tibble: 1 × 5
     id temp_AM temp_PM humidity_AM humidity_PM
  <dbl>   <dbl>   <dbl>       <dbl>       <dbl>
1     1      20      25          60          55

Pivoting with Multiple Name Columns

Spread data based on combinations of multiple columns by passing a vector to names_from:

# Sales by product and channel
sales_multi <- tibble(
  date = rep("2024-01", 4),
  product = c("A", "A", "B", "B"),
  channel = c("Online", "Store", "Online", "Store"),
  revenue = c(1000, 800, 1200, 900)
)

# Create columns from product-channel combinations
sales_multi_wide <- sales_multi %>%
  pivot_wider(
    names_from = c(product, channel),
    values_from = revenue,
    names_sep = "_"
  )

print(sales_multi_wide)

Output:

# A tibble: 1 × 5
  date    A_Online A_Store B_Online B_Store
  <chr>      <dbl>   <dbl>    <dbl>   <dbl>
1 2024-01     1000     800     1200     900

Practical Use Case: Creating Comparison Tables

A common application involves creating side-by-side comparison tables from time-series data:

# Monthly performance data
performance <- tibble(
  department = rep(c("Sales", "Marketing", "Operations"), each = 3),
  month = rep(c("Jan", "Feb", "Mar"), 3),
  budget = c(50000, 52000, 51000, 30000, 31000, 32000, 40000, 41000, 42000),
  actual = c(48000, 53000, 52000, 29000, 30500, 33000, 39000, 40500, 41500)
)

# Create comparison table with calculated variance
comparison <- performance %>%
  mutate(variance = actual - budget) %>%
  pivot_wider(
    names_from = month,
    values_from = c(budget, actual, variance),
    names_glue = "{month}_{.value}"
  ) %>%
  select(department, starts_with("Jan"), starts_with("Feb"), starts_with("Mar"))

print(comparison)

This produces a structured comparison table where each month’s budget, actual, and variance appear together, facilitating quick analysis across departments and time periods.

Performance Considerations

For large datasets, pivot_wider() performance depends on the number of unique combinations in names_from. Pre-filter data and ensure identifier columns are properly indexed. When dealing with millions of rows, consider:

# Efficient approach for large data
large_data %>%
  filter(relevant_condition) %>%  # Reduce rows first
  distinct() %>%                   # Remove duplicates
  pivot_wider(
    names_from = category,
    values_from = value,
    values_fn = list                # Returns lists for multiple values
  )

The values_fn = list option creates list-columns when aggregation isn’t straightforward, preserving all values for manual inspection or custom processing downstream.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.