R dplyr - group_by() and summarise()

Key Insights

group_by() partitions data into groups based on one or more variables, enabling grouped operations without splitting your data frame into separate objects
summarise() reduces grouped data to summary statistics, collapsing each group into a single row with computed aggregations
Combining these functions creates a powerful pattern for data aggregation that handles grouped calculations, missing values, and multiple summary statistics efficiently

Understanding the group_by() Mechanism

The group_by() function transforms a regular data frame into a grouped tibble, which subsequent operations treat as separate partitions. This grouping is metadata—the physical data structure remains intact.

library(dplyr)

# Sample sales data
sales <- data.frame(
  region = c("North", "North", "South", "South", "East", "East"),
  product = c("A", "B", "A", "B", "A", "B"),
  revenue = c(1000, 1500, 2000, 2500, 1200, 1800),
  units = c(10, 15, 20, 25, 12, 18)
)

# Create grouped data
grouped_sales <- sales %>% group_by(region)

# Check the structure
class(grouped_sales)
# [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

# View grouping metadata
group_vars(grouped_sales)
# [1] "region"

The grouped object looks identical when printed, but operations now respect the grouping structure. You can group by multiple variables, creating nested partitions.

# Multiple grouping variables
multi_grouped <- sales %>% 
  group_by(region, product)

# Shows 6 groups (3 regions × 2 products)
n_groups(multi_grouped)
# [1] 6

Basic summarise() Operations

The summarise() function (or summarize() with American spelling) computes summary statistics for each group. Each summary reduces the group to a single value.

# Single summary statistic
sales %>%
  group_by(region) %>%
  summarise(total_revenue = sum(revenue))

# Output:
#   region total_revenue
#   <chr>          <dbl>
# 1 East            3000
# 2 North           2500
# 3 South           4500

Multiple summary statistics in one call:

sales %>%
  group_by(region) %>%
  summarise(
    total_revenue = sum(revenue),
    avg_revenue = mean(revenue),
    total_units = sum(units),
    n_products = n()
  )

# Output:
#   region total_revenue avg_revenue total_units n_products
#   <chr>          <dbl>       <dbl>       <dbl>      <int>
# 1 East            3000        1500          30          2
# 2 North           2500        1250          25          2
# 3 South           4500        2250          45          2

The n() function counts rows per group—a common pattern for frequency analysis.

Handling Multiple Grouping Variables

When grouping by multiple variables, summarise() progressively “peels off” the rightmost grouping variable by default.

sales %>%
  group_by(region, product) %>%
  summarise(total_revenue = sum(revenue))

# Output with grouping message:
# `summarise()` has grouped output by 'region'. 
#   region product total_revenue
#   <chr>  <chr>           <dbl>
# 1 East   A                1200
# 2 East   B                1800
# 3 North  A                1000
# 4 North  B                1500
# 5 South  A                2000
# 6 South  B                2500

Control this behavior with .groups argument:

# Drop all grouping
sales %>%
  group_by(region, product) %>%
  summarise(total_revenue = sum(revenue), .groups = "drop")

# Keep all grouping
sales %>%
  group_by(region, product) %>%
  summarise(total_revenue = sum(revenue), .groups = "keep")

# Drop last grouping (default)
sales %>%
  group_by(region, product) %>%
  summarise(total_revenue = sum(revenue), .groups = "drop_last")

Advanced Summary Functions

Beyond basic aggregations, summarise() works with any function that returns a single value per group.

# Quantile calculations
sales %>%
  group_by(region) %>%
  summarise(
    median_revenue = median(revenue),
    q25 = quantile(revenue, 0.25),
    q75 = quantile(revenue, 0.75),
    iqr = IQR(revenue)
  )

# Statistical measures
sales %>%
  group_by(region) %>%
  summarise(
    mean_rev = mean(revenue),
    sd_rev = sd(revenue),
    cv = sd(revenue) / mean(revenue),  # Coefficient of variation
    min_rev = min(revenue),
    max_rev = max(revenue),
    range = max(revenue) - min(revenue)
  )

Custom functions work seamlessly:

# Calculate weighted average
sales %>%
  group_by(region) %>%
  summarise(
    weighted_avg = sum(revenue * units) / sum(units),
    total_value = sum(revenue * units)
  )

Managing Missing Values

Missing values require explicit handling in summary operations.

# Data with missing values
sales_na <- sales
sales_na$revenue[c(2, 5)] <- NA

# Default behavior - NA propagates
sales_na %>%
  group_by(region) %>%
  summarise(total_revenue = sum(revenue))

# Output:
#   region total_revenue
#   <chr>          <dbl>
# 1 East              NA
# 2 North             NA
# 3 South           4500

# Remove NAs explicitly
sales_na %>%
  group_by(region) %>%
  summarise(
    total_revenue = sum(revenue, na.rm = TRUE),
    na_count = sum(is.na(revenue)),
    valid_count = sum(!is.na(revenue))
  )

Combining with Other dplyr Verbs

The real power emerges when chaining operations:

# Filter, group, and summarise
sales %>%
  filter(revenue > 1000) %>%
  group_by(region) %>%
  summarise(
    high_value_revenue = sum(revenue),
    high_value_count = n()
  )

# Add computed columns before grouping
sales %>%
  mutate(revenue_per_unit = revenue / units) %>%
  group_by(region) %>%
  summarise(
    avg_price = mean(revenue_per_unit),
    total_units = sum(units)
  )

Using across() for multiple column operations:

# Summarise multiple columns with same function
sales %>%
  group_by(region) %>%
  summarise(across(c(revenue, units), 
                   list(sum = sum, mean = mean),
                   .names = "{.col}_{.fn}"))

# Output:
#   region revenue_sum revenue_mean units_sum units_mean
#   <chr>        <dbl>        <dbl>     <dbl>      <dbl>
# 1 East          3000         1500        30         15
# 2 North         2500         1250        25       12.5
# 3 South         4500         2250        45       22.5

Performance Considerations

For large datasets, consider these optimization strategies:

# Use data.table backend for better performance
library(dtplyr)

sales_dt <- lazy_dt(sales)

sales_dt %>%
  group_by(region) %>%
  summarise(total_revenue = sum(revenue)) %>%
  as_tibble()

# Pre-filter to reduce data volume
sales %>%
  filter(revenue > 0) %>%  # Remove zero-revenue rows
  group_by(region) %>%
  summarise(total_revenue = sum(revenue))

# Use summarise() instead of multiple mutate() calls
# Bad: Creates intermediate columns
sales %>%
  group_by(region) %>%
  mutate(sum_rev = sum(revenue)) %>%
  mutate(pct = revenue / sum_rev)

# Better: Single summarise for aggregation
sales %>%
  group_by(region) %>%
  mutate(pct = revenue / sum(revenue))

Practical Pattern: Grouped Rankings

A common pattern combines group_by() with window functions:

# Top product per region by revenue
sales %>%
  group_by(region) %>%
  mutate(rank = dense_rank(desc(revenue))) %>%
  filter(rank == 1) %>%
  select(region, product, revenue)

# Cumulative sums within groups
sales %>%
  arrange(region, product) %>%
  group_by(region) %>%
  mutate(cumulative_revenue = cumsum(revenue))

Ungrouping Data

Always ungroup when finished with grouped operations to avoid unexpected behavior:

# Explicit ungrouping
result <- sales %>%
  group_by(region) %>%
  summarise(total_revenue = sum(revenue)) %>%
  ungroup()

# Check grouping status
is.grouped_df(result)  # FALSE

# Grouped operations on ungrouped data
result %>%
  mutate(pct_of_total = total_revenue / sum(total_revenue))

The combination of group_by() and summarise() forms the foundation of data aggregation in dplyr. Master these functions, understand their interaction with other verbs, and you’ll handle most analytical aggregation tasks efficiently.