R dplyr - group_by() and summarise()
The `group_by()` function transforms a regular data frame into a grouped tibble, which subsequent operations treat as separate partitions. This grouping is metadata—the physical data structure...
Key Insights
- group_by() partitions data into groups based on one or more variables, enabling grouped operations without splitting your data frame into separate objects
- summarise() reduces grouped data to summary statistics, collapsing each group into a single row with computed aggregations
- Combining these functions creates a powerful pattern for data aggregation that handles grouped calculations, missing values, and multiple summary statistics efficiently
Understanding the group_by() Mechanism
The group_by() function transforms a regular data frame into a grouped tibble, which subsequent operations treat as separate partitions. This grouping is metadata—the physical data structure remains intact.
library(dplyr)
# Sample sales data
sales <- data.frame(
region = c("North", "North", "South", "South", "East", "East"),
product = c("A", "B", "A", "B", "A", "B"),
revenue = c(1000, 1500, 2000, 2500, 1200, 1800),
units = c(10, 15, 20, 25, 12, 18)
)
# Create grouped data
grouped_sales <- sales %>% group_by(region)
# Check the structure
class(grouped_sales)
# [1] "grouped_df" "tbl_df" "tbl" "data.frame"
# View grouping metadata
group_vars(grouped_sales)
# [1] "region"
The grouped object looks identical when printed, but operations now respect the grouping structure. You can group by multiple variables, creating nested partitions.
# Multiple grouping variables
multi_grouped <- sales %>%
group_by(region, product)
# Shows 6 groups (3 regions × 2 products)
n_groups(multi_grouped)
# [1] 6
Basic summarise() Operations
The summarise() function (or summarize() with American spelling) computes summary statistics for each group. Each summary reduces the group to a single value.
# Single summary statistic
sales %>%
group_by(region) %>%
summarise(total_revenue = sum(revenue))
# Output:
# region total_revenue
# <chr> <dbl>
# 1 East 3000
# 2 North 2500
# 3 South 4500
Multiple summary statistics in one call:
sales %>%
group_by(region) %>%
summarise(
total_revenue = sum(revenue),
avg_revenue = mean(revenue),
total_units = sum(units),
n_products = n()
)
# Output:
# region total_revenue avg_revenue total_units n_products
# <chr> <dbl> <dbl> <dbl> <int>
# 1 East 3000 1500 30 2
# 2 North 2500 1250 25 2
# 3 South 4500 2250 45 2
The n() function counts rows per group—a common pattern for frequency analysis.
Handling Multiple Grouping Variables
When grouping by multiple variables, summarise() progressively “peels off” the rightmost grouping variable by default.
sales %>%
group_by(region, product) %>%
summarise(total_revenue = sum(revenue))
# Output with grouping message:
# `summarise()` has grouped output by 'region'.
# region product total_revenue
# <chr> <chr> <dbl>
# 1 East A 1200
# 2 East B 1800
# 3 North A 1000
# 4 North B 1500
# 5 South A 2000
# 6 South B 2500
Control this behavior with .groups argument:
# Drop all grouping
sales %>%
group_by(region, product) %>%
summarise(total_revenue = sum(revenue), .groups = "drop")
# Keep all grouping
sales %>%
group_by(region, product) %>%
summarise(total_revenue = sum(revenue), .groups = "keep")
# Drop last grouping (default)
sales %>%
group_by(region, product) %>%
summarise(total_revenue = sum(revenue), .groups = "drop_last")
Advanced Summary Functions
Beyond basic aggregations, summarise() works with any function that returns a single value per group.
# Quantile calculations
sales %>%
group_by(region) %>%
summarise(
median_revenue = median(revenue),
q25 = quantile(revenue, 0.25),
q75 = quantile(revenue, 0.75),
iqr = IQR(revenue)
)
# Statistical measures
sales %>%
group_by(region) %>%
summarise(
mean_rev = mean(revenue),
sd_rev = sd(revenue),
cv = sd(revenue) / mean(revenue), # Coefficient of variation
min_rev = min(revenue),
max_rev = max(revenue),
range = max(revenue) - min(revenue)
)
Custom functions work seamlessly:
# Calculate weighted average
sales %>%
group_by(region) %>%
summarise(
weighted_avg = sum(revenue * units) / sum(units),
total_value = sum(revenue * units)
)
Managing Missing Values
Missing values require explicit handling in summary operations.
# Data with missing values
sales_na <- sales
sales_na$revenue[c(2, 5)] <- NA
# Default behavior - NA propagates
sales_na %>%
group_by(region) %>%
summarise(total_revenue = sum(revenue))
# Output:
# region total_revenue
# <chr> <dbl>
# 1 East NA
# 2 North NA
# 3 South 4500
# Remove NAs explicitly
sales_na %>%
group_by(region) %>%
summarise(
total_revenue = sum(revenue, na.rm = TRUE),
na_count = sum(is.na(revenue)),
valid_count = sum(!is.na(revenue))
)
Combining with Other dplyr Verbs
The real power emerges when chaining operations:
# Filter, group, and summarise
sales %>%
filter(revenue > 1000) %>%
group_by(region) %>%
summarise(
high_value_revenue = sum(revenue),
high_value_count = n()
)
# Add computed columns before grouping
sales %>%
mutate(revenue_per_unit = revenue / units) %>%
group_by(region) %>%
summarise(
avg_price = mean(revenue_per_unit),
total_units = sum(units)
)
Using across() for multiple column operations:
# Summarise multiple columns with same function
sales %>%
group_by(region) %>%
summarise(across(c(revenue, units),
list(sum = sum, mean = mean),
.names = "{.col}_{.fn}"))
# Output:
# region revenue_sum revenue_mean units_sum units_mean
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 East 3000 1500 30 15
# 2 North 2500 1250 25 12.5
# 3 South 4500 2250 45 22.5
Performance Considerations
For large datasets, consider these optimization strategies:
# Use data.table backend for better performance
library(dtplyr)
sales_dt <- lazy_dt(sales)
sales_dt %>%
group_by(region) %>%
summarise(total_revenue = sum(revenue)) %>%
as_tibble()
# Pre-filter to reduce data volume
sales %>%
filter(revenue > 0) %>% # Remove zero-revenue rows
group_by(region) %>%
summarise(total_revenue = sum(revenue))
# Use summarise() instead of multiple mutate() calls
# Bad: Creates intermediate columns
sales %>%
group_by(region) %>%
mutate(sum_rev = sum(revenue)) %>%
mutate(pct = revenue / sum_rev)
# Better: Single summarise for aggregation
sales %>%
group_by(region) %>%
mutate(pct = revenue / sum(revenue))
Practical Pattern: Grouped Rankings
A common pattern combines group_by() with window functions:
# Top product per region by revenue
sales %>%
group_by(region) %>%
mutate(rank = dense_rank(desc(revenue))) %>%
filter(rank == 1) %>%
select(region, product, revenue)
# Cumulative sums within groups
sales %>%
arrange(region, product) %>%
group_by(region) %>%
mutate(cumulative_revenue = cumsum(revenue))
Ungrouping Data
Always ungroup when finished with grouped operations to avoid unexpected behavior:
# Explicit ungrouping
result <- sales %>%
group_by(region) %>%
summarise(total_revenue = sum(revenue)) %>%
ungroup()
# Check grouping status
is.grouped_df(result) # FALSE
# Grouped operations on ungrouped data
result %>%
mutate(pct_of_total = total_revenue / sum(total_revenue))
The combination of group_by() and summarise() forms the foundation of data aggregation in dplyr. Master these functions, understand their interaction with other verbs, and you’ll handle most analytical aggregation tasks efficiently.