R dplyr - summarise() with Examples
The `summarise()` function from dplyr condenses data frames into summary statistics. At its core, it takes a data frame and returns a smaller one containing computed aggregate values.
Key Insights
summarise()reduces data frames to summary statistics by computing aggregate values like means, counts, and sums across groups or entire datasets- Combined with
group_by(), it enables split-apply-combine operations that calculate statistics for each category in your data without writing loops - The function returns a new data frame with one row per group (or one row total if ungrouped), making it essential for generating reports and statistical summaries
Basic Summarise Operations
The summarise() function from dplyr condenses data frames into summary statistics. At its core, it takes a data frame and returns a smaller one containing computed aggregate values.
library(dplyr)
# Sample dataset
sales <- data.frame(
product = c("A", "B", "A", "C", "B", "A"),
revenue = c(100, 150, 200, 120, 180, 90),
units = c(10, 15, 20, 12, 18, 9)
)
# Basic summarise - entire dataset
sales %>%
summarise(
total_revenue = sum(revenue),
avg_revenue = mean(revenue),
total_units = sum(units)
)
# total_revenue avg_revenue total_units
# 1 840 140 84
Without grouping, summarise() collapses the entire data frame into a single row. Each argument creates a new column in the output using the specified aggregation function.
Grouping with summarise
The real power emerges when combining summarise() with group_by(). This pattern performs calculations separately for each group.
# Group by product and summarise
product_summary <- sales %>%
group_by(product) %>%
summarise(
total_revenue = sum(revenue),
avg_revenue = mean(revenue),
count = n()
)
print(product_summary)
# # A tibble: 3 × 4
# product total_revenue avg_revenue count
# <chr> <dbl> <dbl> <int>
# 1 A 390 130 3
# 2 B 330 165 2
# 3 C 120 120 1
The n() helper function counts rows in each group. This is cleaner than using length() or similar approaches.
Multiple Grouping Variables
You can group by multiple columns to create hierarchical summaries. The result maintains all grouping levels except the last one.
# Extended dataset with regions
sales_extended <- data.frame(
region = c("North", "North", "South", "South", "North", "South"),
product = c("A", "B", "A", "C", "B", "A"),
revenue = c(100, 150, 200, 120, 180, 90),
quarter = c("Q1", "Q1", "Q1", "Q2", "Q2", "Q2")
)
# Multi-level grouping
regional_summary <- sales_extended %>%
group_by(region, quarter) %>%
summarise(
total_revenue = sum(revenue),
products_sold = n(),
.groups = "drop"
)
print(regional_summary)
# # A tibble: 4 × 4
# region quarter total_revenue products_sold
# <chr> <chr> <dbl> <int>
# 1 North Q1 250 2
# 2 North Q2 180 1
# 3 South Q1 200 1
# 4 South Q2 210 2
The .groups argument controls grouping in the output. Setting it to “drop” removes all grouping, while “keep” preserves grouping structure.
Advanced Aggregation Functions
Beyond basic statistics, summarise() works with any function that returns a single value from a vector.
# Complex aggregations
advanced_summary <- sales_extended %>%
group_by(region) %>%
summarise(
total_revenue = sum(revenue),
avg_revenue = mean(revenue),
median_revenue = median(revenue),
sd_revenue = sd(revenue),
min_revenue = min(revenue),
max_revenue = max(revenue),
revenue_range = max(revenue) - min(revenue),
distinct_products = n_distinct(product)
)
print(advanced_summary)
# # A tibble: 2 × 9
# region total_revenue avg_revenue median_revenue sd_revenue min_revenue max_revenue revenue_range distinct_products
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 North 430 143. 150 40.4 100 180 80 2
# 2 South 410 137. 155 58.9 90 200 110 2
Conditional Summarisation with across
The across() function enables applying the same operation to multiple columns simultaneously.
# Dataset with multiple numeric columns
metrics <- data.frame(
category = c("X", "Y", "X", "Y"),
metric1 = c(10, 20, 15, 25),
metric2 = c(100, 200, 150, 250),
metric3 = c(5, 10, 7, 12)
)
# Summarise multiple columns
summary_across <- metrics %>%
group_by(category) %>%
summarise(
across(starts_with("metric"),
list(mean = mean, sum = sum),
.names = "{.col}_{.fn}")
)
print(summary_across)
# # A tibble: 2 × 7
# category metric1_mean metric1_sum metric2_mean metric2_sum metric3_mean metric3_sum
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 X 12.5 25 125 250 6 12
# 2 Y 22.5 45 225 450 11 22
The across() syntax accepts column selectors (like starts_with(), ends_with(), contains()) and applies functions to matching columns.
Handling Missing Values
Missing data requires explicit handling in aggregation functions.
# Dataset with NAs
sales_na <- data.frame(
product = c("A", "B", "A", "C", "B"),
revenue = c(100, NA, 200, 120, 180),
units = c(10, 15, NA, 12, 18)
)
# Default behavior - NAs propagate
sales_na %>%
group_by(product) %>%
summarise(avg_revenue = mean(revenue))
# # A tibble: 3 × 2
# product avg_revenue
# <chr> <dbl>
# 1 A 150
# 2 B NA
# 3 C 120
# Remove NAs explicitly
sales_na %>%
group_by(product) %>%
summarise(
avg_revenue = mean(revenue, na.rm = TRUE),
count_revenue = sum(!is.na(revenue)),
count_units = sum(!is.na(units))
)
# # A tibble: 3 × 4
# product avg_revenue count_revenue count_units
# <chr> <dbl> <int> <int>
# 1 A 150 2 1
# 2 B 180 1 2
# 3 C 120 1 1
Creating Custom Summary Functions
You can define custom functions for domain-specific calculations.
# Custom coefficient of variation function
cv <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm) * 100
}
# Apply custom function
sales_extended %>%
group_by(region) %>%
summarise(
mean_revenue = mean(revenue),
cv_revenue = cv(revenue),
stability = case_when(
cv_revenue < 20 ~ "Stable",
cv_revenue < 40 ~ "Moderate",
TRUE ~ "Volatile"
)
)
# # A tibble: 2 × 4
# region mean_revenue cv_revenue stability
# <chr> <dbl> <dbl> <chr>
# 1 North 143. 28.2 Moderate
# 2 South 137. 43.0 Volatile
Summarise vs Mutate
Understanding when to use summarise() versus mutate() is critical. summarise() reduces rows, while mutate() preserves them.
# summarise - reduces to one row per group
sales %>%
group_by(product) %>%
summarise(total = sum(revenue))
# # A tibble: 3 × 2
# product total
# <chr> <dbl>
# 1 A 390
# 2 B 330
# 3 C 120
# mutate - keeps all rows, adds summary as new column
sales %>%
group_by(product) %>%
mutate(total = sum(revenue))
# # A tibble: 6 × 4
# # Groups: product [3]
# product revenue units total
# <chr> <dbl> <dbl> <dbl>
# 1 A 100 10 390
# 2 B 150 15 330
# 3 A 200 20 390
# 4 C 120 12 120
# 5 B 180 18 330
# 6 A 90 9 390
Use summarise() for reports and aggregated views. Use mutate() when you need both detail and summary information in the same data frame.
Performance Considerations
For large datasets, summarise() with proper grouping significantly outperforms manual loops or apply functions.
# Efficient summarisation
system.time({
large_data <- data.frame(
group = sample(LETTERS[1:10], 1000000, replace = TRUE),
value = rnorm(1000000)
)
result <- large_data %>%
group_by(group) %>%
summarise(
mean_val = mean(value),
sd_val = sd(value),
count = n()
)
})
The dplyr implementation uses optimized C++ code under the hood, making it substantially faster than base R alternatives for grouped operations.