R dplyr - summarise() with Examples | Application Architect

Key Insights

summarise() reduces data frames to summary statistics by computing aggregate values like means, counts, and sums across groups or entire datasets
Combined with group_by(), it enables split-apply-combine operations that calculate statistics for each category in your data without writing loops
The function returns a new data frame with one row per group (or one row total if ungrouped), making it essential for generating reports and statistical summaries

Basic Summarise Operations

The summarise() function from dplyr condenses data frames into summary statistics. At its core, it takes a data frame and returns a smaller one containing computed aggregate values.

library(dplyr)

# Sample dataset
sales <- data.frame(
  product = c("A", "B", "A", "C", "B", "A"),
  revenue = c(100, 150, 200, 120, 180, 90),
  units = c(10, 15, 20, 12, 18, 9)
)

# Basic summarise - entire dataset
sales %>%
  summarise(
    total_revenue = sum(revenue),
    avg_revenue = mean(revenue),
    total_units = sum(units)
  )

#   total_revenue avg_revenue total_units
# 1           840         140          84

Without grouping, summarise() collapses the entire data frame into a single row. Each argument creates a new column in the output using the specified aggregation function.

Grouping with summarise

The real power emerges when combining summarise() with group_by(). This pattern performs calculations separately for each group.

# Group by product and summarise
product_summary <- sales %>%
  group_by(product) %>%
  summarise(
    total_revenue = sum(revenue),
    avg_revenue = mean(revenue),
    count = n()
  )

print(product_summary)

# # A tibble: 3 × 4
#   product total_revenue avg_revenue count
#   <chr>           <dbl>       <dbl> <int>
# 1 A                 390        130      3
# 2 B                 330        165      2
# 3 C                 120        120      1

The n() helper function counts rows in each group. This is cleaner than using length() or similar approaches.

Multiple Grouping Variables

You can group by multiple columns to create hierarchical summaries. The result maintains all grouping levels except the last one.

# Extended dataset with regions
sales_extended <- data.frame(
  region = c("North", "North", "South", "South", "North", "South"),
  product = c("A", "B", "A", "C", "B", "A"),
  revenue = c(100, 150, 200, 120, 180, 90),
  quarter = c("Q1", "Q1", "Q1", "Q2", "Q2", "Q2")
)

# Multi-level grouping
regional_summary <- sales_extended %>%
  group_by(region, quarter) %>%
  summarise(
    total_revenue = sum(revenue),
    products_sold = n(),
    .groups = "drop"
  )

print(regional_summary)

# # A tibble: 4 × 4
#   region quarter total_revenue products_sold
#   <chr>  <chr>           <dbl>         <int>
# 1 North  Q1                250             2
# 2 North  Q2                180             1
# 3 South  Q1                200             1
# 4 South  Q2                210             2

The .groups argument controls grouping in the output. Setting it to “drop” removes all grouping, while “keep” preserves grouping structure.

Advanced Aggregation Functions

Beyond basic statistics, summarise() works with any function that returns a single value from a vector.

# Complex aggregations
advanced_summary <- sales_extended %>%
  group_by(region) %>%
  summarise(
    total_revenue = sum(revenue),
    avg_revenue = mean(revenue),
    median_revenue = median(revenue),
    sd_revenue = sd(revenue),
    min_revenue = min(revenue),
    max_revenue = max(revenue),
    revenue_range = max(revenue) - min(revenue),
    distinct_products = n_distinct(product)
  )

print(advanced_summary)

# # A tibble: 2 × 9
#   region total_revenue avg_revenue median_revenue sd_revenue min_revenue max_revenue revenue_range distinct_products
#   <chr>          <dbl>       <dbl>          <dbl>      <dbl>       <dbl>       <dbl>         <dbl>             <int>
# 1 North            430        143.           150        40.4         100         180            80                 2
# 2 South            410        137.           155        58.9          90         200           110                 2

Conditional Summarisation with across

The across() function enables applying the same operation to multiple columns simultaneously.

# Dataset with multiple numeric columns
metrics <- data.frame(
  category = c("X", "Y", "X", "Y"),
  metric1 = c(10, 20, 15, 25),
  metric2 = c(100, 200, 150, 250),
  metric3 = c(5, 10, 7, 12)
)

# Summarise multiple columns
summary_across <- metrics %>%
  group_by(category) %>%
  summarise(
    across(starts_with("metric"), 
           list(mean = mean, sum = sum),
           .names = "{.col}_{.fn}")
  )

print(summary_across)

# # A tibble: 2 × 7
#   category metric1_mean metric1_sum metric2_mean metric2_sum metric3_mean metric3_sum
#   <chr>           <dbl>       <dbl>        <dbl>       <dbl>        <dbl>       <dbl>
# 1 X                12.5          25          125         250          6            12
# 2 Y                22.5          45          225         450         11            22

The across() syntax accepts column selectors (like starts_with(), ends_with(), contains()) and applies functions to matching columns.

Handling Missing Values

Missing data requires explicit handling in aggregation functions.

# Dataset with NAs
sales_na <- data.frame(
  product = c("A", "B", "A", "C", "B"),
  revenue = c(100, NA, 200, 120, 180),
  units = c(10, 15, NA, 12, 18)
)

# Default behavior - NAs propagate
sales_na %>%
  group_by(product) %>%
  summarise(avg_revenue = mean(revenue))

# # A tibble: 3 × 2
#   product avg_revenue
#   <chr>         <dbl>
# 1 A               150
# 2 B                NA
# 3 C               120

# Remove NAs explicitly
sales_na %>%
  group_by(product) %>%
  summarise(
    avg_revenue = mean(revenue, na.rm = TRUE),
    count_revenue = sum(!is.na(revenue)),
    count_units = sum(!is.na(units))
  )

# # A tibble: 3 × 4
#   product avg_revenue count_revenue count_units
#   <chr>         <dbl>         <int>       <int>
# 1 A               150             2           1
# 2 B               180             1           2
# 3 C               120             1           1

Creating Custom Summary Functions

You can define custom functions for domain-specific calculations.

# Custom coefficient of variation function
cv <- function(x, na.rm = TRUE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm) * 100
}

# Apply custom function
sales_extended %>%
  group_by(region) %>%
  summarise(
    mean_revenue = mean(revenue),
    cv_revenue = cv(revenue),
    stability = case_when(
      cv_revenue < 20 ~ "Stable",
      cv_revenue < 40 ~ "Moderate",
      TRUE ~ "Volatile"
    )
  )

# # A tibble: 2 × 4
#   region mean_revenue cv_revenue stability
#   <chr>         <dbl>      <dbl> <chr>    
# 1 North          143.       28.2 Moderate 
# 2 South          137.       43.0 Volatile

Summarise vs Mutate

Understanding when to use summarise() versus mutate() is critical. summarise() reduces rows, while mutate() preserves them.

# summarise - reduces to one row per group
sales %>%
  group_by(product) %>%
  summarise(total = sum(revenue))

# # A tibble: 3 × 2
#   product total
#   <chr>   <dbl>
# 1 A         390
# 2 B         330
# 3 C         120

# mutate - keeps all rows, adds summary as new column
sales %>%
  group_by(product) %>%
  mutate(total = sum(revenue))

# # A tibble: 6 × 4
# # Groups:   product [3]
#   product revenue units total
#   <chr>     <dbl> <dbl> <dbl>
# 1 A           100    10   390
# 2 B           150    15   330
# 3 A           200    20   390
# 4 C           120    12   120
# 5 B           180    18   330
# 6 A            90     9   390

Use summarise() for reports and aggregated views. Use mutate() when you need both detail and summary information in the same data frame.

Performance Considerations

For large datasets, summarise() with proper grouping significantly outperforms manual loops or apply functions.

# Efficient summarisation
system.time({
  large_data <- data.frame(
    group = sample(LETTERS[1:10], 1000000, replace = TRUE),
    value = rnorm(1000000)
  )
  
  result <- large_data %>%
    group_by(group) %>%
    summarise(
      mean_val = mean(value),
      sd_val = sd(value),
      count = n()
    )
})

The dplyr implementation uses optimized C++ code under the hood, making it substantially faster than base R alternatives for grouped operations.