R - aggregate() Function
• The `aggregate()` function provides a straightforward approach to split-apply-combine operations, computing summary statistics across grouped data without external dependencies
Key Insights
• The aggregate() function provides a straightforward approach to split-apply-combine operations, computing summary statistics across grouped data without external dependencies
• Understanding the formula interface (value ~ grouping) versus the data frame method enables flexible aggregation strategies for different data structures
• Multiple grouping variables and custom functions extend aggregate() beyond basic summaries, making it suitable for complex analytical workflows
Basic Syntax and Data Preparation
The aggregate() function computes summary statistics by splitting data into subsets, applying a function to each subset, and combining results. It operates on data frames and formulas with minimal setup.
# Create sample sales data
sales <- data.frame(
region = c("North", "South", "North", "East", "South", "East", "North", "East"),
product = c("A", "A", "B", "A", "B", "B", "A", "B"),
revenue = c(1200, 1500, 900, 1100, 1300, 950, 1400, 1050),
units = c(30, 40, 25, 28, 35, 22, 38, 24)
)
# Basic aggregation - mean revenue by region
aggregate(revenue ~ region, data = sales, FUN = mean)
Output:
region revenue
1 East 1033.33
2 North 1166.67
3 South 1400.00
The formula interface (y ~ x) specifies the variable to aggregate (left side) and grouping variable (right side). The FUN parameter accepts any function that returns a single value.
Multiple Grouping Variables
Combine multiple grouping variables using the + operator in formulas. This creates hierarchical groupings for more granular analysis.
# Aggregate by region AND product
aggregate(revenue ~ region + product, data = sales, FUN = sum)
Output:
region product revenue
1 East A 1100
2 North A 2600
3 South A 1500
4 East B 2000
5 North B 900
6 South B 1300
For aggregating multiple response variables simultaneously:
# Aggregate both revenue and units
aggregate(cbind(revenue, units) ~ region, data = sales, FUN = mean)
Output:
region revenue units
1 East 1033.333 24.66667
2 North 1166.667 31.00000
3 South 1400.000 37.50000
The cbind() function binds multiple columns into a matrix on the left side of the formula.
Data Frame Method
When working with non-formula syntax, aggregate() accepts a data frame and grouping list directly. This method offers more control over column selection.
# Select columns to aggregate
agg_data <- sales[, c("revenue", "units")]
# Define grouping
grouping <- list(region = sales$region)
# Aggregate using data frame method
aggregate(agg_data, by = grouping, FUN = sum)
Output:
region revenue units
1 East 3100 74
2 North 3500 93
3 South 2800 75
This approach proves valuable when grouping variables aren’t part of the aggregated data frame or when programmatically selecting columns.
Custom Aggregation Functions
Beyond built-in functions like mean() and sum(), aggregate() accepts custom functions. The function must return a single value or vector of consistent length.
# Calculate coefficient of variation
cv <- function(x) {
sd(x) / mean(x) * 100
}
aggregate(revenue ~ region, data = sales, FUN = cv)
For functions requiring additional arguments:
# Trimmed mean (removes outliers)
aggregate(revenue ~ region, data = sales,
FUN = function(x) mean(x, trim = 0.1))
# Multiple statistics using custom function
multi_stats <- function(x) {
c(mean = mean(x),
median = median(x),
sd = sd(x))
}
result <- aggregate(revenue ~ region, data = sales, FUN = multi_stats)
# Result contains matrix columns - extract with do.call
do.call(data.frame, result)
Output:
region revenue.mean revenue.median revenue.sd
1 East 1033.33 1050.0 76.38
2 North 1166.67 1200.0 264.58
3 South 1400.00 1400.0 141.42
Handling Missing Values
Control NA handling through the na.action parameter or within custom functions.
# Data with missing values
sales_na <- sales
sales_na$revenue[c(2, 5)] <- NA
# Default behavior - removes groups with NA
aggregate(revenue ~ region, data = sales_na, FUN = mean)
# Handle NAs in function
aggregate(revenue ~ region, data = sales_na,
FUN = function(x) mean(x, na.rm = TRUE))
# Count non-NA values
aggregate(revenue ~ region, data = sales_na,
FUN = function(x) sum(!is.na(x)))
The na.action parameter accepts functions like na.omit (default) or na.pass to control row-level NA handling before grouping.
Performance Considerations and Alternatives
While aggregate() handles most scenarios efficiently, understanding its limitations guides tool selection.
# Timing comparison with larger dataset
set.seed(123)
large_data <- data.frame(
group = sample(LETTERS[1:10], 100000, replace = TRUE),
value = rnorm(100000)
)
# aggregate() approach
system.time({
agg_result <- aggregate(value ~ group, data = large_data, FUN = mean)
})
# tapply() alternative
system.time({
tapply_result <- tapply(large_data$value, large_data$group, mean)
})
# data.table alternative (if available)
library(data.table)
dt <- as.data.table(large_data)
system.time({
dt_result <- dt[, .(mean_value = mean(value)), by = group]
})
For datasets exceeding 100,000 rows or requiring complex transformations, consider data.table or dplyr. For simple vector operations, tapply() offers better performance.
Practical Applications
Sales Analysis Pipeline
# Multi-level aggregation workflow
sales_extended <- data.frame(
date = as.Date(c("2024-01-15", "2024-01-20", "2024-02-10",
"2024-02-15", "2024-01-18", "2024-02-12")),
region = c("North", "South", "North", "South", "North", "South"),
revenue = c(5000, 6000, 5500, 6200, 4800, 5900)
)
# Add month column
sales_extended$month <- format(sales_extended$date, "%Y-%m")
# Aggregate by month and region
monthly <- aggregate(revenue ~ month + region,
data = sales_extended, FUN = sum)
# Calculate regional contribution
monthly$total <- ave(monthly$revenue, monthly$month, FUN = sum)
monthly$pct_contribution <- round(monthly$revenue / monthly$total * 100, 1)
print(monthly)
Quality Control Metrics
# Manufacturing defect analysis
defects <- data.frame(
shift = rep(c("Morning", "Evening", "Night"), each = 20),
line = rep(1:3, 20),
defect_rate = runif(60, 0, 5)
)
# Aggregate with multiple statistics
qc_summary <- aggregate(defect_rate ~ shift + line,
data = defects,
FUN = function(x) {
c(mean = mean(x),
max = max(x),
violations = sum(x > 3))
})
# Format output
qc_formatted <- do.call(data.frame, qc_summary)
names(qc_formatted) <- c("shift", "line", "avg_rate", "max_rate", "violations")
The aggregate() function remains essential for quick data summarization in R. Its formula interface provides clarity, while the data frame method offers flexibility. For production pipelines processing millions of rows, transition to specialized packages, but for exploratory analysis and moderate datasets, aggregate() delivers reliable performance with minimal code.