R dplyr - count() and tally()
The dplyr package provides two complementary functions for counting observations: `count()` and `tally()`. While both produce frequency counts, they differ in their workflow position. `count()`...
Key Insights
count()andtally()are dplyr’s specialized functions for frequency counting, withcount()automatically grouping by specified variables whiletally()works on already-grouped datacount()is syntactic sugar that combinesgroup_by(),summarise(), andn()in a single operation, reducing code verbosity by up to 70%- Both functions support weighting, sorting, and custom naming, making them essential for exploratory data analysis and data validation workflows
Understanding count() and tally()
The dplyr package provides two complementary functions for counting observations: count() and tally(). While both produce frequency counts, they differ in their workflow position. count() performs grouping and counting in one step, while tally() counts observations in pre-grouped data.
library(dplyr)
# Sample dataset
sales <- data.frame(
region = c("North", "South", "North", "East", "South", "North"),
product = c("A", "B", "A", "A", "B", "C"),
quantity = c(10, 15, 8, 12, 20, 5)
)
# Using count()
sales %>% count(region)
# region n
# 1 East 1
# 2 North 3
# 3 South 2
# Equivalent using group_by() and summarise()
sales %>%
group_by(region) %>%
summarise(n = n())
The count() function eliminates boilerplate code. Instead of chaining group_by() and summarise(), you specify grouping variables directly.
Multiple Grouping Variables
count() accepts multiple variables for hierarchical counting. This creates a frequency table for unique combinations.
# Count by multiple variables
sales %>% count(region, product)
# region product n
# 1 East A 1
# 2 North A 2
# 3 North C 1
# 4 South B 2
# Three-way counting
orders <- data.frame(
year = c(2023, 2023, 2024, 2024, 2023),
quarter = c("Q1", "Q2", "Q1", "Q1", "Q1"),
status = c("Complete", "Pending", "Complete", "Complete", "Complete")
)
orders %>% count(year, quarter, status)
# year quarter status n
# 1 2023 Q1 Complete 2
# 2 2023 Q2 Pending 1
# 3 2024 Q1 Complete 2
Each combination of grouping variables produces a separate count. Missing combinations are excluded from output.
Weighted Counts with wt Parameter
The wt parameter enables weighted counting, summing a variable instead of counting rows. This is crucial for aggregating quantities, revenues, or other metrics.
# Regular count
sales %>% count(region)
# region n
# 1 East 1
# 2 North 3
# 3 South 2
# Weighted count by quantity
sales %>% count(region, wt = quantity)
# region n
# 1 East 12
# 2 North 23
# 3 South 35
# Weighted count with multiple groups
sales %>% count(region, product, wt = quantity)
# region product n
# 1 East A 12
# 2 North A 18
# 3 North C 5
# 4 South B 35
The weighted count replaces row counting with summation. The column name remains n by default, but represents the sum of the weight variable.
Sorting Results with sort Parameter
The sort parameter orders results by frequency in descending order. This immediately highlights the most common values.
# Unsorted count
sales %>% count(product)
# product n
# 1 A 3
# 2 B 2
# 3 C 1
# Sorted count
sales %>% count(product, sort = TRUE)
# product n
# 1 A 3
# 2 B 2
# 3 C 1
# Sorted weighted count
sales %>% count(product, wt = quantity, sort = TRUE)
# product n
# 1 B 35
# 2 A 30
# 3 C 5
Sorting integrates directly into count(), eliminating the need for a separate arrange() call.
Custom Column Names with name Parameter
By default, the count column is named n. The name parameter customizes this output column name.
# Default name
sales %>% count(region)
# region n
# 1 East 1
# 2 North 3
# 3 South 2
# Custom name
sales %>% count(region, name = "total_sales")
# region total_sales
# 1 East 1
# 2 North 3
# 3 South 2
# Custom name with weighting
sales %>% count(region, wt = quantity, name = "total_quantity")
# region total_quantity
# 1 East 12
# 2 North 23
# 3 South 35
Custom naming improves code readability, especially when joining count results with other datasets.
Using tally() with Grouped Data
tally() operates on already-grouped data frames. It’s the final step after group_by(), counting observations per group.
# tally() with grouped data
sales %>%
group_by(region) %>%
tally()
# region n
# 1 East 1
# 2 North 3
# 3 South 2
# tally() with weights
sales %>%
group_by(region) %>%
tally(wt = quantity)
# region n
# 1 East 12
# 2 North 23
# 3 South 35
# tally() with sorting
sales %>%
group_by(product) %>%
tally(wt = quantity, sort = TRUE)
# product n
# 1 B 35
# 2 A 30
# 3 C 5
Use tally() when you’ve already performed grouping operations or need to insert additional transformations before counting.
Combining count() with add_count()
add_count() adds a count column without collapsing the dataset. This preserves original rows while adding frequency information.
# add_count() preserves all rows
sales %>% add_count(region)
# region product quantity n
# 1 North A 10 3
# 2 South B 15 2
# 3 North A 8 3
# 4 East A 12 1
# 5 South B 20 2
# 6 North C 5 3
# Compare with count()
sales %>% count(region)
# region n
# 1 East 1
# 2 North 3
# 3 South 2
# add_count() with multiple variables
sales %>% add_count(region, product, name = "combo_count")
# region product quantity combo_count
# 1 North A 10 2
# 2 South B 15 2
# 3 North A 8 2
# 4 East A 12 1
# 5 South B 20 2
# 6 North C 5 1
add_count() is invaluable for filtering based on frequency or calculating percentages within groups.
Practical Applications
Count functions excel in data validation and exploratory analysis. Here are real-world patterns:
# Find duplicate records
transactions <- data.frame(
transaction_id = c("T1", "T2", "T3", "T2", "T4"),
amount = c(100, 200, 150, 200, 300)
)
transactions %>%
count(transaction_id, sort = TRUE) %>%
filter(n > 1)
# transaction_id n
# 1 T2 2
# Calculate percentages
sales %>%
count(region) %>%
mutate(percentage = n / sum(n) * 100)
# region n percentage
# 1 East 1 16.67
# 2 North 3 50.00
# 3 South 2 33.33
# Identify rare categories
sales %>%
count(product, wt = quantity, sort = TRUE) %>%
mutate(cumulative_pct = cumsum(n) / sum(n) * 100)
# product n cumulative_pct
# 1 B 35 38.89
# 2 A 30 72.22
# 3 C 5 100.00
# Filter by frequency threshold
sales %>%
add_count(region) %>%
filter(n >= 2)
# region product quantity n
# 1 North A 10 3
# 2 South B 15 2
# 3 North A 8 3
# 4 South B 20 2
# 5 North C 5 3
Performance Considerations
Both functions are optimized for large datasets. count() is implemented in C++ through dplyr’s backend, providing substantial performance benefits over base R alternatives.
# Efficient for large datasets
# count() handles millions of rows efficiently
large_data <- data.frame(
category = sample(letters[1:10], 1000000, replace = TRUE),
value = rnorm(1000000)
)
system.time({
large_data %>% count(category)
})
# user system elapsed
# 0.015 0.000 0.015
# Equivalent base R (slower)
system.time({
as.data.frame(table(large_data$category))
})
# user system elapsed
# 0.089 0.004 0.093
The performance advantage increases with dataset size and complexity. For grouped operations with multiple variables, count() significantly outperforms base R alternatives.
When to Use Each Function
Choose count() when starting fresh or when grouping variables are known upfront. Use tally() when working within a pipeline that already includes group_by() or when you need intermediate transformations before counting. Use add_count() when you need frequency information alongside original data for filtering or percentage calculations.