R dplyr - count() and tally() | Application Architect

Key Insights

count() and tally() are dplyr’s specialized functions for frequency counting, with count() automatically grouping by specified variables while tally() works on already-grouped data
count() is syntactic sugar that combines group_by(), summarise(), and n() in a single operation, reducing code verbosity by up to 70%
Both functions support weighting, sorting, and custom naming, making them essential for exploratory data analysis and data validation workflows

Understanding count() and tally()

The dplyr package provides two complementary functions for counting observations: count() and tally(). While both produce frequency counts, they differ in their workflow position. count() performs grouping and counting in one step, while tally() counts observations in pre-grouped data.

library(dplyr)

# Sample dataset
sales <- data.frame(
  region = c("North", "South", "North", "East", "South", "North"),
  product = c("A", "B", "A", "A", "B", "C"),
  quantity = c(10, 15, 8, 12, 20, 5)
)

# Using count()
sales %>% count(region)
#   region n
# 1   East 1
# 2  North 3
# 3  South 2

# Equivalent using group_by() and summarise()
sales %>% 
  group_by(region) %>% 
  summarise(n = n())

The count() function eliminates boilerplate code. Instead of chaining group_by() and summarise(), you specify grouping variables directly.

Multiple Grouping Variables

count() accepts multiple variables for hierarchical counting. This creates a frequency table for unique combinations.

# Count by multiple variables
sales %>% count(region, product)
#   region product n
# 1   East       A 1
# 2  North       A 2
# 3  North       C 1
# 4  South       B 2

# Three-way counting
orders <- data.frame(
  year = c(2023, 2023, 2024, 2024, 2023),
  quarter = c("Q1", "Q2", "Q1", "Q1", "Q1"),
  status = c("Complete", "Pending", "Complete", "Complete", "Complete")
)

orders %>% count(year, quarter, status)
#    year quarter   status n
# 1  2023      Q1 Complete 2
# 2  2023      Q2  Pending 1
# 3  2024      Q1 Complete 2

Each combination of grouping variables produces a separate count. Missing combinations are excluded from output.

Weighted Counts with wt Parameter

The wt parameter enables weighted counting, summing a variable instead of counting rows. This is crucial for aggregating quantities, revenues, or other metrics.

# Regular count
sales %>% count(region)
#   region n
# 1   East 1
# 2  North 3
# 3  South 2

# Weighted count by quantity
sales %>% count(region, wt = quantity)
#   region  n
# 1   East 12
# 2  North 23
# 3  South 35

# Weighted count with multiple groups
sales %>% count(region, product, wt = quantity)
#   region product  n
# 1   East       A 12
# 2  North       A 18
# 3  North       C  5
# 4  South       B 35

The weighted count replaces row counting with summation. The column name remains n by default, but represents the sum of the weight variable.

Sorting Results with sort Parameter

The sort parameter orders results by frequency in descending order. This immediately highlights the most common values.

# Unsorted count
sales %>% count(product)
#   product n
# 1       A 3
# 2       B 2
# 3       C 1

# Sorted count
sales %>% count(product, sort = TRUE)
#   product n
# 1       A 3
# 2       B 2
# 3       C 1

# Sorted weighted count
sales %>% count(product, wt = quantity, sort = TRUE)
#   product  n
# 1       B 35
# 2       A 30
# 3       C  5

Sorting integrates directly into count(), eliminating the need for a separate arrange() call.

Custom Column Names with name Parameter

By default, the count column is named n. The name parameter customizes this output column name.

# Default name
sales %>% count(region)
#   region n
# 1   East 1
# 2  North 3
# 3  South 2

# Custom name
sales %>% count(region, name = "total_sales")
#   region total_sales
# 1   East           1
# 2  North           3
# 3  South           2

# Custom name with weighting
sales %>% count(region, wt = quantity, name = "total_quantity")
#   region total_quantity
# 1   East             12
# 2  North             23
# 3  South             35

Custom naming improves code readability, especially when joining count results with other datasets.

Using tally() with Grouped Data

tally() operates on already-grouped data frames. It’s the final step after group_by(), counting observations per group.

# tally() with grouped data
sales %>% 
  group_by(region) %>% 
  tally()
#   region     n
# 1   East     1
# 2  North     3
# 3  South     2

# tally() with weights
sales %>% 
  group_by(region) %>% 
  tally(wt = quantity)
#   region     n
# 1   East    12
# 2  North    23
# 3  South    35

# tally() with sorting
sales %>% 
  group_by(product) %>% 
  tally(wt = quantity, sort = TRUE)
#   product     n
# 1       B    35
# 2       A    30
# 3       C     5

Use tally() when you’ve already performed grouping operations or need to insert additional transformations before counting.

Combining count() with add_count()

add_count() adds a count column without collapsing the dataset. This preserves original rows while adding frequency information.

# add_count() preserves all rows
sales %>% add_count(region)
#   region product quantity n
# 1  North       A       10 3
# 2  South       B       15 2
# 3  North       A        8 3
# 4   East       A       12 1
# 5  South       B       20 2
# 6  North       C        5 3

# Compare with count()
sales %>% count(region)
#   region n
# 1   East 1
# 2  North 3
# 3  South 2

# add_count() with multiple variables
sales %>% add_count(region, product, name = "combo_count")
#   region product quantity combo_count
# 1  North       A       10           2
# 2  South       B       15           2
# 3  North       A        8           2
# 4   East       A       12           1
# 5  South       B       20           2
# 6  North       C        5           1

add_count() is invaluable for filtering based on frequency or calculating percentages within groups.

Practical Applications

Count functions excel in data validation and exploratory analysis. Here are real-world patterns:

# Find duplicate records
transactions <- data.frame(
  transaction_id = c("T1", "T2", "T3", "T2", "T4"),
  amount = c(100, 200, 150, 200, 300)
)

transactions %>% 
  count(transaction_id, sort = TRUE) %>% 
  filter(n > 1)
#   transaction_id n
# 1             T2 2

# Calculate percentages
sales %>% 
  count(region) %>% 
  mutate(percentage = n / sum(n) * 100)
#   region n percentage
# 1   East 1      16.67
# 2  North 3      50.00
# 3  South 2      33.33

# Identify rare categories
sales %>% 
  count(product, wt = quantity, sort = TRUE) %>% 
  mutate(cumulative_pct = cumsum(n) / sum(n) * 100)
#   product  n cumulative_pct
# 1       B 35          38.89
# 2       A 30          72.22
# 3       C  5         100.00

# Filter by frequency threshold
sales %>% 
  add_count(region) %>% 
  filter(n >= 2)
#   region product quantity n
# 1  North       A       10 3
# 2  South       B       15 2
# 3  North       A        8 3
# 4  South       B       20 2
# 5  North       C        5 3

Performance Considerations

Both functions are optimized for large datasets. count() is implemented in C++ through dplyr’s backend, providing substantial performance benefits over base R alternatives.

# Efficient for large datasets
# count() handles millions of rows efficiently
large_data <- data.frame(
  category = sample(letters[1:10], 1000000, replace = TRUE),
  value = rnorm(1000000)
)

system.time({
  large_data %>% count(category)
})
#   user  system elapsed 
#  0.015   0.000   0.015

# Equivalent base R (slower)
system.time({
  as.data.frame(table(large_data$category))
})
#   user  system elapsed 
#  0.089   0.004   0.093

The performance advantage increases with dataset size and complexity. For grouped operations with multiple variables, count() significantly outperforms base R alternatives.

When to Use Each Function

Choose count() when starting fresh or when grouping variables are known upfront. Use tally() when working within a pipeline that already includes group_by() or when you need intermediate transformations before counting. Use add_count() when you need frequency information alongside original data for filtering or percentage calculations.