R dplyr - n() and n_distinct() | Application Architect

Key Insights

• n() counts rows within groups while n_distinct() counts unique values, forming the foundation of aggregation operations in dplyr • Both functions work seamlessly with group_by() and summarise(), but n() requires no arguments while n_distinct() needs at least one column specification • Understanding when to use each function prevents common mistakes like double-counting in grouped operations and enables efficient data quality checks

Understanding n() for Row Counting

The n() function counts the number of rows in the current group. It’s the dplyr equivalent of SQL’s COUNT(*) and requires no arguments.

library(dplyr)

# Sample sales data
sales <- data.frame(
  region = c("East", "East", "West", "West", "East", "North"),
  product = c("A", "B", "A", "A", "A", "C"),
  amount = c(100, 150, 200, 175, 125, 300)
)

# Count total rows
sales %>%
  summarise(total_transactions = n())
#   total_transactions
# 1                  6

# Count by group
sales %>%
  group_by(region) %>%
  summarise(transactions = n())
#   region transactions
# 1 East              3
# 2 North             1
# 3 West              2

The power of n() becomes apparent when combined with filtering and grouping operations. Unlike base R’s nrow(), n() respects grouping structure automatically.

# Count after filtering
sales %>%
  filter(amount > 150) %>%
  group_by(region) %>%
  summarise(
    high_value_transactions = n(),
    avg_amount = mean(amount)
  )
#   region high_value_transactions avg_amount
# 1 North                       1        300
# 2 West                        2      187.5

Using n_distinct() for Unique Value Counting

The n_distinct() function counts unique values in one or more columns. It’s equivalent to SQL’s COUNT(DISTINCT column) and requires column names as arguments.

# Count distinct products
sales %>%
  summarise(unique_products = n_distinct(product))
#   unique_products
# 1               3

# Count distinct products by region
sales %>%
  group_by(region) %>%
  summarise(
    transactions = n(),
    unique_products = n_distinct(product)
  )
#   region transactions unique_products
# 1 East              3               2
# 2 North             1               1
# 3 West              2               1

You can pass multiple columns to n_distinct() to count unique combinations:

# Count unique region-product combinations
sales %>%
  summarise(unique_combinations = n_distinct(region, product))
#   unique_combinations
# 1                   5

# Verify the combinations
sales %>%
  distinct(region, product) %>%
  arrange(region, product)
#   region product
# 1 East   A      
# 2 East   B      
# 3 North  C      
# 4 West   A

Practical Applications in Data Quality

These functions excel at identifying data quality issues and summarizing dataset characteristics.

# Customer transaction data with potential duplicates
customers <- data.frame(
  customer_id = c(1, 1, 2, 2, 2, 3, 4, 4),
  transaction_date = as.Date(c(
    "2024-01-15", "2024-01-15", "2024-01-16", 
    "2024-01-17", "2024-01-17", "2024-01-18",
    "2024-01-19", "2024-01-19"
  )),
  amount = c(50, 50, 75, 100, 100, 200, 150, 150)
)

# Detect potential duplicates
duplicate_check <- customers %>%
  group_by(customer_id, transaction_date) %>%
  summarise(
    record_count = n(),
    unique_amounts = n_distinct(amount),
    .groups = "drop"
  ) %>%
  filter(record_count > 1)

duplicate_check
#   customer_id transaction_date record_count unique_amounts
# 1           1       2024-01-15            2              1
# 2           2       2024-01-17            2              1
# 3           4       2024-01-19            2              1

This pattern quickly identifies exact duplicates (where unique_amounts == 1) versus legitimate multiple transactions on the same day.

# Customer activity summary
customer_summary <- customers %>%
  group_by(customer_id) %>%
  summarise(
    total_transactions = n(),
    unique_dates = n_distinct(transaction_date),
    unique_amounts = n_distinct(amount),
    total_spent = sum(amount)
  )

customer_summary
#   customer_id total_transactions unique_dates unique_amounts total_spent
# 1           1                  2            1              1         100
# 2           2                  3            2              2         275
# 3           3                  1            1              1         200
# 4           4                  2            1              1         300

Combining with Window Functions

Both functions work within window operations using mutate(), enabling row-level calculations based on group statistics.

# Add group statistics to each row
sales_enriched <- sales %>%
  group_by(region) %>%
  mutate(
    region_transaction_count = n(),
    region_product_diversity = n_distinct(product),
    pct_of_region_transactions = 1 / n() * 100
  ) %>%
  ungroup()

sales_enriched
#   region product amount region_transaction_count region_product_diversity
# 1 East   A        100                         3                        2
# 2 East   B        150                         3                        2
# 3 West   A        200                         2                        1
# 4 West   A        175                         2                        1
# 5 East   A        125                         3                        2
# 6 North  C        300                         1                        1
#   pct_of_region_transactions
# 1                   33.33333
# 2                   33.33333
# 3                   50.00000
# 4                   50.00000
# 5                   33.33333
# 6                  100.00000

Handling Missing Values

Both functions handle NA values differently than you might expect.

data_with_na <- data.frame(
  group = c("A", "A", "A", "B", "B", "B"),
  value = c(1, 2, NA, 1, 1, NA)
)

# n() counts all rows including NAs
data_with_na %>%
  group_by(group) %>%
  summarise(row_count = n())
#   group row_count
# 1 A             3
# 2 B             3

# n_distinct() counts NA as a distinct value by default
data_with_na %>%
  group_by(group) %>%
  summarise(
    distinct_with_na = n_distinct(value),
    distinct_without_na = n_distinct(value[!is.na(value)])
  )
#   group distinct_with_na distinct_without_na
# 1 A                    3                   2
# 2 B                    2                   1

To exclude NA from distinct counts, filter them out explicitly or use na.rm parameter in related functions.

Performance Considerations

These functions are optimized for performance, especially on grouped data frames.

# Efficient: single pass through data
sales %>%
  group_by(region) %>%
  summarise(
    n = n(),
    distinct_products = n_distinct(product),
    distinct_amounts = n_distinct(amount)
  )

# Less efficient: multiple separate operations
regions <- unique(sales$region)
results <- data.frame()
for(r in regions) {
  subset_data <- sales[sales$region == r, ]
  results <- rbind(results, data.frame(
    region = r,
    n = nrow(subset_data),
    distinct_products = length(unique(subset_data$product))
  ))
}

The dplyr approach vectorizes operations and minimizes data copying, making it substantially faster on large datasets.

Common Patterns and Anti-Patterns

Pattern: Calculating percentages

sales %>%
  group_by(region) %>%
  summarise(transactions = n()) %>%
  mutate(percentage = transactions / sum(transactions) * 100)

Anti-pattern: Using n() outside summarise/mutate

# This won't work
# count <- n()  # Error: must be used within a dplyr verb

# Instead use
count <- nrow(sales)

Pattern: Conditional distinct counting

sales %>%
  group_by(region) %>%
  summarise(
    high_value_products = n_distinct(product[amount > 150])
  )
#   region high_value_products
# 1 East                     0
# 2 North                    1
# 3 West                     1

These functions form the backbone of exploratory data analysis in dplyr. Master them to write cleaner, more expressive data transformation pipelines.