R dplyr - row_number(), rank(), dense_rank()

Key Insights

row_number(), rank(), and dense_rank() assign sequential positions to rows, but handle ties differently: row_number() breaks ties arbitrarily, rank() skips positions after ties, and dense_rank() maintains consecutive numbering
These ranking functions integrate seamlessly with group_by() to calculate rankings within partitions, enabling per-category top-N queries and relative position analysis
Combining ranking functions with filter() and arrange() provides efficient solutions for common analytical tasks like finding top performers, detecting duplicates, and implementing pagination

Understanding the Three Ranking Functions

The dplyr package provides three distinct ranking functions that assign positional values to rows. While they appear similar, their handling of tied values creates fundamentally different outputs.

library(dplyr)

# Sample dataset with tied values
sales <- tibble(
  product = c("A", "B", "C", "D", "E"),
  revenue = c(100, 150, 150, 200, 100)
)

sales %>%
  arrange(desc(revenue)) %>%
  mutate(
    row_num = row_number(desc(revenue)),
    rank_val = rank(desc(revenue)),
    dense_rank_val = dense_rank(desc(revenue))
  )

Output:

# A tibble: 5 × 5
  product revenue row_num rank_val dense_rank_val
  <chr>     <dbl>   <int>    <dbl>          <int>
1 D           200       1        1              1
2 B           150       2        2              2
3 C           150       3        2              2
4 A           100       4        4              3
5 E           100       5        4              3

Notice how row_number() assigns 2 and 3 to the tied values arbitrarily, rank() assigns both ties the value 2 but skips 3, and dense_rank() assigns both 2 and continues with 3 (no gaps).

Grouped Rankings for Partitioned Data

Ranking functions become powerful when combined with group_by(), allowing you to calculate positions within categories.

quarterly_sales <- tibble(
  quarter = rep(c("Q1", "Q2", "Q3", "Q4"), each = 4),
  salesperson = rep(c("Alice", "Bob", "Carol", "Dave"), 4),
  deals_closed = c(12, 15, 15, 10,  # Q1
                   18, 14, 14, 16,  # Q2
                   20, 22, 19, 22,  # Q3
                   25, 23, 23, 21)  # Q4
)

top_performers <- quarterly_sales %>%
  group_by(quarter) %>%
  mutate(
    rank_position = rank(desc(deals_closed)),
    dense_position = dense_rank(desc(deals_closed))
  ) %>%
  filter(rank_position <= 2) %>%
  arrange(quarter, rank_position)

print(top_performers)

This query identifies the top 2 performers per quarter. When using rank() with ties, you might get more than 2 results per group—Q3 returns three people because two tied for first place.

Top-N Queries with row_number()

For strict top-N queries where you need exactly N results per group regardless of ties, row_number() is the appropriate choice.

# Get exactly top 3 products per category
product_data <- tibble(
  category = rep(c("Electronics", "Clothing", "Food"), each = 6),
  product = paste0("Product_", 1:18),
  units_sold = sample(50:500, 18)
)

top_3_per_category <- product_data %>%
  group_by(category) %>%
  mutate(position = row_number(desc(units_sold))) %>%
  filter(position <= 3) %>%
  arrange(category, position)

This guarantees exactly 3 products per category. If you need to control tie-breaking behavior, add additional sorting criteria:

# Break ties alphabetically by product name
product_data %>%
  group_by(category) %>%
  arrange(category, desc(units_sold), product) %>%
  mutate(position = row_number()) %>%
  filter(position <= 3)

Percentile Rankings and Relative Position

Combine ranking functions with group size calculations to determine relative positions.

employee_performance <- tibble(
  department = rep(c("Engineering", "Sales", "Marketing"), c(10, 8, 6)),
  employee_id = 1:24,
  performance_score = rnorm(24, mean = 75, sd = 10)
)

percentile_ranks <- employee_performance %>%
  group_by(department) %>%
  mutate(
    rank_pos = rank(desc(performance_score)),
    total_count = n(),
    percentile = (1 - (rank_pos - 1) / total_count) * 100,
    top_quartile = percentile >= 75
  ) %>%
  arrange(department, rank_pos)

# Find employees in top 25% of their department
top_quartile_employees <- percentile_ranks %>%
  filter(top_quartile)

Handling Duplicates and Data Quality

Use row_number() to identify and handle duplicate records.

customer_records <- tibble(
  customer_id = c(101, 101, 102, 103, 103, 103, 104),
  transaction_date = as.Date(c(
    "2024-01-15", "2024-01-20", "2024-01-18",
    "2024-02-01", "2024-02-01", "2024-02-05", "2024-02-10"
  )),
  amount = c(250, 300, 150, 400, 400, 200, 350)
)

# Keep most recent transaction per customer
deduplicated <- customer_records %>%
  group_by(customer_id) %>%
  arrange(desc(transaction_date), desc(amount)) %>%
  mutate(record_num = row_number()) %>%
  filter(record_num == 1) %>%
  select(-record_num)

# Identify all duplicates for review
duplicates <- customer_records %>%
  group_by(customer_id, transaction_date) %>%
  mutate(
    occurrence = row_number(),
    total_occurrences = n()
  ) %>%
  filter(total_occurrences > 1)

Window Functions for Running Rankings

Calculate running ranks within ordered windows for time-series analysis.

stock_prices <- tibble(
  date = seq.Date(as.Date("2024-01-01"), by = "day", length.out = 10),
  closing_price = c(150, 152, 151, 155, 154, 158, 157, 160, 159, 162)
)

stock_analysis <- stock_prices %>%
  arrange(date) %>%
  mutate(
    # Rank each day's price against all historical prices
    all_time_rank = rank(desc(closing_price)),
    # Rolling 5-day rank
    price_position = row_number()
  ) %>%
  # Add 5-day rolling window rank
  mutate(
    rolling_rank = sapply(price_position, function(i) {
      window_start <- max(1, i - 4)
      rank(desc(closing_price[window_start:i]))[i - window_start + 1]
    })
  )

Pagination Implementation

Implement efficient pagination for large datasets using row_number().

# Simulate large product catalog
catalog <- tibble(
  product_id = 1:1000,
  product_name = paste0("Product_", 1:1000),
  price = runif(1000, 10, 1000)
)

# Pagination function
get_page <- function(data, page_number, page_size = 20) {
  data %>%
    arrange(product_id) %>%
    mutate(row_num = row_number()) %>%
    filter(
      row_num > (page_number - 1) * page_size,
      row_num <= page_number * page_size
    ) %>%
    select(-row_num)
}

# Get page 3
page_3 <- get_page(catalog, page_number = 3, page_size = 20)

Performance Considerations

When working with large datasets, the ordering of operations impacts performance significantly.

# Inefficient: ranking before filtering
large_dataset %>%
  mutate(rank = row_number(desc(value))) %>%
  filter(category == "A") %>%
  filter(rank <= 10)

# Efficient: filter first, then rank
large_dataset %>%
  filter(category == "A") %>%
  mutate(rank = row_number(desc(value))) %>%
  filter(rank <= 10)

For very large grouped operations, consider using slice_max() or slice_min() as optimized alternatives:

# Instead of this
data %>%
  group_by(category) %>%
  mutate(rank = row_number(desc(sales))) %>%
  filter(rank <= 5)

# Use this
data %>%
  group_by(category) %>%
  slice_max(sales, n = 5, with_ties = FALSE)

The slice_* functions are optimized for top-N operations and provide clearer intent while maintaining comparable performance characteristics.