R dplyr - top_n() and slice_max() | Application Architect

Key Insights

top_n() is deprecated in favor of slice_max() and slice_min(), which offer clearer syntax and better handling of ties through the with_ties parameter
slice_max() provides more predictable behavior when selecting top records, automatically handling grouped data and offering precise control over tie-breaking
Understanding the differences between these functions prevents unexpected results in production code, especially when working with ranked data or generating reports

Why the Transition from top_n() to slice_max()

The dplyr package deprecated top_n() in version 1.0.0, recommending slice_max() and slice_min() as replacements. This wasn’t arbitrary—top_n() had ambiguous behavior, particularly around tie handling and parameter naming that confused developers.

library(dplyr)

# Sample dataset
sales_data <- tibble(
  product = c("A", "B", "C", "D", "E"),
  revenue = c(1000, 1500, 1500, 800, 2000),
  units = c(50, 75, 60, 40, 100)
)

# Old way - top_n() (deprecated)
sales_data %>% top_n(3, revenue)

# New way - slice_max()
sales_data %>% slice_max(revenue, n = 3)

The slice_max() syntax is clearer: the column comes first, followed by named parameters. This aligns with dplyr’s general design philosophy of explicit, readable code.

Handling Ties with with_ties Parameter

The critical difference between these functions lies in tie handling. top_n() always includes ties, which can return more rows than requested. slice_max() gives you control.

# Data with ties
products <- tibble(
  id = 1:6,
  category = c("Electronics", "Electronics", "Clothing", 
               "Clothing", "Food", "Food"),
  score = c(95, 95, 88, 92, 95, 87)
)

# slice_max with ties included (default)
products %>% 
  slice_max(score, n = 2, with_ties = TRUE)
# Returns 3 rows: all products with scores 95 and 95

# slice_max without ties
products %>% 
  slice_max(score, n = 2, with_ties = FALSE)
# Returns exactly 2 rows: first two products with score 95

# For comparison, old top_n behavior
products %>% 
  top_n(2, score)
# Always includes ties, returns 3 rows

In production environments, this distinction matters. If you’re generating a “Top 10” report and ties push it to 13 items, your dashboard layout might break. Use with_ties = FALSE when you need exact counts.

Working with Grouped Data

Both functions work with grouped data, but slice_max() provides more intuitive results and better performance.

# Sales data by region
regional_sales <- tibble(
  region = rep(c("North", "South", "East", "West"), each = 5),
  salesperson = paste0("Rep_", 1:20),
  sales = c(
    45000, 52000, 48000, 51000, 47000,  # North
    38000, 42000, 44000, 41000, 39000,  # South
    55000, 53000, 54000, 52000, 51000,  # East
    49000, 47000, 48000, 50000, 46000   # West
  )
)

# Get top 2 performers per region
top_performers <- regional_sales %>%
  group_by(region) %>%
  slice_max(sales, n = 2, with_ties = FALSE) %>%
  ungroup()

print(top_performers)

# Compare with proportion-based selection
# Get top 20% from each region
top_percent <- regional_sales %>%
  group_by(region) %>%
  slice_max(sales, prop = 0.2) %>%
  ungroup()

The prop parameter is particularly useful for dynamic datasets where absolute counts don’t make sense. If you’re analyzing datasets of varying sizes, prop = 0.1 consistently gives you the top 10% regardless of group size.

slice_max() vs slice_min() vs arrange() + slice()

Understanding when to use each function optimizes both readability and performance.

# Dataset for comparison
employee_data <- tibble(
  employee_id = 1:1000,
  salary = rnorm(1000, mean = 75000, sd = 15000),
  performance_score = runif(1000, min = 1, max = 5)
)

# Method 1: slice_max - most concise
top_earners_v1 <- employee_data %>%
  slice_max(salary, n = 10)

# Method 2: arrange + slice - more verbose
top_earners_v2 <- employee_data %>%
  arrange(desc(salary)) %>%
  slice(1:10)

# Method 3: arrange + head - base R style
top_earners_v3 <- employee_data %>%
  arrange(desc(salary)) %>%
  head(10)

# Benchmark performance
library(microbenchmark)

microbenchmark(
  slice_max = employee_data %>% slice_max(salary, n = 10),
  arrange_slice = employee_data %>% arrange(desc(salary)) %>% slice(1:10),
  times = 100
)

slice_max() is typically faster because it doesn’t need to sort the entire dataset—it only needs to identify the top n values. For large datasets, this performance difference becomes significant.

Practical Use Cases with Multiple Columns

Real-world scenarios often require selecting based on multiple criteria or columns.

# E-commerce product data
products <- tibble(
  product_id = 1:50,
  category = sample(c("Electronics", "Clothing", "Home"), 50, replace = TRUE),
  rating = runif(50, min = 3, max = 5),
  review_count = sample(10:1000, 50),
  price = runif(50, min = 10, max = 500)
)

# Top 5 products by rating, then by review count
top_products <- products %>%
  slice_max(order_by = rating, n = 5, with_ties = FALSE) %>%
  slice_max(order_by = review_count, n = 3, with_ties = FALSE)

# Better approach: use arrange for multiple criteria
top_products_better <- products %>%
  arrange(desc(rating), desc(review_count)) %>%
  slice(1:5)

# Top 3 in each category by weighted score
products <- products %>%
  mutate(weighted_score = rating * log1p(review_count))

top_by_category <- products %>%
  group_by(category) %>%
  slice_max(weighted_score, n = 3, with_ties = FALSE) %>%
  arrange(category, desc(weighted_score)) %>%
  ungroup()

When you need complex sorting logic, arrange() followed by slice() often provides clearer intent than chaining multiple slice_max() calls.

Handling NA Values

Both functions handle NA values, but you need to understand the behavior to avoid surprises.

# Data with missing values
incomplete_data <- tibble(
  id = 1:8,
  value = c(100, 200, NA, 150, NA, 300, 250, 175)
)

# slice_max drops NAs by default
incomplete_data %>%
  slice_max(value, n = 3)
# Returns only non-NA values

# Explicitly handle NAs first
incomplete_data %>%
  filter(!is.na(value)) %>%
  slice_max(value, n = 3)

# Or replace NAs before selection
incomplete_data %>%
  mutate(value = replace_na(value, 0)) %>%
  slice_max(value, n = 3)

# Get bottom values, NAs excluded
incomplete_data %>%
  slice_min(value, n = 3)

Always validate your data for NAs before using these functions in production pipelines. Unexpected NA behavior is a common source of bugs in data processing workflows.

Migration Strategy from top_n()

If you’re maintaining legacy code, here’s a systematic approach to migration:

# Legacy code pattern
legacy_result <- data %>%
  group_by(category) %>%
  top_n(5, metric) %>%
  ungroup()

# Direct replacement (maintains tie behavior)
modern_result <- data %>%
  group_by(category) %>%
  slice_max(metric, n = 5, with_ties = TRUE) %>%
  ungroup()

# Improved version (explicit tie handling)
improved_result <- data %>%
  group_by(category) %>%
  slice_max(metric, n = 5, with_ties = FALSE) %>%
  ungroup()

# Test equivalence
all.equal(legacy_result, modern_result)

Run both versions in parallel initially, comparing results to ensure behavioral consistency before fully deprecating top_n() usage.

Performance Considerations for Large Datasets

When working with millions of rows, function choice impacts execution time significantly.

# Large dataset simulation
large_data <- tibble(
  id = 1:1e6,
  category = sample(letters[1:10], 1e6, replace = TRUE),
  value = rnorm(1e6)
)

# Efficient: slice_max on grouped data
system.time({
  result1 <- large_data %>%
    group_by(category) %>%
    slice_max(value, n = 100, with_ties = FALSE) %>%
    ungroup()
})

# Less efficient: arrange entire dataset
system.time({
  result2 <- large_data %>%
    group_by(category) %>%
    arrange(desc(value)) %>%
    slice(1:100) %>%
    ungroup()
})

For datasets exceeding memory limits, consider using dtplyr or arrow backends, which optimize these operations for larger-than-RAM data processing. The slice_max() syntax translates efficiently to these backends without code changes.