R dplyr - top_n() and slice_max()
The dplyr package deprecated `top_n()` in version 1.0.0, recommending `slice_max()` and `slice_min()` as replacements. This wasn't arbitrary—`top_n()` had ambiguous behavior, particularly around tie...
Key Insights
top_n()is deprecated in favor ofslice_max()andslice_min(), which offer clearer syntax and better handling of ties through thewith_tiesparameterslice_max()provides more predictable behavior when selecting top records, automatically handling grouped data and offering precise control over tie-breaking- Understanding the differences between these functions prevents unexpected results in production code, especially when working with ranked data or generating reports
Why the Transition from top_n() to slice_max()
The dplyr package deprecated top_n() in version 1.0.0, recommending slice_max() and slice_min() as replacements. This wasn’t arbitrary—top_n() had ambiguous behavior, particularly around tie handling and parameter naming that confused developers.
library(dplyr)
# Sample dataset
sales_data <- tibble(
product = c("A", "B", "C", "D", "E"),
revenue = c(1000, 1500, 1500, 800, 2000),
units = c(50, 75, 60, 40, 100)
)
# Old way - top_n() (deprecated)
sales_data %>% top_n(3, revenue)
# New way - slice_max()
sales_data %>% slice_max(revenue, n = 3)
The slice_max() syntax is clearer: the column comes first, followed by named parameters. This aligns with dplyr’s general design philosophy of explicit, readable code.
Handling Ties with with_ties Parameter
The critical difference between these functions lies in tie handling. top_n() always includes ties, which can return more rows than requested. slice_max() gives you control.
# Data with ties
products <- tibble(
id = 1:6,
category = c("Electronics", "Electronics", "Clothing",
"Clothing", "Food", "Food"),
score = c(95, 95, 88, 92, 95, 87)
)
# slice_max with ties included (default)
products %>%
slice_max(score, n = 2, with_ties = TRUE)
# Returns 3 rows: all products with scores 95 and 95
# slice_max without ties
products %>%
slice_max(score, n = 2, with_ties = FALSE)
# Returns exactly 2 rows: first two products with score 95
# For comparison, old top_n behavior
products %>%
top_n(2, score)
# Always includes ties, returns 3 rows
In production environments, this distinction matters. If you’re generating a “Top 10” report and ties push it to 13 items, your dashboard layout might break. Use with_ties = FALSE when you need exact counts.
Working with Grouped Data
Both functions work with grouped data, but slice_max() provides more intuitive results and better performance.
# Sales data by region
regional_sales <- tibble(
region = rep(c("North", "South", "East", "West"), each = 5),
salesperson = paste0("Rep_", 1:20),
sales = c(
45000, 52000, 48000, 51000, 47000, # North
38000, 42000, 44000, 41000, 39000, # South
55000, 53000, 54000, 52000, 51000, # East
49000, 47000, 48000, 50000, 46000 # West
)
)
# Get top 2 performers per region
top_performers <- regional_sales %>%
group_by(region) %>%
slice_max(sales, n = 2, with_ties = FALSE) %>%
ungroup()
print(top_performers)
# Compare with proportion-based selection
# Get top 20% from each region
top_percent <- regional_sales %>%
group_by(region) %>%
slice_max(sales, prop = 0.2) %>%
ungroup()
The prop parameter is particularly useful for dynamic datasets where absolute counts don’t make sense. If you’re analyzing datasets of varying sizes, prop = 0.1 consistently gives you the top 10% regardless of group size.
slice_max() vs slice_min() vs arrange() + slice()
Understanding when to use each function optimizes both readability and performance.
# Dataset for comparison
employee_data <- tibble(
employee_id = 1:1000,
salary = rnorm(1000, mean = 75000, sd = 15000),
performance_score = runif(1000, min = 1, max = 5)
)
# Method 1: slice_max - most concise
top_earners_v1 <- employee_data %>%
slice_max(salary, n = 10)
# Method 2: arrange + slice - more verbose
top_earners_v2 <- employee_data %>%
arrange(desc(salary)) %>%
slice(1:10)
# Method 3: arrange + head - base R style
top_earners_v3 <- employee_data %>%
arrange(desc(salary)) %>%
head(10)
# Benchmark performance
library(microbenchmark)
microbenchmark(
slice_max = employee_data %>% slice_max(salary, n = 10),
arrange_slice = employee_data %>% arrange(desc(salary)) %>% slice(1:10),
times = 100
)
slice_max() is typically faster because it doesn’t need to sort the entire dataset—it only needs to identify the top n values. For large datasets, this performance difference becomes significant.
Practical Use Cases with Multiple Columns
Real-world scenarios often require selecting based on multiple criteria or columns.
# E-commerce product data
products <- tibble(
product_id = 1:50,
category = sample(c("Electronics", "Clothing", "Home"), 50, replace = TRUE),
rating = runif(50, min = 3, max = 5),
review_count = sample(10:1000, 50),
price = runif(50, min = 10, max = 500)
)
# Top 5 products by rating, then by review count
top_products <- products %>%
slice_max(order_by = rating, n = 5, with_ties = FALSE) %>%
slice_max(order_by = review_count, n = 3, with_ties = FALSE)
# Better approach: use arrange for multiple criteria
top_products_better <- products %>%
arrange(desc(rating), desc(review_count)) %>%
slice(1:5)
# Top 3 in each category by weighted score
products <- products %>%
mutate(weighted_score = rating * log1p(review_count))
top_by_category <- products %>%
group_by(category) %>%
slice_max(weighted_score, n = 3, with_ties = FALSE) %>%
arrange(category, desc(weighted_score)) %>%
ungroup()
When you need complex sorting logic, arrange() followed by slice() often provides clearer intent than chaining multiple slice_max() calls.
Handling NA Values
Both functions handle NA values, but you need to understand the behavior to avoid surprises.
# Data with missing values
incomplete_data <- tibble(
id = 1:8,
value = c(100, 200, NA, 150, NA, 300, 250, 175)
)
# slice_max drops NAs by default
incomplete_data %>%
slice_max(value, n = 3)
# Returns only non-NA values
# Explicitly handle NAs first
incomplete_data %>%
filter(!is.na(value)) %>%
slice_max(value, n = 3)
# Or replace NAs before selection
incomplete_data %>%
mutate(value = replace_na(value, 0)) %>%
slice_max(value, n = 3)
# Get bottom values, NAs excluded
incomplete_data %>%
slice_min(value, n = 3)
Always validate your data for NAs before using these functions in production pipelines. Unexpected NA behavior is a common source of bugs in data processing workflows.
Migration Strategy from top_n()
If you’re maintaining legacy code, here’s a systematic approach to migration:
# Legacy code pattern
legacy_result <- data %>%
group_by(category) %>%
top_n(5, metric) %>%
ungroup()
# Direct replacement (maintains tie behavior)
modern_result <- data %>%
group_by(category) %>%
slice_max(metric, n = 5, with_ties = TRUE) %>%
ungroup()
# Improved version (explicit tie handling)
improved_result <- data %>%
group_by(category) %>%
slice_max(metric, n = 5, with_ties = FALSE) %>%
ungroup()
# Test equivalence
all.equal(legacy_result, modern_result)
Run both versions in parallel initially, comparing results to ensure behavioral consistency before fully deprecating top_n() usage.
Performance Considerations for Large Datasets
When working with millions of rows, function choice impacts execution time significantly.
# Large dataset simulation
large_data <- tibble(
id = 1:1e6,
category = sample(letters[1:10], 1e6, replace = TRUE),
value = rnorm(1e6)
)
# Efficient: slice_max on grouped data
system.time({
result1 <- large_data %>%
group_by(category) %>%
slice_max(value, n = 100, with_ties = FALSE) %>%
ungroup()
})
# Less efficient: arrange entire dataset
system.time({
result2 <- large_data %>%
group_by(category) %>%
arrange(desc(value)) %>%
slice(1:100) %>%
ungroup()
})
For datasets exceeding memory limits, consider using dtplyr or arrow backends, which optimize these operations for larger-than-RAM data processing. The slice_max() syntax translates efficiently to these backends without code changes.