R dplyr - arrange() - Sort Data Frame
The `arrange()` function from dplyr provides an intuitive interface for sorting data frames. Unlike base R's `order()`, it returns the entire data frame in sorted order rather than just indices.
Key Insights
arrange()sorts data frames by one or more columns in ascending or descending order, maintaining all rows while reordering them based on specified criteria- Multiple column sorting applies hierarchical ordering where subsequent columns break ties in previous columns, enabling complex multi-level sorting logic
- Helper functions like
desc(),across(), andna_lastparameter provide fine-grained control over sort direction, multiple column operations, and NA value placement
Basic Single Column Sorting
The arrange() function from dplyr provides an intuitive interface for sorting data frames. Unlike base R’s order(), it returns the entire data frame in sorted order rather than just indices.
library(dplyr)
# Sample dataset
employees <- data.frame(
name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
salary = c(75000, 62000, 85000, 58000, 92000),
department = c("IT", "HR", "IT", "HR", "Finance"),
years = c(5, 3, 7, 2, 10)
)
# Sort by salary (ascending)
employees %>% arrange(salary)
name salary department years
1 Diana 58000 HR 2
2 Bob 62000 HR 3
3 Alice 75000 IT 5
4 Charlie 85000 IT 7
5 Eve 92000 Finance 10
The default behavior is ascending order. For numeric columns, this means smallest to largest. For character columns, alphabetical A-Z. For dates, earliest to latest.
Descending Order with desc()
Use the desc() wrapper function to reverse sort order for any column type.
# Sort by salary descending
employees %>% arrange(desc(salary))
# Sort by name in reverse alphabetical order
employees %>% arrange(desc(name))
# desc(salary) output:
name salary department years
1 Eve 92000 Finance 10
2 Charlie 85000 IT 7
3 Alice 75000 IT 5
4 Bob 62000 HR 3
5 Diana 58000 HR 2
You can also use the minus sign for numeric columns: arrange(-salary). However, desc() works universally across all data types and is more explicit in intent.
Multi-Column Sorting
Specify multiple columns to create hierarchical sorting. Each subsequent column breaks ties in the previous column.
# Sort by department, then by salary within each department
employees %>% arrange(department, salary)
# Mix ascending and descending
employees %>% arrange(department, desc(salary))
# department (asc), salary (desc):
name salary department years
1 Eve 92000 Finance 10
2 Bob 62000 HR 3
3 Diana 58000 HR 2
4 Charlie 85000 IT 7
5 Alice 75000 IT 5
The order of columns matters significantly. arrange(salary, department) produces different results than arrange(department, salary).
Handling Missing Values
By default, arrange() places NA values at the end regardless of sort direction. Control this with .by_group and manual NA handling.
# Data with missing values
sales <- data.frame(
product = c("A", "B", "C", "D", "E"),
revenue = c(1000, NA, 1500, 800, NA),
units = c(50, 30, NA, 40, 25)
)
# Default: NAs at end
sales %>% arrange(revenue)
# Force NAs first using is.na()
sales %>% arrange(!is.na(revenue), revenue)
# NAs last explicitly (default behavior shown)
sales %>% arrange(is.na(revenue), revenue)
# Default arrange(revenue):
product revenue units
1 D 800 40
2 A 1000 50
3 C 1500 NA
4 B NA 30
5 E NA 25
For complete control, combine is.na() checks with your sort columns to position NAs precisely where needed.
Sorting Across Multiple Columns with across()
When sorting by many columns sharing characteristics, use across() to avoid repetition.
# Sample data with multiple numeric columns
metrics <- data.frame(
region = c("North", "South", "East", "West"),
q1_sales = c(100, 150, 120, 90),
q2_sales = c(110, 140, 130, 95),
q3_sales = c(120, 160, 125, 100),
q4_sales = c(130, 155, 135, 105)
)
# Sort by all quarterly sales columns in descending order
metrics %>% arrange(across(starts_with("q"), desc))
# Sort all numeric columns ascending
metrics %>% arrange(across(where(is.numeric)))
This approach scales efficiently when dealing with datasets containing dozens of similar columns that require identical sorting treatment.
Sorting Grouped Data
Combine arrange() with group_by() to sort within groups. Note that arrange() ignores grouping by default unless you set .by_group = TRUE.
sales_data <- data.frame(
region = rep(c("North", "South"), each = 3),
month = rep(c("Jan", "Feb", "Mar"), 2),
revenue = c(1000, 1200, 900, 1100, 950, 1300)
)
# Sort within groups
sales_data %>%
group_by(region) %>%
arrange(desc(revenue), .by_group = TRUE)
# Arrange ignoring groups (default)
sales_data %>%
group_by(region) %>%
arrange(desc(revenue))
# With .by_group = TRUE:
region month revenue
<chr> <chr> <dbl>
1 North Feb 1200
2 North Jan 1000
3 North Mar 900
4 South Mar 1300
5 South Jan 1100
6 South Feb 950
The .by_group = TRUE parameter ensures groups remain together and are sorted by the grouping variable first.
Sorting with Custom Expressions
arrange() accepts expressions and transformations directly in the function call.
# Sort by calculated values
employees %>% arrange(salary / years) # Salary per year
# Sort by string transformations
employees %>% arrange(nchar(name)) # Sort by name length
# Sort by conditional logic
employees %>% arrange(ifelse(department == "IT", 0, 1), desc(salary))
# Sort by salary per year:
name salary department years
1 Diana 58000 HR 2
2 Bob 62000 HR 3
3 Eve 92000 Finance 10
4 Charlie 85000 IT 7
5 Alice 75000 IT 5
This eliminates the need to create temporary columns for sorting purposes, keeping your pipeline clean and efficient.
Performance Considerations
For large datasets, arrange() performs efficiently but consider these optimizations:
# Use data.table for very large datasets (millions of rows)
library(data.table)
dt <- as.data.table(employees)
dt[order(salary)] # Often faster than dplyr for large data
# Sort only necessary columns
employees %>%
select(name, salary) %>%
arrange(salary)
# Avoid repeated sorting in loops
# Sort once, then filter/subset
sorted_data <- employees %>% arrange(desc(salary))
The arrange() function uses efficient algorithms, but on datasets exceeding several million rows, data.table’s order() or base R’s order() with proper indexing may provide better performance.
Combining arrange() with Other dplyr Verbs
The real power emerges when chaining arrange() with filtering, selection, and mutation operations.
# Complex pipeline
employees %>%
filter(years >= 3) %>%
mutate(salary_grade = case_when(
salary >= 80000 ~ "High",
salary >= 65000 ~ "Medium",
TRUE ~ "Low"
)) %>%
arrange(salary_grade, desc(salary)) %>%
select(name, salary, salary_grade, years)
name salary salary_grade years
1 Eve 92000 High 10
2 Charlie 85000 High 7
3 Alice 75000 Medium 5
This pattern—filter, transform, sort, select—represents a common analytical workflow where arrange() positions results for final presentation or further processing.