R dplyr - arrange() - Sort Data Frame

The `arrange()` function from dplyr provides an intuitive interface for sorting data frames. Unlike base R's `order()`, it returns the entire data frame in sorted order rather than just indices.

Key Insights

  • arrange() sorts data frames by one or more columns in ascending or descending order, maintaining all rows while reordering them based on specified criteria
  • Multiple column sorting applies hierarchical ordering where subsequent columns break ties in previous columns, enabling complex multi-level sorting logic
  • Helper functions like desc(), across(), and na_last parameter provide fine-grained control over sort direction, multiple column operations, and NA value placement

Basic Single Column Sorting

The arrange() function from dplyr provides an intuitive interface for sorting data frames. Unlike base R’s order(), it returns the entire data frame in sorted order rather than just indices.

library(dplyr)

# Sample dataset
employees <- data.frame(
  name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  salary = c(75000, 62000, 85000, 58000, 92000),
  department = c("IT", "HR", "IT", "HR", "Finance"),
  years = c(5, 3, 7, 2, 10)
)

# Sort by salary (ascending)
employees %>% arrange(salary)
     name salary department years
1   Diana  58000         HR     2
2     Bob  62000         HR     3
3   Alice  75000         IT     5
4 Charlie  85000         IT     7
5     Eve  92000    Finance    10

The default behavior is ascending order. For numeric columns, this means smallest to largest. For character columns, alphabetical A-Z. For dates, earliest to latest.

Descending Order with desc()

Use the desc() wrapper function to reverse sort order for any column type.

# Sort by salary descending
employees %>% arrange(desc(salary))

# Sort by name in reverse alphabetical order
employees %>% arrange(desc(name))
# desc(salary) output:
     name salary department years
1     Eve  92000    Finance    10
2 Charlie  85000         IT     7
3   Alice  75000         IT     5
4     Bob  62000         HR     3
5   Diana  58000         HR     2

You can also use the minus sign for numeric columns: arrange(-salary). However, desc() works universally across all data types and is more explicit in intent.

Multi-Column Sorting

Specify multiple columns to create hierarchical sorting. Each subsequent column breaks ties in the previous column.

# Sort by department, then by salary within each department
employees %>% arrange(department, salary)

# Mix ascending and descending
employees %>% arrange(department, desc(salary))
# department (asc), salary (desc):
     name salary department years
1     Eve  92000    Finance    10
2     Bob  62000         HR     3
3   Diana  58000         HR     2
4 Charlie  85000         IT     7
5   Alice  75000         IT     5

The order of columns matters significantly. arrange(salary, department) produces different results than arrange(department, salary).

Handling Missing Values

By default, arrange() places NA values at the end regardless of sort direction. Control this with .by_group and manual NA handling.

# Data with missing values
sales <- data.frame(
  product = c("A", "B", "C", "D", "E"),
  revenue = c(1000, NA, 1500, 800, NA),
  units = c(50, 30, NA, 40, 25)
)

# Default: NAs at end
sales %>% arrange(revenue)

# Force NAs first using is.na()
sales %>% arrange(!is.na(revenue), revenue)

# NAs last explicitly (default behavior shown)
sales %>% arrange(is.na(revenue), revenue)
# Default arrange(revenue):
  product revenue units
1       D     800    40
2       A    1000    50
3       C    1500    NA
4       B      NA    30
5       E      NA    25

For complete control, combine is.na() checks with your sort columns to position NAs precisely where needed.

Sorting Across Multiple Columns with across()

When sorting by many columns sharing characteristics, use across() to avoid repetition.

# Sample data with multiple numeric columns
metrics <- data.frame(
  region = c("North", "South", "East", "West"),
  q1_sales = c(100, 150, 120, 90),
  q2_sales = c(110, 140, 130, 95),
  q3_sales = c(120, 160, 125, 100),
  q4_sales = c(130, 155, 135, 105)
)

# Sort by all quarterly sales columns in descending order
metrics %>% arrange(across(starts_with("q"), desc))

# Sort all numeric columns ascending
metrics %>% arrange(across(where(is.numeric)))

This approach scales efficiently when dealing with datasets containing dozens of similar columns that require identical sorting treatment.

Sorting Grouped Data

Combine arrange() with group_by() to sort within groups. Note that arrange() ignores grouping by default unless you set .by_group = TRUE.

sales_data <- data.frame(
  region = rep(c("North", "South"), each = 3),
  month = rep(c("Jan", "Feb", "Mar"), 2),
  revenue = c(1000, 1200, 900, 1100, 950, 1300)
)

# Sort within groups
sales_data %>%
  group_by(region) %>%
  arrange(desc(revenue), .by_group = TRUE)

# Arrange ignoring groups (default)
sales_data %>%
  group_by(region) %>%
  arrange(desc(revenue))
# With .by_group = TRUE:
  region month revenue
  <chr>  <chr>   <dbl>
1 North  Feb      1200
2 North  Jan      1000
3 North  Mar       900
4 South  Mar      1300
5 South  Jan      1100
6 South  Feb       950

The .by_group = TRUE parameter ensures groups remain together and are sorted by the grouping variable first.

Sorting with Custom Expressions

arrange() accepts expressions and transformations directly in the function call.

# Sort by calculated values
employees %>% arrange(salary / years)  # Salary per year

# Sort by string transformations
employees %>% arrange(nchar(name))  # Sort by name length

# Sort by conditional logic
employees %>% arrange(ifelse(department == "IT", 0, 1), desc(salary))
# Sort by salary per year:
     name salary department years
1   Diana  58000         HR     2
2     Bob  62000         HR     3
3   Eve    92000    Finance    10
4 Charlie  85000         IT     7
5   Alice  75000         IT     5

This eliminates the need to create temporary columns for sorting purposes, keeping your pipeline clean and efficient.

Performance Considerations

For large datasets, arrange() performs efficiently but consider these optimizations:

# Use data.table for very large datasets (millions of rows)
library(data.table)
dt <- as.data.table(employees)
dt[order(salary)]  # Often faster than dplyr for large data

# Sort only necessary columns
employees %>%
  select(name, salary) %>%
  arrange(salary)

# Avoid repeated sorting in loops
# Sort once, then filter/subset
sorted_data <- employees %>% arrange(desc(salary))

The arrange() function uses efficient algorithms, but on datasets exceeding several million rows, data.table’s order() or base R’s order() with proper indexing may provide better performance.

Combining arrange() with Other dplyr Verbs

The real power emerges when chaining arrange() with filtering, selection, and mutation operations.

# Complex pipeline
employees %>%
  filter(years >= 3) %>%
  mutate(salary_grade = case_when(
    salary >= 80000 ~ "High",
    salary >= 65000 ~ "Medium",
    TRUE ~ "Low"
  )) %>%
  arrange(salary_grade, desc(salary)) %>%
  select(name, salary, salary_grade, years)
     name salary salary_grade years
1     Eve  92000         High    10
2 Charlie  85000         High     7
3   Alice  75000       Medium     5

This pattern—filter, transform, sort, select—represents a common analytical workflow where arrange() positions results for final presentation or further processing.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.