R dplyr - distinct() - Remove Duplicates

Key Insights

distinct() removes duplicate rows based on all columns by default, or specific columns when specified, making it essential for data deduplication tasks
The .keep_all = TRUE parameter preserves all columns when filtering by specific variables, while default behavior only returns the distinct columns
distinct() maintains the first occurrence of each unique combination and integrates seamlessly with pipe workflows for multi-step data transformations

Understanding distinct() Basics

The distinct() function from dplyr identifies and removes duplicate rows from data frames. Unlike base R’s unique(), it works naturally with tibbles and integrates into pipe-based workflows.

library(dplyr)

# Sample dataset with duplicates
customers <- data.frame(
  id = c(1, 2, 2, 3, 4, 4, 4),
  name = c("Alice", "Bob", "Bob", "Charlie", "David", "David", "David"),
  city = c("NYC", "LA", "LA", "Chicago", "Boston", "Boston", "Boston"),
  purchase_amount = c(100, 200, 200, 150, 300, 400, 500)
)

# Remove completely duplicate rows
customers %>% distinct()

  id    name     city purchase_amount
1  1   Alice      NYC             100
2  2     Bob       LA             200
3  3 Charlie  Chicago             150
4  4   David   Boston             300
5  4   David   Boston             400
6  4   David   Boston             500

The function keeps the first occurrence when rows are identical across all columns. Rows 5-7 remain because purchase_amount differs.

Removing Duplicates by Specific Columns

Specify columns to define uniqueness criteria. This removes rows with duplicate combinations of those columns only.

# Keep unique customer IDs only
customers %>% distinct(id)

# Unique combinations of name and city
customers %>% distinct(name, city)

     name     city
1   Alice      NYC
2     Bob       LA
3 Charlie  Chicago
4   David   Boston

By default, distinct() returns only the specified columns. To retain all columns while filtering by specific ones, use .keep_all = TRUE.

# Keep all columns, filter by id
customers %>% distinct(id, .keep_all = TRUE)

  id    name     city purchase_amount
1  1   Alice      NYC             100
2  2     Bob       LA             200
3  3 Charlie  Chicago             150
4  4   David   Boston             300

This preserves the first occurrence’s complete row data.

Practical Deduplication Scenarios

Time-Series Data Deduplication

When working with sensor data or logs, remove duplicate timestamps while keeping the most recent or relevant reading.

sensor_data <- data.frame(
  sensor_id = c("A1", "A1", "A1", "B2", "B2"),
  timestamp = as.POSIXct(c("2024-01-01 10:00:00", "2024-01-01 10:00:00", 
                            "2024-01-01 11:00:00", "2024-01-01 10:00:00", 
                            "2024-01-01 10:00:00")),
  temperature = c(22.5, 22.5, 23.1, 19.8, 19.9),
  humidity = c(45, 45, 47, 52, 53)
)

# Remove exact duplicates
sensor_data %>% 
  distinct()

  sensor_id           timestamp temperature humidity
1        A1 2024-01-01 10:00:00        22.5       45
2        A1 2024-01-01 11:00:00        23.1       47
3        B2 2024-01-01 10:00:00        19.8       52
4        B2 2024-01-01 10:00:00        19.9       53

For duplicate timestamps per sensor, keep only one reading:

# Keep first reading per sensor per timestamp
sensor_data %>% 
  distinct(sensor_id, timestamp, .keep_all = TRUE)

  sensor_id           timestamp temperature humidity
1        A1 2024-01-01 10:00:00        22.5       45
2        A1 2024-01-01 11:00:00        23.1       47
3        B2 2024-01-01 10:00:00        19.8       52

Database Query Result Cleaning

API responses or database joins often produce duplicates. Clean them before analysis.

orders <- data.frame(
  order_id = c(1001, 1001, 1002, 1003, 1003, 1003),
  customer_id = c(50, 50, 51, 52, 52, 52),
  product = c("Laptop", "Laptop", "Mouse", "Keyboard", "Keyboard", "Monitor"),
  quantity = c(1, 1, 2, 1, 1, 1)
)

# Remove duplicate order-product combinations
orders %>% 
  distinct(order_id, product, .keep_all = TRUE)

  order_id customer_id  product quantity
1     1001          50   Laptop        1
2     1002          51    Mouse        2
3     1003          52 Keyboard        1
4     1003          52  Monitor        1

This handles cases where joins create duplicate rows for the same logical entity.

Combining distinct() with Other dplyr Verbs

Chain distinct() with filtering, grouping, and summarization for complex deduplication logic.

transactions <- data.frame(
  user_id = c(1, 1, 2, 2, 3, 3, 3),
  date = as.Date(c("2024-01-01", "2024-01-01", "2024-01-02", 
                    "2024-01-03", "2024-01-01", "2024-01-01", "2024-01-02")),
  amount = c(50, 50, 100, 150, 75, 75, 200),
  status = c("completed", "completed", "completed", "pending", 
             "completed", "completed", "completed")
)

# Get unique users with completed transactions
transactions %>% 
  filter(status == "completed") %>% 
  distinct(user_id, .keep_all = TRUE)

  user_id       date amount    status
1       1 2024-01-01     50 completed
2       2 2024-01-02    100 completed
3       3 2024-01-01     75 completed

For more control over which duplicate to keep, combine with arrange():

# Keep the highest transaction per user
transactions %>% 
  arrange(user_id, desc(amount)) %>% 
  distinct(user_id, .keep_all = TRUE)

  user_id       date amount    status
1       1 2024-01-01     50 completed
2       2 2024-01-03    150   pending
3       3 2024-01-02    200 completed

Performance Considerations

distinct() uses hash-based algorithms internally, making it efficient for large datasets. However, column selection impacts performance.

# Faster: fewer columns to hash
large_data %>% distinct(key_column, .keep_all = TRUE)

# Slower: all columns must be hashed
large_data %>% distinct()

For data.table users, distinct() converts to tibble internally. Use data.table’s native unique() for maximum performance on very large datasets:

library(data.table)

dt <- as.data.table(customers)
unique(dt, by = c("id"))  # data.table approach

Handling NA Values

distinct() treats NA values as distinct entities by default, which differs from some SQL implementations.

data_with_na <- data.frame(
  id = c(1, 2, 2, 3, NA, NA),
  value = c(10, 20, 20, 30, 40, 50)
)

data_with_na %>% distinct(id, .keep_all = TRUE)

Both NA rows remain because distinct() considers each NA unique. To treat all NA values as a single group, replace them before deduplication:

data_with_na %>% 
  mutate(id = replace_na(id, -999)) %>% 
  distinct(id, .keep_all = TRUE)

    id value
1    1    10
2    2    20
3    3    30
4 -999    40

Return Value and Side Effects

distinct() returns a tibble or data frame with the same class as the input. It never modifies the original object.

original <- data.frame(x = c(1, 1, 2), y = c("a", "a", "b"))
result <- original %>% distinct()

# Original unchanged
nrow(original)  # 3
nrow(result)    # 2

For in-place modification, reassign:

customers <- customers %>% distinct(id, .keep_all = TRUE)

The function integrates with grouped data frames, applying deduplication within each group:

sales <- data.frame(
  region = c("North", "North", "South", "South"),
  product = c("A", "A", "A", "A"),
  quarter = c("Q1", "Q1", "Q1", "Q2")
)

sales %>% 
  group_by(region) %>% 
  distinct(product, .keep_all = TRUE)

  region product quarter
  <chr>  <chr>   <chr>  
1 North  A       Q1     
2 South  A       Q1

This removes duplicates per group rather than across the entire dataset, providing granular control over deduplication scope.