R dplyr - distinct() - Remove Duplicates
The `distinct()` function from dplyr identifies and removes duplicate rows from data frames. Unlike base R's `unique()`, it works naturally with tibbles and integrates into pipe-based workflows.
Key Insights
distinct()removes duplicate rows based on all columns by default, or specific columns when specified, making it essential for data deduplication tasks- The
.keep_all = TRUEparameter preserves all columns when filtering by specific variables, while default behavior only returns the distinct columns distinct()maintains the first occurrence of each unique combination and integrates seamlessly with pipe workflows for multi-step data transformations
Understanding distinct() Basics
The distinct() function from dplyr identifies and removes duplicate rows from data frames. Unlike base R’s unique(), it works naturally with tibbles and integrates into pipe-based workflows.
library(dplyr)
# Sample dataset with duplicates
customers <- data.frame(
id = c(1, 2, 2, 3, 4, 4, 4),
name = c("Alice", "Bob", "Bob", "Charlie", "David", "David", "David"),
city = c("NYC", "LA", "LA", "Chicago", "Boston", "Boston", "Boston"),
purchase_amount = c(100, 200, 200, 150, 300, 400, 500)
)
# Remove completely duplicate rows
customers %>% distinct()
id name city purchase_amount
1 1 Alice NYC 100
2 2 Bob LA 200
3 3 Charlie Chicago 150
4 4 David Boston 300
5 4 David Boston 400
6 4 David Boston 500
The function keeps the first occurrence when rows are identical across all columns. Rows 5-7 remain because purchase_amount differs.
Removing Duplicates by Specific Columns
Specify columns to define uniqueness criteria. This removes rows with duplicate combinations of those columns only.
# Keep unique customer IDs only
customers %>% distinct(id)
id
1 1
2 2
3 3
4 4
# Unique combinations of name and city
customers %>% distinct(name, city)
name city
1 Alice NYC
2 Bob LA
3 Charlie Chicago
4 David Boston
By default, distinct() returns only the specified columns. To retain all columns while filtering by specific ones, use .keep_all = TRUE.
# Keep all columns, filter by id
customers %>% distinct(id, .keep_all = TRUE)
id name city purchase_amount
1 1 Alice NYC 100
2 2 Bob LA 200
3 3 Charlie Chicago 150
4 4 David Boston 300
This preserves the first occurrence’s complete row data.
Practical Deduplication Scenarios
Time-Series Data Deduplication
When working with sensor data or logs, remove duplicate timestamps while keeping the most recent or relevant reading.
sensor_data <- data.frame(
sensor_id = c("A1", "A1", "A1", "B2", "B2"),
timestamp = as.POSIXct(c("2024-01-01 10:00:00", "2024-01-01 10:00:00",
"2024-01-01 11:00:00", "2024-01-01 10:00:00",
"2024-01-01 10:00:00")),
temperature = c(22.5, 22.5, 23.1, 19.8, 19.9),
humidity = c(45, 45, 47, 52, 53)
)
# Remove exact duplicates
sensor_data %>%
distinct()
sensor_id timestamp temperature humidity
1 A1 2024-01-01 10:00:00 22.5 45
2 A1 2024-01-01 11:00:00 23.1 47
3 B2 2024-01-01 10:00:00 19.8 52
4 B2 2024-01-01 10:00:00 19.9 53
For duplicate timestamps per sensor, keep only one reading:
# Keep first reading per sensor per timestamp
sensor_data %>%
distinct(sensor_id, timestamp, .keep_all = TRUE)
sensor_id timestamp temperature humidity
1 A1 2024-01-01 10:00:00 22.5 45
2 A1 2024-01-01 11:00:00 23.1 47
3 B2 2024-01-01 10:00:00 19.8 52
Database Query Result Cleaning
API responses or database joins often produce duplicates. Clean them before analysis.
orders <- data.frame(
order_id = c(1001, 1001, 1002, 1003, 1003, 1003),
customer_id = c(50, 50, 51, 52, 52, 52),
product = c("Laptop", "Laptop", "Mouse", "Keyboard", "Keyboard", "Monitor"),
quantity = c(1, 1, 2, 1, 1, 1)
)
# Remove duplicate order-product combinations
orders %>%
distinct(order_id, product, .keep_all = TRUE)
order_id customer_id product quantity
1 1001 50 Laptop 1
2 1002 51 Mouse 2
3 1003 52 Keyboard 1
4 1003 52 Monitor 1
This handles cases where joins create duplicate rows for the same logical entity.
Combining distinct() with Other dplyr Verbs
Chain distinct() with filtering, grouping, and summarization for complex deduplication logic.
transactions <- data.frame(
user_id = c(1, 1, 2, 2, 3, 3, 3),
date = as.Date(c("2024-01-01", "2024-01-01", "2024-01-02",
"2024-01-03", "2024-01-01", "2024-01-01", "2024-01-02")),
amount = c(50, 50, 100, 150, 75, 75, 200),
status = c("completed", "completed", "completed", "pending",
"completed", "completed", "completed")
)
# Get unique users with completed transactions
transactions %>%
filter(status == "completed") %>%
distinct(user_id, .keep_all = TRUE)
user_id date amount status
1 1 2024-01-01 50 completed
2 2 2024-01-02 100 completed
3 3 2024-01-01 75 completed
For more control over which duplicate to keep, combine with arrange():
# Keep the highest transaction per user
transactions %>%
arrange(user_id, desc(amount)) %>%
distinct(user_id, .keep_all = TRUE)
user_id date amount status
1 1 2024-01-01 50 completed
2 2 2024-01-03 150 pending
3 3 2024-01-02 200 completed
Performance Considerations
distinct() uses hash-based algorithms internally, making it efficient for large datasets. However, column selection impacts performance.
# Faster: fewer columns to hash
large_data %>% distinct(key_column, .keep_all = TRUE)
# Slower: all columns must be hashed
large_data %>% distinct()
For data.table users, distinct() converts to tibble internally. Use data.table’s native unique() for maximum performance on very large datasets:
library(data.table)
dt <- as.data.table(customers)
unique(dt, by = c("id")) # data.table approach
Handling NA Values
distinct() treats NA values as distinct entities by default, which differs from some SQL implementations.
data_with_na <- data.frame(
id = c(1, 2, 2, 3, NA, NA),
value = c(10, 20, 20, 30, 40, 50)
)
data_with_na %>% distinct(id, .keep_all = TRUE)
id value
1 1 10
2 2 20
3 3 30
4 NA 40
5 NA 50
Both NA rows remain because distinct() considers each NA unique. To treat all NA values as a single group, replace them before deduplication:
data_with_na %>%
mutate(id = replace_na(id, -999)) %>%
distinct(id, .keep_all = TRUE)
id value
1 1 10
2 2 20
3 3 30
4 -999 40
Return Value and Side Effects
distinct() returns a tibble or data frame with the same class as the input. It never modifies the original object.
original <- data.frame(x = c(1, 1, 2), y = c("a", "a", "b"))
result <- original %>% distinct()
# Original unchanged
nrow(original) # 3
nrow(result) # 2
For in-place modification, reassign:
customers <- customers %>% distinct(id, .keep_all = TRUE)
The function integrates with grouped data frames, applying deduplication within each group:
sales <- data.frame(
region = c("North", "North", "South", "South"),
product = c("A", "A", "A", "A"),
quarter = c("Q1", "Q1", "Q1", "Q2")
)
sales %>%
group_by(region) %>%
distinct(product, .keep_all = TRUE)
region product quarter
<chr> <chr> <chr>
1 North A Q1
2 South A Q1
This removes duplicates per group rather than across the entire dataset, providing granular control over deduplication scope.