R dplyr - Complete Tutorial with Examples
dplyr transforms data manipulation in R by providing a grammar of data manipulation. Instead of learning dozens of functions with inconsistent interfaces, you master five verbs that combine to solve...
Key Insights
- dplyr provides five core verbs (filter, select, mutate, arrange, summarize) that handle 90% of data manipulation tasks with a consistent, readable syntax
- The pipe operator (%>%) chains operations together, eliminating nested function calls and making data transformations read like sequential instructions
- Group operations with group_by() enable split-apply-combine workflows that are significantly faster than base R alternatives on large datasets
Understanding dplyr’s Core Philosophy
dplyr transforms data manipulation in R by providing a grammar of data manipulation. Instead of learning dozens of functions with inconsistent interfaces, you master five verbs that combine to solve complex problems. Each verb operates on data frames and returns data frames, making them composable and predictable.
library(dplyr)
# Sample dataset
employees <- data.frame(
id = 1:6,
name = c("Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"),
department = c("Sales", "IT", "Sales", "IT", "HR", "Sales"),
salary = c(65000, 75000, 68000, 82000, 71000, 63000),
years = c(3, 5, 2, 7, 4, 1)
)
Filtering Rows with filter()
The filter() verb selects rows based on conditions. Multiple conditions combine with logical operators, and all conditions must evaluate to TRUE for a row to be included.
# Single condition
high_earners <- filter(employees, salary > 70000)
# Multiple conditions (AND)
experienced_it <- filter(employees, department == "IT", years > 5)
# OR conditions
sales_or_high_salary <- filter(employees, department == "Sales" | salary > 75000)
# Complex conditions
mid_career <- filter(employees, years >= 2, years <= 5, salary < 80000)
# Using %in% for multiple values
target_depts <- filter(employees, department %in% c("IT", "HR"))
The filter() function automatically removes rows with NA values in the condition columns. To explicitly handle NAs:
# Keep rows where salary is NA or greater than 70000
filter(employees, is.na(salary) | salary > 70000)
Selecting Columns with select()
The select() verb chooses which columns to keep, drop, or reorder. It includes helper functions for pattern-based selection.
# Select specific columns
basic_info <- select(employees, name, department)
# Select range of columns
select(employees, name:salary)
# Exclude columns with minus
select(employees, -id, -years)
# Reorder columns
select(employees, department, name, everything())
# Helper functions
select(employees, starts_with("s")) # salary
select(employees, ends_with("ment")) # department
select(employees, contains("ar")) # years, salary
select(employees, matches("^[a-z]+$")) # lowercase column names
# Select by column type
select(employees, where(is.numeric))
select(employees, where(is.character))
Rename columns during selection:
select(employees, employee_name = name, dept = department)
# Rename without dropping other columns
rename(employees, employee_name = name)
Creating Columns with mutate()
The mutate() verb adds new columns or modifies existing ones. New columns can reference columns created in the same mutate() call.
# Add single column
employees_bonus <- mutate(employees,
bonus = salary * 0.10
)
# Multiple columns
employees_enhanced <- mutate(employees,
bonus = salary * 0.10,
total_comp = salary + bonus,
salary_per_year = salary / years
)
# Conditional mutations with case_when()
employees_level <- mutate(employees,
level = case_when(
years < 2 ~ "Junior",
years >= 2 & years < 5 ~ "Mid",
years >= 5 ~ "Senior"
)
)
# Using if_else for binary conditions
employees_status <- mutate(employees,
high_performer = if_else(salary > 70000, "Yes", "No")
)
Use mutate() to transform existing columns:
# Convert salary to thousands
mutate(employees, salary = salary / 1000)
# Only keep new columns with transmute()
transmute(employees,
name,
annual_salary = salary,
monthly_salary = salary / 12
)
Sorting with arrange()
The arrange() verb orders rows by column values. Default is ascending order.
# Single column ascending
arrange(employees, salary)
# Descending order
arrange(employees, desc(salary))
# Multiple columns
arrange(employees, department, desc(salary))
# Complex sorting
arrange(employees, desc(years), salary)
Aggregating with summarize()
The summarize() verb reduces multiple rows into summary statistics. It’s most powerful when combined with group_by().
# Overall summaries
summarize(employees,
avg_salary = mean(salary),
total_employees = n(),
max_years = max(years)
)
# Multiple statistics
summarize(employees,
salary_mean = mean(salary),
salary_sd = sd(salary),
salary_median = median(salary),
salary_min = min(salary),
salary_max = max(salary)
)
Grouping Operations with group_by()
Grouping transforms how other verbs operate. Instead of working on the entire dataset, they work on each group independently.
# Group and summarize
dept_summary <- employees %>%
group_by(department) %>%
summarize(
count = n(),
avg_salary = mean(salary),
total_years = sum(years)
)
# Multiple grouping variables
employees %>%
mutate(experience_level = if_else(years >= 5, "Senior", "Junior")) %>%
group_by(department, experience_level) %>%
summarize(
count = n(),
avg_salary = mean(salary),
.groups = "drop"
)
# Group and mutate (add group-level statistics)
employees %>%
group_by(department) %>%
mutate(
dept_avg_salary = mean(salary),
diff_from_dept_avg = salary - dept_avg_salary
)
# Group and filter (keep rows meeting group conditions)
employees %>%
group_by(department) %>%
filter(salary == max(salary)) # Highest paid per department
Always ungroup after grouped operations to avoid unexpected behavior:
employees %>%
group_by(department) %>%
summarize(avg_salary = mean(salary)) %>%
ungroup()
Chaining Operations with the Pipe
The pipe operator (%>%) passes the result of one operation as the first argument to the next function. This creates readable, left-to-right workflows.
# Without pipe (nested, hard to read)
arrange(
filter(
mutate(employees, bonus = salary * 0.10),
department == "Sales"
),
desc(bonus)
)
# With pipe (sequential, readable)
employees %>%
mutate(bonus = salary * 0.10) %>%
filter(department == "Sales") %>%
arrange(desc(bonus))
# Complex pipeline
analysis <- employees %>%
filter(years >= 2) %>%
mutate(
salary_category = case_when(
salary < 65000 ~ "Low",
salary < 75000 ~ "Medium",
TRUE ~ "High"
)
) %>%
group_by(department, salary_category) %>%
summarize(
count = n(),
avg_years = mean(years),
total_salary = sum(salary),
.groups = "drop"
) %>%
arrange(department, desc(total_salary))
Joining Datasets
dplyr provides SQL-style joins for combining datasets.
# Sample datasets
projects <- data.frame(
id = c(1, 2, 3, 5),
project = c("Alpha", "Beta", "Gamma", "Delta")
)
# Inner join (only matching rows)
inner_join(employees, projects, by = "id")
# Left join (all employees, matching projects)
left_join(employees, projects, by = "id")
# Right join (all projects, matching employees)
right_join(employees, projects, by = "id")
# Full join (all rows from both)
full_join(employees, projects, by = "id")
# Anti join (employees without projects)
anti_join(employees, projects, by = "id")
# Semi join (employees with projects, no project columns)
semi_join(employees, projects, by = "id")
Join on different column names:
# If columns have different names
other_data <- data.frame(
emp_id = c(1, 2, 3),
rating = c("A", "B", "A")
)
left_join(employees, other_data, by = c("id" = "emp_id"))
Advanced Patterns
Combine multiple operations for sophisticated analyses:
# Window functions with group_by
employees %>%
group_by(department) %>%
mutate(
rank = row_number(desc(salary)),
percentile = percent_rank(salary),
running_total = cumsum(salary)
) %>%
ungroup()
# Conditional aggregation
employees %>%
group_by(department) %>%
summarize(
total = n(),
high_earners = sum(salary > 70000),
pct_high_earners = mean(salary > 70000) * 100
)
# Pivot operations with tidyr
library(tidyr)
employees %>%
group_by(department) %>%
summarize(
avg_salary = mean(salary),
avg_years = mean(years)
) %>%
pivot_longer(
cols = starts_with("avg_"),
names_to = "metric",
values_to = "value"
)
dplyr’s consistent interface and composability make it the standard for data manipulation in R. Master these verbs and patterns, and you’ll handle any data transformation task efficiently.