R - subset() Function with Examples
• The `subset()` function provides an intuitive way to filter rows and select columns from data frames using logical conditions without repetitive bracket notation or the `$` operator
Key Insights
• The subset() function provides an intuitive way to filter rows and select columns from data frames using logical conditions without repetitive bracket notation or the $ operator
• Unlike bracket subsetting, subset() evaluates expressions within the data frame environment, allowing direct reference to column names and automatic handling of NA values in logical conditions
• While convenient for interactive analysis, subset() uses non-standard evaluation which can cause unexpected behavior in functions and production code—making bracket notation or dplyr alternatives more suitable for programming contexts
Understanding subset() Syntax
The subset() function filters data frames and matrices based on specified conditions. Its basic syntax takes three primary arguments: the data object, a logical condition for row selection, and an optional vector of column names to retain.
# Basic syntax
subset(x, subset, select, drop = FALSE)
# Example dataset
employees <- data.frame(
name = c("Alice", "Bob", "Carol", "David", "Eve"),
department = c("Sales", "IT", "Sales", "HR", "IT"),
salary = c(65000, 75000, 68000, 62000, 80000),
years = c(3, 5, 2, 7, 4),
active = c(TRUE, TRUE, FALSE, TRUE, TRUE)
)
# Filter rows where salary > 65000
subset(employees, salary > 65000)
name department salary years active
2 Bob IT 75000 5 TRUE
3 Carol Sales 68000 2 FALSE
5 Eve IT 80000 4 TRUE
The function returns a new data frame containing only rows meeting the specified condition, preserving the original data structure.
Row Filtering with Logical Conditions
subset() excels at filtering rows using complex logical expressions. You can combine multiple conditions using logical operators without repeatedly referencing the data frame name.
# Multiple conditions with AND
subset(employees, department == "IT" & salary > 70000)
name department salary years active
2 Bob IT 75000 5 TRUE
5 Eve IT 80000 4 TRUE
# OR conditions
subset(employees, department == "Sales" | years > 5)
name department salary years active
1 Alice Sales 65000 3 TRUE
3 Carol Sales 68000 2 FALSE
4 David HR 62000 7 TRUE
# Negation
subset(employees, !active)
name department salary years active
3 Carol Sales 68000 2 FALSE
# Using %in% operator
subset(employees, department %in% c("IT", "HR"))
name department salary years active
2 Bob IT 75000 5 TRUE
4 David HR 62000 7 TRUE
5 Eve IT 80000 4 TRUE
The function automatically handles NA values in logical conditions by excluding rows where the condition evaluates to NA, unlike bracket subsetting which would include NA rows.
Column Selection
The select argument controls which columns appear in the output. You can specify columns by name, position, or using helper functions.
# Select specific columns
subset(employees, salary > 65000, select = c(name, department, salary))
name department salary
2 Bob IT 75000
3 Carol Sales 68000
5 Eve IT 80000
# Select columns by position
subset(employees, department == "IT", select = c(1, 2, 3))
name department salary
2 Bob IT 75000
5 Eve IT 80000
# Exclude columns using negative selection
subset(employees, years > 3, select = -c(active, years))
name department salary
2 Bob IT 75000
4 David HR 62000
5 Eve IT 80000
# Select range of columns
subset(employees, active == TRUE, select = name:salary)
name department salary
1 Alice Sales 65000
2 Bob IT 75000
4 David HR 62000
5 Eve IT 80000
Working with Missing Values
subset() provides cleaner handling of NA values compared to bracket notation, automatically excluding rows where conditions evaluate to NA.
# Dataset with missing values
sales <- data.frame(
product = c("A", "B", "C", "D", "E"),
revenue = c(1000, NA, 1500, 2000, NA),
region = c("North", "South", NA, "East", "West")
)
# subset() excludes NA automatically
subset(sales, revenue > 1200)
product revenue region
3 C 1500 <NA>
4 D 2000 East
# Equivalent bracket notation requires explicit NA handling
sales[!is.na(sales$revenue) & sales$revenue > 1200, ]
To explicitly include or filter NA values:
# Find rows with NA revenue
subset(sales, is.na(revenue))
product revenue region
2 B NA South
5 E NA West
# Find rows with complete cases
subset(sales, !is.na(revenue) & !is.na(region))
product revenue region
1 A 1000 North
4 D 2000 East
Practical Applications
Financial Data Analysis
# Stock portfolio analysis
portfolio <- data.frame(
ticker = c("AAPL", "GOOGL", "MSFT", "AMZN", "TSLA"),
price = c(178.50, 142.30, 380.00, 155.20, 242.80),
shares = c(100, 50, 75, 30, 40),
sector = c("Tech", "Tech", "Tech", "Retail", "Auto")
)
portfolio$value <- portfolio$price * portfolio$shares
# Find high-value tech positions
subset(portfolio, sector == "Tech" & value > 10000,
select = c(ticker, value, shares))
ticker value shares
2 GOOGL 7115.0 50
3 MSFT 28500.0 75
Data Quality Checks
# Survey responses
survey <- data.frame(
id = 1:6,
age = c(25, 150, 35, NA, 42, 28),
income = c(50000, 60000, NA, 75000, 80000, 55000),
satisfied = c(TRUE, TRUE, FALSE, TRUE, NA, TRUE)
)
# Find invalid age entries
subset(survey, age > 120 | age < 18, select = c(id, age))
id age
2 2 150
# Find complete responses
subset(survey, !is.na(age) & !is.na(income) & !is.na(satisfied))
id age income satisfied
1 1 25 50000 TRUE
2 2 150 60000 TRUE
6 6 28 55000 TRUE
Time Series Filtering
# Temperature readings
temps <- data.frame(
date = as.Date(c("2024-01-01", "2024-01-02", "2024-01-03",
"2024-01-04", "2024-01-05")),
temp_f = c(32, 28, 35, 42, 38),
humidity = c(65, 70, 60, 55, 62)
)
# Freezing days with high humidity
subset(temps, temp_f <= 32 & humidity > 60)
date temp_f humidity
1 2024-01-01 32 65
Limitations and Alternatives
subset() uses non-standard evaluation, which creates issues in programming contexts:
# Problematic in functions
filter_data <- function(df, threshold) {
# This fails - subset looks for 'threshold' column
subset(df, salary > threshold)
}
# Better approach with bracket notation
filter_data <- function(df, threshold) {
df[df$salary > threshold & !is.na(df$salary), ]
}
# Or use dplyr
library(dplyr)
filter_data <- function(df, threshold) {
filter(df, salary > threshold)
}
For production code, prefer explicit approaches:
# subset() - interactive use
result <- subset(employees, department == "IT", select = c(name, salary))
# Bracket notation - programming
result <- employees[employees$department == "IT" &
!is.na(employees$department),
c("name", "salary")]
# dplyr - modern approach
result <- employees %>%
filter(department == "IT") %>%
select(name, salary)
The subset() function remains valuable for exploratory data analysis and interactive sessions where its concise syntax accelerates workflow. For functions, packages, or automated scripts, use bracket notation or dplyr’s filter() and select() functions to ensure predictable behavior and avoid scoping issues.