R - subset() Function with Examples

• The `subset()` function provides an intuitive way to filter rows and select columns from data frames using logical conditions without repetitive bracket notation or the `$` operator

Key Insights

• The subset() function provides an intuitive way to filter rows and select columns from data frames using logical conditions without repetitive bracket notation or the $ operator • Unlike bracket subsetting, subset() evaluates expressions within the data frame environment, allowing direct reference to column names and automatic handling of NA values in logical conditions • While convenient for interactive analysis, subset() uses non-standard evaluation which can cause unexpected behavior in functions and production code—making bracket notation or dplyr alternatives more suitable for programming contexts

Understanding subset() Syntax

The subset() function filters data frames and matrices based on specified conditions. Its basic syntax takes three primary arguments: the data object, a logical condition for row selection, and an optional vector of column names to retain.

# Basic syntax
subset(x, subset, select, drop = FALSE)

# Example dataset
employees <- data.frame(
  name = c("Alice", "Bob", "Carol", "David", "Eve"),
  department = c("Sales", "IT", "Sales", "HR", "IT"),
  salary = c(65000, 75000, 68000, 62000, 80000),
  years = c(3, 5, 2, 7, 4),
  active = c(TRUE, TRUE, FALSE, TRUE, TRUE)
)

# Filter rows where salary > 65000
subset(employees, salary > 65000)
   name department salary years active
2   Bob         IT  75000     5   TRUE
3 Carol      Sales  68000     2  FALSE
5   Eve         IT  80000     4   TRUE

The function returns a new data frame containing only rows meeting the specified condition, preserving the original data structure.

Row Filtering with Logical Conditions

subset() excels at filtering rows using complex logical expressions. You can combine multiple conditions using logical operators without repeatedly referencing the data frame name.

# Multiple conditions with AND
subset(employees, department == "IT" & salary > 70000)
  name department salary years active
2  Bob         IT  75000     5   TRUE
5  Eve         IT  80000     4   TRUE
# OR conditions
subset(employees, department == "Sales" | years > 5)
    name department salary years active
1  Alice      Sales  65000     3   TRUE
3  Carol      Sales  68000     2  FALSE
4  David         HR  62000     7   TRUE
# Negation
subset(employees, !active)
   name department salary years active
3 Carol      Sales  68000     2  FALSE
# Using %in% operator
subset(employees, department %in% c("IT", "HR"))
   name department salary years active
2   Bob         IT  75000     5   TRUE
4 David         HR  62000     7   TRUE
5   Eve         IT  80000     4   TRUE

The function automatically handles NA values in logical conditions by excluding rows where the condition evaluates to NA, unlike bracket subsetting which would include NA rows.

Column Selection

The select argument controls which columns appear in the output. You can specify columns by name, position, or using helper functions.

# Select specific columns
subset(employees, salary > 65000, select = c(name, department, salary))
   name department salary
2   Bob         IT  75000
3 Carol      Sales  68000
5   Eve         IT  80000
# Select columns by position
subset(employees, department == "IT", select = c(1, 2, 3))
  name department salary
2  Bob         IT  75000
5  Eve         IT  80000
# Exclude columns using negative selection
subset(employees, years > 3, select = -c(active, years))
   name department salary
2   Bob         IT  75000
4 David         HR  62000
5   Eve         IT  80000
# Select range of columns
subset(employees, active == TRUE, select = name:salary)
   name department salary
1 Alice      Sales  65000
2   Bob         IT  75000
4 David         HR  62000
5   Eve         IT  80000

Working with Missing Values

subset() provides cleaner handling of NA values compared to bracket notation, automatically excluding rows where conditions evaluate to NA.

# Dataset with missing values
sales <- data.frame(
  product = c("A", "B", "C", "D", "E"),
  revenue = c(1000, NA, 1500, 2000, NA),
  region = c("North", "South", NA, "East", "West")
)

# subset() excludes NA automatically
subset(sales, revenue > 1200)
  product revenue region
3       C    1500   <NA>
4       D    2000   East
# Equivalent bracket notation requires explicit NA handling
sales[!is.na(sales$revenue) & sales$revenue > 1200, ]

To explicitly include or filter NA values:

# Find rows with NA revenue
subset(sales, is.na(revenue))
  product revenue region
2       B      NA  South
5       E      NA   West
# Find rows with complete cases
subset(sales, !is.na(revenue) & !is.na(region))
  product revenue region
1       A    1000  North
4       D    2000   East

Practical Applications

Financial Data Analysis

# Stock portfolio analysis
portfolio <- data.frame(
  ticker = c("AAPL", "GOOGL", "MSFT", "AMZN", "TSLA"),
  price = c(178.50, 142.30, 380.00, 155.20, 242.80),
  shares = c(100, 50, 75, 30, 40),
  sector = c("Tech", "Tech", "Tech", "Retail", "Auto")
)

portfolio$value <- portfolio$price * portfolio$shares

# Find high-value tech positions
subset(portfolio, sector == "Tech" & value > 10000, 
       select = c(ticker, value, shares))
  ticker  value shares
2 GOOGL 7115.0     50
3  MSFT 28500.0     75

Data Quality Checks

# Survey responses
survey <- data.frame(
  id = 1:6,
  age = c(25, 150, 35, NA, 42, 28),
  income = c(50000, 60000, NA, 75000, 80000, 55000),
  satisfied = c(TRUE, TRUE, FALSE, TRUE, NA, TRUE)
)

# Find invalid age entries
subset(survey, age > 120 | age < 18, select = c(id, age))
  id age
2  2 150
# Find complete responses
subset(survey, !is.na(age) & !is.na(income) & !is.na(satisfied))
  id age income satisfied
1  1  25  50000      TRUE
2  2 150  60000      TRUE
6  6  28  55000      TRUE

Time Series Filtering

# Temperature readings
temps <- data.frame(
  date = as.Date(c("2024-01-01", "2024-01-02", "2024-01-03", 
                   "2024-01-04", "2024-01-05")),
  temp_f = c(32, 28, 35, 42, 38),
  humidity = c(65, 70, 60, 55, 62)
)

# Freezing days with high humidity
subset(temps, temp_f <= 32 & humidity > 60)
        date temp_f humidity
1 2024-01-01     32       65

Limitations and Alternatives

subset() uses non-standard evaluation, which creates issues in programming contexts:

# Problematic in functions
filter_data <- function(df, threshold) {
  # This fails - subset looks for 'threshold' column
  subset(df, salary > threshold)
}

# Better approach with bracket notation
filter_data <- function(df, threshold) {
  df[df$salary > threshold & !is.na(df$salary), ]
}

# Or use dplyr
library(dplyr)
filter_data <- function(df, threshold) {
  filter(df, salary > threshold)
}

For production code, prefer explicit approaches:

# subset() - interactive use
result <- subset(employees, department == "IT", select = c(name, salary))

# Bracket notation - programming
result <- employees[employees$department == "IT" & 
                    !is.na(employees$department), 
                    c("name", "salary")]

# dplyr - modern approach
result <- employees %>%
  filter(department == "IT") %>%
  select(name, salary)

The subset() function remains valuable for exploratory data analysis and interactive sessions where its concise syntax accelerates workflow. For functions, packages, or automated scripts, use bracket notation or dplyr’s filter() and select() functions to ensure predictable behavior and avoid scoping issues.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.