R - which() Function with Examples

• The `which()` function returns integer positions of TRUE values in logical vectors, enabling precise element selection and manipulation in R data structures

Key Insights

• The which() function returns integer positions of TRUE values in logical vectors, enabling precise element selection and manipulation in R data structures • Unlike subset operators that return values, which() returns indices, making it essential for conditional operations, data frame row selection, and matrix element location • Understanding which() with its parameters arr.ind and useNames unlocks advanced filtering patterns and multi-dimensional array operations

Understanding which() Fundamentals

The which() function identifies the positions where a logical condition evaluates to TRUE. While subsetting with brackets returns the actual values, which() returns their indices.

# Basic vector example
numbers <- c(10, 25, 30, 15, 40, 5)
numbers > 20
# [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE

which(numbers > 20)
# [1] 2 3 5

This distinction matters when you need to know where elements are located, not just what they are. The function ignores NA values by default, preventing index errors.

values <- c(5, NA, 15, 20, NA, 30)
which(values > 10)
# [1] 3 4 6

# Compare with direct subsetting
values[values > 10]
# [1] 15 20 NA 30

Finding Positions in Data Frames

Data frame filtering with which() provides row indices that meet specific criteria, enabling targeted operations on subsets.

employees <- data.frame(
  name = c("Alice", "Bob", "Carol", "David", "Eve"),
  salary = c(50000, 75000, 60000, 90000, 55000),
  department = c("Sales", "IT", "Sales", "IT", "HR"),
  years = c(2, 5, 3, 7, 1)
)

# Find row indices for IT department
it_indices <- which(employees$department == "IT")
it_indices
# [1] 2 4

# Use indices for targeted updates
employees$salary[it_indices] <- employees$salary[it_indices] * 1.1
employees[it_indices, ]
#    name salary department years
# 2   Bob  82500         IT     5
# 4 David  99000         IT     7

Combining multiple conditions requires logical operators within which():

# High earners with experience
senior_high_earners <- which(employees$salary > 70000 & employees$years >= 5)
employees[senior_high_earners, ]
#    name salary department years
# 2   Bob  82500         IT     5
# 4 David  99000         IT     7

Matrix Operations with arr.ind

The arr.ind parameter transforms which() for multi-dimensional structures, returning row-column pairs instead of linear indices.

matrix_data <- matrix(c(3, 7, 2, 8, 1, 9, 4, 6, 5), nrow = 3, ncol = 3)
matrix_data
#      [,1] [,2] [,3]
# [1,]    3    8    4
# [2,]    7    1    6
# [3,]    2    9    5

# Default behavior: linear indices
which(matrix_data > 6)
# [1] 2 4 6

# With arr.ind = TRUE: row-column coordinates
which(matrix_data > 6, arr.ind = TRUE)
#      row col
# [1,]   2   1
# [2,]   1   2
# [3,]   3   2

This becomes critical when you need to identify and manipulate specific matrix positions:

# Find positions of values between 4 and 8
positions <- which(matrix_data >= 4 & matrix_data <= 8, arr.ind = TRUE)
positions
#      row col
# [1,]   1   1
# [2,]   2   1
# [3,]   1   2
# [4,]   1   3
# [5,]   2   3
# [6,]   3   3

# Replace those values
for(i in 1:nrow(positions)) {
  matrix_data[positions[i, "row"], positions[i, "col"]] <- 0
}
matrix_data
#      [,1] [,2] [,3]
# [1,]    0    0    0
# [2,]    0    1    0
# [3,]    2    9    0

Practical Pattern: Finding Duplicates

The which() function excels at identifying duplicate positions, especially when you need indices rather than logical vectors.

customer_ids <- c(101, 102, 103, 102, 104, 103, 105, 101)

# Find all duplicate positions
duplicated_positions <- which(duplicated(customer_ids) | duplicated(customer_ids, fromLast = TRUE))
duplicated_positions
# [1] 1 2 3 4 6 8

# Get unique duplicated values
unique(customer_ids[duplicated_positions])
# [1] 101 102 103

# Find first occurrence of each duplicate
first_occurrences <- which(!duplicated(customer_ids) & customer_ids %in% customer_ids[duplicated(customer_ids)])
first_occurrences
# [1] 1 2 3

Handling Missing Data

The which() function automatically excludes NA values, but you can explicitly target them using is.na():

survey_responses <- c(5, NA, 3, 4, NA, 2, 5, NA, 4)

# Find positions of missing values
missing_indices <- which(is.na(survey_responses))
missing_indices
# [1] 2 5 8

# Find positions of complete cases
complete_indices <- which(!is.na(survey_responses))
complete_indices
# [1] 1 3 4 6 7 9

# Impute with median (using complete cases)
median_value <- median(survey_responses, na.rm = TRUE)
survey_responses[missing_indices] <- median_value
survey_responses
# [1] 5 4 3 4 4 2 5 4 4

which.min() and which.max()

These specialized variants return the index of the minimum or maximum value, providing efficiency for single-value lookups.

stock_prices <- c(145.2, 142.8, 150.3, 138.9, 147.6)

# Find day with lowest price
lowest_day <- which.min(stock_prices)
lowest_day
# [1] 4

# Find day with highest price
highest_day <- which.max(stock_prices)
highest_day
# [1] 3

# Get actual values
c(min = stock_prices[lowest_day], max = stock_prices[highest_day])
#    min    max 
# 138.9  150.3

For data frames, these functions work on individual columns:

products <- data.frame(
  name = c("Widget A", "Widget B", "Widget C", "Widget D"),
  price = c(29.99, 15.99, 45.99, 22.50),
  rating = c(4.5, 3.8, 4.9, 4.2)
)

# Find cheapest product
cheapest <- which.min(products$price)
products[cheapest, ]
#       name price rating
# 2 Widget B 15.99    3.8

# Find highest rated product
best_rated <- which.max(products$rating)
products[best_rated, ]
#       name price rating
# 3 Widget C 45.99    4.9

Performance Considerations

When working with large datasets, which() offers performance advantages over alternatives in specific scenarios:

# Create large vector
large_vector <- sample(1:1000, 1e6, replace = TRUE)

# Using which() for index-based operations
system.time({
  indices <- which(large_vector > 500)
  result <- large_vector[indices] * 2
})

# Direct subsetting (may be slower for complex operations)
system.time({
  result <- large_vector[large_vector > 500] * 2
})

The which() function becomes particularly valuable when you need to perform multiple operations on the same subset or when working with complex conditional logic requiring index tracking.

Common Pitfalls

Empty results return integer(0), not NULL or NA. Always check length before using indices:

values <- c(1, 2, 3, 4, 5)
high_values <- which(values > 10)
high_values
# integer(0)

# Safe usage
if(length(high_values) > 0) {
  values[high_values] <- 0
}

Remember that which() returns positions, not logical vectors. Don’t use it where logical indexing is more appropriate:

# Inefficient
bad_approach <- numbers[which(numbers > 20)]

# Better
good_approach <- numbers[numbers > 20]

Use which() when you need indices for manipulation, tracking, or complex operations. Use logical indexing for simple subsetting.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.