R - Access Rows and Columns in Data Frame

Key Insights

• R data frames support multiple indexing methods including bracket notation [], double brackets [[]], and the $ operator, each with distinct behaviors for subsetting rows and columns • Logical indexing and the subset() function provide powerful filtering capabilities, while dplyr verbs like select() and filter() offer more readable alternatives for complex operations • Understanding the difference between returning a data frame versus a vector is critical—single bracket [,] preserves structure while [[]] and $ extract atomic vectors

Bracket Notation for Basic Access

The fundamental way to access data frame elements uses bracket notation with the syntax df[rows, columns]. This approach gives you precise control over which data to extract.

# Create sample data frame
employees <- data.frame(
  id = 1:5,
  name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  salary = c(75000, 82000, 68000, 91000, 79000),
  department = c("IT", "Sales", "IT", "HR", "Sales")
)

# Access single element (row 2, column 3)
employees[2, 3]  # Returns: 82000

# Access entire row
employees[3, ]   # Returns row 3 as a data frame

# Access entire column (returns vector)
employees[, 2]   # Returns: "Alice" "Bob" "Charlie" "Diana" "Eve"

# Access multiple rows
employees[c(1, 3, 5), ]

# Access multiple columns
employees[, c("name", "salary")]

When you omit the column or row index, R selects all elements in that dimension. Critically, using single brackets with a column index returns a vector by default, not a data frame. To preserve the data frame structure:

# Returns a vector
employees[, "salary"]

# Returns a data frame with one column
employees[, "salary", drop = FALSE]

Column Access with $ and [[]]

The $ operator provides convenient shorthand for accessing columns by name, always returning a vector. Double brackets [[]] work similarly but accept both names and numeric indices.

# Using $ operator (name only)
employees$name
employees$salary

# Using [[ ]] with name
employees[["department"]]

# Using [[ ]] with index
employees[[2]]  # Second column

# Chaining for specific element
employees$salary[3]  # Third salary value
employees[["name"]][1]  # First name

The key difference: $ only accepts literal column names, while [[]] accepts variables:

col_name <- "salary"

# This works
employees[[col_name]]

# This doesn't work as expected
employees$col_name  # Looks for column literally named "col_name"

Logical Indexing for Filtering

Logical vectors provide powerful filtering capabilities. Create boolean conditions that return TRUE/FALSE for each row, then use them to subset your data.

# Find employees with salary > 75000
high_earners <- employees[employees$salary > 75000, ]
print(high_earners)

# Multiple conditions with & (AND)
it_high_earners <- employees[employees$department == "IT" & employees$salary > 70000, ]

# Multiple conditions with | (OR)
it_or_hr <- employees[employees$department == "IT" | employees$department == "HR", ]

# Using %in% for multiple values
sales_or_hr <- employees[employees$department %in% c("Sales", "HR"), ]

# Exclude rows with !
not_sales <- employees[!employees$department == "Sales", ]

When combining conditions, use & (AND) and | (OR), not && or || which are for scalar values. Handle NA values explicitly:

# Add some NA values
employees$bonus <- c(5000, NA, 3000, NA, 4000)

# This includes NA rows (problematic)
employees[employees$bonus > 3500, ]

# Properly exclude NA
employees[!is.na(employees$bonus) & employees$bonus > 3500, ]

The subset() Function

The subset() function provides cleaner syntax for filtering, automatically handling NA values and allowing direct column references without the $ operator.

# Basic filtering
subset(employees, salary > 75000)

# Multiple conditions
subset(employees, department == "IT" & salary > 70000)

# Select specific columns
subset(employees, salary > 75000, select = c(name, department))

# Exclude columns with negative selection
subset(employees, department == "Sales", select = -id)

# Using subset with bonus column (NAs handled automatically)
subset(employees, bonus > 3500)

While subset() is convenient for interactive use, avoid it in production functions because it uses non-standard evaluation that can behave unexpectedly in different scopes.

dplyr for Modern Data Manipulation

The dplyr package offers intuitive verbs for data frame operations with better performance on large datasets and more readable code.

library(dplyr)

# filter() for row selection
employees %>%
  filter(salary > 75000)

# Multiple conditions
employees %>%
  filter(department == "IT", salary > 70000)

# select() for column selection
employees %>%
  select(name, salary)

# Combine filter and select
employees %>%
  filter(department %in% c("IT", "Sales")) %>%
  select(name, salary, department)

# Helper functions for select
employees %>%
  select(starts_with("s"))  # salary column

employees %>%
  select(where(is.numeric))  # All numeric columns

# slice() for row numbers
employees %>%
  slice(1:3)

# arrange() for sorting before selection
employees %>%
  arrange(desc(salary)) %>%
  slice(1:2)  # Top 2 earners

The pipe operator %>% (or native |> in R 4.1+) chains operations for readable data pipelines:

# Complex pipeline
result <- employees %>%
  filter(!is.na(bonus)) %>%
  mutate(total_comp = salary + bonus) %>%
  filter(total_comp > 80000) %>%
  select(name, department, total_comp) %>%
  arrange(desc(total_comp))

Row Names and Row Selection

Data frames can have row names, though this feature is less commonly used in modern R programming.

# Create data frame with row names
df <- data.frame(
  x = 1:3,
  y = 4:6,
  row.names = c("row1", "row2", "row3")
)

# Access by row name
df["row2", ]

# Get row names
rownames(df)

# Access multiple rows by name
df[c("row1", "row3"), ]

# which() returns row indices
high_salary_indices <- which(employees$salary > 75000)
employees[high_salary_indices, ]

Negative Indexing for Exclusion

Negative indices remove specified rows or columns rather than selecting them.

# Exclude first row
employees[-1, ]

# Exclude multiple rows
employees[-c(1, 3), ]

# Exclude columns by index
employees[, -c(1, 4)]

# Exclude by name (requires which with colnames)
employees[, -which(names(employees) == "id")]

# dplyr alternative
employees %>%
  select(-id, -department)

Never mix positive and negative indices—R will throw an error.

Performance Considerations

For large data frames, method selection impacts performance significantly.

# Benchmark different approaches
library(microbenchmark)

large_df <- data.frame(
  x = 1:1000000,
  y = rnorm(1000000)
)

microbenchmark(
  bracket = large_df[large_df$x > 500000, ],
  subset_fn = subset(large_df, x > 500000),
  dplyr_filter = filter(large_df, x > 500000),
  times = 100
)

For data frames with millions of rows, dplyr typically outperforms base R methods. However, for small datasets (< 10,000 rows), the difference is negligible, and base R syntax avoids dependencies.

Understanding these access methods allows you to choose the right tool for your specific use case—whether prioritizing readability, performance, or minimal dependencies in your R applications.