R stringr - str_detect() with Examples
The `str_detect()` function from R's stringr package answers a simple question: does this string contain this pattern? It examines each element of a character vector and returns `TRUE` or `FALSE`...
Key Insights
str_detect()returns a logical vector indicating whether each element matches a pattern, making it ideal for filtering operations withdplyr::filter()- Use
fixed()for literal string matching to gain significant performance improvements over regex, especially on large datasets - The
negate = TRUEparameter provides a cleaner alternative to wrapping calls in!when you need to find non-matching elements
Introduction to str_detect()
The str_detect() function from R’s stringr package answers a simple question: does this string contain this pattern? It examines each element of a character vector and returns TRUE or FALSE based on whether the pattern exists within that element.
This function sits at the foundation of string manipulation workflows. While str_extract() pulls out matched content and str_replace() swaps patterns for new text, str_detect() simply tells you what’s there. That logical output makes it the go-to choice for subsetting and filtering operations.
library(stringr)
# Basic syntax
str_detect(string, pattern)
# Simple example
fruits <- c("apple", "banana", "cherry", "apricot")
str_detect(fruits, "ap")
# [1] TRUE FALSE FALSE TRUE
The function processes each element independently, returning a logical vector of the same length as your input. This vectorized behavior eliminates the need for explicit loops and integrates cleanly with tidyverse pipelines.
Basic Pattern Matching
At its simplest, str_detect() searches for literal character sequences. Pass a string vector and a pattern, and you get back a logical vector.
sentences <- c(
"The quick brown fox jumps over the lazy dog",
"Pack my box with five dozen liquor jugs",
"How vexingly quick daft zebras jump"
)
# Detect sentences containing "quick"
str_detect(sentences, "quick")
# [1] TRUE FALSE TRUE
# Case sensitivity matters
str_detect(sentences, "Quick")
# [1] FALSE FALSE FALSE
str_detect(sentences, "the")
# [1] TRUE FALSE FALSE
Notice that str_detect() is case-sensitive by default. The pattern “Quick” with a capital Q doesn’t match “quick” in the text. When case-insensitive matching is required, wrap your pattern with regex() and set the ignore_case argument:
str_detect(sentences, regex("quick", ignore_case = TRUE))
# [1] TRUE FALSE TRUE
str_detect(sentences, regex("THE", ignore_case = TRUE))
# [1] TRUE FALSE FALSE
For checking multiple elements against a single pattern, the vectorization handles everything:
email_addresses <- c(
"user@example.com",
"admin@company.org",
"support@example.com",
"hello@domain.net"
)
# Find all example.com addresses
str_detect(email_addresses, "example.com")
# [1] TRUE FALSE TRUE FALSE
Using Regular Expressions
The real power of str_detect() emerges when you move beyond literal patterns to regular expressions. Regex lets you define flexible patterns that match entire categories of strings.
Anchors pin your pattern to specific positions:
words <- c("apple", "application", "pineapple", "app")
# Strings starting with "app"
str_detect(words, "^app")
# [1] TRUE TRUE FALSE TRUE
# Strings ending with "app"
str_detect(words, "app$")
# [1] FALSE FALSE FALSE TRUE
# Strings that ARE exactly "app"
str_detect(words, "^app$")
# [1] FALSE FALSE FALSE TRUE
Character classes match categories of characters:
mixed_data <- c("order123", "invoice456", "report", "data2024", "summary")
# Contains any digit
str_detect(mixed_data, "\\d")
# [1] TRUE TRUE FALSE TRUE FALSE
# Contains only letters
str_detect(mixed_data, "^[a-zA-Z]+$")
# [1] FALSE FALSE TRUE FALSE TRUE
Quantifiers control how many times a pattern must appear:
codes <- c("A1", "AB12", "ABC123", "ABCD1234", "A")
# At least two consecutive digits
str_detect(codes, "\\d{2,}")
# [1] FALSE TRUE TRUE TRUE FALSE
# Exactly three letters followed by exactly three digits
str_detect(codes, "^[A-Z]{3}\\d{3}$")
# [1] FALSE FALSE TRUE FALSE FALSE
Alternation matches one pattern or another:
file_names <- c("report.pdf", "data.csv", "image.png", "document.pdf", "sheet.xlsx")
# PDF or CSV files
str_detect(file_names, "\\.(pdf|csv)$")
# [1] TRUE TRUE FALSE TRUE FALSE
Practical Use Cases with Data Frames
The combination of str_detect() and dplyr::filter() handles most real-world string filtering tasks. When you need to subset rows based on text content, this pairing delivers clean, readable code.
library(dplyr)
# Sample customer data
customers <- tibble(
id = 1:6,
name = c("John Smith", "Jane Doe", "Bob Johnson", "Alice Smith", "Charlie Brown", "Diana Prince"),
email = c("john@gmail.com", "jane@company.org", "bob@gmail.com", "alice@yahoo.com", "charlie@company.org", "diana@gmail.com"),
notes = c("Premium customer", "New signup", "Requested refund", "Premium tier", "Trial user", "Premium member")
)
# Filter customers with Gmail addresses
customers %>%
filter(str_detect(email, "gmail\\.com"))
# Returns rows 1, 3, and 6
# Filter customers with "Premium" in notes
customers %>%
filter(str_detect(notes, "Premium"))
# Returns rows 1, 4, and 6
# Filter by name containing "Smith"
customers %>%
filter(str_detect(name, "Smith"))
# Returns rows 1 and 4
Combining multiple string conditions creates powerful filters:
# Gmail users who are also Premium
customers %>%
filter(
str_detect(email, "gmail\\.com"),
str_detect(notes, "Premium")
)
# Returns rows 1 and 6
# Company email OR Premium status
customers %>%
filter(
str_detect(email, "company\\.org") | str_detect(notes, "Premium")
)
# Returns rows 1, 2, 4, 5, and 6
You can also use str_detect() within mutate() to create indicator columns:
customers %>%
mutate(
is_gmail = str_detect(email, "gmail\\.com"),
is_premium = str_detect(notes, "Premium")
)
Negation and Advanced Options
Sometimes you need to find strings that don’t match a pattern. The negate parameter handles this cleanly:
log_entries <- c(
"[INFO] Application started",
"[ERROR] Database connection failed",
"[INFO] User logged in",
"[WARNING] Memory usage high",
"[ERROR] File not found"
)
# Find non-error entries
str_detect(log_entries, "ERROR", negate = TRUE)
# [1] TRUE FALSE TRUE TRUE FALSE
# Equivalent but less readable
!str_detect(log_entries, "ERROR")
# [1] TRUE FALSE TRUE TRUE FALSE
The negate parameter becomes especially valuable in filter operations where the intent is clearer:
logs_df <- tibble(entry = log_entries)
# Filter out errors - clear intent
logs_df %>%
filter(str_detect(entry, "ERROR", negate = TRUE))
# Same result, less obvious
logs_df %>%
filter(!str_detect(entry, "ERROR"))
Handling NA values requires attention. By default, str_detect() returns NA when the input is NA:
data_with_na <- c("apple", NA, "banana", "cherry", NA)
str_detect(data_with_na, "a")
# [1] TRUE NA TRUE FALSE NA
When filtering, these NA values get dropped automatically by filter(). If you need explicit control, handle them before or after detection:
# Replace NA with FALSE
coalesce(str_detect(data_with_na, "a"), FALSE)
# [1] TRUE FALSE TRUE FALSE FALSE
# Or filter out NA first
data_with_na[!is.na(data_with_na)] %>%
str_detect("a")
Combining with str_subset() provides a shortcut when you want the actual matching strings rather than a logical vector:
# These produce the same result
fruits[str_detect(fruits, "ap")]
str_subset(fruits, "ap")
# [1] "apple" "apricot"
Performance Considerations
For literal string matching without regex features, fixed() delivers substantial performance gains. It tells str_detect() to skip regex parsing and perform a direct character comparison:
# Large dataset simulation
large_vector <- rep(c("error_log_2024", "info_message", "debug_trace", "error_report"), 250000)
# Regex matching (slower)
system.time(str_detect(large_vector, "error"))
# user system elapsed
# 0.156 0.004 0.160
# Fixed matching (faster)
system.time(str_detect(large_vector, fixed("error")))
# user system elapsed
# 0.052 0.000 0.052
The speedup varies by pattern complexity and data size, but fixed() consistently outperforms regex for literal matches. Use it when your pattern contains no special regex characters and you don’t need regex features.
Comparison with base R’s grepl():
# These are functionally equivalent
str_detect(fruits, "ap")
grepl("ap", fruits)
The stringr version offers consistent argument order (data first, pattern second) that works better in pipes, plus the negate parameter. Performance is comparable for most use cases. Choose based on your coding style and whether you’re already using stringr for other operations.
Tips for large datasets:
- Use
fixed()for literal patterns - Pre-compile complex regex patterns if reusing them
- Consider
str_which()when you only need indices, not the full logical vector - For very large datasets, data.table’s string functions may offer better performance
Summary
The str_detect() function provides the foundation for string-based filtering in R. Its key parameters include the input string vector, the pattern to match, and the optional negate argument for inverse matching.
Common pitfalls to avoid:
- Forgetting case sensitivity (use
regex(pattern, ignore_case = TRUE)) - Not escaping special regex characters like
.when matching literally - Using regex when
fixed()would be faster and sufficient
Related stringr functions worth exploring:
str_which(): returns indices of matching elements instead of a logical vectorstr_subset(): returns the actual matching strings directlystr_count(): counts how many times a pattern appears in each stringstr_locate(): finds the position of the first match within each string
Master str_detect() and you’ll handle the majority of string filtering tasks in R with clean, readable code that integrates naturally into tidyverse workflows.