R stringr - str_extract() and str_extract_all()

The stringr package sits at the heart of text manipulation in R's tidyverse ecosystem. Built on top of the stringi package, it provides consistent, human-readable functions that make regex operations...

Key Insights

  • str_extract() returns only the first match per string while str_extract_all() returns a list of all matches—choose based on whether you expect single or multiple patterns per input
  • The simplify = TRUE parameter in str_extract_all() converts list output to a character matrix, making it easier to work with in data frames but potentially losing information when strings have varying match counts
  • Pre-compile regex patterns with regex() when applying the same extraction across large datasets to avoid repeated pattern compilation overhead

Introduction to String Extraction in R

The stringr package sits at the heart of text manipulation in R’s tidyverse ecosystem. Built on top of the stringi package, it provides consistent, human-readable functions that make regex operations less painful than base R alternatives.

String extraction—pulling specific patterns out of text—is fundamental to data cleaning. You’ll need it when parsing log files, extracting IDs from messy columns, pulling hashtags from social media data, or isolating dates buried in free-text fields. The two workhorses for this job are str_extract() and str_extract_all().

# Setup
library(stringr)
library(dplyr)

# Quick verification
packageVersion("stringr")

Both functions take the same core arguments: a character vector and a regex pattern. The difference lies in what they return and how many matches they capture.

Understanding str_extract() Basics

str_extract() finds the first match of your pattern in each string and returns it. No match? You get NA. It’s that simple.

# Basic syntax: str_extract(string, pattern)

# Extract first number from strings
texts <- c("Order 12345 shipped", "Invoice 67890 pending", "No numbers here")
str_extract(texts, "\\d+")
# [1] "12345" "67890" NA

# Extract first word (sequence of letters)
str_extract("Hello World 123", "[A-Za-z]+")
# [1] "Hello"

# Extract dates in YYYY-MM-DD format
logs <- c("Error on 2024-01-15: timeout", "Warning 2024-02-20: retry", "Info: no date")
str_extract(logs, "\\d{4}-\\d{2}-\\d{2}")
# [1] "2024-01-15" "2024-02-20" NA

The function is vectorized, meaning it processes each element of your input vector independently. Each input string produces exactly one output: either the matched substring or NA.

This behavior makes str_extract() ideal when you know each string contains at most one instance of your pattern, or when you only care about the first occurrence.

str_extract_all() for Multiple Matches

When strings contain multiple instances of your pattern, str_extract_all() captures them all. The trade-off: it returns a list, not a simple vector.

# Extract all hashtags from tweets
tweets <- c(
"Loving #R and #datascience today!",
"Just #coding",
"No hashtags in this one"
)

str_extract_all(tweets, "#\\w+")
# [[1]]
# [1] "#R" "#datascience"
# [[2]]
# [1] "#coding"
# [[3]]
# character(0)

# Extract all email addresses
text <- "Contact john@example.com or jane@company.org for help"
str_extract_all(text, "[\\w.]+@[\\w.]+\\.[a-z]{2,}")
# [[1]]
# [1] "john@example.com" "jane@company.org"

Notice that strings with no matches return character(0), not NA. This distinction matters when processing results.

The simplify parameter converts list output to a matrix:

hashtags <- str_extract_all(tweets, "#\\w+", simplify = TRUE)
hashtags
#      [,1]           [,2]
# [1,] "#R"           "#datascience"
# [2,] "#coding"      ""
# [3,] ""             ""

Use simplify = TRUE when you need consistent column counts for data frame operations, but be aware that shorter match lists get padded with empty strings.

Regex Patterns for Common Extraction Tasks

Effective extraction depends on precise patterns. Here are battle-tested regex patterns for common scenarios:

# Phone numbers (US format variations)
phone_text <- "Call 555-123-4567 or (555) 987-6543 or 5551234567"
str_extract_all(phone_text, "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}")
# [[1]]
# [1] "555-123-4567"   "(555) 987-6543" "5551234567"

# URLs
web_text <- "Visit https://example.com or http://test.org/page?id=1"
str_extract_all(web_text, "https?://[\\w./\\-?=&]+")
# [[1]]
# [1] "https://example.com"       "http://test.org/page?id=1"

# Currency values
prices <- "Items cost $19.99, $5, and $149.50"
str_extract_all(prices, "\\$\\d+\\.?\\d*")
# [[1]]
# [1] "$19.99"  "$5"      "$149.50"

# IP addresses (IPv4)
logs <- "Requests from 192.168.1.1 and 10.0.0.255 blocked"
str_extract_all(logs, "\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}")
# [[1]]
# [1] "192.168.1.1" "10.0.0.255"

Capture groups in parentheses affect what str_extract() returns. The function returns the entire match, not the captured group—a common source of confusion. We’ll address this in the pitfalls section.

Working with Vectorized Input

Real-world extraction happens inside data frames. Here’s how to integrate these functions with dplyr workflows:

# Sample data
orders <- tibble(
order_text = c(
 "Order #A123 - 3 items - $45.99",
 "Order #B456 - 1 item - $12.00",
 "Order #C789 - 5 items - $89.50",
 NA,
 "Order #D012 - 2 items - $33.25"
)
)

# Extract order IDs, item counts, and prices
orders_clean <- orders %>%
mutate(
 order_id = str_extract(order_text, "#[A-Z]\\d{3}"),
 item_count = as.integer(str_extract(order_text, "\\d+(?= items?)")),
 price = as.numeric(str_extract(order_text, "(?<=\\$)\\d+\\.\\d{2}"))
)

orders_clean
# # A tibble: 5 × 4
#   order_text                       order_id item_count price
#   <chr>                            <chr>         <int> <dbl>
# 1 Order #A123 - 3 items - $45.99   #A123             3  46.0
# 2 Order #B456 - 1 item - $12.00    #B456             1  12
# 3 Order #C789 - 5 items - $89.50   #C789             5  89.5
# 4 NA                               NA               NA  NA
# 5 Order #D012 - 2 items - $33.25   #D012             2  33.2

Notice how NA input propagates to NA output automatically—no special handling required. The lookahead (?= items?) and lookbehind (?<=\\$) patterns extract values without including the surrounding text.

For str_extract_all() in data frames, you’ll often need to unnest the list column:

social_posts <- tibble(
post = c(
 "Great day! #sunshine #happy #blessed",
 "Working hard #coding",
 "Just a regular post"
)
)

# Extract and unnest hashtags
social_posts %>%
mutate(hashtags = str_extract_all(post, "#\\w+")) %>%
tidyr::unnest(hashtags, keep_empty = TRUE)

Practical Use Cases and Performance Tips

Log file parsing demonstrates the real power of string extraction:

# Apache-style log entries
log_lines <- c(
'192.168.1.100 - - [15/Jan/2024:10:30:45 +0000] "GET /api/users HTTP/1.1" 200 1234',
'10.0.0.50 - - [15/Jan/2024:10:31:02 +0000] "POST /api/login HTTP/1.1" 401 89',
'192.168.1.100 - - [15/Jan/2024:10:31:15 +0000] "GET /api/data HTTP/1.1" 500 0'
)

parsed_logs <- tibble(raw = log_lines) %>%
mutate(
 ip = str_extract(raw, "^[\\d.]+"),
 timestamp = str_extract(raw, "(?<=\\[)[^\\]]+"),
 method = str_extract(raw, "(GET|POST|PUT|DELETE)"),
 endpoint = str_extract(raw, "(?<=(GET|POST|PUT|DELETE) )[^\\s]+"),
 status = str_extract(raw, "(?<=HTTP/1\\.1\" )\\d{3}")
)

For repeated extraction on large datasets, pre-compile your pattern:

# Pre-compiled pattern
email_pattern <- regex("[\\w.]+@[\\w.]+\\.[a-z]{2,}", ignore_case = TRUE)

# Use across large vector
large_text_vector <- rep("Contact support@example.com", 100000)

# Faster than passing raw string pattern
system.time(str_extract(large_text_vector, email_pattern))

The regex() function also lets you set flags like ignore_case, multiline, and comments for more readable complex patterns.

Common Pitfalls and Alternatives

Greedy matching catches many developers off guard:

html <- "<div>Hello</div><div>World</div>"

# Greedy: matches too much
str_extract(html, "<div>.*</div>")
# [1] "<div>Hello</div><div>World</div>"

# Lazy: matches minimally
str_extract(html, "<div>.*?</div>")
# [1] "<div>Hello</div>"

Capture groups don’t work as expected with str_extract(). If you need captured subgroups, use str_match():

text <- "John Smith (age 35)"

# str_extract returns entire match
str_extract(text, "(\\w+) (\\w+) \\(age (\\d+)\\)")
# [1] "John Smith (age 35)"

# str_match returns matrix with full match and captured groups
str_match(text, "(\\w+) (\\w+) \\(age (\\d+)\\)")
#      [,1]                  [,2]   [,3]    [,4]
# [1,] "John Smith (age 35)" "John" "Smith" "35"

Use str_match() when you need to extract multiple components from a structured pattern simultaneously. Use str_extract() when you need the whole match or a single component.

Empty string vs NA confusion: Remember that str_extract() returns NA for no match, but str_extract_all() returns character(0). When using simplify = TRUE, missing matches become empty strings. Handle these cases explicitly in downstream processing.

String extraction forms the foundation of text data cleaning in R. Master these two functions and their regex patterns, and you’ll handle the vast majority of pattern extraction tasks efficiently.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.