R stringr - str_sub() - Substring
String manipulation is one of those tasks that seems simple until you're knee-deep in edge cases. The `str_sub()` function from the stringr package handles substring extraction and replacement with a...
Key Insights
str_sub()supports negative indexing to count from the end of strings, making it far more intuitive than base R’ssubstr()for common operations like extracting file extensions or last N characters.- The assignment form
str_sub() <-enables in-place substring replacement, which is essential for data masking, formatting corrections, and transforming fixed-width data. - Full vectorization means
str_sub()integrates seamlessly with tidyverse workflows, processing entire columns without explicit loops orapply()calls.
Introduction to str_sub()
String manipulation is one of those tasks that seems simple until you’re knee-deep in edge cases. The str_sub() function from the stringr package handles substring extraction and replacement with a clean, consistent interface that eliminates most of the friction you’ll encounter with base R alternatives.
At its core, str_sub() extracts or replaces substrings based on character positions. What makes it worth using over substr() is the thoughtful API design: negative indexing, consistent vectorization, and predictable behavior with edge cases. If you’re doing any serious text processing in R, this function belongs in your toolkit.
library(stringr)
library(dplyr)
Basic Syntax and Parameters
The function signature is straightforward:
str_sub(string, start = 1L, end = -1L)
Three parameters: the input string, the starting position, and the ending position. Both start and end are inclusive, meaning the characters at those positions are included in the result. The defaults extract the entire string.
text <- "Application Architect"
# Extract characters 1 through 11
str_sub(text, 1, 11)
#> [1] "Application"
# Extract from position 13 to end (default)
str_sub(text, 13)
#> [1] "Architect"
# Extract first 5 characters
str_sub(text, 1, 5)
#> [1] "Appli"
# Omit both start and end to get entire string
str_sub(text)
#> [1] "Application Architect"
Position counting starts at 1, not 0. This matches R’s general indexing convention and feels natural for most users. If you specify positions beyond the string length, str_sub() silently handles it by returning what’s available rather than throwing an error.
# Request more characters than exist
str_sub("hello", 1, 100)
#> [1] "hello"
# Start position beyond string length
str_sub("hello", 10, 15)
#> [1] ""
This graceful degradation prevents crashes when processing variable-length data.
Negative Indexing
Here’s where str_sub() pulls ahead of base R. Negative indices count backward from the end of the string, with -1 representing the last character.
filename <- "report_2024.csv"
# Extract last 3 characters (file extension without dot)
str_sub(filename, -3, -1)
#> [1] "csv"
# Extract last 4 characters (extension with dot)
str_sub(filename, -4)
#> [1] ".csv"
# Extract everything except last 4 characters
str_sub(filename, 1, -5)
#> [1] "report_2024"
# Extract everything except first 7 characters
str_sub(filename, 8)
#> [1] "2024.csv"
You can mix positive and negative indices freely:
code <- "ABC-12345-XYZ"
# From position 5 to 4th from end
str_sub(code, 5, -5)
#> [1] "12345"
# Last 3 characters
str_sub(code, -3)
#> [1] "XYZ"
# All but first and last 4 characters
str_sub(code, 5, -5)
#> [1] "12345"
This negative indexing is invaluable when you need the end of strings but don’t know (or don’t want to calculate) their lengths. Extracting file extensions, domain suffixes, or trailing codes becomes trivial.
Working with Vectors
str_sub() is fully vectorized across all three arguments. Pass a vector of strings, and it processes each element. Pass vectors for start or end, and it applies them element-wise.
product_codes <- c("PRD-001-A", "PRD-002-B", "PRD-003-C", "PRD-004-D")
# Extract prefix from all codes
str_sub(product_codes, 1, 3)
#> [1] "PRD" "PRD" "PRD" "PRD"
# Extract numeric portion
str_sub(product_codes, 5, 7)
#> [1] "001" "002" "003" "004"
# Extract suffix
str_sub(product_codes, -1)
#> [1] "A" "B" "C" "D"
This vectorization integrates naturally with dplyr workflows:
orders <- tibble(
order_id = c("ORD-2024-0001", "ORD-2024-0002", "ORD-2023-0157"),
customer = c("Alice", "Bob", "Carol")
)
orders %>%
mutate(
year = str_sub(order_id, 5, 8),
sequence = str_sub(order_id, -4)
)
#> # A tibble: 3 × 4
#> order_id customer year sequence
#> <chr> <chr> <chr> <chr>
#> 1 ORD-2024-0001 Alice 2024 0001
#> 2 ORD-2024-0002 Bob 2024 0002
#> 3 ORD-2023-0157 Carol 2023 0157
You can also use different positions for different elements by passing vectors to start and end:
strings <- c("abcdef", "ghijkl", "mnopqr")
starts <- c(1, 2, 3)
ends <- c(2, 4, 6)
str_sub(strings, starts, ends)
#> [1] "ab" "hij" "opqr"
This is useful when parsing data where the relevant positions vary by row.
Substring Replacement
The assignment form str_sub() <- replaces substrings in place. This is cleaner than concatenating pieces together manually.
greeting <- "Hello World"
str_sub(greeting, 7, 11) <- "R User"
greeting
#> [1] "Hello R User"
The replacement string doesn’t need to match the length of the substring being replaced:
text <- "I love Python"
str_sub(text, 8, 13) <- "R"
text
#> [1] "I love R"
A practical application is masking sensitive data:
phone_numbers <- c("555-123-4567", "555-987-6543", "555-456-7890")
# Mask middle digits
masked <- phone_numbers
str_sub(masked, 5, 7) <- "XXX"
masked
#> [1] "555-XXX-4567" "555-XXX-6543" "555-XXX-7890"
For credit card masking:
card_numbers <- c("4111111111111111", "5500000000000004")
# Show only last 4 digits
masked_cards <- card_numbers
str_sub(masked_cards, 1, -5) <- strrep("*", nchar(card_numbers) - 4)
masked_cards
#> [1] "************1111" "************0004"
Practical Use Cases
Parsing Fixed-Width Data
Fixed-width formats are common in legacy systems and government data. str_sub() handles them cleanly:
# Census-style fixed width records
records <- c(
"John Smith 19850315NYC",
"Jane Doe 19901122LAX",
"Bob Johnson 19780704CHI"
)
parsed <- tibble(raw = records) %>%
mutate(
first_name = str_trim(str_sub(raw, 1, 10)),
last_name = str_trim(str_sub(raw, 11, 20)),
birth_date = str_sub(raw, 21, 28),
location = str_sub(raw, 29, 31)
) %>%
select(-raw)
parsed
#> # A tibble: 3 × 4
#> first_name last_name birth_date location
#> <chr> <chr> <chr> <chr>
#> 1 John Smith 19850315 NYC
#> 2 Jane Doe 19901122 LAX
#> 3 Bob Johnson 19780704 CHI
Extracting Date Components
When dates come as strings in a known format:
timestamps <- c("2024-03-15 14:30:00", "2024-03-16 09:15:30", "2024-03-17 18:45:00")
tibble(timestamp = timestamps) %>%
mutate(
date = str_sub(timestamp, 1, 10),
year = str_sub(timestamp, 1, 4),
month = str_sub(timestamp, 6, 7),
day = str_sub(timestamp, 9, 10),
time = str_sub(timestamp, 12, 19),
hour = str_sub(timestamp, 12, 13)
)
Working with Standardized Codes
ZIP codes, ISBNs, and product identifiers often have meaningful segments:
zip_codes <- c("10001-1234", "90210-5678", "60601-9999")
tibble(zip = zip_codes) %>%
mutate(
zip5 = str_sub(zip, 1, 5),
plus4 = str_sub(zip, -4)
)
#> # A tibble: 3 × 3
#> zip zip5 plus4
#> <chr> <chr> <chr>
#> 1 10001-1234 10001 1234
#> 2 90210-5678 90210 5678
#> 3 60601-9999 60601 9999
Comparison with Alternatives
Base R offers substr() and substring(). Here’s how they compare:
text <- "Hello World"
# All three work for basic extraction
substr(text, 1, 5)
#> [1] "Hello"
substring(text, 1, 5)
#> [1] "Hello"
str_sub(text, 1, 5)
#> [1] "Hello"
# Negative indexing: only str_sub supports it
str_sub(text, -5)
#> [1] "World"
# substr(text, -5, -1) # Doesn't work as expected
# Out-of-bounds handling
substr(text, 1, 100)
#> [1] "Hello World"
str_sub(text, 1, 100)
#> [1] "Hello World" # Same behavior here
Use str_extract() when you need pattern-based extraction rather than position-based:
# Position-based: use str_sub
str_sub("order-12345", 7, 11)
#> [1] "12345"
# Pattern-based: use str_extract
str_extract("order-12345", "\\d+")
#> [1] "12345"
The rule is simple: if you know the positions, use str_sub(). If you need to find patterns, use str_extract(). For most fixed-format data processing, str_sub() is the right choice—it’s faster and more explicit about what you’re extracting.