R stringr - str_sub() - Substring | Application Architect

Key Insights

str_sub() supports negative indexing to count from the end of strings, making it far more intuitive than base R’s substr() for common operations like extracting file extensions or last N characters.
The assignment form str_sub() <- enables in-place substring replacement, which is essential for data masking, formatting corrections, and transforming fixed-width data.
Full vectorization means str_sub() integrates seamlessly with tidyverse workflows, processing entire columns without explicit loops or apply() calls.

Introduction to str_sub()

String manipulation is one of those tasks that seems simple until you’re knee-deep in edge cases. The str_sub() function from the stringr package handles substring extraction and replacement with a clean, consistent interface that eliminates most of the friction you’ll encounter with base R alternatives.

At its core, str_sub() extracts or replaces substrings based on character positions. What makes it worth using over substr() is the thoughtful API design: negative indexing, consistent vectorization, and predictable behavior with edge cases. If you’re doing any serious text processing in R, this function belongs in your toolkit.

library(stringr)
library(dplyr)

Basic Syntax and Parameters

The function signature is straightforward:

str_sub(string, start = 1L, end = -1L)

Three parameters: the input string, the starting position, and the ending position. Both start and end are inclusive, meaning the characters at those positions are included in the result. The defaults extract the entire string.

text <- "Application Architect"

# Extract characters 1 through 11
str_sub(text, 1, 11)
#> [1] "Application"

# Extract from position 13 to end (default)
str_sub(text, 13)
#> [1] "Architect"

# Extract first 5 characters
str_sub(text, 1, 5)
#> [1] "Appli"

# Omit both start and end to get entire string
str_sub(text)
#> [1] "Application Architect"

Position counting starts at 1, not 0. This matches R’s general indexing convention and feels natural for most users. If you specify positions beyond the string length, str_sub() silently handles it by returning what’s available rather than throwing an error.

# Request more characters than exist
str_sub("hello", 1, 100)
#> [1] "hello"

# Start position beyond string length
str_sub("hello", 10, 15)
#> [1] ""

This graceful degradation prevents crashes when processing variable-length data.

Negative Indexing

Here’s where str_sub() pulls ahead of base R. Negative indices count backward from the end of the string, with -1 representing the last character.

filename <- "report_2024.csv"

# Extract last 3 characters (file extension without dot)
str_sub(filename, -3, -1)
#> [1] "csv"

# Extract last 4 characters (extension with dot)
str_sub(filename, -4)
#> [1] ".csv"

# Extract everything except last 4 characters
str_sub(filename, 1, -5)
#> [1] "report_2024"

# Extract everything except first 7 characters
str_sub(filename, 8)
#> [1] "2024.csv"

You can mix positive and negative indices freely:

code <- "ABC-12345-XYZ"

# From position 5 to 4th from end
str_sub(code, 5, -5)
#> [1] "12345"

# Last 3 characters
str_sub(code, -3)
#> [1] "XYZ"

# All but first and last 4 characters
str_sub(code, 5, -5)
#> [1] "12345"

This negative indexing is invaluable when you need the end of strings but don’t know (or don’t want to calculate) their lengths. Extracting file extensions, domain suffixes, or trailing codes becomes trivial.

Working with Vectors

str_sub() is fully vectorized across all three arguments. Pass a vector of strings, and it processes each element. Pass vectors for start or end, and it applies them element-wise.

product_codes <- c("PRD-001-A", "PRD-002-B", "PRD-003-C", "PRD-004-D")

# Extract prefix from all codes
str_sub(product_codes, 1, 3)
#> [1] "PRD" "PRD" "PRD" "PRD"

# Extract numeric portion
str_sub(product_codes, 5, 7)
#> [1] "001" "002" "003" "004"

# Extract suffix
str_sub(product_codes, -1)
#> [1] "A" "B" "C" "D"

This vectorization integrates naturally with dplyr workflows:

orders <- tibble(
  order_id = c("ORD-2024-0001", "ORD-2024-0002", "ORD-2023-0157"),
  customer = c("Alice", "Bob", "Carol")
)

orders %>%
  mutate(
    year = str_sub(order_id, 5, 8),
    sequence = str_sub(order_id, -4)
  )
#> # A tibble: 3 × 4
#>   order_id      customer year  sequence
#>   <chr>         <chr>    <chr> <chr>   
#> 1 ORD-2024-0001 Alice    2024  0001    
#> 2 ORD-2024-0002 Bob      2024  0002    
#> 3 ORD-2023-0157 Carol    2023  0157

You can also use different positions for different elements by passing vectors to start and end:

strings <- c("abcdef", "ghijkl", "mnopqr")
starts <- c(1, 2, 3)
ends <- c(2, 4, 6)

str_sub(strings, starts, ends)
#> [1] "ab"   "hij"  "opqr"

This is useful when parsing data where the relevant positions vary by row.

Substring Replacement

The assignment form str_sub() <- replaces substrings in place. This is cleaner than concatenating pieces together manually.

greeting <- "Hello World"
str_sub(greeting, 7, 11) <- "R User"
greeting
#> [1] "Hello R User"

The replacement string doesn’t need to match the length of the substring being replaced:

text <- "I love Python"
str_sub(text, 8, 13) <- "R"
text
#> [1] "I love R"

A practical application is masking sensitive data:

phone_numbers <- c("555-123-4567", "555-987-6543", "555-456-7890")

# Mask middle digits
masked <- phone_numbers
str_sub(masked, 5, 7) <- "XXX"
masked
#> [1] "555-XXX-4567" "555-XXX-6543" "555-XXX-7890"

For credit card masking:

card_numbers <- c("4111111111111111", "5500000000000004")

# Show only last 4 digits
masked_cards <- card_numbers
str_sub(masked_cards, 1, -5) <- strrep("*", nchar(card_numbers) - 4)
masked_cards
#> [1] "************1111" "************0004"

Practical Use Cases

Parsing Fixed-Width Data

Fixed-width formats are common in legacy systems and government data. str_sub() handles them cleanly:

# Census-style fixed width records
records <- c(
  "John      Smith     19850315NYC",
  "Jane      Doe       19901122LAX",
  "Bob       Johnson   19780704CHI"
)

parsed <- tibble(raw = records) %>%
  mutate(
    first_name = str_trim(str_sub(raw, 1, 10)),
    last_name = str_trim(str_sub(raw, 11, 20)),
    birth_date = str_sub(raw, 21, 28),
    location = str_sub(raw, 29, 31)
  ) %>%
  select(-raw)

parsed
#> # A tibble: 3 × 4
#>   first_name last_name birth_date location
#>   <chr>      <chr>     <chr>      <chr>   
#> 1 John       Smith     19850315   NYC     
#> 2 Jane       Doe       19901122   LAX     
#> 3 Bob        Johnson   19780704   CHI

Extracting Date Components

When dates come as strings in a known format:

timestamps <- c("2024-03-15 14:30:00", "2024-03-16 09:15:30", "2024-03-17 18:45:00")

tibble(timestamp = timestamps) %>%
  mutate(
    date = str_sub(timestamp, 1, 10),
    year = str_sub(timestamp, 1, 4),
    month = str_sub(timestamp, 6, 7),
    day = str_sub(timestamp, 9, 10),
    time = str_sub(timestamp, 12, 19),
    hour = str_sub(timestamp, 12, 13)
  )

Working with Standardized Codes

ZIP codes, ISBNs, and product identifiers often have meaningful segments:

zip_codes <- c("10001-1234", "90210-5678", "60601-9999")

tibble(zip = zip_codes) %>%
  mutate(
    zip5 = str_sub(zip, 1, 5),
    plus4 = str_sub(zip, -4)
  )
#> # A tibble: 3 × 3
#>   zip        zip5  plus4
#>   <chr>      <chr> <chr>
#> 1 10001-1234 10001 1234 
#> 2 90210-5678 90210 5678 
#> 3 60601-9999 60601 9999

Comparison with Alternatives

Base R offers substr() and substring(). Here’s how they compare:

text <- "Hello World"

# All three work for basic extraction
substr(text, 1, 5)
#> [1] "Hello"
substring(text, 1, 5)
#> [1] "Hello"
str_sub(text, 1, 5)
#> [1] "Hello"

# Negative indexing: only str_sub supports it
str_sub(text, -5)
#> [1] "World"
# substr(text, -5, -1)  # Doesn't work as expected

# Out-of-bounds handling
substr(text, 1, 100)
#> [1] "Hello World"
str_sub(text, 1, 100)
#> [1] "Hello World"  # Same behavior here

Use str_extract() when you need pattern-based extraction rather than position-based:

# Position-based: use str_sub
str_sub("order-12345", 7, 11)
#> [1] "12345"

# Pattern-based: use str_extract
str_extract("order-12345", "\\d+")
#> [1] "12345"

The rule is simple: if you know the positions, use str_sub(). If you need to find patterns, use str_extract(). For most fixed-format data processing, str_sub() is the right choice—it’s faster and more explicit about what you’re extracting.