R stringr - str_split() with Examples
String manipulation sits at the heart of data cleaning and text processing. The `str_split()` function from R's stringr package provides a consistent, readable way to break strings into pieces based...
Key Insights
str_split()returns a list by default, which handles variable-length results when splitting vectors of strings—usesimplify = TRUEfor matrix output orstr_split_1()for single strings- The
nparameter controls maximum splits, making it essential for extracting specific portions like “everything before the first delimiter” or “the last two segments” - Combine regex patterns with
str_split()for powerful parsing: split on multiple delimiters, whitespace sequences, or complex patterns in a single operation
Introduction to str_split()
String manipulation sits at the heart of data cleaning and text processing. The str_split() function from R’s stringr package provides a consistent, readable way to break strings into pieces based on patterns. Whether you’re parsing log files, processing CSV data, or extracting components from formatted strings, str_split() handles the job cleanly.
The function splits a string wherever it finds a matching pattern, returning the pieces between matches. By default, it returns a list—a design choice that becomes clear when you process vectors of strings with varying numbers of splits.
library(stringr)
# Basic split on comma
text <- "apple,banana,cherry"
str_split(text, ",")
[[1]]
[1] "apple" "banana" "cherry"
Notice the list wrapper around the result. This structure accommodates the reality that different strings may split into different numbers of pieces.
Function Syntax and Parameters
The full function signature gives you control over the splitting behavior:
str_split(string, pattern, n = Inf, simplify = FALSE)
The parameters break down as follows:
- string: Character vector to split
- pattern: What to split on (string or regex)
- n: Maximum number of pieces (default: unlimited)
- simplify: Return a matrix instead of a list
The simplify parameter deserves attention. When FALSE (the default), you get a list. When TRUE, you get a character matrix with one row per input string. This matters for downstream processing.
fruits <- "apple,banana,cherry"
# Default: returns a list
str_split(fruits, ",", simplify = FALSE)
[[1]]
[1] "apple" "banana" "cherry"
# Simplified: returns a matrix
str_split(fruits, ",", simplify = TRUE)
[,1] [,2] [,3]
[1,] "apple" "banana" "cherry"
The matrix form integrates better with data frame operations, while the list form preserves the variable-length nature of split results.
Splitting with Different Pattern Types
Real data rarely uses just commas. You’ll encounter various delimiters and need different splitting strategies.
Single character delimiters:
# Split on pipe
str_split("red|green|blue", "\\|")[[1]]
[1] "red" "green" "blue"
Note the escaped pipe—it’s a regex metacharacter.
Multi-character delimiters:
# Split on " - " (space-dash-space)
str_split("New York - Los Angeles - Chicago", " - ")[[1]]
[1] "New York" "Los Angeles" "Chicago"
Whitespace sequences:
# Split on any whitespace (handles multiple spaces, tabs)
messy_text <- "word1 word2\tword3 word4"
str_split(messy_text, "\\s+")[[1]]
[1] "word1" "word2" "word3" "word4"
The \\s+ pattern matches one or more whitespace characters, handling inconsistent spacing gracefully.
Multiple delimiters with regex alternation:
# Split on comma, semicolon, or pipe
mixed <- "a,b;c|d,e"
str_split(mixed, "[,;|]")[[1]]
[1] "a" "b" "c" "d" "e"
Character classes ([...]) provide a clean way to specify multiple single-character delimiters.
# Split on "and" or "or" (word boundaries matter)
sentence <- "cats and dogs or birds and fish"
str_split(sentence, "\\s+(and|or)\\s+")[[1]]
[1] "cats" "dogs" "birds" "fish"
Controlling the Number of Splits with n
The n parameter limits how many pieces you get. This proves invaluable when you only need specific portions of a string.
# Split into at most 2 pieces
path <- "first-middle-last"
str_split(path, "-", n = 2)[[1]]
[1] "first" "middle-last"
With n = 2, you get the first piece and everything else. This pattern appears constantly in real parsing tasks.
Extract the first element only:
# Get everything before the first colon
log_entry <- "ERROR: Connection failed: timeout"
str_split(log_entry, ": ", n = 2)[[1]][1]
[1] "ERROR"
Extract the last element:
# Get file extension (split and take last)
filename <- "report.2024.final.pdf"
parts <- str_split(filename, "\\.")[[1]]
parts[length(parts)]
[1] "pdf"
Practical example—parsing key-value pairs:
config_line <- "database_host=localhost:5432"
str_split(config_line, "=", n = 2)[[1]]
[1] "database_host" "localhost:5432"
Using n = 2 ensures that values containing = don’t get incorrectly split.
Working with Vectors of Strings
The list return type shines when processing multiple strings with potentially different split counts.
names <- c("John Smith", "Mary Jane Watson", "Cher")
str_split(names, " ")
[[1]]
[1] "John" "Smith"
[[2]]
[1] "Mary" "Jane" "Watson"
[[3]]
[1] "Cher"
Each input string produces its own vector within the list. To extract specific positions, use sapply() or purrr functions:
# Extract first names
sapply(str_split(names, " "), `[`, 1)
[1] "John" "Mary" "Cher"
# Extract last names (last element of each split)
sapply(str_split(names, " "), function(x) x[length(x)])
[1] "Smith" "Watson" "Cher"
With purrr:
library(purrr)
# First names
map_chr(str_split(names, " "), 1)
# Last names
map_chr(str_split(names, " "), ~ .x[length(.x)])
When all strings split into the same number of pieces, simplify = TRUE creates a clean matrix:
dates <- c("2024-01-15", "2024-02-20", "2024-03-25")
str_split(dates, "-", simplify = TRUE)
[,1] [,2] [,3]
[1,] "2024" "01" "15"
[2,] "2024" "02" "20"
[3,] "2024" "03" "25"
This integrates directly into data frame creation:
date_parts <- str_split(dates, "-", simplify = TRUE)
data.frame(
year = date_parts[, 1],
month = date_parts[, 2],
day = date_parts[, 3]
)
Related Functions: str_split_fixed() and str_split_1()
stringr provides specialized variants for common scenarios.
str_split_1() handles single strings directly, returning a vector instead of a list:
# No list wrapper needed
str_split_1("a,b,c", ",")
[1] "a" "b" "c"
This eliminates the [[1]] indexing when you know you have exactly one string.
str_split_fixed() always returns a matrix with a specified number of columns:
str_split_fixed("a,b,c,d,e", ",", n = 3)
[,1] [,2] [,3]
[1,] "a" "b" "c,d,e"
Compare all three on the same input:
text <- "one:two:three"
# str_split: list
str_split(text, ":")
# [[1]]
# [1] "one" "two" "three"
# str_split_1: vector (single string only)
str_split_1(text, ":")
# [1] "one" "two" "three"
# str_split_fixed: matrix with fixed columns
str_split_fixed(text, ":", n = 3)
# [,1] [,2] [,3]
# [1,] "one" "two" "three"
Use str_split_1() for single strings, str_split_fixed() when you need consistent column counts, and str_split() for general-purpose splitting.
Practical Use Cases
Parsing file paths:
path <- "/home/user/documents/report.pdf"
# Get filename
str_split_1(path, "/") |> tail(1)
# [1] "report.pdf"
# Get directory
parts <- str_split_1(path, "/")
paste(parts[-length(parts)], collapse = "/")
# [1] "/home/user/documents"
Extracting domains from email addresses:
emails <- c("john@example.com", "mary@company.org", "bob@mail.co.uk")
# Extract domains
sapply(str_split(emails, "@"), `[`, 2)
# [1] "example.com" "company.org" "mail.co.uk"
# Extract usernames
sapply(str_split(emails, "@"), `[`, 1)
# [1] "john" "mary" "bob"
Processing log entries:
log_lines <- c(
"2024-01-15 10:30:45 INFO User logged in",
"2024-01-15 10:31:02 ERROR Database connection failed",
"2024-01-15 10:31:15 WARN High memory usage detected"
)
# Extract log levels
sapply(str_split(log_lines, " "), `[`, 3)
# [1] "INFO" "ERROR" "WARN"
# Extract messages (everything after the third space)
sapply(str_split(log_lines, " ", n = 4), `[`, 4)
# [1] "User logged in" "Database connection failed"
# [3] "High memory usage detected"
Parsing URL components:
url <- "https://api.example.com/v2/users/123"
# Remove protocol and split path
parts <- str_split_1(url, "://")
path_parts <- str_split_1(parts[2], "/")
list(
host = path_parts[1],
path = path_parts[-1]
)
The str_split() function handles the vast majority of string-splitting needs in R. Master its parameters—especially n and simplify—and you’ll parse text data efficiently. For single strings, reach for str_split_1() to skip the list unwrapping. For consistent tabular output, use str_split_fixed() or simplify = TRUE.