R stringr - str_length() - String Length
The stringr package is one of the core tidyverse packages, designed to make string manipulation in R consistent and intuitive. While base R provides string functions, they often have inconsistent...
Key Insights
str_length()provides consistent, predictable behavior with NA values and Unicode characters, returningNAfor missing data rather than the confusing2thatnchar()produces by default.- The function counts Unicode characters correctly, making it essential for modern applications dealing with emojis, accented characters, or international text where byte counts differ from character counts.
- Integration with dplyr pipelines makes
str_length()the practical choice for data validation, filtering, and transformation tasks in tidyverse workflows.
Introduction to str_length()
The stringr package is one of the core tidyverse packages, designed to make string manipulation in R consistent and intuitive. While base R provides string functions, they often have inconsistent naming conventions and surprising edge case behaviors. stringr wraps these operations in a coherent API where every function starts with str_ and follows predictable patterns.
str_length() measures the number of characters in a string. This sounds trivial until you encounter NA values, empty strings, or Unicode characters. That’s where the difference between str_length() and base R’s nchar() becomes apparent.
library(stringr)
# Basic comparison
text <- "hello"
nchar(text)
#> [1] 5
str_length(text)
#> [1] 5
# Where things get interesting
na_text <- NA
nchar(na_text)
#> [1] 2
str_length(na_text)
#> [1] NA
That 2 from nchar() represents the literal characters “NA” being counted. This default behavior has caused countless bugs in production code. str_length() does the sensible thing and returns NA.
Syntax and Parameters
The function signature is straightforward:
str_length(string)
The string parameter accepts a character vector. It can be a single string, a vector of strings, or a column from a dataframe. The function returns an integer vector of the same length as the input, where each element represents the character count of the corresponding input string.
str_length("application architect")
#> [1] 21
The return value is always an integer vector. For a single string input, you get a length-1 integer vector. This consistency matters when you’re building functions that need predictable return types.
Basic Usage
For single strings, the usage is direct:
str_length("R programming")
#> [1] 13
str_length("")
#> [1] 0
str_length(" ")
#> [1] 1
Note that empty strings return 0 and strings containing only whitespace return the count of whitespace characters. This is correct behavior, but worth remembering when validating user input.
The real power emerges with character vectors:
languages <- c("R", "Python", "JavaScript", "Go", "Rust")
str_length(languages)
#> [1] 1 6 10 2 4
# Combine with names for clarity
data.frame(
language = languages,
name_length = str_length(languages)
)
#> language name_length
#> 1 R 1
#> 2 Python 6
#> 3 JavaScript 10
#> 4 Go 2
#> 5 Rust 4
Vectorization means you never need to loop over strings manually. Pass the entire vector and get results for every element in one call.
Handling Special Cases
Production code encounters messy data. Understanding how str_length() handles edge cases prevents runtime surprises.
NA Values
mixed_data <- c("valid", NA, "also valid", NA)
str_length(mixed_data)
#> [1] 5 NA 10 NA
NA propagation is consistent. You can filter or handle these values predictably using standard R idioms like is.na().
Empty Strings
empty_variants <- c("", " ", " ", "\t", "\n")
str_length(empty_variants)
#> [1] 0 1 2 1 1
Empty strings have length zero. Whitespace characters each count as one character. If you need to treat whitespace-only strings as effectively empty, combine with str_trim():
str_length(str_trim(empty_variants))
#> [1] 0 0 0 0 0
Unicode and Multi-byte Characters
This is where str_length() proves its worth. Modern applications deal with international text, emojis, and special characters constantly.
# Accented characters
str_length("café")
#> [1] 4
str_length("naïve")
#> [1] 5
# Emojis
str_length("👍")
#> [1] 1
str_length("Hello 🌍!")
#> [1] 8
# Mixed international text
international <- c("日本語", "한국어", "العربية", "ελληνικά")
str_length(international)
#> [1] 3 3 7 8
str_length() counts Unicode code points, not bytes. The Japanese word “日本語” contains three characters, and that’s what you get. Compare this to byte-counting approaches that would return different values based on encoding.
# Emoji sequences can be tricky
str_length("👨👩👧👦") # Family emoji (may vary by system)
#> [1] 7
# This is a ZWJ sequence - multiple code points rendered as one glyph
# str_length counts code points, not visual glyphs
For most applications, code point counting is what you want. If you need to count grapheme clusters (visual characters), you’ll need the stringi package’s stri_length() with specific options.
Practical Applications
Filtering Strings by Length
A common task is selecting strings that meet length criteria:
library(dplyr)
products <- tibble(
sku = c("A1", "B123", "C12345", "D1234567", "E12"),
name = c("Widget", "Gadget", "Thingamajig", "Doohickey", "Gizmo")
)
# Find SKUs between 3 and 6 characters
products %>%
filter(between(str_length(sku), 3, 6))
#> # A tibble: 2 × 2
#> sku name
#> <chr> <chr>
#> 1 B123 Gadget
#> 2 C12345 Thingamajig
Data Validation
Validating string lengths is essential for data quality:
validate_password <- function(password) {
length <- str_length(password)
if (is.na(length)) {
return(list(valid = FALSE, message = "Password cannot be empty"))
}
if (length < 8) {
return(list(valid = FALSE, message = "Password must be at least 8 characters"))
}
if (length > 128) {
return(list(valid = FALSE, message = "Password cannot exceed 128 characters"))
}
list(valid = TRUE, message = "Password meets length requirements")
}
validate_password("short")
#> $valid
#> [1] FALSE
#> $message
#> [1] "Password must be at least 8 characters"
validate_password("adequately_long_password")
#> $valid
#> [1] TRUE
#> $message
#> [1] "Password meets length requirements"
Use Within dplyr Pipelines
str_length() integrates seamlessly with tidyverse workflows:
customer_data <- tibble(
id = 1:5,
email = c("a@b.co", "user@example.com", "x@y.z", "contact@company.org", NA),
phone = c("555-1234", "5551234", "555-123-4567", NA, "555.123.4567")
)
customer_data %>%
mutate(
email_length = str_length(email),
phone_length = str_length(phone),
email_valid = str_length(email) >= 5 & !is.na(email)
) %>%
filter(email_valid)
#> # A tibble: 3 × 6
#> id email phone email_length phone_length email_valid
#> <int> <chr> <chr> <int> <int> <lgl>
#> 1 1 a@b.co 555-1234 6 8 TRUE
#> 2 2 user@example.com 5551234 16 7 TRUE
#> 3 4 contact@company.org NA 19 NA TRUE
You can also use str_length() for grouping and summarization:
words <- tibble(
word = c("a", "an", "the", "cat", "dogs", "elephant", "programming")
)
words %>%
mutate(length = str_length(word)) %>%
group_by(length) %>%
summarise(
count = n(),
examples = paste(word, collapse = ", ")
)
#> # A tibble: 5 × 3
#> length count examples
#> <int> <int> <chr>
#> 1 1 1 a
#> 2 2 1 an
#> 3 3 2 the, cat
#> 4 4 1 dogs
#> 5 8 1 elephant
#> 6 11 1 programming
Performance Considerations
For most use cases, the performance difference between str_length() and nchar() is negligible. Both are vectorized and efficient. Choose based on correctness and consistency, not speed.
# Both handle large vectors efficiently
large_vector <- rep("test string", 1000000)
system.time(nchar(large_vector))
#> user system elapsed
#> 0.02 0.00 0.02
system.time(str_length(large_vector))
#> user system elapsed
#> 0.03 0.00 0.03
The marginal speed difference doesn’t justify using nchar() when its NA handling will cause bugs. If you’re in a performance-critical loop processing billions of strings, profile your actual code before optimizing.
Use str_length() when:
- You’re working within a tidyverse pipeline
- Your data might contain NA values
- You’re processing Unicode text
- Code readability and consistency matter
Use nchar() when:
- You’re in a base R environment without tidyverse
- You explicitly need
nchar()’stypeparameter for byte or width counting - You’ve profiled and confirmed it’s a bottleneck (rare)
Summary
str_length() is a simple function that does one thing well: count characters in strings. Its value lies in consistent behavior across edge cases, proper Unicode handling, and seamless tidyverse integration.
Key takeaways:
- Returns
NAfor NA inputs, not2 - Counts Unicode code points correctly
- Vectorized for efficient processing
- Works naturally in dplyr pipelines
Related stringr functions worth exploring:
str_sub()- extract substrings by positionstr_trim()- remove whitespace from string endsstr_pad()- pad strings to a specified lengthstr_trunc()- truncate strings to a maximum length
For complete documentation, see the stringr package documentation or run ?str_length in your R console.