R stringr - str_length() - String Length

Key Insights

str_length() provides consistent, predictable behavior with NA values and Unicode characters, returning NA for missing data rather than the confusing 2 that nchar() produces by default.
The function counts Unicode characters correctly, making it essential for modern applications dealing with emojis, accented characters, or international text where byte counts differ from character counts.
Integration with dplyr pipelines makes str_length() the practical choice for data validation, filtering, and transformation tasks in tidyverse workflows.

Introduction to str_length()

The stringr package is one of the core tidyverse packages, designed to make string manipulation in R consistent and intuitive. While base R provides string functions, they often have inconsistent naming conventions and surprising edge case behaviors. stringr wraps these operations in a coherent API where every function starts with str_ and follows predictable patterns.

str_length() measures the number of characters in a string. This sounds trivial until you encounter NA values, empty strings, or Unicode characters. That’s where the difference between str_length() and base R’s nchar() becomes apparent.

library(stringr)

# Basic comparison
text <- "hello"
nchar(text)
#> [1] 5

str_length(text)
#> [1] 5

# Where things get interesting
na_text <- NA
nchar(na_text)
#> [1] 2

str_length(na_text)
#> [1] NA

That 2 from nchar() represents the literal characters “NA” being counted. This default behavior has caused countless bugs in production code. str_length() does the sensible thing and returns NA.

Syntax and Parameters

The function signature is straightforward:

str_length(string)

The string parameter accepts a character vector. It can be a single string, a vector of strings, or a column from a dataframe. The function returns an integer vector of the same length as the input, where each element represents the character count of the corresponding input string.

str_length("application architect")
#> [1] 21

The return value is always an integer vector. For a single string input, you get a length-1 integer vector. This consistency matters when you’re building functions that need predictable return types.

Basic Usage

For single strings, the usage is direct:

str_length("R programming")
#> [1] 13

str_length("")
#> [1] 0

str_length(" ")
#> [1] 1

Note that empty strings return 0 and strings containing only whitespace return the count of whitespace characters. This is correct behavior, but worth remembering when validating user input.

The real power emerges with character vectors:

languages <- c("R", "Python", "JavaScript", "Go", "Rust")
str_length(languages)
#> [1]  1  6 10  2  4

# Combine with names for clarity
data.frame(
  language = languages,
  name_length = str_length(languages)
)
#>     language name_length
#> 1          R           1
#> 2     Python           6
#> 3 JavaScript          10
#> 4         Go           2
#> 5       Rust           4

Vectorization means you never need to loop over strings manually. Pass the entire vector and get results for every element in one call.

Handling Special Cases

Production code encounters messy data. Understanding how str_length() handles edge cases prevents runtime surprises.

NA Values

mixed_data <- c("valid", NA, "also valid", NA)
str_length(mixed_data)
#> [1]  5 NA 10 NA

NA propagation is consistent. You can filter or handle these values predictably using standard R idioms like is.na().

Empty Strings

empty_variants <- c("", " ", "  ", "\t", "\n")
str_length(empty_variants)
#> [1] 0 1 2 1 1

Empty strings have length zero. Whitespace characters each count as one character. If you need to treat whitespace-only strings as effectively empty, combine with str_trim():

str_length(str_trim(empty_variants))
#> [1] 0 0 0 0 0

Unicode and Multi-byte Characters

This is where str_length() proves its worth. Modern applications deal with international text, emojis, and special characters constantly.

# Accented characters
str_length("café")
#> [1] 4

str_length("naïve")
#> [1] 5

# Emojis
str_length("👍")
#> [1] 1

str_length("Hello 🌍!")
#> [1] 8

# Mixed international text
international <- c("日本語", "한국어", "العربية", "ελληνικά")
str_length(international)
#> [1] 3 3 7 8

str_length() counts Unicode code points, not bytes. The Japanese word “日本語” contains three characters, and that’s what you get. Compare this to byte-counting approaches that would return different values based on encoding.

# Emoji sequences can be tricky
str_length("👨‍👩‍👧‍👦")  # Family emoji (may vary by system)
#> [1] 7

# This is a ZWJ sequence - multiple code points rendered as one glyph
# str_length counts code points, not visual glyphs

For most applications, code point counting is what you want. If you need to count grapheme clusters (visual characters), you’ll need the stringi package’s stri_length() with specific options.

Practical Applications

Filtering Strings by Length

A common task is selecting strings that meet length criteria:

library(dplyr)

products <- tibble(
  sku = c("A1", "B123", "C12345", "D1234567", "E12"),
  name = c("Widget", "Gadget", "Thingamajig", "Doohickey", "Gizmo")
)

# Find SKUs between 3 and 6 characters
products %>%
  filter(between(str_length(sku), 3, 6))
#> # A tibble: 2 × 2
#>   sku    name       
#>   <chr>  <chr>      
#> 1 B123   Gadget     
#> 2 C12345 Thingamajig

Data Validation

Validating string lengths is essential for data quality:

validate_password <- function(password) {
  length <- str_length(password)
  
  if (is.na(length)) {
    return(list(valid = FALSE, message = "Password cannot be empty"))
  }
  
  if (length < 8) {
    return(list(valid = FALSE, message = "Password must be at least 8 characters"))
  }
  
  if (length > 128) {
    return(list(valid = FALSE, message = "Password cannot exceed 128 characters"))
  }
  
  list(valid = TRUE, message = "Password meets length requirements")
}

validate_password("short")
#> $valid
#> [1] FALSE
#> $message
#> [1] "Password must be at least 8 characters"

validate_password("adequately_long_password")
#> $valid
#> [1] TRUE
#> $message
#> [1] "Password meets length requirements"

Use Within dplyr Pipelines

str_length() integrates seamlessly with tidyverse workflows:

customer_data <- tibble(
  id = 1:5,
  email = c("a@b.co", "user@example.com", "x@y.z", "contact@company.org", NA),
  phone = c("555-1234", "5551234", "555-123-4567", NA, "555.123.4567")
)

customer_data %>%
  mutate(
    email_length = str_length(email),
    phone_length = str_length(phone),
    email_valid = str_length(email) >= 5 & !is.na(email)
  ) %>%
  filter(email_valid)
#> # A tibble: 3 × 6
#>      id email               phone        email_length phone_length email_valid
#>   <int> <chr>               <chr>               <int>        <int> <lgl>      
#> 1     1 a@b.co              555-1234                6            8 TRUE       
#> 2     2 user@example.com    5551234                16            7 TRUE       
#> 3     4 contact@company.org NA                     19           NA TRUE

You can also use str_length() for grouping and summarization:

words <- tibble(
  word = c("a", "an", "the", "cat", "dogs", "elephant", "programming")
)

words %>%
  mutate(length = str_length(word)) %>%
  group_by(length) %>%
  summarise(
    count = n(),
    examples = paste(word, collapse = ", ")
  )
#> # A tibble: 5 × 3
#>   length count examples   
#>    <int> <int> <chr>      
#> 1      1     1 a          
#> 2      2     1 an         
#> 3      3     2 the, cat   
#> 4      4     1 dogs       
#> 5      8     1 elephant
#> 6     11     1 programming

Performance Considerations

For most use cases, the performance difference between str_length() and nchar() is negligible. Both are vectorized and efficient. Choose based on correctness and consistency, not speed.

# Both handle large vectors efficiently
large_vector <- rep("test string", 1000000)

system.time(nchar(large_vector))
#>    user  system elapsed 
#>   0.02    0.00    0.02 

system.time(str_length(large_vector))
#>    user  system elapsed 
#>   0.03    0.00    0.03

The marginal speed difference doesn’t justify using nchar() when its NA handling will cause bugs. If you’re in a performance-critical loop processing billions of strings, profile your actual code before optimizing.

Use str_length() when:

You’re working within a tidyverse pipeline
Your data might contain NA values
You’re processing Unicode text
Code readability and consistency matter

Use nchar() when:

You’re in a base R environment without tidyverse
You explicitly need nchar()’s type parameter for byte or width counting
You’ve profiled and confirmed it’s a bottleneck (rare)

Summary

str_length() is a simple function that does one thing well: count characters in strings. Its value lies in consistent behavior across edge cases, proper Unicode handling, and seamless tidyverse integration.

Key takeaways:

Returns NA for NA inputs, not 2
Counts Unicode code points correctly
Vectorized for efficient processing
Works naturally in dplyr pipelines

Related stringr functions worth exploring:

str_sub() - extract substrings by position
str_trim() - remove whitespace from string ends
str_pad() - pad strings to a specified length
str_trunc() - truncate strings to a maximum length

For complete documentation, see the stringr package documentation or run ?str_length in your R console.