R stringr - str_count() - Count Matches

The `str_count()` function from the stringr package does exactly what its name suggests: it counts the number of times a pattern appears in a string. Unlike `str_detect()` which returns a boolean, or...

Key Insights

  • str_count() returns the number of pattern matches in each string, making it invaluable for text analysis, data validation, and feature engineering in R workflows.
  • The function is fully vectorized and integrates seamlessly with dplyr pipelines, allowing you to process entire columns of text data with a single line of code.
  • Combining str_count() with regular expressions unlocks powerful pattern matching—from counting digits in phone numbers to validating password complexity requirements.

Introduction to str_count()

The str_count() function from the stringr package does exactly what its name suggests: it counts the number of times a pattern appears in a string. Unlike str_detect() which returns a boolean, or str_extract() which pulls out matches, str_count() gives you a numeric value representing match frequency.

The basic syntax is straightforward:

str_count(string, pattern)

Here’s a simple example to get started:

library(stringr)

# Count occurrences of the letter 'a'
str_count("banana", "a")
# [1] 3

# Count a substring
str_count("mississippi", "ss")
# [1] 2

# Works with vectors too
fruits <- c("apple", "banana", "cherry")
str_count(fruits, "a")
# [1] 1 3 0

The function returns an integer vector with the same length as the input, making it perfect for adding count columns to your data frames.

Basic Pattern Matching

At its simplest, str_count() matches literal strings. This covers most everyday use cases—counting specific characters, words, or substrings.

# Count vowels in words
words <- c("algorithm", "data", "science")
str_count(words, "[aeiou]")
# [1] 4 2 3

# Count specific words in sentences
sentences <- c(

"The cat sat on the mat",
"The dog ran to the park",
"Birds fly in the sky"
)
str_count(sentences, "the")
# [1] 1 1 1

Notice that last example only found one match per sentence. That’s because str_count() is case-sensitive by default. The uppercase “The” at the start of each sentence doesn’t match “the”.

# Case-insensitive counting with regex flag
str_count(sentences, regex("the", ignore_case = TRUE))
# [1] 2 2 1

# Alternative: convert to lowercase first
str_count(str_to_lower(sentences), "the")
# [1] 2 2 1

I prefer the regex() wrapper approach—it’s more explicit about intent and doesn’t modify your original data.

Using Regular Expressions

The real power of str_count() emerges when you combine it with regular expressions. Regex patterns let you match character classes, repetitions, and complex structures.

# Count digits in strings
phone_numbers <- c("555-123-4567", "1-800-555-0199", "123.456.7890")
str_count(phone_numbers, "\\d")
# [1] 10 11 10

# Count words (sequences of word characters)
text <- "Hello, world! How are you today?"
str_count(text, "\\w+")
# [1] 6

# Count whitespace characters
str_count(text, "\\s")
# [1] 5

Here are some regex patterns you’ll use frequently with str_count():

sample_text <- "Order #12345 shipped on 2024-01-15 for $99.99"

# Count all numbers (digit sequences)
str_count(sample_text, "\\d+")
# [1] 5

# Count uppercase letters
str_count(sample_text, "[A-Z]")
# [1] 1

# Count punctuation
str_count(sample_text, "[[:punct:]]")
# [1] 4

# Count words starting with specific letter
words_text <- "Peter Piper picked a peck of pickled peppers"
str_count(words_text, "\\b[Pp]\\w+")
# [1] 6

The \\b in that last example is a word boundary anchor—it ensures we match whole words starting with P, not just any P in the middle of a word.

Working with Vectors and Data Frames

In practice, you’ll rarely count patterns in single strings. Most text analysis involves processing columns of data. str_count() handles this naturally because it’s vectorized.

library(dplyr)

# Sample social media data
tweets <- tibble(
user = c("user_a", "user_b", "user_c", "user_d"),
text = c(
"Loving this weather! #sunny #happy #blessed",
"Just finished my workout #fitness",
"No hashtags here, just vibes",
"#R #rstats #datascience #programming #code"
)
)

# Count hashtags in each tweet
tweets <- tweets %>%
mutate(hashtag_count = str_count(text, "#\\w+"))

tweets
# # A tibble: 4 × 3
#   user   text                                          hashtag_count
#   <chr>  <chr>                                                 <int>
# 1 user_a Loving this weather! #sunny #happy #blessed               3
# 2 user_b Just finished my workout #fitness                         1
# 3 user_c No hashtags here, just vibes                              0
# 4 user_d #R #rstats #datascience #programming #code                5

You can use these counts for filtering, grouping, or feature engineering:

# Filter to tweets with multiple hashtags
tweets %>%
filter(hashtag_count >= 2)

# Analyze engagement patterns
tweets %>%
mutate(
word_count = str_count(text, "\\w+"),
exclamation_count = str_count(text, "!"),
mention_count = str_count(text, "@\\w+")
)

This pattern—adding multiple count columns in a single mutate() call—is how I typically start any text analysis project. It gives you immediate insight into your data’s structure.

Practical Use Cases

Let’s look at three real-world scenarios where str_count() proves essential.

Password Validation

Validating that passwords meet complexity requirements is a classic use case:

validate_password <- function(password) {
tibble(
password = password,
length_ok = nchar(password) >= 8,
has_upper = str_count(password, "[A-Z]") >= 1,
has_lower = str_count(password, "[a-z]") >= 1,
has_digit = str_count(password, "\\d") >= 1,
has_special = str_count(password, "[!@#$%^&*]") >= 1
) %>%
mutate(valid = length_ok & has_upper & has_lower & has_digit & has_special)
}

passwords <- c("weak", "Better123", "Str0ng!Pass", "12345678")
validate_password(passwords)
# # A tibble: 4 × 7
#   password    length_ok has_upper has_lower has_digit has_special valid
#   <chr>       <lgl>     <lgl>     <lgl>     <lgl>     <lgl>       <lgl>
# 1 weak        FALSE     FALSE     TRUE      FALSE     FALSE       FALSE
# 2 Better123   TRUE      TRUE      TRUE      TRUE      FALSE       FALSE
# 3 Str0ng!Pass TRUE      TRUE      TRUE      TRUE      TRUE        TRUE
# 4 12345678    TRUE      FALSE     FALSE     TRUE      FALSE       FALSE

Log File Analysis

Parsing server logs to count error occurrences:

log_entries <- c(
"2024-01-15 10:23:45 INFO User login successful",
"2024-01-15 10:24:01 ERROR Database connection failed",
"2024-01-15 10:24:15 ERROR ERROR Retry failed - ERROR state",
"2024-01-15 10:25:00 WARN Memory usage high",
"2024-01-15 10:25:30 INFO Request processed"
)

log_analysis <- tibble(entry = log_entries) %>%
mutate(
error_count = str_count(entry, "ERROR"),
is_critical = error_count > 1,
timestamp = str_extract(entry, "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}")
)

# Find entries with multiple errors
log_analysis %>%
filter(is_critical)

Text Corpus Analysis

Analyzing document characteristics for NLP preprocessing:

documents <- c(
"The quick brown fox jumps over the lazy dog.",
"R is a programming language for statistical computing.",
"Machine learning models require careful feature engineering!"
)

corpus_stats <- tibble(doc = documents) %>%
mutate(
doc_id = row_number(),
word_count = str_count(doc, "\\b\\w+\\b"),
sentence_count = str_count(doc, "[.!?]"),
avg_word_length = (nchar(doc) - str_count(doc, "\\s") - str_count(doc, "[[:punct:]]")) / word_count
)

corpus_stats

Performance Tips and Alternatives

For small to medium datasets, str_count() performs excellently. But when processing millions of strings, you might consider alternatives.

library(microbenchmark)

# Generate test data
large_vector <- rep("The quick brown fox jumps over the lazy dog", 10000)

# Benchmark different approaches
microbenchmark(
stringr = str_count(large_vector, "the"),
base_r = sapply(gregexpr("the", large_vector), function(x) sum(x > 0)),
times = 50
)

In my benchmarks, str_count() typically matches or beats base R alternatives while providing cleaner syntax. The stringr package uses the stringi library under the hood, which is highly optimized.

A few performance tips:

  1. Compile regex patterns for repeated use with regex()
  2. Avoid unnecessary captures—use (?:...) instead of (...) when you don’t need backreferences
  3. Pre-filter when possible—use str_detect() first if you only need to process matching strings
# More efficient for complex patterns used repeatedly
pattern <- regex("\\b[A-Z][a-z]+\\b", ignore_case = FALSE)
str_count(large_vector, pattern)

Summary and Quick Reference

str_count() is deceptively simple but incredibly useful. It bridges the gap between detecting patterns and extracting them, giving you quantitative insight into your text data.

Common pitfalls to avoid:

  • Forgetting case sensitivity (use regex(..., ignore_case = TRUE))
  • Not escaping special regex characters (use fixed() for literal matching)
  • Expecting overlapping matches (they’re not counted—“aaa” has one “aa”, not two)

Quick reference patterns:

# Literal string (no regex interpretation)
str_count(text, fixed("$100"))

# Any digit
str_count(text, "\\d")

# Any word character
str_count(text, "\\w")

# Whole words only
str_count(text, "\\bword\\b")

# Case insensitive
str_count(text, regex("pattern", ignore_case = TRUE))

# Multiple alternatives
str_count(text, "cat|dog|bird")

# Character class
str_count(text, "[aeiouAEIOU]")

Master str_count() and you’ll find yourself reaching for it constantly—whether you’re validating input, analyzing text, or engineering features for machine learning models. It’s one of those functions that does one thing well, and that’s exactly what good software should do.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.