R stringr - str_replace() and str_replace_all()
Text manipulation is unavoidable in data work. Whether you're cleaning survey responses, standardizing product names, or preparing data for analysis, you'll spend significant time replacing patterns...
Key Insights
str_replace()targets only the first match in each string whilestr_replace_all()replaces every occurrence—choosing the wrong one is a common source of data cleaning bugs- Named vectors with
str_replace_all()enable powerful batch replacements, letting you standardize dozens of variations in a single function call - stringr functions provide consistent, predictable behavior compared to base R’s
gsub()andsub(), making your text manipulation code easier to read and debug
Introduction to String Replacement in R
Text manipulation is unavoidable in data work. Whether you’re cleaning survey responses, standardizing product names, or preparing data for analysis, you’ll spend significant time replacing patterns in strings. The stringr package, part of the tidyverse, provides a consistent and intuitive interface for these operations.
The two workhorses for string replacement are str_replace() and str_replace_all(). Understanding when to use each—and how to leverage their full capabilities—will save you hours of debugging and make your data cleaning pipelines more robust.
# Load the package
library(stringr)
# Sample data we'll use throughout
product_names <- c("iPhone 12 Pro", "iPhone 13 Pro Max", "Samsung Galaxy S21")
messy_text <- c("Hello World", "Too many spaces", "Normal text")
codes <- c("ABC-123-XYZ", "DEF-456-UVW", "GHI-789-RST")
str_replace() - Single Pattern Replacement
The str_replace() function replaces only the first occurrence of a pattern in each string. This behavior is intentional and useful when you know you only need to modify the first match.
# Syntax: str_replace(string, pattern, replacement)
# Replace first space with underscore
str_replace(messy_text, " ", "_")
# [1] "Hello_ World" "Too_ many spaces" "Normal_text"
# Fix a specific typo (only first occurrence)
typos <- c("recieve the package", "recieve and recieve again")
str_replace(typos, "recieve", "receive")
# [1] "receive the package" "receive and recieve again"
Notice in the second example how only the first “recieve” gets corrected. The second occurrence remains unchanged. This is exactly what you want when fixing a known single error, but it’s a trap if you expect all instances to be replaced.
# Practical example: Extracting and modifying file extensions
files <- c("report.txt", "data.csv", "analysis.txt")
str_replace(files, "\\.txt$", ".md")
# [1] "report.md" "data.csv" "analysis.md"
The regex pattern \\.txt$ matches “.txt” only at the end of strings. The double backslash escapes the dot (which otherwise matches any character), and the dollar sign anchors the match to the string’s end.
str_replace_all() - Global Pattern Replacement
When you need to replace every occurrence of a pattern, str_replace_all() is your tool. The syntax is identical, but the behavior affects all matches.
# Replace ALL spaces with underscores
str_replace_all(messy_text, " ", "_")
# [1] "Hello___World" "Too____many_____spaces" "Normal_text"
# Remove all punctuation
sentences <- c("Hello, world! How are you?", "Fine, thanks.")
str_replace_all(sentences, "[[:punct:]]", "")
# [1] "Hello world How are you" "Fine thanks"
# Normalize whitespace (replace multiple spaces with single space)
str_replace_all(messy_text, "\\s+", " ")
# [1] "Hello World" "Too many spaces" "Normal text"
The \\s+ pattern matches one or more whitespace characters. This is a common cleaning operation that collapses inconsistent spacing into a uniform single space.
# Convert to snake_case (simplified)
titles <- c("Product Name", "Order Date", "Customer ID")
titles %>%
str_to_lower() %>%
str_replace_all(" ", "_")
# [1] "product_name" "order_date" "customer_id"
Using Regular Expressions for Pattern Matching
Both functions accept regular expressions, which dramatically expands their power. Learning a few key patterns will handle most real-world scenarios.
# Replace all digits with X (masking)
phone_numbers <- c("Call 555-1234", "Fax: 555-5678")
str_replace_all(phone_numbers, "\\d", "X")
# [1] "Call XXX-XXXX" "Fax: XXX-XXXX"
# Case-insensitive replacement using regex()
mixed_case <- c("The QUICK brown Fox", "QUICK quick QuIcK")
str_replace_all(mixed_case, regex("quick", ignore_case = TRUE), "slow")
# [1] "The slow brown Fox" "slow slow slow"
# Replace words at boundaries only
text <- c("cat category catfish", "the cat sat")
str_replace_all(text, "\\bcat\\b", "dog")
# [1] "dog category catfish" "the dog sat"
The word boundary anchor \\b prevents partial matches. Without it, “cat” would match inside “category” and “catfish”, producing “dogegory” and “dogfish”—almost certainly not what you want.
# Match and replace patterns with captured groups
dates <- c("2023-12-25", "2024-01-15", "2023-06-30")
str_replace_all(dates, "(\\d{4})-(\\d{2})-(\\d{2})", "\\2/\\3/\\1")
# [1] "12/25/2023" "01/15/2024" "06/30/2023"
Captured groups (parentheses in the pattern) can be referenced in the replacement string with \\1, \\2, etc. This enables sophisticated reformatting without multiple operations.
Named Vector for Multiple Replacements
One of str_replace_all()’s most powerful features is accepting a named vector for batch replacements. This lets you define multiple substitutions in a single call.
# Standardize abbreviations
abbreviations <- c(
"St\\." = "Street",
"Ave\\." = "Avenue",
"Blvd\\." = "Boulevard",
"Dr\\." = "Drive"
)
addresses <- c("123 Main St.", "456 Oak Ave.", "789 Sunset Blvd.")
str_replace_all(addresses, abbreviations)
# [1] "123 Main Street" "456 Oak Avenue" "789 Sunset Boulevard"
# Standardize state names
state_map <- c(
"CA" = "California",
"NY" = "New York",
"TX" = "Texas"
)
locations <- c("San Francisco, CA", "Austin, TX", "NYC, NY")
str_replace_all(locations, state_map)
# [1] "San Francisco, California" "Austin, Texas" "NYC, New York"
Be aware that replacements are applied sequentially. If one replacement creates a pattern that matches another, you might get unexpected results.
# Order matters - be careful with overlapping patterns
problematic <- c("a" = "b", "b" = "c")
str_replace_all("aaa", problematic)
# [1] "ccc" # 'a' becomes 'b', then 'b' becomes 'c'
# Solution: use patterns that won't overlap, or apply in separate steps
Practical Use Cases and Best Practices
Real data cleaning often requires combining these techniques. Here are patterns I use regularly.
# Clean phone numbers to digits only
raw_phones <- c("(555) 123-4567", "555.123.4567", "555 123 4567")
str_replace_all(raw_phones, "[^0-9]", "")
# [1] "5551234567" "5551234567" "5551234567"
# Sanitize filenames (remove problematic characters)
filenames <- c("Report Q1/2023.xlsx", "Data: Final (v2).csv")
filenames %>%
str_replace_all("[/:()]", "_") %>%
str_replace_all("_+", "_") %>%
str_replace_all("^_|_$", "")
# [1] "Report_Q1_2023.xlsx" "Data_Final_v2.csv"
# Clean currency values for numeric conversion
prices <- c("$1,234.56", "$999.00", "$12,345.67")
prices %>%
str_replace_all("[$,]", "") %>%
as.numeric()
# [1] 1234.56 999.00 12345.67
For large datasets, str_replace_all() with a compiled regex can improve performance:
# Pre-compile pattern for repeated use
pattern <- regex("\\d{3}-\\d{4}", ignore_case = FALSE)
# Use in vectorized operations
large_vector <- rep(codes, 10000)
system.time(str_replace_all(large_vector, pattern, "XXX-XXXX"))
Common pitfalls to avoid:
- Forgetting to escape special characters: Dots, brackets, and other regex metacharacters need escaping with double backslashes
- Using
str_replace()when you needstr_replace_all(): Always verify your output - Not anchoring patterns: Without
^and$, you might match substrings unintentionally
Comparison with Base R Alternatives
Base R provides sub() and gsub() for the same operations. Here’s how they compare:
text <- "the quick brown fox"
# Equivalent operations
sub("quick", "slow", text) # Base R - first match
str_replace(text, "quick", "slow") # stringr - first match
gsub("o", "0", text) # Base R - all matches
str_replace_all(text, "o", "0") # stringr - all matches
The functions produce identical results, but stringr offers advantages:
# stringr: consistent argument order (data first, enables piping)
text %>% str_replace_all("o", "0")
# Base R: pattern comes first (less pipe-friendly)
gsub("o", "0", text)
# stringr: named vector replacement built-in
str_replace_all(text, c("quick" = "slow", "brown" = "red"))
# Base R: requires multiple calls or mgsub package
Choose stringr when you’re already using the tidyverse, need named vector replacements, or value consistent function signatures. Stick with base R when minimizing dependencies matters or you’re writing package code that shouldn’t require stringr.
String replacement is fundamental to data cleaning. Master str_replace() and str_replace_all(), learn the essential regex patterns, and you’ll handle most text manipulation tasks efficiently. The investment in understanding these functions pays dividends every time you face messy data.