R stringr - str_to_lower()/str_to_upper()/str_to_title()
Case conversion sounds trivial until you're debugging why your user authentication fails for Turkish users or why your data join missed 30% of records. Standardizing text case is fundamental to data...
Key Insights
- stringr’s case conversion functions (
str_to_lower(),str_to_upper(),str_to_title()) provide consistent, locale-aware text transformation that integrates seamlessly with tidyverse pipelines. - Locale handling matters more than most developers realize—the Turkish “i” problem alone has caused countless bugs in production systems, and stringr handles this correctly out of the box.
- While base R’s
tolower()andtoupper()work fine for simple cases, stringr’s functions offer better Unicode support and predictable behavior across different operating systems.
Introduction to String Case Conversion
Case conversion sounds trivial until you’re debugging why your user authentication fails for Turkish users or why your data join missed 30% of records. Standardizing text case is fundamental to data cleaning, and getting it wrong creates silent failures that propagate through your entire analysis.
The stringr package, part of the tidyverse, provides a consistent interface for string manipulation in R. Its case conversion functions follow the same design philosophy as the rest of the package: predictable behavior, clear naming conventions, and proper Unicode handling.
library(stringr)
library(dplyr)
# Basic examples of each function
text <- "Hello World"
str_to_lower(text) # "hello world"
str_to_upper(text) # "HELLO WORLD"
str_to_title(text) # "Hello World"
str_to_sentence(text) # "Hello world"
These four functions cover the vast majority of case conversion needs. Let’s examine each one and understand when to use them.
str_to_lower() - Converting to Lowercase
The str_to_lower() function converts all characters in a string to lowercase. Its signature is simple:
str_to_lower(string, locale = "en")
Lowercase conversion is your default choice for standardization. When comparing strings, joining datasets, or validating input, convert to lowercase first. This eliminates an entire category of matching failures.
# Standardizing messy survey responses
survey_responses <- c("Yes", "YES", "yes", "yEs", "NO", "no", "No")
str_to_lower(survey_responses)
# [1] "yes" "yes" "yes" "yes" "no" "no" "no"
# Normalizing email addresses for deduplication
emails <- c("John.Doe@Company.com", "JANE.SMITH@company.COM", "bob@COMPANY.com")
str_to_lower(emails)
# [1] "john.doe@company.com" "jane.smith@company.com" "bob@company.com"
# Case-insensitive matching in data cleaning
customer_input <- "PREMIUM"
valid_tiers <- c("basic", "premium", "enterprise")
str_to_lower(customer_input) %in% valid_tiers # TRUE
The email normalization example illustrates a common pattern. Email addresses are case-insensitive by specification (RFC 5321), but users type them inconsistently. Always lowercase emails before storing or comparing them.
str_to_upper() - Converting to Uppercase
The str_to_upper() function converts all characters to uppercase:
str_to_upper(string, locale = "en")
Uppercase conversion is less common than lowercase but essential for specific formatting requirements. Use it for codes, identifiers, and situations where uppercase is the established convention.
# Formatting product SKUs
raw_skus <- c("abc-123", "Def-456", "ghi-789")
str_to_upper(raw_skus)
# [1] "ABC-123" "DEF-456" "GHI-789"
# Creating consistent category labels for reports
categories <- c("electronics", "clothing", "Home & Garden")
str_to_upper(categories)
# [1] "ELECTRONICS" "CLOTHING" "HOME & GARDEN"
# Formatting state abbreviations
states <- c("ca", "Ny", "TX", "fl")
str_to_upper(states)
# [1] "CA" "NY" "TX" "FL"
One practical tip: if you’re building a system where users enter codes or identifiers, convert to uppercase on input and store uppercase. This prevents the “we have both ABC-123 and abc-123 in the database” problem.
str_to_title() and str_to_sentence()
Title case capitalizes the first letter of each word. Sentence case capitalizes only the first letter of the string. Both functions exist because they serve different formatting needs.
str_to_title(string, locale = "en")
str_to_sentence(string, locale = "en")
Here’s where they differ:
text <- "the quick brown fox jumps over the lazy dog"
str_to_title(text)
# [1] "The Quick Brown Fox Jumps Over The Lazy Dog"
str_to_sentence(text)
# [1] "The quick brown fox jumps over the lazy dog"
Title case is useful for names and headings. Sentence case works for normalizing text that should read like natural prose.
# Formatting names in a dataset
messy_names <- c("john doe", "JANE SMITH", "bob JOHNSON")
str_to_title(messy_names)
# [1] "John Doe" "Jane Smith" "Bob Johnson"
# Cleaning book titles
titles <- c("THE GREAT GATSBY", "to kill a mockingbird", "1984")
str_to_title(titles)
# [1] "The Great Gatsby" "To Kill A Mockingbird" "1984"
A word of caution: str_to_title() capitalizes every word, including articles and prepositions. “To Kill A Mockingbird” isn’t technically correct title case (it should be “To Kill a Mockingbird”). For publication-quality title formatting, you’ll need additional logic or a specialized package.
# Handling edge cases with names
names_with_particles <- c("ludwig van beethoven", "vincent van gogh")
str_to_title(names_with_particles)
# [1] "Ludwig Van Beethoven" "Vincent Van Gogh"
# Note: "van" should arguably stay lowercase
Locale-Aware Case Conversion
The locale parameter exists because case conversion isn’t universal. The most famous example is the Turkish “i” problem.
In English, “i” becomes “I” when uppercased. In Turkish, “i” becomes “İ” (with a dot), and there’s a separate letter “ı” (dotless i) that becomes “I”. This distinction has broken countless software systems.
# English locale (default)
str_to_upper("i", locale = "en")
# [1] "I"
str_to_lower("I", locale = "en")
# [1] "i"
# Turkish locale
str_to_upper("i", locale = "tr")
# [1] "İ"
str_to_lower("I", locale = "tr")
# [1] "ı"
# This matters for real data
turkish_city <- "istanbul"
str_to_upper(turkish_city, locale = "en") # "ISTANBUL"
str_to_upper(turkish_city, locale = "tr") # "İSTANBUL"
German has similar considerations with the eszett (ß):
german_word <- "straße" # street
str_to_upper(german_word, locale = "de")
# [1] "STRASSE"
If you’re processing international text, set the locale explicitly. If you’re unsure, “en” is a safe default for most Western European languages, but be aware of its limitations.
Practical Data Cleaning Workflow
Real data cleaning combines multiple operations. Here’s a complete pipeline for cleaning a messy customer names column:
library(stringr)
library(dplyr)
# Simulated messy customer data
customers <- tibble(
id = 1:6,
raw_name = c(
" JOHN DOE ",
"jane smith",
"Bob JOHNSON",
"MARY-JANE WATSON",
" tim o'brien ",
"josé garcía"
),
email = c(
"John.Doe@Gmail.COM",
"JANE@company.com",
"bob@Company.Com",
"mj@email.COM",
"Tim@EMAIL.com",
"jose@correo.COM"
)
)
# Complete cleaning pipeline
customers_clean <- customers %>%
mutate(
# Clean and standardize names
clean_name = raw_name %>%
str_squish() %>% # Remove extra whitespace
str_to_title(), # Proper case
# Normalize emails to lowercase
clean_email = str_to_lower(email),
# Create uppercase last name for sorting/indexing
last_name_upper = raw_name %>%
str_squish() %>%
str_extract("\\S+$") %>% # Extract last word
str_to_upper()
)
print(customers_clean)
# # A tibble: 6 × 5
# id raw_name email clean_name clean_email last_name_upper
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 " JOHN DOE " John.Doe@Gmail.COM John Doe john.doe@gmail.com DOE
# 2 2 "jane smith" JANE@company.com Jane Smith jane@company.com SMITH
# 3 3 "Bob JOHNSON" bob@Company.Com Bob Johnson bob@company.com JOHNSON
# 4 4 "MARY-JANE WATSON" mj@email.COM Mary-Jane Watson mj@email.com WATSON
# 5 5 " tim o'brien " Tim@EMAIL.com Tim O'Brien tim@email.com O'BRIEN
# 6 6 "josé garcía" jose@correo.COM José García jose@correo.com GARCÍA
Notice how the pipeline handles accented characters correctly. “josé garcía” becomes “José García” with proper capitalization of accented letters.
Base R Comparison and Performance
Base R provides tolower(), toupper(), and tools::toTitleCase() for case conversion. Here’s how they compare:
text <- "Hello World"
# Base R
tolower(text) # "hello world"
toupper(text) # "HELLO WORLD"
tools::toTitleCase(text) # "Hello World"
# stringr
str_to_lower(text) # "hello world"
str_to_upper(text) # "HELLO WORLD"
str_to_title(text) # "Hello World"
For simple cases, the output is identical. The differences emerge in three areas:
1. Locale handling: stringr makes locale explicit; base R uses system locale.
# stringr - explicit locale
str_to_upper("i", locale = "tr") # "İ"
# Base R - depends on system settings
toupper("i") # Usually "I", but system-dependent
2. Consistency across platforms: stringr uses the stringi library internally, providing consistent behavior across Windows, Mac, and Linux. Base R’s behavior can vary.
3. Integration with tidyverse: stringr functions work naturally in dplyr pipelines.
# This reads naturally
df %>% mutate(name = str_to_title(name))
# This works but feels less integrated
df %>% mutate(name = tools::toTitleCase(name))
For performance, base R functions are marginally faster for simple operations on small vectors. For most practical data science work, the difference is negligible, and stringr’s consistency and features outweigh the minor performance cost.
My recommendation: Use stringr for data analysis and pipeline work. Use base R if you’re writing a package with minimal dependencies or processing millions of strings in a tight loop where every millisecond matters.