R stringr - str_to_lower()/str_to_upper()/str_to_title()

Case conversion sounds trivial until you're debugging why your user authentication fails for Turkish users or why your data join missed 30% of records. Standardizing text case is fundamental to data...

Key Insights

  • stringr’s case conversion functions (str_to_lower(), str_to_upper(), str_to_title()) provide consistent, locale-aware text transformation that integrates seamlessly with tidyverse pipelines.
  • Locale handling matters more than most developers realize—the Turkish “i” problem alone has caused countless bugs in production systems, and stringr handles this correctly out of the box.
  • While base R’s tolower() and toupper() work fine for simple cases, stringr’s functions offer better Unicode support and predictable behavior across different operating systems.

Introduction to String Case Conversion

Case conversion sounds trivial until you’re debugging why your user authentication fails for Turkish users or why your data join missed 30% of records. Standardizing text case is fundamental to data cleaning, and getting it wrong creates silent failures that propagate through your entire analysis.

The stringr package, part of the tidyverse, provides a consistent interface for string manipulation in R. Its case conversion functions follow the same design philosophy as the rest of the package: predictable behavior, clear naming conventions, and proper Unicode handling.

library(stringr)
library(dplyr)

# Basic examples of each function
text <- "Hello World"

str_to_lower(text)  # "hello world"
str_to_upper(text)  # "HELLO WORLD"
str_to_title(text)  # "Hello World"
str_to_sentence(text)  # "Hello world"

These four functions cover the vast majority of case conversion needs. Let’s examine each one and understand when to use them.

str_to_lower() - Converting to Lowercase

The str_to_lower() function converts all characters in a string to lowercase. Its signature is simple:

str_to_lower(string, locale = "en")

Lowercase conversion is your default choice for standardization. When comparing strings, joining datasets, or validating input, convert to lowercase first. This eliminates an entire category of matching failures.

# Standardizing messy survey responses
survey_responses <- c("Yes", "YES", "yes", "yEs", "NO", "no", "No")
str_to_lower(survey_responses)
# [1] "yes" "yes" "yes" "yes" "no"  "no"  "no"

# Normalizing email addresses for deduplication
emails <- c("John.Doe@Company.com", "JANE.SMITH@company.COM", "bob@COMPANY.com")
str_to_lower(emails)
# [1] "john.doe@company.com" "jane.smith@company.com" "bob@company.com"

# Case-insensitive matching in data cleaning
customer_input <- "PREMIUM"
valid_tiers <- c("basic", "premium", "enterprise")
str_to_lower(customer_input) %in% valid_tiers  # TRUE

The email normalization example illustrates a common pattern. Email addresses are case-insensitive by specification (RFC 5321), but users type them inconsistently. Always lowercase emails before storing or comparing them.

str_to_upper() - Converting to Uppercase

The str_to_upper() function converts all characters to uppercase:

str_to_upper(string, locale = "en")

Uppercase conversion is less common than lowercase but essential for specific formatting requirements. Use it for codes, identifiers, and situations where uppercase is the established convention.

# Formatting product SKUs
raw_skus <- c("abc-123", "Def-456", "ghi-789")
str_to_upper(raw_skus)
# [1] "ABC-123" "DEF-456" "GHI-789"

# Creating consistent category labels for reports
categories <- c("electronics", "clothing", "Home & Garden")
str_to_upper(categories)
# [1] "ELECTRONICS" "CLOTHING" "HOME & GARDEN"

# Formatting state abbreviations
states <- c("ca", "Ny", "TX", "fl")
str_to_upper(states)
# [1] "CA" "NY" "TX" "FL"

One practical tip: if you’re building a system where users enter codes or identifiers, convert to uppercase on input and store uppercase. This prevents the “we have both ABC-123 and abc-123 in the database” problem.

str_to_title() and str_to_sentence()

Title case capitalizes the first letter of each word. Sentence case capitalizes only the first letter of the string. Both functions exist because they serve different formatting needs.

str_to_title(string, locale = "en")
str_to_sentence(string, locale = "en")

Here’s where they differ:

text <- "the quick brown fox jumps over the lazy dog"

str_to_title(text)
# [1] "The Quick Brown Fox Jumps Over The Lazy Dog"

str_to_sentence(text)
# [1] "The quick brown fox jumps over the lazy dog"

Title case is useful for names and headings. Sentence case works for normalizing text that should read like natural prose.

# Formatting names in a dataset
messy_names <- c("john doe", "JANE SMITH", "bob JOHNSON")
str_to_title(messy_names)
# [1] "John Doe" "Jane Smith" "Bob Johnson"

# Cleaning book titles
titles <- c("THE GREAT GATSBY", "to kill a mockingbird", "1984")
str_to_title(titles)
# [1] "The Great Gatsby" "To Kill A Mockingbird" "1984"

A word of caution: str_to_title() capitalizes every word, including articles and prepositions. “To Kill A Mockingbird” isn’t technically correct title case (it should be “To Kill a Mockingbird”). For publication-quality title formatting, you’ll need additional logic or a specialized package.

# Handling edge cases with names
names_with_particles <- c("ludwig van beethoven", "vincent van gogh")
str_to_title(names_with_particles)
# [1] "Ludwig Van Beethoven" "Vincent Van Gogh"
# Note: "van" should arguably stay lowercase

Locale-Aware Case Conversion

The locale parameter exists because case conversion isn’t universal. The most famous example is the Turkish “i” problem.

In English, “i” becomes “I” when uppercased. In Turkish, “i” becomes “İ” (with a dot), and there’s a separate letter “ı” (dotless i) that becomes “I”. This distinction has broken countless software systems.

# English locale (default)
str_to_upper("i", locale = "en")
# [1] "I"

str_to_lower("I", locale = "en")
# [1] "i"

# Turkish locale
str_to_upper("i", locale = "tr")
# [1] "İ"

str_to_lower("I", locale = "tr")
# [1] "ı"

# This matters for real data
turkish_city <- "istanbul"
str_to_upper(turkish_city, locale = "en")  # "ISTANBUL"
str_to_upper(turkish_city, locale = "tr")  # "İSTANBUL"

German has similar considerations with the eszett (ß):

german_word <- "straße"  # street
str_to_upper(german_word, locale = "de")
# [1] "STRASSE"

If you’re processing international text, set the locale explicitly. If you’re unsure, “en” is a safe default for most Western European languages, but be aware of its limitations.

Practical Data Cleaning Workflow

Real data cleaning combines multiple operations. Here’s a complete pipeline for cleaning a messy customer names column:

library(stringr)
library(dplyr)

# Simulated messy customer data
customers <- tibble(
  id = 1:6,
  raw_name = c(
    "  JOHN   DOE  ",
    "jane smith",
    "Bob    JOHNSON",
    "MARY-JANE WATSON",
    "  tim o'brien  ",
    "josé garcía"
  ),
  email = c(
    "John.Doe@Gmail.COM",
    "JANE@company.com",
    "bob@Company.Com",
    "mj@email.COM",
    "Tim@EMAIL.com",
    "jose@correo.COM"
  )
)

# Complete cleaning pipeline
customers_clean <- customers %>%
  mutate(
    # Clean and standardize names
    clean_name = raw_name %>%
      str_squish() %>%           # Remove extra whitespace
      str_to_title(),            # Proper case
    
    # Normalize emails to lowercase
    clean_email = str_to_lower(email),
    
    # Create uppercase last name for sorting/indexing
    last_name_upper = raw_name %>%
      str_squish() %>%
      str_extract("\\S+$") %>%   # Extract last word
      str_to_upper()
  )

print(customers_clean)
# # A tibble: 6 × 5
#      id raw_name           email              clean_name        clean_email          last_name_upper
#   <int> <chr>              <chr>              <chr>             <chr>                <chr>
# 1     1 "  JOHN   DOE  "   John.Doe@Gmail.COM John Doe          john.doe@gmail.com   DOE
# 2     2 "jane smith"       JANE@company.com   Jane Smith        jane@company.com     SMITH
# 3     3 "Bob    JOHNSON"   bob@Company.Com    Bob Johnson       bob@company.com      JOHNSON
# 4     4 "MARY-JANE WATSON" mj@email.COM       Mary-Jane Watson  mj@email.com         WATSON
# 5     5 "  tim o'brien  "  Tim@EMAIL.com      Tim O'Brien       tim@email.com        O'BRIEN
# 6     6 "josé garcía"      jose@correo.COM    José García       jose@correo.com      GARCÍA

Notice how the pipeline handles accented characters correctly. “josé garcía” becomes “José García” with proper capitalization of accented letters.

Base R Comparison and Performance

Base R provides tolower(), toupper(), and tools::toTitleCase() for case conversion. Here’s how they compare:

text <- "Hello World"

# Base R
tolower(text)           # "hello world"
toupper(text)           # "HELLO WORLD"
tools::toTitleCase(text) # "Hello World"

# stringr
str_to_lower(text)      # "hello world"
str_to_upper(text)      # "HELLO WORLD"
str_to_title(text)      # "Hello World"

For simple cases, the output is identical. The differences emerge in three areas:

1. Locale handling: stringr makes locale explicit; base R uses system locale.

# stringr - explicit locale
str_to_upper("i", locale = "tr")  # "İ"

# Base R - depends on system settings
toupper("i")  # Usually "I", but system-dependent

2. Consistency across platforms: stringr uses the stringi library internally, providing consistent behavior across Windows, Mac, and Linux. Base R’s behavior can vary.

3. Integration with tidyverse: stringr functions work naturally in dplyr pipelines.

# This reads naturally
df %>% mutate(name = str_to_title(name))

# This works but feels less integrated
df %>% mutate(name = tools::toTitleCase(name))

For performance, base R functions are marginally faster for simple operations on small vectors. For most practical data science work, the difference is negligible, and stringr’s consistency and features outweigh the minor performance cost.

My recommendation: Use stringr for data analysis and pipeline work. Use base R if you’re writing a package with minimal dependencies or processing millions of strings in a tight loop where every millisecond matters.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.