R stringr - str_trim()/str_pad()

Whitespace problems are everywhere in real-world data. CSV exports with trailing spaces that break joins. User input with invisible characters that cause silent matching failures. IDs that need...

Key Insights

  • str_trim() removes unwanted whitespace from string ends while str_pad() adds controlled padding—together they give you complete control over string boundaries in your data cleaning pipelines.
  • The side parameter in both functions accepts “left”, “right”, or “both”, providing precise control over which end of strings you’re modifying.
  • str_squish() complements str_trim() by collapsing internal whitespace runs into single spaces, making it essential for cleaning messy text data with irregular spacing.

Introduction to String Whitespace Management

Whitespace problems are everywhere in real-world data. CSV exports with trailing spaces that break joins. User input with invisible characters that cause silent matching failures. IDs that need consistent formatting for downstream systems. If you’ve spent any time cleaning data, you’ve fought these battles.

The stringr package, part of the tidyverse ecosystem, provides a consistent, readable interface for string manipulation in R. Unlike base R’s scattered string functions with inconsistent naming and argument patterns, stringr functions all start with str_ and follow predictable conventions. This article focuses on two complementary functions: str_trim() for removing whitespace and str_pad() for adding it in a controlled way.

library(stringr)
library(dplyr)

str_trim(): Removing Unwanted Whitespace

The str_trim() function strips whitespace from the beginning and end of strings. Its signature is straightforward:

str_trim(string, side = c("both", "left", "right"))

The side parameter controls which end gets trimmed. The default is “both”, which handles the most common case where you don’t know or care which side has the problem.

Consider a typical scenario: you’ve imported a CSV file where some fields have leading or trailing spaces due to inconsistent data entry or export quirks.

# Messy data imported from CSV
customer_data <- tibble(
  customer_id = c("  A001", "A002 ", "  A003  ", "A004"),
  name = c("John Smith  ", "  Jane Doe", "Bob Wilson", "  Alice Brown  "),
  city = c("New York  ", "  Los Angeles", "Chicago", "  Houston  ")
)

# These spaces cause real problems
customer_data$customer_id == "A001"
# [1] FALSE  TRUE FALSE  TRUE

# Clean all character columns
customer_data_clean <- customer_data %>%
  mutate(across(where(is.character), str_trim))

# Now matching works correctly
customer_data_clean$customer_id == "A001"
# [1] TRUE FALSE FALSE FALSE

This pattern—using across() with str_trim()—is something you’ll use constantly. It’s defensive programming: apply trimming to all character columns regardless of whether you think they need it.

Sometimes you need selective trimming. If you’re dealing with formatted text where leading spaces are intentional (like indentation) but trailing spaces are garbage:

code_comments <- c("  # Main function  ", "    # Helper method  ", "# Utility  ")

# Preserve indentation, remove trailing spaces
str_trim(code_comments, side = "right")
# [1] "  # Main function"   "    # Helper method" "# Utility"

str_squish(): Bonus Companion Function

While str_trim() handles the ends, str_squish() goes further by also collapsing runs of internal whitespace into single spaces. This function is invaluable when dealing with text that’s been copied from PDFs, scraped from websites, or entered by users who got heavy-handed with the spacebar.

messy_text <- c(
  "John    Smith",
  "Jane   Doe  ",
  "  Bob    Wilson  ",
  "Alice\t\tBrown"  # tabs count as whitespace too
)

str_trim(messy_text)
# [1] "John    Smith" "Jane   Doe"   "Bob    Wilson" "Alice\t\tBrown"

str_squish(messy_text)
# [1] "John Smith"   "Jane Doe"     "Bob Wilson"   "Alice Brown"

Notice that str_trim() leaves the internal spacing intact, while str_squish() normalizes everything. Choose based on whether internal whitespace is meaningful in your context. For names and addresses, str_squish() is usually correct. For code or formatted text, you might want str_trim() only.

# Real-world example: cleaning address data
addresses <- c(
  "123   Main    Street",
  "456  Oak   Avenue  ",
  "  789   Pine   Road  "
)

addresses %>%
  str_squish() %>%
  str_to_title()
# [1] "123 Main Street" "456 Oak Avenue"  "789 Pine Road"

str_pad(): Adding Controlled Whitespace

The inverse problem is equally common: you need strings to be a specific length, padded with spaces or other characters. The str_pad() function handles this:

str_pad(string, width, side = c("left", "right", "both"), pad = " ")

The width parameter specifies the target length. Strings shorter than this get padded; strings already at or above this length pass through unchanged. The pad parameter lets you specify what character to use (defaulting to a space).

The classic use case is formatting numeric IDs with leading zeros:

product_ids <- c("1", "42", "7", "123", "9999")

# Pad to 5 digits with leading zeros
str_pad(product_ids, width = 5, side = "left", pad = "0")
# [1] "00001" "00042" "00007" "00123" "09999"

This is cleaner and more readable than the base R equivalent using sprintf(). It’s also more flexible—you can pad with any character, not just zeros.

Right-padding is useful for creating fixed-width text output:

items <- c("Apple", "Banana", "Cherry")
prices <- c("$1.99", "$0.59", "$3.49")

# Create aligned columns
paste0(str_pad(items, 10, "right"), " | ", str_pad(prices, 6, "left"))
# [1] "Apple      |  $1.99" "Banana     |  $0.59" "Cherry     |  $3.49"

Center padding with side = "both" is less common but useful for creating centered headers or labels:

str_pad("REPORT", width = 20, side = "both", pad = "-")
# [1] "-------REPORT-------"

Practical Applications

Let’s work through a realistic scenario: you’re receiving product data from multiple vendors, each with their own formatting quirks. You need to standardize everything before loading it into your database.

# Raw vendor data with inconsistent formatting
vendor_data <- tibble(
  raw_sku = c("  SKU-001  ", "sku-42", "SKU-7  ", "  sku-123"),
  raw_name = c("Widget   Pro", "  Gadget    Basic", "Gizmo  Standard  ", "Tool   Premium"),
  raw_category = c("ELECTRONICS", "  electronics", "Electronics  ", "  ELECTRONICS  ")
)

# Comprehensive cleaning pipeline
clean_data <- vendor_data %>%
  mutate(
    # Step 1: Trim and squish all text fields
    sku = raw_sku %>% str_squish() %>% str_to_upper(),
    name = raw_name %>% str_squish() %>% str_to_title(),
    category = raw_category %>% str_squish() %>% str_to_lower(),
    
    # Step 2: Extract numeric portion and pad to standard format
    sku_number = sku %>%
      str_extract("\\d+") %>%
      str_pad(width = 5, side = "left", pad = "0"),
    
    # Step 3: Create standardized SKU
    standard_sku = paste0("PRD-", sku_number)
  ) %>%
  select(standard_sku, name, category)

clean_data
# # A tibble: 4 × 3
#   standard_sku name            category   
#   <chr>        <chr>           <chr>      
# 1 PRD-00001    Widget Pro      electronics
# 2 PRD-00042    Gadget Basic    electronics
# 3 PRD-00007    Gizmo Standard  electronics
# 4 PRD-00123    Tool Premium    electronics

Another common scenario is preparing data for fixed-width file exports, which some legacy systems still require:

# Prepare data for fixed-width export
export_data <- tibble(
  account = c("12345", "67890", "11111"),
  name = c("Smith, John", "Doe, Jane", "Wilson, Bob"),
  balance = c("1500.00", "250.50", "10000.00")
)

fixed_width_output <- export_data %>%
  mutate(
    account_fixed = str_pad(account, 10, "right"),
    name_fixed = str_pad(str_trunc(name, 20), 20, "right"),
    balance_fixed = str_pad(balance, 12, "left"),
    record = paste0(account_fixed, name_fixed, balance_fixed)
  ) %>%
  pull(record)

cat(fixed_width_output, sep = "\n")
# 12345     Smith, John              1500.00
# 67890     Doe, Jane                 250.50
# 11111     Wilson, Bob             10000.00

Performance Considerations

For most data cleaning tasks, stringr functions are fast enough that performance isn’t a concern. However, if you’re processing millions of strings, it’s worth knowing your options.

Base R provides trimws() for trimming, which has similar performance to str_trim():

# Base R equivalent
trimws("  hello  ", which = "both")

# For padding, base R uses sprintf or formatC
sprintf("%05d", 42)  # "00042"
formatC(42, width = 5, flag = "0")  # "00042"

The stringr functions are built on the stringi package, which uses the ICU library for robust Unicode handling. This matters when your data contains non-ASCII characters. Base R’s trimws() handles standard whitespace fine, but stringr is more reliable with unusual Unicode whitespace characters.

For large-scale operations, both approaches are vectorized and perform well. The readability advantage of stringr usually outweighs any minor performance differences:

# Both handle vectors efficiently
large_vector <- rep("  test  ", 1000000)
system.time(str_trim(large_vector))       # ~0.15 seconds
system.time(trimws(large_vector))         # ~0.12 seconds

Summary and Quick Reference

Function Purpose Key Parameters
str_trim() Remove whitespace from ends side: “left”, “right”, “both”
str_squish() Trim ends + collapse internal whitespace None
str_pad() Add padding to reach target width width, side, pad

Use str_trim() as a defensive measure on all imported character data. Use str_squish() when internal whitespace is noise rather than signal. Use str_pad() when you need consistent string lengths for formatting or system requirements.

These functions pair naturally with other stringr operations like str_to_lower(), str_replace(), and str_extract(). Master them, and you’ll handle 80% of string cleaning tasks with clean, readable code.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.