R - Regex (Regular Expressions) in R

Regular expressions are the Swiss Army knife of text processing. Whether you're cleaning survey responses, parsing log files, or extracting features from unstructured text, regex skills will save you...

Key Insights

  • R supports both POSIX and Perl-compatible regular expressions, with perl = TRUE offering better performance and advanced features like lookarounds
  • The stringr package provides a consistent, readable API that wraps base R functions, making regex operations more predictable in data pipelines
  • Double backslash escaping (\\d instead of \d) is the most common source of regex errors in R—use raw strings (r"(...)") in R 4.0+ to avoid this pain

Introduction to Regular Expressions in R

Regular expressions are the Swiss Army knife of text processing. Whether you’re cleaning survey responses, parsing log files, or extracting features from unstructured text, regex skills will save you hours of manual work.

R provides robust regex support through both base functions and the tidyverse ecosystem. You get two regex flavors: POSIX extended regular expressions (the default) and Perl-compatible regular expressions (PCRE), enabled with perl = TRUE. For most work, I recommend defaulting to Perl mode—it’s faster and supports features like lookarounds and non-greedy quantifiers that you’ll eventually need.

Let’s get practical.

Core String Functions for Pattern Matching

Base R gives you four essential functions for finding patterns:

  • grep() returns indices or values of matching elements
  • grepl() returns a logical vector (TRUE/FALSE for each element)
  • regexpr() returns the position of the first match
  • gregexpr() returns positions of all matches

Here’s when to use each:

emails <- c(
"contact@company.com",
"invalid-email",
"user.name+tag@domain.org",
"another@test.co.uk",
"not an email at all"
)

# Pattern for basic email validation
email_pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

# grep() returns indices by default
grep(email_pattern, emails)
# [1] 1 3 4

# grep() with value = TRUE returns the actual matches
grep(email_pattern, emails, value = TRUE)
# [1] "contact@company.com" "user.name+tag@domain.org" "another@test.co.uk"

# grepl() returns logical vector - perfect for subsetting data frames
grepl(email_pattern, emails)
# [1] TRUE FALSE TRUE TRUE FALSE

# Use grepl() for filtering
valid_emails <- emails[grepl(email_pattern, emails)]

Use grep() when you need indices or want to extract matching values directly. Use grepl() when filtering data frames or creating logical conditions. The logical output of grepl() integrates cleanly with dplyr::filter() and base R subsetting.

Search and Replace with sub() and gsub()

sub() replaces the first match. gsub() replaces all matches. This distinction matters more than you’d think.

The real power comes from capture groups and backreferences. Wrap parts of your pattern in parentheses to capture them, then reference them with \\1, \\2, etc.

# Messy phone number data
phones <- c(
"(555) 123-4567",
"555.123.4567",
"555-123-4567",
"5551234567",
"+1 555 123 4567"
)

# Extract just the digits, then format consistently
# First, remove all non-digits
digits_only <- gsub("[^0-9]", "", phones)
digits_only
# [1] "5551234567" "5551234567" "5551234567" "5551234567" "15551234567"

# Now format with capture groups (handling 10 or 11 digit numbers)
standardized <- gsub(
"^1?(\\d{3})(\\d{3})(\\d{4})$",
"(\\1) \\2-\\3",
digits_only
)
standardized
# [1] "(555) 123-4567" "(555) 123-4567" "(555) 123-4567" 
#     "(555) 123-4567" "(555) 123-4567"

This two-step approach—strip unwanted characters, then reformat—is cleaner than trying to handle every input variation in a single pattern.

The stringr Package: A Tidyverse Approach

The stringr package wraps base R’s regex functions with a consistent interface. Every function starts with str_, takes the string as the first argument (pipe-friendly), and uses consistent naming conventions.

library(stringr)

tweets <- c(
"Loving the new #RStats features! #DataScience is amazing",
"No hashtags here",
"#MachineLearning #AI #DeepLearning are trending",
"Check out #tidyverse for #rstats work"
)

# str_detect() is grepl()'s tidyverse equivalent
str_detect(tweets, "#\\w+")
# [1] TRUE FALSE TRUE TRUE

# str_extract() gets the first match
str_extract(tweets, "#\\w+")
# [1] "#RStats" NA "#MachineLearning" "#tidyverse"

# str_extract_all() gets ALL matches - returns a list
str_extract_all(tweets, "#\\w+")
# [[1]]
# [1] "#RStats" "#DataScience"
# [[2]]
# character(0)
# [[3]]
# [1] "#MachineLearning" "#AI" "#DeepLearning"
# [[4]]
# [1] "#tidyverse" "#rstats"

# Flatten to a single vector of all hashtags
all_hashtags <- unlist(str_extract_all(tweets, "#\\w+"))
# Convert to lowercase and count
table(tolower(all_hashtags))

str_match() deserves special attention—it returns capture groups as separate columns in a matrix:

log_entries <- c(
"2024-01-15 ERROR: Connection timeout",
"2024-01-15 INFO: Process started",
"2024-01-16 WARN: Memory usage high"
)

# Extract date, level, and message separately
str_match(log_entries, "(\\d{4}-\\d{2}-\\d{2}) (\\w+): (.+)")
#      [,1]                                   [,2]         [,3]    [,4]
# [1,] "2024-01-15 ERROR: Connection timeout" "2024-01-15" "ERROR" "Connection timeout"
# [2,] "2024-01-15 INFO: Process started"     "2024-01-15" "INFO"  "Process started"
# [3,] "2024-01-16 WARN: Memory usage high"   "2024-01-16" "WARN"  "Memory usage high"

Essential Regex Syntax and Patterns

Here’s the syntax you’ll use constantly:

Metacharacters:

  • . matches any character
  • ^ anchors to start, $ anchors to end
  • | for alternation (OR)
  • () for grouping and capture

Quantifiers:

  • * zero or more, + one or more, ? zero or one
  • {n} exactly n, {n,} n or more, {n,m} between n and m
  • Add ? after quantifier for non-greedy: *?, +?

Character Classes:

  • [abc] matches a, b, or c
  • [^abc] matches anything except a, b, c
  • [a-z] range
  • \\d digit, \\w word character, \\s whitespace
  • \\D, \\W, \\S are negations

Lookarounds (Perl mode only):

  • (?=...) positive lookahead
  • (?!...) negative lookahead
  • (?<=...) positive lookbehind
  • (?<!...) negative lookbehind
# Validate URLs with a practical (not perfect) pattern
urls <- c(
"https://example.com/path",
"http://sub.domain.org",
"ftp://files.server.net",
"not-a-url",
"https://api.service.io/v2/users?id=123"
)

url_pattern <- "^https?://[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}(/[^\\s]*)?$"
grepl(url_pattern, urls, perl = TRUE)
# [1] TRUE TRUE FALSE FALSE TRUE

Practical Applications and Performance Tips

Real-world regex work often involves parsing semi-structured data. Here’s a common scenario: extracting key-value pairs from configuration or log data.

config_text <- "
server=production.example.com
port=8080
timeout=30
debug=false
api_key=sk-abc123xyz
"

# Extract all key=value pairs
lines <- unlist(strsplit(config_text, "\n"))
lines <- lines[nzchar(lines)]  # Remove empty lines

# Parse into a named list
config <- setNames(
  str_extract(lines, "(?<==).+"),  # Value: everything after =
  str_extract(lines, "^[^=]+")     # Key: everything before =
)
config
#                  server                    port                 timeout 
# "production.example.com"                  "8080"                    "30" 
#                   debug                 api_key 
#                 "false"           "sk-abc123xyz"

Performance tips:

  1. Always use perl = TRUE for complex patterns—it’s significantly faster
  2. Compile patterns with stringr::regex() when using them repeatedly
  3. Pre-filter with simple patterns before applying complex ones
  4. Use fixed = TRUE or stringr::fixed() for literal string matching—no regex overhead
# Fast literal matching
grepl("error", log_data, fixed = TRUE)  # Much faster than regex

# Pre-compiled pattern for repeated use
pattern <- regex("\\d{4}-\\d{2}-\\d{2}", ignore_case = FALSE)
str_detect(large_vector, pattern)

Common Pitfalls and Debugging

The number one regex mistake in R: forgetting to double-escape backslashes.

# WRONG - R interprets \d as an escape sequence
pattern <- "\d+"
# Error or unexpected behavior

# RIGHT - escape the backslash for R's string parser
pattern <- "\\d+"

# ALSO RIGHT (R 4.0+) - raw strings avoid escaping entirely
pattern <- r"(\d+)"

The raw string syntax r"(...)" is a game-changer. Use it.

Greedy vs. lazy matching:

html <- "<div>First</div><div>Second</div>"

# Greedy: matches as much as possible
str_extract(html, "<div>.*</div>")
# [1] "<div>First</div><div>Second</div>"

# Lazy: matches as little as possible
str_extract(html, "<div>.*?</div>")
# [1] "<div>First</div>"

Debugging strategies:

  1. Build patterns incrementally, testing each addition
  2. Use str_view() from stringr to visualize matches
  3. Test with regmatches(x, regexpr(pattern, x)) to see exactly what matched
  4. Use online tools like regex101.com (set to PCRE mode for R compatibility)
# str_view() highlights matches in the console
str_view(emails, "[a-zA-Z0-9._%+-]+@")

Regular expressions reward investment. Master the fundamentals covered here, and you’ll handle 90% of text processing tasks without breaking a sweat. The remaining 10%? That’s when you reach for dedicated parsing libraries—but that’s another article.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.