Scala - Regular Expressions in Scala

• Scala provides `scala.util.matching.Regex` class with pattern matching integration, making regex operations more idiomatic than Java's verbose approach

Key Insights

• Scala provides scala.util.matching.Regex class with pattern matching integration, making regex operations more idiomatic than Java’s verbose approach • The .r method converts strings to Regex objects, while triple-quoted strings eliminate escape character headaches when writing complex patterns • Pattern matching with regex extractors enables declarative data parsing that’s both type-safe and readable compared to imperative extraction methods

Creating Regular Expressions

Scala offers multiple ways to create regular expressions. The most common approach uses the .r method on strings:

val pattern = "\\d{3}-\\d{4}".r
val simplePattern = """\d{3}-\d{4}""".r // triple quotes avoid escaping

Triple-quoted strings ("""...""") are particularly useful for regex patterns because they treat backslashes literally, eliminating the need for double-escaping:

// Without triple quotes - needs double escaping
val emailPattern = "\\w+@\\w+\\.\\w+".r

// With triple quotes - cleaner syntax
val betterEmailPattern = """\w+@\w+\.\w+""".r

// Complex pattern example
val urlPattern = """https?://[\w\-\.]+/[\w\-\./?%&=]*""".r

You can also use the Regex constructor directly with optional group names:

import scala.util.matching.Regex

val datePattern = new Regex(
  """(\d{4})-(\d{2})-(\d{2})""",
  "year", "month", "day"
)

Finding Matches

The findFirstIn and findAllIn methods search for pattern matches within strings:

val phonePattern = """\d{3}-\d{4}""".r
val text = "Call me at 555-1234 or 555-5678"

// Find first match
val firstMatch: Option[String] = phonePattern.findFirstIn(text)
println(firstMatch) // Some(555-1234)

// Find all matches
val allMatches: Iterator[String] = phonePattern.findAllIn(text)
allMatches.foreach(println)
// Output:
// 555-1234
// 555-5678

For match location information, use findFirstMatchIn and findAllMatchIn:

val pattern = """\b\w{4}\b""".r
val sentence = "This code runs fast"

pattern.findFirstMatchIn(sentence).foreach { m =>
  println(s"Match: ${m.matched}")
  println(s"Start: ${m.start}")
  println(s"End: ${m.end}")
}
// Output:
// Match: This
// Start: 0
// End: 4

Pattern Matching with Regex

Scala’s pattern matching integrates seamlessly with regex, enabling elegant extraction:

val datePattern = """(\d{4})-(\d{2})-(\d{2})""".r

val input = "2024-01-15"

input match {
  case datePattern(year, month, day) =>
    println(s"Year: $year, Month: $month, Day: $day")
  case _ =>
    println("Invalid date format")
}
// Output: Year: 2024, Month: 01, Day: 15

This works because Scala’s Regex class implements an unapplySeq method. You can use this for complex data parsing:

val logPattern = """(\w+)\s+(\d{4}-\d{2}-\d{2})\s+(.+)""".r

val logEntries = List(
  "ERROR 2024-01-15 Database connection failed",
  "INFO 2024-01-15 Application started",
  "WARN 2024-01-16 High memory usage"
)

logEntries.foreach {
  case logPattern(level, date, message) =>
    println(s"[$level] $date: $message")
  case _ =>
    println("Invalid log format")
}

Extracting Groups

Named groups make extraction more readable. When you define a regex with the Regex constructor and provide group names, you can access them directly:

val emailPattern = new Regex(
  """([\w\.]+)@([\w\.]+)\.(\w+)""",
  "user", "domain", "tld"
)

val email = "john.doe@example.com"

emailPattern.findFirstMatchIn(email).foreach { m =>
  println(s"User: ${m.group("user")}")
  println(s"Domain: ${m.group("domain")}")
  println(s"TLD: ${m.group("tld")}")
}
// Output:
// User: john.doe
// Domain: example
// TLD: com

For unnamed groups, use numeric indices:

val phonePattern = """(\d{3})-(\d{4})""".r

phonePattern.findFirstMatchIn("555-1234").foreach { m =>
  println(s"Area code: ${m.group(1)}")
  println(s"Number: ${m.group(2)}")
}

Replacing Text

The replaceAllIn and replaceFirstIn methods perform substitutions:

val pattern = """\d+""".r
val text = "I have 3 cats and 2 dogs"

val result = pattern.replaceAllIn(text, "many")
println(result) // I have many cats and many dogs

val firstOnly = pattern.replaceFirstIn(text, "several")
println(firstOnly) // I have several cats and 2 dogs

Use replacement functions for dynamic substitutions:

val numberPattern = """\d+""".r
val text = "10 + 20 + 30"

val doubled = numberPattern.replaceAllIn(text, m => 
  (m.matched.toInt * 2).toString
)
println(doubled) // 20 + 40 + 60

Access captured groups in replacements:

val namePattern = """(\w+)\s+(\w+)""".r
val names = "John Doe, Jane Smith"

val swapped = namePattern.replaceAllIn(names, m =>
  s"${m.group(2)}, ${m.group(1)}"
)
println(swapped) // Doe, John, Smith, Jane

Splitting Strings

Use regex patterns to split strings on complex delimiters:

val pattern = """[,;]\s*""".r
val text = "apple,banana; cherry,  date;elderberry"

val fruits = pattern.split(text)
fruits.foreach(println)
// Output:
// apple
// banana
// cherry
// date
// elderberry

The split method accepts a limit parameter:

val pattern = """\s+""".r
val text = "one two three four five"

val limited = pattern.split(text, 3)
limited.foreach(println)
// Output:
// one
// two
// three four five

Anchors and Boundaries

Use anchors to match entire strings or specific positions:

val exactPattern = """^\d{3}-\d{4}$""".r

exactPattern.findFirstIn("555-1234") // Some(555-1234)
exactPattern.findFirstIn("Call 555-1234") // None

// Word boundaries
val wordPattern = """\bcat\b""".r
println(wordPattern.findFirstIn("cat")) // Some(cat)
println(wordPattern.findFirstIn("category")) // None
println(wordPattern.findFirstIn("the cat sat")) // Some(cat)

Practical Example: Log Parser

Here’s a complete example parsing structured log files:

import scala.util.matching.Regex
import scala.io.Source

case class LogEntry(
  timestamp: String,
  level: String,
  thread: String,
  message: String
)

object LogParser {
  val logPattern: Regex = new Regex(
    """(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[(\w+)\]\s+\[([^\]]+)\]\s+(.+)""",
    "timestamp", "level", "thread", "message"
  )

  def parseLine(line: String): Option[LogEntry] = line match {
    case logPattern(ts, level, thread, msg) =>
      Some(LogEntry(ts, level, thread, msg))
    case _ => None
  }

  def parseFile(filename: String): List[LogEntry] = {
    Source.fromFile(filename).getLines()
      .flatMap(parseLine)
      .toList
  }

  def filterByLevel(entries: List[LogEntry], level: String): List[LogEntry] =
    entries.filter(_.level == level)
}

// Usage
val logs = List(
  "2024-01-15 10:30:45 [ERROR] [main-thread] Connection timeout",
  "2024-01-15 10:30:46 [INFO] [worker-1] Processing request",
  "2024-01-15 10:30:47 [ERROR] [worker-2] Invalid input"
)

val parsed = logs.flatMap(LogParser.parseLine)
val errors = LogParser.filterByLevel(parsed, "ERROR")
errors.foreach(e => println(s"${e.timestamp}: ${e.message}"))

This approach combines regex pattern matching with case classes for type-safe log processing, demonstrating how Scala’s features create maintainable parsing code.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.