Scala - Regular Expressions in Scala
• Scala provides `scala.util.matching.Regex` class with pattern matching integration, making regex operations more idiomatic than Java's verbose approach
Key Insights
• Scala provides scala.util.matching.Regex class with pattern matching integration, making regex operations more idiomatic than Java’s verbose approach
• The .r method converts strings to Regex objects, while triple-quoted strings eliminate escape character headaches when writing complex patterns
• Pattern matching with regex extractors enables declarative data parsing that’s both type-safe and readable compared to imperative extraction methods
Creating Regular Expressions
Scala offers multiple ways to create regular expressions. The most common approach uses the .r method on strings:
val pattern = "\\d{3}-\\d{4}".r
val simplePattern = """\d{3}-\d{4}""".r // triple quotes avoid escaping
Triple-quoted strings ("""...""") are particularly useful for regex patterns because they treat backslashes literally, eliminating the need for double-escaping:
// Without triple quotes - needs double escaping
val emailPattern = "\\w+@\\w+\\.\\w+".r
// With triple quotes - cleaner syntax
val betterEmailPattern = """\w+@\w+\.\w+""".r
// Complex pattern example
val urlPattern = """https?://[\w\-\.]+/[\w\-\./?%&=]*""".r
You can also use the Regex constructor directly with optional group names:
import scala.util.matching.Regex
val datePattern = new Regex(
"""(\d{4})-(\d{2})-(\d{2})""",
"year", "month", "day"
)
Finding Matches
The findFirstIn and findAllIn methods search for pattern matches within strings:
val phonePattern = """\d{3}-\d{4}""".r
val text = "Call me at 555-1234 or 555-5678"
// Find first match
val firstMatch: Option[String] = phonePattern.findFirstIn(text)
println(firstMatch) // Some(555-1234)
// Find all matches
val allMatches: Iterator[String] = phonePattern.findAllIn(text)
allMatches.foreach(println)
// Output:
// 555-1234
// 555-5678
For match location information, use findFirstMatchIn and findAllMatchIn:
val pattern = """\b\w{4}\b""".r
val sentence = "This code runs fast"
pattern.findFirstMatchIn(sentence).foreach { m =>
println(s"Match: ${m.matched}")
println(s"Start: ${m.start}")
println(s"End: ${m.end}")
}
// Output:
// Match: This
// Start: 0
// End: 4
Pattern Matching with Regex
Scala’s pattern matching integrates seamlessly with regex, enabling elegant extraction:
val datePattern = """(\d{4})-(\d{2})-(\d{2})""".r
val input = "2024-01-15"
input match {
case datePattern(year, month, day) =>
println(s"Year: $year, Month: $month, Day: $day")
case _ =>
println("Invalid date format")
}
// Output: Year: 2024, Month: 01, Day: 15
This works because Scala’s Regex class implements an unapplySeq method. You can use this for complex data parsing:
val logPattern = """(\w+)\s+(\d{4}-\d{2}-\d{2})\s+(.+)""".r
val logEntries = List(
"ERROR 2024-01-15 Database connection failed",
"INFO 2024-01-15 Application started",
"WARN 2024-01-16 High memory usage"
)
logEntries.foreach {
case logPattern(level, date, message) =>
println(s"[$level] $date: $message")
case _ =>
println("Invalid log format")
}
Extracting Groups
Named groups make extraction more readable. When you define a regex with the Regex constructor and provide group names, you can access them directly:
val emailPattern = new Regex(
"""([\w\.]+)@([\w\.]+)\.(\w+)""",
"user", "domain", "tld"
)
val email = "john.doe@example.com"
emailPattern.findFirstMatchIn(email).foreach { m =>
println(s"User: ${m.group("user")}")
println(s"Domain: ${m.group("domain")}")
println(s"TLD: ${m.group("tld")}")
}
// Output:
// User: john.doe
// Domain: example
// TLD: com
For unnamed groups, use numeric indices:
val phonePattern = """(\d{3})-(\d{4})""".r
phonePattern.findFirstMatchIn("555-1234").foreach { m =>
println(s"Area code: ${m.group(1)}")
println(s"Number: ${m.group(2)}")
}
Replacing Text
The replaceAllIn and replaceFirstIn methods perform substitutions:
val pattern = """\d+""".r
val text = "I have 3 cats and 2 dogs"
val result = pattern.replaceAllIn(text, "many")
println(result) // I have many cats and many dogs
val firstOnly = pattern.replaceFirstIn(text, "several")
println(firstOnly) // I have several cats and 2 dogs
Use replacement functions for dynamic substitutions:
val numberPattern = """\d+""".r
val text = "10 + 20 + 30"
val doubled = numberPattern.replaceAllIn(text, m =>
(m.matched.toInt * 2).toString
)
println(doubled) // 20 + 40 + 60
Access captured groups in replacements:
val namePattern = """(\w+)\s+(\w+)""".r
val names = "John Doe, Jane Smith"
val swapped = namePattern.replaceAllIn(names, m =>
s"${m.group(2)}, ${m.group(1)}"
)
println(swapped) // Doe, John, Smith, Jane
Splitting Strings
Use regex patterns to split strings on complex delimiters:
val pattern = """[,;]\s*""".r
val text = "apple,banana; cherry, date;elderberry"
val fruits = pattern.split(text)
fruits.foreach(println)
// Output:
// apple
// banana
// cherry
// date
// elderberry
The split method accepts a limit parameter:
val pattern = """\s+""".r
val text = "one two three four five"
val limited = pattern.split(text, 3)
limited.foreach(println)
// Output:
// one
// two
// three four five
Anchors and Boundaries
Use anchors to match entire strings or specific positions:
val exactPattern = """^\d{3}-\d{4}$""".r
exactPattern.findFirstIn("555-1234") // Some(555-1234)
exactPattern.findFirstIn("Call 555-1234") // None
// Word boundaries
val wordPattern = """\bcat\b""".r
println(wordPattern.findFirstIn("cat")) // Some(cat)
println(wordPattern.findFirstIn("category")) // None
println(wordPattern.findFirstIn("the cat sat")) // Some(cat)
Practical Example: Log Parser
Here’s a complete example parsing structured log files:
import scala.util.matching.Regex
import scala.io.Source
case class LogEntry(
timestamp: String,
level: String,
thread: String,
message: String
)
object LogParser {
val logPattern: Regex = new Regex(
"""(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[(\w+)\]\s+\[([^\]]+)\]\s+(.+)""",
"timestamp", "level", "thread", "message"
)
def parseLine(line: String): Option[LogEntry] = line match {
case logPattern(ts, level, thread, msg) =>
Some(LogEntry(ts, level, thread, msg))
case _ => None
}
def parseFile(filename: String): List[LogEntry] = {
Source.fromFile(filename).getLines()
.flatMap(parseLine)
.toList
}
def filterByLevel(entries: List[LogEntry], level: String): List[LogEntry] =
entries.filter(_.level == level)
}
// Usage
val logs = List(
"2024-01-15 10:30:45 [ERROR] [main-thread] Connection timeout",
"2024-01-15 10:30:46 [INFO] [worker-1] Processing request",
"2024-01-15 10:30:47 [ERROR] [worker-2] Invalid input"
)
val parsed = logs.flatMap(LogParser.parseLine)
val errors = LogParser.filterByLevel(parsed, "ERROR")
errors.foreach(e => println(s"${e.timestamp}: ${e.message}"))
This approach combines regex pattern matching with case classes for type-safe log processing, demonstrating how Scala’s features create maintainable parsing code.