Scala - Read CSV File

For simple CSV files without complex quoting or escaping, Scala's standard library provides sufficient functionality. Use `scala.io.Source` to read files line by line and split on delimiters.

Key Insights

  • Scala offers multiple approaches for CSV parsing: native Scala collections for simple cases, scala-csv for type-safe parsing, and Apache Commons CSV for complex enterprise scenarios
  • Memory-efficient streaming with Iterator-based processing prevents OutOfMemoryError when handling large CSV files that exceed available heap space
  • Case classes combined with CSV libraries enable automatic marshalling to domain objects with compile-time type safety and minimal boilerplate code

Reading CSV with Native Scala

For simple CSV files without complex quoting or escaping, Scala’s standard library provides sufficient functionality. Use scala.io.Source to read files line by line and split on delimiters.

import scala.io.Source
import scala.util.{Try, Using}

def readSimpleCSV(filename: String): Try[List[Map[String, String]]] = {
  Using(Source.fromFile(filename)) { source =>
    val lines = source.getLines().toList
    val headers = lines.head.split(",").map(_.trim)
    
    lines.tail.map { line =>
      val values = line.split(",").map(_.trim)
      headers.zip(values).toMap
    }
  }
}

// Usage
val result = readSimpleCSV("data.csv")
result.foreach { records =>
  records.foreach(println)
}

This approach works for basic CSV files but fails with quoted fields containing commas or newlines. The Using construct ensures proper resource cleanup even if exceptions occur.

Using scala-csv for Production Code

The scala-csv library provides robust CSV parsing with proper RFC 4180 compliance. Add the dependency to your build.sbt:

libraryDependencies += "com.github.tototoshi" %% "scala-csv" % "1.3.10"

Basic reading with automatic type conversion:

import com.github.tototoshi.csv._
import java.io.File

def readWithScalaCSV(filename: String): List[Map[String, String]] = {
  val reader = CSVReader.open(new File(filename))
  val data = reader.allWithHeaders()
  reader.close()
  data
}

// With custom format
implicit val format = new DefaultCSVFormat {
  override val delimiter = ';'
  override val quoteChar = '\''
}

def readCustomFormat(filename: String): List[List[String]] = {
  val reader = CSVReader.open(new File(filename))
  val data = reader.all()
  reader.close()
  data
}

For streaming large files without loading everything into memory:

import com.github.tototoshi.csv._

def streamCSV(filename: String)(process: Map[String, String] => Unit): Unit = {
  val reader = CSVReader.open(new File(filename))
  try {
    val headers = reader.readNext().get
    reader.foreach { row =>
      val record = headers.zip(row).toMap
      process(record)
    }
  } finally {
    reader.close()
  }
}

// Usage - process one row at a time
streamCSV("large-file.csv") { record =>
  println(s"Processing: ${record("id")}")
  // Database insert, API call, etc.
}

Mapping CSV to Case Classes

Type-safe domain modeling eliminates runtime errors and provides IDE autocompletion. Parse CSV rows directly into case classes:

import com.github.tototoshi.csv._
import scala.util.Try

case class Employee(
  id: Int,
  name: String,
  email: String,
  salary: Double,
  active: Boolean
)

object Employee {
  def fromCSVRow(row: Map[String, String]): Try[Employee] = Try {
    Employee(
      id = row("id").toInt,
      name = row("name"),
      email = row("email"),
      salary = row("salary").toDouble,
      active = row("active").toBoolean
    )
  }
}

def readEmployees(filename: String): List[Employee] = {
  val reader = CSVReader.open(new File(filename))
  try {
    reader.allWithHeaders().flatMap { row =>
      Employee.fromCSVRow(row).toOption
    }
  } finally {
    reader.close()
  }
}

For automatic marshalling without manual field mapping, use kantan.csv:

// build.sbt
libraryDependencies += "com.nrinaudo" %% "kantan.csv-generic" % "0.7.0"

import kantan.csv._
import kantan.csv.ops._
import kantan.csv.generic._

case class Product(sku: String, name: String, price: Double, stock: Int)

def readProducts(filename: String): List[Product] = {
  new File(filename)
    .readCsv[List, Product](rfc.withHeader)
    .collect { case Right(product) => product }
}

Handling Large Files with Iterator

Processing gigabyte-sized CSV files requires streaming to avoid memory exhaustion. Use Iterator for lazy evaluation:

import com.github.tototoshi.csv._
import scala.util.Using

def processLargeCSV[T](
  filename: String,
  batchSize: Int = 1000
)(transform: Map[String, String] => T): Iterator[List[T]] = {
  val reader = CSVReader.open(new File(filename))
  val headers = reader.readNext().get
  
  reader.iterator.map { row =>
    transform(headers.zip(row).toMap)
  }.grouped(batchSize)
}

// Process in batches
processLargeCSV("huge-file.csv") { row =>
  row("id").toInt
}.foreach { batch =>
  // Insert batch into database
  println(s"Processed ${batch.size} records")
}

For parallel processing with Akka Streams:

// build.sbt
libraryDependencies += "com.lightbend.akka" %% "akka-stream" % "2.8.0"

import akka.actor.ActorSystem
import akka.stream.scaladsl._
import akka.stream.alpakka.csv.scaladsl.CsvParsing
import akka.util.ByteString
import java.nio.file.Paths

implicit val system = ActorSystem("csv-processor")

def processWithAkkaStreams(filename: String): Future[Done] = {
  FileIO.fromPath(Paths.get(filename))
    .via(CsvParsing.lineScanner())
    .map(_.map(_.utf8String))
    .runForeach { fields =>
      println(s"Row: ${fields.mkString(", ")}")
    }
}

Error Handling and Validation

Production code requires robust error handling for malformed CSV data:

import scala.util.{Try, Success, Failure}
import com.github.tototoshi.csv._

case class ValidationError(line: Int, field: String, message: String)

def readWithValidation(filename: String): Either[List[ValidationError], List[Employee]] = {
  val reader = CSVReader.open(new File(filename))
  try {
    val rows = reader.allWithHeaders()
    val results = rows.zipWithIndex.map { case (row, idx) =>
      Employee.fromCSVRow(row) match {
        case Success(emp) => Right(emp)
        case Failure(ex) => Left(ValidationError(idx + 2, "", ex.getMessage))
      }
    }
    
    val errors = results.collect { case Left(err) => err }
    val employees = results.collect { case Right(emp) => emp }
    
    if (errors.isEmpty) Right(employees)
    else Left(errors)
  } finally {
    reader.close()
  }
}

// Usage
readWithValidation("employees.csv") match {
  case Right(employees) => 
    println(s"Successfully loaded ${employees.size} employees")
  case Left(errors) =>
    errors.foreach { err =>
      println(s"Line ${err.line}: ${err.message}")
    }
}

Apache Commons CSV for Complex Requirements

For advanced features like multi-line records, custom null handling, or strict RFC 4180 compliance:

// build.sbt
libraryDependencies += "org.apache.commons" % "commons-csv" % "1.10.0"

import org.apache.commons.csv.{CSVFormat, CSVParser}
import java.io.{FileReader, Reader}
import scala.jdk.CollectionConverters._

def readWithCommonsCSV(filename: String): List[Map[String, String]] = {
  val reader: Reader = new FileReader(filename)
  val format = CSVFormat.DEFAULT.builder()
    .setHeader()
    .setSkipHeaderRecord(true)
    .setIgnoreEmptyLines(true)
    .setTrim(true)
    .build()
  
  val parser = new CSVParser(reader, format)
  try {
    parser.getRecords.asScala.toList.map { record =>
      record.toMap.asScala.toMap
    }
  } finally {
    parser.close()
  }
}

Choose your CSV library based on requirements: native Scala for prototypes, scala-csv for most applications, kantan.csv for type-safe marshalling, and Apache Commons CSV for complex enterprise scenarios. Always prefer streaming over loading entire files into memory when processing large datasets.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.