Spark Scala - Create DataFrame from Seq/List

Creating DataFrames from in-memory Scala collections is a fundamental skill that every Spark developer uses regularly. Whether you're writing unit tests, prototyping transformations in the REPL, or...

Key Insights

  • Use toDF() with case classes for type-safe schema inference in production code; reserve tuple-based approaches for quick prototyping and REPL exploration
  • Always import spark.implicits._ after creating your SparkSession—forgetting this is the most common source of “value toDF is not a member” compilation errors
  • Prefer createDataFrame() with explicit StructType schemas when you need precise control over nullability, data types, or when working with data from external sources

Introduction

Creating DataFrames from in-memory Scala collections is a fundamental skill that every Spark developer uses regularly. Whether you’re writing unit tests, prototyping transformations in the REPL, or working with small reference datasets, knowing how to efficiently convert a Seq or List into a DataFrame saves significant development time.

Spark provides multiple approaches for this conversion, each with distinct trade-offs. Understanding when to use toDF() versus createDataFrame(), and when case classes outperform tuples, will make your code more maintainable and your debugging sessions shorter.

This article covers the practical patterns you’ll encounter daily, from simple tuple-based DataFrames to complex nested structures with explicit schemas.

Prerequisites & Setup

Before creating DataFrames from collections, you need a properly configured SparkSession. In local development or testing, you’ll typically create a session with the local master.

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("DataFrame Creation Examples")
  .master("local[*]")
  .getOrCreate()

// Critical: Import implicits for toDF() method
import spark.implicits._

The spark.implicits._ import is essential. It provides the implicit conversions that enable the toDF() method on Scala collections. Without this import, you’ll encounter the cryptic error: value toDF is not a member of Seq[...].

For the examples in this article, I’m assuming Spark 3.x and Scala 2.12 or 2.13. The APIs remain consistent across minor versions, but some edge cases around type inference have improved in recent releases.

Creating DataFrame from Seq of Tuples

The quickest way to create a DataFrame is from a sequence of tuples. This approach works well for ad-hoc exploration and simple test fixtures.

import spark.implicits._

val df = Seq(
  (1, "Alice", 28),
  (2, "Bob", 35),
  (3, "Charlie", 42)
).toDF("id", "name", "age")

df.show()
// +---+-------+---+
// | id|   name|age|
// +---+-------+---+
// |  1|  Alice| 28|
// |  2|    Bob| 35|
// |  3|Charlie| 42|
// +---+-------+---+

df.printSchema()
// root
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)
//  |-- age: integer (nullable = false)

The toDF() method accepts column names as varargs. If you omit column names, Spark assigns default names like _1, _2, _3 based on tuple positions—rarely what you want.

You can also use List interchangeably with Seq:

val dfFromList = List(
  ("Engineering", 50),
  ("Marketing", 25),
  ("Sales", 30)
).toDF("department", "headcount")

The tuple approach has limitations. Spark infers types from the Scala types, which usually works but can produce unexpected results with numeric types. An Int becomes IntegerType, a Long becomes LongType, and a Double becomes DoubleType. If you need a DecimalType or specific precision, you’ll need explicit schema definition.

Creating DataFrame from Seq of Case Classes

Case classes provide type-safe schema inference and are the recommended approach for production code. Spark uses reflection to derive the schema from the case class definition.

case class Employee(
  id: Int,
  name: String,
  department: String,
  salary: Double
)

val employees = Seq(
  Employee(1, "Alice", "Engineering", 95000.0),
  Employee(2, "Bob", "Marketing", 75000.0),
  Employee(3, "Charlie", "Engineering", 105000.0),
  Employee(4, "Diana", "Sales", 85000.0)
)

val employeeDf = employees.toDF()

employeeDf.printSchema()
// root
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)
//  |-- department: string (nullable = true)
//  |-- salary: double (nullable = false)

Case classes offer several advantages over tuples:

  1. Self-documenting code: Field names appear in the class definition, not scattered across toDF() calls
  2. Compile-time safety: Typos in field names cause compilation errors, not runtime failures
  3. Refactoring support: IDEs can rename fields across your codebase
  4. Reusability: Define once, use across multiple test fixtures

For nullable fields, use Option types:

case class Customer(
  id: Int,
  name: String,
  email: Option[String],
  phoneNumber: Option[String]
)

val customers = Seq(
  Customer(1, "Alice", Some("alice@example.com"), None),
  Customer(2, "Bob", None, Some("555-1234"))
)

val customerDf = customers.toDF()
customerDf.show()
// +---+-----+------------------+-----------+
// | id| name|             email|phoneNumber|
// +---+-----+------------------+-----------+
// |  1|Alice|alice@example.com|       null|
// |  2|  Bob|              null|   555-1234|
// +---+-----+------------------+-----------+

Using createDataFrame() with Explicit Schema

When you need precise control over data types, nullability constraints, or when working with data that doesn’t map cleanly to case classes, use createDataFrame() with an explicit schema.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("id", LongType, nullable = false),
  StructField("name", StringType, nullable = false),
  StructField("balance", DecimalType(10, 2), nullable = true),
  StructField("active", BooleanType, nullable = false)
))

val data = Seq(
  Row(1L, "Alice", BigDecimal("1500.50").bigDecimal, true),
  Row(2L, "Bob", BigDecimal("2300.75").bigDecimal, true),
  Row(3L, "Charlie", null, false)
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

df.printSchema()
// root
//  |-- id: long (nullable = false)
//  |-- name: string (nullable = false)
//  |-- balance: decimal(10,2) (nullable = true)
//  |-- active: boolean (nullable = false)

This approach is verbose but provides complete control. Use it when:

  • You need specific decimal precision for financial data
  • You’re testing schema validation logic
  • You’re creating fixtures that must match an external schema exactly
  • You’re working with legacy code that uses Row objects

A cleaner alternative combines case classes with schema override:

case class Account(id: Long, name: String, balance: Double, active: Boolean)

val accounts = Seq(
  Account(1L, "Alice", 1500.50, true),
  Account(2L, "Bob", 2300.75, true)
)

// Create with case class, then cast specific columns
val accountDf = accounts.toDF()
  .withColumn("balance", col("balance").cast(DecimalType(10, 2)))

Working with Complex/Nested Types

Real-world data often contains arrays, maps, and nested structures. Spark handles these elegantly with both case classes and explicit schemas.

case class Order(
  orderId: String,
  customerId: Int,
  items: Seq[String],
  quantities: Map[String, Int]
)

val orders = Seq(
  Order("ORD-001", 1, Seq("Widget", "Gadget"), Map("Widget" -> 2, "Gadget" -> 1)),
  Order("ORD-002", 2, Seq("Sprocket"), Map("Sprocket" -> 5))
)

val orderDf = orders.toDF()
orderDf.printSchema()
// root
//  |-- orderId: string (nullable = true)
//  |-- customerId: integer (nullable = false)
//  |-- items: array (nullable = true)
//  |    |-- element: string (containsNull = true)
//  |-- quantities: map (nullable = true)
//  |    |-- key: string
//  |    |-- value: integer (valueContainsNull = false)

For deeply nested structures, use nested case classes:

case class Address(street: String, city: String, zipCode: String)
case class Person(name: String, age: Int, addresses: Seq[Address])

val people = Seq(
  Person("Alice", 28, Seq(
    Address("123 Main St", "Boston", "02101"),
    Address("456 Oak Ave", "Cambridge", "02139")
  )),
  Person("Bob", 35, Seq(
    Address("789 Pine Rd", "Somerville", "02143")
  ))
)

val peopleDf = people.toDF()
peopleDf.printSchema()
// root
//  |-- name: string (nullable = true)
//  |-- age: integer (nullable = false)
//  |-- addresses: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- street: string (nullable = true)
//  |    |    |-- city: string (nullable = true)
//  |    |    |-- zipCode: string (nullable = true)

Common Pitfalls & Best Practices

Missing implicits import: The most frequent error when starting with Spark.

// This fails:
val df = Seq((1, "a")).toDF()
// error: value toDF is not a member of Seq[(Int, String)]

// Fix: Import implicits after creating SparkSession
import spark.implicits._
val df = Seq((1, "a")).toDF()

Type inference surprises: Spark infers types based on the first non-null value it encounters.

// Problematic: mixing Int and Long
val mixedDf = Seq(
  (1, "Alice"),    // Int
  (2L, "Bob")      // Long - type mismatch!
).toDF("id", "name")

// Fix: Be explicit about types
val fixedDf = Seq(
  (1L, "Alice"),
  (2L, "Bob")
).toDF("id", "name")

Performance with large collections: Converting large in-memory collections to DataFrames defeats the purpose of distributed computing. If your Seq has millions of elements, you should be reading from a distributed source.

// Acceptable: Small reference data
val countryCodes = (1 to 200).map(i => (i, s"Country$i")).toDF("id", "name")

// Problematic: Large dataset in memory
val massiveSeq = (1 to 10000000).map(i => (i, s"Value$i"))
val badIdea = massiveSeq.toDF("id", "value") // Don't do this

Case class scope: Define case classes outside of methods or objects where you call toDF(). Case classes defined inside methods can cause serialization issues.

// Good: Case class at object/class level
case class Record(id: Int, value: String)

object MyApp {
  def createDf(spark: SparkSession): DataFrame = {
    import spark.implicits._
    Seq(Record(1, "a")).toDF()
  }
}

For testing, prefer case classes with explicit schemas for fixtures that validate schema-sensitive logic. For quick REPL exploration, tuples with toDF() provide the fastest iteration cycle. Match your approach to your use case, and you’ll write cleaner, more maintainable Spark code.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.