Spark Scala - Build with SBT
If you're building Spark applications in Scala, SBT should be your default choice. While Maven has broader enterprise adoption and Gradle offers flexibility, SBT provides native Scala support that...
Key Insights
- SBT’s incremental compilation and native Scala support make it significantly faster and more ergonomic than Maven or Gradle for Spark Scala projects, especially during iterative development.
- The
providedscope is critical for Spark dependencies—cluster environments already include Spark libraries, and bundling them creates bloated JARs and version conflicts. - Fat JAR assembly requires careful merge strategy configuration; ignoring META-INF conflicts and concatenating reference.conf files prevents 90% of deployment failures.
Why SBT for Spark Scala Projects
If you’re building Spark applications in Scala, SBT should be your default choice. While Maven has broader enterprise adoption and Gradle offers flexibility, SBT provides native Scala support that eliminates the impedance mismatch you’ll experience with JVM-centric tools.
The practical benefits are immediate: incremental compilation means your second build takes seconds instead of minutes, the Scala REPL integration lets you test transformations interactively, and dependency resolution understands Scala’s binary compatibility requirements out of the box. Maven requires the scala-maven-plugin and careful configuration to achieve what SBT does natively.
That said, SBT has a learning curve. Its DSL can feel foreign if you’re coming from XML-based build tools. This article focuses on practical patterns that work—not exhaustive documentation of every SBT feature.
Project Setup and Directory Structure
SBT follows a convention-over-configuration approach similar to Maven. Here’s the standard layout for a Spark application:
spark-etl-project/
├── build.sbt
├── project/
│ ├── build.properties
│ └── plugins.sbt
├── src/
│ ├── main/
│ │ ├── scala/
│ │ │ └── com/
│ │ │ └── example/
│ │ │ └── SparkApp.scala
│ │ └── resources/
│ │ └── application.conf
│ └── test/
│ ├── scala/
│ │ └── com/
│ │ └── example/
│ │ └── SparkAppTest.scala
│ └── resources/
└── target/
Start by creating project/build.properties to lock your SBT version:
sbt.version=1.9.7
This ensures reproducible builds across developer machines and CI environments. Never rely on whatever SBT version happens to be installed globally.
Your minimal build.sbt for a Spark project:
ThisBuild / scalaVersion := "2.12.18"
ThisBuild / version := "0.1.0"
ThisBuild / organization := "com.example"
lazy val root = (project in file("."))
.settings(
name := "spark-etl-project"
)
Note the Scala version: Spark 3.x supports Scala 2.12 and 2.13, but many organizations still run Spark 2.4 clusters locked to Scala 2.11. Verify your cluster’s Scala version before writing code—binary incompatibility between Scala versions will cause runtime failures with cryptic NoSuchMethodError exceptions.
Configuring Dependencies
Here’s a production-ready build.sbt with properly scoped Spark dependencies:
ThisBuild / scalaVersion := "2.12.18"
ThisBuild / version := "0.1.0"
ThisBuild / organization := "com.example"
val sparkVersion = "3.5.0"
lazy val root = (project in file("."))
.settings(
name := "spark-etl-project",
libraryDependencies ++= Seq(
// Spark dependencies - provided scope for cluster deployment
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
// Your application dependencies - bundled in fat JAR
"com.typesafe" % "config" % "1.4.3",
"com.github.scopt" %% "scopt" % "4.1.0",
// Delta Lake (if using)
"io.delta" %% "delta-core" % "2.4.0" % "provided",
// Testing
"org.scalatest" %% "scalatest" % "3.2.17" % Test,
"org.apache.spark" %% "spark-core" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-sql" % sparkVersion % Test classifier "tests"
),
// Ensure tests run in forked JVM to avoid classloader issues
Test / fork := true,
// Java options for Spark
javaOptions ++= Seq(
"-Xms512M",
"-Xmx2048M",
"--add-opens=java.base/sun.nio.ch=ALL-UNNAMED"
)
)
The provided scope is crucial. When you submit a job to a Spark cluster, the cluster already has Spark libraries available. Including them in your JAR creates a 200MB+ artifact instead of a few megabytes, and risks version conflicts between your bundled Spark and the cluster’s Spark.
The %% operator automatically appends the Scala version suffix to artifact names. "org.apache.spark" %% "spark-core" resolves to spark-core_2.12 when your project uses Scala 2.12.
Building Fat JARs with sbt-assembly
Cluster deployment requires a fat JAR (also called an uber JAR) containing your code and all non-provided dependencies. The sbt-assembly plugin handles this.
Create project/plugins.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.1.5")
Now configure assembly in your build.sbt:
// Add to your existing build.sbt
assembly / assemblyMergeStrategy := {
case PathList("META-INF", "services", xs @ _*) => MergeStrategy.concat
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case "application.conf" => MergeStrategy.concat
case x if x.endsWith(".proto") => MergeStrategy.first
case x if x.endsWith("module-info.class") => MergeStrategy.discard
case x =>
val oldStrategy = (assembly / assemblyMergeStrategy).value
oldStrategy(x)
}
assembly / assemblyJarName := s"${name.value}-${version.value}.jar"
// Exclude Scala library if cluster provides it
assembly / assemblyOption := (assembly / assemblyOption).value
.withIncludeScala(false)
The merge strategy configuration handles common conflicts:
- META-INF files: Multiple JARs include manifests and signatures. Discarding them is usually safe.
- reference.conf: Typesafe Config files must be concatenated, not overwritten, or you’ll lose configuration from dependencies.
- Service loader files: These need concatenation to preserve all service implementations.
Build your fat JAR with:
sbt assembly
The output lands in target/scala-2.12/spark-etl-project-0.1.0.jar.
Writing and Running a Sample Spark Application
Here’s a practical example that reads JSON, performs transformations, and writes Parquet:
package com.example
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._
object CustomerETL {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("CustomerETL")
.getOrCreate()
import spark.implicits._
val inputPath = args(0)
val outputPath = args(1)
val customers = spark.read.json(inputPath)
val transformed = transformCustomers(customers)
transformed.write
.mode("overwrite")
.parquet(outputPath)
spark.stop()
}
def transformCustomers(df: DataFrame): DataFrame = {
df.select(
col("id"),
upper(col("name")).as("name_upper"),
col("email"),
when(col("age") >= 18, "adult")
.otherwise("minor")
.as("age_category"),
current_timestamp().as("processed_at")
).filter(col("email").isNotNull)
}
}
Run locally during development:
sbt "runMain com.example.CustomerETL data/input.json data/output"
For local runs, add a master configuration. Create src/main/resources/application.conf:
spark {
master = "local[*]"
master = ${?SPARK_MASTER}
}
Then modify your SparkSession builder to use it for local development while respecting cluster settings in production.
Submit to a cluster:
spark-submit \
--class com.example.CustomerETL \
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
--executor-memory 4g \
--executor-cores 2 \
target/scala-2.12/spark-etl-project-0.1.0.jar \
s3://bucket/input/ \
s3://bucket/output/
Testing Spark Applications
Extract transformation logic into pure functions that accept and return DataFrames. This makes testing straightforward:
package com.example
import org.apache.spark.sql.SparkSession
import org.scalatest.funsuite.AnyFunSuite
import org.scalatest.BeforeAndAfterAll
class CustomerETLTest extends AnyFunSuite with BeforeAndAfterAll {
var spark: SparkSession = _
override def beforeAll(): Unit = {
spark = SparkSession.builder()
.appName("CustomerETLTest")
.master("local[2]")
.config("spark.ui.enabled", "false")
.config("spark.driver.bindAddress", "127.0.0.1")
.getOrCreate()
}
override def afterAll(): Unit = {
if (spark != null) {
spark.stop()
}
}
test("transformCustomers filters null emails") {
import spark.implicits._
val input = Seq(
(1, "alice", "alice@example.com", 25),
(2, "bob", null, 30)
).toDF("id", "name", "email", "age")
val result = CustomerETL.transformCustomers(input)
assert(result.count() == 1)
assert(result.filter($"id" === 1).count() == 1)
}
test("transformCustomers categorizes age correctly") {
import spark.implicits._
val input = Seq(
(1, "alice", "alice@example.com", 25),
(2, "charlie", "charlie@example.com", 16)
).toDF("id", "name", "email", "age")
val result = CustomerETL.transformCustomers(input)
val categories = result.select("age_category").as[String].collect()
assert(categories.contains("adult"))
assert(categories.contains("minor"))
}
}
Run tests with:
sbt test
Tips and Common Pitfalls
Scala version mismatches cause the most deployment failures. If your cluster runs Spark 2.4 with Scala 2.11, but you compiled with Scala 2.12, you’ll see NoSuchMethodError or ClassNotFoundException at runtime. Always verify your target cluster’s Scala version.
Memory settings for SBT matter for large projects. Create .sbtopts in your project root:
-J-Xms1024M
-J-Xmx4096M
-J-XX:+UseG1GC
Dependency shading becomes necessary when your dependencies conflict with Spark’s bundled libraries (Guava is a common culprit). Use sbt-assembly’s shading feature:
assembly / assemblyShadeRules := Seq(
ShadeRule.rename("com.google.common.**" -> "shaded.guava.@1").inAll
)
Incremental compilation is SBT’s killer feature. Avoid clean before every build—it defeats the purpose. Only clean when you suspect cache corruption or after changing Scala versions.
Parallel test execution can cause issues with shared SparkSessions. Either disable parallelism in tests or use separate SparkSessions per test class:
Test / parallelExecution := false
SBT rewards investment in understanding its model. The patterns in this article cover 95% of Spark project needs. Start simple, add complexity only when required, and always verify your assumptions against your target cluster’s configuration.