Apache Spark - RDD vs DataFrame vs Dataset

Key Insights

RDDs provide low-level control and flexibility but require manual optimization, making them ideal for unstructured data and custom partitioning schemes where you need fine-grained control over distributed operations.
DataFrames offer SQL-like optimizations through Catalyst and Tungsten engines with 2-10x performance improvements over RDDs, but sacrifice compile-time type safety for runtime schema validation.
Datasets combine type safety with DataFrame optimizations, delivering the best of both worlds for Scala and Java applications, though they introduce serialization overhead and aren’t available in PySpark.

Understanding RDDs: The Foundation

Resilient Distributed Datasets (RDDs) are Spark’s fundamental data structure—immutable, distributed collections of objects partitioned across a cluster. They expose low-level transformations and actions, giving you complete control over how data is processed.

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext

val sc = new SparkContext(conf)

// Creating an RDD from a collection
val numbersRDD: RDD[Int] = sc.parallelize(Seq(1, 2, 3, 4, 5))

// Loading data from HDFS
val logsRDD: RDD[String] = sc.textFile("hdfs://logs/*.txt")

// Transformations are lazy
val squaredRDD = numbersRDD.map(x => x * x)
val evenRDD = squaredRDD.filter(x => x % 2 == 0)

// Actions trigger computation
val result = evenRDD.collect()

RDDs shine when working with unstructured data or when you need custom partitioning logic. They’re also the only option for operations that don’t fit the DataFrame/Dataset model.

case class WebLog(ip: String, timestamp: Long, url: String, statusCode: Int)

val logsRDD = sc.textFile("logs.txt")
  .map(line => {
    val parts = line.split(",")
    WebLog(parts(0), parts(1).toLong, parts(2), parts(3).toInt)
  })

// Custom partitioning by IP address
val partitionedRDD = logsRDD.partitionBy(
  new org.apache.spark.HashPartitioner(100)
).map(log => (log.ip, log))

// Complex aggregations
val ipStats = partitionedRDD
  .mapValues(log => (1, if (log.statusCode >= 400) 1 else 0))
  .reduceByKey((a, b) => (a._1 + b._1, a._2 + b._2))

The downside? Spark can’t optimize RDD operations. You’re responsible for choosing efficient transformations, and serialization overhead can be significant.

DataFrames: SQL-Powered Optimization

DataFrames introduce a schema-based abstraction layer with named columns and types. They leverage Catalyst optimizer to generate efficient physical execution plans and Tungsten engine for memory management and code generation.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder()
  .appName("DataFrame Example")
  .getOrCreate()

// Reading structured data
val logsDF = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("logs.csv")

// Schema inspection
logsDF.printSchema()
// root
//  |-- ip: string
//  |-- timestamp: long
//  |-- url: string
//  |-- status_code: integer

// SQL-like operations with optimization
val errorStats = logsDF
  .filter(col("status_code") >= 400)
  .groupBy("ip")
  .agg(
    count("*").as("error_count"),
    avg("status_code").as("avg_status")
  )
  .orderBy(desc("error_count"))

errorStats.show()

The Catalyst optimizer rewrites your query plan, pushing down filters, pruning columns, and choosing optimal join strategies automatically.

// Complex aggregations with window functions
import org.apache.spark.sql.expressions.Window

val windowSpec = Window
  .partitionBy("ip")
  .orderBy(desc("timestamp"))
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

val enrichedLogs = logsDF
  .withColumn("request_number", row_number().over(windowSpec))
  .withColumn("cumulative_errors", 
    sum(when(col("status_code") >= 400, 1).otherwise(0)).over(windowSpec)
  )

// Join optimization happens automatically
val geoDF = spark.read.parquet("geo_ip_data.parquet")
val logsWithGeo = logsDF.join(broadcast(geoDF), "ip")

DataFrames integrate seamlessly with Spark SQL, allowing you to mix DataFrame API calls with SQL queries:

logsDF.createOrReplaceTempView("logs")

val result = spark.sql("""
  SELECT ip, COUNT(*) as request_count,
         SUM(CASE WHEN status_code >= 400 THEN 1 ELSE 0 END) as errors
  FROM logs
  WHERE timestamp > unix_timestamp() - 86400
  GROUP BY ip
  HAVING errors > 10
""")

The tradeoff is runtime errors. Column name typos or type mismatches only surface when your job runs.

Datasets: Type-Safe DataFrames

Datasets extend DataFrames with compile-time type safety. They’re essentially DataFrames with an encoder that maps JVM objects to Spark’s internal binary format.

case class WebLog(
  ip: String,
  timestamp: Long,
  url: String,
  statusCode: Int,
  responseTime: Int
)

val logsDS: Dataset[WebLog] = spark.read
  .option("header", "true")
  .csv("logs.csv")
  .as[WebLog]

// Type-safe transformations
val slowRequests: Dataset[WebLog] = logsDS
  .filter(log => log.responseTime > 1000)
  .map(log => log.copy(url = log.url.toLowerCase))

// Compile error if field doesn't exist
// logsDS.filter(log => log.invalidField > 0)  // Won't compile

Datasets support both typed and untyped operations. You can mix Dataset methods with DataFrame operations:

val errorAnalysis = logsDS
  .filter(_.statusCode >= 400)  // Typed
  .groupByKey(_.ip)              // Typed
  .mapGroups { case (ip, logs) =>
    val logsList = logs.toList
    (ip, logsList.size, logsList.map(_.statusCode).sum / logsList.size)
  }
  .toDF("ip", "error_count", "avg_status_code")  // Convert to DataFrame
  .orderBy(desc("error_count"))

Custom aggregations become type-safe and composable:

import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders

case class RequestStats(count: Long, totalTime: Long, errors: Long)

object RequestStatsAggregator extends Aggregator[WebLog, RequestStats, RequestStats] {
  def zero: RequestStats = RequestStats(0, 0, 0)
  
  def reduce(buffer: RequestStats, log: WebLog): RequestStats = {
    RequestStats(
      buffer.count + 1,
      buffer.totalTime + log.responseTime,
      buffer.errors + (if (log.statusCode >= 400) 1 else 0)
    )
  }
  
  def merge(b1: RequestStats, b2: RequestStats): RequestStats = {
    RequestStats(b1.count + b2.count, b1.totalTime + b2.totalTime, b1.errors + b2.errors)
  }
  
  def finish(reduction: RequestStats): RequestStats = reduction
  
  def bufferEncoder: Encoder[RequestStats] = Encoders.product
  def outputEncoder: Encoder[RequestStats] = Encoders.product
}

val stats = logsDS
  .groupByKey(_.ip)
  .agg(RequestStatsAggregator.toColumn)

Performance Comparison

Benchmarking a word count operation on 10GB of text data reveals the performance hierarchy:

// RDD approach
val wordCountRDD = sc.textFile("data.txt")
  .flatMap(_.split("\\s+"))
  .map(word => (word, 1))
  .reduceByKey(_ + _)
// Execution time: ~45 seconds

// DataFrame approach
val wordCountDF = spark.read.text("data.txt")
  .select(explode(split(col("value"), "\\s+")).as("word"))
  .groupBy("word")
  .count()
// Execution time: ~22 seconds

// Dataset approach
case class Word(value: String)
val wordCountDS = spark.read.text("data.txt")
  .flatMap(_.getString(0).split("\\s+"))
  .map(Word(_))
  .groupByKey(_.value)
  .count()
// Execution time: ~25 seconds

DataFrames win on performance due to Catalyst optimization and Tungsten execution. Datasets add minimal overhead—typically 5-15%—while providing type safety.

Choosing the Right Abstraction

Use RDDs when you need:

Fine-grained control over partitioning and data placement
Operations on unstructured data that don’t fit tabular models
Custom partitioners or complex stateful transformations
Access to each partition’s data directly

Use DataFrames when you need:

Maximum performance for structured data operations
SQL-like queries and aggregations
Integration with BI tools and external systems
Working in PySpark where Datasets aren’t available

Use Datasets when you need:

Compile-time type safety in Scala/Java applications
Complex domain models with business logic
Refactoring safety and IDE support
Balance between performance and type safety

// Converting between abstractions
val rdd: RDD[WebLog] = logsDS.rdd
val df: DataFrame = logsDS.toDF()
val ds: Dataset[WebLog] = df.as[WebLog]

// You can also work with Row objects
val rowRDD: RDD[Row] = df.rdd

Modern Spark applications typically use DataFrames and Datasets almost exclusively. RDDs remain relevant for specific use cases requiring low-level control, but the optimizations in higher-level APIs make them the default choice for most workloads.

Understanding RDDs: The Foundation

DataFrames: SQL-Powered Optimization

Datasets: Type-Safe DataFrames

Performance Comparison

Choosing the Right Abstraction

Liked this? There's more.