Apache Spark - How Spark Works Internally

Key Insights

Spark’s execution model transforms high-level operations into a DAG of stages, where each stage contains tasks that process data partitions in parallel across the cluster
The driver program coordinates execution by splitting jobs into stages at shuffle boundaries, while executors run tasks and cache data in memory for iterative algorithms
Understanding RDD lineage, lazy evaluation, and the catalyst optimizer helps you write efficient Spark applications that minimize shuffles and maximize data locality

Spark Architecture Components

Spark operates on a master-worker architecture with three primary components: the driver program, cluster manager, and executors.

The driver runs your main() function and creates the SparkContext, which coordinates the entire application. It converts your code into a logical execution plan, optimizes it, and schedules tasks across the cluster.

// Driver creates SparkContext and defines transformations
val conf = new SparkConf().setAppName("InternalWorkings")
val sc = new SparkContext(conf)

val data = sc.textFile("hdfs://data/logs.txt")
val filtered = data.filter(line => line.contains("ERROR"))
val counts = filtered.map(line => (line.split(" ")(0), 1))
                    .reduceByKey(_ + _)
counts.collect()

Executors are JVM processes running on worker nodes that execute tasks and store data. Each executor has multiple cores and memory slots for parallel task execution. The cluster manager (YARN, Mesos, or Kubernetes) allocates resources and manages executor lifecycle.

RDD Lineage and Lazy Evaluation

Resilient Distributed Datasets (RDDs) are Spark’s fundamental abstraction. Rather than immediately executing operations, Spark builds a lineage graph—a directed acyclic graph (DAG) of transformations.

# These transformations build lineage but don't execute
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = rdd1.map(lambda x: x * 2)
rdd3 = rdd2.filter(lambda x: x > 5)

# Only when action is called does execution begin
result = rdd3.collect()  # Triggers computation

# Check lineage
print(rdd3.toDebugString())
# Output shows: (2) PythonRDD[2] at collect
#               |  PythonRDD[1] at collect
#               |  ParallelCollectionRDD[0] at parallelize

Lazy evaluation enables Spark to optimize the entire pipeline before execution. When you call an action like collect(), count(), or saveAsTextFile(), the driver examines the complete lineage and creates an optimized physical plan.

This approach provides fault tolerance without replication. If a partition is lost, Spark recomputes only that partition using its lineage information.

Job Execution and DAG Scheduling

When an action triggers execution, the DAGScheduler divides the job into stages based on shuffle boundaries. Narrow transformations (map, filter) stay within a stage, while wide transformations (groupByKey, reduceByKey) create stage boundaries.

val logs = sc.textFile("logs/*.txt")  // Stage 0
val errors = logs.filter(_.contains("ERROR"))  // Stage 0
val warnings = logs.filter(_.contains("WARN"))  // Stage 0

// Union is narrow - still Stage 0
val combined = errors.union(warnings)

// reduceByKey requires shuffle - creates Stage 1
val categorized = combined.map(line => {
  val parts = line.split(" ")
  (parts(1), 1)  // (category, count)
}).reduceByKey(_ + _)  // Shuffle boundary

// Another shuffle - creates Stage 2
val sorted = categorized.sortBy(_._2, ascending = false)
sorted.take(10)

The DAGScheduler creates a logical execution plan:

Stage 0: Read, filter, union, map (all narrow)
Stage 1: Shuffle read, reduce (wide)
Stage 2: Shuffle read, sort (wide)

Each stage runs as a set of tasks—one task per partition. The TaskScheduler assigns tasks to executors based on data locality preferences: PROCESS_LOCAL (data in executor cache), NODE_LOCAL (data on same node), RACK_LOCAL (data in same rack), or ANY (data anywhere).

Shuffle Mechanism

Shuffles are the most expensive operations in Spark. During a shuffle, data is redistributed across partitions, requiring disk I/O, network transfer, and serialization.

# This triggers a shuffle
pairs = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("b", 4)])
grouped = pairs.groupByKey()  # Shuffle: all 'a' keys to one partition

# More efficient - combines before shuffle
reduced = pairs.reduceByKey(lambda x, y: x + y)  # Partial aggregation

The shuffle process has two phases:

Shuffle Write (Map Side): Each task writes its output to local disk in separate files for each reducer partition. With M map tasks and R reduce tasks, this creates M × R files.

Shuffle Read (Reduce Side): Tasks in the next stage fetch their required data blocks over the network, merge them, and perform the aggregation.

// Configure shuffle behavior
val conf = new SparkConf()
  .set("spark.shuffle.compress", "true")
  .set("spark.shuffle.spill.compress", "true")
  .set("spark.sql.shuffle.partitions", "200")  // Default 200

Spark 2.0+ uses the Tungsten-sort shuffle writer, which sorts records by partition ID and writes to a single file per executor, significantly reducing file count.

Memory Management and Caching

Spark divides executor memory into several regions:

Execution memory (60%): For shuffles, joins, sorts, aggregations
Storage memory (40%): For cached RDDs and broadcast variables
User memory: For user data structures and UDFs
Reserved memory: For Spark internal objects

val data = sc.textFile("large_dataset.txt")
             .map(parseLine)
             .cache()  // Or persist()

// First action materializes and caches
val count1 = data.count()  // Reads from source, caches result

// Subsequent actions read from cache
val count2 = data.count()  // Reads from memory
val sample = data.take(10)  // Reads from memory

// Check cache status
println(data.getStorageLevel)

Storage levels control caching behavior:

from pyspark import StorageLevel

# Memory only (deserialized)
rdd.persist(StorageLevel.MEMORY_ONLY)

# Memory and disk (deserialized)
rdd.persist(StorageLevel.MEMORY_AND_DISK)

# Memory only (serialized) - more space efficient
rdd.persist(StorageLevel.MEMORY_ONLY_SER)

# Replicated across 2 nodes
rdd.persist(StorageLevel.MEMORY_AND_DISK_2)

When memory is full, Spark uses LRU eviction for storage memory. Execution memory cannot be evicted—if insufficient, Spark spills to disk.

Catalyst Optimizer and Tungsten

For DataFrames and Datasets, the Catalyst optimizer transforms logical plans into optimized physical plans through multiple phases:

val df = spark.read.parquet("users.parquet")
val result = df.filter($"age" > 21)
               .select($"name", $"city")
               .groupBy($"city")
               .count()

// View execution plan
result.explain(true)

Catalyst applies optimizations:

Predicate pushdown: Filters applied at data source level
Column pruning: Only required columns read from storage
Constant folding: Compile-time expression evaluation
Join reordering: Optimal join sequence based on statistics

Tungsten provides runtime optimizations:

Code generation: Generates Java bytecode for entire queries
Off-heap memory management: Reduces GC overhead
Cache-aware computation: Optimizes CPU cache usage

// Code generation example
val df = spark.range(1000000)
val result = df.selectExpr("id * 2 as doubled")
               .filter("doubled > 1000")

// Tungsten generates optimized code similar to:
// while (input.hasNext()) {
//   long id = input.next();
//   long doubled = id * 2;
//   if (doubled > 1000) output.write(doubled);
// }

Task Execution and Speculation

Each task is a unit of work on a single partition. The TaskScheduler uses a TaskSetManager to track task completion and handle failures.

# Configure task execution
spark.conf.set("spark.task.cpus", "1")
spark.conf.set("spark.task.maxFailures", "4")
spark.conf.set("spark.speculation", "true")
spark.conf.set("spark.speculation.multiplier", "1.5")

Speculative execution launches duplicate tasks for stragglers. If a task runs 1.5× slower than the median, Spark starts a backup task on another executor. The first to complete wins.

Understanding these internals helps you diagnose performance issues, optimize resource allocation, and write efficient Spark applications that leverage the framework’s distributed computing capabilities.