Apache Spark Interview Questions (Top 50)
Spark is a distributed computing engine that processes data in-memory, making it 10-100x faster than MapReduce for iterative algorithms. MapReduce writes intermediate results to disk; Spark keeps...
Key Insights
- Understanding the difference between narrow and wide transformations is crucial—it determines shuffle behavior and directly impacts job performance in production systems.
- DataFrames should be your default choice over RDDs in modern Spark applications; the Catalyst optimizer provides optimizations that hand-written RDD code cannot match.
- Most Spark interview failures stem from inability to explain data skew solutions and memory management—these are the areas that separate junior from senior engineers.
Spark Fundamentals & Architecture (Questions 1-10)
Q1: What is Apache Spark and how does it differ from Hadoop MapReduce?
Spark is a distributed computing engine that processes data in-memory, making it 10-100x faster than MapReduce for iterative algorithms. MapReduce writes intermediate results to disk; Spark keeps them in memory across operations.
Q2: Explain the Spark architecture.
The driver program runs your main function and creates SparkContext. It coordinates with cluster managers (YARN, Mesos, Kubernetes, or Standalone) to allocate executors on worker nodes. Executors run tasks and store data for caching.
Q3: What is lazy evaluation?
Transformations don’t execute immediately—Spark builds a DAG (Directed Acyclic Graph) of operations. Execution only happens when an action is called. This allows Spark to optimize the entire computation chain.
Q4: What’s the difference between SparkContext and SparkSession?
SparkSession is the unified entry point introduced in Spark 2.0. It encapsulates SparkContext, SQLContext, and HiveContext. Always use SparkSession in modern applications:
// Modern approach (Spark 2.0+)
val spark = SparkSession.builder()
.appName("MyApp")
.master("local[*]")
.config("spark.sql.shuffle.partitions", "200")
.getOrCreate()
// Access SparkContext if needed
val sc = spark.sparkContext
Q5: What are the supported cluster managers?
Standalone (built-in), YARN (Hadoop), Mesos (deprecated), and Kubernetes. YARN dominates enterprise deployments; Kubernetes is gaining traction for cloud-native architectures.
Q6-10 cover DAG scheduler, task scheduler, stages, and job execution flow—understand that stages are divided by shuffle boundaries.
RDDs, DataFrames & Datasets (Questions 11-18)
Q11: What is an RDD?
Resilient Distributed Dataset—an immutable, partitioned collection of records. RDDs provide fault tolerance through lineage (the sequence of transformations that created them).
Q12: When should you use RDDs over DataFrames?
Use RDDs only when you need fine-grained control over physical data placement, working with unstructured data, or using legacy code. For everything else, use DataFrames.
Q13: What are the key differences between DataFrame and Dataset?
// DataFrame: untyped, schema at runtime
val df: DataFrame = spark.read.json("people.json")
df.select("name").show() // Runtime error if "name" doesn't exist
// Dataset: typed, compile-time safety
case class Person(name: String, age: Int)
val ds: Dataset[Person] = spark.read.json("people.json").as[Person]
ds.map(_.name) // Compile-time error if field doesn't exist
DataFrames are actually Dataset[Row]. Datasets provide type safety but can be slower due to serialization overhead.
Q14: How do you define a schema explicitly?
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("id", LongType, nullable = false),
StructField("name", StringType, nullable = true),
StructField("salary", DoubleType, nullable = true)
))
val df = spark.read
.schema(schema)
.option("header", "true")
.csv("employees.csv")
Always define schemas for production jobs—schema inference is slow and unreliable.
Q15-18 cover conversions between abstractions, Catalyst optimizer benefits, and encoder requirements for Datasets.
Transformations & Actions (Questions 19-26)
Q19: What’s the difference between narrow and wide transformations?
Narrow transformations (map, filter, union) process data within a single partition—no shuffle required. Wide transformations (groupByKey, reduceByKey, join) require data movement across partitions—triggering expensive shuffles.
Q20: Why is reduceByKey preferred over groupByKey?
// BAD: groupByKey shuffles all values
val grouped = rdd.groupByKey().mapValues(_.sum)
// GOOD: reduceByKey combines locally before shuffle
val reduced = rdd.reduceByKey(_ + _)
reduceByKey performs map-side combining, dramatically reducing shuffle data. groupByKey sends all values across the network before aggregation.
Q21: Explain map vs flatMap.
val lines = sc.parallelize(Seq("hello world", "spark sql"))
// map: one-to-one transformation
lines.map(_.split(" ")) // Array[Array[String]]
// flatMap: one-to-many, flattens result
lines.flatMap(_.split(" ")) // Array[String]: ["hello", "world", "spark", "sql"]
Q22: What are common actions?
collect() returns all data to driver (dangerous for large datasets), count() returns row count, take(n) returns first n elements, reduce() aggregates using a function, saveAsTextFile() writes to storage.
Q23-26 cover transformation lineage, coalesce vs repartition, and the implications of calling actions multiple times without caching.
Memory Management & Performance Tuning (Questions 27-34)
Q27: Explain Spark’s memory management.
Unified memory management (Spark 1.6+) divides executor memory into execution (shuffles, joins, aggregations) and storage (caching). The boundary is soft—execution can borrow from storage when needed.
Q28: What are the storage levels for persistence?
import org.apache.spark.storage.StorageLevel
df.persist(StorageLevel.MEMORY_ONLY) // Default, deserialized in heap
df.persist(StorageLevel.MEMORY_ONLY_SER) // Serialized, more compact
df.persist(StorageLevel.MEMORY_AND_DISK) // Spill to disk if needed
df.persist(StorageLevel.DISK_ONLY) // Only on disk
df.persist(StorageLevel.OFF_HEAP) // Outside JVM heap
// Don't forget to unpersist when done
df.unpersist()
Use MEMORY_AND_DISK for production—it prevents recomputation if memory is insufficient.
Q29: What are broadcast variables?
val lookupMap = Map("US" -> "United States", "UK" -> "United Kingdom")
val broadcastLookup = spark.sparkContext.broadcast(lookupMap)
// Used in transformations—sent once per executor, not per task
df.map(row => broadcastLookup.value.getOrElse(row.getString(0), "Unknown"))
Broadcast variables send read-only data to all executors once, avoiding repeated serialization.
Q30: When should you use repartition vs coalesce?
// Increase partitions (full shuffle)
df.repartition(200)
// Decrease partitions (no shuffle, combines existing partitions)
df.coalesce(50)
// Repartition by column (for even distribution)
df.repartition(200, col("country"))
Use coalesce to reduce partitions; use repartition to increase or redistribute data evenly.
Q31-34 cover accumulators, garbage collection tuning, and partition optimization strategies.
Spark SQL & Query Optimization (Questions 35-41)
Q35: What is the Catalyst optimizer?
Catalyst is Spark SQL’s query optimizer. It transforms logical plans through rule-based and cost-based optimization, then generates physical execution plans. This is why DataFrames outperform hand-optimized RDD code.
Q36: How do you analyze query plans?
df.filter(col("age") > 25)
.groupBy("department")
.agg(avg("salary"))
.explain(true) // Shows parsed, analyzed, optimized, and physical plans
Look for PushedFilters in the physical plan to verify predicate pushdown is working.
Q37: What are the join strategies in Spark?
// Broadcast Hash Join (small table < 10MB default)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760)
smallDf.join(broadcast(largeDf), "key")
// Sort Merge Join (default for large tables)
// Shuffle Hash Join (when one side is much smaller)
// Cartesian Join (cross join, avoid if possible)
Q38-41 cover adaptive query execution, partition pruning, and handling skewed joins with AQE.
Spark Streaming & Structured Streaming (Questions 42-46)
Q42: What’s the difference between DStreams and Structured Streaming?
DStreams (legacy) process micro-batches of RDDs. Structured Streaming treats streaming data as an unbounded table, using the same DataFrame API. Always use Structured Streaming for new projects.
Q43: Explain watermarking.
val windowedCounts = spark.readStream
.format("kafka")
.load()
.withWatermark("timestamp", "10 minutes") // Late data threshold
.groupBy(
window(col("timestamp"), "5 minutes"),
col("userId")
)
.count()
Watermarks tell Spark how long to wait for late data before finalizing aggregation results.
Q44-46 cover exactly-once semantics, checkpointing, and output modes (append, complete, update).
Advanced Topics & Troubleshooting (Questions 47-50)
Q47: How do you handle data skew?
// Salting technique: add random prefix to skewed keys
val saltedDf = skewedDf.withColumn(
"salted_key",
concat(col("key"), lit("_"), (rand() * 10).cast("int"))
)
// Join with exploded salt on the other side
val explodedDf = normalDf.withColumn(
"salt",
explode(array((0 to 9).map(lit): _*))
).withColumn(
"salted_key",
concat(col("key"), lit("_"), col("salt"))
)
saltedDf.join(explodedDf, "salted_key")
Salting distributes skewed keys across multiple partitions, preventing single-executor bottlenecks.
Q48: How do you debug OOM errors?
Check driver vs executor OOM. Driver OOM usually means collect() on large data or too many broadcast variables. Executor OOM means insufficient memory for shuffles or caching. Increase spark.executor.memory, reduce partition count, or add more executors.
Q49: What is speculative execution?
Spark can launch duplicate tasks for slow-running tasks. Enable with spark.speculation=true. Useful for heterogeneous clusters but can waste resources.
Q50: How do you implement a custom partitioner?
class DomainPartitioner(numParts: Int) extends Partitioner {
override def numPartitions: Int = numParts
override def getPartition(key: Any): Int = {
val domain = key.asInstanceOf[String].split("@")(1)
math.abs(domain.hashCode % numPartitions)
}
}
rdd.partitionBy(new DomainPartitioner(100))
Custom partitioners ensure related data lands on the same partition, optimizing subsequent operations.
Master these 50 questions and you’ll handle any Spark interview. Focus especially on memory management and data skew—these separate candidates who’ve run production workloads from those who’ve only completed tutorials.