PySpark Interview Questions and Answers (Top 50)
PySpark is the Python API for Apache Spark. It allows you to write Spark applications using Python while leveraging Spark's distributed computing engine written in Scala. Under the hood, PySpark uses...
Key Insights
- PySpark interviews test both theoretical understanding and practical problem-solving—expect questions ranging from basic DataFrame operations to complex optimization scenarios involving shuffle and memory management.
- The distinction between transformations and actions, narrow vs wide operations, and when to use RDDs vs DataFrames forms the foundation of most interview discussions.
- Real-world debugging skills matter more than memorized definitions—interviewers want to see you reason through data skew, OOM errors, and execution plan analysis.
Introduction & PySpark Fundamentals (Questions 1-8)
Q1: What is PySpark and how does it relate to Apache Spark?
PySpark is the Python API for Apache Spark. It allows you to write Spark applications using Python while leveraging Spark’s distributed computing engine written in Scala. Under the hood, PySpark uses Py4J to translate Python code into JVM calls.
Q2: Explain the difference between SparkContext and SparkSession.
SparkContext was the original entry point for Spark functionality, handling cluster connections and RDD creation. SparkSession, introduced in Spark 2.0, unifies SparkContext, SQLContext, and HiveContext into a single entry point. Always use SparkSession in modern applications.
from pyspark.sql import SparkSession
# Modern approach - SparkSession
spark = SparkSession.builder \
.appName("MyApp") \
.config("spark.sql.shuffle.partitions", 200) \
.getOrCreate()
# Access SparkContext if needed
sc = spark.sparkContext
Q3: What are RDDs and when should you use them over DataFrames?
RDDs (Resilient Distributed Datasets) are the fundamental data structure in Spark—immutable, distributed collections of objects. Use DataFrames for structured data with schema; use RDDs when you need fine-grained control over physical data placement or when working with unstructured data that doesn’t fit a tabular model.
Q4: Explain lazy evaluation in Spark.
Spark doesn’t execute transformations immediately. Instead, it builds a DAG (Directed Acyclic Graph) of operations and only executes when an action is called. This allows Spark to optimize the entire computation plan before execution.
Q5: What’s the difference between transformations and actions?
Transformations create new RDDs/DataFrames from existing ones (map, filter, join) and are lazy. Actions trigger computation and return results (collect, count, write). Understanding this distinction is critical for performance optimization.
# Transformations (lazy - nothing happens yet)
rdd = sc.parallelize([1, 2, 3, 4, 5])
filtered = rdd.filter(lambda x: x > 2)
mapped = filtered.map(lambda x: x * 2)
# Action (triggers execution)
result = mapped.collect() # [6, 8, 10]
Q6-Q8: Basic RDD Creation and Operations
# Q6: Creating RDDs
rdd_from_list = sc.parallelize([1, 2, 3, 4, 5], numSlices=4)
rdd_from_file = sc.textFile("hdfs://path/to/file.txt")
# Q7: Basic transformations
doubled = rdd_from_list.map(lambda x: x * 2)
evens = rdd_from_list.filter(lambda x: x % 2 == 0)
# Q8: Common actions
count = rdd_from_list.count()
first = rdd_from_list.first()
taken = rdd_from_list.take(3)
DataFrame Operations & Transformations (Questions 9-18)
Q9: How do you create a DataFrame from various sources?
# From Python objects
data = [("Alice", 34), ("Bob", 45)]
df = spark.createDataFrame(data, ["name", "age"])
# From RDD with schema
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
df = spark.createDataFrame(rdd, schema)
Q10: Explain different types of joins in PySpark.
# Inner join (default)
result = df1.join(df2, df1.id == df2.id, "inner")
# Left outer join
result = df1.join(df2, "id", "left")
# Broadcast join for small tables
from pyspark.sql.functions import broadcast
result = df_large.join(broadcast(df_small), "id")
Q11-Q13: Filtering, Selecting, and Aggregations
from pyspark.sql.functions import col, sum, avg, count
# Q11: Filtering
df.filter(col("age") > 30)
df.where("salary > 50000 AND department = 'Engineering'")
# Q12: Selecting and aliasing
df.select("name", (col("salary") * 1.1).alias("new_salary"))
# Q13: GroupBy aggregations
df.groupBy("department") \
.agg(
sum("salary").alias("total_salary"),
avg("age").alias("avg_age"),
count("*").alias("employee_count")
)
Q14-Q16: Window Functions
Window functions are essential for ranking, running totals, and comparing rows within partitions.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, lag, lead
# Q14: Ranking within groups
window_spec = Window.partitionBy("department").orderBy(col("salary").desc())
df.withColumn("rank", rank().over(window_spec)) \
.withColumn("row_num", row_number().over(window_spec))
# Q15: Running totals
running_window = Window.partitionBy("department") \
.orderBy("date") \
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("running_total", sum("amount").over(running_window))
# Q16: Accessing previous/next rows
df.withColumn("prev_salary", lag("salary", 1).over(window_spec))
Q17-Q18: Handling Null Values
# Q17: Dropping nulls
df.dropna() # Drop rows with any null
df.dropna(subset=["name", "age"]) # Drop if specific columns are null
# Q18: Filling nulls
df.fillna(0) # Fill all numeric nulls with 0
df.fillna({"age": 0, "name": "Unknown"}) # Column-specific fills
RDD Operations & Internals (Questions 19-26)
Q19: Explain map vs flatMap.
rdd = sc.parallelize(["hello world", "foo bar"])
# map: one-to-one transformation
rdd.map(lambda x: x.split(" ")).collect()
# [['hello', 'world'], ['foo', 'bar']]
# flatMap: one-to-many, flattens results
rdd.flatMap(lambda x: x.split(" ")).collect()
# ['hello', 'world', 'foo', 'bar']
Q20: Why is reduceByKey preferred over groupByKey?
reduceByKey performs local aggregation before shuffling, reducing network traffic. groupByKey shuffles all data before aggregation, causing potential memory issues.
pairs = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
# Preferred - combines locally first
pairs.reduceByKey(lambda x, y: x + y).collect()
# Avoid for large datasets - shuffles everything
pairs.groupByKey().mapValues(sum).collect()
Q21-Q22: Narrow vs Wide Transformations
Narrow transformations (map, filter) don’t require shuffling—each input partition contributes to one output partition. Wide transformations (groupBy, join, repartition) require shuffling data across partitions.
Q23-Q24: Partitioning Strategies
# Q23: repartition vs coalesce
df.repartition(100) # Full shuffle, can increase or decrease
df.coalesce(10) # No shuffle, can only decrease
# Q24: Custom partitioning for RDDs
rdd.partitionBy(100, lambda k: hash(k) % 100)
Q25-Q26: Caching and Persistence
from pyspark import StorageLevel
# Q25: Basic caching
df.cache() # Equivalent to MEMORY_AND_DISK
# Q26: Custom storage levels
df.persist(StorageLevel.MEMORY_ONLY)
df.persist(StorageLevel.DISK_ONLY)
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
# Always unpersist when done
df.unpersist()
Performance Optimization (Questions 27-35)
Q27-Q28: Broadcast Variables and Accumulators
# Q27: Broadcast for lookup tables
lookup_dict = {"A": 1, "B": 2, "C": 3}
broadcast_lookup = sc.broadcast(lookup_dict)
rdd.map(lambda x: broadcast_lookup.value.get(x, 0))
# Q28: Accumulators for counters
error_count = sc.accumulator(0)
def process_row(row):
global error_count
if row is None:
error_count += 1
return row
Q29-Q31: Shuffle Optimization
# Q29: Tune shuffle partitions
spark.conf.set("spark.sql.shuffle.partitions", 200)
# Q30: Avoid shuffles with broadcast joins
small_df = spark.read.parquet("small_table")
large_df = spark.read.parquet("large_table")
# Forces broadcast regardless of size
result = large_df.join(broadcast(small_df), "key")
# Q31: Set broadcast threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 * 1024) # 100MB
Q32-Q35: Memory Configuration
# Q32-Q33: Key memory settings
spark = SparkSession.builder \
.config("spark.executor.memory", "8g") \
.config("spark.driver.memory", "4g") \
.config("spark.memory.fraction", 0.8) \
.config("spark.memory.storageFraction", 0.3) \
.getOrCreate()
# Q34-Q35: Adaptive Query Execution (Spark 3.0+)
spark.conf.set("spark.sql.adaptive.enabled", True)
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", True)
PySpark SQL & Data Sources (Questions 36-42)
Q36-Q38: Reading and Writing Data
# Q36: Reading various formats
df_parquet = spark.read.parquet("path/to/data.parquet")
df_json = spark.read.json("path/to/data.json")
df_csv = spark.read.option("header", True).csv("path/to/data.csv")
# Q37: Writing with partitioning
df.write \
.partitionBy("year", "month") \
.mode("overwrite") \
.parquet("output/path")
# Q38: Schema handling
df = spark.read \
.schema(explicit_schema) \
.option("mode", "FAILFAST") \
.json("path/to/data.json")
Q39-Q42: SQL Operations
# Q39: Temporary views
df.createOrReplaceTempView("employees")
result = spark.sql("SELECT department, AVG(salary) FROM employees GROUP BY department")
# Q40: Global temporary views
df.createOrReplaceGlobalTempView("global_employees")
spark.sql("SELECT * FROM global_temp.global_employees")
# Q41-Q42: Catalog operations
spark.catalog.listTables()
spark.catalog.listColumns("employees")
Streaming & Advanced Topics (Questions 43-47)
Q43-Q45: Structured Streaming Basics
# Q43: Basic streaming setup
stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "topic") \
.load()
# Q44: Windowed aggregations
from pyspark.sql.functions import window
stream_df \
.groupBy(window("timestamp", "10 minutes", "5 minutes")) \
.count()
# Q45: Watermarking for late data
stream_df \
.withWatermark("timestamp", "10 minutes") \
.groupBy(window("timestamp", "5 minutes")) \
.count()
Q46-Q47: Checkpointing and Output
query = stream_df.writeStream \
.format("parquet") \
.option("checkpointLocation", "/checkpoint/path") \
.option("path", "/output/path") \
.outputMode("append") \
.start()
Debugging & Real-World Scenarios (Questions 48-50)
Q48: How do you read and interpret execution plans?
# Physical plan
df.explain()
# Detailed plan with all stages
df.explain(mode="extended")
# Formatted plan (Spark 3.0+)
df.explain(mode="formatted")
Look for Exchange (shuffles), BroadcastHashJoin vs SortMergeJoin, and filter pushdown into scans.
Q49: How do you handle data skew?
# Salting technique for skewed keys
from pyspark.sql.functions import lit, concat, rand
salt_count = 10
# Add salt to skewed table
df_salted = df_large.withColumn(
"salted_key",
concat(col("key"), lit("_"), (rand() * salt_count).cast("int"))
)
# Explode small table to match salts
df_small_exploded = df_small.crossJoin(
spark.range(salt_count).withColumnRenamed("id", "salt")
).withColumn("salted_key", concat(col("key"), lit("_"), col("salt")))
# Join on salted keys
result = df_salted.join(df_small_exploded, "salted_key")
Q50: Common OOM errors and solutions
Driver OOM usually means you’re collecting too much data. Executor OOM indicates partition-level issues. Solutions include increasing memory, reducing partition sizes, using coalesce after filters, and avoiding collect() on large datasets. Always check the Spark UI’s Storage and Executors tabs for memory pressure indicators.