Machine Learning with PySpark Interview Questions

PySpark's machine learning ecosystem has evolved significantly. The critical distinction interviewers test is between the legacy RDD-based `mllib` package and the modern DataFrame-based `ml` package....

Key Insights

  • PySpark ML interviews focus heavily on the distinction between Transformers and Estimators, and your ability to construct reproducible Pipeline workflows that scale across clusters.
  • Performance optimization questions separate senior candidates from juniors—know your partitioning strategies, caching patterns, and how to diagnose bottlenecks in the Spark UI.
  • Real-world scenarios dominate advanced interviews: expect questions on feature stores, data skew handling, and integrating PySpark ML with MLOps tools like MLflow.

PySpark ML Fundamentals

PySpark’s machine learning ecosystem has evolved significantly. The critical distinction interviewers test is between the legacy RDD-based mllib package and the modern DataFrame-based ml package. Always default to pyspark.ml—it’s actively maintained, integrates with Spark SQL optimizations, and provides the Pipeline API that production systems require.

Here’s the standard setup interviewers expect you to know:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MLInterview") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.ml.pipeline.persistence.enabled", "true") \
    .getOrCreate()

# Verify ML packages are available
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
print(f"Spark version: {spark.version}")

When asked why DataFrame-based ML is preferred, cite three reasons: Catalyst optimizer integration, schema enforcement for data validation, and native support for structured streaming inference.

Data Preparation & Feature Engineering Questions

Interviewers love feature engineering questions because they reveal whether you understand distributed data processing constraints. The most common question: “How do you handle missing values and categorical encoding at scale?”

from pyspark.ml.feature import (
    VectorAssembler, StringIndexer, OneHotEncoder,
    StandardScaler, Imputer
)
from pyspark.ml import Pipeline

# Sample data with missing values and categorical features
df = spark.createDataFrame([
    (1, 25.0, "engineer", 50000.0, 1),
    (2, None, "manager", 75000.0, 0),
    (3, 35.0, "engineer", None, 1),
    (4, 45.0, "director", 120000.0, 0),
], ["id", "age", "role", "salary", "label"])

# Impute missing numeric values
imputer = Imputer(
    inputCols=["age", "salary"],
    outputCols=["age_imputed", "salary_imputed"],
    strategy="median"
)

# Encode categorical variables
role_indexer = StringIndexer(
    inputCol="role",
    outputCol="role_indexed",
    handleInvalid="keep"  # Critical for production—handles unseen categories
)

role_encoder = OneHotEncoder(
    inputCol="role_indexed",
    outputCol="role_encoded"
)

# Assemble features into single vector
assembler = VectorAssembler(
    inputCols=["age_imputed", "salary_imputed", "role_encoded"],
    outputCol="features_raw"
)

# Scale features
scaler = StandardScaler(
    inputCol="features_raw",
    outputCol="features",
    withStd=True,
    withMean=True
)

A follow-up question often asks about handleInvalid="keep" in StringIndexer. This parameter assigns unseen categories to an additional index, preventing pipeline failures in production when new category values appear.

ML Pipelines & Transformers

The Transformer vs Estimator distinction is fundamental. Transformers apply deterministic transformations via transform(). Estimators learn from data via fit() and produce Transformers. Pipelines chain these together.

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression

# Build complete pipeline
lr = LogisticRegression(
    featuresCol="features",
    labelCol="label",
    maxIter=100
)

pipeline = Pipeline(stages=[
    imputer,
    role_indexer,
    role_encoder,
    assembler,
    scaler,
    lr
])

# Split data
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# Fit pipeline (Estimator -> PipelineModel Transformer)
pipeline_model = pipeline.fit(train_df)

# Transform produces predictions
predictions = pipeline_model.transform(test_df)
predictions.select("id", "features", "prediction", "probability").show()

Interviewers often ask: “What happens when you call fit() on a Pipeline?” Walk through the stages: each Estimator is fitted sequentially, its output Transformer is applied to the data, and the result passes to the next stage. The final PipelineModel contains all fitted Transformers.

Algorithm Implementation Questions

Hyperparameter tuning questions test your understanding of distributed cross-validation. CrossValidator and ParamGridBuilder are the standard tools:

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

rf = RandomForestClassifier(
    featuresCol="features",
    labelCol="label",
    seed=42
)

# Define parameter grid
param_grid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [50, 100, 200]) \
    .addGrid(rf.maxDepth, [5, 10, 15]) \
    .addGrid(rf.minInstancesPerNode, [1, 5, 10]) \
    .build()

# Configure cross-validator
evaluator = BinaryClassificationEvaluator(
    labelCol="label",
    metricName="areaUnderROC"
)

cv = CrossValidator(
    estimator=rf,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=4,  # Parallel model training
    seed=42
)

# This trains 3 * 3 * 3 * 3 = 81 models
cv_model = cv.fit(train_features)
best_model = cv_model.bestModel
print(f"Best numTrees: {best_model.getNumTrees}")

For clustering, interviewers test your understanding of when to use different algorithms:

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans(
    featuresCol="features",
    predictionCol="cluster",
    k=5,
    seed=42,
    initMode="k-means||"  # Distributed initialization
)

kmeans_model = kmeans.fit(feature_df)

# Evaluate using silhouette score
evaluator = ClusteringEvaluator(
    predictionCol="cluster",
    featuresCol="features",
    metricName="silhouette"
)

silhouette = evaluator.evaluate(kmeans_model.transform(feature_df))
print(f"Silhouette score: {silhouette}")

Performance Optimization & Debugging

This section separates candidates who’ve run PySpark in production from those who’ve only done tutorials. Key questions focus on partitioning and caching:

# BAD: Default partitioning may cause skew
predictions = model.transform(large_df)

# GOOD: Repartition before ML operations
optimized_df = large_df.repartition(200, "user_id")
optimized_df.cache()
optimized_df.count()  # Materialize cache

predictions = model.transform(optimized_df)

# For small lookup tables, use broadcast
from pyspark.sql.functions import broadcast

feature_lookup = spark.read.parquet("s3://bucket/features")
enriched = large_df.join(
    broadcast(feature_lookup),
    on="user_id",
    how="left"
)

When asked about debugging slow ML jobs, mention these Spark UI indicators: shuffle read/write sizes indicating data skew, task duration variance across executors, and spill metrics showing memory pressure.

Model Evaluation & Deployment

Production deployment questions test your understanding of model serialization and batch inference:

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Comprehensive evaluation
evaluators = {
    "accuracy": MulticlassClassificationEvaluator(
        labelCol="label", predictionCol="prediction", metricName="accuracy"
    ),
    "f1": MulticlassClassificationEvaluator(
        labelCol="label", predictionCol="prediction", metricName="f1"
    ),
    "precision": MulticlassClassificationEvaluator(
        labelCol="label", predictionCol="prediction", metricName="weightedPrecision"
    ),
}

for metric_name, evaluator in evaluators.items():
    score = evaluator.evaluate(predictions)
    print(f"{metric_name}: {score:.4f}")

# Save model for production
model_path = "s3://models/rf_classifier_v1"
pipeline_model.write().overwrite().save(model_path)

# Load in production batch job
from pyspark.ml import PipelineModel
production_model = PipelineModel.load(model_path)

# Batch inference
daily_data = spark.read.parquet("s3://data/daily/")
daily_predictions = production_model.transform(daily_data)
daily_predictions.write.parquet("s3://predictions/daily/")

Advanced Scenarios & System Design

Senior interviews include system design questions. A common scenario: “Design a real-time recommendation system using PySpark ML.”

from pyspark.sql.functions import col, struct, to_json
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

# Load pre-trained model
model = PipelineModel.load("s3://models/recommendation_v2")

# Structured Streaming inference
streaming_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "user_events") \
    .load()

# Parse incoming events
parsed = streaming_df.select(
    col("key").cast("string").alias("user_id"),
    col("value").cast("string").alias("event_data")
)

# Apply feature engineering and model
predictions = model.transform(parsed)

# Write predictions back to Kafka
query = predictions \
    .select(
        col("user_id").alias("key"),
        to_json(struct("prediction", "probability")).alias("value")
    ) \
    .writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("topic", "recommendations") \
    .option("checkpointLocation", "s3://checkpoints/reco/") \
    .start()

For MLOps integration questions, demonstrate MLflow tracking:

import mlflow
import mlflow.spark

mlflow.set_experiment("pyspark_classification")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("num_trees", 100)
    mlflow.log_param("max_depth", 10)
    
    # Train model
    model = pipeline.fit(train_df)
    
    # Log metrics
    predictions = model.transform(test_df)
    auc = evaluator.evaluate(predictions)
    mlflow.log_metric("auc", auc)
    
    # Log model artifact
    mlflow.spark.log_model(model, "model")

When discussing data skew, explain salting techniques: append random suffixes to skewed keys, perform the join, then aggregate results. For feature stores, mention that PySpark integrates with Feast and Tecton for consistent feature serving between training and inference.

These questions cover the breadth of what production ML engineering with PySpark requires. Master these patterns, and you’ll handle most technical interviews in this space.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.