Spark MLlib - StringIndexer and OneHotEncoder

Key Insights

StringIndexer converts categorical string labels into numerical indices based on label frequency, essential for algorithms that require numeric input, while OneHotEncoder transforms these indices into binary vectors to prevent ordinal relationships
Proper handling of unseen labels during prediction requires setting handleInvalid parameter to “keep” or “skip” to avoid pipeline failures in production environments
Combining both transformers in a Pipeline with VectorAssembler creates a robust feature engineering workflow that handles categorical variables correctly for Spark ML algorithms

Understanding StringIndexer

StringIndexer maps categorical string values to numerical indices. The most frequent label receives index 0.0, the second most frequent gets 1.0, and so on. This transformation is critical because most machine learning algorithms in Spark MLlib operate on numerical features.

import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("StringIndexerExample")
  .master("local[*]")
  .getOrCreate()

import spark.implicits._

val data = Seq(
  ("red", 1),
  ("blue", 2),
  ("red", 3),
  ("green", 4),
  ("blue", 5),
  ("red", 6)
).toDF("color", "value")

val indexer = new StringIndexer()
  .setInputCol("color")
  .setOutputCol("color_index")
  .fit(data)

val indexed = indexer.transform(data)
indexed.show()

// Output:
// +-----+-----+-----------+
// |color|value|color_index|
// +-----+-----+-----------+
// |  red|    1|        0.0|
// | blue|    2|        1.0|
// |  red|    3|        0.0|
// |green|    4|        2.0|
// | blue|    5|        1.0|
// |  red|    6|        0.0|
// +-----+-----+-----------+

The fit() method computes the label frequencies and creates the mapping. Red appears 3 times (index 0), blue appears 2 times (index 1), and green appears once (index 2).

Handling Unseen Labels

Production systems encounter labels not present in training data. The handleInvalid parameter controls this behavior:

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer

spark = SparkSession.builder.appName("HandleInvalid").getOrCreate()

train_data = spark.createDataFrame([
    ("cat", 1),
    ("dog", 2),
    ("cat", 3),
], ["animal", "value"])

test_data = spark.createDataFrame([
    ("cat", 4),
    ("bird", 5),  # Unseen label
    ("dog", 6),
], ["animal", "value"])

# Option 1: error (default) - throws exception
indexer_error = StringIndexer(
    inputCol="animal",
    outputCol="animal_index",
    handleInvalid="error"
)

# Option 2: skip - removes rows with unseen labels
indexer_skip = StringIndexer(
    inputCol="animal",
    outputCol="animal_index",
    handleInvalid="skip"
).fit(train_data)

result_skip = indexer_skip.transform(test_data)
result_skip.show()
# bird row is removed

# Option 3: keep - assigns unseen labels to numLabels index
indexer_keep = StringIndexer(
    inputCol="animal",
    outputCol="animal_index",
    handleInvalid="keep"
).fit(train_data)

result_keep = indexer_keep.transform(test_data)
result_keep.show()
# bird gets index 2.0 (numLabels)

Use “keep” for production pipelines where you need to process all records and handle unknowns explicitly downstream.

OneHotEncoder Transformation

StringIndexer creates ordinal relationships (0 < 1 < 2) that don’t exist in categorical data. OneHotEncoder solves this by creating binary vectors where only one element is 1 (hot) and others are 0.

import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}

val data = Seq(
  ("small", 10),
  ("medium", 20),
  ("large", 30),
  ("small", 15),
  ("large", 35)
).toDF("size", "price")

val indexer = new StringIndexer()
  .setInputCol("size")
  .setOutputCol("size_index")
  .fit(data)

val indexed = indexer.transform(data)

val encoder = new OneHotEncoder()
  .setInputCol("size_index")
  .setOutputCol("size_vec")

val encoded = encoder.transform(indexed)
encoded.select("size", "size_index", "size_vec").show(false)

// Output:
// +------+----------+-------------+
// |size  |size_index|size_vec     |
// +------+----------+-------------+
// |small |1.0       |(2,[1],[1.0])|
// |medium|2.0       |(2,[],[])    |
// |large |0.0       |(2,[0],[1.0])|
// |small |1.0       |(2,[1],[1.0])|
// |large |0.0       |(2,[0],[1.0])|
// +------+----------+-------------+

The output uses sparse vector notation: (size, [indices], [values]). For 3 categories, we get 2 dimensions (n-1) to avoid multicollinearity. The last category becomes the reference with all zeros.

Setting dropLast Parameter

Control the output dimensionality with dropLast:

from pyspark.ml.feature import OneHotEncoder

# dropLast=True (default) - creates n-1 columns
encoder_drop = OneHotEncoder(
    inputCol="size_index",
    outputCol="size_vec",
    dropLast=True
)

# dropLast=False - creates n columns
encoder_keep = OneHotEncoder(
    inputCol="size_index",
    outputCol="size_vec",
    dropLast=False
)

encoded_keep = encoder_keep.transform(indexed)
encoded_keep.select("size", "size_index", "size_vec").show(truncate=False)

# With dropLast=False:
// +------+----------+-------------+
// |size  |size_index|size_vec     |
// +------+----------+-------------+
// |small |1.0       |(3,[1],[1.0])|
// |medium|2.0       |(3,[2],[1.0])|
// |large |0.0       |(3,[0],[1.0])|
// +------+----------+-------------+

Keep dropLast=True for linear models to prevent the dummy variable trap. Set dropLast=False for tree-based algorithms that handle collinearity naturally.

Building Complete Feature Pipeline

Combine multiple categorical features with numerical features using Pipeline and VectorAssembler:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder, VectorAssembler}

val data = Seq(
  ("red", "small", 100, 1.5),
  ("blue", "large", 200, 2.1),
  ("red", "medium", 150, 1.8),
  ("green", "small", 120, 1.6)
).toDF("color", "size", "quantity", "weight")

// String indexers
val colorIndexer = new StringIndexer()
  .setInputCol("color")
  .setOutputCol("color_index")
  .setHandleInvalid("keep")

val sizeIndexer = new StringIndexer()
  .setInputCol("size")
  .setOutputCol("size_index")
  .setHandleInvalid("keep")

// One-hot encoders
val colorEncoder = new OneHotEncoder()
  .setInputCol("color_index")
  .setOutputCol("color_vec")

val sizeEncoder = new OneHotEncoder()
  .setInputCol("size_index")
  .setOutputCol("size_vec")

// Combine all features
val assembler = new VectorAssembler()
  .setInputCols(Array("color_vec", "size_vec", "quantity", "weight"))
  .setOutputCol("features")

// Create pipeline
val pipeline = new Pipeline()
  .setStages(Array(colorIndexer, sizeIndexer, colorEncoder, sizeEncoder, assembler))

val model = pipeline.fit(data)
val result = model.transform(data)

result.select("color", "size", "features").show(false)

This pipeline handles the complete transformation from raw categorical strings to a feature vector ready for ML algorithms.

Multiple Column Transformation

Process multiple columns efficiently using array-based APIs:

from pyspark.ml.feature import StringIndexer, OneHotEncoder

data = spark.createDataFrame([
    ("red", "small", "A", 100),
    ("blue", "large", "B", 200),
    ("red", "medium", "A", 150),
], ["color", "size", "category", "value"])

# Index multiple columns at once
indexer = StringIndexer(
    inputCols=["color", "size", "category"],
    outputCols=["color_idx", "size_idx", "category_idx"],
    handleInvalid="keep"
)

indexed = indexer.fit(data).transform(data)

# Encode multiple columns at once
encoder = OneHotEncoder(
    inputCols=["color_idx", "size_idx", "category_idx"],
    outputCols=["color_vec", "size_vec", "category_vec"]
)

encoded = encoder.fit(indexed).transform(indexed)
encoded.select("color_vec", "size_vec", "category_vec").show(truncate=False)

This approach reduces boilerplate code and improves pipeline readability when handling many categorical features.

Inverse Transformation

Convert indices back to original labels using IndexToString:

import org.apache.spark.ml.feature.IndexToString

val converter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predicted_label")
  .setLabels(indexer.labels)

// After model prediction
val predictions = model.transform(testData)
val withLabels = converter.transform(predictions)
withLabels.select("predicted_label", "prediction").show()

This is essential for interpreting classification results in the original categorical space.

Performance Considerations

StringIndexer and OneHotEncoder are lightweight transformations, but consider these optimizations:

Cache intermediate DataFrames when reusing transformations during experimentation
Use handleInvalid="keep" to avoid expensive exception handling in streaming pipelines
Limit cardinality for one-hot encoding—high-cardinality features (>100 categories) should use feature hashing or embedding instead
Persist fitted models to avoid recomputing label mappings on the same dataset

For high-cardinality categorical features, consider FeatureHasher as an alternative that creates fixed-size vectors without requiring a vocabulary fit.