Spark MLlib - StringIndexer and OneHotEncoder
StringIndexer maps categorical string values to numerical indices. The most frequent label receives index 0.0, the second most frequent gets 1.0, and so on. This transformation is critical because...
Key Insights
- StringIndexer converts categorical string labels into numerical indices based on label frequency, essential for algorithms that require numeric input, while OneHotEncoder transforms these indices into binary vectors to prevent ordinal relationships
- Proper handling of unseen labels during prediction requires setting
handleInvalidparameter to “keep” or “skip” to avoid pipeline failures in production environments - Combining both transformers in a Pipeline with VectorAssembler creates a robust feature engineering workflow that handles categorical variables correctly for Spark ML algorithms
Understanding StringIndexer
StringIndexer maps categorical string values to numerical indices. The most frequent label receives index 0.0, the second most frequent gets 1.0, and so on. This transformation is critical because most machine learning algorithms in Spark MLlib operate on numerical features.
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("StringIndexerExample")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val data = Seq(
("red", 1),
("blue", 2),
("red", 3),
("green", 4),
("blue", 5),
("red", 6)
).toDF("color", "value")
val indexer = new StringIndexer()
.setInputCol("color")
.setOutputCol("color_index")
.fit(data)
val indexed = indexer.transform(data)
indexed.show()
// Output:
// +-----+-----+-----------+
// |color|value|color_index|
// +-----+-----+-----------+
// | red| 1| 0.0|
// | blue| 2| 1.0|
// | red| 3| 0.0|
// |green| 4| 2.0|
// | blue| 5| 1.0|
// | red| 6| 0.0|
// +-----+-----+-----------+
The fit() method computes the label frequencies and creates the mapping. Red appears 3 times (index 0), blue appears 2 times (index 1), and green appears once (index 2).
Handling Unseen Labels
Production systems encounter labels not present in training data. The handleInvalid parameter controls this behavior:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
spark = SparkSession.builder.appName("HandleInvalid").getOrCreate()
train_data = spark.createDataFrame([
("cat", 1),
("dog", 2),
("cat", 3),
], ["animal", "value"])
test_data = spark.createDataFrame([
("cat", 4),
("bird", 5), # Unseen label
("dog", 6),
], ["animal", "value"])
# Option 1: error (default) - throws exception
indexer_error = StringIndexer(
inputCol="animal",
outputCol="animal_index",
handleInvalid="error"
)
# Option 2: skip - removes rows with unseen labels
indexer_skip = StringIndexer(
inputCol="animal",
outputCol="animal_index",
handleInvalid="skip"
).fit(train_data)
result_skip = indexer_skip.transform(test_data)
result_skip.show()
# bird row is removed
# Option 3: keep - assigns unseen labels to numLabels index
indexer_keep = StringIndexer(
inputCol="animal",
outputCol="animal_index",
handleInvalid="keep"
).fit(train_data)
result_keep = indexer_keep.transform(test_data)
result_keep.show()
# bird gets index 2.0 (numLabels)
Use “keep” for production pipelines where you need to process all records and handle unknowns explicitly downstream.
OneHotEncoder Transformation
StringIndexer creates ordinal relationships (0 < 1 < 2) that don’t exist in categorical data. OneHotEncoder solves this by creating binary vectors where only one element is 1 (hot) and others are 0.
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
val data = Seq(
("small", 10),
("medium", 20),
("large", 30),
("small", 15),
("large", 35)
).toDF("size", "price")
val indexer = new StringIndexer()
.setInputCol("size")
.setOutputCol("size_index")
.fit(data)
val indexed = indexer.transform(data)
val encoder = new OneHotEncoder()
.setInputCol("size_index")
.setOutputCol("size_vec")
val encoded = encoder.transform(indexed)
encoded.select("size", "size_index", "size_vec").show(false)
// Output:
// +------+----------+-------------+
// |size |size_index|size_vec |
// +------+----------+-------------+
// |small |1.0 |(2,[1],[1.0])|
// |medium|2.0 |(2,[],[]) |
// |large |0.0 |(2,[0],[1.0])|
// |small |1.0 |(2,[1],[1.0])|
// |large |0.0 |(2,[0],[1.0])|
// +------+----------+-------------+
The output uses sparse vector notation: (size, [indices], [values]). For 3 categories, we get 2 dimensions (n-1) to avoid multicollinearity. The last category becomes the reference with all zeros.
Setting dropLast Parameter
Control the output dimensionality with dropLast:
from pyspark.ml.feature import OneHotEncoder
# dropLast=True (default) - creates n-1 columns
encoder_drop = OneHotEncoder(
inputCol="size_index",
outputCol="size_vec",
dropLast=True
)
# dropLast=False - creates n columns
encoder_keep = OneHotEncoder(
inputCol="size_index",
outputCol="size_vec",
dropLast=False
)
encoded_keep = encoder_keep.transform(indexed)
encoded_keep.select("size", "size_index", "size_vec").show(truncate=False)
# With dropLast=False:
// +------+----------+-------------+
// |size |size_index|size_vec |
// +------+----------+-------------+
// |small |1.0 |(3,[1],[1.0])|
// |medium|2.0 |(3,[2],[1.0])|
// |large |0.0 |(3,[0],[1.0])|
// +------+----------+-------------+
Keep dropLast=True for linear models to prevent the dummy variable trap. Set dropLast=False for tree-based algorithms that handle collinearity naturally.
Building Complete Feature Pipeline
Combine multiple categorical features with numerical features using Pipeline and VectorAssembler:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder, VectorAssembler}
val data = Seq(
("red", "small", 100, 1.5),
("blue", "large", 200, 2.1),
("red", "medium", 150, 1.8),
("green", "small", 120, 1.6)
).toDF("color", "size", "quantity", "weight")
// String indexers
val colorIndexer = new StringIndexer()
.setInputCol("color")
.setOutputCol("color_index")
.setHandleInvalid("keep")
val sizeIndexer = new StringIndexer()
.setInputCol("size")
.setOutputCol("size_index")
.setHandleInvalid("keep")
// One-hot encoders
val colorEncoder = new OneHotEncoder()
.setInputCol("color_index")
.setOutputCol("color_vec")
val sizeEncoder = new OneHotEncoder()
.setInputCol("size_index")
.setOutputCol("size_vec")
// Combine all features
val assembler = new VectorAssembler()
.setInputCols(Array("color_vec", "size_vec", "quantity", "weight"))
.setOutputCol("features")
// Create pipeline
val pipeline = new Pipeline()
.setStages(Array(colorIndexer, sizeIndexer, colorEncoder, sizeEncoder, assembler))
val model = pipeline.fit(data)
val result = model.transform(data)
result.select("color", "size", "features").show(false)
This pipeline handles the complete transformation from raw categorical strings to a feature vector ready for ML algorithms.
Multiple Column Transformation
Process multiple columns efficiently using array-based APIs:
from pyspark.ml.feature import StringIndexer, OneHotEncoder
data = spark.createDataFrame([
("red", "small", "A", 100),
("blue", "large", "B", 200),
("red", "medium", "A", 150),
], ["color", "size", "category", "value"])
# Index multiple columns at once
indexer = StringIndexer(
inputCols=["color", "size", "category"],
outputCols=["color_idx", "size_idx", "category_idx"],
handleInvalid="keep"
)
indexed = indexer.fit(data).transform(data)
# Encode multiple columns at once
encoder = OneHotEncoder(
inputCols=["color_idx", "size_idx", "category_idx"],
outputCols=["color_vec", "size_vec", "category_vec"]
)
encoded = encoder.fit(indexed).transform(indexed)
encoded.select("color_vec", "size_vec", "category_vec").show(truncate=False)
This approach reduces boilerplate code and improves pipeline readability when handling many categorical features.
Inverse Transformation
Convert indices back to original labels using IndexToString:
import org.apache.spark.ml.feature.IndexToString
val converter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predicted_label")
.setLabels(indexer.labels)
// After model prediction
val predictions = model.transform(testData)
val withLabels = converter.transform(predictions)
withLabels.select("predicted_label", "prediction").show()
This is essential for interpreting classification results in the original categorical space.
Performance Considerations
StringIndexer and OneHotEncoder are lightweight transformations, but consider these optimizations:
- Cache intermediate DataFrames when reusing transformations during experimentation
- Use
handleInvalid="keep"to avoid expensive exception handling in streaming pipelines - Limit cardinality for one-hot encoding—high-cardinality features (>100 categories) should use feature hashing or embedding instead
- Persist fitted models to avoid recomputing label mappings on the same dataset
For high-cardinality categorical features, consider FeatureHasher as an alternative that creates fixed-size vectors without requiring a vocabulary fit.