Spark MLlib - StandardScaler and MinMaxScaler
Feature scaling is critical in machine learning pipelines because algorithms that compute distances or assume normally distributed data perform poorly when features exist on different scales. In...
Key Insights
- StandardScaler transforms features to have zero mean and unit variance, making it ideal for algorithms sensitive to feature magnitude like logistic regression and SVMs, while MinMaxScaler bounds features to a specified range (typically [0,1]) which works better for neural networks and distance-based algorithms.
- Both scalers in Spark MLlib operate as Estimator-Transformer pairs where the Estimator learns parameters from training data and the Transformer applies those parameters to any dataset, ensuring consistent scaling between training and test sets.
- Choosing between scalers depends on your data distribution and algorithm requirements: StandardScaler handles outliers better by not compressing them into a fixed range, while MinMaxScaler preserves zero entries in sparse data and maintains exact minimum/maximum bounds.
Understanding Feature Scaling in Distributed Environments
Feature scaling is critical in machine learning pipelines because algorithms that compute distances or assume normally distributed data perform poorly when features exist on different scales. In Spark MLlib, scaling operations must work efficiently across distributed datasets while maintaining statistical accuracy.
StandardScaler and MinMaxScaler follow Spark’s Estimator-Transformer pattern. The Estimator calculates statistics (mean/std for StandardScaler, min/max for MinMaxScaler) by scanning the dataset once. The resulting Transformer model applies these pre-computed statistics to transform features without additional passes through the data.
StandardScaler Implementation
StandardScaler normalizes features to have mean 0 and standard deviation 1 using the formula: (x - μ) / σ. This transformation makes features comparable regardless of their original units.
import org.apache.spark.ml.feature.{StandardScaler, VectorAssembler}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("StandardScalerExample")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// Create sample dataset with features on different scales
val data = Seq(
(1, Vectors.dense(100.0, 0.001, 50.0)),
(2, Vectors.dense(200.0, 0.002, 75.0)),
(3, Vectors.dense(150.0, 0.0015, 60.0)),
(4, Vectors.dense(300.0, 0.003, 90.0)),
(5, Vectors.dense(250.0, 0.0025, 80.0))
).toDF("id", "features")
// Configure StandardScaler
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithMean(true) // Center data (subtract mean)
.setWithStd(true) // Scale to unit variance
// Fit the scaler to compute statistics
val scalerModel = scaler.fit(data)
// Transform the data
val scaledData = scalerModel.transform(data)
scaledData.select("features", "scaledFeatures").show(false)
The setWithMean and setWithStd parameters control the transformation behavior. Setting both to true performs full standardization. Setting only setWithStd(true) scales variance without centering, which is useful for sparse data where centering would destroy sparsity.
from pyspark.ml.feature import StandardScaler
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StandardScalerExample").getOrCreate()
# Create DataFrame with features
data = [(1, Vectors.dense([100.0, 0.001, 50.0])),
(2, Vectors.dense([200.0, 0.002, 75.0])),
(3, Vectors.dense([150.0, 0.0015, 60.0])),
(4, Vectors.dense([300.0, 0.003, 90.0])),
(5, Vectors.dense([250.0, 0.0025, 80.0]))]
df = spark.createDataFrame(data, ["id", "features"])
scaler = StandardScaler(
inputCol="features",
outputCol="scaledFeatures",
withMean=True,
withStd=True
)
scaler_model = scaler.fit(df)
scaled_df = scaler_model.transform(df)
# Access computed statistics
print(f"Mean: {scaler_model.mean}")
print(f"Std Dev: {scaler_model.std}")
scaled_df.select("features", "scaledFeatures").show(truncate=False)
MinMaxScaler Implementation
MinMaxScaler transforms features to a specified range using the formula: (x - min) / (max - min) * (max_bound - min_bound) + min_bound. The default range is [0, 1].
import org.apache.spark.ml.feature.MinMaxScaler
// Using the same data from previous example
val minMaxScaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("minMaxScaledFeatures")
.setMin(0.0) // Lower bound of range
.setMax(1.0) // Upper bound of range
val minMaxModel = minMaxScaler.fit(data)
val minMaxScaled = minMaxModel.transform(data)
minMaxScaled.select("features", "minMaxScaledFeatures").show(false)
// Access the computed min/max values
println(s"Original min values: ${minMaxModel.originalMin}")
println(s"Original max values: ${minMaxModel.originalMax}")
MinMaxScaler is particularly useful when you need features bounded to a specific range, such as for neural networks with sigmoid/tanh activations or when visualizing data.
from pyspark.ml.feature import MinMaxScaler
minmax_scaler = MinMaxScaler(
inputCol="features",
outputCol="minMaxScaledFeatures",
min=0.0,
max=1.0
)
minmax_model = minmax_scaler.fit(df)
minmax_scaled_df = minmax_model.transform(df)
# Retrieve original min/max values
print(f"Original Min: {minmax_model.originalMin}")
print(f"Original Max: {minmax_model.originalMax}")
minmax_scaled_df.select("features", "minMaxScaledFeatures").show(truncate=False)
Building ML Pipelines with Scalers
In production systems, scalers integrate into ML pipelines to ensure consistent preprocessing across training and prediction phases.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
// Create training data with labels
val trainingData = Seq(
(0, Vectors.dense(100.0, 0.001, 50.0)),
(1, Vectors.dense(200.0, 0.002, 75.0)),
(0, Vectors.dense(150.0, 0.0015, 60.0)),
(1, Vectors.dense(300.0, 0.003, 90.0)),
(1, Vectors.dense(250.0, 0.0025, 80.0))
).toDF("label", "features")
// Define pipeline stages
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithMean(true)
.setWithStd(true)
val lr = new LogisticRegression()
.setFeaturesCol("scaledFeatures")
.setLabelCol("label")
.setMaxIter(10)
val pipeline = new Pipeline()
.setStages(Array(scaler, lr))
// Fit pipeline - scaler statistics computed only on training data
val pipelineModel = pipeline.fit(trainingData)
// Test data uses the same scaler parameters
val testData = Seq(
(0, Vectors.dense(180.0, 0.0018, 65.0)),
(1, Vectors.dense(280.0, 0.0028, 85.0))
).toDF("label", "features")
val predictions = pipelineModel.transform(testData)
predictions.select("label", "features", "scaledFeatures", "prediction").show(false)
Handling Sparse Features
StandardScaler with withMean=true converts sparse vectors to dense, which can cause memory issues with high-dimensional sparse data. MinMaxScaler preserves sparsity better.
from pyspark.ml.linalg import Vectors, SparseVector
# Sparse feature vectors (common in text/categorical features)
sparse_data = [
(1, Vectors.sparse(10000, {5: 2.0, 100: 3.5, 1000: 1.2})),
(2, Vectors.sparse(10000, {5: 1.5, 200: 2.8, 2000: 0.9})),
(3, Vectors.sparse(10000, {10: 3.0, 100: 4.1, 1500: 2.1}))
]
sparse_df = spark.createDataFrame(sparse_data, ["id", "features"])
# StandardScaler without mean centering preserves sparsity
sparse_scaler = StandardScaler(
inputCol="features",
outputCol="scaledFeatures",
withMean=False, # Critical for sparse data
withStd=True
)
sparse_model = sparse_scaler.fit(sparse_df)
scaled_sparse = sparse_model.transform(sparse_df)
# MinMaxScaler also preserves sparsity
minmax_sparse_scaler = MinMaxScaler(
inputCol="features",
outputCol="minMaxScaledFeatures"
)
minmax_sparse_model = minmax_sparse_scaler.fit(sparse_df)
minmax_scaled_sparse = minmax_sparse_model.transform(sparse_df)
Performance Considerations and Best Practices
When working with large datasets, scaler performance depends on data distribution and cluster configuration. StandardScaler requires computing mean and standard deviation, which involves two aggregations. MinMaxScaler needs only min/max aggregations, potentially faster on wide datasets.
For datasets with extreme outliers, StandardScaler is more robust because outliers affect mean/std less dramatically than min/max bounds. MinMaxScaler compresses all values including outliers into the specified range, which can reduce the relative differences between normal values.
Always fit scalers only on training data and apply the learned parameters to validation and test sets. Fitting on test data causes data leakage and inflates performance metrics.
# Correct approach: fit on train, transform on train and test
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scaler_model = scaler.fit(train_df) # Fit only on training data
scaled_train = scaler_model.transform(train_df)
scaled_test = scaler_model.transform(test_df) # Use same parameters
# Incorrect: fitting separately on test data
# wrong_scaler_model = scaler.fit(test_df) # DON'T DO THIS
For streaming applications, persist the fitted scaler model and load it for transforming incoming data streams to maintain consistency with the training distribution.