How to Scale Features in R

Key Insights

Feature scaling is critical for distance-based algorithms (KNN, SVM, k-means) and neural networks, but tree-based models work fine without it
Always fit scaling parameters on training data only, then apply those same parameters to test data—fitting on the entire dataset causes data leakage
Standardization (z-score) is generally more robust than min-max normalization, especially when your data contains outliers or you don’t know the theoretical bounds

Introduction to Feature Scaling

Feature scaling transforms your numeric variables to a common scale without distorting differences in the ranges of values. This matters because many machine learning algorithms are sensitive to the magnitude of features.

Consider k-means clustering with two features: annual income (ranging from $30,000 to $150,000) and age (ranging from 18 to 65). Without scaling, the income feature will dominate distance calculations simply because its numeric values are larger, not because it’s more important.

library(tidyverse)

# Create sample data with different scales
set.seed(123)
data <- tibble(
  income = rnorm(100, mean = 80000, sd = 25000),
  age = rnorm(100, mean = 40, sd = 10)
)

# K-means without scaling
km_unscaled <- kmeans(data, centers = 3)

# K-means with scaling
km_scaled <- kmeans(scale(data), centers = 3)

# Compare cluster assignments
table(km_unscaled$cluster, km_scaled$cluster)

The cluster assignments differ significantly because unscaled k-means essentially ignores the age variable. Algorithms that use distance metrics (KNN, SVM, k-means) or gradient descent (neural networks, logistic regression) need scaled features. Tree-based algorithms (random forests, XGBoost) don’t require scaling since they split on individual features independently.

Normalization (Min-Max Scaling)

Min-max scaling transforms features to a fixed range, typically [0,1]. The formula is: (x - min(x)) / (max(x) - min(x)). This preserves the original distribution shape while compressing it into the target range.

Use normalization when you know the theoretical bounds of your data or when you need features in a specific range (like [0,1] for certain neural network activations).

# Manual min-max scaling
min_max_scale <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

# Example data
values <- c(10, 20, 30, 40, 50)
normalized <- min_max_scale(values)
print(normalized)  # [1] 0.00 0.25 0.50 0.75 1.00

For production code, use the scales package:

library(scales)

# Rescale to [0,1]
rescale(values)

# Rescale to custom range [0,100]
rescale(values, to = c(0, 100))

Apply normalization to multiple columns in a data frame:

library(dplyr)

df <- tibble(
  feature1 = c(10, 20, 30, 40, 50),
  feature2 = c(100, 200, 300, 400, 500),
  category = c("A", "B", "A", "B", "A")
)

df_normalized <- df %>%
  mutate(across(where(is.numeric), ~rescale(.x)))

print(df_normalized)

The limitation of min-max scaling is sensitivity to outliers. A single extreme value will compress the rest of your data into a narrow range.

Standardization (Z-Score Scaling)

Standardization transforms features to have mean 0 and standard deviation 1 using the formula: (x - mean(x)) / sd(x). Unlike normalization, standardized values aren’t bounded to a specific range—they typically fall between -3 and 3 for normally distributed data.

Prefer standardization when your data contains outliers or when you don’t know the theoretical min/max bounds. It’s also the default choice for many machine learning algorithms.

# Base R scale() function
values <- c(10, 20, 30, 40, 50)
standardized <- scale(values)
print(standardized)

# Verify mean ≈ 0 and sd ≈ 1
mean(standardized)  # approximately 0
sd(standardized)    # approximately 1

Custom standardization with dplyr:

# Custom standardization function
z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

df_standardized <- df %>%
  mutate(across(where(is.numeric), z_score))

Here’s a practical example showing how standardization affects linear regression:

# Generate data where one feature has much larger scale
set.seed(456)
data <- tibble(
  small_feature = rnorm(100, mean = 5, sd = 2),
  large_feature = rnorm(100, mean = 5000, sd = 2000),
  outcome = 3 * small_feature + 0.002 * large_feature + rnorm(100)
)

# Model without scaling
model_unscaled <- lm(outcome ~ small_feature + large_feature, data = data)

# Model with scaling
data_scaled <- data %>%
  mutate(across(c(small_feature, large_feature), scale))

model_scaled <- lm(outcome ~ small_feature + large_feature, data = data_scaled)

# Compare coefficients
summary(model_unscaled)$coefficients
summary(model_scaled)$coefficients

The scaled model converges faster and produces more interpretable coefficients since features are on the same scale.

Robust Scaling Methods

When your data contains outliers, standard scaling methods can produce poor results. Robust scaling uses the median and interquartile range (IQR) instead of mean and standard deviation.

The formula is: (x - median(x)) / IQR(x)

# Robust scaling function
robust_scale <- function(x) {
  med <- median(x, na.rm = TRUE)
  iqr_val <- IQR(x, na.rm = TRUE)
  (x - med) / iqr_val
}

# Data with outliers
values_with_outliers <- c(10, 12, 11, 13, 12, 14, 100)

# Compare standard vs robust scaling
standard <- scale(values_with_outliers)
robust <- robust_scale(values_with_outliers)

tibble(
  original = values_with_outliers,
  standard = as.vector(standard),
  robust = robust
)

The robust method handles the outlier (100) more gracefully, keeping the other values in a reasonable range.

Scaling in ML Pipelines with caret and recipes

In real machine learning workflows, you need to scale training and test data consistently. The caret and recipes packages handle this automatically.

Using caret::preProcess():

library(caret)

# Split data
set.seed(789)
train_idx <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]

# Fit preprocessing on training data only
preproc <- preProcess(train_data[, 1:4], method = c("center", "scale"))

# Apply to both sets
train_scaled <- predict(preproc, train_data[, 1:4])
test_scaled <- predict(preproc, test_data[, 1:4])

The recipes package provides a more flexible pipeline approach:

library(recipes)

# Create recipe
recipe_obj <- recipe(Species ~ ., data = train_data) %>%
  step_normalize(all_numeric_predictors())  # z-score scaling

# Or use step_range() for min-max scaling
recipe_minmax <- recipe(Species ~ ., data = train_data) %>%
  step_range(all_numeric_predictors(), min = 0, max = 1)

# Prep the recipe (fit parameters on training data)
prepped_recipe <- prep(recipe_obj, training = train_data)

# Apply to data
train_processed <- bake(prepped_recipe, new_data = train_data)
test_processed <- bake(prepped_recipe, new_data = test_data)

The recipes workflow is particularly powerful for complex preprocessing pipelines involving multiple steps.

Best Practices and Common Pitfalls

The most critical rule: always fit scaling parameters on training data only. Fitting on the entire dataset causes data leakage, where information from the test set influences your model.

Here’s the wrong way:

# WRONG: Scaling before splitting
data_scaled_wrong <- as.data.frame(scale(iris[, 1:4]))
data_scaled_wrong$Species <- iris$Species

train_wrong <- data_scaled_wrong[train_idx, ]
test_wrong <- data_scaled_wrong[-train_idx, ]

The correct way:

# RIGHT: Split first, then scale
train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]

# Fit scaler on training data
train_means <- colMeans(train_data[, 1:4])
train_sds <- apply(train_data[, 1:4], 2, sd)

# Apply same transformation to both sets
train_scaled <- scale(train_data[, 1:4], 
                      center = train_means, 
                      scale = train_sds)

test_scaled <- scale(test_data[, 1:4], 
                     center = train_means, 
                     scale = train_sds)

Additional best practices:

Don’t scale categorical variables: Scaling only applies to numeric features. One-hot encoded variables should remain as 0/1.

Don’t scale your target variable for regression unless you have a specific reason (like neural networks with bounded outputs). If you do, remember to inverse transform predictions.

Know when NOT to scale: Tree-based algorithms (random forests, gradient boosting) don’t require scaling. Neither do naive Bayes classifiers.

Handle new data carefully: When deploying models, save your scaling parameters and apply them to new incoming data:

# Save scaling parameters
scaling_params <- list(
  means = train_means,
  sds = train_sds
)

saveRDS(scaling_params, "scaling_params.rds")

# Apply to new data later
new_data_scaled <- scale(new_data, 
                         center = scaling_params$means,
                         scale = scaling_params$sds)

Feature scaling is a fundamental preprocessing step that can significantly impact model performance. Choose standardization as your default, use normalization when you need bounded ranges, and switch to robust scaling when outliers are present. Most importantly, always maintain the training/test boundary to avoid data leakage.