How to Normalize Data in Python

Data normalization transforms features to a common scale without distorting differences in value ranges. In machine learning, algorithms that calculate distances between data points—like k-nearest...

Key Insights

  • Features with vastly different scales cause distance-based algorithms to be dominated by high-magnitude features, leading to poor model performance and biased predictions.
  • Min-max scaling works well for bounded distributions without outliers, while standardization is preferred for algorithms assuming normally distributed data like linear regression and neural networks.
  • Always fit normalization transformers on training data only and apply the same transformation to test data—fitting on the entire dataset causes data leakage and inflates performance metrics.

Introduction to Data Normalization

Data normalization transforms features to a common scale without distorting differences in value ranges. In machine learning, algorithms that calculate distances between data points—like k-nearest neighbors, support vector machines, and neural networks—are highly sensitive to feature magnitudes. When one feature ranges from 0 to 1 while another ranges from 0 to 100,000, the larger-scale feature dominates distance calculations, rendering smaller-scale features nearly irrelevant.

Consider a simple example with age and income predicting loan approval:

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Sample dataset with different scales
data = pd.DataFrame({
    'age': [25, 35, 45, 55, 30, 40],
    'income': [30000, 80000, 120000, 150000, 45000, 95000],
    'approved': [0, 1, 1, 1, 0, 1]
})

X = data[['age', 'income']]
y = data['approved']

# Calculate Euclidean distance between first two samples
dist_raw = np.sqrt((X.iloc[0]['age'] - X.iloc[1]['age'])**2 + 
                   (X.iloc[0]['income'] - X.iloc[1]['income'])**2)
print(f"Distance without normalization: {dist_raw:.2f}")
# Output: Distance without normalization: 50000.01

# Age difference: 10 years, Income difference: $50,000
# Income completely dominates the distance calculation

The income difference of $50,000 overwhelms the age difference of 10 years, making age essentially meaningless in the model’s decision-making process. Normalization solves this by bringing all features to comparable scales.

Min-Max Normalization (Scaling)

Min-max scaling transforms features to a fixed range, typically [0, 1], using the formula:

X_scaled = (X - X_min) / (X_max - X_min)

This technique preserves the original distribution shape while compressing it into the specified range. Use min-max scaling when you know the approximate upper and lower bounds of your features, and when your data doesn’t contain significant outliers.

from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

# Manual implementation
def manual_minmax(data):
    return (data - data.min()) / (data.max() - data.min())

# Create sample data
ages = np.array([22, 35, 48, 52, 28, 41, 55, 30]).reshape(-1, 1)

# Manual scaling
ages_manual = manual_minmax(ages)

# Scikit-learn implementation
scaler = MinMaxScaler()
ages_scaled = scaler.fit_transform(ages)

print("Original ages:", ages.flatten())
print("Scaled ages:", ages_scaled.flatten())
print(f"Min: {ages_scaled.min()}, Max: {ages_scaled.max()}")

# Output:
# Original ages: [22 35 48 52 28 41 55 30]
# Scaled ages: [0.    0.39  0.79  0.91  0.18  0.58  1.    0.24]
# Min: 0.0, Max: 1.0

Min-max scaling is deterministic and maintains zero values in sparse data, making it ideal for neural networks and image processing where pixel values naturally fall within [0, 255]. However, it’s sensitive to outliers—a single extreme value can compress the rest of your data into a tiny range.

Standardization (Z-score Normalization)

Standardization transforms data to have a mean of 0 and standard deviation of 1 using:

X_standardized = (X - μ) / σ

where μ is the mean and σ is the standard deviation. Unlike min-max scaling, standardization doesn’t bound values to a specific range—standardized data can extend beyond [-3, 3] if outliers exist.

from sklearn.preprocessing import StandardScaler

# Manual implementation
def manual_standardize(data):
    return (data - data.mean()) / data.std()

# Create data with different distributions
income = np.array([35000, 42000, 38000, 95000, 41000, 39000, 43000, 40000]).reshape(-1, 1)

# Manual standardization
income_manual = manual_standardize(income)

# Scikit-learn implementation
std_scaler = StandardScaler()
income_standardized = std_scaler.fit_transform(income)

print("Original income:", income.flatten())
print("Standardized income:", income_standardized.flatten().round(2))
print(f"Mean: {income_standardized.mean():.2e}, Std: {income_standardized.std():.2f}")

# Output:
# Original income: [35000 42000 38000 95000 41000 39000 43000 40000]
# Standardized income: [-0.62  0.05 -0.44  3.09 -0.01 -0.35  0.14 -0.18]
# Mean: 5.55e-17, Std: 1.00

Standardization is preferred for algorithms that assume normally distributed features (linear regression, logistic regression, LDA) and for neural networks. It handles outliers better than min-max scaling because it doesn’t compress the entire distribution—outliers remain outliers, just on a different scale. However, the outlier at $95,000 still creates a large standardized value (3.09), which brings us to robust scaling.

Robust Scaling for Outlier-Heavy Data

Robust scaling uses the median and interquartile range (IQR) instead of mean and standard deviation:

X_robust = (X - median) / IQR

This approach is resistant to outliers because median and IQR aren’t affected by extreme values the way mean and standard deviation are.

from sklearn.preprocessing import RobustScaler

# Create dataset with outliers
data_with_outliers = np.array([20, 22, 21, 23, 22, 150, 24, 21, 23, 200]).reshape(-1, 1)

# Standard scaling
std_scaler = StandardScaler()
standard_scaled = std_scaler.fit_transform(data_with_outliers)

# Robust scaling
robust_scaler = RobustScaler()
robust_scaled = robust_scaler.fit_transform(data_with_outliers)

comparison = pd.DataFrame({
    'Original': data_with_outliers.flatten(),
    'Standard': standard_scaled.flatten().round(2),
    'Robust': robust_scaled.flatten().round(2)
})

print(comparison)

# Output shows robust scaling keeps normal values closer together:
#    Original  Standard  Robust
# 0        20     -0.47   -0.33
# 1        22     -0.44   -0.17
# 2        21     -0.46   -0.25
# 3        23     -0.43   -0.08
# 4        22     -0.44   -0.17
# 5       150      1.47    15.75
# 6        24     -0.41    0.00
# 7        21     -0.46   -0.25
# 8        23     -0.43   -0.08
# 9       200      2.07    21.67

Notice how standard scaling compresses the normal values (20-24) into a very tight range (-0.47 to -0.41), while robust scaling maintains better separation (-0.33 to 0.00). The outliers remain outliers in both methods, but robust scaling preserves the structure of the non-outlier data.

Normalization in Real ML Pipelines

The cardinal rule of normalization: fit on training data only, transform both training and test data. Fitting on the entire dataset causes data leakage—your model gains information about the test set it shouldn’t have access to.

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=5, n_informative=3, 
                          n_redundant=1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# WRONG WAY - Data leakage
scaler_wrong = StandardScaler()
X_all_scaled = scaler_wrong.fit_transform(np.vstack([X_train, X_test]))
X_train_wrong = X_all_scaled[:len(X_train)]
X_test_wrong = X_all_scaled[len(X_train):]

model_wrong = LogisticRegression()
model_wrong.fit(X_train_wrong, y_train)
score_wrong = model_wrong.score(X_test_wrong, y_test)

# RIGHT WAY - Proper pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

pipeline.fit(X_train, y_train)
score_right = pipeline.score(X_test, y_test)

print(f"Wrong approach score: {score_wrong:.4f}")
print(f"Correct approach score: {score_right:.4f}")

# The wrong approach often shows artificially inflated performance

Using scikit-learn’s Pipeline ensures transformations are applied correctly. The pipeline fits the scaler on training data during pipeline.fit(), then automatically applies the same transformation (with training statistics) to test data during pipeline.score() or pipeline.predict().

Choosing the Right Normalization Technique

Your choice depends on three factors: algorithm requirements, data distribution, and outlier presence.

Use Min-Max Scaling when:

  • Working with neural networks or algorithms requiring bounded inputs
  • Features have known, fixed ranges
  • Data is uniformly distributed without outliers
  • Working with image data or other naturally bounded features

Use Standardization when:

  • Algorithm assumes normally distributed features (linear models, LDA)
  • Features have Gaussian-like distributions
  • Working with gradient descent optimization
  • Features don’t have meaningful bounds

Use Robust Scaling when:

  • Data contains significant outliers
  • Median better represents central tendency than mean
  • Outliers are legitimate data points (not errors) that shouldn’t dominate scaling
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Compare normalization techniques
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scalers = {
    'None': None,
    'MinMax': MinMaxScaler(),
    'Standard': StandardScaler(),
    'Robust': RobustScaler()
}

models = {
    'SVM': SVC(),
    'LogisticRegression': LogisticRegression(),
    'DecisionTree': DecisionTreeClassifier()
}

results = []
for scaler_name, scaler in scalers.items():
    for model_name, model in models.items():
        if scaler:
            pipeline = Pipeline([('scaler', scaler), ('model', model)])
        else:
            pipeline = Pipeline([('model', model)])
        
        pipeline.fit(X_train, y_train)
        score = pipeline.score(X_test, y_test)
        results.append({
            'Scaler': scaler_name,
            'Model': model_name,
            'Accuracy': score
        })

results_df = pd.DataFrame(results).pivot(index='Model', columns='Scaler', values='Accuracy')
print(results_df.round(3))

This comparison reveals that tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting) are invariant to feature scaling—they perform identically regardless of normalization. Distance-based and gradient-based algorithms show dramatic improvements with proper scaling.

Normalization isn’t optional for most machine learning workflows—it’s a fundamental preprocessing step that can mean the difference between a model that works and one that fails. Master these three techniques, understand when to apply each, and always use proper train-test splitting to avoid data leakage.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.