How to Implement Gaussian Naive Bayes in Python

Key Insights

Gaussian Naive Bayes excels at classification tasks with continuous features by modeling each feature’s distribution per class as a Gaussian (normal) distribution, making it incredibly fast and memory-efficient for high-dimensional data.
The “naive” assumption that features are independent given the class label rarely holds in practice, yet the algorithm often performs surprisingly well because classification only requires correct ranking of probabilities, not their absolute accuracy.
Implementation from scratch requires just three components: calculating mean and variance per class, implementing the Gaussian probability density function, and applying Bayes’ theorem to compute posterior probabilities for prediction.

Introduction to Gaussian Naive Bayes

Gaussian Naive Bayes is a probabilistic classifier based on Bayes’ theorem with a critical assumption: features follow a Gaussian (normal) distribution within each class. This makes it particularly suitable for continuous numerical data, unlike its cousins Multinomial and Bernoulli Naive Bayes which handle discrete features.

The algorithm shines in several scenarios. When you have limited training data, its simplicity prevents overfitting. When you need real-time predictions, its computational efficiency delivers. When features are roughly independent and normally distributed, its accuracy rivals far more complex models. Common applications include medical diagnosis (continuous measurements like blood pressure and cholesterol), spam filtering with continuous features, and sentiment analysis with normalized word frequencies.

The key assumptions are straightforward but important: features are conditionally independent given the class, and each feature follows a Gaussian distribution within each class. While these assumptions are often violated in practice, the algorithm remains remarkably robust.

Mathematical Foundation

Bayes’ theorem forms the backbone of this classifier:

P(y|X) = P(X|y) * P(y) / P(X)

Where y is the class label and X is the feature vector. For classification, we need to find the class with the highest posterior probability. Since P(X) is constant across all classes, we can ignore it:

y_pred = argmax_y P(X|y) * P(y)

The “naive” part comes from assuming feature independence:

P(X|y) = P(x1|y) * P(x2|y) * ... * P(xn|y)

For Gaussian Naive Bayes, each P(xi|y) is calculated using the Gaussian probability density function:

P(xi|y) = (1 / sqrt(2π * σ²)) * exp(-(xi - μ)² / (2σ²))

Where μ is the mean and σ² is the variance of feature xi for class y.

Let’s visualize how features are modeled as Gaussian distributions:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Simulate two classes with different distributions
np.random.seed(42)
class_0_feature = np.random.normal(loc=5, scale=1.5, size=100)
class_1_feature = np.random.normal(loc=8, scale=1.2, size=100)

# Calculate parameters
mu_0, sigma_0 = class_0_feature.mean(), class_0_feature.std()
mu_1, sigma_1 = class_1_feature.mean(), class_1_feature.std()

# Plot distributions
x = np.linspace(0, 14, 1000)
plt.figure(figsize=(10, 6))
plt.hist(class_0_feature, bins=20, density=True, alpha=0.5, label='Class 0 data')
plt.hist(class_1_feature, bins=20, density=True, alpha=0.5, label='Class 1 data')
plt.plot(x, norm.pdf(x, mu_0, sigma_0), 'b-', linewidth=2, label=f'Class 0 Gaussian (μ={mu_0:.2f})')
plt.plot(x, norm.pdf(x, mu_1, sigma_1), 'r-', linewidth=2, label=f'Class 1 Gaussian (μ={mu_1:.2f})')
plt.xlabel('Feature Value')
plt.ylabel('Probability Density')
plt.legend()
plt.title('Gaussian Distribution Modeling in Naive Bayes')
plt.show()

Implementation from Scratch

Understanding the internals makes you a better practitioner. Here’s a complete implementation:

import numpy as np

class GaussianNaiveBayes:
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        
        # Initialize storage for parameters
        self.mean = np.zeros((n_classes, n_features))
        self.var = np.zeros((n_classes, n_features))
        self.priors = np.zeros(n_classes)
        
        # Calculate mean, variance, and prior for each class
        for idx, c in enumerate(self.classes):
            X_c = X[y == c]
            self.mean[idx, :] = X_c.mean(axis=0)
            self.var[idx, :] = X_c.var(axis=0)
            self.priors[idx] = X_c.shape[0] / n_samples
    
    def _gaussian_pdf(self, class_idx, x):
        """Calculate Gaussian probability density function"""
        mean = self.mean[class_idx]
        var = self.var[class_idx]
        numerator = np.exp(-((x - mean) ** 2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator
    
    def _predict_single(self, x):
        """Predict class for a single sample"""
        posteriors = []
        
        for idx, c in enumerate(self.classes):
            # Log probabilities to avoid numerical underflow
            prior = np.log(self.priors[idx])
            likelihood = np.sum(np.log(self._gaussian_pdf(idx, x)))
            posterior = prior + likelihood
            posteriors.append(posterior)
        
        return self.classes[np.argmax(posteriors)]
    
    def predict(self, X):
        """Predict classes for multiple samples"""
        return np.array([self._predict_single(x) for x in X])

# Test the implementation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, 
                           n_redundant=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gnb = GaussianNaiveBayes()
gnb.fit(X_train, y_train)
predictions = gnb.predict(X_test)
print(f"Custom Implementation Accuracy: {accuracy_score(y_test, predictions):.4f}")

Using Scikit-learn’s GaussianNB

For production use, leverage scikit-learn’s optimized implementation:

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)
y_pred_proba = gnb.predict_proba(X_test)

# Evaluate
print(f"Accuracy: {gnb.score(X_test, y_test):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Show probability predictions for first 5 samples
print("\nPrediction Probabilities (first 5 samples):")
print(pd.DataFrame(y_pred_proba[:5], columns=iris.target_names))

Handling Real-World Scenarios

Feature scaling typically doesn’t affect Gaussian Naive Bayes since it models each feature’s distribution independently. However, features with zero variance cause division by zero. Scikit-learn adds a small constant (var_smoothing=1e-9) to handle this:

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import time

# Compare with and without scaling
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
gnb_unscaled = GaussianNB()
start = time.time()
scores_unscaled = cross_val_score(gnb_unscaled, X_train, y_train, cv=5)
time_unscaled = time.time() - start

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
gnb_scaled = GaussianNB()
start = time.time()
scores_scaled = cross_val_score(gnb_scaled, X_train_scaled, y_train, cv=5)
time_scaled = time.time() - start

print(f"Without scaling: {scores_unscaled.mean():.4f} (+/- {scores_unscaled.std():.4f}) - {time_unscaled:.4f}s")
print(f"With scaling: {scores_scaled.mean():.4f} (+/- {scores_scaled.std():.4f}) - {time_scaled:.4f}s")

# Handle imbalanced datasets with class priors
from sklearn.utils import resample

# Adjust priors based on domain knowledge
gnb_adjusted = GaussianNB(priors=[0.7, 0.3])  # If you know class distribution
gnb_adjusted.fit(X_train, y_train)

Performance Comparison and Best Practices

Gaussian Naive Bayes offers significant computational advantages. Let’s benchmark it:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import time

# Generate larger dataset
X, y = make_classification(n_samples=50000, n_features=50, n_informative=30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    'Gaussian NB': GaussianNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = []
for name, model in models.items():
    # Training time
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    
    # Prediction time
    start = time.time()
    y_pred = model.predict(X_test)
    pred_time = time.time() - start
    
    accuracy = accuracy_score(y_test, y_pred)
    results.append({
        'Model': name,
        'Accuracy': f"{accuracy:.4f}",
        'Train Time (s)': f"{train_time:.4f}",
        'Predict Time (s)': f"{pred_time:.4f}"
    })

print(pd.DataFrame(results))

Choose Gaussian Naive Bayes when:

You need fast training and prediction with limited computational resources
You have high-dimensional data (hundreds or thousands of features)
Features are roughly continuous and independent
You need a probabilistic classifier with interpretable outputs
You’re building a baseline model before trying complex algorithms

Avoid it when:

Features have strong correlations (violates independence assumption)
Features are clearly non-Gaussian (consider kernel density estimation)
You need state-of-the-art accuracy and have sufficient data for complex models
You have categorical features (use Multinomial or Categorical NB instead)

The algorithm’s speed makes it ideal for real-time applications, online learning scenarios, and as a quick baseline to beat. Its simplicity means fewer hyperparameters to tune and less risk of overfitting on small datasets. While more sophisticated algorithms often achieve higher accuracy, Gaussian Naive Bayes remains a practical choice when speed, interpretability, and simplicity matter.