Naive Bayes: Complete Guide with Examples

Key Insights

Naive Bayes achieves surprisingly strong performance despite its “naive” independence assumption, often matching complex models while training 10-100x faster
The algorithm excels at high-dimensional problems like text classification where feature counts far exceed sample sizes—a scenario where many algorithms struggle
Understanding the three variants (Gaussian, Multinomial, Bernoulli) is crucial: using the wrong type for your data distribution will tank performance regardless of tuning

Introduction to Naive Bayes

Naive Bayes is a probabilistic classifier that punches well above its weight. Despite making an unrealistic assumption—that all features are independent—it consistently delivers competitive results across text classification, spam filtering, sentiment analysis, and medical diagnosis tasks.

The algorithm’s foundation is Bayes’ theorem from probability theory, combined with a “naive” independence assumption that simplifies calculations dramatically. This simplification is what makes Naive Bayes fast enough to handle millions of features while requiring minimal training data.

The real-world applications are extensive. Email providers use it to filter billions of spam messages daily. News aggregators classify articles into categories. Healthcare systems assist in disease diagnosis based on symptoms. The common thread: problems where you need to classify something based on multiple observed features.

Mathematical Foundation

Bayes’ theorem describes the probability of an event based on prior knowledge of conditions related to that event:

P(A|B) = P(B|A) × P(A) / P(B)

For classification, we want to find the probability of a class given observed features:

P(class|features) = P(features|class) × P(class) / P(features)

The naive assumption treats all features as independent, so:

P(features|class) = P(f₁|class) × P(f₂|class) × ... × P(fₙ|class)

This independence assumption is almost never true in reality—word frequencies in documents are correlated, pixel values in images depend on neighbors—but it works anyway. Why? Because we only care about which class has the highest probability, not the exact probability values.

Here’s a concrete example with toy data:

import numpy as np

# Toy dataset: [height (cm), weight (kg), shoe_size]
# Labels: 0 = female, 1 = male
data = np.array([
    [165, 55, 37, 0],
    [170, 65, 39, 0],
    [180, 75, 42, 1],
    [175, 70, 41, 1],
    [160, 50, 36, 0],
    [185, 80, 43, 1]
])

# Calculate P(male) and P(female)
p_male = np.sum(data[:, 3] == 1) / len(data)
p_female = np.sum(data[:, 3] == 0) / len(data)

print(f"P(male) = {p_male:.2f}, P(female) = {p_female:.2f}")

# For a new person: [172, 68, 40]
# Calculate P(features|male) using Gaussian distribution
male_data = data[data[:, 3] == 1][:, :3]
male_mean = male_data.mean(axis=0)
male_std = male_data.std(axis=0)

test_features = np.array([172, 68, 40])
likelihood_male = np.prod(
    1 / (male_std * np.sqrt(2 * np.pi)) * 
    np.exp(-0.5 * ((test_features - male_mean) / male_std) ** 2)
)

print(f"P(features|male) ∝ {likelihood_male:.2e}")
# Posterior: P(male|features) ∝ likelihood_male × p_male

Types of Naive Bayes Classifiers

Choosing the right variant is critical. Each assumes a different probability distribution for features.

Gaussian Naive Bayes assumes features follow a normal distribution. Use it for continuous measurements like sensor readings, physical measurements, or financial indicators.

Multinomial Naive Bayes works with discrete counts. It’s the go-to for text classification where features represent word frequencies or TF-IDF scores.

Bernoulli Naive Bayes handles binary features (present/absent). Use it when you only care whether a feature exists, not how many times.

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.datasets import load_iris, fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Gaussian NB: Continuous features (Iris dataset)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print(f"Gaussian NB Accuracy: {accuracy_score(y_test, gnb.predict(X_test)):.3f}")

# Multinomial NB: Text classification
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)
vectorizer = CountVectorizer()
X_text = vectorizer.fit_transform(newsgroups.data)
X_train, X_test, y_train, y_test = train_test_split(
    X_text, newsgroups.target, test_size=0.3, random_state=42
)
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
print(f"Multinomial NB Accuracy: {accuracy_score(y_test, mnb.predict(X_test)):.3f}")

# Bernoulli NB: Binary features
X_binary = (X_text > 0).astype(int)  # Convert counts to binary
X_train, X_test, y_train, y_test = train_test_split(
    X_binary, newsgroups.target, test_size=0.3, random_state=42
)
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
print(f"Bernoulli NB Accuracy: {accuracy_score(y_test, bnb.predict(X_test)):.3f}")

Implementation from Scratch

Building Naive Bayes from scratch reveals its elegant simplicity:

import numpy as np
from collections import defaultdict

class MultinomialNaiveBayes:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Laplace smoothing parameter
        self.class_log_prior = {}
        self.feature_log_prob = {}
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        
        # Calculate class priors: P(class)
        for c in self.classes:
            n_c = np.sum(y == c)
            self.class_log_prior[c] = np.log(n_c / n_samples)
            
            # Calculate feature likelihoods: P(feature|class)
            X_c = X[y == c]
            feature_counts = X_c.sum(axis=0) + self.alpha
            total_count = feature_counts.sum()
            self.feature_log_prob[c] = np.log(feature_counts / total_count)
    
    def predict(self, X):
        predictions = []
        for x in X:
            class_scores = {}
            for c in self.classes:
                # Log probability: log P(class) + sum(log P(feature_i|class))
                score = self.class_log_prior[c]
                score += np.sum(x * self.feature_log_prob[c])
                class_scores[c] = score
            predictions.append(max(class_scores, key=class_scores.get))
        return np.array(predictions)

# Test on simple text data
from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "buy cheap watches now",
    "limited time offer buy",
    "meeting scheduled tomorrow",
    "project deadline next week",
    "click here buy now",
    "team sync at 3pm"
]
labels = np.array([1, 1, 0, 0, 1, 0])  # 1=spam, 0=ham

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs).toarray()

nb = MultinomialNaiveBayes(alpha=1.0)
nb.fit(X, labels)

test_docs = ["buy now limited", "meeting at 3pm"]
X_test = vectorizer.transform(test_docs).toarray()
predictions = nb.predict(X_test)
print(f"Predictions: {predictions}")  # [1, 0] - spam, ham

Real-World Application: Email Spam Detection

Let’s build a complete spam classifier with proper evaluation:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Load spam dataset (using SMS Spam Collection as example)
# Download from: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'message']
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)

# Feature extraction with TF-IDF
vectorizer = TfidfVectorizer(max_features=3000, stop_words='english', 
                             ngram_range=(1, 2))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train classifier
clf = MultinomialNB(alpha=0.1)
clf.fit(X_train_vec, y_train)

# Evaluate
y_pred = clf.predict(X_test_vec)
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Show most informative features
feature_names = vectorizer.get_feature_names_out()
spam_prob = clf.feature_log_prob_[1]
top_spam_indices = spam_prob.argsort()[-10:][::-1]
print("\nTop spam indicators:")
for idx in top_spam_indices:
    print(f"  {feature_names[idx]}: {np.exp(spam_prob[idx]):.4f}")

This classifier typically achieves 97%+ accuracy on spam detection. The TF-IDF vectorizer captures word importance better than raw counts, and the alpha parameter (Laplace smoothing) prevents zero probabilities for unseen words.

Advantages, Limitations, and Best Practices

Use Naive Bayes when:

You have high-dimensional sparse data (text, genomics)
Training data is limited (it needs fewer examples than discriminative models)
You need fast training and prediction
Features are relatively independent or independence violations don’t hurt much

Avoid it when:

Feature dependencies are crucial (image classification, spatial data)
You need probability calibration (raw probabilities are often overconfident)
You have abundant data and computational resources (deep learning will likely win)

Laplace smoothing is essential. Without it, a single unseen word can zero out the entire probability:

from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Create simple dataset
X_train = np.array([[1, 2, 0], [0, 1, 3], [2, 0, 1]])
y_train = np.array([0, 1, 0])
X_test = np.array([[0, 0, 5]])  # Feature 2 never seen in class 0

# Without smoothing (alpha=0) - can cause issues
clf_no_smooth = MultinomialNB(alpha=1e-10)
clf_no_smooth.fit(X_train, y_train)

# With smoothing (alpha=1)
clf_smooth = MultinomialNB(alpha=1.0)
clf_smooth.fit(X_train, y_train)

print(f"Without smoothing: {clf_no_smooth.predict_proba(X_test)}")
print(f"With smoothing: {clf_smooth.predict_proba(X_test)}")

Comparison with Other Algorithms

Naive Bayes shines in specific scenarios. Here’s a benchmark:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import time

# Load data
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

models = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Linear SVM': LinearSVC(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100)
}

for name, model in models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    
    accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f"{name:20s} - Accuracy: {accuracy:.3f}, Time: {train_time:.2f}s")

Typical results show Naive Bayes trains 5-10x faster than Logistic Regression and 50-100x faster than Random Forest, with accuracy within 1-3% of the best model. For text classification with limited data, it often wins outright.

The algorithm’s speed and simplicity make it ideal for prototyping. Start with Naive Bayes to establish a baseline, then invest in complex models only if the performance gain justifies the cost. In many production systems, that investment never pays off—Naive Bayes remains the deployed solution because it’s fast, interpretable, and good enough.