How to Implement Naive Bayes in Python

Key Insights

Naive Bayes achieves 80-90% accuracy on text classification tasks while training 10-100x faster than deep learning models, making it ideal for baseline models and production systems with limited resources
The “naive” independence assumption rarely holds in practice, yet the algorithm performs remarkably well because it only needs to rank classes correctly, not estimate exact probabilities
Choose Gaussian NB for continuous features, Multinomial for word counts or frequencies, and Bernoulli for binary features—using the wrong variant can reduce accuracy by 20% or more

Introduction to Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes’ theorem with a strong independence assumption between features. Despite this “naive” assumption that all features are independent given the class label, the algorithm performs surprisingly well in practice, especially for text classification tasks.

The algorithm calculates the probability of each class given the input features and selects the class with the highest probability. It’s called “naive” because it assumes features don’t influence each other—for example, in spam detection, it assumes the presence of the word “free” is independent of the word “money,” which is clearly not true. Yet this simplification makes the algorithm computationally efficient and remarkably effective.

Common applications include spam filtering, sentiment analysis, document categorization, medical diagnosis, and recommendation systems. Its speed and simplicity make it an excellent baseline model before investing in more complex algorithms.

Mathematical Foundation

Bayes’ theorem forms the core of this classifier:

P(A|B) = P(B|A) * P(A) / P(B)

In classification terms:

P(A|B): Posterior probability—probability of class A given features B
P(B|A): Likelihood—probability of features B given class A
P(A): Prior probability—probability of class A in the dataset
P(B): Evidence—probability of observing features B

Let’s implement a simple example:

import numpy as np

# Toy dataset: hours studied vs pass/fail
# [hours_studied, passed (1) or failed (0)]
data = np.array([
    [2, 0], [3, 0], [4, 1], [5, 1], 
    [6, 1], [7, 1], [8, 1], [9, 1]
])

# Calculate P(Pass) - prior probability
total_samples = len(data)
passed = np.sum(data[:, 1] == 1)
p_pass = passed / total_samples
print(f"P(Pass) = {p_pass:.2f}")  # 0.75

# Calculate P(hours >= 6 | Pass) - likelihood
pass_samples = data[data[:, 1] == 1]
high_hours_and_pass = np.sum(pass_samples[:, 0] >= 6)
p_high_hours_given_pass = high_hours_and_pass / len(pass_samples)
print(f"P(hours >= 6 | Pass) = {p_high_hours_given_pass:.2f}")  # 0.67

# Calculate P(hours >= 6) - evidence
high_hours = np.sum(data[:, 0] >= 6)
p_high_hours = high_hours / total_samples
print(f"P(hours >= 6) = {p_high_hours:.2f}")  # 0.50

# Apply Bayes' theorem: P(Pass | hours >= 6)
p_pass_given_high_hours = (p_high_hours_given_pass * p_pass) / p_high_hours
print(f"P(Pass | hours >= 6) = {p_pass_given_high_hours:.2f}")  # 1.00

This demonstrates how we update our belief about passing based on observed study hours.

Types of Naive Bayes Classifiers

Gaussian Naive Bayes assumes features follow a normal distribution. Use it for continuous numerical data like sensor readings, measurements, or real-valued features. It calculates mean and standard deviation for each feature per class.

Multinomial Naive Bayes works with discrete counts or frequencies. It’s the go-to choice for text classification where features represent word counts or TF-IDF scores. It models the probability of observing count data.

Bernoulli Naive Bayes handles binary features (present/absent). Use it when you only care whether a feature exists, not how many times. Common in document classification with binary word occurrence vectors.

Choosing the wrong type can significantly hurt performance. Using Gaussian NB on text data or Multinomial NB on continuous measurements will produce poor results.

Implementation from Scratch

Building a Gaussian Naive Bayes classifier from scratch reveals its simplicity:

import numpy as np

class GaussianNB:
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.parameters = {}
        
        for c in self.classes:
            X_c = X[y == c]
            self.parameters[c] = {
                'mean': X_c.mean(axis=0),
                'std': X_c.std(axis=0),
                'prior': len(X_c) / len(X)
            }
    
    def _gaussian_probability(self, x, mean, std):
        """Calculate Gaussian probability density"""
        exponent = np.exp(-((x - mean) ** 2) / (2 * std ** 2))
        return (1 / (np.sqrt(2 * np.pi) * std)) * exponent
    
    def _calculate_class_probability(self, x, c):
        """Calculate P(class) * P(x1|class) * P(x2|class) * ..."""
        params = self.parameters[c]
        prior = np.log(params['prior'])
        
        # Sum log probabilities to avoid underflow
        likelihood = np.sum(np.log(
            self._gaussian_probability(x, params['mean'], params['std'])
        ))
        return prior + likelihood
    
    def predict(self, X):
        predictions = []
        for x in X:
            probabilities = [
                self._calculate_class_probability(x, c) 
                for c in self.classes
            ]
            predictions.append(self.classes[np.argmax(probabilities)])
        return np.array(predictions)

# Test with synthetic data
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, 
                          n_informative=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = np.mean(predictions == y_test)
print(f"Accuracy: {accuracy:.2f}")  # ~0.90

This implementation captures the essence: calculate mean and standard deviation per class, then use Gaussian probability to classify new samples.

Using Scikit-learn

For production systems, use scikit-learn’s optimized implementations. Here’s a complete spam detection example:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Sample email dataset
emails = [
    "Free money now! Click here!", "Meeting at 3pm tomorrow",
    "Claim your prize today!", "Project deadline reminder",
    "You won the lottery!", "Lunch next week?",
    "Buy now! Limited offer!", "Code review scheduled",
    "Get rich quick scheme", "Team standup at 10am",
    "Congratulations! You won!", "Please review the document",
    "Cheap meds online", "Calendar invite sent",
    "Hot singles in your area", "Quarterly report attached"
]

labels = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0])  # 1=spam, 0=ham

# Convert text to numerical features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.3, random_state=42
)

# Train model
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Evaluate
y_pred = classifier.predict(X_test)
print(f"Accuracy: {classifier.score(X_test, y_test):.2f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# Predict new emails
new_emails = ["Free prize money!", "Meeting notes from yesterday"]
new_X = vectorizer.transform(new_emails)
predictions = classifier.predict(new_X)
print(f"\nPredictions: {['Spam' if p == 1 else 'Ham' for p in predictions]}")

Handling Real-World Challenges

The zero probability problem occurs when a feature value never appears with a class during training. This causes the entire probability calculation to become zero. Laplace smoothing (additive smoothing) solves this by adding a small constant to all counts:

# Without smoothing (alpha=1.0 is default)
nb_no_smooth = MultinomialNB(alpha=0.01)
nb_no_smooth.fit(X_train, y_train)

# With stronger smoothing
nb_smooth = MultinomialNB(alpha=1.0)
nb_smooth.fit(X_train, y_train)

print(f"Low smoothing accuracy: {nb_no_smooth.score(X_test, y_test):.2f}")
print(f"Standard smoothing accuracy: {nb_smooth.score(X_test, y_test):.2f}")

For missing data, Naive Bayes handles it naturally by ignoring missing features during probability calculations. However, imputation often yields better results.

When features are highly correlated, consider feature selection or dimensionality reduction before applying Naive Bayes. Despite violations of the independence assumption, the algorithm often remains effective because it only needs to rank classes correctly, not estimate exact probabilities.

Performance Comparison and Best Practices

Naive Bayes excels when you need fast training and prediction with limited data. Here’s a comparison with Logistic Regression:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import time

# Load larger dataset
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)
X = TfidfVectorizer(max_features=1000).fit_transform(newsgroups.data)
y = newsgroups.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Naive Bayes
start = time.time()
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_time = time.time() - start
nb_accuracy = nb.score(X_test, y_test)

# Logistic Regression
start = time.time()
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
lr_time = time.time() - start
lr_accuracy = lr.score(X_test, y_test)

print(f"Naive Bayes - Time: {nb_time:.4f}s, Accuracy: {nb_accuracy:.2f}")
print(f"Logistic Reg - Time: {lr_time:.4f}s, Accuracy: {lr_accuracy:.2f}")

Use Naive Bayes when:

You need a fast baseline model
Training data is limited
Features are relatively independent
Working with text classification
Real-time predictions are required

Avoid Naive Bayes when:

Features are highly correlated
You need probability calibration
Maximum accuracy is critical regardless of speed
Working with complex non-linear relationships

Naive Bayes remains one of the most practical algorithms in machine learning. Its speed, simplicity, and effectiveness on text data make it indispensable for rapid prototyping and production systems where computational resources are constrained. Start with Naive Bayes, establish a baseline, then decide if the complexity of more advanced algorithms is justified.