How to Implement Multinomial Naive Bayes in Python

Key Insights

Multinomial Naive Bayes excels at text classification by treating documents as bags of word counts, making it ideal for spam detection, sentiment analysis, and document categorization with minimal computational overhead.
The algorithm’s “naive” assumption of feature independence rarely holds in practice, yet it consistently delivers strong performance because it only needs to get the ranking of probabilities correct, not their exact values.
Laplace smoothing (controlled by the alpha parameter) is essential to prevent zero probabilities when encountering unseen words, and tuning this hyperparameter can significantly impact model performance on sparse datasets.

Introduction to Multinomial Naive Bayes

Multinomial Naive Bayes (MNB) is a probabilistic classifier based on Bayes’ theorem with the “naive” assumption that features are conditionally independent given the class label. Despite this unrealistic assumption, MNB performs remarkably well for text classification tasks where features represent discrete counts—specifically, word frequencies in documents.

The algorithm shines in scenarios like spam email detection, sentiment analysis, document categorization, and topic classification. Its effectiveness stems from working with discrete feature counts (how many times each word appears) rather than binary presence/absence indicators. This makes it particularly suited for the multinomial distribution of word counts in text data.

The key advantage of MNB is computational efficiency. Training is fast, prediction is near-instantaneous, and it handles high-dimensional sparse data elegantly—critical when dealing with vocabularies containing thousands or millions of unique words.

Understanding the Mathematics

At its core, Multinomial Naive Bayes applies Bayes’ theorem:

P(class|document) = P(document|class) × P(class) / P(document)

For classification, we only need to compare probabilities across classes, so we can ignore the denominator and focus on:

P(class|document) ∝ P(class) × P(document|class)

The multinomial assumption means we model the document as a sequence of word counts drawn from a multinomial distribution. For a document with word counts, the probability becomes:

P(document|class) = P(n₁, n₂, …, nᵥ|class)

Where nᵢ is the count of word i in the document.

Laplace smoothing (additive smoothing) prevents zero probabilities when a word appears in the test set but never appeared with a particular class in training. We add a small constant α (typically 1.0) to all word counts:

P(word|class) = (count(word, class) + α) / (count(all words, class) + α × vocabulary_size)

Here’s a simple demonstration of Bayes’ theorem with toy data:

import numpy as np

# Toy example: P(Spam|contains "free")
# Prior probabilities
p_spam = 0.3
p_ham = 0.7

# Likelihoods: P("free"|Spam) and P("free"|Ham)
p_free_given_spam = 0.8
p_free_given_ham = 0.1

# P("free") = P("free"|Spam)*P(Spam) + P("free"|Ham)*P(Ham)
p_free = p_free_given_spam * p_spam + p_free_given_ham * p_ham

# Posterior: P(Spam|"free")
p_spam_given_free = (p_free_given_spam * p_spam) / p_free

print(f"P(Spam|contains 'free'): {p_spam_given_free:.3f}")
print(f"P(Ham|contains 'free'): {1 - p_spam_given_free:.3f}")

Preparing Text Data for Classification

Text must be converted into numerical features before feeding it to the algorithm. The standard pipeline involves tokenization (splitting text into words) and vectorization (converting text to word count vectors).

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Load sample dataset - using 4 categories for faster training
categories = ['alt.atheism', 'soc.religion.christian', 
              'comp.graphics', 'sci.med']

# Fetch training data
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      categories=categories,
                                      remove=('headers', 'footers', 'quotes'))

# Fetch test data
newsgroups_test = fetch_20newsgroups(subset='test',
                                     categories=categories,
                                     remove=('headers', 'footers', 'quotes'))

# Create train/test splits
X_train, X_test = newsgroups_train.data, newsgroups_test.data
y_train, y_test = newsgroups_train.target, newsgroups_test.target

# Vectorize text into word count features
vectorizer = CountVectorizer(max_features=5000,  # limit vocabulary size
                            stop_words='english',  # remove common words
                            min_df=2,  # ignore words appearing in < 2 docs
                            max_df=0.8)  # ignore words in > 80% of docs

X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

print(f"Training set shape: {X_train_counts.shape}")
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")

The CountVectorizer parameters are crucial. max_features limits vocabulary size to control dimensionality. stop_words='english' removes common words like “the” and “is” that add noise. min_df and max_df filter out rare and overly common words respectively.

Building the Model with Scikit-learn

Scikit-learn’s implementation makes training and prediction straightforward:

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and train the model
mnb = MultinomialNB(alpha=1.0)  # alpha is the Laplace smoothing parameter
mnb.fit(X_train_counts, y_train)

# Make predictions
y_pred = mnb.predict(X_test_counts)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, 
                          target_names=newsgroups_test.target_names))

# Get prediction probabilities
y_proba = mnb.predict_proba(X_test_counts)
print(f"\nProbabilities shape: {y_proba.shape}")
print(f"Sample probabilities for first document:\n{y_proba[0]}")

The model typically achieves 85-90% accuracy on this dataset. The predict_proba method returns probability estimates for each class, useful for ranking predictions by confidence.

Implementing from Scratch

Understanding the internals helps debug issues and customize behavior:

import numpy as np
from collections import defaultdict

class MultinomialNaiveBayes:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Laplace smoothing parameter
        self.class_log_prior = {}
        self.feature_log_prob = {}
        self.classes = None
        
    def fit(self, X, y):
        """Train the Multinomial Naive Bayes classifier"""
        self.classes = np.unique(y)
        n_samples = X.shape[0]
        
        for c in self.classes:
            # Get all documents for this class
            X_c = X[y == c]
            
            # Calculate log prior: log(P(class))
            self.class_log_prior[c] = np.log(X_c.shape[0] / n_samples)
            
            # Calculate log likelihood: log(P(word|class))
            # Sum word counts across all documents in class
            word_counts = np.asarray(X_c.sum(axis=0)).flatten()
            
            # Total words in class (with smoothing)
            total_words = word_counts.sum()
            vocab_size = X.shape[1]
            
            # Apply Laplace smoothing
            smoothed_counts = word_counts + self.alpha
            smoothed_total = total_words + self.alpha * vocab_size
            
            # Calculate log probabilities
            self.feature_log_prob[c] = np.log(smoothed_counts / smoothed_total)
        
        return self
    
    def predict(self, X):
        """Predict class labels for samples in X"""
        return np.array([self._predict_single(x) for x in X])
    
    def _predict_single(self, x):
        """Predict class for a single sample"""
        posteriors = {}
        
        for c in self.classes:
            # Start with log prior
            log_posterior = self.class_log_prior[c]
            
            # Add log likelihoods for each word
            # Multiply word count by log probability (in log space: add)
            x_array = np.asarray(x.toarray()).flatten()
            log_posterior += np.sum(x_array * self.feature_log_prob[c])
            
            posteriors[c] = log_posterior
        
        # Return class with highest posterior probability
        return max(posteriors, key=posteriors.get)

# Train custom implementation
custom_mnb = MultinomialNaiveBayes(alpha=1.0)
custom_mnb.fit(X_train_counts, y_train)

# Predict and evaluate
y_pred_custom = custom_mnb.predict(X_test_counts)
custom_accuracy = accuracy_score(y_test, y_pred_custom)

print(f"Custom implementation accuracy: {custom_accuracy:.3f}")
print(f"Scikit-learn accuracy: {accuracy:.3f}")
print(f"Difference: {abs(custom_accuracy - accuracy):.6f}")

The custom implementation should match scikit-learn’s results closely (within floating-point precision).

Model Evaluation and Hyperparameter Tuning

The alpha parameter controls smoothing strength. Too little smoothing (alpha → 0) risks zero probabilities; too much (large alpha) over-smooths and loses discriminative power.

from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

# Grid search for optimal alpha
param_grid = {'alpha': [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]}

grid_search = GridSearchCV(MultinomialNB(), param_grid, 
                          cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_counts, y_train)

print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")

# Plot alpha vs accuracy
results = grid_search.cv_results_
plt.figure(figsize=(10, 6))
plt.semilogx(param_grid['alpha'], results['mean_test_score'], marker='o')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Cross-validation Accuracy')
plt.title('Hyperparameter Tuning: Alpha vs Accuracy')
plt.grid(True)
plt.show()

# Confusion matrix with best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_counts)

cm = confusion_matrix(y_test, y_pred_best)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=newsgroups_test.target_names,
           yticklabels=newsgroups_test.target_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

Practical Tips and Limitations

When to use Multinomial Naive Bayes:

Text classification with large vocabularies
Real-time prediction requirements (extremely fast inference)
Limited training data (works well with small datasets)
Baseline model before trying complex algorithms

Handling imbalanced datasets: Use class weights or resampling techniques. Scikit-learn’s MultinomialNB doesn’t directly support class weights, but you can oversample minority classes or use stratified sampling.

Performance optimization: MNB trains in linear time O(n × d) where n is samples and d is features. Prediction is also linear. For massive datasets, use sparse matrices (scipy.sparse) to reduce memory usage—scikit-learn handles this automatically.

Key limitations:

Assumes feature independence (rarely true for text where word order and context matter)
Only works with discrete, non-negative features (use GaussianNB for continuous features)
Can’t capture feature interactions or complex patterns
Probability estimates are often poorly calibrated

Despite limitations, Multinomial Naive Bayes remains a strong baseline that’s hard to beat for speed and simplicity. It often outperforms more complex models on small datasets and high-dimensional sparse data. Always start with MNB for text classification—you might not need anything fancier.