How to Implement Sentiment Analysis in TensorFlow

Key Insights

Sentiment analysis requires proper text preprocessing including tokenization, vocabulary mapping, and sequence padding to convert raw text into numerical tensors that neural networks can process
Bidirectional LSTM architectures with embedding layers consistently outperform simpler approaches by capturing both forward and backward context in text sequences
Pre-trained embeddings like GloVe can boost model accuracy by 5-10% compared to training embeddings from scratch, especially with limited training data

Introduction to Sentiment Analysis

Sentiment analysis is one of the most practical applications of natural language processing. Companies use it to monitor brand reputation on social media, analyze product reviews at scale, and prioritize customer support tickets based on emotional urgency. At its core, sentiment analysis classifies text into emotional categories.

Binary sentiment classification (positive vs. negative) is the most common approach and what we’ll focus on here. Multi-class sentiment extends this to include neutral sentiments or fine-grained emotions like anger, joy, or sadness. For most business applications, binary classification provides sufficient signal while being easier to train and maintain.

TensorFlow’s high-level Keras API makes implementing production-ready sentiment analysis models straightforward. Let’s build one from scratch.

Dataset Preparation

Text data requires significant preprocessing before it can feed into a neural network. Neural networks operate on numbers, not strings, so we need a systematic way to convert text into numerical representations.

The IMDB movie reviews dataset is perfect for learning sentiment analysis. It contains 50,000 reviews labeled as positive or negative, split evenly for training and testing.

import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Load dataset with top 10,000 most frequent words
vocab_size = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Sequences have variable lengths - let's check
print(f"First review length: {len(X_train[0])}")
print(f"Second review length: {len(X_train[1])}")

# Pad sequences to uniform length
max_length = 200
X_train_padded = pad_sequences(X_train, maxlen=max_length, padding='post', truncating='post')
X_test_padded = pad_sequences(X_test, maxlen=max_length, padding='post', truncating='post')

print(f"Padded shape: {X_train_padded.shape}")

The num_words parameter limits vocabulary to the 10,000 most frequent words, reducing model complexity and training time. Words outside this vocabulary are ignored. The pad_sequences function ensures all reviews have the same length by truncating longer sequences and padding shorter ones with zeros.

Setting max_length=200 is a pragmatic choice. Longer sequences capture more context but increase computational cost. For movie reviews, 200 words typically capture the essential sentiment while keeping training efficient.

Building the Model Architecture

The architecture for sentiment analysis typically follows a pattern: embedding layer → sequence processing layer → dense classification layer.

Embedding layers convert integer-encoded words into dense vector representations. Instead of one-hot encoding (which creates sparse 10,000-dimensional vectors), embeddings map each word to a learned 128-dimensional space where semantically similar words cluster together.

Bidirectional LSTMs process sequences in both forward and backward directions, capturing context from both sides of each word. This is crucial for sentiment analysis where phrases like “not good” have opposite meaning to “good.”

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

def create_sentiment_model(vocab_size, embedding_dim=128, max_length=200):
    model = Sequential([
        # Embedding layer: maps word indices to dense vectors
        Embedding(input_dim=vocab_size, 
                  output_dim=embedding_dim, 
                  input_length=max_length),
        
        # Bidirectional LSTM: processes sequence forward and backward
        Bidirectional(LSTM(64, return_sequences=False)),
        
        # Dropout for regularization
        Dropout(0.5),
        
        # Dense layer for classification
        Dense(64, activation='relu'),
        Dropout(0.5),
        
        # Output layer: sigmoid for binary classification
        Dense(1, activation='sigmoid')
    ])
    
    return model

model = create_sentiment_model(vocab_size)
model.summary()

The return_sequences=False parameter in LSTM means we only want the final output, not outputs at each time step. For classification, we only need the final hidden state that summarizes the entire sequence.

Dropout layers with 0.5 probability randomly disable half the neurons during training, preventing overfitting. This is essential for text classification where models easily memorize training examples.

Training and Evaluation

Binary classification uses binary crossentropy loss, which penalizes confident wrong predictions more than uncertain ones. Adam optimizer adapts learning rates automatically, making it reliable for most scenarios.

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.metrics import Precision, Recall

# Compile model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', Precision(), Recall()]
)

# Set up callbacks
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True
)

checkpoint = ModelCheckpoint(
    'best_sentiment_model.h5',
    monitor='val_accuracy',
    save_best_only=True
)

# Train model
history = model.fit(
    X_train_padded, y_train,
    validation_split=0.2,
    epochs=10,
    batch_size=128,
    callbacks=[early_stopping, checkpoint],
    verbose=1
)

# Evaluate on test set
test_loss, test_accuracy, test_precision, test_recall = model.evaluate(X_test_padded, y_test)
f1_score = 2 * (test_precision * test_recall) / (test_precision + test_recall)

print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall: {test_recall:.4f}")
print(f"Test F1-Score: {f1_score:.4f}")

Early stopping monitors validation loss and stops training when it stops improving, preventing overfitting. The patience=3 parameter allows three epochs without improvement before stopping.

For sentiment analysis, accuracy alone can be misleading if classes are imbalanced. Precision (how many predicted positives are actually positive) and recall (how many actual positives were found) provide a more complete picture. F1-score balances both metrics.

Making Predictions

A trained model is useless if you can’t use it on new data. Production inference requires the same preprocessing pipeline used during training.

def predict_sentiment(text, model, word_index, max_length=200):
    """
    Predict sentiment for raw text input.
    
    Args:
        text: Raw text string
        model: Trained Keras model
        word_index: Dictionary mapping words to indices
        max_length: Maximum sequence length
    
    Returns:
        Tuple of (sentiment_label, confidence_score)
    """
    # Preprocess text
    words = text.lower().split()
    
    # Convert words to indices (unknown words get index 0)
    sequence = [word_index.get(word, 0) for word in words]
    
    # Pad sequence
    padded = pad_sequences([sequence], maxlen=max_length, padding='post')
    
    # Predict
    prediction = model.predict(padded, verbose=0)[0][0]
    
    sentiment = "Positive" if prediction > 0.5 else "Negative"
    confidence = prediction if prediction > 0.5 else 1 - prediction
    
    return sentiment, confidence

# Get word index from IMDB dataset
word_index = imdb.get_word_index()

# Test predictions
test_reviews = [
    "This movie was absolutely fantastic! I loved every minute of it.",
    "Terrible waste of time. The plot made no sense and acting was awful.",
    "It was okay, nothing special but not terrible either."
]

for review in test_reviews:
    sentiment, confidence = predict_sentiment(review, model, word_index)
    print(f"Review: {review[:60]}...")
    print(f"Sentiment: {sentiment} (Confidence: {confidence:.2%})\n")

The prediction function handles the full pipeline: tokenization, vocabulary lookup, padding, and inference. Unknown words (not in training vocabulary) map to index 0, which the embedding layer handles gracefully.

Improving Model Performance

The basic LSTM model achieves around 85-87% accuracy on IMDB reviews. Pre-trained embeddings can push this to 90%+ by leveraging semantic knowledge learned from billions of words.

GloVe (Global Vectors for Word Representation) embeddings trained on Wikipedia and news articles capture word relationships that transfer well to sentiment analysis.

import numpy as np

def load_glove_embeddings(glove_file, word_index, embedding_dim=100):
    """Load GloVe embeddings and create embedding matrix."""
    embeddings_index = {}
    
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    
    print(f"Loaded {len(embeddings_index)} word vectors")
    
    # Create embedding matrix
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    
    for word, i in word_index.items():
        if i < vocab_size:
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

# Create model with pre-trained embeddings
def create_model_with_glove(vocab_size, embedding_matrix, max_length=200):
    embedding_dim = embedding_matrix.shape[1]
    
    model = Sequential([
        Embedding(input_dim=vocab_size,
                  output_dim=embedding_dim,
                  input_length=max_length,
                  weights=[embedding_matrix],
                  trainable=False),  # Freeze embeddings initially
        
        Bidirectional(LSTM(64)),
        Dropout(0.5),
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])
    
    return model

# Load GloVe embeddings (download from https://nlp.stanford.edu/projects/glove/)
# embedding_matrix = load_glove_embeddings('glove.6B.100d.txt', word_index, embedding_dim=100)
# model_glove = create_model_with_glove(vocab_size, embedding_matrix)

Setting trainable=False freezes the embedding layer, preventing updates during training. This works well when you have limited training data. With larger datasets, you can set trainable=True to fine-tune embeddings for your specific domain.

Other performance improvements include:

Data augmentation: Synonym replacement, back-translation, or random insertion/deletion of words
Ensemble methods: Combine predictions from multiple models
Transfer learning with BERT: Use transformer models pre-trained on massive corpora (requires more compute but achieves state-of-the-art results)
Hyperparameter tuning: Experiment with LSTM units, embedding dimensions, dropout rates, and learning rates

For production systems, monitor model performance over time. Language evolves, and models trained on 2020 reviews may underperform on 2024 text. Schedule regular retraining with fresh data to maintain accuracy.