How to Implement Sentiment Analysis in PyTorch

Sentiment analysis is the task of determining emotional tone from text—whether a review is positive or negative, whether a tweet expresses anger or joy. It's fundamental to modern NLP applications:...

Key Insights

  • Sentiment analysis with PyTorch requires three core components: an embedding layer to convert tokens to vectors, a recurrent layer (LSTM/GRU) to capture sequential patterns, and a classifier head for binary or multi-class prediction
  • Proper text preprocessing—including tokenization, vocabulary building with frequency thresholds, and sequence padding—directly impacts model performance and should handle edge cases like unknown tokens and variable-length inputs
  • Training stability improves significantly when using BCEWithLogitsLoss (which combines sigmoid and binary cross-entropy) over separate sigmoid + BCELoss, and gradient clipping prevents exploding gradients in recurrent networks

Introduction & Problem Overview

Sentiment analysis is the task of determining emotional tone from text—whether a review is positive or negative, whether a tweet expresses anger or joy. It’s fundamental to modern NLP applications: e-commerce platforms analyze product reviews, brands monitor social media sentiment, and customer service systems route tickets based on message tone.

In this article, you’ll build a binary sentiment classifier for movie reviews using PyTorch and the IMDB dataset. This dataset contains 50,000 highly polarized reviews labeled as positive or negative, making it ideal for learning sentiment classification. By the end, you’ll have a working model that can classify new text with confidence scores.

Dataset Preparation & Preprocessing

Text preprocessing is where most beginners stumble. Neural networks need numbers, not strings, so you must convert text into numerical sequences while preserving semantic meaning.

First, load the IMDB dataset. We’ll use the datasets library from Hugging Face, which provides clean access to common NLP datasets:

from datasets import load_dataset
from collections import Counter
import torch
from torch.utils.data import DataLoader
import re

# Load IMDB dataset
dataset = load_dataset('imdb')
train_data = dataset['train']
test_data = dataset['test']

print(f"Training samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")

Next, implement tokenization. Keep it simple—split on whitespace and punctuation, convert to lowercase:

def tokenize(text):
    """Simple tokenization: lowercase and split on non-alphanumeric"""
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text)
    return tokens

# Build vocabulary from training data
def build_vocab(texts, min_freq=5):
    counter = Counter()
    for text in texts:
        counter.update(tokenize(text))
    
    # Special tokens
    vocab = {'<PAD>': 0, '<UNK>': 1}
    
    # Add words meeting frequency threshold
    for word, freq in counter.items():
        if freq >= min_freq:
            vocab[word] = len(vocab)
    
    return vocab

vocab = build_vocab(train_data['text'], min_freq=5)
print(f"Vocabulary size: {len(vocab)}")

The min_freq threshold prevents rare words from bloating your vocabulary. Words appearing fewer than 5 times become <UNK> (unknown) tokens.

Now convert text to sequences of indices and handle variable lengths:

def text_to_sequence(text, vocab, max_len=256):
    """Convert text to padded sequence of vocabulary indices"""
    tokens = tokenize(text)
    sequence = [vocab.get(token, vocab['<UNK>']) for token in tokens]
    
    # Truncate or pad
    if len(sequence) > max_len:
        sequence = sequence[:max_len]
    else:
        sequence = sequence + [vocab['<PAD>']] * (max_len - len(sequence))
    
    return sequence

# Create PyTorch datasets
class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, vocab, max_len=256):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab
        self.max_len = max_len
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        sequence = text_to_sequence(self.texts[idx], self.vocab, self.max_len)
        return torch.tensor(sequence, dtype=torch.long), torch.tensor(self.labels[idx], dtype=torch.float)

# Create data loaders
train_dataset = IMDBDataset(train_data['text'], train_data['label'], vocab)
test_dataset = IMDBDataset(test_data['text'], test_data['label'], vocab)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

Setting max_len=256 balances computational efficiency with capturing enough context. Most reviews convey sentiment within the first few hundred tokens.

Building the Neural Network Architecture

The architecture follows a standard sequence classification pattern: embeddings → recurrent layer → pooling → classifier.

import torch.nn as nn

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256, num_layers=2, dropout=0.5):
        super(SentimentLSTM, self).__init__()
        
        # Embedding layer converts token indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        
        # LSTM processes sequences
        self.lstm = nn.LSTM(
            embedding_dim, 
            hidden_dim, 
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True,
            bidirectional=False
        )
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)
        
        # Fully connected output layer
        self.fc = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        # x shape: (batch_size, seq_len)
        embedded = self.embedding(x)  # (batch_size, seq_len, embedding_dim)
        
        # LSTM output
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # lstm_out shape: (batch_size, seq_len, hidden_dim)
        
        # Use final hidden state for classification
        # hidden shape: (num_layers, batch_size, hidden_dim)
        hidden_final = hidden[-1]  # Take last layer
        
        # Apply dropout and classifier
        dropped = self.dropout(hidden_final)
        output = self.fc(dropped)  # (batch_size, 1)
        
        return output.squeeze(1)  # (batch_size)

# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SentimentLSTM(len(vocab), embedding_dim=128, hidden_dim=256, num_layers=2)
model = model.to(device)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

Key architectural decisions:

  • Embedding dimension (128): Smaller than typical word2vec (300) but sufficient for sentiment, reducing parameters
  • Hidden dimension (256): Captures complex patterns without excessive computation
  • Two LSTM layers: Adds representational power; dropout between layers prevents overfitting
  • Final hidden state: We use the last LSTM hidden state as the sequence representation, which works well for sentiment where the overall tone matters more than individual token positions

Training Loop Implementation

Use BCEWithLogitsLoss for numerical stability—it combines sigmoid activation with binary cross-entropy:

import torch.optim as optim

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for sequences, labels in loader:
        sequences, labels = sequences.to(device), labels.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(sequences)
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping prevents exploding gradients in RNNs
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        
        optimizer.step()
        
        # Calculate accuracy
        predictions = torch.sigmoid(outputs) > 0.5
        correct += (predictions == labels).sum().item()
        total += labels.size(0)
        total_loss += loss.item()
    
    return total_loss / len(loader), correct / total

def validate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for sequences, labels in loader:
            sequences, labels = sequences.to(device), labels.to(device)
            outputs = model(sequences)
            loss = criterion(outputs, labels)
            
            predictions = torch.sigmoid(outputs) > 0.5
            correct += (predictions == labels).sum().item()
            total += labels.size(0)
            total_loss += loss.item()
    
    return total_loss / len(loader), correct / total

# Training loop
num_epochs = 5
best_val_acc = 0

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = validate(model, test_loader, criterion, device)
    
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
    print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}\n")
    
    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), 'best_sentiment_model.pt')

Gradient clipping with max_norm=5.0 is critical for LSTM training—it prevents gradients from exploding during backpropagation through time. You should expect validation accuracy around 85-88% after 5 epochs.

Evaluation & Inference

After training, evaluate on the test set and create an inference function for new text:

def predict_sentiment(model, text, vocab, device, max_len=256):
    """Predict sentiment for a single text sample"""
    model.eval()
    
    # Preprocess text
    sequence = text_to_sequence(text, vocab, max_len)
    sequence_tensor = torch.tensor([sequence], dtype=torch.long).to(device)
    
    # Get prediction
    with torch.no_grad():
        output = model(sequence_tensor)
        probability = torch.sigmoid(output).item()
    
    sentiment = "Positive" if probability > 0.5 else "Negative"
    confidence = probability if probability > 0.5 else 1 - probability
    
    return sentiment, confidence

# Load best model
model.load_state_dict(torch.load('best_sentiment_model.pt'))

# Test predictions
test_reviews = [
    "This movie was absolutely fantastic! The acting was superb and the plot kept me engaged throughout.",
    "Terrible waste of time. Poor acting and a nonsensical plot that went nowhere.",
    "It was okay, nothing special but not terrible either."
]

for review in test_reviews:
    sentiment, confidence = predict_sentiment(model, review, vocab, device)
    print(f"Review: {review[:60]}...")
    print(f"Sentiment: {sentiment} (confidence: {confidence:.2%})\n")

This inference function handles the complete pipeline: tokenization, sequence conversion, model prediction, and probability interpretation.

Improvements & Next Steps

This implementation provides a solid baseline, but several enhancements can boost performance:

Bidirectional LSTMs: Process sequences in both directions to capture context from past and future tokens. Change bidirectional=True and double the fc input dimension to hidden_dim * 2.

Attention mechanisms: Instead of using only the final hidden state, compute weighted averages of all hidden states based on their relevance to the classification task.

Pre-trained embeddings: Initialize the embedding layer with GloVe or FastText vectors trained on large corpora. This provides better semantic representations, especially for words with limited training examples.

Transformer models: For production systems, consider fine-tuning BERT or RoBERTa using Hugging Face’s transformers library. These models achieve 92-95% accuracy on IMDB but require more computational resources.

Data augmentation: Generate synthetic training examples through back-translation, synonym replacement, or paraphrasing to improve robustness.

The architecture presented here gives you a strong foundation for understanding sentiment analysis mechanics. Master these fundamentals before moving to more complex transformer-based approaches—the core concepts of embeddings, sequence modeling, and classification remain the same.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.