How to Implement Text Classification in PyTorch

Key Insights

Text classification with PyTorch requires a clear pipeline: tokenization → vocabulary building → numerical encoding → model training, with each step directly impacting model performance
LSTM-based architectures remain highly effective for text classification tasks, offering a good balance between performance and computational efficiency compared to transformer models
Proper data preprocessing, including vocabulary size limits and sequence padding strategies, often matters more than complex model architectures for achieving good results

Introduction & Problem Setup

Text classification is one of the most common NLP tasks in production systems. Whether you’re filtering spam emails, routing customer support tickets, analyzing product reviews, or categorizing news articles, you need a reliable way to assign labels to text documents.

In this article, we’ll build a binary sentiment classifier for movie reviews using PyTorch. We’ll use the IMDB dataset, which contains 50,000 movie reviews labeled as positive or negative. This is a real-world problem that demonstrates the core concepts you’ll need for any text classification task.

Let’s start by loading the dataset:

from datasets import load_dataset
import torch
from collections import Counter
import re

# Load IMDB dataset
dataset = load_dataset('imdb')
train_data = dataset['train']
test_data = dataset['test']

print(f"Training samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")
print(f"\nExample review: {train_data[0]['text'][:200]}...")
print(f"Label: {train_data[0]['label']}")  # 0 = negative, 1 = positive

Data Preprocessing & Tokenization

Before feeding text into a neural network, we need to convert it into numerical representations. This involves several steps: cleaning the text, splitting it into tokens, building a vocabulary, and converting tokens to indices.

Here’s a practical tokenizer and vocabulary builder:

class Vocabulary:
    def __init__(self, max_size=10000, min_freq=2):
        self.max_size = max_size
        self.min_freq = min_freq
        self.word2idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx2word = {0: '<PAD>', 1: '<UNK>'}
        
    def build(self, texts):
        # Count word frequencies
        counter = Counter()
        for text in texts:
            tokens = self.tokenize(text)
            counter.update(tokens)
        
        # Keep most frequent words above min_freq
        most_common = counter.most_common(self.max_size - 2)
        for idx, (word, freq) in enumerate(most_common, start=2):
            if freq >= self.min_freq:
                self.word2idx[word] = idx
                self.idx2word[idx] = word
    
    def tokenize(self, text):
        # Simple word-level tokenization
        text = text.lower()
        text = re.sub(r'[^a-z0-9\s]', '', text)
        return text.split()
    
    def encode(self, text):
        tokens = self.tokenize(text)
        return [self.word2idx.get(token, 1) for token in tokens]  # 1 = <UNK>
    
    def __len__(self):
        return len(self.word2idx)

# Build vocabulary from training data
vocab = Vocabulary(max_size=10000, min_freq=5)
vocab.build([sample['text'] for sample in train_data])
print(f"Vocabulary size: {len(vocab)}")

Now we need to handle variable-length sequences by padding them to the same length:

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader

class IMDBDataset(Dataset):
    def __init__(self, data, vocab, max_len=256):
        self.data = data
        self.vocab = vocab
        self.max_len = max_len
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        text = self.data[idx]['text']
        label = self.data[idx]['label']
        
        # Encode and truncate
        encoded = self.vocab.encode(text)[:self.max_len]
        return torch.tensor(encoded), torch.tensor(label)

def collate_batch(batch):
    texts, labels = zip(*batch)
    # Pad sequences to the same length
    texts_padded = pad_sequence(texts, batch_first=True, padding_value=0)
    labels = torch.stack(labels)
    return texts_padded, labels

# Create data loaders
train_dataset = IMDBDataset(train_data, vocab)
test_dataset = IMDBDataset(test_data, vocab)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, 
                         collate_fn=collate_batch)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False,
                        collate_fn=collate_batch)

Building the Classification Model

Our model architecture consists of three main components: an embedding layer to convert token indices into dense vectors, an LSTM to process the sequence and capture context, and a linear layer to produce the final classification.

import torch.nn as nn

class SentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256, 
                 num_layers=2, dropout=0.5):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers,
                           batch_first=True, dropout=dropout if num_layers > 1 else 0,
                           bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim * 2, 2)  # *2 for bidirectional
        
    def forward(self, text):
        # text shape: [batch_size, seq_len]
        embedded = self.embedding(text)  # [batch_size, seq_len, embedding_dim]
        
        # LSTM output
        lstm_out, (hidden, cell) = self.lstm(embedded)
        
        # Concatenate final forward and backward hidden states
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        hidden = self.dropout(hidden)
        
        # Classification
        output = self.fc(hidden)  # [batch_size, 2]
        return output

# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SentimentClassifier(vocab_size=len(vocab)).to(device)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

Training Loop Implementation

With the model defined, we implement the training loop with proper loss calculation and optimization:

import torch.optim as optim

def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for texts, labels in dataloader:
        texts, labels = texts.to(device), labels.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(texts)
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        
        # Track metrics
        total_loss += loss.item()
        predictions = outputs.argmax(dim=1)
        correct += (predictions == labels).sum().item()
        total += labels.size(0)
    
    return total_loss / len(dataloader), correct / total

def evaluate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for texts, labels in dataloader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = model(texts)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            predictions = outputs.argmax(dim=1)
            correct += (predictions == labels).sum().item()
            total += labels.size(0)
    
    return total_loss / len(dataloader), correct / total

# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 5
for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, 
                                       optimizer, device)
    val_loss, val_acc = evaluate(model, test_loader, criterion, device)
    
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
    print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}\n")

Evaluation & Inference

After training, we need a clean interface for making predictions on new text:

def predict_sentiment(model, text, vocab, device):
    model.eval()
    
    # Preprocess
    encoded = vocab.encode(text)
    tensor = torch.tensor(encoded).unsqueeze(0).to(device)  # Add batch dim
    
    # Predict
    with torch.no_grad():
        output = model(tensor)
        probabilities = torch.softmax(output, dim=1)
        prediction = output.argmax(dim=1).item()
    
    sentiment = "Positive" if prediction == 1 else "Negative"
    confidence = probabilities[0][prediction].item()
    
    return sentiment, confidence

# Test predictions
test_reviews = [
    "This movie was absolutely fantastic! Best film I've seen all year.",
    "Terrible waste of time. The plot made no sense and acting was awful.",
    "It was okay, nothing special but not terrible either."
]

for review in test_reviews:
    sentiment, confidence = predict_sentiment(model, review, vocab, device)
    print(f"Review: {review[:60]}...")
    print(f"Prediction: {sentiment} (confidence: {confidence:.2%})\n")

For comprehensive evaluation, compute additional metrics:

from sklearn.metrics import classification_report

def get_predictions(model, dataloader, device):
    model.eval()
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for texts, labels in dataloader:
            texts = texts.to(device)
            outputs = model(texts)
            predictions = outputs.argmax(dim=1).cpu().numpy()
            all_preds.extend(predictions)
            all_labels.extend(labels.numpy())
    
    return all_labels, all_preds

true_labels, predictions = get_predictions(model, test_loader, device)
print(classification_report(true_labels, predictions, 
                          target_names=['Negative', 'Positive']))

Improvements & Next Steps

This baseline model achieves solid performance, but several enhancements can boost accuracy:

Pre-trained embeddings provide better semantic representations than random initialization:

import numpy as np

def load_pretrained_embeddings(vocab, embedding_dim=100):
    # Download GloVe embeddings first
    embeddings = np.random.randn(len(vocab), embedding_dim)
    
    # Load GloVe file (example)
    with open('glove.6B.100d.txt', 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            if word in vocab.word2idx:
                idx = vocab.word2idx[word]
                embeddings[idx] = np.array(values[1:], dtype=np.float32)
    
    return torch.tensor(embeddings, dtype=torch.float32)

# Initialize model with pre-trained embeddings
pretrained_embeddings = load_pretrained_embeddings(vocab)
model.embedding.weight.data.copy_(pretrained_embeddings)
model.embedding.weight.requires_grad = True  # Fine-tune or freeze

Other improvements worth exploring:

Use attention mechanisms to focus on important words
Experiment with CNN-based architectures for faster training
Implement learning rate scheduling for better convergence
Add data augmentation through back-translation or synonym replacement
Try transformer models (BERT, RoBERTa) for state-of-the-art performance

The LSTM approach shown here provides an excellent foundation. It’s interpretable, trains quickly, and works well with limited computational resources. For production systems handling millions of documents, this architecture often outperforms heavier alternatives in terms of cost-effectiveness while maintaining competitive accuracy.