How to Implement Text Classification in TensorFlow

Key Insights

Text classification with TensorFlow requires three core steps: tokenizing text into sequences, building an embedding layer to represent words as vectors, and using recurrent or convolutional layers to capture patterns before classification.
The IMDB movie review dataset provides an ideal starting point with 50,000 labeled reviews, allowing you to build a production-ready sentiment classifier in under 100 lines of code.
Pre-trained embeddings like GloVe can dramatically improve model performance on limited datasets by leveraging semantic relationships learned from billions of words.

Introduction to Text Classification

Text classification assigns predefined categories to text documents. Common applications include sentiment analysis (positive/negative reviews), spam detection (spam/not spam emails), and topic categorization (news articles into sports, politics, technology). These systems power recommendation engines, content moderation, and customer feedback analysis across the industry.

TensorFlow excels at text classification due to its comprehensive ecosystem. The tf.keras API provides high-level abstractions for preprocessing, model building, and training, while maintaining flexibility for custom architectures. TensorFlow’s production deployment tools also make it straightforward to move from prototype to production.

We’ll build a binary sentiment classifier using the IMDB movie review dataset—25,000 reviews for training and 25,000 for testing, each labeled as positive or negative. This is a real-world problem that demonstrates the complete pipeline from raw text to predictions.

Data Preparation and Preprocessing

Text data requires significant preprocessing before neural networks can process it. We need to convert words into numbers, handle variable-length sequences, and split data appropriately.

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Load IMDB dataset (built into Keras)
vocab_size = 10000
(train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.imdb.load_data(num_words=vocab_size)

# Get the word index mapping
word_index = tf.keras.datasets.imdb.get_word_index()

# Reverse the word index to decode reviews
reverse_word_index = {value: key for key, value in word_index.items()}

def decode_review(encoded_review):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review])

# Pad sequences to uniform length
max_length = 256
train_data = pad_sequences(train_data, maxlen=max_length, padding='post', truncating='post')
test_data = pad_sequences(test_data, maxlen=max_length, padding='post', truncating='post')

# Create validation split
validation_split = 5000
x_val = train_data[:validation_split]
y_val = train_labels[:validation_split]
x_train = train_data[validation_split:]
y_train = train_labels[validation_split:]

print(f"Training samples: {len(x_train)}")
print(f"Validation samples: {len(x_val)}")
print(f"Test samples: {len(test_data)}")

The IMDB dataset comes pre-tokenized, but in real projects, you’ll use Tokenizer to build vocabulary from raw text. The num_words parameter limits vocabulary to the 10,000 most frequent words, reducing dimensionality and noise. Padding ensures all sequences have identical length (256 tokens), enabling batch processing.

Setting aside 5,000 samples for validation allows us to monitor overfitting during training without touching the test set.

Building the Classification Model

The model architecture follows a proven pattern: embedding layer for word representations, recurrent layers to capture sequential dependencies, and dense layers for classification.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional

def build_model(vocab_size, embedding_dim=128, max_length=256):
    model = Sequential([
        # Embedding layer converts word indices to dense vectors
        Embedding(input_dim=vocab_size, 
                  output_dim=embedding_dim, 
                  input_length=max_length),
        
        # Bidirectional LSTM processes sequences forward and backward
        Bidirectional(LSTM(64, return_sequences=False)),
        
        # Dropout for regularization
        Dropout(0.5),
        
        # Dense layer for learning complex patterns
        Dense(64, activation='relu'),
        Dropout(0.5),
        
        # Output layer with sigmoid for binary classification
        Dense(1, activation='sigmoid')
    ])
    
    return model

model = build_model(vocab_size=vocab_size)
model.summary()

The embedding layer is crucial—it learns dense vector representations where semantically similar words have similar vectors. Starting with 128 dimensions balances expressiveness and computational cost.

Bidirectional LSTM processes text in both directions, capturing context from both past and future words. The return_sequences=False parameter means we only keep the final output, which encodes the entire sequence.

Dropout layers (50% rate) prevent overfitting by randomly disabling neurons during training. The sigmoid activation in the output layer produces probabilities between 0 and 1 for binary classification.

Training and Evaluation

Proper training configuration and monitoring are essential for good results.

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import matplotlib.pyplot as plt

# Compile model with appropriate loss and optimizer
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Configure callbacks
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True
)

checkpoint = ModelCheckpoint(
    'best_model.h5',
    monitor='val_loss',
    save_best_only=True
)

# Train the model
history = model.fit(
    x_train, y_train,
    epochs=20,
    batch_size=128,
    validation_data=(x_val, y_val),
    callbacks=[early_stopping, checkpoint],
    verbose=1
)

# Evaluate on test set
test_loss, test_accuracy = model.evaluate(test_data, test_labels)
print(f"\nTest accuracy: {test_accuracy:.4f}")

# Plot training history
def plot_history(history):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Accuracy plot
    ax1.plot(history.history['accuracy'], label='Training')
    ax1.plot(history.history['val_accuracy'], label='Validation')
    ax1.set_title('Model Accuracy')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Accuracy')
    ax1.legend()
    ax1.grid(True)
    
    # Loss plot
    ax2.plot(history.history['loss'], label='Training')
    ax2.plot(history.history['val_loss'], label='Validation')
    ax2.set_title('Model Loss')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Loss')
    ax2.legend()
    ax2.grid(True)
    
    plt.tight_layout()
    plt.savefig('training_history.png')
    plt.show()

plot_history(history)

Binary cross-entropy is the standard loss function for binary classification. Adam optimizer adapts learning rates automatically, requiring minimal tuning.

Early stopping monitors validation loss and stops training when it stops improving for 3 consecutive epochs, preventing overfitting. The restore_best_weights parameter ensures we use the best model, not the final one.

Expect around 85-88% test accuracy with this architecture. If training and validation curves diverge significantly, increase dropout or reduce model capacity.

Making Predictions on New Text

The trained model needs the same preprocessing pipeline applied to new text.

def predict_sentiment(text, model, vocab_size=10000, max_length=256):
    # Tokenize the input text
    tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
    
    # In production, save and load the tokenizer used during training
    # For this example, we'll use the IMDB word index
    word_index = tf.keras.datasets.imdb.get_word_index()
    
    # Convert text to sequence
    words = text.lower().split()
    sequence = [word_index.get(word, 2) + 3 for word in words]
    
    # Pad sequence
    padded = pad_sequences([sequence], maxlen=max_length, padding='post', truncating='post')
    
    # Get prediction
    prediction = model.predict(padded, verbose=0)[0][0]
    
    sentiment = "Positive" if prediction > 0.5 else "Negative"
    confidence = prediction if prediction > 0.5 else 1 - prediction
    
    return sentiment, confidence

# Test with custom reviews
test_reviews = [
    "This movie was absolutely fantastic! The acting was superb and the plot kept me engaged.",
    "Terrible waste of time. Poor acting and boring storyline.",
    "It was okay, nothing special but not terrible either."
]

for review in test_reviews:
    sentiment, confidence = predict_sentiment(review, model)
    print(f"Review: {review[:60]}...")
    print(f"Sentiment: {sentiment} (confidence: {confidence:.2%})\n")

In production, save the tokenizer using pickle or joblib to ensure consistent preprocessing. The offset of 3 accounts for reserved indices (0 for padding, 1 for start, 2 for unknown).

Optimization Techniques

Pre-trained embeddings leverage knowledge from massive text corpora, often improving performance significantly.

import numpy as np

def load_glove_embeddings(glove_file, word_index, embedding_dim=100):
    embeddings_index = {}
    
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    
    # Create embedding matrix
    vocab_size = len(word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

# Use pre-trained embeddings in model
embedding_matrix = load_glove_embeddings('glove.6B.100d.txt', word_index, 100)

model_with_glove = Sequential([
    Embedding(vocab_size, 100, 
              weights=[embedding_matrix],
              input_length=max_length,
              trainable=False),  # Freeze embeddings initially
    Bidirectional(LSTM(64)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

Setting trainable=False freezes the embedding weights, which works well with limited data. With larger datasets, allow fine-tuning by setting it to True.

Alternative architectures worth exploring: 1D CNNs are faster and work well for shorter texts, while Transformer-based models (BERT, DistilBERT) achieve state-of-the-art results but require more computational resources.

Conclusion and Next Steps

You now have a complete text classification pipeline: preprocessing text into padded sequences, building neural architectures with embeddings and recurrent layers, training with proper validation, and making predictions on new data.

To improve performance, experiment with data augmentation (synonym replacement, back-translation), ensemble multiple models, or fine-tune pre-trained language models. For production deployment, consider TensorFlow Serving or converting models to TensorFlow Lite for mobile devices.

The techniques here generalize to multi-class classification (change output layer to Dense(num_classes, activation='softmax') and use categorical cross-entropy) and other NLP tasks. Master these fundamentals, and you’ll be equipped to tackle complex text classification challenges in production systems.