How to Implement Text Classification in TensorFlow
Text classification assigns predefined categories to text documents. Common applications include sentiment analysis (positive/negative reviews), spam detection (spam/not spam emails), and topic...
Key Insights
- Text classification with TensorFlow requires three core steps: tokenizing text into sequences, building an embedding layer to represent words as vectors, and using recurrent or convolutional layers to capture patterns before classification.
- The IMDB movie review dataset provides an ideal starting point with 50,000 labeled reviews, allowing you to build a production-ready sentiment classifier in under 100 lines of code.
- Pre-trained embeddings like GloVe can dramatically improve model performance on limited datasets by leveraging semantic relationships learned from billions of words.
Introduction to Text Classification
Text classification assigns predefined categories to text documents. Common applications include sentiment analysis (positive/negative reviews), spam detection (spam/not spam emails), and topic categorization (news articles into sports, politics, technology). These systems power recommendation engines, content moderation, and customer feedback analysis across the industry.
TensorFlow excels at text classification due to its comprehensive ecosystem. The tf.keras API provides high-level abstractions for preprocessing, model building, and training, while maintaining flexibility for custom architectures. TensorFlow’s production deployment tools also make it straightforward to move from prototype to production.
We’ll build a binary sentiment classifier using the IMDB movie review dataset—25,000 reviews for training and 25,000 for testing, each labeled as positive or negative. This is a real-world problem that demonstrates the complete pipeline from raw text to predictions.
Data Preparation and Preprocessing
Text data requires significant preprocessing before neural networks can process it. We need to convert words into numbers, handle variable-length sequences, and split data appropriately.
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
# Load IMDB dataset (built into Keras)
vocab_size = 10000
(train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.imdb.load_data(num_words=vocab_size)
# Get the word index mapping
word_index = tf.keras.datasets.imdb.get_word_index()
# Reverse the word index to decode reviews
reverse_word_index = {value: key for key, value in word_index.items()}
def decode_review(encoded_review):
return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review])
# Pad sequences to uniform length
max_length = 256
train_data = pad_sequences(train_data, maxlen=max_length, padding='post', truncating='post')
test_data = pad_sequences(test_data, maxlen=max_length, padding='post', truncating='post')
# Create validation split
validation_split = 5000
x_val = train_data[:validation_split]
y_val = train_labels[:validation_split]
x_train = train_data[validation_split:]
y_train = train_labels[validation_split:]
print(f"Training samples: {len(x_train)}")
print(f"Validation samples: {len(x_val)}")
print(f"Test samples: {len(test_data)}")
The IMDB dataset comes pre-tokenized, but in real projects, you’ll use Tokenizer to build vocabulary from raw text. The num_words parameter limits vocabulary to the 10,000 most frequent words, reducing dimensionality and noise. Padding ensures all sequences have identical length (256 tokens), enabling batch processing.
Setting aside 5,000 samples for validation allows us to monitor overfitting during training without touching the test set.
Building the Classification Model
The model architecture follows a proven pattern: embedding layer for word representations, recurrent layers to capture sequential dependencies, and dense layers for classification.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
def build_model(vocab_size, embedding_dim=128, max_length=256):
model = Sequential([
# Embedding layer converts word indices to dense vectors
Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
input_length=max_length),
# Bidirectional LSTM processes sequences forward and backward
Bidirectional(LSTM(64, return_sequences=False)),
# Dropout for regularization
Dropout(0.5),
# Dense layer for learning complex patterns
Dense(64, activation='relu'),
Dropout(0.5),
# Output layer with sigmoid for binary classification
Dense(1, activation='sigmoid')
])
return model
model = build_model(vocab_size=vocab_size)
model.summary()
The embedding layer is crucial—it learns dense vector representations where semantically similar words have similar vectors. Starting with 128 dimensions balances expressiveness and computational cost.
Bidirectional LSTM processes text in both directions, capturing context from both past and future words. The return_sequences=False parameter means we only keep the final output, which encodes the entire sequence.
Dropout layers (50% rate) prevent overfitting by randomly disabling neurons during training. The sigmoid activation in the output layer produces probabilities between 0 and 1 for binary classification.
Training and Evaluation
Proper training configuration and monitoring are essential for good results.
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import matplotlib.pyplot as plt
# Compile model with appropriate loss and optimizer
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Configure callbacks
early_stopping = EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True
)
checkpoint = ModelCheckpoint(
'best_model.h5',
monitor='val_loss',
save_best_only=True
)
# Train the model
history = model.fit(
x_train, y_train,
epochs=20,
batch_size=128,
validation_data=(x_val, y_val),
callbacks=[early_stopping, checkpoint],
verbose=1
)
# Evaluate on test set
test_loss, test_accuracy = model.evaluate(test_data, test_labels)
print(f"\nTest accuracy: {test_accuracy:.4f}")
# Plot training history
def plot_history(history):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Accuracy plot
ax1.plot(history.history['accuracy'], label='Training')
ax1.plot(history.history['val_accuracy'], label='Validation')
ax1.set_title('Model Accuracy')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.legend()
ax1.grid(True)
# Loss plot
ax2.plot(history.history['loss'], label='Training')
ax2.plot(history.history['val_loss'], label='Validation')
ax2.set_title('Model Loss')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.legend()
ax2.grid(True)
plt.tight_layout()
plt.savefig('training_history.png')
plt.show()
plot_history(history)
Binary cross-entropy is the standard loss function for binary classification. Adam optimizer adapts learning rates automatically, requiring minimal tuning.
Early stopping monitors validation loss and stops training when it stops improving for 3 consecutive epochs, preventing overfitting. The restore_best_weights parameter ensures we use the best model, not the final one.
Expect around 85-88% test accuracy with this architecture. If training and validation curves diverge significantly, increase dropout or reduce model capacity.
Making Predictions on New Text
The trained model needs the same preprocessing pipeline applied to new text.
def predict_sentiment(text, model, vocab_size=10000, max_length=256):
# Tokenize the input text
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
# In production, save and load the tokenizer used during training
# For this example, we'll use the IMDB word index
word_index = tf.keras.datasets.imdb.get_word_index()
# Convert text to sequence
words = text.lower().split()
sequence = [word_index.get(word, 2) + 3 for word in words]
# Pad sequence
padded = pad_sequences([sequence], maxlen=max_length, padding='post', truncating='post')
# Get prediction
prediction = model.predict(padded, verbose=0)[0][0]
sentiment = "Positive" if prediction > 0.5 else "Negative"
confidence = prediction if prediction > 0.5 else 1 - prediction
return sentiment, confidence
# Test with custom reviews
test_reviews = [
"This movie was absolutely fantastic! The acting was superb and the plot kept me engaged.",
"Terrible waste of time. Poor acting and boring storyline.",
"It was okay, nothing special but not terrible either."
]
for review in test_reviews:
sentiment, confidence = predict_sentiment(review, model)
print(f"Review: {review[:60]}...")
print(f"Sentiment: {sentiment} (confidence: {confidence:.2%})\n")
In production, save the tokenizer using pickle or joblib to ensure consistent preprocessing. The offset of 3 accounts for reserved indices (0 for padding, 1 for start, 2 for unknown).
Optimization Techniques
Pre-trained embeddings leverage knowledge from massive text corpora, often improving performance significantly.
import numpy as np
def load_glove_embeddings(glove_file, word_index, embedding_dim=100):
embeddings_index = {}
with open(glove_file, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
# Create embedding matrix
vocab_size = len(word_index) + 1
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
return embedding_matrix
# Use pre-trained embeddings in model
embedding_matrix = load_glove_embeddings('glove.6B.100d.txt', word_index, 100)
model_with_glove = Sequential([
Embedding(vocab_size, 100,
weights=[embedding_matrix],
input_length=max_length,
trainable=False), # Freeze embeddings initially
Bidirectional(LSTM(64)),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
Setting trainable=False freezes the embedding weights, which works well with limited data. With larger datasets, allow fine-tuning by setting it to True.
Alternative architectures worth exploring: 1D CNNs are faster and work well for shorter texts, while Transformer-based models (BERT, DistilBERT) achieve state-of-the-art results but require more computational resources.
Conclusion and Next Steps
You now have a complete text classification pipeline: preprocessing text into padded sequences, building neural architectures with embeddings and recurrent layers, training with proper validation, and making predictions on new data.
To improve performance, experiment with data augmentation (synonym replacement, back-translation), ensemble multiple models, or fine-tune pre-trained language models. For production deployment, consider TensorFlow Serving or converting models to TensorFlow Lite for mobile devices.
The techniques here generalize to multi-class classification (change output layer to Dense(num_classes, activation='softmax') and use categorical cross-entropy) and other NLP tasks. Master these fundamentals, and you’ll be equipped to tackle complex text classification challenges in production systems.