How to Use Word Embeddings in TensorFlow
Word embeddings solve a fundamental problem in natural language processing: computers don't understand words, they understand numbers. Traditional one-hot encoding creates sparse vectors where each...
Key Insights
- Word embeddings transform discrete words into dense numerical vectors that capture semantic relationships, making them essential for modern NLP tasks where similar words cluster together in vector space
- TensorFlow’s Embedding layer trains task-specific representations from scratch during model training, while pre-trained embeddings like GloVe provide transfer learning benefits for smaller datasets
- Proper embedding dimension selection (typically 50-300) and vocabulary management directly impact model performance—larger dimensions capture more nuance but require more training data to avoid overfitting
Introduction to Word Embeddings
Word embeddings solve a fundamental problem in natural language processing: computers don’t understand words, they understand numbers. Traditional one-hot encoding creates sparse vectors where each word is a single dimension in a massive vocabulary-sized vector. This approach fails to capture semantic relationships—“king” and “queen” are just as different as “king” and “potato.”
Word embeddings represent words as dense vectors in continuous space, typically 50-300 dimensions. Words with similar meanings cluster together in this space. The famous example: vector(“king”) - vector(“man”) + vector(“woman”) ≈ vector(“queen”). This mathematical property enables models to understand semantic relationships and generalize better.
Popular pre-trained embeddings include Word2Vec (trained on Google News), GloVe (trained on Wikipedia and web text), and FastText (handles out-of-vocabulary words through subword information). However, training custom embeddings on your specific domain often yields better results for specialized tasks.
Setting Up TensorFlow and Data Preparation
Start by installing TensorFlow and preparing your text data. Text preprocessing involves tokenization, vocabulary building, and sequence conversion.
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample dataset - replace with your actual data
texts = [
"I love machine learning and deep learning",
"Natural language processing is fascinating",
"TensorFlow makes building models easier",
"Word embeddings capture semantic meaning",
"Neural networks learn from data"
]
labels = [1, 1, 1, 1, 1] # Positive sentiment examples
# Create and fit tokenizer
vocab_size = 1000
max_length = 20
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
# Convert texts to sequences
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
print(f"Vocabulary size: {len(tokenizer.word_index)}")
print(f"Sequence shape: {padded_sequences.shape}")
The oov_token parameter handles out-of-vocabulary words encountered during inference. Padding ensures all sequences have uniform length, which TensorFlow requires for batch processing.
Creating Embeddings with tf.keras.layers.Embedding
The Embedding layer is the core component for working with word embeddings in TensorFlow. It learns a dense representation for each word during training.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense
# Define model architecture
embedding_dim = 64
model = Sequential([
Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
input_length=max_length),
GlobalAveragePooling1D(),
Dense(24, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
The Embedding layer parameters:
input_dim: Vocabulary size (number of unique words)output_dim: Embedding dimension (vector size for each word)input_length: Length of input sequences (optional but recommended)
The layer creates a lookup table with shape (vocab_size, embedding_dim). During training, it retrieves embeddings for input word indices and updates them via backpropagation.
Training Custom Embeddings
Training embeddings from scratch lets them specialize for your specific task and domain. Here’s a complete training example:
from sklearn.model_selection import train_test_split
# Expanded dataset for demonstration
# In practice, use thousands of examples
train_texts = texts * 100 # Repeat for demonstration
train_labels = labels * 100
# Prepare data
sequences = tokenizer.texts_to_sequences(train_texts)
padded = pad_sequences(sequences, maxlen=max_length, padding='post')
X_train, X_val, y_train, y_val = train_test_split(
padded, train_labels, test_size=0.2, random_state=42
)
# Train model
history = model.fit(
X_train, y_train,
epochs=20,
validation_data=(X_val, y_val),
batch_size=32,
verbose=1
)
# Plot training history
import matplotlib.pyplot as plt
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
The embeddings learn task-specific representations. For sentiment analysis, words like “excellent” and “terrible” will be far apart in embedding space because they predict opposite labels.
Using Pre-trained Embeddings
Pre-trained embeddings provide strong baselines, especially with limited training data. Here’s how to load GloVe embeddings:
import os
def load_glove_embeddings(glove_file, word_index, embedding_dim=100):
"""Load GloVe embeddings and create embedding matrix"""
embeddings_index = {}
with open(glove_file, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print(f'Loaded {len(embeddings_index)} word vectors')
# Create embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
return embedding_matrix
# Load pre-trained embeddings
# Download GloVe from: https://nlp.stanford.edu/projects/glove/
embedding_matrix = load_glove_embeddings(
'glove.6B.100d.txt',
tokenizer.word_index,
embedding_dim=100
)
# Create model with pre-trained embeddings
model_pretrained = Sequential([
Embedding(input_dim=vocab_size,
output_dim=100,
input_length=max_length,
weights=[embedding_matrix],
trainable=False), # Freeze embeddings
GlobalAveragePooling1D(),
Dense(24, activation='relu'),
Dense(1, activation='sigmoid')
])
model_pretrained.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
Set trainable=False to freeze embeddings and use them as fixed features. Set trainable=True to fine-tune them on your task, which often improves performance when you have sufficient training data.
Visualizing and Analyzing Embeddings
Understanding what your embeddings learn is crucial. Extract and analyze them using dimensionality reduction and similarity metrics:
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
# Extract embedding weights
embedding_layer = model.layers[0]
embedding_weights = embedding_layer.get_weights()[0]
print(f"Embedding matrix shape: {embedding_weights.shape}")
# Find similar words using cosine similarity
def find_similar_words(word, tokenizer, embeddings, top_n=5):
"""Find most similar words to given word"""
if word not in tokenizer.word_index:
return []
word_idx = tokenizer.word_index[word]
word_vec = embeddings[word_idx].reshape(1, -1)
# Compute similarities
similarities = cosine_similarity(word_vec, embeddings)[0]
# Get top N similar word indices
similar_indices = similarities.argsort()[-top_n-1:-1][::-1]
# Convert indices back to words
idx_to_word = {idx: word for word, idx in tokenizer.word_index.items()}
similar_words = [(idx_to_word.get(idx, '<UNK>'), similarities[idx])
for idx in similar_indices]
return similar_words
# Example usage
similar = find_similar_words('learning', tokenizer, embedding_weights)
print(f"Words similar to 'learning': {similar}")
# Visualize with t-SNE
def visualize_embeddings(embeddings, word_index, num_words=100):
"""Visualize embeddings using t-SNE"""
# Take subset of words
embeddings_subset = embeddings[1:num_words+1]
# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings_subset)
# Plot
plt.figure(figsize=(12, 8))
idx_to_word = {idx: word for word, idx in word_index.items()}
for i, (x, y) in enumerate(embeddings_2d):
plt.scatter(x, y)
plt.annotate(idx_to_word.get(i+1, ''), (x, y), fontsize=8)
plt.title('Word Embeddings Visualization (t-SNE)')
plt.show()
For production applications, use TensorBoard’s Embedding Projector for interactive visualization with built-in search and 3D exploration capabilities.
Best Practices and Common Pitfalls
Choosing Embedding Dimensions: Start with 50-100 dimensions for small vocabularies (<10K words), 100-300 for medium vocabularies. Larger dimensions capture more nuance but require more training data. If your model overfits, reduce embedding dimensions before adding regularization.
Handling Out-of-Vocabulary Words: Always use an oov_token in your tokenizer. For production systems, consider subword tokenization (BPE, WordPiece) or character-level models to handle unseen words gracefully.
Pre-trained vs. Custom Embeddings: Use pre-trained embeddings when you have <10K training examples or your domain overlaps with the pre-training corpus. Train custom embeddings when you have abundant data or highly specialized vocabulary (medical, legal, technical domains).
Performance Optimization: Embedding lookups are memory-bound operations. Reduce vocabulary size by removing rare words (appearing <5 times). Use mixed precision training for faster computation. Consider embedding quantization for deployment.
Common Mistake: Setting input_dim to a fixed number instead of actual vocabulary size. This causes index errors when tokens exceed the embedding table size. Always set input_dim=len(tokenizer.word_index) + 1.
Word embeddings remain foundational in NLP despite the rise of transformers. Understanding how to train, use, and analyze them in TensorFlow gives you powerful tools for building effective text classification, sentiment analysis, and recommendation systems.