Vector Databases: Embeddings and Similarity Search

Key Insights

Vector embeddings transform unstructured data into numerical representations that capture semantic meaning, enabling machines to understand similarity between text, images, and other content types beyond exact keyword matching.
Traditional databases fail at high-dimensional vector search because they can’t efficiently index hundreds of dimensions—vector databases use specialized algorithms like HNSW and IVF to find approximate nearest neighbors in milliseconds.
The real power emerges when combining embeddings with metadata filtering, enabling hybrid search that finds semantically similar content while respecting business logic constraints like date ranges, categories, or user permissions.

What Are Vector Embeddings?

Vector embeddings are numerical representations of data that capture semantic meaning in high-dimensional space. Instead of storing text as strings or images as pixels, embeddings convert this data into arrays of floating-point numbers where similar concepts cluster together mathematically.

Modern embedding models typically output vectors with 384 to 1536 dimensions. OpenAI’s text-embedding-3-small produces 1536-dimensional vectors, while open-source models like all-MiniLM-L6-v2 generate 384 dimensions. Each dimension represents learned features that the model associates with semantic concepts.

Here’s how to generate embeddings using OpenAI’s API:

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Generate embeddings for sample texts
texts = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "Python is a programming language"
]

embeddings = [get_embedding(text) for text in texts]
print(f"Embedding dimensions: {len(embeddings[0])}")
# Output: Embedding dimensions: 1536

The first two sentences will have similar embeddings despite different wording because they describe the same concept. The third sentence will be mathematically distant from the others.

Understanding Similarity Search

Similarity search finds vectors that are “close” to a query vector in high-dimensional space. The three primary distance metrics are:

Cosine similarity measures the angle between vectors, ranging from -1 (opposite) to 1 (identical). It’s ideal for text embeddings because it ignores magnitude and focuses on direction.

Euclidean distance calculates straight-line distance between points. It works well when magnitude matters, like in image embeddings.

Dot product combines both angle and magnitude. It’s computationally efficient and often used in recommendation systems.

Traditional databases use B-trees and hash indexes optimized for exact matches and range queries. They break down with high-dimensional vectors because of the “curse of dimensionality”—as dimensions increase, all points become approximately equidistant, making traditional indexing useless.

Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms that trade perfect accuracy for speed. Instead of checking every vector, they use graph-based or clustering approaches to narrow the search space.

Here’s cosine similarity calculation:

import numpy as np

def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

# Compare our example embeddings
emb1 = np.array(embeddings[0])  # "The cat sat on the mat"
emb2 = np.array(embeddings[1])  # "A feline rested on the rug"
emb3 = np.array(embeddings[2])  # "Python is a programming language"

print(f"Cat vs Feline: {cosine_similarity(emb1, emb2):.4f}")
# Output: ~0.92 (very similar)
print(f"Cat vs Python: {cosine_similarity(emb1, emb3):.4f}")
# Output: ~0.45 (dissimilar)

Vector Database Architecture

Vector databases use specialized indexes optimized for ANN search. The two most common algorithms are:

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each node represents a vector. Search starts at the top layer with large jumps, then descends through layers making progressively smaller jumps until reaching the nearest neighbors. It offers excellent recall (accuracy) and query speed but requires more memory.

IVF (Inverted File Index) clusters vectors into buckets using k-means or similar algorithms. During search, it identifies the most relevant clusters and only searches within those. It’s more memory-efficient but may miss results if the query falls between cluster boundaries.

Here’s a conceptual representation:

# Simplified HNSW pseudocode
class HNSWIndex:
    def __init__(self, vectors, M=16, ef_construction=200):
        self.layers = []
        self.M = M  # Max connections per node
        
    def insert(self, vector):
        # Determine layer (exponential decay)
        layer = random_layer()
        
        # Find entry point at top layer
        entry = self.get_entry_point()
        
        # Navigate down layers, connecting neighbors
        for level in range(layer, -1, -1):
            neighbors = self.search_layer(vector, entry, level)
            self.connect_neighbors(vector, neighbors, level)
    
    def search(self, query, k=10, ef=50):
        # ef controls search breadth vs speed
        candidates = self.search_layer(query, entry_point, top_layer)
        return self.select_neighbors(candidates, k)

The key trade-off is the ef parameter during search: higher values check more candidates for better accuracy but slower queries. Production systems typically use ef=50-200 depending on requirements.

Practical Implementation

Let’s build a complete workflow using Qdrant, an open-source vector database:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

# Initialize client (local or cloud)
client = QdrantClient(url="http://localhost:6333")

# Create collection
collection_name = "documentation"
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    )
)

# Prepare documents with metadata
documents = [
    {"text": "Vector databases store embeddings for similarity search", "category": "databases"},
    {"text": "HNSW provides fast approximate nearest neighbor search", "category": "algorithms"},
    {"text": "Embeddings capture semantic meaning in numerical form", "category": "ml"},
]

# Insert vectors with metadata
points = []
for doc in documents:
    embedding = get_embedding(doc["text"])
    points.append(PointStruct(
        id=str(uuid.uuid4()),
        vector=embedding,
        payload=doc
    ))

client.upsert(collection_name=collection_name, points=points)

# Perform similarity search
query = "How do I search for similar vectors?"
query_embedding = get_embedding(query)

results = client.search(
    collection_name=collection_name,
    query_vector=query_embedding,
    limit=3
)

for result in results:
    print(f"Score: {result.score:.4f}")
    print(f"Text: {result.payload['text']}")
    print(f"Category: {result.payload['category']}\n")

The payload field is crucial—it allows filtering results by metadata while performing vector search. This enables queries like “find similar products in the electronics category under $100.”

Real-World Use Cases

Semantic search powers modern documentation and knowledge bases. Unlike keyword search, it understands intent. Searching “how to reset password” matches documents about “account recovery” and “credential management.”

def semantic_search(query: str, filters: dict = None):
    query_embedding = get_embedding(query)
    
    search_params = {
        "collection_name": "documentation",
        "query_vector": query_embedding,
        "limit": 5
    }
    
    # Add metadata filters
    if filters:
        from qdrant_client.models import Filter, FieldCondition, MatchValue
        search_params["query_filter"] = Filter(
            must=[
                FieldCondition(
                    key=k,
                    match=MatchValue(value=v)
                ) for k, v in filters.items()
            ]
        )
    
    return client.search(**search_params)

# Search with filters
results = semantic_search(
    query="database performance optimization",
    filters={"category": "databases"}
)

Retrieval-Augmented Generation (RAG) combines vector search with LLMs. Instead of relying solely on training data, the LLM retrieves relevant context from your documents before generating responses. This reduces hallucinations and grounds answers in your actual data.

Recommendation systems use embeddings to find similar products, articles, or users. Netflix and Spotify embed viewing/listening history into vectors and find similar profiles for personalized recommendations.

Performance Optimization & Best Practices

Choose dimensionality wisely. Higher dimensions capture more nuance but increase storage and query latency. For most text applications, 384-768 dimensions suffice. Only use 1536+ dimensions when you need maximum semantic precision.

Batch operations dramatically improve throughput:

# Bad: Individual inserts
for doc in documents:
    embedding = get_embedding(doc["text"])
    client.upsert(collection_name, points=[create_point(embedding, doc)])

# Good: Batch processing
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = documents[i:i + batch_size]
    texts = [doc["text"] for doc in batch]
    embeddings = get_embeddings_batch(texts)  # Single API call
    points = [create_point(emb, doc) for emb, doc in zip(embeddings, batch)]
    client.upsert(collection_name, points=points)

Monitor query performance by tracking p95 and p99 latencies, not just averages. A few slow queries can destroy user experience:

import time

def benchmark_query(query: str, iterations: int = 100):
    latencies = []
    query_embedding = get_embedding(query)
    
    for _ in range(iterations):
        start = time.perf_counter()
        client.search(
            collection_name=collection_name,
            query_vector=query_embedding,
            limit=10
        )
        latencies.append((time.perf_counter() - start) * 1000)
    
    latencies.sort()
    print(f"p50: {latencies[50]:.2f}ms")
    print(f"p95: {latencies[95]:.2f}ms")
    print(f"p99: {latencies[99]:.2f}ms")

Cost optimization matters at scale. Embedding API costs add up—cache embeddings for static content. Storage costs scale with dimensions × vector count. A million 1536-dimensional vectors at 4 bytes per float requires 6GB just for vectors, before indexes.

Use quantization for production systems. Qdrant and Pinecone support scalar quantization that reduces memory by 4x with minimal accuracy loss. This trades CPU for memory—vectors are decompressed during comparison.

Vector databases have moved from research novelty to production necessity. They enable applications that understand meaning, not just match keywords. Start with a managed service like Pinecone or Qdrant Cloud, measure your actual query patterns, then optimize based on data.