Design a Typeahead Suggestion System: Autocomplete
Typeahead suggestion systems are everywhere. When you start typing in Google Search, your IDE, or an e-commerce search bar, you expect instant, relevant suggestions. These systems seem simple on the...
Key Insights
- Tries (prefix trees) are the foundational data structure for typeahead systems, but production implementations require compression and careful metadata storage to balance memory usage with query speed.
- Latency is the critical metric—users expect suggestions within 100ms, which means aggressive caching at multiple layers and geographic distribution are non-negotiable at scale.
- The ranking algorithm matters more than the data structure; a fast but poorly-ranked suggestion system will frustrate users more than a slightly slower but relevant one.
Introduction & Problem Definition
Typeahead suggestion systems are everywhere. When you start typing in Google Search, your IDE, or an e-commerce search bar, you expect instant, relevant suggestions. These systems seem simple on the surface—match prefixes and return results—but building one that handles millions of queries per second while maintaining sub-100ms latency is a genuine engineering challenge.
The core requirements are straightforward: given a prefix like “how to”, return the top 5-10 most relevant completions. But “relevant” is where things get complicated. Should you prioritize global popularity? User history? Trending topics? Geographic context? The answer is usually “all of the above,” weighted appropriately.
At scale, you’re looking at constraints like:
- Latency: Under 100ms end-to-end, ideally under 50ms
- Throughput: Potentially millions of queries per second
- Freshness: New trending terms should appear within minutes
- Quality: No offensive suggestions, high relevance
High-Level Architecture Overview
Let’s establish the component architecture before diving into implementation details.
┌─────────────┐ ┌─────────────┐ ┌──────────────────┐
│ Client │────▶│ CDN/Edge │────▶│ API Gateway │
│ (Browser) │ │ Cache │ │ (Rate Limiting) │
└─────────────┘ └─────────────┘ └────────┬─────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Suggestion Service │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │ │
│ │ │ Trie Store │ │ Ranker │ │ Cache │ │ │
│ │ └─────────────┘ └─────────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Data Pipeline │ │
│ │ (Log Aggregation, Frequency Updates, Filtering) │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
The API contract is minimal:
# OpenAPI specification
paths:
/suggestions:
get:
parameters:
- name: q
in: query
required: true
schema:
type: string
description: The prefix to match
- name: limit
in: query
schema:
type: integer
default: 10
maximum: 20
- name: context
in: query
schema:
type: string
description: Optional context (e.g., "shopping", "images")
responses:
'200':
content:
application/json:
schema:
type: object
properties:
suggestions:
type: array
items:
type: object
properties:
text: { type: string }
score: { type: number }
Data Structures: Trie & Optimizations
The trie (prefix tree) is the natural choice for typeahead. Each path from root to node represents a prefix, and finding all completions for a prefix is O(p + n) where p is the prefix length and n is the number of results.
Here’s a practical implementation with frequency tracking:
from typing import Optional
from dataclasses import dataclass, field
from heapq import nlargest
@dataclass
class TrieNode:
children: dict = field(default_factory=dict)
is_terminal: bool = False
frequency: int = 0
term: Optional[str] = None # Store full term at terminal nodes
class SuggestionTrie:
def __init__(self):
self.root = TrieNode()
def insert(self, term: str, frequency: int = 1) -> None:
node = self.root
for char in term.lower():
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_terminal = True
node.frequency += frequency
node.term = term
def _find_prefix_node(self, prefix: str) -> Optional[TrieNode]:
node = self.root
for char in prefix.lower():
if char not in node.children:
return None
node = node.children[char]
return node
def _collect_terms(self, node: TrieNode, results: list) -> None:
"""DFS to collect all terminal nodes under this subtree."""
if node.is_terminal:
results.append((node.term, node.frequency))
for child in node.children.values():
self._collect_terms(child, results)
def search(self, prefix: str, limit: int = 10) -> list[tuple[str, int]]:
node = self._find_prefix_node(prefix)
if not node:
return []
results = []
self._collect_terms(node, results)
# Return top-k by frequency
return nlargest(limit, results, key=lambda x: x[1])
# Usage
trie = SuggestionTrie()
trie.insert("how to cook rice", frequency=50000)
trie.insert("how to tie a tie", frequency=45000)
trie.insert("how to lose weight", frequency=80000)
print(trie.search("how to"))
# [('how to lose weight', 80000), ('how to cook rice', 50000), ('how to tie a tie', 45000)]
For production systems, consider a radix tree (compressed trie) where chains of single-child nodes are collapsed. This reduces memory usage significantly—often by 50-80% for natural language data.
Ranking & Relevance Algorithms
Raw frequency isn’t enough. A good ranking function combines multiple signals:
import math
from dataclasses import dataclass
from time import time
@dataclass
class SuggestionCandidate:
term: str
global_frequency: int
user_frequency: int # How often this user searched this
last_searched: float # Unix timestamp
category_match: bool # Does it match current context?
class SuggestionRanker:
def __init__(
self,
global_weight: float = 0.4,
personal_weight: float = 0.3,
recency_weight: float = 0.2,
context_weight: float = 0.1,
recency_halflife_days: float = 7.0
):
self.global_weight = global_weight
self.personal_weight = personal_weight
self.recency_weight = recency_weight
self.context_weight = context_weight
self.recency_halflife = recency_halflife_days * 86400 # Convert to seconds
def _normalize_frequency(self, freq: int, max_freq: int) -> float:
"""Log-normalize frequency to prevent popular terms from dominating."""
if freq == 0 or max_freq == 0:
return 0.0
return math.log(1 + freq) / math.log(1 + max_freq)
def _recency_decay(self, timestamp: float) -> float:
"""Exponential decay based on time since last search."""
if timestamp == 0:
return 0.0
age = time() - timestamp
return math.exp(-age / self.recency_halflife)
def score(
self,
candidate: SuggestionCandidate,
max_global_freq: int,
max_user_freq: int
) -> float:
global_score = self._normalize_frequency(
candidate.global_frequency, max_global_freq
)
personal_score = self._normalize_frequency(
candidate.user_frequency, max_user_freq
)
recency_score = self._recency_decay(candidate.last_searched)
context_score = 1.0 if candidate.category_match else 0.0
return (
self.global_weight * global_score +
self.personal_weight * personal_score +
self.recency_weight * recency_score +
self.context_weight * context_score
)
def rank(
self,
candidates: list[SuggestionCandidate],
limit: int = 10
) -> list[tuple[str, float]]:
if not candidates:
return []
max_global = max(c.global_frequency for c in candidates)
max_user = max(c.user_frequency for c in candidates)
scored = [
(c.term, self.score(c, max_global, max_user))
for c in candidates
]
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:limit]
The weights are tunable per product. An e-commerce site might weight context heavily, while a general search engine might favor global popularity.
Scaling & Performance Optimization
At scale, you can’t query a single trie. Sharding strategies include:
- Prefix-based sharding: Terms starting with ‘a-f’ go to shard 1, ‘g-m’ to shard 2, etc. Simple but can create hotspots.
- Consistent hashing: Better distribution but requires querying multiple shards for short prefixes.
Caching is where you win the latency battle:
import redis
import json
from typing import Optional
class SuggestionCache:
def __init__(
self,
redis_client: redis.Redis,
default_ttl: int = 300, # 5 minutes
popular_ttl: int = 3600 # 1 hour for popular prefixes
):
self.redis = redis_client
self.default_ttl = default_ttl
self.popular_ttl = popular_ttl
self.popular_prefixes = self._load_popular_prefixes()
def _load_popular_prefixes(self) -> set:
"""Load precomputed popular prefixes (top 10k or so)."""
# In practice, load from config or precompute
return {"how to", "what is", "where", "why", "best"}
def _cache_key(self, prefix: str, context: Optional[str] = None) -> str:
base = f"suggest:{prefix.lower()}"
return f"{base}:{context}" if context else base
def get(
self,
prefix: str,
context: Optional[str] = None
) -> Optional[list[dict]]:
key = self._cache_key(prefix, context)
cached = self.redis.get(key)
if cached:
return json.loads(cached)
return None
def set(
self,
prefix: str,
suggestions: list[dict],
context: Optional[str] = None
) -> None:
key = self._cache_key(prefix, context)
ttl = (
self.popular_ttl
if prefix.lower() in self.popular_prefixes
else self.default_ttl
)
self.redis.setex(key, ttl, json.dumps(suggestions))
def warm_cache(self, popular_suggestions: dict[str, list[dict]]) -> None:
"""Precompute and cache top suggestions for popular prefixes."""
pipe = self.redis.pipeline()
for prefix, suggestions in popular_suggestions.items():
key = self._cache_key(prefix)
pipe.setex(key, self.popular_ttl, json.dumps(suggestions))
pipe.execute()
For the top 1000 prefixes, precompute suggestions and cache them at the CDN edge. This handles a disproportionate amount of traffic.
Data Pipeline & Updates
Suggestions need to stay fresh. Here’s a simplified streaming aggregation approach:
from collections import defaultdict
from dataclasses import dataclass
from time import time
import threading
@dataclass
class SearchEvent:
term: str
timestamp: float
user_id: str
class FrequencyAggregator:
def __init__(self, flush_interval: int = 60, min_count: int = 5):
self.counts = defaultdict(int)
self.flush_interval = flush_interval
self.min_count = min_count
self.lock = threading.Lock()
self.blocked_terms = self._load_blocklist()
def _load_blocklist(self) -> set:
"""Load offensive/inappropriate terms to filter."""
# In practice, load from a maintained blocklist
return set()
def _is_valid_term(self, term: str) -> bool:
term_lower = term.lower().strip()
if len(term_lower) < 2 or len(term_lower) > 100:
return False
if term_lower in self.blocked_terms:
return False
return True
def process_event(self, event: SearchEvent) -> None:
if not self._is_valid_term(event.term):
return
with self.lock:
self.counts[event.term.lower()] += 1
def flush(self) -> dict[str, int]:
"""Return counts and reset. Called periodically."""
with self.lock:
# Filter out low-count terms (likely typos or spam)
results = {
term: count
for term, count in self.counts.items()
if count >= self.min_count
}
self.counts.clear()
return results
In production, you’d use Kafka or Kinesis for event streaming and Flink or Spark Streaming for aggregation.
Client-Side Considerations & Wrap-Up
The client side matters as much as the backend. Debouncing prevents flooding your servers:
class TypeaheadClient {
constructor(endpoint, debounceMs = 150) {
this.endpoint = endpoint;
this.debounceMs = debounceMs;
this.controller = null;
this.timeoutId = null;
}
async getSuggestions(prefix, context = null) {
// Cancel any pending request
if (this.controller) {
this.controller.abort();
}
if (this.timeoutId) {
clearTimeout(this.timeoutId);
}
// Don't query for very short prefixes
if (prefix.length < 2) {
return [];
}
return new Promise((resolve, reject) => {
this.timeoutId = setTimeout(async () => {
this.controller = new AbortController();
try {
const params = new URLSearchParams({ q: prefix });
if (context) params.append('context', context);
const response = await fetch(
`${this.endpoint}?${params}`,
{ signal: this.controller.signal }
);
if (!response.ok) {
// Graceful degradation: return empty on error
resolve([]);
return;
}
const data = await response.json();
resolve(data.suggestions || []);
} catch (error) {
if (error.name === 'AbortError') {
// Request was cancelled, not an error
resolve([]);
} else {
console.warn('Suggestion fetch failed:', error);
resolve([]); // Fail silently
}
}
}, this.debounceMs);
});
}
}
Trade-offs to consider:
- Memory vs. latency: Larger caches mean faster responses but higher costs
- Freshness vs. stability: More frequent updates can cause suggestion flickering
- Personalization vs. privacy: User history improves relevance but raises privacy concerns
Extensions worth exploring:
- Fuzzy matching for typo tolerance (Levenshtein distance, BK-trees)
- Query understanding with NLP (intent classification)
- A/B testing framework for ranking algorithm tuning
A well-designed typeahead system is invisible when it works—users just expect instant, relevant suggestions. The complexity lies in making that simplicity possible at scale.