Design a Typeahead Suggestion System: Autocomplete

Key Insights

Tries (prefix trees) are the foundational data structure for typeahead systems, but production implementations require compression and careful metadata storage to balance memory usage with query speed.
Latency is the critical metric—users expect suggestions within 100ms, which means aggressive caching at multiple layers and geographic distribution are non-negotiable at scale.
The ranking algorithm matters more than the data structure; a fast but poorly-ranked suggestion system will frustrate users more than a slightly slower but relevant one.

Introduction & Problem Definition

Typeahead suggestion systems are everywhere. When you start typing in Google Search, your IDE, or an e-commerce search bar, you expect instant, relevant suggestions. These systems seem simple on the surface—match prefixes and return results—but building one that handles millions of queries per second while maintaining sub-100ms latency is a genuine engineering challenge.

The core requirements are straightforward: given a prefix like “how to”, return the top 5-10 most relevant completions. But “relevant” is where things get complicated. Should you prioritize global popularity? User history? Trending topics? Geographic context? The answer is usually “all of the above,” weighted appropriately.

At scale, you’re looking at constraints like:

Latency: Under 100ms end-to-end, ideally under 50ms
Throughput: Potentially millions of queries per second
Freshness: New trending terms should appear within minutes
Quality: No offensive suggestions, high relevance

High-Level Architecture Overview

Let’s establish the component architecture before diving into implementation details.

┌─────────────┐     ┌─────────────┐     ┌──────────────────┐
│   Client    │────▶│  CDN/Edge   │────▶│   API Gateway    │
│  (Browser)  │     │   Cache     │     │  (Rate Limiting) │
└─────────────┘     └─────────────┘     └────────┬─────────┘
                                                 │
                    ┌────────────────────────────┼────────────────────────────┐
                    │                            ▼                            │
                    │  ┌─────────────────────────────────────────────────┐   │
                    │  │            Suggestion Service                    │   │
                    │  │  ┌─────────────┐  ┌─────────────┐  ┌──────────┐ │   │
                    │  │  │ Trie Store  │  │   Ranker    │  │  Cache   │ │   │
                    │  │  └─────────────┘  └─────────────┘  └──────────┘ │   │
                    │  └─────────────────────────────────────────────────┘   │
                    │                            │                            │
                    │                            ▼                            │
                    │  ┌─────────────────────────────────────────────────┐   │
                    │  │              Data Pipeline                       │   │
                    │  │  (Log Aggregation, Frequency Updates, Filtering) │   │
                    │  └─────────────────────────────────────────────────┘   │
                    └─────────────────────────────────────────────────────────┘

The API contract is minimal:

# OpenAPI specification
paths:
  /suggestions:
    get:
      parameters:
        - name: q
          in: query
          required: true
          schema:
            type: string
          description: The prefix to match
        - name: limit
          in: query
          schema:
            type: integer
            default: 10
            maximum: 20
        - name: context
          in: query
          schema:
            type: string
          description: Optional context (e.g., "shopping", "images")
      responses:
        '200':
          content:
            application/json:
              schema:
                type: object
                properties:
                  suggestions:
                    type: array
                    items:
                      type: object
                      properties:
                        text: { type: string }
                        score: { type: number }

Data Structures: Trie & Optimizations

The trie (prefix tree) is the natural choice for typeahead. Each path from root to node represents a prefix, and finding all completions for a prefix is O(p + n) where p is the prefix length and n is the number of results.

Here’s a practical implementation with frequency tracking:

from typing import Optional
from dataclasses import dataclass, field
from heapq import nlargest

@dataclass
class TrieNode:
    children: dict = field(default_factory=dict)
    is_terminal: bool = False
    frequency: int = 0
    term: Optional[str] = None  # Store full term at terminal nodes

class SuggestionTrie:
    def __init__(self):
        self.root = TrieNode()
    
    def insert(self, term: str, frequency: int = 1) -> None:
        node = self.root
        for char in term.lower():
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_terminal = True
        node.frequency += frequency
        node.term = term
    
    def _find_prefix_node(self, prefix: str) -> Optional[TrieNode]:
        node = self.root
        for char in prefix.lower():
            if char not in node.children:
                return None
            node = node.children[char]
        return node
    
    def _collect_terms(self, node: TrieNode, results: list) -> None:
        """DFS to collect all terminal nodes under this subtree."""
        if node.is_terminal:
            results.append((node.term, node.frequency))
        for child in node.children.values():
            self._collect_terms(child, results)
    
    def search(self, prefix: str, limit: int = 10) -> list[tuple[str, int]]:
        node = self._find_prefix_node(prefix)
        if not node:
            return []
        
        results = []
        self._collect_terms(node, results)
        
        # Return top-k by frequency
        return nlargest(limit, results, key=lambda x: x[1])

# Usage
trie = SuggestionTrie()
trie.insert("how to cook rice", frequency=50000)
trie.insert("how to tie a tie", frequency=45000)
trie.insert("how to lose weight", frequency=80000)

print(trie.search("how to"))
# [('how to lose weight', 80000), ('how to cook rice', 50000), ('how to tie a tie', 45000)]

For production systems, consider a radix tree (compressed trie) where chains of single-child nodes are collapsed. This reduces memory usage significantly—often by 50-80% for natural language data.

Ranking & Relevance Algorithms

Raw frequency isn’t enough. A good ranking function combines multiple signals:

import math
from dataclasses import dataclass
from time import time

@dataclass
class SuggestionCandidate:
    term: str
    global_frequency: int
    user_frequency: int  # How often this user searched this
    last_searched: float  # Unix timestamp
    category_match: bool  # Does it match current context?

class SuggestionRanker:
    def __init__(
        self,
        global_weight: float = 0.4,
        personal_weight: float = 0.3,
        recency_weight: float = 0.2,
        context_weight: float = 0.1,
        recency_halflife_days: float = 7.0
    ):
        self.global_weight = global_weight
        self.personal_weight = personal_weight
        self.recency_weight = recency_weight
        self.context_weight = context_weight
        self.recency_halflife = recency_halflife_days * 86400  # Convert to seconds
    
    def _normalize_frequency(self, freq: int, max_freq: int) -> float:
        """Log-normalize frequency to prevent popular terms from dominating."""
        if freq == 0 or max_freq == 0:
            return 0.0
        return math.log(1 + freq) / math.log(1 + max_freq)
    
    def _recency_decay(self, timestamp: float) -> float:
        """Exponential decay based on time since last search."""
        if timestamp == 0:
            return 0.0
        age = time() - timestamp
        return math.exp(-age / self.recency_halflife)
    
    def score(
        self,
        candidate: SuggestionCandidate,
        max_global_freq: int,
        max_user_freq: int
    ) -> float:
        global_score = self._normalize_frequency(
            candidate.global_frequency, max_global_freq
        )
        personal_score = self._normalize_frequency(
            candidate.user_frequency, max_user_freq
        )
        recency_score = self._recency_decay(candidate.last_searched)
        context_score = 1.0 if candidate.category_match else 0.0
        
        return (
            self.global_weight * global_score +
            self.personal_weight * personal_score +
            self.recency_weight * recency_score +
            self.context_weight * context_score
        )
    
    def rank(
        self,
        candidates: list[SuggestionCandidate],
        limit: int = 10
    ) -> list[tuple[str, float]]:
        if not candidates:
            return []
        
        max_global = max(c.global_frequency for c in candidates)
        max_user = max(c.user_frequency for c in candidates)
        
        scored = [
            (c.term, self.score(c, max_global, max_user))
            for c in candidates
        ]
        scored.sort(key=lambda x: x[1], reverse=True)
        return scored[:limit]

The weights are tunable per product. An e-commerce site might weight context heavily, while a general search engine might favor global popularity.

Scaling & Performance Optimization

At scale, you can’t query a single trie. Sharding strategies include:

Prefix-based sharding: Terms starting with ‘a-f’ go to shard 1, ‘g-m’ to shard 2, etc. Simple but can create hotspots.
Consistent hashing: Better distribution but requires querying multiple shards for short prefixes.

Caching is where you win the latency battle:

import redis
import json
from typing import Optional

class SuggestionCache:
    def __init__(
        self,
        redis_client: redis.Redis,
        default_ttl: int = 300,  # 5 minutes
        popular_ttl: int = 3600  # 1 hour for popular prefixes
    ):
        self.redis = redis_client
        self.default_ttl = default_ttl
        self.popular_ttl = popular_ttl
        self.popular_prefixes = self._load_popular_prefixes()
    
    def _load_popular_prefixes(self) -> set:
        """Load precomputed popular prefixes (top 10k or so)."""
        # In practice, load from config or precompute
        return {"how to", "what is", "where", "why", "best"}
    
    def _cache_key(self, prefix: str, context: Optional[str] = None) -> str:
        base = f"suggest:{prefix.lower()}"
        return f"{base}:{context}" if context else base
    
    def get(
        self,
        prefix: str,
        context: Optional[str] = None
    ) -> Optional[list[dict]]:
        key = self._cache_key(prefix, context)
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None
    
    def set(
        self,
        prefix: str,
        suggestions: list[dict],
        context: Optional[str] = None
    ) -> None:
        key = self._cache_key(prefix, context)
        ttl = (
            self.popular_ttl
            if prefix.lower() in self.popular_prefixes
            else self.default_ttl
        )
        self.redis.setex(key, ttl, json.dumps(suggestions))
    
    def warm_cache(self, popular_suggestions: dict[str, list[dict]]) -> None:
        """Precompute and cache top suggestions for popular prefixes."""
        pipe = self.redis.pipeline()
        for prefix, suggestions in popular_suggestions.items():
            key = self._cache_key(prefix)
            pipe.setex(key, self.popular_ttl, json.dumps(suggestions))
        pipe.execute()

For the top 1000 prefixes, precompute suggestions and cache them at the CDN edge. This handles a disproportionate amount of traffic.

Data Pipeline & Updates

Suggestions need to stay fresh. Here’s a simplified streaming aggregation approach:

from collections import defaultdict
from dataclasses import dataclass
from time import time
import threading

@dataclass
class SearchEvent:
    term: str
    timestamp: float
    user_id: str

class FrequencyAggregator:
    def __init__(self, flush_interval: int = 60, min_count: int = 5):
        self.counts = defaultdict(int)
        self.flush_interval = flush_interval
        self.min_count = min_count
        self.lock = threading.Lock()
        self.blocked_terms = self._load_blocklist()
    
    def _load_blocklist(self) -> set:
        """Load offensive/inappropriate terms to filter."""
        # In practice, load from a maintained blocklist
        return set()
    
    def _is_valid_term(self, term: str) -> bool:
        term_lower = term.lower().strip()
        if len(term_lower) < 2 or len(term_lower) > 100:
            return False
        if term_lower in self.blocked_terms:
            return False
        return True
    
    def process_event(self, event: SearchEvent) -> None:
        if not self._is_valid_term(event.term):
            return
        
        with self.lock:
            self.counts[event.term.lower()] += 1
    
    def flush(self) -> dict[str, int]:
        """Return counts and reset. Called periodically."""
        with self.lock:
            # Filter out low-count terms (likely typos or spam)
            results = {
                term: count
                for term, count in self.counts.items()
                if count >= self.min_count
            }
            self.counts.clear()
            return results

In production, you’d use Kafka or Kinesis for event streaming and Flink or Spark Streaming for aggregation.

Client-Side Considerations & Wrap-Up

The client side matters as much as the backend. Debouncing prevents flooding your servers:

class TypeaheadClient {
  constructor(endpoint, debounceMs = 150) {
    this.endpoint = endpoint;
    this.debounceMs = debounceMs;
    this.controller = null;
    this.timeoutId = null;
  }

  async getSuggestions(prefix, context = null) {
    // Cancel any pending request
    if (this.controller) {
      this.controller.abort();
    }
    if (this.timeoutId) {
      clearTimeout(this.timeoutId);
    }

    // Don't query for very short prefixes
    if (prefix.length < 2) {
      return [];
    }

    return new Promise((resolve, reject) => {
      this.timeoutId = setTimeout(async () => {
        this.controller = new AbortController();
        
        try {
          const params = new URLSearchParams({ q: prefix });
          if (context) params.append('context', context);
          
          const response = await fetch(
            `${this.endpoint}?${params}`,
            { signal: this.controller.signal }
          );
          
          if (!response.ok) {
            // Graceful degradation: return empty on error
            resolve([]);
            return;
          }
          
          const data = await response.json();
          resolve(data.suggestions || []);
        } catch (error) {
          if (error.name === 'AbortError') {
            // Request was cancelled, not an error
            resolve([]);
          } else {
            console.warn('Suggestion fetch failed:', error);
            resolve([]); // Fail silently
          }
        }
      }, this.debounceMs);
    });
  }
}

Trade-offs to consider:

Memory vs. latency: Larger caches mean faster responses but higher costs
Freshness vs. stability: More frequent updates can cause suggestion flickering
Personalization vs. privacy: User history improves relevance but raises privacy concerns

Extensions worth exploring:

Fuzzy matching for typo tolerance (Levenshtein distance, BK-trees)
Query understanding with NLP (intent classification)
A/B testing framework for ranking algorithm tuning

A well-designed typeahead system is invisible when it works—users just expect instant, relevant suggestions. The complexity lies in making that simplicity possible at scale.