Design Search Autocomplete (Typeahead) That Feels Instant

You type one letter into a search box and a list appears under your finger before you have decided what the second letter is. That feeling, the suggestions racing ahead of your intent, is the entire product. Lose it and the box becomes a form field you have to finish and submit. Keep it and the box becomes a conversation.

The number that decides which one you have built is 100 milliseconds.

Facebook measured it in 2010 and put it plainly: spending more than 100ms to retrieve a result will cause the typeahead to "stutter," and a stutter reads as broken. The Prefixy team, building an open-source prefix-search service years later, anchored to the identical number. This is not a target you negotiate. It is the constant the whole design gets built backward from, the way a bridge gets designed backward from the load it must carry. Everything that follows, every structure, every cache, every shard, exists to defend that budget on the keystroke that matters least to you and most to the slow path: the 99th-percentile one.

This is a system design problem with a particular shape. It looks like an algorithms question (which data structure holds the suggestions?) and it is really a latency question (where does the 100ms go?), and the senior move is refusing to answer the first without the second. If you want the broader interview scaffolding, the system design interview framework lays out how to drive any of these from requirements to bottlenecks. Here we go deep on one.

Pin the requirements before you pick a structure

The fastest way to design the wrong typeahead is to start drawing tries before you know what the box is for. Three requirements change the answer underneath you, so surface them first.

Does it need to match mid-string, or only prefixes? "pizza" matching "cheese pizza" is infix matching, and it breaks every pure prefix structure on this page. If the product needs it, you are in n-gram indexing territory before you have written a line, and the clean trie answer is dead on arrival. Ask early, because the answer reshapes everything.

Does it need typo tolerance? "Did you mean" inside the suggestions is a structure decision, not a feature you bolt on later. A finite state transducer paired with a Levenshtein automaton gives edit-distance matching as a native capability. A trie or a sorted-n-gram scheme cannot add it without a combinatorial explosion of stored variants. If fuzzy matching is a requirement, it pushes you off the trie before you start.

How personalized is it? Google's autocomplete blends global popularity with your language, your location, and, when you are signed in, your own history. Personalization is the requirement that quietly destroys your nicest optimization, the shared cache, because a result computed for you cannot be reused for me. Decide how much of it you need, because it determines whether one cache tier can serve everyone or whether you need two retrieval paths merged late.

Then size it, because the numbers set the architecture. Take 10 million daily users issuing 10 searches a day. That is roughly 24,000 queries per second on average and call it 48,000 at peak. Each keystroke is potentially its own query, so the read volume is brutal and the write volume, new distinct queries to learn, is a comparative trickle, on the order of 0.4GB of new data a day at 20% novelty. That asymmetry, oceans of reads against a stream of writes, is the single most important fact about the workload, and it is the justification for every read-optimized choice below. If back-of-envelope sizing is rusty, capacity estimation walks the method; the point here is that the read/write ratio is lopsided enough to design around.

Where you pay for top-k is the whole decision

Strip the problem to its core and it is this: for a prefix, return the few best completions, ranked. The only real question is when you compute that ranked list, at write time or at read time, and everything else is a consequence of that choice.

Start with the structure everyone names first, because understanding why it struggles is the actual lesson. A trie is a tree where each node is a character and the path from the root spells a prefix. To find completions for "tr" you walk t then r, and now you are standing at the prefix node. The naive move is to compute the answer here, at read time: visit every descendant of this node to collect all completions, then sort them by popularity and take the top few. Prefixy named the killer precisely. After traversing to the prefix node, "we still have to visit every single one of its children to find all the completions. In the worst-case scenario, this is an O(N) operation." Related Works quantifies the same subtree cost as O(MN).

Here is the part juniors miss: the pain is not evenly distributed. It concentrates on short prefixes. The node under "s" has a colossal subtree; the node under "strawb" has almost none. So the cheapest input for the user, one or two letters, is the most expensive query for you, and it is also the most common. The expensive, contended unit of work in a typeahead is not the long completion. It is the short prefix with enormous fan-out, and most engineering effort on these systems targets short prefixes specifically.

The standard fix flips the timing: precompute top-k at write time and store it at each prefix node. The "tr" node carries a ready list like ["true", "try", "trip", ...] already ranked and capped. Now a read walks to the node and reads the list off it. No subtree scan. Prefixy: store the prefix "along with all of its completions together in the same node," giving constant-bounded reads. This is the design most production systems converge toward, and it is the right default when reads dominate writes the way they do here.

But say "O(1)" out loud in an interview and a good interviewer leans in, because it is the tell of a memorized answer. The read is O(prefix length), because you still walk from the root to the node. It becomes constant only because you cap the maximum prefix length, around 20 characters in Prefixy's real deployment, so the walk has a ceiling. And the speed is bought, not free. You pay in two currencies. Storage blows up, because the completion "strawberry" is duplicated into the stored list of "s", "st", "str", and every prefix along the way. And writes amplify: one query that suddenly trends must climb its entire prefix path and update the top-k list at every ancestor node. That write amplification is the real cost of the read-optimized design, and it is the reason you rebuild in batches instead of updating per event. We will come back to the batch boundary, because it is where freshness is won and lost.

There is a third option, and knowing it is what separates a staff-grade answer from a strong one. Precompute-at-node is not the only way to get a fast read. Wolf Garbe's PruningRadixTrie keeps the trie and computes top-k at read time, but makes that read cheap by storing, at each node, the maximum rank found anywhere in its subtree. During the read you compare that bound against the worst score already in your result set, and the moment a branch cannot possibly beat what you hold, you prune it and stop descending. Garbe's benchmark: roughly 1000 times faster than an ordinary radix trie, turning a tens-of-milliseconds lookup into a sub-millisecond one. You recover most of the precompute win without duplicating completions across every prefix. The senior framing is a tradeoff axis, not a winner: a naive trie is cheap to write and slow to read; precompute-at-node is fast to read and expensive to write; the pruning trie sits in between; and the FST, next, holds its own corner.

The trie is often the wrong production answer

Here is the uncomfortable thing about leading with a trie: real search teams frequently do not ship one. Three structural weaknesses push them elsewhere, and naming them is how you show you have built this rather than read about it.

A trie has no native concept of ranking. Weights are bolted on, which is exactly why the naive read has to scan and sort a subtree. It handles fuzzy matching badly, because edit-distance over a character tree means exploring a blizzard of near-miss paths. And it is memory-hungry, with a node per character and heavy pointer overhead. Related Works lays the comparison out directly, and the structure that wins on production systems is the finite state transducer.

An FST is, in one line, a Map<String, Number>: a compressed automaton that stores each suggestion with its weight and shares common suffixes so that "hotel" and "motel" and "munchen" reuse the same tail states instead of duplicating them. That sharing is where the memory win comes from. Lucene and Elasticsearch use a weighted FST for their completion suggester, and the numbers are concrete. Benchmarked on 2.1 million Wikipedia titles, an FST completion lookup runs about 0.72ms, fuzzy matching at edit-distance one about 1.14ms, and the older prefix-query approach it replaced sat at 4 to 6ms. Native ranking falls out of the structure: because weights live on the transitions, a weighted depth-first traversal finds the top suggestions and terminates early, the same early-termination idea as the pruning trie, expressed in automaton form. And fuzzy matching is native through a Levenshtein automaton, the capability a trie cannot add without exploding.

The tradeoff is real, not free. FSTs are immutable once built, so you do not mutate them; you rebuild. Elastic's answer to "how do you stay fresh, then?" is to build an FST per index segment at write time, so new data lands in new segments and a small query-time merge across segments gives near-real-time updates without rebuilding one global structure. And richer suggestions cost memory directly: Elastic measured that attaching small JSON payloads to suggestions doubled FST memory, from 82MB to 166MB. "Just attach metadata to each suggestion" is not free, and a staff engineer prices it.

Etsy went a different way entirely, which is worth holding onto as proof there is no single right answer. It used neither a trie nor an FST but binary search over a sorted list of edge-n-grammed prefixes, where "car" is indexed as "c", "ca", and "car", and each prefix stores only its top suggestions to enable early termination. Plain, debuggable, and shaped to its workload. The structure follows the requirements, not fashion. How a senior decides: prefixes only with hard latency and modest memory, reach for precompute-at-node or a pruning trie; fuzzy matching or tight memory required, reach for an FST; an existing Lucene stack, use its completion suggester and stop reinventing it.

Ranking is more than a frequency count

"Rank by popularity" is the toy version, and Google says so without meaning to. Its autocomplete starts from common and trending queries, but the override is the interesting part: "if our automated systems detect there's rising interest in a topic, they might show a trending prediction even if it isn't typically the most common." Freshness can beat raw popularity. A query spiking this hour should surface above an evergreen query with a higher all-time count, because the user's intent right now is shaped by what is happening right now.

So the production score is a blend: frequency, plus recency and velocity, plus geography, plus language, plus personalization. Each signal has architectural weight. Frequency comes from the batch pipeline. Velocity comes from a separate real-time stream, deliberately kept off the main structure. Geo and language let you shard regionally and serve from a closer edge. And personalization is the signal that breaks the shared cache, which is why it gets its own path. The shallow answer stops at frequency. The senior answer treats ranking as a late-merge of signals that each live in a different part of the system, because each refreshes on a different clock.

The offline pipeline is what makes the read cheap

The reason the online read can be a single list lookup is that an offline pipeline did the expensive thinking in advance, away from the hot path. The shape, verified against Alex Xu's treatment, is a conveyor belt. Append-only analytics logs capture raw queries cheaply, unindexed. Aggregators compress those logs into query-to-frequency counts. Workers consume the aggregated counts on a schedule and build the structure, the trie or FST. A structure store, often a document or key-value store where every prefix maps to a key, holds the built result durably. And a distributed in-memory cache serves it, taking periodic snapshots.

The load-bearing detail is the boundary in the middle. Frequencies update on a batch cadence, daily or weekly, not per keystroke, and that is deliberate, not lazy. The read path is serving tens of thousands of lookups a second and cannot tolerate write contention from a live counter on every node. So writes are drained off into an offline lane and applied in bulk during a rebuild. This is the same instinct behind decoupling producers from consumers with a log; Kafka vs queues covers when a replayable log earns its place over a work queue, and a query-log pipeline that you may reprocess to retune ranking is squarely log-shaped. Sampling makes the lane cheaper still: log one query in N, because the popularity distribution is heavy-tailed, so a sample is statistically faithful for top-k. You do not need every event to know "facebook" beats a typo of it.

Two correctness details earn their place here. When a prefix already holds its cap of K completions and a new one arrives, naively dropping it loses every emerging term forever; the system can never learn a new word. Prefixy's fix is exact: evict the lowest-scored completion, then insert the newcomer with the evicted score plus one, so the new entrant is not instantly re-evicted by the next arrival, and churn at the bottom of the list settles. And the cap itself is tuned, not guessed: store about K=50 per prefix even though you display 5 or 10, because you need enough depth to rank accurately as counts shift, and no more, since every extra slot multiplies across millions of prefixes.

Spend the 100ms before you optimize anything

Now the budget, because this is the diagram the whole design serves. Lay 100ms on a bar and walk a keystroke across it: the user presses a key, a debounce timer waits, a request crosses the network, a load balancer routes it, a cache answers (or misses and computes), the response serializes, and the browser renders. The instinct is to pour effort into the cache lookup. The budget says you are looking in the wrong place, and that redirection is the senior insight.

Price the server terms against real numbers. A memory reference is about 100 nanoseconds; a same-datacenter round trip is about 500 microseconds, half a millisecond, from the canonical latency table every engineer should keep in their head. An FST completion lookup is 0.72ms. So the in-datacenter work, a couple of hops plus a sub-millisecond structure read, totals a few milliseconds at most. That is not where 100ms goes.

It goes to two places, and they are the two the data structure cannot touch. The first is debounce, the deliberate wait after the last keystroke before firing a request, and it is the largest single term you control. Tuned well it lands around 150 to 250ms of user-perceived settle time. Tuned badly, say a flat 300ms, it blows the entire perceived budget on its own before a single byte leaves the laptop. The second is the wide-area round trip, the user-to-datacenter network, which dwarfs the intra-datacenter 0.5ms. The math is blunt: shave a millisecond off your FST and the user feels nothing, because a 60ms WAN hop and a 200ms debounce sit on top of it. Optimize those two and you have moved the number that matters. This is why latency and the tail insists you decompose a budget before tuning any single stage. The structure is necessary and it is not where the budget lives.

Which reframes the client as half the system. Three techniques carry it. Use AJAX so suggestions update without a page reload, the price of admission. Cache repeated prefixes in the browser, so backspacing from "stra" to "str" hits a list you already fetched and the WAN term goes to zero. And debounce keystrokes so "react" fires one request after you pause, not five as you type, which simultaneously cuts your backend QPS and stops a slow earlier response from flickering in over a newer one, paired with request cancellation so an in-flight reply for "rea" can never overwrite the one for "react." Meta pushed the cache idea furthest: the moment the input gains focus, before a single keystroke, it fires off a request for the user's own friends, pages, and groups and loads them into the browser cache, so the most common personalized hits are answered with zero backend round trips. The fastest request is the one you already have the answer to.

Cache tiers, hot shards, and the personalized path

The serving side is a stack of caches, each catching what the one below would have had to compute. Browser cache first, then an edge or CDN tier, then the load balancer, then an in-memory structure cache, with the durable structure store as the cold fallback. The win compounds because typeahead traffic is heavy-tailed: the same short, popular prefixes get hammered, so a modest hot-prefix cache absorbs a wildly disproportionate share of reads. The mechanics of sizing, evicting, and replicating that tier are their own subject; the distributed cache covers eviction policy and replication, and a typeahead cache is a textbook read-through with a hot-key problem.

That hot-key problem is also a sharding trap. "Shard by first letter" sounds clean and guarantees pain, because traffic by initial letter is wildly skewed: "s" dwarfs "z", and the short-prefix hot path lands hardest on whichever shard owns the popular letters. You get a melting shard next to idle ones. The fix is to partition by prefix range and rebalance so a busy region like "s" gets its own shard while quiet neighbors like "u" through "z" share one, and to keep the hottest short prefixes in a dedicated cache tier in front of the shards so they barely touch the partitions at all. When you outgrow static ranges, consistent hashing is the standard way to spread keys and move the fewest of them when the fleet changes. The same skew that makes short prefixes expensive to compute makes them dangerous to place; both pressures point at the short prefix.

Personalization gets its own lane because it cannot share the global one. A cache keyed on the prefix alone assumes everyone typing "ma" wants the same answer, and personalization violates that by construction, so you cannot precompute one cached entry that is correct for every user. Meta's pattern is the clean resolution: split a global retrieval path that is cacheable and shared, served from a query cache, from a personalized path scoped to the individual, then merge and rank the two late in the request. The global half stays fast and shared; the personalized half is small, bounded to the user's own neighborhood, and often pre-warmed into the browser on focus. Two paths, merged at the end, instead of one path that is wrong for everyone.

The tail is the product

One last reframing, and it is the one most candidates skip. You do not optimize typeahead for the average keystroke. You optimize it for the slow one, because that is the one the user feels.

A median of 20ms with a p99 of 400ms is not a fast system with a small blemish. Across a query, a user fires dozens of keystrokes, so a one-in-a-hundred stall surfaces several times per search, and the stutters are what they remember. DoorDash lived this exactly: the alarm that started their search-latency investigation was a p99.9 monitor, not the average. The root cause was not the algorithm at all; it was garbage-collection pauses in the JVM. Switching collectors, Shenandoah to G1GC, plus a Lucene upgrade, cut p99.9 latency by roughly half and hardware cost by about three-quarters. The structure was never the problem. The runtime was.

Internalize the two implications, because they are what staff-grade looks like. Your SLO is a tail percentile, p99 or p99.9, not a mean, because the mean hides exactly the keystrokes that break the feeling. And the tail is often not an algorithm problem at all; it is GC pauses, cold caches after a deploy, connection setup on a fresh socket. You can pick the perfect structure and still ship a stuttering typeahead if a stop-the-world pause lands on the 99th-percentile keystroke. The data structure gets you a fast median. Defending the tail is a different and harder job, and it is the one that decides whether the box feels instant.

Where this sits among the canon

Typeahead is the friendliest doorway into a family of patterns. The read-vs-write placement of precomputed results is the same instinct as fanning out a timeline on write versus assembling it on read in Design Twitter. The heavy-tailed traffic, hot keys, and cache-tier reasoning recur in Design Instagram. The offline-pipeline-feeds-online-read split shows up the moment you separate slow analytics from fast serving, including a sibling in this batch, the notification system. And the discipline of designing backward from a hard SLO is the through-line from the rate limiter to idempotency and the exactly-once lie to here. If you want a small, self-contained build to practice the read-heavy, cache-forward shape first, the URL shortener is the cleanest one.

I have felt the difference these choices make in real products. Powering live location updates in NomadCrew taught me that "feels instant" is a tail-latency promise you keep on the worst frame, not the average one, which is exactly the typeahead lesson wearing different clothes; building search-shaped retrieval into Audex and IntelliFill drove home that the structure is the easy half and the budget is the hard half. The box that races ahead of your intent is not magic. It is a precomputed list, a defended budget, and a stubborn refusal to let the slow keystroke through.

FAQ

Is caching the top-k completions at each trie node really O(1)?

No, and calling it O(1) is the tell of a memorized answer. A read walks from the root to the prefix node, so it is O(prefix length). It becomes a constant only because you cap the maximum prefix length, typically around 20 characters, so the walk can never exceed that bound. The honest phrasing is constant-bounded, not O(1). The cost you trade for that fast read is storage blow-up, since the same completion is duplicated into the top-k list of every prefix that leads to it, plus write amplification, since one newly trending query must update the top-k list of every ancestor prefix along its path.

Why is a plain trie often the wrong production answer?

A naive trie has no native ranking, so to return the best suggestions for a prefix you walk the entire subtree below the prefix node and sort it, which is an O(subtree) scan that explodes on short popular prefixes. It is also memory-heavy. Production systems frequently reach for a finite state transducer instead, which stores weights natively, supports fuzzy matching through a Levenshtein automaton, and uses an order of magnitude less memory than a trie. Etsy did not ship a trie at all; it used binary search over a sorted list of edge-n-grammed prefixes. The trie is a useful starting point in an interview, not the destination.

Why optimize p99 latency instead of the average for typeahead?

Because users feel the slow keystrokes, not the average one. A median of 20ms with a p99 of 400ms means one in a hundred keystrokes stutters, and across a session of dozens of keystrokes per query that slow tail is what the user remembers. DoorDash discovered this in production when their search latency alarm fired on a p99.9 monitor, not the average; the root cause was garbage-collection pauses, and switching collectors cut p99.9 latency by roughly half. The lesson is that tail latency is the product, and it is often a runtime problem rather than an algorithm problem.

How do you keep suggestion frequencies fresh without rebuilding constantly?

You separate the slow path from the fast path. Popularity counts update on a batch schedule, often daily or weekly, through an offline pipeline that aggregates sampled query logs into frequencies and rebuilds the served structure, because the read path cannot afford write contention on every keystroke. Real-time trending is handled as a separate, bounded stream that is merged late, not pushed into the main structure per event. Lucene and Elasticsearch take a middle road by building a finite state transducer per index segment, so new segments add fresh suggestions and a tiny query-time merge gives near-real-time updates without a global rebuild.

Why does personalization break the shared autocomplete cache?

Because a cache entry keyed only on the prefix assumes everyone typing that prefix should see the same suggestions, and personalized results violate that by definition. You cannot precompute one cached answer that is correct for every user. The production pattern, which Meta used, is to split retrieval into a global path that is cacheable and shared, and a personalized path scoped to the individual user, then merge and rank the two late in the request. A nice trick for the personalized half is to bootstrap the user's own entities into the browser cache the moment the input gains focus, which removes a backend round trip for the most common personalized hits.