I once wrote a smaller version of this and learned exactly where the walls are.
NomadCrew is a real-time group-travel backend I built in Go. The chat, presence, and live-location features all ride one WebSocket hub, and the hub is honest about its job: each connection runs three goroutines (read, write, ping) coordinated through context cancellation, a sync.RWMutex guards the map of active connections, and outbound events go through a buffered channel of 256 that drops events on overflow instead of blocking the hub. Messages persist and replay on reconnection. It works. People plan trips on it.
Then you ask one question that the whole design quietly assumed away, and the floor disappears: what happens when the recipient is offline, and on a different box?
Inside one process, that RWMutex-guarded map is the entire routing table. You look up the recipient, you have their connection, you write to it. Done. The trouble is that one process holds a few hundred thousand sockets at most before the operating system and the runtime start fighting you, and a planet has billions of people. The moment you need a second box, the recipient you want is no longer in your map. WhatsApp's entire architecture is the answer to that one sentence, scaled until it covered roughly half the planet on a few hundred machines. This is a Design X interview where the obvious answer ("use WebSockets, they're real-time") is the first sentence and everything that earns the offer comes after it. So let me retrace it the way I'd retrace my own hub if someone told me it now had to serve two billion users. At each step the pattern is the same. NomadCrew does X. At WhatsApp's scale that exact line forces Y.
The connection is the product
Start with the thing that is actually expensive, because it is not the thing beginners point at. Messaging volume sounds enormous and is, in fact, easy. WhatsApp moved around 50 billion messages a day, and on one record day pushed 18 billion (about 7 billion in, 11 billion out, the gap being group fan-out). Spread over a day, that is a few hundred thousand a second, and a single well-tuned machine can route it. Raw message throughput is not where this system lives or dies.
What is genuinely hard is that every online device holds one persistent connection open for its entire session, and almost all of those connections are doing nothing at any given instant. You, right now, probably have WhatsApp connected and silent. The connection server's job is to own tens of millions of mostly-idle sockets cheaply, and "cheaply" is the load-bearing word. A socket that sends one message an hour still costs a file descriptor, kernel buffer, a scheduler slot, and memory, all day. Multiply by tens of millions and the cost of doing nothing becomes the whole problem.
The transport was a customized version of XMPP, the Extensible Messaging and Presence Protocol↗ standardized in RFC 6120, running over a long-lived TCP socket with TLS, trimmed to cut handshake bytes on bad mobile networks. A common confusion treats XMPP and WebSocket as competing choices, but they sit on different layers: XMPP is the grammar of what you say, the socket is the pipe you say it through. My NomadCrew hub runs a thinner grammar, a JSON event envelope, over the same kind of pipe.
A persistent connection also needs a heartbeat, which is why NomadCrew's third goroutine per connection is ping. Mobile networks and NAT timeouts drop idle TCP connections silently; without a periodic ping-pong both ends believe they are connected to a corpse, and the server keeps trying to deliver to a socket that is already gone.
Where is the recipient? The registry is the real data structure
When NomadCrew routes a message, the recipient's connection is a value in a map I already hold. Split across hundreds of boxes, that connection lives in some other process's map, on some other machine, and your local map cannot see it. You need a shared answer to one question asked millions of times a second: which server currently holds the socket for user B?
That mapping, user to server, is the connection registry, and it is the actual core data structure of a planet-scale chat system. Not the message store. The registry. When A sends to B, A's chat server consults the registry, finds B is held by chat server 47, and forwards the message there over the internal network, where server 47 writes it down B's live socket. The full path: client A, sticky load balancer, chat server A, registry lookup for B, chat server B, client B.
Two properties make the registry hard, and both are about change. First, it churns constantly. Every reconnect, every network flap, every switch between cellular and wifi moves a user to a possibly-different box, and a stale entry misroutes the message to a server that no longer holds that socket. You keep it honest with presence heartbeats and TTL expiry: an entry not refreshed within its window is presumed dead and evicted. This is exactly what a fast shared store is for, and it is why NomadCrew leans on Redis (Upstash) for real-time event distribution between service boundaries. Redis presence keys with short TTLs are how you stop the registry from lying, with the tradeoffs of that shared cache layer covered in the distributed cache.
Second, the registry is why sticky load balancing earns its keep. People call it a hack; it is load-bearing. Once a device's socket lands on server A, the in-memory session and the registry entry both point at A. If the next packet from that device were free to land on server B, B would have no session for it and would have to either proxy back to A or rebuild state from scratch. Session affinity keeps the device pinned to A for the life of the connection so the cheap, correct path stays valid. This is the same instinct as routing a partition's traffic through a consistent owner, the logic behind consistent hashing: you want a stable mapping from key to node so the in-memory state has a home.
The Erlang lesson, told correctly
Now the famous part, which almost everyone quotes and most people quote wrong. The shallow version is "WhatsApp ran two million connections on one server, Erlang is fast." Rick Reed did demonstrate two million concurrent connections on a single box at Erlang Factory in 2012, with a peak benchmark around 2.8 million pushing 571,000 packets a second. The first real bottleneck only surfaced at 425,000 connections, up from a baseline near 200 before tuning. Those numbers are real and staggering.
The staff-grade twist is that production later ran closer to one million connections per server on purpose. By the 2014 follow-up, the fleet was around 550 machines holding 147 million concurrent connections, taking 230,000 logins a second, with BEAM, the Erlang virtual machine↗ processing 70 million messages a second across roughly 11,000 cores. They could have packed in twice as many per box and deliberately did not. Two million connections on one machine means that when it dies, and machines die, two million clients reconnect at once and slam whatever box they land on next. Halving the density halves that blast radius. This is the reasoning a senior engineer uses everywhere. The question is rarely "how dense can I go." It is "what happens when I lose one," and the right number is the one that keeps a single failure boring.
And the reason any of this is even thinkable is the runtime. Erlang's BEAM gives you cheap, isolated, lightweight processes, one per connection, each with its own private mailbox, all watched over by supervision trees that restart what crashes. My NomadCrew per-connection model is the same idea in different clothes: instead of one BEAM process per socket I run three goroutines per socket, and instead of supervision trees I use context cancellation to tear all three down together when one fails. The shapes rhyme because the problem is identical: you want each connection independently schedulable and independently killable, so one slow or dead client cannot stall the others.
The deepest part of the story is the part nobody puts on a slide. The 2012 fight was not C10K, the old problem of ten thousand concurrent connections. It was C10M, ten million, and at that scale the application code was almost the easy part. The enemy was the kernel and the runtime underneath it, and their single biggest bottleneck across the entire effort, in their own words, was lock contention. The fixes were surgical: per-scheduler memory allocators so threads stopped fighting over one global pool, spreading a single contended time-of-day lock across schedulers, FreeBSD socket tuning, a patched BEAM. When you hold tens of millions of anything, the bottleneck migrates out of your code into the shared resources it sits on, and "make it faster" becomes "stop these threads contending on a lock you did not know existed." The tail-latency version of this same fight is its own essay, latency and the tail.
Store-and-forward, and the discipline of the mailbox
The connection model holds the socket. It says nothing about the case that is most of the time for most users: there is no socket to hold. That is store-and-forward, the mechanism that delivers "exactly once to an offline recipient."
The flow is a small state machine. A sends a message; it reaches A's chat server, which acknowledges receipt, the single check, state Sent. The server routes toward B. If B is online, the message goes straight down B's socket. If B is offline, it is queued in a server-side table (WhatsApp used Mnesia, the Erlang database, kept resident in RAM) and the state is Queued. When B reconnects the queue drains to the device, the device acknowledges, and the state is Delivered, the double check. B opens the chat, the device sends a read acknowledgment, and the state is Read, the blue ticks. Then the part that surprises people: once delivery is confirmed, the server deletes its copy. State Purged.
That deletion is the whole philosophy. WhatsApp is a relay, not an archive. The common misconception is that your message history lives on WhatsApp's servers; it does not. Delivered means deleted server-side, and your history lives in a SQLite database on your phone. The server holds a message only for the window between "sent" and "delivered," which for half of all messages was under sixty seconds, with a 98% cache hit rate on the offline store because most queued messages drain almost immediately. The store is a short bounded buffer, not a mailbox you accumulate forever.
The acknowledgment ladder, sent then delivered then read, is also where you must be honest about the guarantee. Each tick is a separate control message that can lie if you are sloppy: "delivered" has to mean the recipient's device acknowledged, not that your server sent it. And you cannot promise exactly-once delivery, for the same reason no networked system can. The sender cannot distinguish a lost message from a lost acknowledgment, so it must retry, and retries mean a message can arrive twice. What you build instead is at-least-once transport plus idempotent dedup on the message id at the recipient, which together behave like effectively-once: the device sees an id it already has and shows it once. I wrote a whole piece on why this is the only honest framing, in idempotency and the exactly-once lie. Exactly-once is a property you manufacture at the receiver, never one the wire hands you.
NomadCrew's "messages persist and replay on reconnection" is the toy version of exactly this, with one honest difference: it replays from Postgres and keeps the row, because at my scale durable history is a feature and storage is free. WhatsApp's scale inverts that, so the relay deletes on ack. The gap is not a flaw in my design; it is the point where scale and a privacy promise force a different decision.
One more detail separates a real answer from a textbook one: the message queue is the master health metric. In an actor system a slow consumer means a mailbox that grows without bound, and an unbounded mailbox is death, eating memory until the scheduler drowns. WhatsApp watched per-process queue length as its primary gauge of health and even paused garbage collection during backlogs so the schedulers would not thrash. That 256-buffer channel in NomadCrew that drops events on overflow is the same instinct at toy scale: a bounded mailbox that refuses to back up, because a dropped event you recover with a replay, while a hub frozen behind one slow client takes everyone down with it. Backpressure beats blocking.
There is also a seam where "real-time" quietly becomes "eventually." A persistent socket cannot reach an app the operating system has killed to save battery, which is most apps most of the time, so store-and-forward hands the message off to Apple's APNs or Google's FCM; the push wakes the app, it reconnects, and only then drains its queue. The socket is the fast path; the push is the fallback that makes it optional.
Order: per-conversation, never global
A junior answer to ordering is "add a timestamp." A senior answer refuses the premise that you need global order at all. You do not need every message across all of WhatsApp to agree on a single total order. Nobody can observe that order, and enforcing it would mean funneling the planet's traffic through one sequencer, which is both needless and ruinously expensive. What you actually need is that messages within one conversation arrive in the order they were sent. Per-conversation FIFO. That is cheap: route a conversation's traffic along a consistent path and stamp each message with a monotonic per-conversation sequence number, and order falls out without any global clock. Wall-clock timestamps from different senders are useless here anyway, because their clocks skew by seconds and a naive timestamp sort will happily put a reply before the message it answers.
There is a real tradeoff hiding underneath: order versus availability. Strict per-conversation FIFO can stall. If message 5 is stuck on a slow link, do you hold 6, 7, and 8 behind it to preserve order, or deliver them and reconcile by sequence number on the client? That is head-of-line blocking, and how you answer it is a product decision as much as a technical one. WhatsApp leans toward delivering and letting the client reorder by sequence, because a stalled chat feels broken in a way a momentary gap does not. The language for reasoning about these consistency-versus-availability choices is CAP and PACELC. The point to internalize: "ordering" is a knob, not a single problem, and the senior move is to scope it to the conversation and then decide, deliberately, how strict to be when the network misbehaves.
Group fan-out, and the encryption that moves it to the client
Send a message to a group of fifty, and that one inbound message has to become fifty outbound copies. Fan-out. The obvious place to do it is the server: one message in, N out, the server explodes it. This is server-side fan-out, and the Design Twitter and Design Instagram builds lean on exactly this pattern for timelines. But end-to-end encryption breaks the obvious answer. If the server cannot read the content, it cannot re-encrypt it for each recipient, because each has different keys. So fan-out moves to the client: the sender encrypts and the server becomes a blind relay that copies opaque bytes to N destinations. Done naively that is brutal, one encryption per member, fifty for a single group post.
The Sender Keys protocol fixes the steady state. The first time you post to a group, your client generates a sender key, a chain key plus a signature key, and encrypts it once to each member over their individual secure channel. Every message after that is a single symmetric encryption with your sender key that all members can already decrypt. A normal group post is one encryption and one relay-to-N. The amplification still exists, fifty members still receive fifty copies over the wire, but the expensive cryptographic work collapses from per-recipient to per-sender. Large groups still carry a tax in key distribution when membership changes, but the per-message cost is bounded. This is the architecture in WhatsApp's own encryption whitepaper↗.
What the server gives up, and what it keeps
End-to-end encryption is usually explained as "your messages are secure," which is true and almost entirely beside the engineering point. The engineering point is that encryption relocates where the server is allowed to act. With the Signal Protocol, the message body is AES-256 ciphertext with a signature, and the server relays bytes it cannot read. Every server-side feature that needed plaintext, search, smart replies, spam scanning, has to move to the client, which is why search runs on your device against your local SQLite. The relocation, not the secrecy, is what reshapes the system.
What the server keeps is most of the job. It still routes, still queues for offline recipients, still drives group fan-out as a blind pipe, still tracks the sent, delivered, and read acknowledgments (those are control messages, not content), still manages presence and serves encrypted media blobs. The honest answer to "what can the server do under end-to-end encryption" is: everything except read the content. And here is the limit a staff engineer names without being asked: metadata. Encryption hides what you said, not that you said it, to whom, and when. The social graph and the timing stay visible server-side because routing requires them. The privacy story is "they cannot read your messages," not "they know nothing about you," and conflating the two is the most common overclaim in the whole topic.
Multi-device makes all of this harder in a way the 2014 talks predate. Each device is a separate cryptographic identity, so the sender must fan out to every device of every recipient, and a key change, a new phone or a reinstall, triggers re-encryption and the "security code changed" warning. That frontier connects to distributed state-reconciliation problems, the kind explored in CRDTs and, publishing alongside this piece, the Design Google Docs build, where convergence under concurrent edits is the entire problem.
The thread that ties it together
The reason WhatsApp is a great interview and a greater system is that none of its hard parts are the part that sounds hard. The message firehose is easy. The hard problems are holding tens of millions of idle sockets cheaply, finding which box holds a given recipient and keeping that registry from lying, relaying a message to someone offline and on a different machine and guaranteeing they get it once and in conversation order, fanning out to a group without a global sequencer, and then surrendering server-side plaintext, which shoves a class of features down onto the client.
When I built NomadCrew, I solved the small version of every one of those honestly: a hub that holds connections, a registry that is just a map, replay-on-reconnect, receipts, bounded mailboxes that drop instead of block. The value of the small version is that you feel, in your hands, the exact line where each decision was sized for thousands and would have to change for billions. The RWMutex map becomes a Redis-backed registry. The Postgres replay becomes a delete-on-ack relay. The dropped-event channel becomes a fleet-wide war on lock contention. Same problems, asked at a scale that turns every comfortable assumption into a load-bearing decision. That is what "at planet scale" actually means. It is the point where the easy answer stops being an answer, not the point where the numbers simply get bigger. The queue that backs up here is the one underneath the whole event-driven spine, the territory of Kafka vs queues, and the hub I keep coming back to is NomadCrew.
FAQ
How many connections did WhatsApp really hold per server?
The famous number is two million concurrent connections on a single box, demonstrated by Rick Reed at Erlang Factory in 2012, with a peak benchmark near 2.8 million. The number people miss is that production later ran closer to one million per server on purpose. Two million made a single failure catastrophic: lose that box and a reconnect storm of two million clients lands somewhere else at once. They traded raw density for failover headroom. Density was never the goal; surviving the loss of a machine was.
How does WhatsApp deliver a message to someone who is offline?
Store-and-forward. The sender's chat server routes the message toward the recipient's chat server. If the recipient has no live connection, the message is queued in a server-side table (WhatsApp used Mnesia) rather than discarded. When the device reconnects, the queue is drained to it, and once the device acknowledges delivery the server deletes its copy. WhatsApp is a relay, not an archive. The history you see lives on your phone, not on their servers.
Does WhatsApp guarantee exactly-once delivery?
No system can guarantee exactly-once delivery as a network primitive, because the sender can never tell a lost message apart from a lost acknowledgment. What you actually get is at-least-once transport plus idempotent deduplication on a message id at the recipient, which composes into effectively-once. The retries make sure nothing is dropped; the dedup makes sure a redelivered message is shown once. Anyone who promises exactly-once without naming the dedup step is selling you the lie.
If WhatsApp is end-to-end encrypted, what can the server still do?
End-to-end encryption changes where the server is allowed to act, not whether a server is needed. It still routes messages, queues them for offline recipients, fans them out to group members, tracks the sent and delivered and read acknowledgments, manages presence, and stores media blobs. It just cannot read the content, which is opaque ciphertext. And it still sees metadata: who is talking to whom, and when. That metadata is the honest limit of the privacy story.
Why does a group message become N encryptions, and how do Sender Keys fix it?
Under end-to-end encryption the server cannot re-encrypt one inbound message for each recipient, so fan-out moves to the client. Naively that means encrypting the message separately for every member, which gets expensive in large groups. The Sender Keys protocol fixes the steady state: the first message to a group distributes a per-sender chain key, encrypted once to each member, and every message after that is a single symmetric encryption that all members can decrypt. A group post becomes one encryption plus a relay, not N.