← Back to Portfolio

Leader Election: How Distributed Systems Pick a Boss and Survive Losing One

Picking a leader is the easy half; the murder is in agreeing the previous one is actually dead.

· 15 min read· leader-election / distributed-systems / consensus / raft / zookeeper / system-design

A leader is the answer to a question most engineers never ask out loud: when two machines disagree about what just happened, who wins?

You can build a long way without asking it. One database, one process, one source of truth, and the question never comes up, because there is only ever one opinion. Then you add a second node for availability, and a third, and one night a link hiccups and two of those nodes are each certain they are in charge. Both accept a write. The histories fork, and no merge puts them back together, because both branches are equally real and mutually exclusive. That failure has a name, and the entire discipline of leader election exists to prevent it.

What a leader is actually for

The instinct is to say a leader coordinates the others, like a manager handing out tasks. That is the shallow reading, and it points at the wrong goal. The real job of a leader is to serialize writes into a single total order. Every change goes through one node, that node decides the sequence, and because there is exactly one sequencer, no two changes can ever be ordered inconsistently across the cluster. The leader is a chokepoint on purpose; throughput is not the goal and is often slightly worse with a leader than without. The goal is the invariant that no two nodes make conflicting decisions. Raft, the consensus algorithm most people learn first, states this almost verbatim: at most one leader can be elected in a given term. Consensus and Raft covers how that algorithm holds the line; this piece is the narrower question of how the boss gets picked and what happens the moment the boss disappears.

Take the single-leader invariant away and you get split-brain. Cut a five-node cluster cleanly into three and two: the majority side elects a new leader, the minority side still has the old one who never stopped believing it was in charge, and now there are two leaders taking two write streams into two divergent histories. When the partition heals you do not get to choose which one was correct, because both followed the rules they could see. That is the failure every mechanism below exists to prevent.

Picking a leader is the easy half

Here is the thing nobody tells you in the tutorial: electing a leader is not the hard part. If you could know a leader had died the instant it died, election would be almost mechanical: run a quick vote, the winner takes over, done. The hard part is the murder, agreeing that the previous leader is actually gone.

Over an asynchronous network, a leader that is slow is indistinguishable from a leader that is dead. The follower's view of a healthy leader is a steady drip of heartbeats. When they stop, the follower knows exactly one thing: it stopped hearing from the leader. It does not know whether the leader crashed, hit a four-second garbage-collection pause, sits behind a link dropping packets, or froze while its VM was live-migrated to another host. All of those produce the identical trace: silence.

This is the long shadow of the FLP impossibility result, which proved that in a fully asynchronous system with even one possible crash, no deterministic protocol can guarantee consensus. The practical consequence is blunt: there is no timeout that is always right, so every system makes a bet. Too aggressive and you get needless failovers, a leader that paused for a beat declared dead and the cluster churning out a replacement you did not need. Too lenient and real failures take forever to notice, so recovery time balloons while clients sit on a corpse. Cloud environments push the dial toward longer and jittered, because noisy-neighbor CPU steal, stop-the-world pauses, and migration freezes manufacture multi-second stalls that look exactly like death. Tight detection in a noisy place is a machine for false alarms.

Three families, and where the consensus lives

There are essentially three architectural answers to leader election, and the cleanest way to tell them apart is to ask one question of each: where does the consensus actually live?

The first family bakes the election into the replication protocol itself. Raft and Zab (the protocol inside ZooKeeper) belong here. Election is not a separate subsystem; it is a special case of the same machinery that replicates the log, hung on a logical clock that labels each reign, a term in Raft and an epoch in Zab. A candidate wins votes from a majority, and a stale leader is retired the moment anyone sees a higher number.

The second family outsources the problem. Your application does not run consensus; it asks a coordination service that already does, and leadership becomes nothing more than holding a key. ZooKeeper, etcd, and Google's Chubby are the canonical services. While you hold an ephemeral lock you are the leader; if your session dies, the lock evaporates and someone else takes it. This is the pragmatic default for most teams, because running consensus correctly is a tax you should not pay if you can avoid it. It is the same machinery as distributed locks, and it inherits the same sharp edge.

The third family is the textbook one: ID-based election, where the bully algorithm (Garcia-Molina, 1982) crowns the highest-ID node that answers and the ring algorithm circulates a message collecting IDs until the highest wins. Treat these as the pedagogical baseline, not a production choice: they assume reliable failure detection and synchronous messaging, the exact assumptions that shatter in real networks. Modern production cores run Raft and Zab instead.

Quorum is the anti-split-brain primitive

The consensus-backed families share one load-bearing idea, the single mechanism that makes split-brain structurally impossible rather than merely unlikely. A quorum is a majority: the ceiling of (N plus 1) over 2, so three of five, two of three. The magic is geometric. Any two majorities of the same set must overlap in at least one node, because two disjoint majorities would add up to more than N. That overlap is the whole game: two partitions can never both assemble a quorum, because the node in the overlap can only be on one side of the split.

So in our five-node cluster, the side with three forms a quorum and elects a leader. The side with two cannot, no matter how convinced it is. The minority fails closed: no quorum, no leader, no writes, no forked history. This is why these systems use odd cluster sizes and why the tolerance formula is 2f + 1 nodes survive f failures. You do not add nodes for throughput here; you add them to raise the number of simultaneous deaths the cluster absorbs while still assembling a majority. This is the reasoning underneath CAP and PACELC: when a partition forces the choice, the minority gives up availability to keep consistency.

Raft's election, with the actual numbers

Here is how Raft runs an election. Each follower holds an election timeout, randomized per node from a range the paper suggests as 150 to 300 milliseconds. A leader sends heartbeats (empty append-entries messages) to reset everyone's timer; when a follower's timer fires without one, it increments the term, votes for itself, and asks everyone else for a vote. The randomization is the whole mechanism: a fixed timeout would have every follower fire together, all vote for themselves, and none reach a majority, a split vote that repeats forever because the cluster keeps tying.

Walk one through a three-node cluster anyway. Leader L on term 4 crashes. Followers A and B both fire near 150 milliseconds, both become candidates for term 5, both vote for themselves, neither reaches the majority of two. Each picks a fresh random timeout, A on 160 and B on 290, so A fires first, votes for itself, and B (still waiting) grants its vote. A leads term 6. Total disruption: a couple hundred milliseconds, and the randomness guaranteed the tie broke instead of looping.

The term also retires zombies for free: any server that sees a higher term reverts to follower, so a partitioned leader returning on an old term steps down the instant it talks to anyone current. The paper insists on one timing inequality that captures the whole tuning philosophy: broadcast time (half a millisecond to twenty) much less than the election timeout much less than the mean time between failures (months), with the timeout in the wide gap between. Production Raft also adds a pre-vote phase, probing for votes before incrementing the term, so a rejoining partitioned node cannot inflate the term and depose a healthy leader just by coming back.

The lock-service shortcut, and its numbers

If you do not want to run consensus yourself, you hold a lock, and etcd's election package shows the shape cleanly. You call Campaign, which writes a key bound to a lease and blocks until you are elected; the leader is the candidate whose key has the lowest creation revision, which is just first-one-in-wins. Here is the nice touch: each waiter watches only the key immediately ahead of it, not the leader itself, so a departure wakes exactly one node instead of stampeding the field. ZooKeeper's recipe is the same idea in sequential ephemeral nodes, smallest number leads, each watching the next-smaller one.

Liveness rides on the lease. etcd's NewSession defaults to a 60-second TTL, and as long as the leader's process is alive it keeps renewing; if the process dies the lease expires, etcd auto-deletes the key, and the next waiter unblocks. There is no separate failure detector to build: the lease expiry is one, tied to whether the leader is around to renew it.

Winning the election does not mean you have the latest data

ZooKeeper's docs raise a caveat that separates people who read the manual from people who skim it: being the smallest sequence number does not, by itself, prove that node knows it is the leader, so they recommend a separate acknowledgment node after running the leader procedure. That generalizes into one of the least-internalized truths in the whole area. The election picks a winner; it does not, on its own, guarantee that winner holds the most recent committed state.

Consensus-backed systems close the gap deliberately, and the mechanism is elegant. Raft refuses to grant a vote to a candidate whose log is less up-to-date than the voter's own. The result is a property called leader completeness: because a candidate needs a majority, and every node in that majority confirmed the candidate's log is at least as current as its own, the winner's log already contains every committed entry. Zab does it differently with the same intent, spending an explicit synchronization phase to bring the new primary up to date before it broadcasts. The lock-service family gets none of this for free, which is what the ZooKeeper caveat warns about: if you elect by who-grabbed-the-lock-first, nothing checked whether the winner caught up on the state that matters. The election and the data-freshness guarantee are two separate things, and if your election does not force freshness, you have to.

The deposed leader is a loaded gun

Here is the part that separates real understanding from a Wikipedia summary. Suppose your election is flawless and quorum guarantees only one leader is ever elected. You are still not safe, because when a new leader takes over, the old one does not necessarily know it has been deposed. It might have been paused or partitioned in the way that triggered the election, and while it was frozen the cluster moved on. When it wakes, from its own point of view nothing happened: it was leader a moment ago, it is leader now, and it issues a write. Timeouts cannot prevent this, because the problem is not detection. By the time the old leader acts it has already been replaced, and the damage lives in the gap between losing leadership and finding out.

Martin Kleppmann's telling is the canonical one. Client 1 acquires the lock with fencing token 33 and prepares to write to storage, then hits a stop-the-world garbage-collection pause long enough that its lease quietly expires while it is frozen. The lock service, seeing the lapse, hands the lock to Client 2 with token 34, and Client 2 writes. Then Client 1 wakes up with no idea any time passed. It still believes it holds the lock, and it issues its write carrying the now-stale token 33.

Without a fence, both writes land and storage is corrupted by a zombie that did not know it had died. The mechanism that saves you is small and lives in a specific place: the storage server remembers the highest token it has accepted (34), and when Client 1's write arrives carrying 33, it rejects it because the token went backwards. The zombie's write bounces off a wall.

The non-obvious lesson is where the fence has to live. It is not enough for the lock service to issue the token; the resource itself has to check it. As Kleppmann puts it, this requires the storage server to take an active role in checking tokens and rejecting any write on which the token has gone backwards. Fencing is the resource's job. If your database, file server, or queue does not validate the token, the token does nothing and you have two leaders writing again. This is also why a random UUID is not a fencing token: a fence requires monotonicity so the resource can reject the older of two values, and a random nonce has no order to compare. The token must go up, every time.

Sequencers, lock-delay, and the patterns that ship

Google's Chubby, the lock service that quietly underpins much of Google's infrastructure, made all of this concrete years before fencing token was common vocabulary. A Chubby cell is typically five replicas that use Paxos to elect a master, which holds a renewable lease and needs a majority to act. The master answers each KeepAlive just before the lease expires, keeping the client's view of the lease conservatively shorter than its own as a margin against clock drift; lose the master and the client gets a 45-second grace period to fail over before declaring its session dead. Chubby's fencing token is called a sequencer: an opaque byte-string carrying the lock's name, mode, and a generation number, which a lock holder hands to a downstream server that validates it against Chubby before honoring the request. That is fencing, named and shipped in 2006.

Chubby also has a fallback for downstream servers that cannot check sequencers, because not everything is Chubby-aware. It is called lock-delay: after a holder vanishes, the lock cannot be re-acquired until a configurable delay passes (commonly cited up to about a minute), which bounds the window a zombie could do damage by refusing to hand the lock out until the old holder has surely noticed or died. It is the poor engineer's fence, weaker than a real token but enough to cap the blast radius. Fence when you can, fall back to lock-delay when you cannot.

One extension a staff engineer keeps in mind: fencing tokens compose with idempotency. A monotonic token at the resource doubles as a dedup and write-ordering key, the property idempotency and the exactly-once lie reaches from a different door. The same instinct extends to reads: a leader serving local reads without a lease can return stale data after silently losing leadership, which is why Raft needs a read-lease or a ReadIndex round-trip and Chubby leans on its master lease. A leader cannot assume it is still the leader.

How a senior actually decides

Put the pieces in the order an engineer reaches for them.

DecisionDefault moveWhy
Run consensus or outsource itOutsource to ZooKeeper, etcd, or Chubby unless you have a strong reasonRunning consensus correctly is a tax a coordination service already paid
Cluster sizeOdd, usually 3 or 52f + 1 tolerates f failures while still forming a majority; odd avoids ties
Detection timeoutLonger and jittered in the cloudSlow equals dead over async links; noisy neighbors and GC pauses manufacture false deaths
Two leaders elected at onceQuorum, alwaysOnly one partition holds a majority, so the minority fails closed
A deposed leader writingFencing tokens enforced at the resourceQuorum does not stop a zombie that has not noticed; the resource must reject stale tokens
Trusting the new leader's dataA protocol that forces log freshness, or enforce it yourselfWinning the election is not the same as holding the latest committed state

The honest landing

You will never get a perfect failure detector, because slow and dead are the same thing to anyone watching from across a network, and that single ambiguity is the source of nearly everything hard here. Quorum makes split-brain structurally impossible by ensuring only one side of a partition can assemble a majority. Fencing tokens make a deposed leader harmless by teaching the resource to reject any write whose token has gone backwards. And forcing log freshness, in the election or by hand, makes the leader you crowned worth trusting.

Get those three right and a frozen leader thawing at the worst possible moment writes its stale token into a wall. Get them wrong and the same event forks your history in two, and you find out which half you lost only after the partition heals and it is far too late to choose. The byte-deterministic discipline behind a system like Audex, where a golden hash refuses to accept output that drifted even one byte, is the same instinct wearing different clothes: decide what correct means, then build the wall that rejects everything else.

FAQ

Why does a distributed system need a leader at all?

A leader exists to impose a single total order on writes so that no two nodes ever make conflicting decisions. Raft states the invariant as at most one leader can be elected in a given term. Without that single serialization point you get split-brain: two nodes each believe they are in charge, both accept writes, and the data diverges in ways no merge can reconcile. Throughput is a side benefit; agreement is the real reason.

What is the difference between a quorum and a fencing token?

They defend against two different failures. A quorum (a majority of nodes) prevents two leaders from being elected at the same time, because only one side of any network partition can hold a majority. A fencing token prevents a leader that was already elected, then paused or partitioned, from doing damage when it wakes up still thinking it is in charge. The quorum lives in the election; the token lives at the resource, which rejects any write carrying a token lower than the highest it has seen. You need both.

Why is failure detection the hard part of leader election?

Over an asynchronous network, a leader that is slow is indistinguishable from a leader that is dead. A long garbage-collection pause, a VM live-migration freeze, or a congested link all produce the same silence as a crash. This is the shadow of the FLP impossibility result: with no bound on message delay you cannot perfectly tell crashed from slow. Every real system therefore picks a timeout, which is a deliberate bet that trades faster recovery against the risk of declaring a healthy-but-slow leader dead.

Does winning the election mean the new leader has the latest data?

Only because the protocol forces it. The election mechanism by itself does not guarantee it. Raft refuses to vote for a candidate whose log is less up-to-date than the voter's, which guarantees the winner already holds every committed entry. ZooKeeper's own recipe warns that being the node with the smallest sequence number does not prove that node knows it is the leader, and recommends a separate acknowledgment step. Zab spends a whole synchronization phase catching the new primary up before it serves traffic.

Is the bully algorithm used in real production systems?

Rarely in the core of anything serious. The bully algorithm (highest ID wins) and the ring algorithm are the textbook baseline, and they assume reliable failure detection and synchronous messaging, which is exactly what breaks in production. Modern systems run consensus-backed election instead: etcd uses Raft, ZooKeeper uses Zab, and applications that do not want to run consensus themselves outsource the whole problem to a lock service like ZooKeeper, etcd, or Chubby.