Leader Election in distributed systems

10 min readAug 2, 2020

I am always fascinated to work on distributed systems and the challenges that this architecture brings to the table. In the below article, I would like to talk about one of the classic problems — “Leader Election”. Understanding how a leader is elected and the responsibilities of the leader is key to understanding a distributed system.

# What is Leader Election

The leader election is an important problem in a distributed system as data is distributed among different nodes which may be geographically separated. Designating a single node as an organizer in distributed systems is a challenging issue that calls for suitable election algorithms.

Before I delve deep into this topic — let me introduce some of the use cases.

# Why(Use cases)

It's a very common problem as we scale applications in a wide variety of domains —

Transaction management (eg: banking systems)
Autoscaling of an application & Membership management for new nodes(Cloud platforms)
Distributed data structures (using shards etc.)
Cluster management

#Tryst with java developers

If you consider a thread as a node, it’s easy to imagine the same problem in a distributed system, but there is no synchronized keyword to the rescue. The reason being that synchronized is implemented using semaphores, and in turn, they are backed by the underlying operating system and the hardware it’s running on. But if we look at what synchronized does, we might adapt the concept to a distributed setting. When declaring a block of code as synchronized in Java, it is always synchronized to the monitor of a specific object. The size of this monitor is one. This means that only one thread may be inside the monitor at any given time. When a thread requests the monitor and the monitor is available, it is allowed to enter instantly. If the monitor is occupied, the thread is put in the waiting pool and suspended until the monitor is released.

# How

Although it seems that doesn’t follow any pattern — this would follow the same classic CAP theorem(Consistency, Availability and Partition Tolerance). And we have to consider different approaches and their trade-offs.

Hazelcast (AP system)(Latest can be configured as CP*)
Zookeeper (CP system **)
etcd using Raft.(CP system)

In general, Hazelcast will be much more performant and be able to handle more failures than Raft and etcd, but at the cost of potential data loss or consistency issues. The way Hazelcast works is it partitions data and stores pieces of the data on different nodes. So, in a 5 node cluster, the key “foo” may be stored on nodes 1 and 2, and bar may be stored on nodes 3 and 4. You can control the number of nodes to which Hazelcast replicates data via the Hazelcast and map configuration. However, during a network or other failure, there is some risk that you’ll see old data or even lose data in Hazelcast.

# Are you still interested? — TL; DR;

If you have a peer to peer background like myself you would probably suggest using a distributed hash table rather than a leader to decide which node implements the monitor for each resource. In fact, there are some really great DHT-implementations out there that deal with thousands of nodes leaving and joining every hour. Given that they also work in a heterogeneous network with no prior knowledge of underlying network topology, query response times of four to ten message hops are pretty good, but in a grid setting where the nodes are homogenous and the network is fairly stable, we can do better.

# General Approach

We have a leader that regularly pushes the current version of the global cluster state to each node. What happens if the leader goes down? Just because our nodes are fairly reliable does not imply that we can accept a single point of failure. Provided that only the leader node has crashed and all other nodes are still capable of communicating, Elasticsearch will handle this gracefully as the remaining nodes will detect the failure of the leader — or rather the absence of messages from the leader — and initiate leader election. If you are using ZenDiscovery (default) then the process is like this:

Each node calculates the lowest known node id and sends a vote for leadership to this node
If a node receives sufficiently many votes and the node also voted for itself, then it takes on the role of leader and starts publishing cluster state.

The definition of sufficiently many votes to win an election is what is known as a quorum. In Elasticsearch the quorum size is a configurable parameter. A node in a distributed system is not able to determine whether another node is dead or whether the network is not able to deliver its messages to it, therefore, common to have a previously agreed upon threshold — the quorum size — to represent the other party’s votes.

Let’s exemplify with the following scenario: A cluster has nodes N1, N2, N3, N4, N5 and quorum size 3. N1is the leader. It so happens that N1 and N2 are on one network and N3, N4, N5 on another. The networks are connected through a link.

When this link fails, N1and N2 are still able to communicate with each other, but not with N3, N4 and N5. Similarly, on the other network N3, N4 and N5 may communicate with each other, but not with N1 and N2.

What happens next is this: nodes N3, N4 and N5 detect that they no longer have contact with the leader N1 and subsequently initiate leader election by sending votes to N3. Once N3 has received three votes it takes on the role of leader and starts publishing to N3, N4 and N5. On the other network, the leader N1 detects that it no longer has contact with nodes C, D and E. Leader N1 calculates that the new cluster size is less than the quorum size and gives up the “leader” role and effectively stops nodes N1and N2 from responding to queries until the link is restored.

In real life, it’s unlikely that someone trips over on a crucial network cable, but networks are more complex than the example above, and network partitions are actually not that uncommon. It is not hard to imagine a network split when a system is relying on multiple datacenters; another likely culprit of tricky network errors is a wrongly configured router or switch. As mentioned in the network is reliable, it does occur that network cards do things like dropping all inbound packets while still delivering outbound packets, with the result that a server still sends heartbeats but is unable to service any requests.

## Avoid Split Brain

The concept of a quorum size has two immediate advantages.

Firstly it simplifies the election process as votes only need to be delivered to the node they’re voting for.
Secondly and more importantly it is the only way to avoid a split-brain. Imagine the same cluster and the same network split, but with a quorum size of two. C would still be elected leader, but A would not give up its “leader” role. This would result in two leaders, with clusters unknown to each other. This is what is known as a split-brain. Both clusters would be accepting read and write operations, and as expected, they would be out of sync. Without human intervention, they would probably never recover. Depending on the data model it might be impossible to unify the two data versions, forcing one to simply discard all data on one of the clusters.

## Don’t Reinvent the Wheel

As you might expect, leader election has been an intriguing topic in application architecture circles for many years and quite a few smart people have done a great deal of contemplation. If you are keen on getting your distributed system working than putting a massive effort into research and development, you’re probably better off with having a go at implementing a well-known algorithm. Some day in the future when you have finally figured out that crazy bug in your system, it’s more likely to be just a bug in the implementation than a major design flaw in the algorithm. Of course, the same argument applies to adapting or integrating an existing implementation rather than implementing from scratch, especially if you want one of the more advanced algorithms.

### The Bully Algorithm

The bully algorithm is one of the basic algorithms for leader election. It assumes that all nodes are given a unique ID that imposes a total ordering of the nodes. The current leader at any time is the node with the highest id participating in the cluster. The advantage of this algorithm is an easy implementation, but it does not cope well with a scenario whereby the node with the largest id is flaky. Especially in a situation where the node with the largest id tends to be overloaded by the chores of the leader role. Consequently, it will crash as the leader; the node with the second largest id will be elected; and the largest id node will recover — as it’s no longer overloaded — and subsequently initiate leader election again, only to be elected and crash yet again. However, in low-level hardware interfaces, the bully algorithm is quite common. It might be tempting to avoid “thrashing” by postponing election until the current leader fails, but that will easily lead to a split-brain.

### Paxos

Paxos is actually much more than a leader election algorithm. Paxos is a family of different protocols for maintaining consensus in a distributed system. I will not go into detail about the varieties of Paxos, but rather discuss the concepts and qualities of Paxos in general. The data model used in Paxos is a growing list of statements where each statement has a unique number. The mathematically proven guarantees of Paxos are that two nodes will never have different statements for the same number, and as long as there is a quorum of nodes participating the algorithm will make progress. This translates to the nodes never being inconsistent and the algorithm never going into deadlock, but it may halt if there are not enough nodes online. These guarantees are in fact quite similar to what we want for a leader election algorithm. We want all nodes participating in the cluster to agree on which node is the leader, and we do not want any other nodes to be able to run off and create their own cluster, resulting in a split brain.

Performing leader election with Paxos then becomes as simple as proposing the statement “I am the leader”. If it is accepted, then you are the leader until another node proposes to be the leader. However, this alone is not sufficient to avoid a split brain. In the partitioned network example above the existing leader N1 will not be notified of the election of Node N3 on the other switch. There need to be another criteria to end the leadership. One option is that when initiating leader election the statement used is on the form: “I am leader until ”, but a more preferred option is that the leader is able to detect that it no longer has contact with a quorum and then demotes itself.

Given the previous example of a network partition, let’s imagine N3 was disconnected and connected to the other switch — the interesting part here is that the quorum now consists of nodes N1, N2 and N3 instead of the previous N3, N4 and N5. If the nodes where running Paxos, their data could look something like this:

Ex: Adding node from one switch to another

Depending on the exact implementation of leader election through Paxos, a number of outcomes are possible, but they all have two properties in common. Nodes N4 and N5 will not be able to elect a new leader as they do not have a majority. Nodes N1and N2 will not elect a new leader before learning that node N3 was elected in round 2. Assuming that the end criterion of each leader term is a timeout, there are two possible outcomes. If node N3 establishes contact with nodes N1 and N2 before its term is ended, it will reestablish a quorum, resume its “leader” role, and no leader election is necessary. If its term is ended, then either node N1, N2 or N3 may be elected. Assuming node N1 is the first to discover node N3 on the network, it will try to get re-elected as leader for term 2, but node N3 will reject this as it already has a leader for term 2. Node N1 then has the option to attempt to be elected for term 3. If it does, it will then also inform node N2 about node N3’s time as leader in term 2. Now, what if node N3 and N1 try to get elected at the same time? The full answer to that question requires delving into the specifics of Paxos, which out of scope for this article, but the short answer is that election will fail. In fact, the original paxos paper suggests using a leader election algorithm to decide which node should be the one initiating new proposals, but since the Paxos protocol guarantees that there will not be inconsistencies even if there are multiple nodes proposing conflicting statements, this leader election algorithm could be just a simple heuristic to avoid consecutive failed rounds. For instance, a node might decide to wait a randomized amount of time before reattempting to propose a failed statement. This flexibility of Paxos in terms of when and how to do leader election can be a great advantage over the simple bully algorithm, having in mind that in real life there are far more failure modes than a dead network link.

# Conclusion:

If you reached this point — you really made good progress and definitely would like to implement this in your architecture. I would classify this as one of the complex problems to solve. And the below should be our primary considerations before we pick up any solution.

Quorum size is really important if you care about your data.
Paxos is really robust, but not as easy to implement.
It’s better to implement a well-known algorithm where the strengths and flaws are known rather than getting a big surprise further down the road. There are certainly plenty of options out there.

References:

Hazelcast (Java) and ETCD (golang) differences/similarities?

Hazelcast and etcd are two very different systems. The reason is the CAP theorem. The CAP theorem states that no…

stackoverflow.com

** Zookeeper has Sequential Consistency - Updates from a client will be applied in the order that they were sent.