Flexible Paxos

Today we’re going to talk about a new development in the world of consensus protocols called Flexible Paxos from Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. Flexible Paxos builds off of Paxos, which is the foundational consensus protocol invented by Leslie Lamport and has been written about extensively. Paxos is notoriously difficult to understand and reason about, so much so that there is a follow up paper by Leslie Lamport called Paxos Made Simple which, while still being easier to understand than the original paper, is still difficult to follow. There has been other work on consensus protocols like Zab which powers ZooKeeper, and more recently Raft which powers Consul, etcd, and others. That said, Paxos is still the grandaddy of them all, and powers lots of important infrastructure.

Before we jump in, we should understand what consensus protocols like Paxos aim to solve. The idea is that in a distributed network of computers, where messages can be dropped or delayed between hosts, and individual hosts can fail completely, we want to be able to have every computer agree on some value. Once we’re able to have that building block of agreeing on one value, then we can keep repeating the process, and have a stream of values that are all agreed upon. This stream of agreed upon values is called a “Replicated Log” in distributed systems research.

The reason we’d want a replicated log is that we can build a replicated state machine off of it by having each value, or log entry, represent a state machine transition. Once you have a replicated state machine, you can encode basically anything you’re interested in, like a Key-Value Store, SQL Database, Programming Language Runtime, etc. And since it’s replicated, the system is more resilient to individual host failures.

Tradeoffs

When we evaluate consensus protocols we care about two main properties.

First, How many nodes can completely fail before we cannot serve reads or writes? In general, the ability to make updates are the first thing to go, given enough failures, and the number of tolerable failures is never a constant, but a function of the number of nodes in the system. Ideally, if we had N hosts, we would be able to tolerate N-1 failures and still be able to handle both reads and writes.

Two, what’s the protocol’s throughput and latency? How many messages do I need to send in order to agree on a single value? The fewer messages I need to send/receive the better. Doing strictly less work will improve performance, and the fewer messages going over the network will improve my tail latency.

As in most of engineering, there are tradeoffs. If we force every single host to respond back in order to do an update, then we have no write resiliency, i.e., as soon as one host dies we can’t make any updates. But we do have amazing read resiliency, since every host has all the history, we can support N-1 failures until we’re on our last host, and we’d still be able to serve reads.

Paxos is pretty rigid in its tradeoffs, requiring a strict majority of hosts to be operable in order to support writes. Flexible Paxos on the other hand gives the designer, well, more flexibility in choosing constraints.

Paxos - An Overview

I mentioned that Paxos guaranteed continued operation as long as a strict majority of nodes still remain operable. It’s important to understand why that is, when we consider the advancements that Flexible Paxos makes, and in order to do that we’ll have to have a basic understanding of Paxos.

Paxos has three roles: (1) Proposer, (2) Acceptor, (3) Learner, and two phases of operation: (1) Prepare & Promise and (2) Propose & Accept. The three roles are pretty confusing, because in practice every agent plays all three “roles”, so we’ll largely ignore that. The two phases are important to understand, and we’ll try to provide an overview of the process, but not a detailed explanation.

An agent who wants a certain value chosen (the proposer) asks all of the agents to promise to accept whatever value the proposer chooses later, and waits for a majority of the agents to respond back, promising to accept the proposed value. This is the Prepare & Promise phase. If a majority of agents respond with promises, then the proposer sends out their proposed value to all the agents. If a majority of the agents accept the value, then the proposer “learns” of this value (as well as all the agents who accepted it as well). That’s the Propose & Accept phase, and after that Paxos is finished, resulting in a majority of nodes agreeing on some value.

This majority, is fundamental to the failure guarantee of Paxos. If a majority of nodes have accepted and learned some value, then as long as there is a majority of nodes still up, there exists at least one node who has recorded that accepted value. This is because any two subsets of nodes containing a majority of nodes, must intersect. So if another agent proposes a different value, the node that still remembers¹ the previously accepted value, can refuse to accept the new proposal. In addition, the node can also inform the new proposer about the accepted value, disseminating the information further.

One last thing to note about Paxos is that in its basic form, it only allows us to decide on one value. What we really want is a replicated log, and in order to do that, we need to agree on a series of values. We could run an instance of Paxos for each value, but there’s a better way, namely Multi-Paxos.

Multi-Paxos is an optimization of generic Paxos that allows a group of nodes to decide on a sequence of values. The idea is that the first phase of Paxos doesn’t depend on the value being proposed. So instead of having agents promise to accept just one value, they promise to accept some set, arbitrary number of values coming from the original proposer. That proposer is referred to as the “leader”, and now we only need to run phase 1 once for a series of values, following up with a phase 2 for each individual value. This amortizes the cost of phase 1 over a series of values, improving our throughput.

Flexible Paxos - Quorum Intersection

Now that we’re through all the background knowledge, let’s get to the meat of the idea in this paper. Namely that we don’t have to have both a majority of nodes promise and a majority of nodes accept. Instead we can use arbitrary quorums as long as they intersect. As the authors put it:

We observe that it is only necessary for phase 1 quorums (Q1) and phase 2 quorums (Q2) to intersect. There is no need to require that Q1 ’s intersect with each other nor Q2 ’s intersect with each other. We refer to this as Flexible Paxos (FPaxos) and it generalizes the Paxos algorithm.

So with Paxos, we required majority quorums for phase 1 and phase 2 so that they would intersect, and would guarantee both continued operability and prevent us from incorrectly deciding on another value. What Flexible Paxos is saying, is that it doesn’t matter that one instance of Paxos has its phase 1 quorum (Q1) intersect with another instance of Paxos’ Q1. If that’s the case we’re not strictly limited to using just majority quorums.

Now that’s a neat intellectual thing to discover in an algorithm that is over three decades old, but what’s cooler is that it unlocks completely new tradeoffs in the design space. If only Q1 and Q2 have to intersect, we can start considering things like, what if we make Q2 smaller, or Q1 bigger? They have to intersect, sure, but in a system of 100 nodes, what if only the first 5 were required for Q2, but all 100 were required for Q1?

Well we know from Multi-Paxos that we don’t have to do phase 1 as often, since we can just do it once and elect a leader for a large series of values. But we always have to do phase 2 for each new value. By making Q2 very small, we’ll improve tail latency since we don’t have to send a message to all 100 agents, and wait for a response from at least 51 of them. Now we’re sending a message to 5 agents, and waiting for a response from 3 of them! In effect, we’ve traded off resiliency of phase 2 for increased throughput and reduced latency.

Quorum Systems

The authors, by realizing we merely require overlapping Q1 and Q2, opened the door to leveraging existing research on quorum systems. I love when people discover equivalences that allow you to apply knowledge from a seemingly unrelated domain to another. It’s like finding free money in an old coat. Somebody worked hard to earn that knowledge, but now you get to use it for free! That’s awesome!

The authors talk about a number of different techniques, but my favorite by far is the Grid Quorum.

Grid quorum schemes arrange the N nodes into a matrix of N1 columns by N2 rows, where N1 × N2 = N and quorums are composed of rows and columns. As with many other quorum systems, grid quorums restrict which combinations of acceptors can form valid quorums. This restriction allows us to reduce the size of quorums whilst still ensuring that they intersect.

For example, a valid grid quorum might be to have phase 1 quorums be each distinct row, i.e., in order for phase 1 to pass, all the nodes in at least one row must promise. Then you would have the phase 2 quorums be each column, i.e., for phase 2 to pass you’d need at least all the nodes in one column to accept. That way Q1 and Q2 would always intersect (always at one node that was in both the row and column chosen) but both quorums would be reduced in size. This would be one way to trade off resiliency for both phase 1 and phase 2 to improve throughput for both phases.

Seriously, Grid Quorums are so cool!

Conclusion

I’ll leave the proof of validity to the paper, but there is some intuition that it should work out. Overall, the paper is really well written and the research is a great re-examination of a staple of distributed systems that lead to an awesome new discovery. I love when new points in the design space are discovered or made viable, because it creates new opportunities and hopefully new technolgies.

Do you remember when we accepted proposals with lower proposal numbers? Pepperidge Farm remembers. ↩︎