Raft

http://nil.csail.mit.edu/6.824/2020/papers/raft-extended.pdf

Raft is a consensus algorithm for managing a replicated log.

  • Strong leader. Raft uses a stronger form of leader-ship than other consensus algorithms. For example,log entries only flow from the leader to other servers.This simplifies the management of the replicated log and makes Raft easier to understand.

  • Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.

  • Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.

Consensus algorithms for practical systems typically have the following properties:

  • They ensure safety(never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.

  • They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.

  • They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.

  • In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.

Sub-problems:

  • Leader election: a new leader must be chosen when an existing leader fails.

  • Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own.

  • Safety: the key safety property for Raft is the State Machine Safety Property: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index.

  • Any two majorities always overlap.

  • The leader can commit after hearing back from the half of other servers.

  • Then the leader tells the replicas to commit too. It's piggy-backed inside the next AppendEntries.

  • Why is the system focused on logs in the first place? Because it needs to apply operations in the same order. And it can reply the log after the crash.

Leader election:

  • Possible to build a system like this without a leader. But it's more efficient (while servers don't fail).

  • Each term has at most one leader.

  • There's an election timer on each server. If it expires, it assumes the current leader is dead, becomes a candidate, tries to become a leader.

    • Increment term.

    • Vote for itself.

    • Request votes.

  • Election timer is randomized, to reduce the probability of split votes. Choose random timeout every time, not just once!

  • New leader sorts out possibly divergent replicas.

Compaction:

  • Must avoid indefinitely growing logs.

  • Take state machine snapshots.

    • With "last included index" and "last included term". These are needed for checks in AppendEntry.

  • Generally servers take snapshots independently. Server sends snapshots to peers falling behind.

  • Use copy-on-write.

Fast Backup (when leader needs to find out where to append to the follower): on AppendEntries rejection, the follower also sends the term and the index of the conflicting entry, and the length of the log. Then the leader will know how to update the nextIndex.

  • They say I have to do this to pass the lab, but I didn't and it still passed :-|

    • Maybe I'll need it after I add persistence.

Persistent properties vs volatile. It only matters if it restarts. It is likely to be the performance bottleneck. Persist:

  • Log.

  • CurrentTerm.

  • VotedFor. This and CurrentTerm are needed to ensure there's only one leader.

Last updated