Primary-Backup Replication

The Design of a Practical System for Fault-Tolerant Virtual Machines http://nil.csail.mit.edu/6.824/2020/papers/vm-ft.pdf

Very cool stuff overall. And I bet the implementation is super challenging.

  • "typically reduces performance of real applications by less than 10%"

    • Doesn't sound like nothing to me!

  • State-machine approach.

    • Crash-consistent or app-consistent? App consistent I guess? Or neither?

    • Do they assume no network calls, etc? Or is it another thing that the hypervisor intercepts and adjusts?

    • "Deterministic replay".

  • Fail-stop failures - detected before incorrect externally-visible action.

  • What sort of VMs are usually replicated like this? Must be something stateful. DB servers?

  • Ok, network and all inputs go only to the primary.

  • Secondary outputs are dropped.

  • Split brain - must ensure that only one VM takes over the execution.

  • Challenges:

    • Capture all inputs and non-determinism.

    • Apply them to the backup.

    • Do it without degrading performance.

    • Also instructions with undefined behavior.

  • Inputs and non-determinism are written to a log, then read.

  • Log is not written to disk - sent straight to the backup.

  • They can have any number of secondaries this way I suppose? If they fan out the log stream.

  • If the backup takes over, must function consistently with outputs of the primary.

  • Delay external outputs until the event has been applied to the secondary.

    • Received and acked.

    • This prevents duplicate outputs at failover.

  • We assume some failure on the hardware/infra level? Otherwise, if the primary failed, why wouldn't the secondary fail too?

  • Can't guarantee that all outputs are produced exactly once!

  • The logic is baked into the hypervisor?

  • Split-brain is avoided - before going live, a VM acquires a lock on the shared storage. If it can't - it kills itself.

  • To start replication (not necessarily when the VM is starting), they have a tech to clone a running VM.

  • If the primary produces logs faster than the secondary can consume, it will result in slowing down the primary.

  • Operations like shutting down should be directed only to the primary machine.

  • Parallel non-blocking disk reads can lead to non-determinism.

    • Solution: detect such races, force to execute sequentially.

  • Alternative design: no shared storage.

    • Secondary writes to its own disk.

    • Can do long distance.

    • Need to sync disks when it's failing over.

    • Harder to deal with split-brain situations.

  • Alternative design: instead of sending disks inputs through the logging channel, the secondary could actually read from the disk.

    • It would reduce the logging traffic significantly.

    • But there are too many subtleties. Like dealing with failed disk operations.

  • Only implemented for uni-processor VMs. Mkay.

    • Is there a way to extend it I wonder?

  • How do the clients cut-over? How do they know to communicate with the new machine? I suppose it has different IP? Or the same? I'm missing something major here.

    • They say the backup claims the primary's ethernet id.

Lecture (https://www.youtube.com/watch?v=M_teob23ZzY) notes:

  • Replication: only deal with fail-stop faults. Not bugs, etc.

  • Assume failures are independent. If dependent - replication won't help.

  • Is it worth to pay for 2 machines? It depends. An economic question, not tech.

  • State transfer vs Replicated State Machine.

  • Operations are usually much smaller than the entire state.

  • But state replication is more complex.

  • To implement replicated state machine we need to decide:

    • What state?

    • How to synchronize? How close?

    • Scheme for switching over.

    • Anomalies in cut-over. How to cope with them.

    • New replicas.

  • What state? All state. Memory, etc.

    • It's rare. Usually replication only does what's important for the application. But here it's very general-purpose.

    • Application-level replication requires the application to participate in it.

  • Disk Server is not much, or at all, different than any other network dependency.

  • Non-determinism:

    • Inputs - data + interrupt.

    • Weird instructions like random numbers, time, ...

    • Multi-core. Not dealing with it here.

  • Log entry:

    • Instruction number/index.

    • Type (network input, weird inst...)

    • Data

  • Any replication system would have possibility of either duplicate outputs, or missing output (during cutover). Duplicate output is usually better. If it's TCP, it will be deduped on TCP level, transparently to client apps.

Last updated