Paper: http://nil.csail.mit.edu/6.824/2020/papers/zookeeper.pdf

To summarize, in this paper our main contributions are:

  • Coordination kernel: We propose a wait-free coordination service with relaxed consistency guarantees for use in distributed systems. In particular, we describe our design and implementation of a coordination kernel, which we have used in many critical applications to implement various coordination techniques.

  • Coordination recipes: We show how ZooKeeper can be used to build higher level coordination primitives, even blocking and strongly consistent primitives, that are often used in distributed applications.

  • Experience with Coordination: We share some of the ways that we use ZooKeeper and evaluate its performance.

Clients submit requests to ZooKeeper through a client API using a ZooKeeper client library. In addition to exposing the ZooKeeper service interface through the client API, the client library also manages the network connections between the client and ZooKeeper servers.

In this paper, we use client to denote a user of the ZooKeeper service, server to denote a process providing the ZooKeeper service, and znode to denote an in-memory data node in the ZooKeeper data, which is organized in a hierarchical namespace referred to as the data tree. We also use the terms update and write to refer to any operation that modifies the state of the data tree. Clients establish a session when they connect to ZooKeeper and obtain a session handle through which they issue requests.

ZooKeeper provides to its clients the abstraction of a set of data nodes (znodes), organized according to a hierarchical name space. The znodes in this hierarchy are data objects that clients manipulate through the ZooKeeper API.

There are two types of znodes that a client can create:

  • Regular: Clients manipulate regular znodes by creating and deleting them explicitly

  • Ephemeral: Clients create such znodes, and they either delete them explicitly, or let the system remove them automatically when the session that creates them terminates (deliberately or due to a failure).

Data model.The data model of ZooKeeper is essentially a file system with a simplified API and only full data reads and writes, or a key/value table with hierarchical keys.

Client API:

  • create(path, data, flags): Creates a znode with path namepath, stores data[] in it, and returns the name of the new znode. flags enables a client to select the type of znode: regular, ephemeral, and set the sequential flag;

  • delete(path, version): Deletes the znode path if that znode is at the expected version;

  • exists(path, watch): Returns true if the znodewith path name path exists, and returns false otherwise. The watch flag enables a client to set a watch on the znode;

  • getData(path, watch): Returns the data and metadata, such as version information, associated with the znode. The watchflag works in the same way as it does for exists(), except that Zoo-Keeper does not set the watch if the znode does not exist;

  • setData(path, data, version):Writes data[] to znode path if the version number is the current version of the znode;

  • getChildren(path, watch): Returns the set of names of the children of a znode;

  • sync(path): Waits for all updates pending at the start of the operation to propagate to the server that the client is connected to. The path is currently ignored.


  • Linearizable writes: all requests that update the state of ZooKeeper are serializable and respect precedence;

  • FIFO client order: all requests from a given client are executed in the order that they were sent by the client.

Can easily build primitives on top of it:

  • Locks.

  • Dynamic configuration.

  • Group membership.

Linearizability. A history is linearizable, if exists a total order of ops, matches real time, reads see preceping writes in order.

Why ZooKeeper? (MIT 6.824)

  • API for general-purpose coordination service.

  • Nx many servers - can it yield Nx performance?

Takeway: in consesus-based systems like Raft, you can't scale write performance by adding more replicas - it can only get slower. You can scale read throughput by adding replicas and serving reads by replicas, but you have to relax some consistency guarantees.

Scalable lock - implement locking without herd effect with seq noded.

Last updated