Amazing stuff overall. And the paper is written so well. I found reading it very inspiring.

  • "most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially"

  • "We have also introduced anatomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them."

    • Multiple clients writing the same file concurrently seems strange.

    • Ah, I see. "Our files are often used as producer-consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential".

  • "Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them."

  • "High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write."

  • How do you build PD on top of this then?

  • Single master (SPOF?), for metadata and management. Clients read/write directly to chunkservers.

  • Reading/writing is done in the client lib, interesting.

  • If you decide to change the chunk size, then what happens? It seems like the client lib and the master must use the same "version".

  • Chunk size is 64mb. That's a lot! Especially in 2003.

  • "Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size"

  • Do they do something for data integrity?

    • 32-bit checksums.

    • Nice that it's so short. Why bother with something cryprographically strong?

    • But for 64kb blocks, not the entire chunk.

  • Master keeps operation log, builds checkpoints in the background.

  • They are minimizing master's involvement in anything.

  • Leases.

  • There seems to be a lot of trust into the client code.

  • "We decouple the flow of data from the flow of control to use the network efficiently."

  • "Our network topology is simple enough that “distances” can be accurately estimated from IP addresses." Lol

  • "Finally, we minimize latency by pipelining the data transfer over TCP connections. Once a chunk server receives some data, it starts forwarding immediately." Sounds familiar (e.g CDNs and random trees).

  • Fancy control flow to guarantee atomicity.

  • "GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit" Hmm

  • "we use standard copy-on-write techniques to implement snapshots" Very standard indeed :-)

  • "Unlike many traditional file systems, GFS does not have a per-directory data structure that lists all the files in that directory. Nor does it support aliases for the same file or directory (i.e, hard or symbolic links in Unix terms)."

    • No POSIX for you.

  • Master - "When it fails, it can restart almost instantly. If its machine or disk fails, monitoring infrastructure outside GFS starts anew master process elsewhere with the replicated operation log. Clients use only the canonical name of the master (e.g.gfs-test), which is a DNS alias that can be changed if the master is relocated to another machine."

  • "Initially, GFS was conceived as the backend file system for our production systems. Over time, the usage evolved to include research and development tasks. It started with little support for things like permissions and quotas but now includes rudimentary forms of these. While production systems are well disciplined and controlled, users some times are not. More infrastructure is required to keep users from interfering with one another"

    • Clients read directly from chunkservers - no surprise permissions are hard to manage.

Overall structure
  clients (library, RPC -- but not visible as a UNIX FS)
  each file split into independent 64 MB chunks
  chunk servers, each chunk replicated on 3
  every file's chunks are spread over the chunk servers
    for parallel read/write (e.g. MapReduce), and to allow huge files
  single master (!), and master replicas
  division of work: master deals w/ naming, chunkservers w/ data

Master state
  in RAM (for speed, must be smallish):
    file name -> array of chunk handles (nv)
    chunk handle -> version # (nv)
                    list of chunkservers (v)
                    primary (v)
                    lease time (v)
  on disk:
    checkpointMaster state
  in RAM (for speed, must be smallish):
    file name -> array of chunk handles (nv)
    chunk handle -> version # (nv)
                    list of chunkservers (v)
                    primary (v)
                    lease time (v)
  on disk:


Last updated