MapReduce
The original article: http://nil.csail.mit.edu/6.824/2020/papers/mapreduce.pdf
MIT 6.824: Distributed Systems Lecture 1
Somehow I used to think that functional languages borrowed the terms ("map" and "reduce") from them, not the other way around.
The brilliance is hiding the complexity of parallelism, massive scale, fault tolerance, etc. behind very simple interface.
"After successful completion, the output of the map-reduce execution is available in the output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these output files into one file – they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files."
Huh. Somehow I'm used to a single output file.
Or at least that one key belongs to only one output file.
"Completed map tasks are re-executed on a failure be-cause their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system."
How is work distributed across Reduce tasks? With Map it's clear.
Is there a "global" master, or per-job masters? I guess the latter.
I like the discussion of the failure semantics.
"We conserve network band-width by taking advantage of the fact that the input data(managed by GFS [8]) is stored on the local disks of the machines that make up our cluster."
It's interesting just how everything runs on the same machines... Needs some thinking.
It allows tricks like input locality then.
"One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a ma-chine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation."
Oh I've seen that at Microsoft.
"We have a general mechanism to alleviate the problem of stragglers. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks."
Clever.
Combiner - does it really need framework support? The user can just use the reducer code at the end of their Map.
So, each Reduce reads all Map outputs, and just filters by the key that belongs to it? Isn't it too many reads? And then read locality for inputs - does it even make much difference in comparison?
I suppose if the intermediate output is sorted on the Map worker, it's not that bad. And then the filtering is done on the Map side too.
Also it's network-constrained. If the sorting is done on the Map side, then the network transfer does not explode.
"This includes about a minute of startup over-head. The overhead is due to the propagation of the pro-gram to all worker machines, and delays interacting with GFS to open the set of 1000 input files and to get the information needed for the locality optimization."
The whole crawled web was 20TB around that time! It's just nothing!
Last updated