📓
ainzzorl's wiki
  • Intro
  • Pet Projects
    • Good Code
    • LangFlipFlop
    • Soccer Adjectives
    • Word Highlighter
    • News Digest
    • Algorithms in Kotlin
    • rustorcli
  • Tech
    • Tech Articles
    • Algorithms
      • Distributed Hash Table (DHT)
      • RSA
      • Seam Carving
      • Fibonacci Heaps
      • Suffix trees, suffix arrays, etc.
    • Technologies
      • Threads vs Events
      • TLS
      • GPU
      • Web Sockets
      • OSI Model
    • Open Source
      • Homebrew
      • Standard Notes
      • LiChess
    • Bash
    • Raft
    • Quantum Computing
    • GFS
    • MapReduce
    • ZooKeeper
  • Courses
    • MIT 6.824: Distributed Systems
      • Primary-Backup Replication
      • Object Storage on CRAQ
      • Aurora
      • Cache Consistency: Frangipani
      • Takeways
      • Distributed Transactions
      • Midterm
      • Spanner
      • FaRM
      • Spark
      • Cache Consistency: Memcached at Facebook
    • Page 2
    • MIT 6.854/18.415: Advanced Algorithms + Stanford CS168: The Modern Algorithmic Toolbox
  • Math
    • Fourier Transform
    • Probabilities & Statistics
  • Places
    • Moscow
  • Books
    • Page 1
    • Page 1
    • Page 1
    • Page 1
    • Page 1
    • Tobol Mnogo Zvannyh - Ivanov
    • The Twelve Chairs/12 стульев - Ilf, Petrov
    • Beauty is a Wound - Eka Kurniawan
    • The Queen of Spades/Пиковая Дама - Pushkin
    • The Sirens of Titan - Kurt Vonnegut
    • Обитель - Захар Прилепин
    • The Faithful Executioner - Joel Harrington
    • City of Lies: Love, Sex, Death, and the Search for Truth in Tehran - Ramita Navai
    • June/Июль - Dmitry Bykov/Дмитрий Быков
    • East of Eden - John Steinbeck
    • Como Agua Para Chocolate/Like Water for Chocolate - Laura Esquivel Valdés
    • The Kukotski Enigma/Казус Кукоцкого - Lyudmila Ulitskaya/Людмила Улицкая
    • Ancillary Justice - Ann Leckie
    • Career of Evil - JK Rowling
    • The Signal and the Noise - Nate Silver
    • Don't Sleep, There are Snakes - Daniel Everett
    • Оправдание Острова - Eugene Vodolazkin
    • A Place Called Winter - Patrick Gale
    • 1491: New Revelations of the Americas Before Columbus - Charles C. Mann
    • Трудно Отпускает Антарктида - Vladimir Sanin
    • Klara and the Sun - Kazuo Ishiguro
    • The History of My Contemporary/История Моего Современника - Vladimir Korolenko
    • Life at the Speed of Light - Craig Venter
    • Misery - Stephen King
    • And Then There Were None - Agatha Christie
    • A Kim Jong-Il Production - Paul Fischer
  • Cooking
    • Marinated Mushrooms
    • Pea Soup
    • Fried Potato
    • Chimichurri
    • Komendantsky Sauce
    • Spicy mushroom marinara
    • Bruschetta
    • Solyanka with Mushrooms
    • Tofu Scramble
    • Bean Spaghetti
    • Salsa
    • Baked Mushrooms
    • Lentil Soup
    • Веганство в Москве
  • Misc Research
    • Recycling
    • Sailing Upwind
    • Viruses
  • Misc Reading
    • Harry Potter - Rowling Writing
  • Fitness
  • Languages
    • Spanish
    • Language Classification
  • Juggling
  • Life Advice
Powered by GitBook
On this page

Was this helpful?

  1. Tech

MapReduce

PreviousGFSNextZooKeeper

Last updated 3 years ago

Was this helpful?

The original article:

  • Somehow I used to think that functional languages borrowed the terms ("map" and "reduce") from them, not the other way around.

  • The brilliance is hiding the complexity of parallelism, massive scale, fault tolerance, etc. behind very simple interface.

  • "After successful completion, the output of the map-reduce execution is available in the output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these output files into one file – they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files."

    • Huh. Somehow I'm used to a single output file.

    • Or at least that one key belongs to only one output file.

  • "Completed map tasks are re-executed on a failure be-cause their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system."

  • How is work distributed across Reduce tasks? With Map it's clear.

  • Is there a "global" master, or per-job masters? I guess the latter.

  • I like the discussion of the failure semantics.

  • "We conserve network band-width by taking advantage of the fact that the input data(managed by GFS [8]) is stored on the local disks of the machines that make up our cluster."

    • It's interesting just how everything runs on the same machines... Needs some thinking.

    • It allows tricks like input locality then.

  • "One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a ma-chine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation."

    • Oh I've seen that at Microsoft.

  • "We have a general mechanism to alleviate the problem of stragglers. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks."

    • Clever.

  • Combiner - does it really need framework support? The user can just use the reducer code at the end of their Map.

  • So, each Reduce reads all Map outputs, and just filters by the key that belongs to it? Isn't it too many reads? And then read locality for inputs - does it even make much difference in comparison?

    • I suppose if the intermediate output is sorted on the Map worker, it's not that bad. And then the filtering is done on the Map side too.

    • Also it's network-constrained. If the sorting is done on the Map side, then the network transfer does not explode.

  • "This includes about a minute of startup over-head. The overhead is due to the propagation of the pro-gram to all worker machines, and delays interacting with GFS to open the set of 1000 input files and to get the information needed for the locality optimization."

  • The whole crawled web was 20TB around that time! It's just nothing!

http://nil.csail.mit.edu/6.824/2020/papers/mapreduce.pdf
MIT 6.824: Distributed Systems Lecture 1