2010-05-15

My understanding of MapReduce

I've recently read up on MapReduce and this is my understanding

MapReduce is a framework for distributing parallel computation over large dataset on a computer cluster.
It takes care of the low-level tasks like splitting & scheduling jobs, disk I/O, bandwidth management, error detection and recovery.
It is suitable for simple computation on large dataset, where the computation on one part of the data does not affect the computation on another part (linearity), thus trivially parallelizable.
One way to understand it is that "Map" abstracts the transformation loops, while "Reduce" abstracts the aggregation loops.

There is a loose analogy with SQL, though the dataset here is not normalized (no JOIN). Also, the result set is not necessarily the subset of the (aggregation of) input (as in SQL). And "Map" is actually more generic than the SQL analogy counterpart. Some non-relational databases expose MapReduce interface for querying data.

 A visual explanation:


Steps:
  • map (split) on the dataset space
  • (map) filter/collect on the new key space
  • sort the new key space
  • reduce the value collection of each key

SQL analogy:
  • Map:
FROM the dataset
SELECT some traits WHERE ...
ORDER BY some traits
  • Reduce:
GROUP BY some traits
HAVING ...
(and aggregation functions in SELECT)

Example:
Map:
        (doc_id, doc_content) => pairs of (word, position)
Reduce:
        (word, list of positions) => (word, count)
SELECT word, count(word_position)
FROM doc
GROUP BY word

MapReduce is not a silver bullet. I is suitable for simple linear calculation over large dataset, and unsuitable for:
  • Recursive/non-linear problems which cannot be trivially parallelized like matrix inversion, fluid dynamics equation solving (non-linear equation)...
  • The dataset is small, in which case MapReduce overhead is too high.

References:
http://en.wikipedia.org/wiki/MapReduce
http://labs.google.com/papers/mapreduce-osdi04.pdf
http://ayende.com/Blog/archive/2010/03/14/map-reduce-ndash-a-visual-explanation.aspx
http://www.mongodb.org/display/DOCS/MapReduce

No comments:

Post a Comment