2010-05-15

My understanding of MapReduce

I've recently read up on MapReduce and this is my understanding

MapReduce is a framework for distributing parallel computation over large dataset on a computer cluster.
It takes care of the low-level tasks like splitting & scheduling jobs, disk I/O, bandwidth management, error detection and recovery.
It is suitable for simple computation on large dataset, where the computation on one part of the data does not affect the computation on another part (linearity), thus trivially parallelizable.
One way to understand it is that "Map" abstracts the transformation loops, while "Reduce" abstracts the aggregation loops.

There is a loose analogy with SQL, though the dataset here is not normalized (no JOIN). Also, the result set is not necessarily the subset of the (aggregation of) input (as in SQL). And "Map" is actually more generic than the SQL analogy counterpart. Some non-relational databases expose MapReduce interface for querying data.