2.8. Combining Map and Reduce¶

2.8.1. The MapReduce Paradigm¶

In 2004, Jeffrey Dean and Sanjay Ghemawat of Google published a paper describing a paradigm for distributed computation that has come to be called MapReduce. It illustrated the influence of functional programming on the way in which Google organized computational work that could be parallelized on distributed clusters of computers.

The essence of Dean and Ghemawat’s idea was to define a mapping function that would perform a specified task in parallel on multiple data sets distributed across many computers. The results of each mapping function were then returned to a reducing function that accumulated the results into the “answer” being sought.

To illustrate, suppose we had a distributed database, called db2, of salesperson records with the sales records of “Smith” on one computer, the sales records of “Jones” on a second computer, and the sales records of “Green” on a third computer.

var db2 = [ ["Jones", 9, 2, 8, 6, 4], ["Smith", 4, 1, 8, 32, 45],
            ["Green", 4, 4, 6, 1, 12, 8] ];

Given this database, we want a computation (the mapping function) done on each computer that returns the name of the salesperson along with the sum of all the sales records for that person. The results of those three computations are then returned to a reducing function that picks out the salesperson who sold the most.

> bestSalesPerson(db2)
[ 'Smith', 90 ]

The following bestSalesPerson function achieves this computation by defining two functions (the mapper and the reducer) and then appropriately calling on fp.reduce. Read through the following slide show for more details and then attempt the review problem that follows.

1 / 4 Settings

<<<>>>

Our database is a list of records where each record r is a list whose head is the name of a salesperson and whose tail is a list of their sales. The sample database on the left below has three such records. Ultimately our answer will be returned by applying reduce to another list of records produced by applying the map operation to each list in the database. The function we give to the map operation in line 20 is called mapper (lines 3-7). How should mapper determine the [name, totalSales] pair it must return?

Lists in DB

Jones
9
2
8
6
4

Smith
4
1
8
32
45

Green
4
4
6
1
12
8

Jones
29

Smith
90

Green
35

Smith
90

var bestSalesPerson = function (db) {
// returns the pair [name, totalSales] for a given record in db
var mapper = function (r) {
return [
???,
fp.hd(r),
???,
fp.reduce(fp.add, fp.tl(r), 0)
];
};
// Given two input pairs of the form [name, totalSales], return
// the one with the largest totalSales
var reducer = function(p1,p2) {
return ???
return (fp.isGT(fp.hd(fp.tl(p1)), fp.hd(fp.tl(p2)))) ? p1 : p2;
};
// returns [salesPerson, totalSales] with the largest totalSales
// in the DB
return fp.reduce(
reducer,
fp.map(mapper, db),
???
['dummy', -1]
);
};

Saving...

Server Error
Resubmit

The following randomized problem is about the MapReduce model. You must solve it correctly three times in a row to earn credit for it.

Programming Languages

Chapter 2 Functional Programming

2.8. Combining Map and Reduce¶

2.8.1. The MapReduce Paradigm¶