Difference between revisions of "MapReduce"

Revision as of 12:02, 22 July 2015

#ENCODING
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16')
#MONGO
from pymongo import MongoClient
client = MongoClient()
client.progzoo.authenticate('scott','tiger')
db = client['progzoo']
#PRETTY
import pprint
pp = pprint.PrettyPrinter(indent=4)

Page Under Construction

Introducing the MapReduce function

The MapReduce function is an aggregate function that consists of two functions: Map and Reduce. As the name would suggest, the map is always performed before the reduce.

The map function takes data and breaks it down into tuples (key/value pairs) for each element in the dataset
The reduce function then takes the result of the map function and simply reduces it in to a smaller set of tuples by merging all values with the same key.

Map is used to deal with "embarassingly parallel problems" where a task can be broken down into subtasks that can then be ran simultaneously without affecting each other. Instead of just processing elements one by one, all elements can all be dealt with at the same time in parallel. This allows for massively reduced processing times as well as large scalability across multiple servers, making it an attractive solution to handling Big Data.

We will be using Javascript here through the use of Code. Also note that the Mongo shell uses the camelCase 'mapReduce()' whereas PyMongo uses the Python naming scheme: map_reduce()

For this example MapReduce takes the form:

db.<collection>.map_reduce(
    map=<function>,
    reduce=<function>,
    out=<collection>
)

This example returns the population of all the continents.

In the map emit(k,v) selects the fields to to turn into tuples, where k is the key and v is the value. The key will be the continents and the values will be the populations
The reduce will sum all the values associated with each key. Finally the "out" part of the MapReduce specifies that the output is to be inline rather than a collection, allowing it to printed it to screen.

from bson.code import Code
pp.pprint(
    db.world.map_reduce(
        map=Code("function(){emit(this.continent, this.population)}"), 
        reduce=Code("""function(key, values){
                           var total = 0;
                           for (var i = 0; i < values.length; i++){
                               total += values[i];
                           }
                           return total;
                       }
                    """),
        out={"inline":1},
    )
)

query can be used in a mapreduce to filter the input documents to map.

Find the GDP for each continent, but only include data from countries that start with the letter A or B.

from bson.code import Code
temp = db.world.map_reduce(
        query={"name": {"$regex":"^(A|B)"}},
        map=Code("function(){emit(this.continent, this.gdp)}"), 
        reduce=Code("""function(key, values){
                          var total = 0;
                          for (var i = 0; i < values.length; i++){
                              total += values[i];
                          }
                          return total;
                      }
                   """),
        out={"inline":1},
)

pp.pprint(
   temp["results"]
)

from bson.code import Code;temp = db.world.map_reduce(query={"name": {"$regex":"^(A|B)"}},map=Code("function(){emit(this.continent, this.gdp)}"),reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total;}"),out={"inline":1});import operator;pp.pprint(temp["results"])

scope takes in a document:{} and lets you create global variables.

It's syntax is: scope={}.

Using scope, list all the countries with a higher population than Mexico.

mexico_data = db.world.find_one({"name":"Mexico"})
pp.pprint(mexico_data)

from bson.code import Code
temp = db.world.map_reduce(
        scope = {"MEXICO":mexico_data},
        map = Code("""function(){
                         if (this.population > MEXICO.population) emit(this.name, this.population)
                      }
                   """), 
        reduce=Code("function(key, values){return values}"),
        out={"inline":1},
)
pp.pprint(
   temp["results"]
)

mexico_data = db.world.find_one({"name":"Mexico"}); pp.pprint(mexico_data); from bson.code import Code; temp = db.world.map_reduce( scope={"MEXICO":mexico_data}, map=Code("function(){if (this.population > MEXICO.population) emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1});pp.pprint(temp['results'])

sort and limit

Sort allows us to sort the input documents that are passed to map, limit is self explanatory and also applies to the input documents and not the output documents, like db.<collection>.find().limit(n) does.

Get the five countries with the highest GDPs

from bson.code import Code temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, ) pp.pprint( temp["results"] )

from bson.code import Code; temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, );pp.pprint(temp["results"])

@@ Line 20: / Line 20: @@
 Map is used to deal with [https://en.wikipedia.org/wiki/Embarrassingly_parallel "embarassingly parallel problems"] where a task can be broken down into subtasks that can then be ran simultaneously without affecting each other. Instead of just processing elements one by one, all elements can all be dealt with at the same time in parallel. This allows for massively reduced processing times as well as large scalability across multiple servers, making it an attractive solution to handling [https://en.wikipedia.org/wiki/Big_data Big Data].<br/><br/>
 We will be using Javascript here through the use of <code>Code</code>. Also note that the Mongo shell uses the camelCase 'mapReduce()' whereas PyMongo uses the Python naming scheme: map_reduce()<br/><br/>
-For this example our MapReduce takes the form:<br/>
+For this example MapReduce takes the form:<br/>
 <pre>db.<collection>.map_reduce(
      map=<function>,
@@ Line 28: / Line 28: @@
 </pre>
-<div class=q data-lang="py3">In this example we will be returning the population of all the continents.<br/>
+<div class=q data-lang="py3">This example returns the population of all the continents.<br/>
-<code>emit(k,v)</code> lets us pick the fields we want to turn into tuples, where k is the key and v is the value. Our keys will be the continents and our values will be the population<br/>
+In the map <code>emit(k,v)</code> selects the fields to to turn into tuples, where k is the key and v is the value. The key will be the continents and the values will be the populations<br/>
-In our reduce we sum all the values associated with each key.
+The reduce will sum all the values associated with each key.
-Finally we specify that we want the "out" part of the MapReduce to be inline rather than a collection, allowing us to print it to screen.
+Finally the "out" part of the MapReduce specifies that the output is to be inline rather than a collection, allowing it to printed it to screen.
 <pre class=def>
 from bson.code import Code
@@ Line 51: / Line 51: @@
 </div>
-<div class=q data-lang="py3">This is all very good, but what if we only want the results of a MapReduce? Whilst we would normally write our results to another collection, for these examples we will store our results as a dict, then access the results through Python rather than PyMongo.<br/><br/>
+<div class=q data-lang="py3"><code>query</code> can be used in a mapreduce to filter the <b>input</b> documents to map.<br/>
-Alternatively, if you want more information about the timing you can use <code>verbose=1</code>
-<p class="strong">Find the average population of each country by continent</p>
-<pre class=def>
-from bson.code import Code
-temp = db.world.map_reduce(
-        map=Code("function(){emit(this.continent, this.population)}"),
-        reduce=Code("""function(key, values){
-                           var total = 0;
-                           for (var i = 0; i < values.length; i++){
-                               total += values[i];
-                           }
-                           return total/values.length;
-                       }
-                    """),
-        out={"inline":1},
-)
-pp.pprint(list(
-   temp["results"]
-))
-</pre>
-<div class="ans">
-from bson.code import Code; temp = db.world.map_reduce(map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){ var total = 0; for (var i = 0; i < values.length; i++){total += values[i];}return total/values.length;}"),out={"inline":1});pp.pprint(list(temp["results"]))
-</div>
-</div>
-<div class=q data-lang="py3">This also allows us to do things like sorting, again for this we use Python, and not PyMongo.<br/>
-<p class="strong">Find the average population of each country by continent then sort by value in descending order.</p>
-<pre class=def>
-from bson.code import Code
-temp = db.world.map_reduce(
-        map=Code("function(){emit(this.continent, this.population)}"),
-        reduce=Code("""function(key, values){
-                           var total = 0;
-                           for (var i = 0; i < values.length; i++){
-                               total += values[i];
-                           }
-                           return total/values.length;
-                       }
-                    """),
-        out={"inline":1},
-)
-import operator
-pp.pprint(list(
-   sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)
-))
-</pre>
-<div class="ans">
-from bson.code import Code
-temp = db.world.map_reduce(map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total/values.length;}"),out={"inline":1});import operator;pp.pprint(list(sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)))
-</div>
-</div>
-<div class=q data-lang="py3">Until now we've been working on all the data, let's apply a mapreduce to a more specific set of elements by using <code>query</code><br/>
 <p class="strong">Find the GDP for each continent, but only include data from countries that start with the letter A or B.</p>
 <pre class=def>
@@ Line 124: / Line 69: @@
 )
-pp.pprint(list(
+pp.pprint(
     temp["results"]
-))
+)
 </pre>
 <div class="ans">
-from bson.code import Code;temp = db.world.map_reduce(query={"name": {"$regex":"^(A|B)"}},map=Code("function(){emit(this.continent, this.gdp)}"),reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total;}"),out={"inline":1});import operator;pp.pprint(list(temp["results"]))
+from bson.code import Code;temp = db.world.map_reduce(query={"name": {"$regex":"^(A|B)"}},map=Code("function(){emit(this.continent, this.gdp)}"),reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total;}"),out={"inline":1});import operator;pp.pprint(temp["results"])
 </div>
 </div>
-<div class=q data-lang="py3">Another MapReduce feature is <code>scope</code>, which takes in a <b>document</b>:<code>{}</code> and lets you create global variables.<br/>
+<div class=q data-lang="py3"><code>scope</code> takes in a <b>document</b>:<code>{}</code> and lets you create global variables.<br/>
 It's syntax is: <code>scope={}</code>.<br/>
 <p class="strong">Using <code>scope</code>, list all the countries with a higher population than Mexico.</p>
@@ Line 151: / Line 96: @@
          out={"inline":1},
 )
+pp.pprint(
-import operator
+    temp["results"]
-pp.pprint(list(
+)
-    sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)
-))
 </pre>
 <div class="ans">
-mexico_data = db.world.find_one({"name":"Mexico"}); pp.pprint(mexico_data);  from bson.code import Code; temp = db.world.map_reduce( scope={"MEXICO":mexico_data}, map=Code("function(){if (this.population > MEXICO.population) emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, ); import operator; pp.pprint(list(sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)))
+mexico_data = db.world.find_one({"name":"Mexico"}); pp.pprint(mexico_data);  from bson.code import Code; temp = db.world.map_reduce( scope={"MEXICO":mexico_data}, map=Code("function(){if (this.population > MEXICO.population) emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1});pp.pprint(temp['results'])
 </div>
 </div>
 <div class=q data-lang="py3"><code>sort</code> and <code>limit</code><br/>
-Sort allows us to sort the <b>input</b> documents that are passed to <b>map</b>, limit is self explanatory and also applies to the input documents.
+Sort allows us to sort the <b>input</b> documents that are passed to <b>map</b>, limit is self explanatory and also applies to the input documents and <b>not the output documents<b>, like <code>db.<collection>.find().limit(n)</code> does.
 <p class="strong">Get the five countries with the highest GDPs</p>
 <pre class=def>
@@ Line 176: / Line 119: @@
 )
-pp.pprint(list(
+pp.pprint(
     temp["results"]
-))
+)
 </pre>
 <div class="ans">
-from bson.code import Code; temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, );pp.pprint(list(temp["results"]))
+from bson.code import Code; temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, );pp.pprint(temp["results"])
 </div>
 </div>