Cookies help us deliver our services. By using our services, you agree to our use of cookies. More information

Difference between revisions of "MapReduce"

From NoSQLZoo
Jump to: navigation, search
Line 16: Line 16:
 
<div style="min-height:25em">
 
<div style="min-height:25em">
 
==Introducing the MapReduce function==
 
==Introducing the MapReduce function==
The MapReduce function is an aggregate function that consists of two functions: Map and Reduce. As the name would suggest, the map is always performed before the reduce.<br/><br/>
+
The MapReduce function is an aggregate function that consists of two functions: Map and Reduce.
The map function takes data and breaks it down into tuples (key/value pairs) for each element in the dataset<br/>
+
 
The reduce function then takes the result of the map function and simply reduces it in to a smaller set of tuples by merging all values with the same key.<br/><br/>
+
The map is always performed before the reduce.
Map is used to deal with [https://en.wikipedia.org/wiki/Embarrassingly_parallel "embarassingly parallel problems"] where a task can be broken down into subtasks that can then be ran simultaneously without affecting each other. Instead of just processing elements one by one, all elements can all be dealt with at the same time in parallel. This allows for massively reduced processing times as well as large scalability across multiple servers, making it an attractive solution to handling [https://en.wikipedia.org/wiki/Big_data Big Data].<br/><br/>
+
 
We will be using Javascript here through the use of <code>Code</code>. Also note that the Mongo shell uses the camelCase <code>mapReduce()</code> whereas PyMongo uses the Python naming scheme <code>map_reduce()</code><br/><br/>
+
The map function examines every document in the collection and emits '''(key,value)''' pairs.
For this example MapReduce takes the form:<br/>
+
 
<pre>db.<collection>.map_reduce(
+
The map function takes no input however the current document can be accessed as '''this'''
    map=<function>,
+
 
    reduce=<function>,
+
The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values.
    out=<collection>
+
 
)
+
==How many countries in each continent==
</pre>
+
<div class=q data-lang="mongo">
</div>
+
This example returns the number of countries in each continent.
<div class=q data-lang="py3">This example returns the population of all the continents.<br/>
 
In the <code>map</code> stage, <code>emit(k,v)</code> selects the fields to to turn into tuples, where <b>k</b> is the key and <b>v</b> is the value. The key will be the continents and the values will be the populations.<br/>
 
The <code>reduce</code> will sum all the values associated with each key.
 
Finally the <code>out</code> specifies that the output is to be <code>inline</code> rather than a collection, allowing it to printed it to screen.  
 
 
<pre class=def>
 
<pre class=def>
from bson.code import Code
+
db.world.mapReduce(
pp.pprint(
+
  function(){emit(this.continent, 1);},  
    db.world.map_reduce(
+
  function(k, v){ return v.length; }
        map=Code("function(){emit(this.continent, this.population)}"),  
+
  out={"inline":1}
        reduce=Code("""function(key, values){
 
                          return Array.sum(values)
 
                      }
 
                    """),
 
        out={"inline":1},
 
    )
 
 
)
 
)
 
</pre>
 
</pre>

Revision as of 19:40, 2 August 2016

#ENCODING
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16')
#MONGO
from pymongo import MongoClient
client = MongoClient()
client.progzoo.authenticate('scott','tiger')
db = client['progzoo']
#PRETTY
import pprint
pp = pprint.PrettyPrinter(indent=4)

Introducing the MapReduce function

The MapReduce function is an aggregate function that consists of two functions: Map and Reduce.

The map is always performed before the reduce.

The map function examines every document in the collection and emits (key,value) pairs.

The map function takes no input however the current document can be accessed as this

The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values.

How many countries in each continent

This example returns the number of countries in each continent.

db.world.mapReduce(
  function(){emit(this.continent, 1);}, 
  function(k, v){ return v.length; }
  out={"inline":1}
)
query can be used to filter the input documents to map.

Find the GDP for each continent, but only include data from countries that start with the letter A or B.

from bson.code import Code
temp = db.world.map_reduce(
        query={"name": {"$regex":"^(A|B)"}},
        map=Code("function(){emit(this.continent, this.gdp)}"), 
        reduce=Code("""function(key, values){
                           return Array.sum(values)
                      }
                   """),
        out={"inline":1},
)

pp.pprint(
   temp["results"]
)

from bson.code import Code;temp = db.world.map_reduce(query={"name": {"$regex":"^(A|B)"}},map=Code("function(){emit(this.continent, this.gdp)}"),reduce=Code("function(key, values){return Array.sum(values)}"),out={"inline":1});import operator;pp.pprint(temp["results"])


scope takes in a document:{} and lets you create global variables.

It's syntax is: scope={}.

Using scope, list all the countries with a higher population than Mexico.

mexico_data = db.world.find_one({"name":"Mexico"})
pp.pprint(mexico_data)

from bson.code import Code
temp = db.world.map_reduce(
        scope = {"MEXICO":mexico_data},
        map = Code("""function(){
                         if (this.population > MEXICO.population) emit(this.name, this.population)
                      }
                   """), 
        reduce=Code("function(key, values){return values}"),
        out={"inline":1},
)
pp.pprint(
   temp["results"]
)

mexico_data = db.world.find_one({"name":"Mexico"}); pp.pprint(mexico_data); from bson.code import Code; temp = db.world.map_reduce( scope={"MEXICO":mexico_data}, map=Code("function(){if (this.population > MEXICO.population) emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1});pp.pprint(temp['results'])

sort and limit

Sort allows us to sort the input documents that are passed to map
Limit is self explanatory and also applies to the input documents that are passed to map

Get the five countries with the highest GDPs

from bson.code import Code
temp = db.world.map_reduce(
        query={"gdp":{"$ne":None}},
        sort={"gdp":-1},
        limit=5,
        map=Code("function(){emit(this.name, this.gdp)}"), 
        reduce=Code("function(key, values){return values}"),
        out={"inline":1},
)

pp.pprint(
   temp["results"]
)

from bson.code import Code; temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, );pp.pprint(temp["results"])

finalize is an optional additional step that allows you to modify the data produce by reduce

Show the top 15 countries by population, then show their population as a percentage of Mexico's population.

mexico_data = db.world.find_one({"name":"Mexico"})

from bson.code import Code
temp = db.world.map_reduce(
        scope = {"MEXICO":mexico_data},
        query={"population":{"$ne":None}},
        sort={"population":-1},
        limit=15,
        map=Code("function(){emit(this.name, this.population)}"), 
        reduce=Code("function(key, values){return values}"),
        out={"inline":1},
        finalize=Code("""function(key, values){
                             return 100*(values/MEXICO.population)+"%"
                         }
                      """)
)

pp.pprint(
   temp["results"]
)

mexico_data = db.world.find_one({"name":"Mexico"});from bson.code import Code; temp = db.world.map_reduce( scope = {"MEXICO":mexico_data}, query={"population":{"$ne":None}}, sort={"population":-1}, limit=15, map=Code("function(){emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, finalize=Code("""function(key, values){return 100*(values/MEXICO.population)+"%"} """) );pp.pprint(temp["results"] );

Rounding can also be performed by using JavaScript.

Show the top 15 countries by population, then show their population as a whole number percentage of Mexico's population.

mexico_data = db.world.find_one({"name":"Mexico"})

from bson.code import Code
temp = db.world.map_reduce(
        scope = {"MEXICO":mexico_data},
        query={"population":{"$ne":None}},
        sort={"population":-1},
        limit=15,
        map=Code("function(){emit(this.name, this.population)}"), 
        reduce=Code("function(key, values){return values}"),
        out={"inline":1},
        finalize=Code("""function(key, values){
                             return Math.round(100*(values/MEXICO.population))+"%"
                         }
                      """)
)

pp.pprint(
   temp["results"]
)

mexico_data = db.world.find_one({"name":"Mexico"});from bson.code import Code;temp=db.world.map_reduce(scope ={"MEXICO":mexico_data},query={"population":{"$ne":None}},sort={"population":-1},limit=15,map=Code("function(){emit(this.name,this.population)}"),reduce=Code("function(key, values){return values}"), out={"inline":1},finalize=Code("function(key,values){return Math.round(100*(values/MEXICO.population))+'%'}"));pp.pprint(temp["results"])