Difference between revisions of "MapReduce"
Line 31: | Line 31: | ||
<code>emit(k,v)</code> lets us pick the fields we want to turn into tuples, where k is the key and v is the value. Our keys will be the continents and our values will be the population<br/> | <code>emit(k,v)</code> lets us pick the fields we want to turn into tuples, where k is the key and v is the value. Our keys will be the continents and our values will be the population<br/> | ||
In our reduce we sum all the values associated with each key. | In our reduce we sum all the values associated with each key. | ||
− | Finally we specify that we want the "out" part of the | + | Finally we specify that we want the "out" part of the MapReduce to be inline rather than a collection, allowing us to print it to screen. |
<pre class=def> | <pre class=def> | ||
from bson.code import Code | from bson.code import Code | ||
Line 50: | Line 50: | ||
</div> | </div> | ||
− | <div class=q data-lang="py3">This is all very good, but what if we only want the results of a MapReduce? Whilst we would normally write our results to another collection, for these examples we | + | <div class=q data-lang="py3">This is all very good, but what if we only want the results of a MapReduce? Whilst we would normally write our results to another collection, for these examples we will store our results as a dict, then access the results through Python rather than PyMongo.<br/><br/> |
Alternatively, if you want more information about the timing you can use <code>verbose=1</code> | Alternatively, if you want more information about the timing you can use <code>verbose=1</code> | ||
<p class="strong">Find the average population of each country by continent</p> | <p class="strong">Find the average population of each country by continent</p> |
Revision as of 12:07, 21 July 2015
#ENCODING import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16') #MONGO from pymongo import MongoClient client = MongoClient() client.progzoo.authenticate('scott','tiger') db = client['progzoo'] #PRETTY import pprint pp = pprint.PrettyPrinter(indent=4)
Page Under Construction
Introducing the MapReduce function
The MapReduce function is an aggregate function that consists of two functions: Map and Reduce. As the name would suggest, the map is always performed before the reduce.
The map function takes data and breaks it down into tuples (key/value pairs) for each element in the dataset
The reduce function then takes the result of the map function and simply reduces it in to a smaller set of tuples by merging all values with the same key.
Map is used to deal with "embarassingly parallel problems" where a task can be broken down into subtasks that can then be ran simultaneously without affecting each other. Instead of just processing elements one by one, all elements can all be dealt with at the same time in parallel. This allows for massively reduced processing times as well as large scalability across multiple servers, making it an attractive solution to handling Big Data.
This is a feature more suited for the shell or a Node.JS implementation, as here we will need to use JavaScript code inside Pymongo. Also note that the Mongo shell uses the camelCase 'mapReduce()', whereas PyMongo uses the Python naming scheme: map_reduce()
For this example our MapReduce takes the form:
db.<collection>.map_reduce( map=<function>, reduce=<function>, out=<collection> )
emit(k,v)
lets us pick the fields we want to turn into tuples, where k is the key and v is the value. Our keys will be the continents and our values will be the population
In our reduce we sum all the values associated with each key.
Finally we specify that we want the "out" part of the MapReduce to be inline rather than a collection, allowing us to print it to screen.
from bson.code import Code pp.pprint( db.world.map_reduce( map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){" " var total = 0;" " for (var i = 0; i < values.length; i++){" " total += values[i];" " }" " return total;" "}"), out={"inline":1}, ) )
Alternatively, if you want more information about the timing you can use verbose=1
Find the average population of each country by continent
from bson.code import Code temp = db.world.map_reduce( map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){" " var total = 0;" " for (var i = 0; i < values.length; i++){" " total += values[i];" " }" " return total/values.length;" "}"), out={"inline":1}, ) pp.pprint(list( temp["results"] ))
from bson.code import Code; temp = db.world.map_reduce(map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){ var total = 0; for (var i = 0; i < values.length; i++){total += values[i];}return total/values.length;}"),out={"inline":1});pp.pprint(list(temp["results"]))
Find the average population of each country by continent then sort by value in descending order.
from bson.code import Code temp = db.world.map_reduce( map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){" " var total = 0;" " for (var i = 0; i < values.length; i++){" " total += values[i];" " }" " return total/values.length;" "}"), out={"inline":1}, ) import operator pp.pprint(list( sorted(temp["results"],key=operator.itemgetter('value'), reverse=True) ))
from bson.code import Code temp = db.world.map_reduce(map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total/values.length;}"),out={"inline":1});import operator;pp.pprint(list(sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)))
query
Find the GDP for each continent, but only include data from countries that start with the letter A or B.
from bson.code import Code temp = db.world.map_reduce( query={"name": {"$regex":"^(A|B)"}}, map=Code("function(){emit(this.continent, this.gdp)}"), reduce=Code("function(key, values){" " var total = 0;" " for (var i = 0; i < values.length; i++){" " total += values[i];" " }" " return total/values.length;" "}"), out={"inline":1}, ) import operator pp.pprint(list( temp["results"] ))
from bson.code import Code;temp = db.world.map_reduce(query={"name": {"$regex":"^(A|B)"}},map=Code("function(){emit(this.continent, this.gdp)}"),reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total/values.length;}"),out={"inline":1});import operator;pp.pprint(list(temp["results"]))
scope
, which takes in a document:{}
and lets you create global variables.It's syntax is: scope={}
.
Using scope
, list all the countries with a higher population than Mexico.
mexico_data = db.world.find_one({"name":"Mexico"}) pp.pprint(mexico_data) from bson.code import Code temp = db.world.map_reduce( scope={"MEXICO":mexico_data}, map=Code("function(){" " if (this.population > MEXICO.population) emit(this.name, this.population)" "}"), reduce=Code("function(key, values){" " var total = 0;" " for (var i = 0; i < values.length; i++){" " total += values[i];" " }" " return total/values.length;" "}"), out={"inline":1}, ) import operator pp.pprint(list( sorted(temp["results"],key=operator.itemgetter('value'), reverse=True) ))
mexico_data = db.world.find_one({"name":"Mexico"}); pp.pprint(mexico_data);from bson.code import Code; temp = db.world.map_reduce( map=Code("function(){if(this.population > MEXICO.population) emit(this.name, this.population)}"), reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total/values.length;}"), out={"inline":1}, scope={"MEXICO":mexico_data} ); import operator; pp.pprint(list(sorted(temp["results"],key=operator.itemgetter('value'), reverse=True) ));