Difference between revisions of "MapReduce"
Line 16: | Line 16: | ||
<div style="min-height:25em"> | <div style="min-height:25em"> | ||
==Introducing the MapReduce function== | ==Introducing the MapReduce function== | ||
− | The MapReduce function is an aggregate function that consists of two functions: Map and Reduce. | + | The MapReduce function is an aggregate function that consists of two functions: Map and Reduce. |
− | The map function | + | |
− | The | + | The map is always performed before the reduce. |
− | + | ||
− | + | The map function examines every document in the collection and emits '''(key,value)''' pairs. | |
− | + | ||
− | + | The map function takes no input however the current document can be accessed as '''this''' | |
− | + | ||
− | + | The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values. | |
− | + | ||
− | + | ==How many countries in each continent== | |
− | + | <div class=q data-lang="mongo"> | |
− | + | This example returns the number of countries in each continent. | |
− | <div class=q data-lang=" | ||
− | |||
− | |||
− | |||
<pre class=def> | <pre class=def> | ||
− | + | db.world.mapReduce( | |
− | + | function(){emit(this.continent, 1);}, | |
− | + | function(k, v){ return v.length; } | |
− | + | out={"inline":1} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
) | ) | ||
</pre> | </pre> |
Revision as of 19:40, 2 August 2016
#ENCODING import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16') #MONGO from pymongo import MongoClient client = MongoClient() client.progzoo.authenticate('scott','tiger') db = client['progzoo'] #PRETTY import pprint pp = pprint.PrettyPrinter(indent=4)
Introducing the MapReduce function
The MapReduce function is an aggregate function that consists of two functions: Map and Reduce.
The map is always performed before the reduce.
The map function examines every document in the collection and emits (key,value) pairs.
The map function takes no input however the current document can be accessed as this
The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values.
How many countries in each continent
This example returns the number of countries in each continent.
db.world.mapReduce( function(){emit(this.continent, 1);}, function(k, v){ return v.length; } out={"inline":1} )
query
can be used to filter the input documents to map.Find the GDP for each continent, but only include data from countries that start with the letter A or B.
from bson.code import Code temp = db.world.map_reduce( query={"name": {"$regex":"^(A|B)"}}, map=Code("function(){emit(this.continent, this.gdp)}"), reduce=Code("""function(key, values){ return Array.sum(values) } """), out={"inline":1}, ) pp.pprint( temp["results"] )
from bson.code import Code;temp = db.world.map_reduce(query={"name": {"$regex":"^(A|B)"}},map=Code("function(){emit(this.continent, this.gdp)}"),reduce=Code("function(key, values){return Array.sum(values)}"),out={"inline":1});import operator;pp.pprint(temp["results"])
scope
takes in a document:{}
and lets you create global variables.It's syntax is: scope={}
.
Using scope
, list all the countries with a higher population than Mexico.
mexico_data = db.world.find_one({"name":"Mexico"}) pp.pprint(mexico_data) from bson.code import Code temp = db.world.map_reduce( scope = {"MEXICO":mexico_data}, map = Code("""function(){ if (this.population > MEXICO.population) emit(this.name, this.population) } """), reduce=Code("function(key, values){return values}"), out={"inline":1}, ) pp.pprint( temp["results"] )
mexico_data = db.world.find_one({"name":"Mexico"}); pp.pprint(mexico_data); from bson.code import Code; temp = db.world.map_reduce( scope={"MEXICO":mexico_data}, map=Code("function(){if (this.population > MEXICO.population) emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1});pp.pprint(temp['results'])
sort
and limit
Sort allows us to sort the input documents that are passed to map
Limit is self explanatory and also applies to the input documents that are passed to map
Get the five countries with the highest GDPs
from bson.code import Code temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, ) pp.pprint( temp["results"] )
from bson.code import Code; temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, );pp.pprint(temp["results"])
finalize
is an optional additional step that allows you to modify the data produce by reduce
Show the top 15 countries by population, then show their population as a percentage of Mexico's population.
mexico_data = db.world.find_one({"name":"Mexico"}) from bson.code import Code temp = db.world.map_reduce( scope = {"MEXICO":mexico_data}, query={"population":{"$ne":None}}, sort={"population":-1}, limit=15, map=Code("function(){emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, finalize=Code("""function(key, values){ return 100*(values/MEXICO.population)+"%" } """) ) pp.pprint( temp["results"] )
mexico_data = db.world.find_one({"name":"Mexico"});from bson.code import Code; temp = db.world.map_reduce( scope = {"MEXICO":mexico_data}, query={"population":{"$ne":None}}, sort={"population":-1}, limit=15, map=Code("function(){emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, finalize=Code("""function(key, values){return 100*(values/MEXICO.population)+"%"} """) );pp.pprint(temp["results"] );
Show the top 15 countries by population, then show their population as a whole number percentage of Mexico's population.
mexico_data = db.world.find_one({"name":"Mexico"}) from bson.code import Code temp = db.world.map_reduce( scope = {"MEXICO":mexico_data}, query={"population":{"$ne":None}}, sort={"population":-1}, limit=15, map=Code("function(){emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, finalize=Code("""function(key, values){ return Math.round(100*(values/MEXICO.population))+"%" } """) ) pp.pprint( temp["results"] )
mexico_data = db.world.find_one({"name":"Mexico"});from bson.code import Code;temp=db.world.map_reduce(scope ={"MEXICO":mexico_data},query={"population":{"$ne":None}},sort={"population":-1},limit=15,map=Code("function(){emit(this.name,this.population)}"),reduce=Code("function(key, values){return values}"), out={"inline":1},finalize=Code("function(key,values){return Math.round(100*(values/MEXICO.population))+'%'}"));pp.pprint(temp["results"])