Cookies help us deliver our services. By using our services, you agree to our use of cookies. More information

Difference between revisions of "MapReduce"

From NoSQLZoo
Jump to: navigation, search
 
(26 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<pre class=setup>
 
#ENCODING
 
import io
 
import sys
 
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16')
 
#MONGO
 
from pymongo import MongoClient
 
client = MongoClient()
 
client.progzoo.authenticate('scott','tiger')
 
db = client['progzoo']
 
#PRETTY
 
import pprint
 
pp = pprint.PrettyPrinter(indent=4)
 
</pre>
 
 
{{TopTenTips}}
 
{{TopTenTips}}
 
<div style="min-height:25em">
 
<div style="min-height:25em">
Line 26: Line 12:
 
The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values.
 
The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values.
  
==How many countries in each continent==
+
==Population of each continent==
 
<div class=q data-lang="mongo">
 
<div class=q data-lang="mongo">
This example returns the number of countries in each continent.
+
Here the map function emits the continent and the population for each country.
<pre class=def>
+
 
 +
The reduce function uses the JavaScript function <code>Array.sum</code> to add the populations.
 +
<pre class="def"><nowiki>
 
db.world.mapReduce(
 
db.world.mapReduce(
   function(){emit(this.continent, 1);},  
+
   function () {emit(this.continent, this.population);},  
   function(k, v){ return v.length; }
+
   function (k, v) { return Array.sum(v); },
   out={"inline":1}
+
   {out: {inline: 1}}
)
+
);</nowiki></pre>
</pre>
 
 
</div>
 
</div>
  
<div class=q data-lang="py3"><code>query</code> can be used to filter the <b>input</b> documents to map.<br/>
+
==Number of countries in each continent==
<p class="strong">Find the GDP for each continent, but only include data from countries that start with the letter A or B.</p>
+
<div class=q data-lang="mongo">
<pre class=def>
+
Instead of sending populations you can send a list one 1s to the reduce function.
from bson.code import Code
 
temp = db.world.map_reduce(
 
        query={"name": {"$regex":"^(A|B)"}},
 
        map=Code("function(){emit(this.continent, this.gdp)}"),
 
        reduce=Code("""function(key, values){
 
                          return Array.sum(values)
 
                      }
 
                  """),
 
        out={"inline":1},
 
)
 
  
pp.pprint(
+
The reduce function will now create a count of the number of countries in each continent.
  temp["results"]
+
<pre class="def"><nowiki>
)
+
db.world.mapReduce(
</pre>
+
  function () {emit(this.continent, 1);},  
<div class="ans">
+
  function (k, v) { return Array.sum(v); },
from bson.code import Code;temp = db.world.map_reduce(query={"name": {"$regex":"^(A|B)"}},map=Code("function(){emit(this.continent, this.gdp)}"),reduce=Code("function(key, values){return Array.sum(values)}"),out={"inline":1});import operator;pp.pprint(temp["results"])
+
  {out: {inline: 1}}
</div>
+
);</nowiki></pre>
 
</div>
 
</div>
  
 +
==Count only some countries==
 +
<div class=q data-lang="mongo">
 +
The map function does not need to emit once for every entry.
  
<div class=q data-lang="py3"><code>scope</code> takes in a <b>document</b>:<code>{}</code> and lets you create global variables.<br/>
+
In this example we are only counting the countries that have a large population.
It's syntax is: <code>scope={}</code>.<br/>
+
<pre class="def"><nowiki>
<p class="strong">Using <code>scope</code>, list all the countries with a higher population than Mexico.</p>
+
db.world.mapReduce(
<pre class=def>
+
  function () {
mexico_data = db.world.find_one({"name":"Mexico"})
+
    if (this.population > 100000000)
pp.pprint(mexico_data)
+
    {
 
+
      emit(this.continent, 1);
from bson.code import Code
+
    }
temp = db.world.map_reduce(
+
  },
        scope = {"MEXICO":mexico_data},
+
  function (k, v) { return Array.sum(v); },
        map = Code("""function(){
+
  {out: {"inline": 1}}
                        if (this.population > MEXICO.population) emit(this.name, this.population)
+
);</nowiki></pre>
                      }
 
                  """),  
 
        reduce=Code("function(key, values){return values}"),
 
        out={"inline":1},
 
)
 
pp.pprint(
 
  temp["results"]
 
)
 
</pre>
 
<div class="ans">
 
mexico_data = db.world.find_one({"name":"Mexico"}); pp.pprint(mexico_data); from bson.code import Code; temp = db.world.map_reduce( scope={"MEXICO":mexico_data}, map=Code("function(){if (this.population > MEXICO.population) emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1});pp.pprint(temp['results'])
 
</div>
 
 
</div>
 
</div>
  
<div class=q data-lang="py3"><code>sort</code> and <code>limit</code><br/>
+
==Examine the reduce function==
Sort allows us to sort the <b>input</b> documents that are passed to <b>map</b><br/>Limit is self explanatory and also applies to the <b>input</b> documents that are passed to <b>map</b>
+
<div class=q data-lang="mongo">
<p class="strong">Get the five countries with the highest GDPs</p>
+
<p class="strong">Examine the reduce function.</p>
<pre class=def>
 
from bson.code import Code
 
temp = db.world.map_reduce(
 
        query={"gdp":{"$ne":None}},
 
        sort={"gdp":-1},
 
        limit=5,
 
        map=Code("function(){emit(this.name, this.gdp)}"),
 
        reduce=Code("function(key, values){return values}"),
 
        out={"inline":1},
 
)
 
  
pp.pprint(
+
Here we emit the continent and the name, and in the reduce function we <code>return v.join(',')</code> to see a comma separated list of the values in the list.
  temp["results"]
+
<pre class="def"><nowiki>
)
+
db.world.mapReduce(
</pre>
+
  function () {
<div class="ans">
+
    if (this.population > 100000000) {
from bson.code import Code; temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, );pp.pprint(temp["results"])
+
      emit(this.continent, this.name);
</div>
+
    }
 +
  },
 +
  function (k, v) { return v.join(','); },
 +
  {out: {"inline": 1}}
 +
);</nowiki></pre>
 
</div>
 
</div>
  
<div class=q data-lang="py3"><code>finalize</code> is an optional additional step that allows you to modify the data produce by <code>reduce</code><br/>
+
==Reduce to a single value==
<p class="strong">Show the top 15 countries by population, then show their population as a percentage of Mexico's population.</p>
+
<div class=q data-lang="mongo">
<pre class=def>
+
If you emit the same key every time you will get exactly one result from your query.
mexico_data = db.world.find_one({"name":"Mexico"})
 
  
from bson.code import Code
+
Here we emit the value 1 as the key and 1 as the value. The reduce function sums those 1s to get a count of the total number of countries.
temp = db.world.map_reduce(
+
<pre class="def"><nowiki>
        scope = {"MEXICO":mexico_data},
+
db.world.mapReduce(
        query={"population":{"$ne":None}},
+
  function () {
        sort={"population":-1},
+
    emit(1, 1);
        limit=15,
+
  },
        map=Code("function(){emit(this.name, this.population)}"),
+
  function (k, v) { return Array.sum(v); },
        reduce=Code("function(key, values){return values}"),
+
  {out: {"inline": 1}}
        out={"inline":1},
+
);</nowiki></pre>
        finalize=Code("""function(key, values){
 
                            return 100*(values/MEXICO.population)+"%"
 
                        }
 
                      """)
 
)
 
 
 
pp.pprint(
 
  temp["results"]
 
)
 
</pre>
 
<div class="ans">
 
mexico_data = db.world.find_one({"name":"Mexico"});from bson.code import Code; temp = db.world.map_reduce( scope = {"MEXICO":mexico_data}, query={"population":{"$ne":None}}, sort={"population":-1}, limit=15, map=Code("function(){emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, finalize=Code("""function(key, values){return 100*(values/MEXICO.population)+"%"} """) );pp.pprint(temp["results"] );
 
</div>
 
 
</div>
 
</div>
  
<div class=q data-lang="py3">Rounding can also be performed by using JavaScript.<br/>
+
==Emit a name==
<p class="strong">Show the top 15 countries by population, then show their population as a whole number percentage of Mexico's population.</p>
+
<div class=q data-lang="mongo">
<pre class=def>
+
You can use the list given in the reduce function.
mexico_data = db.world.find_one({"name":"Mexico"})
 
  
from bson.code import Code
+
Here we emit the key '''this.continent''' and the value '''this.name'''.
temp = db.world.map_reduce(
+
The reduce function returns the first element of the collected list.
        scope = {"MEXICO":mexico_data},
+
<pre class="def"><nowiki>
        query={"population":{"$ne":None}},
+
db.world.mapReduce(
        sort={"population":-1},
+
  function () {
        limit=15,
+
    emit(this.continent, this.name);
        map=Code("function(){emit(this.name, this.population)}"),
+
  },
        reduce=Code("function(key, values){return values}"),
+
  function (k, v) { return v[0]; },
        out={"inline":1},
+
  {out: {"inline": 1}}
        finalize=Code("""function(key, values){
+
);</nowiki></pre>
                            return Math.round(100*(values/MEXICO.population))+"%"
 
                        }
 
                      """)
 
)
 
 
 
pp.pprint(
 
  temp["results"]
 
)
 
</pre>
 
<div class="ans">
 
mexico_data = db.world.find_one({"name":"Mexico"});from bson.code import Code;temp=db.world.map_reduce(scope ={"MEXICO":mexico_data},query={"population":{"$ne":None}},sort={"population":-1},limit=15,map=Code("function(){emit(this.name,this.population)}"),reduce=Code("function(key, values){return values}"), out={"inline":1},finalize=Code("function(key,values){return Math.round(100*(values/MEXICO.population))+'%'}"));pp.pprint(temp["results"])
 
</div>
 
 
</div>
 
</div>

Latest revision as of 08:47, 26 June 2018

Introducing the MapReduce function

The MapReduce function is an aggregate function that consists of two functions: Map and Reduce.

The map is always performed before the reduce.

The map function examines every document in the collection and emits (key,value) pairs.

The map function takes no input however the current document can be accessed as this

The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values.

Population of each continent

Here the map function emits the continent and the population for each country.

The reduce function uses the JavaScript function Array.sum to add the populations.

db.world.mapReduce(
  function () {emit(this.continent, this.population);}, 
  function (k, v) { return Array.sum(v); },
  {out: {inline: 1}}
);

Number of countries in each continent

Instead of sending populations you can send a list one 1s to the reduce function.

The reduce function will now create a count of the number of countries in each continent.

db.world.mapReduce(
  function () {emit(this.continent, 1);}, 
  function (k, v) { return Array.sum(v); },
  {out: {inline: 1}}
);

Count only some countries

The map function does not need to emit once for every entry.

In this example we are only counting the countries that have a large population.

db.world.mapReduce(
  function () {
    if (this.population > 100000000)
    {
      emit(this.continent, 1);
    }
  },
  function (k, v) { return Array.sum(v); },
  {out: {"inline": 1}}
);

Examine the reduce function

Examine the reduce function.

Here we emit the continent and the name, and in the reduce function we return v.join(',') to see a comma separated list of the values in the list.

db.world.mapReduce(
  function () {
    if (this.population > 100000000) {
      emit(this.continent, this.name);
    }
  },
  function (k, v) { return v.join(','); },
  {out: {"inline": 1}}
);

Reduce to a single value

If you emit the same key every time you will get exactly one result from your query.

Here we emit the value 1 as the key and 1 as the value. The reduce function sums those 1s to get a count of the total number of countries.

db.world.mapReduce(
  function () {
    emit(1, 1);
  },
  function (k, v) { return Array.sum(v); },
  {out: {"inline": 1}}
);

Emit a name

You can use the list given in the reduce function.

Here we emit the key this.continent and the value this.name. The reduce function returns the first element of the collected list.

db.world.mapReduce(
  function () {
    emit(this.continent, this.name);
  },
  function (k, v) { return v[0]; },
  {out: {"inline": 1}}
);