Cookies help us deliver our services. By using our services, you agree to our use of cookies. More information

Difference between revisions of "MapReduce"

From NoSQLZoo
Jump to: navigation, search
 
(48 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<pre class=setup>
+
{{TopTenTips}}
#ENCODING
+
<div style="min-height:25em">
import io
 
import sys
 
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16')
 
#MONGO
 
from pymongo import MongoClient
 
client = MongoClient()
 
client.progzoo.authenticate('scott','tiger')
 
db = client['progzoo']
 
#PRETTY
 
import pprint
 
pp = pprint.PrettyPrinter(indent=4)
 
</pre>
 
==Page Under Construction==
 
 
==Introducing the MapReduce function==
 
==Introducing the MapReduce function==
The MapReduce function is an aggregate function that consists of two functions: Map and Reduce. As the name would suggest, the map is always performed before the reduce.<br/><br/>
+
The MapReduce function is an aggregate function that consists of two functions: Map and Reduce.
The map function takes data and breaks it down into tuples (key/value pairs) for each element in the dataset<br/>
 
The reduce function then takes the result of the map function and simply reduces it in to a smaller set of tuples by merging all values with the same key.<br/><br/>
 
Map is used to deal with [https://en.wikipedia.org/wiki/Embarrassingly_parallel "embarassingly parallel problems"] where a task can be broken down into subtasks that can then be ran simultaneously without affecting each other. Instead of just processing elements one by one, all elements can all be dealt with at the same time in parallel. This allows for massively reduced processing times as well as large scalability across multiple servers, making it an attractive solution to handling [https://en.wikipedia.org/wiki/Big_data Big Data].<br/><br/>
 
We will be using Javascript here through the use of <code>Code</code>. Also note that the Mongo shell uses the camelCase 'mapReduce()' whereas PyMongo uses the Python naming scheme: map_reduce()<br/><br/>
 
For this example our MapReduce takes the form:<br/>
 
<pre>db.<collection>.map_reduce(
 
    map=<function>,
 
    reduce=<function>,
 
    out=<collection>
 
)
 
</pre>
 
  
<div class=q data-lang="py3">In this example we will be returning the population of all the continents.<br/>
+
The map is always performed before the reduce.
<code>emit(k,v)</code> lets us pick the fields we want to turn into tuples, where k is the key and v is the value. Our keys will be the continents and our values will be the population<br/>
+
 
In our reduce we sum all the values associated with each key.
+
The map function examines every document in the collection and emits '''(key,value)''' pairs.
Finally we specify that we want the "out" part of the MapReduce to be inline rather than a collection, allowing us to print it to screen.  
+
 
<pre class=def>
+
The map function takes no input however the current document can be accessed as '''this'''
from bson.code import Code
+
 
pp.pprint(
+
The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values.
    db.world.map_reduce(
+
 
        map=Code("function(){emit(this.continent, this.population)}"),  
+
==Population of each continent==
        reduce=Code("""function(key, values){
+
<div class=q data-lang="mongo">
                          var total = 0;
+
Here the map function emits the continent and the population for each country.
                          for (var i = 0; i < values.length; i++){
+
 
                              total += values[i];
+
The reduce function uses the JavaScript function <code>Array.sum</code> to add the populations.
                          }
+
<pre class="def"><nowiki>
                          return total;
+
db.world.mapReduce(
                      }
+
  function () {emit(this.continent, this.population);},  
                    """),
+
  function (k, v) { return Array.sum(v); },
        out={"inline":1},
+
  {out: {inline: 1}}
    )
+
);</nowiki></pre>
)
 
</pre>
 
 
</div>
 
</div>
  
<div class=q data-lang="py3">This is all very good, but what if we only want the results of a MapReduce? Whilst we would normally write our results to another collection, for these examples we will store our results as a dict, then access the results through Python rather than PyMongo.<br/><br/>
+
==Number of countries in each continent==
Alternatively, if you want more information about the timing you can use <code>verbose=1</code>
+
<div class=q data-lang="mongo">
<p class="strong">Find the average population of each country by continent</p>
+
Instead of sending populations you can send a list one 1s to the reduce function.
<pre class=def>
 
from bson.code import Code
 
temp = db.world.map_reduce(
 
        map=Code("function(){emit(this.continent, this.population)}"),
 
        reduce=Code("""function(key, values){
 
                          var total = 0;
 
                          for (var i = 0; i < values.length; i++){
 
                              total += values[i];
 
                          }
 
                          return total/values.length;
 
                      }
 
                    """),
 
        out={"inline":1},
 
)
 
  
pp.pprint(list(
+
The reduce function will now create a count of the number of countries in each continent.
  temp["results"]
+
<pre class="def"><nowiki>
))
+
db.world.mapReduce(
</pre>
+
  function () {emit(this.continent, 1);},  
<div class="ans">
+
  function (k, v) { return Array.sum(v); },
from bson.code import Code; temp = db.world.map_reduce(map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){ var total = 0; for (var i = 0; i < values.length; i++){total += values[i];}return total/values.length;}"),out={"inline":1});pp.pprint(list(temp["results"]))
+
  {out: {inline: 1}}
</div>
+
);</nowiki></pre>
 
</div>
 
</div>
  
<div class=q data-lang="py3">This also allows us to do things like sorting, again for this we use Python, and not PyMongo.<br/>
+
==Count only some countries==
<p class="strong">Find the average population of each country by continent then sort by value in descending order.</p>
+
<div class=q data-lang="mongo">
<pre class=def>
+
The map function does not need to emit once for every entry.
from bson.code import Code
 
temp = db.world.map_reduce(
 
        map=Code("function(){emit(this.continent, this.population)}"),
 
        reduce=Code("""function(key, values){
 
                          var total = 0;
 
                          for (var i = 0; i < values.length; i++){
 
                              total += values[i];
 
                          }
 
                          return total/values.length;
 
                      }
 
                    """),
 
        out={"inline":1},
 
)
 
  
import operator
+
In this example we are only counting the countries that have a large population.
pp.pprint(list(
+
<pre class="def"><nowiki>
  sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)
+
db.world.mapReduce(
))
+
  function () {
</pre>
+
    if (this.population > 100000000)
<div class="ans">
+
    {
from bson.code import Code
+
      emit(this.continent, 1);
temp = db.world.map_reduce(map=Code("function(){emit(this.continent, this.population)}"), reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total/values.length;}"),out={"inline":1});import operator;pp.pprint(list(sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)))
+
    }
</div>
+
  },
 +
  function (k, v) { return Array.sum(v); },
 +
  {out: {"inline": 1}}
 +
);</nowiki></pre>
 
</div>
 
</div>
  
<div class=q data-lang="py3">Until now we've been working on all the data, let's apply a mapreduce to a more specific set of elements by using <code>query</code><br/>
+
==Examine the reduce function==
<p class="strong">Find the GDP for each continent, but only include data from countries that start with the letter A or B.</p>
+
<div class=q data-lang="mongo">
<pre class=def>
+
<p class="strong">Examine the reduce function.</p>
from bson.code import Code
 
temp = db.world.map_reduce(
 
        query={"name": {"$regex":"^(A|B)"}},
 
        map=Code("function(){emit(this.continent, this.gdp)}"),
 
        reduce=Code("""function(key, values){
 
                          var total = 0;
 
                          for (var i = 0; i < values.length; i++){
 
                              total += values[i];
 
                          }
 
                          return total;
 
                      }
 
                  """),
 
        out={"inline":1},
 
)
 
  
pp.pprint(list(
+
Here we emit the continent and the name, and in the reduce function we <code>return v.join(',')</code> to see a comma separated list of the values in the list.
  temp["results"]
+
<pre class="def"><nowiki>
))
+
db.world.mapReduce(
</pre>
+
  function () {
<div class="ans">
+
    if (this.population > 100000000) {
from bson.code import Code;temp = db.world.map_reduce(query={"name": {"$regex":"^(A|B)"}},map=Code("function(){emit(this.continent, this.gdp)}"),reduce=Code("function(key, values){var total = 0;for (var i = 0; i < values.length; i++){total += values[i];}return total;}"),out={"inline":1});import operator;pp.pprint(list(temp["results"]))
+
      emit(this.continent, this.name);
</div>
+
    }
 +
  },
 +
  function (k, v) { return v.join(','); },
 +
  {out: {"inline": 1}}
 +
);</nowiki></pre>
 
</div>
 
</div>
  
 +
==Reduce to a single value==
 +
<div class=q data-lang="mongo">
 +
If you emit the same key every time you will get exactly one result from your query.
  
<div class=q data-lang="py3">Another MapReduce feature is <code>scope</code>, which takes in a <b>document</b>:<code>{}</code> and lets you create global variables.<br/>
+
Here we emit the value 1 as the key and 1 as the value. The reduce function sums those 1s to get a count of the total number of countries.
It's syntax is: <code>scope={}</code>.<br/>
+
<pre class="def"><nowiki>
<p class="strong">Using <code>scope</code>, list all the countries with a higher population than Mexico.</p>
+
db.world.mapReduce(
<pre class=def>
+
  function () {
mexico_data = db.world.find_one({"name":"Mexico"})
+
    emit(1, 1);
pp.pprint(mexico_data)
+
  },
 
+
  function (k, v) { return Array.sum(v); },
from bson.code import Code
+
  {out: {"inline": 1}}
temp = db.world.map_reduce(
+
);</nowiki></pre>
        scope = {"MEXICO":mexico_data},
 
        map = Code("""function(){
 
                        if (this.population > MEXICO.population) emit(this.name, this.population)
 
                      }
 
                  """),  
 
        reduce=Code("function(key, values){return values}"),
 
        out={"inline":1},
 
)
 
import operator
 
pp.pprint(list(
 
  sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)
 
))
 
</pre>
 
<div class="ans">
 
mexico_data = db.world.find_one({"name":"Mexico"}); pp.pprint(mexico_data);  from bson.code import Code; temp = db.world.map_reduce( scope={"MEXICO":mexico_data}, map=Code("function(){if (this.population > MEXICO.population) emit(this.name, this.population)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, ); import operator; pp.pprint(list(sorted(temp["results"],key=operator.itemgetter('value'), reverse=True)))
 
</div>
 
 
</div>
 
</div>
  
<div class=q data-lang="py3"><code>sort</code> and <code>limit</code><br/>
+
==Emit a name==
Sort allows us to sort the <b>input</b> documents that are passed to <b>map</b>, limit is self explanatory and also applies to the input documents.
+
<div class=q data-lang="mongo">
<p class="strong">Get the five countries with the highest GDPs</p>
+
You can use the list given in the reduce function.
<pre class=def>
 
from bson.code import Code
 
temp = db.world.map_reduce(
 
        query={"gdp":{"$ne":None}},
 
        sort={"gdp":-1},
 
        limit=5,
 
        map=Code("function(){emit(this.name, this.gdp)}"),
 
        reduce=Code("function(key, values){return values}"),
 
        out={"inline":1},
 
)
 
  
pp.pprint(list(
+
Here we emit the key '''this.continent''' and the value '''this.name'''.
  temp["results"]
+
The reduce function returns the first element of the collected list.
))
+
<pre class="def"><nowiki>
</pre>
+
db.world.mapReduce(
<div class="ans">
+
  function () {
from bson.code import Code; temp = db.world.map_reduce( query={"gdp":{"$ne":None}}, sort={"gdp":-1}, limit=5, map=Code("function(){emit(this.name, this.gdp)}"), reduce=Code("function(key, values){return values}"), out={"inline":1}, );pp.pprint(list(temp["results"]))
+
    emit(this.continent, this.name);
</div>
+
  },
 +
  function (k, v) { return v[0]; },
 +
  {out: {"inline": 1}}
 +
);</nowiki></pre>
 
</div>
 
</div>

Latest revision as of 08:47, 26 June 2018

Introducing the MapReduce function

The MapReduce function is an aggregate function that consists of two functions: Map and Reduce.

The map is always performed before the reduce.

The map function examines every document in the collection and emits (key,value) pairs.

The map function takes no input however the current document can be accessed as this

The reduce function has two inputs, for every distinct key emitted by map the reduce function is called with a list of the corresponding values.

Population of each continent

Here the map function emits the continent and the population for each country.

The reduce function uses the JavaScript function Array.sum to add the populations.

db.world.mapReduce(
  function () {emit(this.continent, this.population);}, 
  function (k, v) { return Array.sum(v); },
  {out: {inline: 1}}
);

Number of countries in each continent

Instead of sending populations you can send a list one 1s to the reduce function.

The reduce function will now create a count of the number of countries in each continent.

db.world.mapReduce(
  function () {emit(this.continent, 1);}, 
  function (k, v) { return Array.sum(v); },
  {out: {inline: 1}}
);

Count only some countries

The map function does not need to emit once for every entry.

In this example we are only counting the countries that have a large population.

db.world.mapReduce(
  function () {
    if (this.population > 100000000)
    {
      emit(this.continent, 1);
    }
  },
  function (k, v) { return Array.sum(v); },
  {out: {"inline": 1}}
);

Examine the reduce function

Examine the reduce function.

Here we emit the continent and the name, and in the reduce function we return v.join(',') to see a comma separated list of the values in the list.

db.world.mapReduce(
  function () {
    if (this.population > 100000000) {
      emit(this.continent, this.name);
    }
  },
  function (k, v) { return v.join(','); },
  {out: {"inline": 1}}
);

Reduce to a single value

If you emit the same key every time you will get exactly one result from your query.

Here we emit the value 1 as the key and 1 as the value. The reduce function sums those 1s to get a count of the total number of countries.

db.world.mapReduce(
  function () {
    emit(1, 1);
  },
  function (k, v) { return Array.sum(v); },
  {out: {"inline": 1}}
);

Emit a name

You can use the list given in the reduce function.

Here we emit the key this.continent and the value this.name. The reduce function returns the first element of the collected list.

db.world.mapReduce(
  function () {
    emit(this.continent, this.name);
  },
  function (k, v) { return v[0]; },
  {out: {"inline": 1}}
);