Cookies help us deliver our services. By using our services, you agree to our use of cookies. More information

Difference between revisions of "MapReduce"

From NoSQLZoo
Jump to: navigation, search
Line 19: Line 19:
 
The reduce function then takes the result of the map function and simply reduces it in to a smaller set of tuples by merging all values with the same key.<br/><br/>
 
The reduce function then takes the result of the map function and simply reduces it in to a smaller set of tuples by merging all values with the same key.<br/><br/>
 
Map is used to deal with [https://en.wikipedia.org/wiki/Embarrassingly_parallel "embarassingly parallel problems"] where a task can be broken down into subtasks that can then be ran simultaneously without affecting each other. Instead of just processing elements one by one, all elements can all be dealt with at the same time in parallel. This allows for massively reduced processing times as well as large scalability across multiple servers, making it an attractive solution to handling [https://en.wikipedia.org/wiki/Big_data Big Data].<br/><br/>
 
Map is used to deal with [https://en.wikipedia.org/wiki/Embarrassingly_parallel "embarassingly parallel problems"] where a task can be broken down into subtasks that can then be ran simultaneously without affecting each other. Instead of just processing elements one by one, all elements can all be dealt with at the same time in parallel. This allows for massively reduced processing times as well as large scalability across multiple servers, making it an attractive solution to handling [https://en.wikipedia.org/wiki/Big_data Big Data].<br/><br/>
This is a feature more suited for the shell or a Node.JS implementation, as here we will need to use JavaScript code inside Pymongo. Also note that the Mongo shell version of this is mapReduce, whereas pymongo use map_reduce()<br/><br/>
+
This is a feature more suited for the shell or a Node.JS implementation, as here we will need to use JavaScript code inside Pymongo. Also note that the Mongo shell version of this is mapReduce, whereas PyMongo use map_reduce()<br/><br/>
 
For this example our MapReduce takes the form:<br/>
 
For this example our MapReduce takes the form:<br/>
 
<pre>db.<collection>.map_reduce(
 
<pre>db.<collection>.map_reduce(
Line 27: Line 27:
 
)
 
)
 
</pre>
 
</pre>
 +
 
<div class=q data-lang="py3">In this example we will be returning the population of all the continents.<br/>
 
<div class=q data-lang="py3">In this example we will be returning the population of all the continents.<br/>
 
<code>emit(k,v)</code> lets us pick the fields we want to turn into tuples, where k is the key and v is the value. Our keys will be the continents and our values will be the population<br/>
 
<code>emit(k,v)</code> lets us pick the fields we want to turn into tuples, where k is the key and v is the value. Our keys will be the continents and our values will be the population<br/>
Line 42: Line 43:
 
                   "    }"
 
                   "    }"
 
                   "    return total;"
 
                   "    return total;"
 +
                  "}"),
 +
        {"inline":1},
 +
    )
 +
)
 +
</pre>
 +
</div>
 +
 +
<div class=q data-lang="py3">By making a small change to our JavaScript we can create things like averages.<br/>
 +
<p class="strong">Find the average population of each continent</p>
 +
<pre class=def>
 +
from bson.code import Code
 +
pp.pprint(
 +
    db.world.map_reduce(
 +
        Code("function(){emit(this.continent, this.population)}"),
 +
        Code("function(key, values){"
 +
                  "    var total = 0;"
 +
                  "    for (var i = 0; i < values.length; i++){"
 +
                  "        total += values[i];"
 +
                  "    }"
 +
                  "    return total/values.length;"
 
                   "}"),
 
                   "}"),
 
         {"inline":1},
 
         {"inline":1},

Revision as of 08:44, 21 July 2015

#ENCODING
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16')
#MONGO
from pymongo import MongoClient
client = MongoClient()
client.progzoo.authenticate('scott','tiger')
db = client['progzoo']
#PRETTY
import pprint
pp = pprint.PrettyPrinter(indent=4)

Page Under Construction

Introducing the MapReduce function

The MapReduce function is an aggregate function that consists of two functions: Map and Reduce. As the name would suggest, the map is always performed before the reduce.

The map function takes data and breaks it down into tuples (key/value pairs) for each element in the dataset
The reduce function then takes the result of the map function and simply reduces it in to a smaller set of tuples by merging all values with the same key.

Map is used to deal with "embarassingly parallel problems" where a task can be broken down into subtasks that can then be ran simultaneously without affecting each other. Instead of just processing elements one by one, all elements can all be dealt with at the same time in parallel. This allows for massively reduced processing times as well as large scalability across multiple servers, making it an attractive solution to handling Big Data.

This is a feature more suited for the shell or a Node.JS implementation, as here we will need to use JavaScript code inside Pymongo. Also note that the Mongo shell version of this is mapReduce, whereas PyMongo use map_reduce()

For this example our MapReduce takes the form:

db.<collection>.map_reduce(
    <map function>,
    <reduce function>,
    <out collection>
)
In this example we will be returning the population of all the continents.

emit(k,v) lets us pick the fields we want to turn into tuples, where k is the key and v is the value. Our keys will be the continents and our values will be the population
In our reduce we sum all the values associated with Finally we specify that we want the "out" part of the mapreduce to be inline rather than a collection, allowing us to print it to screen.

from bson.code import Code
pp.pprint(
    db.world.map_reduce(
        Code("function(){emit(this.continent, this.population)}"), 
        Code("function(key, values){"
                  "    var total = 0;"
                  "    for (var i = 0; i < values.length; i++){"
                  "        total += values[i];"
                  "    }"
                  "    return total;"
                  "}"),
        {"inline":1},
    )
)
By making a small change to our JavaScript we can create things like averages.

Find the average population of each continent

from bson.code import Code
pp.pprint(
    db.world.map_reduce(
        Code("function(){emit(this.continent, this.population)}"), 
        Code("function(key, values){"
                  "    var total = 0;"
                  "    for (var i = 0; i < values.length; i++){"
                  "        total += values[i];"
                  "    }"
                  "    return total/values.length;"
                  "}"),
        {"inline":1},
    )
)