Difference between revisions of "AGGREGATE examples"
Line 55: | Line 55: | ||
<div class=ans> | <div class=ans> | ||
pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}]))) | pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}]))) | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | <div class=q data-lang="py3"> | ||
+ | Because aggregate is a pipeline we can repeat a stage, and we aren't forced to use stages in a specific order. | ||
+ | <p class=strong>Show the name of Asian countries with a density that's over 500 people per km<sup>2</sup>. (population/area)</p> | ||
+ | <pre class=def> | ||
+ | pp.pprint(list( | ||
+ | db.world.aggregate([ | ||
+ | {"$match":{"area":{"$ne":0},"continent":"Asia"}}, | ||
+ | {"$project":{ | ||
+ | "_id":0, | ||
+ | "name":1, | ||
+ | "density": {"$divide": ["$population","$area"]} | ||
+ | }}, | ||
+ | {"$match":{"density":{"$gt":500}}} | ||
+ | ]) | ||
+ | )) | ||
+ | </pre> | ||
+ | <div class=ans> | ||
+ | pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}},{"$match":{"density":{"$gt":500}}}]))) | ||
</div> | </div> | ||
</div> | </div> | ||
Line 82: | Line 103: | ||
==Grouping== | ==Grouping== | ||
Grouping allows us to use accumulator operations sum as <code>$sum</code><br /> | Grouping allows us to use accumulator operations sum as <code>$sum</code><br /> | ||
− | All groups must have an <code>_id</code>. To | + | All groups must have an <code>_id</code>. To see the why this is useful imagine the following:<br/><br/> |
− | < | + | |
+ | So far you've been using the collection <code>world</code><br/>As every country has a continent, it would make sense to have countries as a nested document inside continents: e.g: | ||
+ | |||
+ | <pre> | ||
+ | [ | ||
+ | {"name":"Africa", | ||
+ | "countries":[ | ||
+ | {"name":"Algeria", "capital":"Algiers", ...}, | ||
+ | {"name":"Angola", "capital":"Luanda", ...}, | ||
+ | {"name":"Benin", "capital":"Porto-Novo",...}. | ||
+ | {...}, | ||
+ | ... | ||
+ | ]}, | ||
+ | {"name":"Asia", | ||
+ | "countries":[ | ||
+ | {"name":"Afghanistan","capital":"Kabul", ...}, | ||
+ | {"name":"Azerbaijan", "capital":"Baku", ...}, | ||
+ | {"name":"Bahrain", "capital":"Manama",...}, | ||
+ | {...}, | ||
+ | ... | ||
+ | ]}, | ||
+ | {...}, | ||
+ | ... | ||
+ | ] | ||
+ | </pre> | ||
+ | |||
+ | The <code>world</code> collection isn't like this however. We'll be using the following structure, which has a redundancy where <code>continent</code> is repeated for each country. Don't worry, you'll learn how to query nested documents later. | ||
+ | |||
+ | <pre> | ||
+ | [ | ||
+ | {"name":"Afghanistan","capital":"Kabul", "continent":"Asia", ...}, | ||
+ | {"name":"Albania", "capital":"Tirana", "continent":"Europe, ...}, | ||
+ | {"name":"Algeria", "capital":"Algiers","contiennt":"Africa",...}, | ||
+ | {...}, | ||
+ | ... | ||
+ | ] | ||
+ | </pre> | ||
+ | What if we wanted to know what the largest and smallest GDPs in each continent are? We can use the _id to group by continent: <code> "_id":"continent"</code> If we wanted to find these values for any country regardless of continent, we could just group by <code>"_id":"null"</code> or <code>"_id":None</code> | ||
<div class=q data-lang="py3"> | <div class=q data-lang="py3"> | ||
<code>$max</code> and <code>$min</code> can be used to get the largest and smallest values in a group. | <code>$max</code> and <code>$min</code> can be used to get the largest and smallest values in a group. | ||
− | <p class=strong>Get the smallest and largest GDPs.</p> | + | <p class=strong>Get the smallest and largest GDPs of each continent.</p> |
<pre class=def> | <pre class=def> | ||
pp.pprint(list( | pp.pprint(list( | ||
− | + | db.world.aggregate([ | |
{"$group":{ | {"$group":{ | ||
− | '_id':' | + | '_id':'$continent', |
'min':{"$min":"$gdp"}, | 'min':{"$min":"$gdp"}, | ||
− | 'max':{"$max":"$gdp"} | + | 'max':{"$max":"$gdp"} |
}}, | }}, | ||
{"$project":{ | {"$project":{ | ||
− | + | "_id":1, | |
− | + | "min":1, | |
− | + | "max":1 | |
}}, | }}, | ||
]) | ]) | ||
Line 104: | Line 162: | ||
</pre> | </pre> | ||
<div class=ans> | <div class=ans> | ||
− | pp.pprint(list(db.world.aggregate([{"$group":{'_id':' | + | pp.pprint(list(db.world.aggregate([{"$group":{'_id':'$continent','min':{"$min":"$gdp"},'max':{"$max":"$gdp"}}},{"$project":{"_id":1,"min":1,"max":1}},]))) |
</div> | </div> | ||
</div> | </div> | ||
<div class=q data-lang="py3"> | <div class=q data-lang="py3"> | ||
− | + | A downside of grouping is that we lose the ability to find the <code>$name</code> of the country with the min/max value.<br/><br/> | |
− | If we want to do this, a simple way is to | + | If we want to do this, a simple way is to combine a sort <code>sort=[(<fieldname>, <direction>)]</code> and a <code>find_one()</code>. Performance will be improved if these fields have previously been indexed. |
<p class=strong>Get the names and GDPs of the two countries with the smallest and largest GDPs.</p> | <p class=strong>Get the names and GDPs of the two countries with the smallest and largest GDPs.</p> | ||
<div class="hint" title="Avoiding null results"> | <div class="hint" title="Avoiding null results"> | ||
− | It is possible that we will occasionally encounter <code>null</code> values in a data collection. This could be either where the field is simply not included on a certain document | + | It is possible that we will occasionally encounter <code>null</code> values in a data collection. This could be either where the field is simply not included on a certain document - MongoDB will simply ignore this document and not include it in the results set - or a document has a redundancy where the field is specified as null - which MongoDB <b>will</b> include. e.g: <code>{"name": "Bob", "age": None}</code> etc.<br/><br/>To deal with this we can use <code>{<field>: {"$ne": None}}</code> to prevent any null values from being included. |
</div> | </div> | ||
<pre class=def> | <pre class=def> | ||
Line 124: | Line 182: | ||
</pre> | </pre> | ||
<div class=ans> | <div class=ans> | ||
− | pp.pprint(db.world.find_one({"gdp":{"$ne":None}},{"name":1,"gdp":1,"_id":0},sort=[("gdp", 1)]) | + | pp.pprint(db.world.find_one({"gdp":{"$ne":None}},{"name":1,"gdp":1,"_id":0},sort=[("gdp", 1)])) |
pp.pprint(db.world.find_one({},{"name":1,"gdp":1,"_id":0},sort=[("gdp", -1)])) | pp.pprint(db.world.find_one({},{"name":1,"gdp":1,"_id":0},sort=[("gdp", -1)])) | ||
</div> | </div> | ||
</div> | </div> |
Revision as of 11:25, 17 July 2015
#ENCODING import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16') #MONGO from pymongo import MongoClient client = MongoClient() client.progzoo.authenticate('scott','tiger') db = client['progzoo'] #PRETTY import pprint pp = pprint.PrettyPrinter(indent=4)
Introducing the aggregation framework
These examples introduce the aggregation framework and its operators. Again we will be using the collection world
$match
Allows us to perform queries in a similar way to find()
Show all the details for France
pp.pprint(list( db.world.aggregate([ {"$match":{"name":"France"}} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"name":"France"}}])))
$project
Allows us to select what fields to display.
It can also has the ability to insert new fields and allows you to compare fields against each other without using $where
Show the name and population density of all Asian countries. (population/area)
Note that "density" is a new field, made from the result of dividing two existing fields, and that $divide
is an aggregate function.
To avoid diving by 0 we do a $match
to remove any countries with 0 area (Vatican City), then pipe these results through to $project
pp.pprint(list( db.world.aggregate([ {"$match":{"area":{"$ne":0},"continent":"Asia"}}, {"$project":{ "_id":0, "name":1, "density": {"$divide": ["$population","$area"]} }} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}])))
Because aggregate is a pipeline we can repeat a stage, and we aren't forced to use stages in a specific order.
Show the name of Asian countries with a density that's over 500 people per km2. (population/area)
pp.pprint(list( db.world.aggregate([ {"$match":{"area":{"$ne":0},"continent":"Asia"}}, {"$project":{ "_id":0, "name":1, "density": {"$divide": ["$population","$area"]} }}, {"$match":{"density":{"$gt":500}}} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}},{"$match":{"density":{"$gt":500}}}])))
$sort
Allows us to choose how the results are displayed, where 1 is ascending and -1 is descending.
Note that excluding $match
is the same as {"$match":{}}
Show the name of all countries in descending order.
pp.pprint(list( db.world.aggregate([ {"$project":{ "_id":0, "name":1, }}, {"$sort":{ "name":-1 }} ]) ))
pp.pprint(list(db.world.aggregate([{"$project":{"_id":0,"name":1,}},{"$sort":{"name":-1}}])))
Grouping
Grouping allows us to use accumulator operations sum as $sum
All groups must have an _id
. To see the why this is useful imagine the following:
So far you've been using the collection world
As every country has a continent, it would make sense to have countries as a nested document inside continents: e.g:
[ {"name":"Africa", "countries":[ {"name":"Algeria", "capital":"Algiers", ...}, {"name":"Angola", "capital":"Luanda", ...}, {"name":"Benin", "capital":"Porto-Novo",...}. {...}, ... ]}, {"name":"Asia", "countries":[ {"name":"Afghanistan","capital":"Kabul", ...}, {"name":"Azerbaijan", "capital":"Baku", ...}, {"name":"Bahrain", "capital":"Manama",...}, {...}, ... ]}, {...}, ... ]
The world
collection isn't like this however. We'll be using the following structure, which has a redundancy where continent
is repeated for each country. Don't worry, you'll learn how to query nested documents later.
[ {"name":"Afghanistan","capital":"Kabul", "continent":"Asia", ...}, {"name":"Albania", "capital":"Tirana", "continent":"Europe, ...}, {"name":"Algeria", "capital":"Algiers","contiennt":"Africa",...}, {...}, ... ]
What if we wanted to know what the largest and smallest GDPs in each continent are? We can use the _id to group by continent: "_id":"continent"
If we wanted to find these values for any country regardless of continent, we could just group by "_id":"null"
or "_id":None
$max
and $min
can be used to get the largest and smallest values in a group.
Get the smallest and largest GDPs of each continent.
pp.pprint(list( db.world.aggregate([ {"$group":{ '_id':'$continent', 'min':{"$min":"$gdp"}, 'max':{"$max":"$gdp"} }}, {"$project":{ "_id":1, "min":1, "max":1 }}, ]) ))
pp.pprint(list(db.world.aggregate([{"$group":{'_id':'$continent','min':{"$min":"$gdp"},'max':{"$max":"$gdp"}}},{"$project":{"_id":1,"min":1,"max":1}},])))
A downside of grouping is that we lose the ability to find the $name
of the country with the min/max value.
If we want to do this, a simple way is to combine a sort sort=[(<fieldname>, <direction>)]
and a find_one()
. Performance will be improved if these fields have previously been indexed.
Get the names and GDPs of the two countries with the smallest and largest GDPs.
It is possible that we will occasionally encounter null
values in a data collection. This could be either where the field is simply not included on a certain document - MongoDB will simply ignore this document and not include it in the results set - or a document has a redundancy where the field is specified as null - which MongoDB will include. e.g: {"name": "Bob", "age": None}
etc.
To deal with this we can use {<field>: {"$ne": None}}
to prevent any null values from being included.
pp.pprint( db.world.find_one({"gdp":{"$ne":None}},{"name":1,"gdp":1,"_id":0},sort=[("gdp", 1)]) ) pp.pprint( db.world.find_one({},{"name":1,"gdp":1,"_id":0},sort=[("gdp", -1)]) )
pp.pprint(db.world.find_one({"gdp":{"$ne":None}},{"name":1,"gdp":1,"_id":0},sort=[("gdp", 1)])) pp.pprint(db.world.find_one({},{"name":1,"gdp":1,"_id":0},sort=[("gdp", -1)]))