Difference between revisions of "AGGREGATE examples"
m (40166222 moved page AGGREGATE basics to AGGREGATE basics (PyMongo)) |
|||
Line 19: | Line 19: | ||
<div class=q data-lang="py3"> | <div class=q data-lang="py3"> | ||
− | <code>$match</code> | + | <code>$match</code> performs queries in a similar way to <code>find()</code> |
<p class=strong>Show all the details for France</p> | <p class=strong>Show all the details for France</p> | ||
<pre class=def> | <pre class=def> | ||
Line 34: | Line 34: | ||
<div class=q data-lang="py3"> | <div class=q data-lang="py3"> | ||
− | <code>$project</code> | + | <code>$project</code> selects what fields to display.<br/> |
− | It can also has the ability to | + | It can also has the ability to create new fields and to compare fields against each other without using <code>$where</code> |
<p class=strong>Show the name and population density of all Asian countries. (population/area)</p> | <p class=strong>Show the name and population density of all Asian countries. (population/area)</p> | ||
Note that "density" is a new field, made from the result of dividing two existing fields, and that <code>$divide</code> is an aggregate function. | Note that "density" is a new field, made from the result of dividing two existing fields, and that <code>$divide</code> is an aggregate function. | ||
<div class=hint title="Dealing with division by 0"> | <div class=hint title="Dealing with division by 0"> | ||
− | To avoid diving by 0 | + | To avoid diving by 0 insert a <code>$match</code> to remove any countries with 0 area (Vatican City), then pipe these results through to <code>$project</code> |
</div> | </div> | ||
<pre class=def> | <pre class=def> | ||
Line 59: | Line 59: | ||
<div class=q data-lang="py3"> | <div class=q data-lang="py3"> | ||
− | Because aggregate is a pipeline | + | Because aggregate is a pipeline stages may be repeated, and stages don't have to be used in a specific order. |
<p class=strong>Show the name of Asian countries with a density that's over 500 people per km<sup>2</sup>. (population/area)</p> | <p class=strong>Show the name of Asian countries with a density that's over 500 people per km<sup>2</sup>. (population/area)</p> | ||
<pre class=def> | <pre class=def> | ||
Line 80: | Line 80: | ||
<div class=q data-lang="py3"> | <div class=q data-lang="py3"> | ||
− | <code>$sort</code> | + | <code>$sort</code> allows ordering of the results set, where 1 is ascending and -1 is descending.<br/> |
− | Note that | + | Note that not including <code>$match</code> is the same as <code>{"$match":{}}</code> |
<p class=strong>Show the name of all countries in descending order.</p> | <p class=strong>Show the name of all countries in descending order.</p> | ||
<pre class=def> | <pre class=def> | ||
Line 102: | Line 102: | ||
==Grouping== | ==Grouping== | ||
− | Grouping | + | Grouping provides accumulator operations such as <code>$sum</code><br /> |
All groups must have an <code>_id</code>. To see why this is useful imagine the following:<br/><br/> | All groups must have an <code>_id</code>. To see why this is useful imagine the following:<br/><br/> | ||
Line 130: | Line 130: | ||
</pre> | </pre> | ||
− | The <code>world</code> collection isn't like this however. | + | The <code>world</code> collection isn't like this however. It uses the following structure, which has a redundancy where <code>continent</code> is repeated for each country. |
<pre> | <pre> | ||
Line 141: | Line 141: | ||
] | ] | ||
</pre> | </pre> | ||
− | + | To code to group by continent is <code> "_id":"$continent"</code> If instead the question was to group by country the code would be <code>"_id":"$name"</code>. To operate over the whole document (which would have the same effect as <code>"_id":"$name"</code>) <code>"_id":"null"</code> or <code>"_id":None</code> can be used. | |
− | <div class=q data-lang="py3"> | + | <div class=q data-lang="py3"> |
<code>$max</code> and <code>$min</code> can be used to get the largest and smallest values in a group. | <code>$max</code> and <code>$min</code> can be used to get the largest and smallest values in a group. | ||
<p class=strong>Get the smallest and largest GDPs of each continent.</p> | <p class=strong>Get the smallest and largest GDPs of each continent.</p> | ||
Line 167: | Line 167: | ||
<div class=q data-lang="py3"> | <div class=q data-lang="py3"> | ||
− | A downside of grouping is that | + | A downside of grouping is that it is not possible to find the <code>$name</code> of the country with the min/max value.<br/><br/> |
− | + | A simple way is to do this combine a sort <code>sort=[(<fieldname>, <direction>)]</code> and a <code>find_one()</code>. Performance will be improved if these fields have previously been indexed. | |
<p class=strong>Get the names and GDPs of the two countries with the smallest and largest GDPs.</p> | <p class=strong>Get the names and GDPs of the two countries with the smallest and largest GDPs.</p> | ||
<div class="hint" title="Avoiding null results"> | <div class="hint" title="Avoiding null results"> | ||
Line 189: | Line 189: | ||
<div class=q data-lang="py3"> | <div class=q data-lang="py3"> | ||
Some other useful aggregate functions to know are <code>$sum</code> and average: <code>$avg</code><br/> | Some other useful aggregate functions to know are <code>$sum</code> and average: <code>$avg</code><br/> | ||
− | + | This example combines all the material in these examples. | |
<p class=strong>Order the continents in descending order by total GDP, Include the average GDP for each country.</p> | <p class=strong>Order the continents in descending order by total GDP, Include the average GDP for each country.</p> | ||
<pre class=def> | <pre class=def> | ||
pp.pprint(list( | pp.pprint(list( | ||
db.world.aggregate([ | db.world.aggregate([ | ||
+ | {"$match":{}}, | ||
{"$group":{ | {"$group":{ | ||
"_id":"$continent", | "_id":"$continent", |
Revision as of 12:34, 22 July 2015
#ENCODING import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16') #MONGO from pymongo import MongoClient client = MongoClient() client.progzoo.authenticate('scott','tiger') db = client['progzoo'] #PRETTY import pprint pp = pprint.PrettyPrinter(indent=4)
Introducing the aggregation framework
These examples introduce the aggregation framework and its operators. Again we will be using the collection world
$match
performs queries in a similar way to find()
Show all the details for France
pp.pprint(list( db.world.aggregate([ {"$match":{"name":"France"}} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"name":"France"}}])))
$project
selects what fields to display.
It can also has the ability to create new fields and to compare fields against each other without using $where
Show the name and population density of all Asian countries. (population/area)
Note that "density" is a new field, made from the result of dividing two existing fields, and that $divide
is an aggregate function.
To avoid diving by 0 insert a $match
to remove any countries with 0 area (Vatican City), then pipe these results through to $project
pp.pprint(list( db.world.aggregate([ {"$match":{"area":{"$ne":0},"continent":"Asia"}}, {"$project":{ "_id":0, "name":1, "density": {"$divide": ["$population","$area"]} }} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}])))
Because aggregate is a pipeline stages may be repeated, and stages don't have to be used in a specific order.
Show the name of Asian countries with a density that's over 500 people per km2. (population/area)
pp.pprint(list( db.world.aggregate([ {"$match":{"area":{"$ne":0},"continent":"Asia"}}, {"$project":{ "_id":0, "name":1, "density": {"$divide": ["$population","$area"]} }}, {"$match":{"density":{"$gt":500}}} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}},{"$match":{"density":{"$gt":500}}}])))
$sort
allows ordering of the results set, where 1 is ascending and -1 is descending.
Note that not including $match
is the same as {"$match":{}}
Show the name of all countries in descending order.
pp.pprint(list( db.world.aggregate([ {"$project":{ "_id":0, "name":1, }}, {"$sort":{ "name":-1 }} ]) ))
pp.pprint(list(db.world.aggregate([{"$project":{"_id":0,"name":1,}},{"$sort":{"name":-1}}])))
Grouping
Grouping provides accumulator operations such as $sum
All groups must have an _id
. To see why this is useful imagine the following:
So far you've been using the collection world
As every country has a continent, it would make sense to have countries as a nested document inside continents: e.g:
[ {"name":"Africa", "countries":[ {"name":"Algeria", "capital":"Algiers", ...}, {"name":"Angola", "capital":"Luanda", ...}, {"name":"Benin", "capital":"Porto-Novo",...}. {...}, ... ]}, {"name":"Asia", "countries":[ {"name":"Afghanistan","capital":"Kabul", ...}, {"name":"Azerbaijan", "capital":"Baku", ...}, {"name":"Bahrain", "capital":"Manama",...}, {...}, ... ]}, {...}, ... ]
The world
collection isn't like this however. It uses the following structure, which has a redundancy where continent
is repeated for each country.
[ {"name":"Afghanistan","capital":"Kabul", "continent":"Asia", ...}, {"name":"Albania", "capital":"Tirana", "continent":"Europe, ...}, {"name":"Algeria", "capital":"Algiers","contiennt":"Africa",...}, {...}, ... ]
To code to group by continent is "_id":"$continent"
If instead the question was to group by country the code would be "_id":"$name"
. To operate over the whole document (which would have the same effect as "_id":"$name"
) "_id":"null"
or "_id":None
can be used.
$max
and $min
can be used to get the largest and smallest values in a group.
Get the smallest and largest GDPs of each continent.
pp.pprint(list( db.world.aggregate([ {"$group":{ '_id':'$continent', 'min':{"$min":"$gdp"}, 'max':{"$max":"$gdp"} }}, {"$project":{ "_id":1, "min":1, "max":1 }}, ]) ))
pp.pprint(list(db.world.aggregate([{"$group":{'_id':'$continent','min':{"$min":"$gdp"},'max':{"$max":"$gdp"}}},{"$project":{"_id":1,"min":1,"max":1}},])))
A downside of grouping is that it is not possible to find the $name
of the country with the min/max value.
A simple way is to do this combine a sort sort=[(<fieldname>, <direction>)]
and a find_one()
. Performance will be improved if these fields have previously been indexed.
Get the names and GDPs of the two countries with the smallest and largest GDPs.
It is possible that we will occasionally encounter null
values in a data collection. This could be either where the field is simply not present in a certain document - MongoDB will simply ignore this document and not include it in the results set - or a document has a redundancy where the field is specified as null - which MongoDB will include. e.g: {"name": "Bob", "age": None}
etc.
To deal with this we can use {<field>: {"$ne": None}}
to prevent any null values from being included.
pp.pprint( db.world.find_one({"gdp":{"$ne":None}},{"name":1,"gdp":1,"_id":0},sort=[("gdp", 1)]) ) pp.pprint( db.world.find_one({},{"name":1,"gdp":1,"_id":0},sort=[("gdp", -1)]) )
pp.pprint(db.world.find_one({"gdp":{"$ne":None}},{"name":1,"gdp":1,"_id":0},sort=[("gdp", 1)])) pp.pprint(db.world.find_one({},{"name":1,"gdp":1,"_id":0},sort=[("gdp", -1)]))
Some other useful aggregate functions to know are $sum
and average: $avg
This example combines all the material in these examples.
Order the continents in descending order by total GDP, Include the average GDP for each country.
pp.pprint(list( db.world.aggregate([ {"$match":{}}, {"$group":{ "_id":"$continent", "Total GDP": {"$sum": "$gdp"}, "Average GDP": {"$avg": "$gdp"} }}, {"$sort":{ "Total GDP":-1 }}, {"$project":{ "Area":"$_id", "Total GDP": 1, "Average GDP":1, "_id":0 }} ]) ))
pp.pprint(list(db.world.aggregate([{"$group":{"_id":"$continent","Total GDP":{"$sum":"$gdp"},"Average GDP":{"$avg":"$gdp"}}},{"$sort":{"Total GDP":-1}},{"$project":{"Area":"$_id","Total GDP":1,"Average GDP":1,"_id":0}}])))