Difference between revisions of "AGGREGATE examples"
Line 11: | Line 11: | ||
#PRETTY | #PRETTY | ||
import pprint | import pprint | ||
− | pp = pprint.PrettyPrinter(indent=4) | + | pp = pprint.PrettyPrinter(indent=4, width=160) |
</pre> | </pre> | ||
Revision as of 14:57, 30 July 2015
#ENCODING import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16') #MONGO from pymongo import MongoClient client = MongoClient() client.progzoo.authenticate('scott','tiger') db = client['progzoo'] #PRETTY import pprint pp = pprint.PrettyPrinter(indent=4, width=160)
Introducing the aggregation framework
These examples introduce the aggregation framework and its operators. Again we will be using the collection world
$match
performs queries in a similar way to find()
Show all the details for France
pp.pprint(list( db.world.aggregate([ {"$match":{"name":"France"}} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"name":"France"}}])))
limit
sets the amount of documents to be handed to the next stage in the pipeline.
Return the first document
pp.pprint(list( db.world.aggregate([ {"$limit":1} ]) ))
pp.pprint(list(db.world.aggregate([{"$limit":1}])))
$project
selects what fields to display.
It can also has the ability to create new fields and to compare fields against each other without using $where
Show the name and population density of all Asian countries. (population/area)
Note that "density" is a new field, made from the result of dividing two existing fields, and that $divide
is an aggregate function.
To avoid diving by 0 insert a $match
to remove any countries with 0 area (Vatican City), then pipe these results through to $project
There is no need to check if values are null
, MongoDB will ignore these documents.
pp.pprint(list( db.world.aggregate([ {"$match":{"area":{"$ne":0},"continent":"Asia"}}, {"$project":{ "_id":0, "name":1, "density": {"$divide": ["$population","$area"]} }} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}])))
Because aggregate is a pipeline stages may be repeated, and stages don't have to be used in a specific order.
Show the name of Asian countries with a density that's over 500 people per km2. (population/area)
pp.pprint(list( db.world.aggregate([ {"$match":{"area":{"$ne":0},"continent":"Asia"}}, {"$project":{ "_id":0, "name":1, "density": {"$divide": ["$population","$area"]} }}, {"$match":{"density":{"$gt":500}}} ]) ))
pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}},{"$match":{"density":{"$gt":500}}}])))
$sort
allows ordering of the results set, where 1 is ascending and -1 is descending.
Note that not including $match
is the same as {"$match":{}}
Show the name of all countries in descending order.
pp.pprint(list( db.world.aggregate([ {"$project":{ "_id":0, "name":1, }}, {"$sort":{ "name":-1 }} ]) ))
pp.pprint(list(db.world.aggregate([{"$project":{"_id":0,"name":1,}},{"$sort":{"name":-1}}])))
Grouping
Grouping provides accumulator operations such as $sum
All groups must have an _id
. To see why this is useful imagine the following:
So far you've been using the collection world
As every country has a continent, it would make sense to have countries as a nested document inside continents: e.g:
[ {"name":"Africa", "countries":[ {"name":"Algeria", "capital":"Algiers", ...}, {"name":"Angola", "capital":"Luanda", ...}, {"name":"Benin", "capital":"Porto-Novo",...}. {...}, ... ]}, {"name":"Asia", "countries":[ {"name":"Afghanistan","capital":"Kabul", ...}, {"name":"Azerbaijan", "capital":"Baku", ...}, {"name":"Bahrain", "capital":"Manama",...}, {...}, ... ]}, {...}, ... ]
The world
collection isn't like this however. It uses the following structure, which has a redundancy where continent
is repeated for each country.
[ {"name":"Afghanistan","capital":"Kabul", "continent":"Asia", ...}, {"name":"Albania", "capital":"Tirana", "continent":"Europe, ...}, {"name":"Algeria", "capital":"Algiers","contiennt":"Africa",...}, {...}, ... ]
The code to group by continent is "_id":"$continent"
If instead the question was to group by country the code would be "_id":"$name"
.
To operate over the whole document (which would have the same effect as "_id":"$name"
) "_id":"null"
or "_id":None
can be used.
$max
and $min
can be used to get the largest and smallest values in a group.
Get the smallest and largest GDPs of each continent.
pp.pprint(list( db.world.aggregate([ {"$group":{ '_id':'$continent', 'min':{"$min":"$gdp"}, 'max':{"$max":"$gdp"} }}, {"$project":{ "_id":1, "min":1, "max":1 }}, ]) ))
pp.pprint(list(db.world.aggregate([{"$group":{'_id':'$continent','min':{"$min":"$gdp"},'max':{"$max":"$gdp"}}},{"$project":{"_id":1,"min":1,"max":1}},])))
Some other useful aggregate functions to know are $sum
and average: $avg
This example combines all the material in these examples.
Order the continents in descending order by total GDP, Include the average GDP for each country.
pp.pprint(list( db.world.aggregate([ {"$match":{}}, {"$group":{ "_id":"$continent", "Total GDP": {"$sum": "$gdp"}, "Average GDP": {"$avg": "$gdp"} }}, {"$sort":{ "Total GDP":-1 }}, {"$project":{ "Area":"$_id", "Total GDP": 1, "Average GDP":1, "_id":0 }} ]) ))
pp.pprint(list(db.world.aggregate([{"$group":{"_id":"$continent","Total GDP":{"$sum":"$gdp"},"Average GDP":{"$avg":"$gdp"}}},{"$sort":{"Total GDP":-1}},{"$project":{"Area":"$_id","Total GDP":1,"Average GDP":1,"_id":0}}])))
Using Conditions
$cond
is similar to a CASE
statement in other languages.
It has the form "$cond": [{<comparison> :[<field or value>,<field or value>]},<true case>,<false case>]
pp.pprint(list( db.world.aggregate([ {"$group":{ "_id":{ "$cond": [{"$eq":["$continent","Eurasia"]},"Europe","$continent"] }, "area":{"$sum": "$area"} }}, {"$sort":{ "area": -1 }}, {"$project":{ "_id":1, "area":1 }} ]) ))