Difference between revisions of "AGGREGATE examples"

Latest revision as of 20:22, 1 April 2022

aggregate

The aggregate function takes a list of operations - this is the pipeline.

The data passes through each stage of the pipeline in turn.

Pipeline stages can include:

$group This is the aggregate special sauce; you specify the _id value, the output from this stage includes one entry for each distinct _id value. In SQL you would use a GROUP BY clause
$match - this acts as a filter, some data items pass through this stage, some do not. Similar to the WHERE clause in SQL
$project - this can be used to transform each element. Rather like the values on the SELECT line of an SQL query
$limit
$sort
$skip

$group

$group allows you to collect group items that share common features

_id - this determines the values to be grouped
$continent and $population refers to keys 'continent' and 'population' available in each item
$sum is an aggregating function, it takes many values in and returns a single value. Other examples of aggregating functions are $min $max, $avg

List the continents

db.world.aggregate(
    {$group: {_id: "$continent"}, pop:{$sum:'$population'}}
);

$match

$match performs queries in a similar way to find()

Show all the details for France

db.world.aggregate([
    {$match: {name: "France"}}
]);

db.world.aggregate([{$match:{name:"France"}}]);

$limit

$limit sets the amount of documents to be handed to the next stage in the pipeline.

Return the first two document

db.world.aggregate([
    {$limit: 2}
]);

db.world.aggregate([{"$limit":2}]);

$project

$project selects what fields to display.
It can also has the ability to create new fields and to compare fields against each other without using $where

Show the name and population density of all Asian countries. (population/area)

Note that "density" is a new field made from the result of dividing two existing fields, and that $divide is an aggregate function.

To avoid diving by 0 insert a $match to remove any countries with no area (Vatican City), then pipe these results through to $project
There is no need to check if values are null, MongoDB will ignore these documents.

db.world.aggregate([
    {$match: {area: {$ne: 0}, continent: "Asia"}},
    {$project: {
        _id: 0,
        name: 1,
        density: {$divide: ["$population", "$area"]}
    }}
]);

db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}]);

aggregate composition

You can have several pipeline stages, the data flows through each one in turn.

Show the name of Asian countries with a density that's over 500 people per km². (population/area)

db.world.aggregate([
   {$match: {area: {$ne: 0}, continent: "Asia"}},
   {$project: {
       _id: 0,
       name: 1,
       density: {$divide: ["$population", "$area"]}
   }},
   {$match: {density: {$gt: 500}}}
]);

db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}},{"$match":{"density":{"$gt":500}}}]);

$sort

$sort allows ordering of the results set, where 1 is ascending and -1 is descending.
Note that not including $match is the same as {"$match":{}}

Show the name of all countries in descending order.

db.world.aggregate([
    {"$project":{
        "_id":0,
        "name":1,
    }},
    {"$sort":{
        "name":-1
    }}  
]);

db.world.aggregate([{"$project":{"_id":0,"name":1,}},{"$sort":{"name":-1}}])

Grouping

Grouping provides accumulator operations such as $sum
All groups must have an _id. To see why this is useful imagine the following:

So far you've been using the world collection
As every country has a continent, it would make sense to have countries as a nested document inside continents: e.g:

[    
    {"name": "Africa",
     "countries": [
         {"name": "Algeria",     "capital": "Algiers",    ...},
         {"name": "Angola",      "capital": "Luanda",     ...},
         {"name": "Benin",       "capital": "Porto-Novo", ...}.
         {...},
         ...
    ]},
    {"name": "Asia",
     "countries": [
         {"name": "Afghanistan", "capital": "Kabul",  ...},
         {"name": "Azerbaijan",  "capital": "Baku",   ...},
         {"name": "Bahrain",     "capital": "Manama", ...},
         {...},
         ...
    ]},
    {...},
    ...
]

The world collection isn't like this however. It uses the following structure, which has a redundancy where continent is repeated for each country.

[ 
    {"name": "Afghanistan", "capital": "Kabul",   "continent": "Asia",   ...},
    {"name": "Albania",     "capital": "Tirana",  "continent": "Europe",  ...},
    {"name": "Algeria",     "capital": "Algiers", "contiennt": "Africa", ...},
    {...},
    ...
]

The code to group by continent is "_id":"$continent"
If instead the question was to group by country the code would be "_id": "$name".
To operate over the whole document (which would have the same effect as "_id": "$name") "_id": "null" or "_id": None can be used.

group operators

$max and $min can be used to get the largest and smallest values in a group.

Get the smallest and largest GDPs of each continent.

db.world.aggregate([
    {$group: {
        _id: '$continent',
        min: {$min: "$gdp"},
        max: {$max: "$gdp"}
    }},
    {$project: {
        _id: 1,
        min: 1,
        max: 1
    }}
]);

db.world.aggregate([{"$group":{'_id':'$continent','min':{"$min":"$gdp"},'max':{"$max":"$gdp"}}},{"$project":{"_id":1,"min":1,"max":1}}]);

Some other useful aggregate functions to know are $sum and average: $avg
The example below combines previous example material.

Order the continents in descending order by total GDP, Include the average GDP for each country.

db.world.aggregate([
    {$match: {}},
    {$group: {
        _id:"$continent",
        "Total GDP": {"$sum": "$gdp"},
        "Average GDP": {"$avg": "$gdp"}
    }},
    {$sort: {
        "Total GDP":-1
    }},
    {$project:{
        "Area": "$_id",
        "Total GDP": 1,
        "Average GDP": 1,
        _id: 0
    }}
]);

db.world.aggregate([{"$group":{"_id":"$continent","Total GDP":{"$sum":"$gdp"},"Average GDP":{"$avg":"$gdp"}}},{"$sort":{"Total GDP":-1}},{"$project":{"Area":"$_id","Total GDP":1,"Average GDP":1,"_id":0}}]);

Using Conditions

$cond is similar to a CASE statement in other languages.
It has the form "$cond": [{<comparison>: [<field or value>, <field or value>]}, <true case>, <false case>]

db.world.aggregate([
    {$group: {
        _id: {
            $cond: [{"$eq": ["$continent", "Eurasia"]}, "Europe", "$continent"]
        },
        area: {$sum: "$area"}
    }},
    {$sort: {
        area: -1
    }},
    {$project: {
        _id: 1,
        area: 1
    }}
]);