Cookies help us deliver our services. By using our services, you agree to our use of cookies. More information

Difference between revisions of "AGGREGATE examples"

From NoSQLZoo
Jump to: navigation, search
 
(45 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<pre class=setup>
+
==aggregate==
#ENCODING
+
The aggregate function takes a list of operations - this is the pipeline.
import io
 
import sys
 
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-16')
 
#MONGO
 
from pymongo import MongoClient
 
client = MongoClient()
 
client.progzoo.authenticate('scott','tiger')
 
db = client['progzoo']
 
#PRETTY
 
import pprint
 
pp = pprint.PrettyPrinter(indent=4)
 
</pre>
 
  
==Introducing the aggregation framework==
+
The data passes through each stage of the pipeline in turn.
These examples introduce the aggregation framework and its operators. Again we will be using the collection <code>world</code>
+
 
 +
Pipeline stages can include:
 +
* '''$group''' This is the aggregate special sauce; you specify the _id value, the output from this stage includes one entry for each distinct _id value. In SQL you would use a GROUP BY clause
 +
* '''$match''' - this acts as a filter, some data items pass through this stage, some do not. Similar to the WHERE clause in SQL
 +
* '''$project''' - this can be used to transform each element. Rather like the values on the SELECT line of an SQL query
 +
* '''$limit'''
 +
* '''$sort'''
 +
* '''$skip'''
 
<div class='extra_space' style='width:1em; height:6em;'></div>
 
<div class='extra_space' style='width:1em; height:6em;'></div>
 +
==$group==
 +
<div class="q" data-lang="mongo">
 +
'''$group''' allows you to collect group items that share common features
 +
* _id - this determines the values to be grouped
 +
* $continent and $population refers to keys 'continent' and 'population' available in each item
 +
* $sum is an aggregating function, it takes many values in and returns a single value. Other examples of aggregating functions are $min $max, $avg
 +
<p class="strong">List the continents</p>
 +
<pre class="def"><nowiki>
 +
db.world.aggregate(
 +
    {$group: {_id: "$continent"}, pop:{$sum:'$population'}}
 +
);</nowiki></pre>
 +
</div>
  
<div class=q data-lang="py3">
+
==$match==
<code>$match</code> Allows us to perform queries in a similar way to find()
+
<div class="q" data-lang="mongo">
<p class=strong>Show all the details for France</p>
+
'''$match''' performs queries in a similar way to <syntaxhighlight lang="JavaScript" inline>find()</syntaxhighlight>
<pre class=def>
+
<p class="strong">Show all the details for France</p>
pp.pprint(list(
+
<pre class="def"><nowiki>
    db.world.aggregate([
+
db.world.aggregate([
        {"$match":{"name":"France"}}
+
    {$match: {name: "France"}}
    ])
+
]);</nowiki></pre>
))
+
<pre class="ans"><nowiki>db.world.aggregate([{$match:{name:"France"}}]);</nowiki></pre>
</pre>
 
<div class=ans>
 
pp.pprint(list(db.world.aggregate([{"$match":{"name":"France"}}])))
 
 
</div>
 
</div>
 +
 +
==$limit==
 +
<div class="q" data-lang="mongo">'''$limit''' sets the amount of documents to be handed to the next stage in the pipeline.
 +
<p class="strong">Return the first two document</p>
 +
<pre class="def"><nowiki>
 +
db.world.aggregate([
 +
    {$limit: 2}
 +
]);</nowiki></pre>
 +
<pre class="ans"><nowiki>db.world.aggregate([{"$limit":2}]);</nowiki></pre>
 
</div>
 
</div>
  
<div class=q data-lang="py3">
+
==$project==
<code>$project</code> Allows us to select what fields to display.<br/>
+
<div class="q" data-lang="mongo">
It can also has the ability to insert new fields and allows you to compare fields against each other without using <code>$where</code>
+
'''$project''' selects what fields to display.<br/>
<p class=strong>Show the name and population density of all Asian countries. (population/area)</p>
+
It can also has the ability to create new fields and to compare fields against each other without using '''$where'''
Note that "density" is a new field, made from the result of dividing two existing fields, and that <code>$divide</code> is an aggregate function.
+
<p class="strong">Show the name and population density of all Asian countries. (population/area)</p>
<div class=hint title="Dealing with division by 0">
+
Note that "density" is a new field made from the result of dividing two existing fields, and that '''$divide''' is an aggregate function.
To avoid diving by 0 we do a <code>$match</code> to remove any countries with 0 area (Vatican City), then pipe these results through to <code>$project</code>  
+
<div class="hint" title="Dealing with division by 0">
 +
To avoid diving by 0 insert a '''$match''' to remove any countries with no area (Vatican City), then pipe these results through to '''$project'''<br/>
 +
There is no need to check if values are <syntaxhighlight lang="JavaScript" inline>null</syntaxhighlight>, MongoDB will ignore these documents.
 
</div>
 
</div>
<pre class=def>
+
<pre class="def"><nowiki>
pp.pprint(list(
+
db.world.aggregate([
    db.world.aggregate([
+
    {$match: {area: {$ne: 0}, continent: "Asia"}},
        {"$match":{"area":{"$ne":0},"continent":"Asia"}},
+
    {$project: {
        {"$project":{
+
        _id: 0,
            "_id":0,
+
        name: 1,
            "name":1,
+
        density: {$divide: ["$population", "$area"]}
            "density": {"$divide": ["$population","$area"]}
+
    }}
        }}
+
]);</nowiki></pre>
    ])
+
<pre class="ans"><nowiki>db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}]);</nowiki></pre>
))
 
</pre>
 
<div class=ans>
 
pp.pprint(list(db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}])))
 
 
</div>
 
</div>
 +
 +
==aggregate composition==
 +
<div class="q" data-lang="mongo">
 +
You can have several pipeline stages, the data flows through each one in turn.
 +
<p class="strong">Show the name of Asian countries with a density that's over 500 people per km<sup>2</sup>. (population/area)</p>
 +
<pre class="def"><nowiki>
 +
db.world.aggregate([
 +
  {$match: {area: {$ne: 0}, continent: "Asia"}},
 +
  {$project: {
 +
      _id: 0,
 +
      name: 1,
 +
      density: {$divide: ["$population", "$area"]}
 +
  }},
 +
  {$match: {density: {$gt: 500}}}
 +
]);</nowiki></pre>
 +
<pre class="ans"><nowiki>db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}},{"$match":{"density":{"$gt":500}}}]);</nowiki></pre>
 
</div>
 
</div>
  
<div class=q data-lang="py3">
+
==$sort==
<code>$sort</code> Allows us to choose how the results are displayed, where 1 is ascending and -1 is descending.<br/>
+
<div class="q" data-lang="mongo">
Note that excluding <code>$match</code> is the same as <code>{"$match":{}}</code>
+
'''$sort''' allows ordering of the results set, where 1 is ascending and -1 is descending.<br/>
<p class=strong>Show the name of all countries in descending order.</p>
+
Note that not including '''$match''' is the same as <syntaxhighlight lang="JavaScript" inline>{"$match":{}}</syntaxhighlight>
<pre class=def>
+
<p class="strong">Show the name of all countries in descending order.</p>
pp.pprint(list(
+
<pre class="def"><nowiki>
    db.world.aggregate([
+
db.world.aggregate([
        {"$project":{
+
    {"$project":{
            "_id":0,
+
        "_id":0,
            "name":1,
+
        "name":1,
        }},
+
    }},
        {"$sort":{
+
    {"$sort":{
            "name":-1
+
        "name":-1
        }}   
+
    }}   
    ])
+
]);</nowiki></pre>
))
+
<pre class="ans"><nowiki>db.world.aggregate([{"$project":{"_id":0,"name":1,}},{"$sort":{"name":-1}}])</nowiki></pre>
</pre>
 
<div class=ans>
 
pp.pprint(list(db.world.aggregate([{"$project":{"_id":0,"name":1,}},{"$sort":{"name":-1}}])))
 
</div>
 
 
</div>
 
</div>
  
 
==Grouping==
 
==Grouping==
Grouping allows us to use accumulator operations sum as <code>$sum</code><br />
+
Grouping provides accumulator operations such as '''$sum'''<br />
All groups must have an <code>_id</code>. To accumulate over all the results you can just use <code>null</code>
+
All groups must have an '''_id'''. To see why this is useful imagine the following:<br/><br/>
<div class='extra_space' style='width:1em; height:6em;'></div>
+
 
<div class=q data-lang="py3">
+
So far you've been using the '''world''' collection</code><br/>
<code>$max</code> and <code>$min</code> can be used to get the largest and smallest values in a group.
+
As every country has a continent, it would make sense to have countries as a nested document inside continents: e.g:
<p class=strong>Get the smallest and largest GDPs.</p>
+
<syntaxhighlight lang="JavaScript">
<pre class=def>
+
[   
pp.pprint(list(
+
    {"name": "Africa",
    db.world.aggregate([
+
    "countries": [
        {"$group":{
+
        {"name": "Algeria",    "capital": "Algiers",    ...},
            '_id':'null',
+
        {"name": "Angola",      "capital": "Luanda",    ...},
            'min':{"$min":"$gdp"},
+
        {"name": "Benin",      "capital": "Porto-Novo", ...}.
            'max':{"$max":"$gdp"},
+
        {...},
        }},
+
        ...
        {"$project":{
+
    ]},
          "_id":0,
+
    {"name": "Asia",
          "min":1,
+
    "countries": [
          "max":1
+
        {"name": "Afghanistan", "capital": "Kabul",  ...},
        }},
+
        {"name": "Azerbaijan",  "capital": "Baku",  ...},
    ])
+
        {"name": "Bahrain",    "capital": "Manama", ...},
))
+
        {...},
</pre>
+
        ...
<div class=ans>
+
    ]},
pp.pprint(list(db.world.aggregate([{"$group":{'_id':'null','min':{"$min":"$gdp"},'max':{"$max":"$gdp"},}},{"$project":{"_id":0,"min":1,"max":1}},])))
+
    {...},
 +
    ...
 +
]
 +
</syntaxhighlight>
 +
 
 +
The '''world''' collection isn't like this however. It uses the following structure, which has a redundancy where '''continent''' is repeated for each country.
 +
 
 +
<syntaxhighlight lang="JavaScript">
 +
[
 +
    {"name": "Afghanistan", "capital": "Kabul",  "continent": "Asia",  ...},
 +
    {"name": "Albania",    "capital": "Tirana",  "continent": "Europe",  ...},
 +
    {"name": "Algeria",    "capital": "Algiers", "contiennt": "Africa", ...},
 +
    {...},
 +
    ...
 +
]
 +
</syntaxhighlight>
 +
The code to group by continent is <syntaxhighlight lang="JavaScript" inline>"_id":"$continent"</syntaxhighlight><br/>
 +
If instead the question was to group by country the code would be <syntaxhighlight lang="JavaScript" inline>"_id": "$name"</syntaxhighlight>.<br/>
 +
To operate over the whole document (which would have the same effect as <syntaxhighlight lang="JavaScript" inline>"_id": "$name"</syntaxhighlight>) <syntaxhighlight lang="JavaScript" inline>"_id": "null"</syntaxhighlight> or <syntaxhighlight lang="JavaScript" inline>"_id": None</syntaxhighlight> can be used.
 +
==group operators==
 +
<div class="q" data-lang="mongo">  
 +
'''$max''' and '''$min''' can be used to get the largest and smallest values in a group.
 +
<p class="strong">Get the smallest and largest GDPs of each continent.</p>
 +
<pre class="def"><nowiki>
 +
db.world.aggregate([
 +
    {$group: {
 +
        _id: '$continent',
 +
        min: {$min: "$gdp"},
 +
        max: {$max: "$gdp"}
 +
    }},
 +
    {$project: {
 +
        _id: 1,
 +
        min: 1,
 +
        max: 1
 +
    }}
 +
]);</nowiki></pre>
 +
<pre class="ans"><nowiki>db.world.aggregate([{"$group":{'_id':'$continent','min':{"$min":"$gdp"},'max':{"$max":"$gdp"}}},{"$project":{"_id":1,"min":1,"max":1}}]);</nowiki></pre>
 
</div>
 
</div>
 +
 +
<div class="q" data-lang="mongo">
 +
Some other useful aggregate functions to know are '''$sum''' and average: '''$avg'''<br/>
 +
The example below  combines previous example material.
 +
<p class="strong">Order the continents in descending order by total GDP, Include the average GDP for each country.</p>
 +
<pre class="def"><nowiki>
 +
db.world.aggregate([
 +
    {$match: {}},
 +
    {$group: {
 +
        _id:"$continent",
 +
        "Total GDP": {"$sum": "$gdp"},
 +
        "Average GDP": {"$avg": "$gdp"}
 +
    }},
 +
    {$sort: {
 +
        "Total GDP":-1
 +
    }},
 +
    {$project:{
 +
        "Area": "$_id",
 +
        "Total GDP": 1,
 +
        "Average GDP": 1,
 +
        _id: 0
 +
    }}
 +
]);</nowiki></pre>
 +
<pre class="ans"><nowiki>db.world.aggregate([{"$group":{"_id":"$continent","Total GDP":{"$sum":"$gdp"},"Average GDP":{"$avg":"$gdp"}}},{"$sort":{"Total GDP":-1}},{"$project":{"Area":"$_id","Total GDP":1,"Average GDP":1,"_id":0}}]);</nowiki></pre>
 
</div>
 
</div>
  
<div class=q data-lang="py3">
+
<div class="q" data-lang="mongo">
In the previous example we can't use <code>$name<code> to get the names of the countries with the smallest and largest values as we lost the ability to associate documents when we set the <code>_id</code> to <code>null</code> to perform a grouping<br/>
+
Using Conditions<br/><br/>
If we want to do this, a simple way is to use <code>sort=[("uid", -1)]</code> inside a <code>find_one()</code> statement. Performance will be improved if these fields have previously been indexed.
+
'''$cond''' is similar to a '''CASE''' statement in other languages.<br/>
<p class=strong>Get the names and GDPs of the two countries with the smallest and largest GDPs.</p>
+
It has the form <syntaxhighlight lang="JavaScript" inline>"$cond": [{<comparison>: [<field or value>, <field or value>]}, <true case>, <false case>]</syntaxhighlight><br/><br/>
<div class="hint" title="Avoiding null results">
+
<pre class="def"><nowiki>
It is possible that we will occasionally encounter <code>null</code> values in a data collection. This could be either where the field is simply not included on a certain document, or a document has a redundancy where the field is spesified as null, none, etc.<br/>To deal with this we can use <code>{<field>: {"$ne": None}}</code> to prevent any null values from being included.
+
db.world.aggregate([
</div>
+
    {$group: {
<pre class=def>
+
        _id: {
pp.pprint(
+
            $cond: [{"$eq": ["$continent", "Eurasia"]}, "Europe", "$continent"]
    db.world.find_one({"gdp":{"$ne":None}},{"name":1,"gdp":1,"_id":0},sort=[("gdp", 1)])
+
        },
)
+
        area: {$sum: "$area"}
pp.pprint(
+
    }},
    db.world.find_one({},{"name":1,"gdp":1,"_id":0},sort=[("gdp", -1)])
+
    {$sort: {
)
+
        area: -1
</pre>
+
    }},
<div class=ans>
+
    {$project: {
pp.pprint(db.world.find_one({"gdp":{"$ne":None}},{"name":1,"gdp":1,"_id":0},sort=[("gdp", 1)])
+
        _id: 1,
pp.pprint(db.world.find_one({},{"name":1,"gdp":1,"_id":0},sort=[("gdp", -1)]))
+
        area: 1
</div>
+
    }}
 +
]);</nowiki></pre>
 
</div>
 
</div>

Latest revision as of 20:22, 1 April 2022

aggregate

The aggregate function takes a list of operations - this is the pipeline.

The data passes through each stage of the pipeline in turn.

Pipeline stages can include:

  • $group This is the aggregate special sauce; you specify the _id value, the output from this stage includes one entry for each distinct _id value. In SQL you would use a GROUP BY clause
  • $match - this acts as a filter, some data items pass through this stage, some do not. Similar to the WHERE clause in SQL
  • $project - this can be used to transform each element. Rather like the values on the SELECT line of an SQL query
  • $limit
  • $sort
  • $skip

$group

$group allows you to collect group items that share common features

  • _id - this determines the values to be grouped
  • $continent and $population refers to keys 'continent' and 'population' available in each item
  • $sum is an aggregating function, it takes many values in and returns a single value. Other examples of aggregating functions are $min $max, $avg

List the continents

db.world.aggregate(
    {$group: {_id: "$continent"}, pop:{$sum:'$population'}}
);

$match

$match performs queries in a similar way to find()

Show all the details for France

db.world.aggregate([
    {$match: {name: "France"}}
]);
db.world.aggregate([{$match:{name:"France"}}]);

$limit

$limit sets the amount of documents to be handed to the next stage in the pipeline.

Return the first two document

db.world.aggregate([
    {$limit: 2}
]);
db.world.aggregate([{"$limit":2}]);

$project

$project selects what fields to display.
It can also has the ability to create new fields and to compare fields against each other without using $where

Show the name and population density of all Asian countries. (population/area)

Note that "density" is a new field made from the result of dividing two existing fields, and that $divide is an aggregate function.

To avoid diving by 0 insert a $match to remove any countries with no area (Vatican City), then pipe these results through to $project
There is no need to check if values are null, MongoDB will ignore these documents.

db.world.aggregate([
    {$match: {area: {$ne: 0}, continent: "Asia"}},
    {$project: {
        _id: 0,
        name: 1,
        density: {$divide: ["$population", "$area"]}
    }}
]);
db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}}]);

aggregate composition

You can have several pipeline stages, the data flows through each one in turn.

Show the name of Asian countries with a density that's over 500 people per km2. (population/area)

db.world.aggregate([
   {$match: {area: {$ne: 0}, continent: "Asia"}},
   {$project: {
       _id: 0,
       name: 1,
       density: {$divide: ["$population", "$area"]}
   }},
   {$match: {density: {$gt: 500}}}
]);
db.world.aggregate([{"$match":{"area":{"$ne":0},"continent":"Asia"}},{"$project":{"_id":0,"name":1,"density":{"$divide":["$population","$area"]}}},{"$match":{"density":{"$gt":500}}}]);

$sort

$sort allows ordering of the results set, where 1 is ascending and -1 is descending.
Note that not including $match is the same as {"$match":{}}

Show the name of all countries in descending order.

db.world.aggregate([
    {"$project":{
        "_id":0,
        "name":1,
    }},
    {"$sort":{
        "name":-1
    }}  
]);
db.world.aggregate([{"$project":{"_id":0,"name":1,}},{"$sort":{"name":-1}}])

Grouping

Grouping provides accumulator operations such as $sum
All groups must have an _id. To see why this is useful imagine the following:

So far you've been using the world collection
As every country has a continent, it would make sense to have countries as a nested document inside continents: e.g:

[    
    {"name": "Africa",
     "countries": [
         {"name": "Algeria",     "capital": "Algiers",    ...},
         {"name": "Angola",      "capital": "Luanda",     ...},
         {"name": "Benin",       "capital": "Porto-Novo", ...}.
         {...},
         ...
    ]},
    {"name": "Asia",
     "countries": [
         {"name": "Afghanistan", "capital": "Kabul",  ...},
         {"name": "Azerbaijan",  "capital": "Baku",   ...},
         {"name": "Bahrain",     "capital": "Manama", ...},
         {...},
         ...
    ]},
    {...},
    ...
]

The world collection isn't like this however. It uses the following structure, which has a redundancy where continent is repeated for each country.

[ 
    {"name": "Afghanistan", "capital": "Kabul",   "continent": "Asia",   ...},
    {"name": "Albania",     "capital": "Tirana",  "continent": "Europe",  ...},
    {"name": "Algeria",     "capital": "Algiers", "contiennt": "Africa", ...},
    {...},
    ...
]

The code to group by continent is "_id":"$continent"
If instead the question was to group by country the code would be "_id": "$name".
To operate over the whole document (which would have the same effect as "_id": "$name") "_id": "null" or "_id": None can be used.

group operators

$max and $min can be used to get the largest and smallest values in a group.

Get the smallest and largest GDPs of each continent.

db.world.aggregate([
    {$group: {
        _id: '$continent',
        min: {$min: "$gdp"},
        max: {$max: "$gdp"}
    }},
    {$project: {
        _id: 1,
        min: 1,
        max: 1
    }}
]);
db.world.aggregate([{"$group":{'_id':'$continent','min':{"$min":"$gdp"},'max':{"$max":"$gdp"}}},{"$project":{"_id":1,"min":1,"max":1}}]);

Some other useful aggregate functions to know are $sum and average: $avg
The example below combines previous example material.

Order the continents in descending order by total GDP, Include the average GDP for each country.

db.world.aggregate([
    {$match: {}},
    {$group: {
        _id:"$continent",
        "Total GDP": {"$sum": "$gdp"},
        "Average GDP": {"$avg": "$gdp"}
    }},
    {$sort: {
        "Total GDP":-1
    }},
    {$project:{
        "Area": "$_id",
        "Total GDP": 1,
        "Average GDP": 1,
        _id: 0
    }}
]);
db.world.aggregate([{"$group":{"_id":"$continent","Total GDP":{"$sum":"$gdp"},"Average GDP":{"$avg":"$gdp"}}},{"$sort":{"Total GDP":-1}},{"$project":{"Area":"$_id","Total GDP":1,"Average GDP":1,"_id":0}}]);

Using Conditions

$cond is similar to a CASE statement in other languages.
It has the form "$cond": [{<comparison>: [<field or value>, <field or value>]}, <true case>, <false case>]

db.world.aggregate([
    {$group: {
        _id: {
            $cond: [{"$eq": ["$continent", "Eurasia"]}, "Europe", "$continent"]
        },
        area: {$sum: "$area"}
    }},
    {$sort: {
        area: -1
    }},
    {$project: {
        _id: 1,
        area: 1
    }}
]);