SQL Aggregate Functions are like juicers. They take in records(fruits), process(blend) them and return a single value(fruit juice). SQL aggregate functions simply help us to summarize data.
What are Aggregate Functions?
They perform some calculations on a group of rows and return a single value. SQLite supports the 5 ISO standard aggregate functions namely:
- AVG
- COUNT
- MAX
- MIN
- SUM
These functions are mostly used with the GROUP BY and HAVING commands. We'll look into them in this article.
AVG Function
This calculates the mean of a numerical column. Let's calculate the average duration of tracks in our 'tracks' table.
Open the 'chinook' database and run the code below:
SELECT AVG(Milliseconds) FROM tracks;
We can do the same for the 'Bytes' column including the 'Milliseconds' column.
SELECT AVG(Milliseconds), AVG(Bytes) FROM tracks;
As you can see, we can perform operations on multiple columns at a time. It is similar to selecting few columns from a table but this time, we're performing operations on them by passing them as arguments into functions.
COUNT Function
This counts the number of rows in a column. It ignores NULL values when a column is specified.
The 'Composer' column contains NULL values. Let's see how the COUNT function ignores them.
Please note that the 'tracks' table contains 3503 records(It can be seen at the bottom of the table when you browse its data).
SELECT COUNT(Composer) FROM tracks;
It can be seen that the NULL values have been ignored as our COUNT function returns 2525 records.
We can also use our DISTINCT command to count the number of unique composers in the 'tracks' table. This way, we avoid counting a value more than once.
SELECT COUNT(DISTINCT Composer) FROM tracks;
The DISTINCT function avoids repitition. We have 852 different composers.
There's another way in which the COUNT function can be used. To count the number of rows in a table. This doesn't ignore null values.
SELECT COUNT(*) FROM tracks;
It returns 3503 records as seen in the table.
MAX Function
You already guessed what this function does. It gives us the maximum value in a column.
We calculated the duration of the longest track in the last article by using the ORDER BY command. This can be done with a one-line statement.
SELECT MAX(Milliseconds) FROM tracks;
MIN Function
We also calculated the duration of the shortest track using multiple lines of statement. This can be done with just a line.
SELECT MIN(Milliseconds) FROM tracks;
SUM Function
This sums up the values in a numerical column. Let's do something a bit different for this function.
In our last article, a customer ordered for an album with an AlbumID of 1. Let's find the duration of this album.
SELECT SUM(Milliseconds) FROM tracks WHERE AlbumId = 1;
SQL aggregate functions can be used in various creative ways.
Let's look into the AS keyword.
AS Keyword
This helps us to rename a column in our result set. It aliases the column without altering it in the database. In our aggregate functions above, the column names in our results look like MAX(Milliseconds), COUNT(*). We can format it nicely using the AS keyword.
SELECT MAX(Milliseconds) AS "Max_duration", MIN(Milliseconds) AS "Min_duration" FROM tracks;
This result set is descriptive. We'll be using aliases a lot in our future episodes.
GROUP BY Command
It provides a summary of our data by grouping rows with similar values. It is often used with aggregate functions.
One way I like to visualize the GROUP BY function is:
- It groups similar rows by specified column(s)
- Performs aggregates on the remaining column(s) specified by the SELECT command
- Returns a single result for each group
The image above groups by the genre, counts the number of rows in each group and returns a single result for each group.
When you group by a column, it's advisable to perform aggregates on the other columns. If we are to group the tracks by their albums, we might want to count the number of tracks in each album.
SELECT AlbumId, COUNT(*) AS "Number_of_tracks" FROM tracks GROUP BY AlbumId;
The above code:
- Selects the 'AlbumId' column and other columns (*)
- Groups the result set by the 'AlbumId' column
- Performs the COUNT function on the remaining columns of each group and aliases the column as 'Number_of_tracks'
Please note that COUNT(*) counts the number of rows in each group without ignoring null values.
We can use the ORDER BY command to see the albums that have the most tracks.
SELECT AlbumId, COUNT(*) AS "Number_of_tracks" FROM tracks GROUP BY AlbumId ORDER BY Number_of_tracks DESC;
There are various ways by which we can use the GROUP BY command. You'll see more of it as we progress with the SQL Series of this Data Science path.
HAVING Clause
This is similar to the WHERE clause. It is used with groups and not individual records.
The WHERE clause returns individual rows that satisfy certain condition(s) but the HAVING clause returns groups that satisfy certain condition(s).
Let's apply it to our previous example. We want albums that have more than 20 tracks.
SELECT AlbumId, COUNT(*) AS "Number_of_tracks" FROM tracks GROUP BY AlbumId HAVING Number_of_tracks > 20;
Scroll through the results, you'll find only albums with more than 20 tracks. We do not need the ORDER BY command for this part unless we want to sort result set.
CASE Statement
Let's say we are not interested in the duration of a track. We just want to know if it's long or short. The Case statement can help us do this.
We'll classify tracks less than 300,000 milliseconds as short and long otherwise. Let's select just the 'Name' and 'Milliseconds' columns for this part.
SELECT Name, CASE WHEN Milliseconds>300000 THEN "long" ELSE "short" END AS "Length" FROM tracks;
The above statement, creates a temporary column 'Length' and fills it with either 'long' or 'short' depending on which condition is satisfied in the WHEN clause. This concept might seem a bit stiff but with practice you'll understand it better
Let's try something. We want to know how many long and short tracks we have in the 'tracks' table.
We just have to add a GROUP BY command to count the number of rows in our previous query but this time, we'll select just the 'Length' column and perform aggregates on the remaining.
SELECT CASE WHEN Milliseconds>300000 THEN "long" ELSE "short" END AS "Length", COUNT(*) FROM tracks GROUP BY Length;
The above code selects the temporary column, 'Length', we created using the CASE statement, groups by that column and counts the number of rows in each group.
Let's formulate some questions and apply what we have covered so far to provide answers.
Which Artists have more than 5 Albums?
The GROUP BY and HAVING commands come in handy here.
SELECT ArtistId, COUNT(*) AS "Number_of_albums" FROM albums GROUP BY ArtistId HAVING Number_of_albums > 5;
The results contains only artists that have more than 5 albums. We only have the ArtistId in our result set. We'll incoporate artist names into our results when we start working with multiple tables.
How many customers come from each country?
The GROUP BY command and COUNT function can help us with this.
SELECT Country, COUNT(*) AS "No_of_customers" FROM customers GROUP BY Country;
We can go further by ordering from the countries with the most customers to the least. Try that.
How much was generated in each year?
We'll use the GROUP BY command, CASE statement and LIKE operator for this problem.
We'll follow these steps to get our final result:
- Create a temporary column, 'Year' with the CASE statement and perform the SUM aggregate function on the 'Total' column
- GROUP BY the 'Year' Column
SELECT CASE WHEN InvoiceDate LIKE "2009%" THEN 2009 WHEN InvoiceDate LIKE "2010%" THEN 2010 WHEN InvoiceDate LIKE "2011%" THEN 2011 WHEN InvoiceDate LIKE "2012%" THEN 2012 WHEN InvoiceDate LIKE "2013%" THEN 2013 END AS "Year", SUM(Total) FROM invoices GROUP BY Year;
We used the LIKE operator to find matches for each year, AS keyword to alias the column as 'Year' and grouped by the aliased column. 2010 is the year with the highest revenue.
Please note that the END keyword is necessary for the CASE statement. The ELSE clause is not compulsory. The statement fills NULL values if no condition is satisfied.
Takeaway
It can seem repititive to type in the column names whenever we want to group and order our result set. We can use numbers to represent columns in these commands where each number represents the position of the column in the SELECT command.
SELECT AlbumId, SUM(Milliseconds) AS "Duration" FROM tracks GROUP BY AlbumId ORDER BY Duration;
The above code can be written as:
SELECT AlbumId, SUM(Milliseconds) AS "Duration" FROM tracks GROUP BY 1 ORDER BY 2;
1 represents the first selected column and 2 represents the second column.
Try these concepts with different questions and share your insights in the comments section below.
No comments: