Learn How to Use SQL Aggregate Functions

SQL Aggregate Functions are like juicers. They take in records(fruits), process(blend) them and return a single value(fruit juice). SQL aggregate functions simply help us to summarize data.

What are Aggregate Functions?

They perform some calculations on a group of rows and return a single value. SQLite supports the 5 ISO standard aggregate functions namely:

AVG
COUNT
MAX
MIN
SUM

These functions are mostly used with the GROUP BY and HAVING commands. We'll look into them in this article.

AVG Function

This calculates the mean of a numerical column. Let's calculate the average duration of tracks in our 'tracks' table.

Open the 'chinook' database and run the code below:

SELECT AVG(Milliseconds) FROM tracks;

We can do the same for the 'Bytes' column including the 'Milliseconds' column.

SELECT AVG(Milliseconds), AVG(Bytes) FROM tracks;

As you can see, we can perform operations on multiple columns at a time. It is similar to selecting few columns from a table but this time, we're performing operations on them by passing them as arguments into functions.

COUNT Function

This counts the number of rows in a column. It ignores NULL values when a column is specified.

The 'Composer' column contains NULL values. Let's see how the COUNT function ignores them.

Please note that the 'tracks' table contains 3503 records(It can be seen at the bottom of the table when you browse its data).

SELECT COUNT(Composer) FROM tracks;

It can be seen that the NULL values have been ignored as our COUNT function returns 2525 records.

We can also use our DISTINCT command to count the number of unique composers in the 'tracks' table. This way, we avoid counting a value more than once.

SELECT COUNT(DISTINCT Composer) FROM tracks;

The DISTINCT function avoids repitition. We have 852 different composers.

There's another way in which the COUNT function can be used. To count the number of rows in a table. This doesn't ignore null values.

SELECT COUNT(*) FROM tracks;

It returns 3503 records as seen in the table.

MAX Function

You already guessed what this function does. It gives us the maximum value in a column.

We calculated the duration of the longest track in the last article by using the ORDER BY command. This can be done with a one-line statement.

SELECT MAX(Milliseconds) FROM tracks;

MIN Function

We also calculated the duration of the shortest track using multiple lines of statement. This can be done with just a line.

SELECT MIN(Milliseconds) FROM tracks;

SUM Function

This sums up the values in a numerical column. Let's do something a bit different for this function.

In our last article, a customer ordered for an album with an AlbumID of 1. Let's find the duration of this album.

SELECT SUM(Milliseconds) FROM tracks
WHERE AlbumId = 1;

SQL aggregate functions can be used in various creative ways.

Let's look into the AS keyword.

AS Keyword

This helps us to rename a column in our result set. It aliases the column without altering it in the database. In our aggregate functions above, the column names in our results look like MAX(Milliseconds), COUNT(*). We can format it nicely using the AS keyword.

SELECT MAX(Milliseconds) AS "Max_duration",
MIN(Milliseconds) AS "Min_duration" 
FROM tracks;

This result set is descriptive. We'll be using aliases a lot in our future episodes.

GROUP BY Command

It provides a summary of our data by grouping rows with similar values. It is often used with aggregate functions.

One way I like to visualize the GROUP BY function is:

It groups similar rows by specified column(s)
Performs aggregates on the remaining column(s) specified by the SELECT command
Returns a single result for each group

The image above groups by the genre, counts the number of rows in each group and returns a single result for each group.

When you group by a column, it's advisable to perform aggregates on the other columns. If we are to group the tracks by their albums, we might want to count the number of tracks in each album.

SELECT AlbumId, COUNT(*) AS "Number_of_tracks"
FROM tracks
GROUP BY AlbumId;

The above code:

Selects the 'AlbumId' column and other columns (*)
Groups the result set by the 'AlbumId' column
Performs the COUNT function on the remaining columns of each group and aliases the column as 'Number_of_tracks'

Please note that COUNT(*) counts the number of rows in each group without ignoring null values.

We can use the ORDER BY command to see the albums that have the most tracks.

SELECT AlbumId, COUNT(*) AS "Number_of_tracks"
FROM tracks
GROUP BY AlbumId
ORDER BY Number_of_tracks DESC;

There are various ways by which we can use the GROUP BY command. You'll see more of it as we progress with the SQL Series of this Data Science path.

HAVING Clause

This is similar to the WHERE clause. It is used with groups and not individual records.

The WHERE clause returns individual rows that satisfy certain condition(s) but the HAVING clause returns groups that satisfy certain condition(s).

Let's apply it to our previous example. We want albums that have more than 20 tracks.

SELECT AlbumId, COUNT(*) AS "Number_of_tracks"
FROM tracks
GROUP BY AlbumId
HAVING Number_of_tracks > 20;

Scroll through the results, you'll find only albums with more than 20 tracks. We do not need the ORDER BY command for this part unless we want to sort result set.

CASE Statement

Let's say we are not interested in the duration of a track. We just want to know if it's long or short. The Case statement can help us do this.

We'll classify tracks less than 300,000 milliseconds as short and long otherwise. Let's select just the 'Name' and 'Milliseconds' columns for this part.

SELECT Name,
CASE
WHEN Milliseconds>300000 THEN "long"
ELSE "short"
END AS "Length"
FROM tracks;

The above statement, creates a temporary column 'Length' and fills it with either 'long' or 'short' depending on which condition is satisfied in the WHEN clause. This concept might seem a bit stiff but with practice you'll understand it better

Let's try something. We want to know how many long and short tracks we have in the 'tracks' table.

We just have to add a GROUP BY command to count the number of rows in our previous query but this time, we'll select just the 'Length' column and perform aggregates on the remaining.

SELECT
CASE
WHEN Milliseconds>300000 THEN "long"
ELSE "short"
END AS "Length",
COUNT(*) FROM tracks
GROUP BY Length;

The above code selects the temporary column, 'Length', we created using the CASE statement, groups by that column and counts the number of rows in each group.

Let's formulate some questions and apply what we have covered so far to provide answers.

Which Artists have more than 5 Albums?

The GROUP BY and HAVING commands come in handy here.

SELECT ArtistId, COUNT(*) AS "Number_of_albums"
FROM albums
GROUP BY ArtistId
HAVING Number_of_albums > 5;

The results contains only artists that have more than 5 albums. We only have the ArtistId in our result set. We'll incoporate artist names into our results when we start working with multiple tables.

How many customers come from each country?

The GROUP BY command and COUNT function can help us with this.

SELECT  Country, COUNT(*) AS "No_of_customers"
FROM customers
GROUP BY Country;

We can go further by ordering from the countries with the most customers to the least. Try that.

How much was generated in each year?

We'll use the GROUP BY command, CASE statement and LIKE operator for this problem.

We'll follow these steps to get our final result:

Create a temporary column, 'Year' with the CASE statement and perform the SUM aggregate function on the 'Total' column
GROUP BY the 'Year' Column

SELECT
CASE
WHEN InvoiceDate LIKE "2009%" THEN 2009
WHEN InvoiceDate LIKE "2010%" THEN 2010
WHEN InvoiceDate LIKE "2011%" THEN 2011
WHEN InvoiceDate LIKE "2012%" THEN 2012
WHEN InvoiceDate LIKE "2013%" THEN 2013 
END AS "Year", SUM(Total)
FROM invoices
GROUP BY Year;

We used the LIKE operator to find matches for each year, AS keyword to alias the column as 'Year' and grouped by the aliased column. 2010 is the year with the highest revenue.

Please note that the END keyword is necessary for the CASE statement. The ELSE clause is not compulsory. The statement fills NULL values if no condition is satisfied.

Takeaway

It can seem repititive to type in the column names whenever we want to group and order our result set. We can use numbers to represent columns in these commands where each number represents the position of the column in the SELECT command.

SELECT AlbumId, SUM(Milliseconds) AS "Duration"
FROM tracks
GROUP BY AlbumId
ORDER BY Duration;

The above code can be written as:

SELECT AlbumId, SUM(Milliseconds) AS "Duration"
FROM tracks
GROUP BY 1
ORDER BY 2;

1 represents the first selected column and 2 represents the second column.

Try these concepts with different questions and share your insights in the comments section below.

Learn How to Use SQL Aggregate Functions

What are Aggregate Functions?

AVG Function

COUNT Function

MAX Function

MIN Function

SUM Function

AS Keyword

GROUP BY Command

HAVING Clause

CASE Statement

Which Artists have more than 5 Albums?

How many customers come from each country?

How much was generated in each year?

Takeaway

You May Also Like

No comments:

Popular Posts

Blog Archive

Learn How to Use SQL Aggregate Functions

What are Aggregate Functions?

AVG Function

COUNT Function

MAX Function

MIN Function

SUM Function

AS Keyword

GROUP BY Command

HAVING Clause

CASE Statement

Which Artists have more than 5 Albums?

How many customers come from each country?

How much was generated in each year?

Takeaway

You May Also Like

SQL Churn Rate and Funnel Analysis

SQL JOIN Commands: Working with Multiple Tables