SQL Queries are like magic wand for Data Scientists. You should learn how to write SQL statements to query databases for information that can provide answers or insights.
But what are SQL queries and how are they different from SQL statements?
As mentioned in this article, a statement is a valid command in SQL. Every piece of code that SQL understands is a statement. A statement ends with semi-colon(;).
What is an SQL Query?
This is a statement that returns a record set(table). SQL Statements that display results are called queries. The SELECT statement is a query.
It's safe to say all queries are statements but not all statements are queries.
In this episode, You'll learn some new commands.
You'll also write simple queries using some of these commands including the ones you learned previously. In case you missed them:
- SQL for Data Science: Create Your First Database and Table
- Learn Basic SQL Commands for Data Science
Open your DB Browser and open the 'chinook' database. Click on the 'Browse Data' button at the top bar and inspect the columns of the tables.
You will notice the database has been nicely organized into tables that relate to each other.
Open the 'tracks' table and inspect the columns.
You should see the 'TrackId', 'AlbumId', 'MediaTypeId', and 'GenreId' columns. What are these columns and why are they filled with integer values?
'TrackId' is a primary key column in the 'tracks' table while 'AlbumId', 'MediaTypeId', and 'GenreId' are foreign key columns.
What is a Primary Key?
A primary key in a table is a unique identifier of a row in the table. Each row in a table can have Id's which are unique to them. There can only be one primary key column in a table.
A Primary Key constraint can be specified on a column in the creation of a table.
CREATE TABLE songs( id INTEGER PRIMARY KEY, title TEXT, artiste TEXT );
The 'id' column in the 'songs' table has been chosen as the primary key column. Multiple rows in this table cannot have the same primary key value(same value for the 'id' column).
INSERT INTO songs VALUES(1, "Ye", "Burna Boy"); INSERT INTO songs VALUES(1, "1 Milli", "Davido");
The first statement will run sucessfully.
But the second statement will throw an error as it tries to insert a record with an 'id' value(i.e 1) that already exists in the table.
In a nutshell, a primary key is a unique identifier for a row in a table.
What is a Foreign Key?
A foreign key in a table is the primary key in another table. Click on the 'Browse Data' button and check the 'albums' table.
You can see that the first column in the table is 'AlbumId'. The 'AlbumId' column is a primary key in the 'albums' table but a foreign key in other tables.
Same thing goes for the 'MediaTypeId', and 'GenreId' columns in the 'tracks' table. They are primary keys in their respective tables.
You'll see more of the Primary Key and Foreign Key when we start working with multiple tables in future sessions.
Now, open the 'tracks' table.
There is a table for artists where you can see all the possible names of artists for every track in the 'tracks' table. But we don't know all the possible composers? Do we have to scroll down the 'tracks' table to pick their names?
No. This is what the DISTINCT command can do seamlessly.
DISTINCT Command
Run the below code in your code editor.
SELECT DISTINCT Composer FROM tracks;
This code selects all the distinct values from the 'Composer' column in the tracks table. Scroll down and you'll see that there are no repititions.
LIMIT Command
You might want to limit your record set to a specified number of rows.
SELECT * FROM tracks LIMIT 5;
As you can see, There are only 5 rows as specified in the LIMIT clause
ORDER BY Command
You can sort your record set in ascending or descending order.
Let's sort by the track's name in ascending order
SELECT * FROM tracks ORDER BY Name ASC;
As you can see, it has been sorted. Symbols and numbers have precedence over alphabets.
Let's sort in descending order too
SELECT * FROM tracks ORDER BY Name DESC;
Please note, we sorted these records by the 'Name' column. We can use any column to sort, maybe the duration of tracks.
LIKE Operator
Sometimes, you only know what a particular column value in a row looks like. The LIKE operator can help you find matches for a pattern.
The LIKE operator is used with wildcard values '_' and '%'.
Wildcard '_' represents any single character. 'Data' and 'Date' will both match 'Dat_'. 'Seven' and '7even' will both match '_even'.
Wildcard '%' represents zero or more unspecified character(s). 'Data', 'Data Science', and 'Data Analysis' will all match 'Data%'. The pattern 'Data%' simply means 'Data + any length of character(s)'.
Let's use the wildcard '%' to filter track names that start with 'Love' from our 'tracks' table
SELECT * FROM tracks WHERE Name LIKE "Love%";
As you can see, we have track names starting with 'Love'
What if we want track names that contain the word 'Love'. We might want track names that end with 'Love' too.
SELECT * FROM tracks WHERE Name LIKE "%Love%";
Scroll through the result set to see the matches
The wildcard '%' will match zero or more character(s) wherever it is placed.
BETWEEN Operator
This can be used to filter records by a range of values. These values can be text, numbers or date.
Let's filter the 'tracks' table to include only tracks names starting with letters between 'A' and 'C'
SELECT * FROM tracks WHERE Name BETWEEN "A" AND "C";
You would notice there are no track names starting with letter 'C'. This is how the BETWEEN operator works with text values. It finds records from the start up to the end but not including the end parameter.
In this case, our BETWEEN operator finds track names starting with 'A' and 'B' but not 'C'. If we would like to include track names starting with 'C' we have to change our clause to "BETWEEN 'A' AND 'D'"
This behavior is different for number and date values. It includes the end parameter in the record set.
Let's find the records with 'TrackId' between 20 and 100.
SELECT * FROM tracks WHERE TrackId BETWEEN 20 AND 100;
Scroll down the table and you'll see the track with an ID of 100
NULL Values
If you check through the 'Composer' column in the 'tracks' table, you'll find Null values which represent missing values.
You might want to view the tracks that have missing values for the 'Composer' column.
Let's select just the 'Name' and 'Composer' columns.
SELECT Name, Composer FROM tracks WHERE Composer IS NULL;
You might also want to view the tracks that have non-missing values for the 'Composer' column.
SELECT Name, Composer FROM tracks WHERE Composer IS NOT NULL;
So far, you learned some new SQL commands.
Now, Let's formulate some questions and run queries to find answers to them.
- What countries are the customers from?
- What transactions were billed in the month of February?
- 'For Those About To Rock We Salute You' has been ordered by a customer. What tracks are in this album?
- What is the name, length and size of the shortest track?
- What is the name, length and size of the longest track?
What Countries are the Customers from?
To answer this question, you simply need to run the distinct command on the 'Country' column in the 'customers' table.
SELECT DISTINCT Country FROM customers;
Scroll down the table and you'll discover that the customers are from 24 different countries.
What Transactions were billed in February?
To answer this you would need to see the format of the 'InvoiceDate' column in the 'invoices' table.
It is clear that the date column is in the format 'YYYY-MM-DD HH-MM-SS' where 'MM' represents the month.
To find transactions from a particular month, you'll use the LIKE operator with the wildcard '%'(yes, you guessed right).
SELECT * FROM invoices WHERE InvoiceDate LIKE "%-02-%";
As you can see, all the displayed records take the format 'YYYY-02-DD HH-MM-SS'.
What tracks belong to 'For Those About To Rock We Salute You'
By inspecting the 'albums' table, you can see that this album has an ID of 1.
A way to answer this question is to use the WHERE clause as shown below.
SELECT * FROM tracks WHERE AlbumId = 1;
The record set only contains tracks that belong to the album.
What is the Name, Length and Size of the Shortest Track?
To answer this, you'll use the ORDER BY command to sort the tracks in ascending order by the 'Milliseconds' column in the 'tracks' table.
You'll select only the 'Name', 'Milliseconds' and 'Bytes' columns.
SELECT Name, Milliseconds, Bytes FROM tracks ORDER BY Milliseconds ASC;
The shortest track has a duration of 1071 milliseconds.
What is the Name, Length and Size of the Longest Track?
Just like the previous question but you'll order in descending order this time.
SELECT Name, Milliseconds, Bytes FROM tracks ORDER BY Milliseconds DESC;
The Longest track has a duration of 5286953 milliseconds.
What other insight can you discover from our result set?
The duration of a track is not a factor of its size. As seen in the last record set, the second largest track has a larger size than the largest track.
These are just few things you can discover with SQL queries. We'll look into deeper insights as we proceed in this Data Science Path.
You might want to save your SQL queries. Simply click on the 'Save SQL File' button at the top task bar and save in the same directory as the 'chinook' database.
Run these queries and try out some new ones using what you've learnt so far.
Summary
- All SQL queries are statements but not all statements are queries
- A primary key in a table is a unique identifier of a row in the table
- A foreign key in a table is the primary key in another table
- The LIKE operator can help you find matches for a pattern
- Null values represent missing values
No comments: