Why do you need SQL for Data Science?
According to forbes, 2.5 quintillion bytes of data is created everyday at our current pace. Most of these data are stored in databases.
As a Data Scientist, you need to be well versed in SQL to retrieve and communicate with these data.
What is SQL
SQL (Structured Query Language) is the language for databases. It allows you to manage and retrieve the data held in relational database management systems(RDBMS). The syntax is easy to write, read and interpret. It is pronounced "ess-que-ell"
What is a Database?
A database is a structure that organizes data in such a way that it can be electronically accessed from anywhere.
What is a Relational Database?
This is a type of database based on the relation between the data stored in it. It organizes data into one or more tables.
What is a Table?
It is a collection of data organized into rows and columns. It is also called a relation.
What is a Row?
It is an horizontal group of cells in a table. It is also called a record
What is a Column
It is a vertical group of cells in a table. The values are of the same data type. These cells are also called fields
Source: ThoughtCo.
What is a Relational Database System(RDBMS)?
This is a program/software that uses SQL to manage relational databases. Some popular relational database management systems are:
- MySQL
- PostgreSQL
- SQL Server
- Microsoft SQL
- IBM DB2
- SQLite(This is what we'll be using)
What is SQLite?
SQLite is a popular open source RDBMS program that interacts with relational databases. One of its major advantages is that data is saved locally in a single file without connecting the database to a server.
It is the most widely deployed database engine used in embedded systems like cellphones and Mp3 players. You can read more about it here
How to Setup SQLite on Your Computer
We'll be using DB Browser for SQLite as it makes working with SQLite databases easier than creating and managing databases from the command prompt.
DB Browser for SQLite is an intuitive, visual, open source tool for managing SQLite databases.
Download DB browser for SQLite and install the suitable 'Standard installer' for your OS(windows in my case).
Download DB Broswer for SQLite
Make sure you check the following boxes as shown below during your installation.
After installation, your initial interface should look like this.
Let's do a quick tweak. Click on the view button on the uppermost taskbar and unselect 'Edit Database Cell'. Do the same for 'SQL Log', 'Remote', and 'Plot'. Your interface should now look like this.
Now that we are all set up, let's delve into SQL.
SQL Statement: This is a valid text in SQL. It is the kind of text which SQL understands. It ends with a semi-colon (;) and can be of any length.
CREATE TABLE songs( title TEXT, artiste TEXT, year INTEGER, genre TEXT );
The statement above creates a table in a database with 4 columns(title, artiste, year, genre). Let's break down the statement
CREATE TABLE is a clause. Clauses perform specific actions in SQL. They are written in capital letters by convention(they are not case sensitive)
songs is a table name. We defined a table earlier.
(title TEXT, artiste TEXT, year INTEGER, genre TEXT) is a parameter. A parameter is an argument passed to a clause. It contains column names, their data types and other values.
Data stored in a relational database are of specific data types. We used the TEXT and INTEGER data types above to specify the kind of data that should be stored in the fields. Even if SQLite does not enforce these rules(One of the drawbacks of SQLite), we'll stick to them.
These are the most common data types:
- INTEGER, a positive or negative whole number
- TEXT, a text string
- DATE, the date formatted as 'YYYY-MM-DD'
- REAL, a decimal value
BASIC SQL COMMANDS
CREATE: This is used to create a new table
INSERT: This is used to insert records into an existing table
SELECT: This extracts data from the database. This is the most commonly used command in SQL.
The first stage of the data science process is to extract data as mentioned in my previous post.
Read:Learn Data Science by Yourself
We'll use SELECT statements to extract data from databases, but it won't hurt to see how tables are created.
Firstly, let's create a database.
Open the DB Browser for SQLite we installed earlier by typing it in your search bar or selecting from your desktop home screen
Select 'New Database' at the top bar
Select a directory to save it. I'll save mine in 'documents'
Type a preferred name for your database and save ('mydatabase' in my own case). When the 'Edit table definition' window pops up, click on cancel.
SQL CREATE Command
Now that we have created a database, let's create a table, 'songs', using the CREATE command. Click on the 'Execute SQL' button on the top taskbar and enter the code as shown below. Press 'Ctrl' + 'Enter' to run.
CREATE TABLE songs( title TEXT, artiste TEXT, genre TEXT, year INTEGER );
The code above creates a table,songs, with the following columns:
title is of the text data type. It contains the title of the songs
artiste is of the text data type. It contains names of the artistes
genre is of the text data type. It contains category of the songs
year is of the integer data type. It contains the year each song was released
We have created our first table. As you can see on the right pane, DB schema has been updated with the songs table
SQL INSERT Command
Let's proceed to populating the table using the INSERT command. Enter the code as shown below, highlight the INSERT statement and press 'Ctrl' + 'Enter' to run. Simply pressing 'Ctrl' + 'Enter' will run the whole code and throw an error as it attempts to create the songs table again.
INSERT INTO songs(title, artiste, genre, year) VALUES("Joro", "Wizkid", "Afro", 2019);
The code above inserts a record into the table, 'songs', with the following parameters:
title ='Joro'
artiste = 'Wizkid'
genre = 'Afro'
year = 2019
As shown in the code, we specified the order in which we'll be inserting into the fields. I recommend this method because it prevents errors whenever you change the order of the table. As seen above, text values are enclosed in quotes.
SQL SELECT Command
We have created a table and inserted a record. Let's view the table using the SELECT command. Enter the code as shown below, highlight and press 'Ctrl' + 'Enter' to run:
SELECT * FROM songs;
As seen above, the songs table contains the inserted record.
Let's populate the songs table with more records, you can choose your preferred values following the steps we used for inserting a record earlier.
INSERT INTO songs(title, artiste, genre, year) VALUES("Dear Mama", "Tupac", "Hip hop", 1994); INSERT INTO songs(title, artiste, genre, year) VALUES("Despacito", "Luis Fonsi", "Reggaeton", 2017); INSERT INTO songs(title, artiste, genre, year) VALUES("Toosie Slide", "Drake", "Trap", 2020); INSERT INTO songs(title, artiste, genre, year) VALUES("Try me", "Tems", "RnB", 2019);
Here's my table after populating.
You'll learn about three other basic commands in the next article.
Create your own table and populate with as much records as you can and view your table.
Click on the 'Write Changes' button on the taskbar to commit/save your table to the database
If you have any problem setting up or creating as shown above, kindly drop it in the comment section below.
Summary
- Data scientists use SQL to retrieve relevant data for their work
- SQL is a language for interacting with databases
- A database is a structure that organizes data in such a way that it can be electronically accessed from anywhere
- Relational Databases organize data into tables. RDBMS are programs used to create and manage relational databases. MySQL, Oracle, SQLite are some of the most popular RDBMS
- A table organizes data into rows and columns
- CREATE statements are used to create tables in database
- INSERT statements are used to insert records into tables
- SELECT statements are used to retrieve data from databases
No comments: