Chapter 3
Part 1
Data Extraction and Profiling
ETL Processes
The data is transformed
according to specific
The data is business rules, cleaning The data is
extracted up the information and loaded into
from a source structuring it in a way the target
or sources that matches the location
schema of the target
location
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 2
Extract Data from SQL Database
SQL vs NOSQL Databases
vs
Source
SQL DB NoSQL DB
[MySQL, ORACLE, Microsoft SQL Server] [mongoDB, Cassandra, Google Cloud
Firestore]
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 4
SQL Databases
Place
Data Consistency
Hard to Scale
CUSTOMER ORDER PRODUCT
Resource Intensive
Contain
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 5
NOSQL Databases
Key Value
{
Name: “MackBook Pro 13”, Scalable
5215563225
Price: “1033$”
Data quality
}
896695565 USB Microphone
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 6
SQL and Relational Databases
• Structured Query Language (SQL) → means to communicate with
structured data stored in tables
• SQL code is easy to read and it's quickly understood what the expected results
are.
• Relational databases solve the problem of storing data together in
multiple tables while keeping consistency across them using the
concept of a primary and foreign key
• A primary key is the unique value (typically an integer) used to represent a
single distinct record or tuple in each table
• A foreign key would be a field in one table that references the primary key
from another
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 7
SQL and Relational Databases
• was created to communicate with data stored in database tables that
have a defined schema.
• Database schema → blueprint that defines a structure for storing data before the
data is loaded including rules, conditions, and specific data types for each field in a
table
• The key concepts to focus on behind working with any Relational Database
Management System (RDBMS) begin with how to communicate with the
system or servers that host the database.
• Most of them support using an Open Database Connectivity (ODBC) driver which
handles authentication and communication over a network
• ODBC is a standard application programming interface (API) for accessing DBMS
• SQLite → open source and cross-platform that is the most widely deployed and used
database engine in the world.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 8
SQL Commands
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi Source: https://www.w3schools.in/mysql/ddl-dml-dcl/ 9
SQL Commands
• Data Definition Language (DDL)
• Commands defining a database include creating, altering, and dropping tables
and establishing constraints.
• Data Manipulation Language (DML)
• Commands that maintain and query a database.
• Data Control Language (DCL)
• Commands that control a database, including administering privileges and
committing data.
• Transaction Control Language (TCL)
• Commands that control database transactions.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 10
Basic Retrieval Queries in SQL
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 11
Basic Retrieval Queries in SQL - Example
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 12
Basic Retrieval Queries in SQL - Example
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 13
Basic Retrieval Queries in SQL - Example
What if I need an annual salary (Salary*12)?
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 14
Basic Retrieval Queries in SQL - Example
What if I need an annual salary (Salary*12)?
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 15
Duplicate Tuples
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 16
Eliminate duplicate tuples (DISTINCT)
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 17
Retrieval Queries in SQL - Example
Retrieve the highest closing stock price of Apple stock in all of
Ex. 2018 from tbl_stock_price table.
SELECT max(closing_price) FROM tbl_stock_price WHERE year = 2018
The SELECT command max() function retrieve Lets the SQL interpreter Restricts the data by
instructs the code to the maximum value know a table or object reducing the number of
retrieve data in the from the closing_price is being referenced rows to a specific
form of rows and field immediately afterward condition, which is defined
columns from a table by a specific field of year
defined after the and value.
FROM statement
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 18
SQL Statement
• SELECT is the most common SQL command and has many different
use cases and levels of complexity.
• SQL is not case sensitive but the tables and fields referenced might
be, depending on which RDBMS is being used.
• Spaces are important between reserve words but you typically won’t
find spaces in the table or field names. Rather, underscores or dashes
are common.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 19
SQL Statement
SQL Command Role More Details
SELECT Extracts data from a database https://www.w3schools.com/sql/sql_select.asp
UPDATE Updates data in a database https://www.w3schools.com/sql/sql_update.a
sp
DELETE Deletes data from a database https://www.w3schools.com/sql/sql_insert.as
p
INSERT INTO Inserts new data into a https://www.w3schools.com/sql/sql_delete.a
database sp
Interactive SQL: https://www.w3schools.com/sql/
See Practices (1,2, and 3) on e-learning.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 20
Data Profiling and Exploration
Data Profiling
• Data Profiling → analyze data to identify errors, missing information,
and any anomalies that may affect the quality of information.
• The process of reviewing source data, understanding structure, content and
interrelationships.
• The process of examining, analyzing, and creating useful summaries of data.
• The process yields a high-level overview which aids in the discovery of data
quality issues, risks, and overall trends.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 22
Data Profiling Tasks
• Data profiling involves:
• Collecting descriptive statistics like min, max, count and sum. apply the average on a class and check if
it is logical for there salaries
• Collecting data types.
• Performing data quality assessment, risk of performing joins on the
data.
• Identifying inconsistencies in the format.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 23
Data Profiling in Knime
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 24
Data Profiling in Knime
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 25
Data Profiling in Knime
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 26
Descriptive Statistics
• Descriptive statistics are used to describe and/or summarize the
data we are working with.
• Measures of central tendency describe the center of our
distribution of data.
• Mean → calculated by summing all the values
and dividing by the count of values.
• Median → represents the 50th percentile of the
data; this means that 50% of the values are
greater than the median and 50% are less than
the median (2nd quartile (Q2)).
• Mode → the most common value in the data.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 27
Descriptive Statistics
• Measures of spread tell us how the data is dispersed
• Range → the distance between the smallest value (minimum) and the
largest value (maximum).
• Variance → describes how far apart observations are spread out from their
average value (the mean) (the average squared distance from the mean).
• Standard deviation → the square root of the
variance.
• Interquartile range (IQR) → distance between the
3rd and 1st quartiles where Q1 is 25th percentile
and Q3 is 75th percentile. The IQR measures the
spread of data around the median and quantifies
how much dispersion we have in the middle 50%
of our distribution.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 28
Summarizing Data
• The 5-number summary → provides five descriptive statistics that
summarize our data:
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 29
Summarizing Data - Example
Min =
Q1 =
0, 3, 4, 4, 5, 6, 9, 12 Q2 =
Q3 =
Max =
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 30
Summarizing Data - Example
Min = 0
Q1 = 3
0, 3, 4, 4, 5, 6, 9, 12 Q2 = 4.5
Q3 = 9
Max = 12
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 31
www.psut.edu.jo
Call: (+962) 6-5359 949
Fax: (+962) 6-5347 295
Email:
[email protected]Princess Sumaya University for Technology
Amman 11941 Jordan
P.o.Box 1438 Al-Jubaiha