0% found this document useful (0 votes)
44 views32 pages

CH3-Data Extraction and Profiling With Notes2 Part1

The document discusses data extraction, transformation, and loading (ETL) processes. It describes how data is extracted from source locations, transformed according to business rules to clean and structure the data, and loaded into the target location. SQL and NoSQL databases are compared, with SQL databases discussed in more detail regarding concepts like schemas, primary keys, and relationships between tables. Basic SQL commands for querying and manipulating data are also summarized.

Uploaded by

Qais Ahmad Amin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views32 pages

CH3-Data Extraction and Profiling With Notes2 Part1

The document discusses data extraction, transformation, and loading (ETL) processes. It describes how data is extracted from source locations, transformed according to business rules to clean and structure the data, and loaded into the target location. SQL and NoSQL databases are compared, with SQL databases discussed in more detail regarding concepts like schemas, primary keys, and relationships between tables. Basic SQL commands for querying and manipulating data are also summarized.

Uploaded by

Qais Ahmad Amin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 3

Part 1

Data Extraction and Profiling


ETL Processes

The data is transformed


according to specific
The data is business rules, cleaning The data is
extracted up the information and loaded into
from a source structuring it in a way the target
or sources that matches the location
schema of the target
location

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 2


Extract Data from SQL Database
SQL vs NOSQL Databases

vs
Source

SQL DB NoSQL DB
[MySQL, ORACLE, Microsoft SQL Server] [mongoDB, Cassandra, Google Cloud
Firestore]
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 4
SQL Databases

Place

Data Consistency
Hard to Scale
CUSTOMER ORDER PRODUCT
Resource Intensive
Contain

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 5


NOSQL Databases

Key Value
{
Name: “MackBook Pro 13”, Scalable
5215563225
Price: “1033$”
Data quality
}
896695565 USB Microphone

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 6


SQL and Relational Databases
• Structured Query Language (SQL) → means to communicate with
structured data stored in tables
• SQL code is easy to read and it's quickly understood what the expected results
are.
• Relational databases solve the problem of storing data together in
multiple tables while keeping consistency across them using the
concept of a primary and foreign key
• A primary key is the unique value (typically an integer) used to represent a
single distinct record or tuple in each table

• A foreign key would be a field in one table that references the primary key
from another

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 7


SQL and Relational Databases
• was created to communicate with data stored in database tables that
have a defined schema.
• Database schema → blueprint that defines a structure for storing data before the
data is loaded including rules, conditions, and specific data types for each field in a
table
• The key concepts to focus on behind working with any Relational Database
Management System (RDBMS) begin with how to communicate with the
system or servers that host the database.
• Most of them support using an Open Database Connectivity (ODBC) driver which
handles authentication and communication over a network
• ODBC is a standard application programming interface (API) for accessing DBMS
• SQLite → open source and cross-platform that is the most widely deployed and used
database engine in the world.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 8
SQL Commands

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi Source: https://www.w3schools.in/mysql/ddl-dml-dcl/ 9
SQL Commands
• Data Definition Language (DDL)
• Commands defining a database include creating, altering, and dropping tables
and establishing constraints.
• Data Manipulation Language (DML)
• Commands that maintain and query a database.
• Data Control Language (DCL)
• Commands that control a database, including administering privileges and
committing data.
• Transaction Control Language (TCL)
• Commands that control database transactions.

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 10


Basic Retrieval Queries in SQL

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 11


Basic Retrieval Queries in SQL - Example

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 12


Basic Retrieval Queries in SQL - Example

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 13


Basic Retrieval Queries in SQL - Example

What if I need an annual salary (Salary*12)?

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 14


Basic Retrieval Queries in SQL - Example

What if I need an annual salary (Salary*12)?

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 15


Duplicate Tuples

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 16


Eliminate duplicate tuples (DISTINCT)

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 17


Retrieval Queries in SQL - Example
Retrieve the highest closing stock price of Apple stock in all of
Ex. 2018 from tbl_stock_price table.

SELECT max(closing_price) FROM tbl_stock_price WHERE year = 2018

The SELECT command max() function retrieve Lets the SQL interpreter Restricts the data by
instructs the code to the maximum value know a table or object reducing the number of
retrieve data in the from the closing_price is being referenced rows to a specific
form of rows and field immediately afterward condition, which is defined
columns from a table by a specific field of year
defined after the and value.
FROM statement

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 18


SQL Statement
• SELECT is the most common SQL command and has many different
use cases and levels of complexity.

• SQL is not case sensitive but the tables and fields referenced might
be, depending on which RDBMS is being used.

• Spaces are important between reserve words but you typically won’t
find spaces in the table or field names. Rather, underscores or dashes
are common.

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 19


SQL Statement

SQL Command Role More Details


SELECT Extracts data from a database https://www.w3schools.com/sql/sql_select.asp

UPDATE Updates data in a database https://www.w3schools.com/sql/sql_update.a


sp
DELETE Deletes data from a database https://www.w3schools.com/sql/sql_insert.as
p
INSERT INTO Inserts new data into a https://www.w3schools.com/sql/sql_delete.a
database sp

Interactive SQL: https://www.w3schools.com/sql/


See Practices (1,2, and 3) on e-learning.

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 20


Data Profiling and Exploration
Data Profiling
• Data Profiling → analyze data to identify errors, missing information,
and any anomalies that may affect the quality of information.

• The process of reviewing source data, understanding structure, content and


interrelationships.
• The process of examining, analyzing, and creating useful summaries of data.
• The process yields a high-level overview which aids in the discovery of data
quality issues, risks, and overall trends.

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 22


Data Profiling Tasks
• Data profiling involves:
• Collecting descriptive statistics like min, max, count and sum. apply the average on a class and check if
it is logical for there salaries

• Collecting data types.


• Performing data quality assessment, risk of performing joins on the
data.
• Identifying inconsistencies in the format.

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 23


Data Profiling in Knime

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 24


Data Profiling in Knime

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 25


Data Profiling in Knime

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 26


Descriptive Statistics
• Descriptive statistics are used to describe and/or summarize the
data we are working with.
• Measures of central tendency describe the center of our
distribution of data.
• Mean → calculated by summing all the values
and dividing by the count of values.
• Median → represents the 50th percentile of the
data; this means that 50% of the values are
greater than the median and 50% are less than
the median (2nd quartile (Q2)).
• Mode → the most common value in the data.

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 27


Descriptive Statistics
• Measures of spread tell us how the data is dispersed
• Range → the distance between the smallest value (minimum) and the
largest value (maximum).
• Variance → describes how far apart observations are spread out from their
average value (the mean) (the average squared distance from the mean).
• Standard deviation → the square root of the
variance.
• Interquartile range (IQR) → distance between the
3rd and 1st quartiles where Q1 is 25th percentile
and Q3 is 75th percentile. The IQR measures the
spread of data around the median and quantifies
how much dispersion we have in the middle 50%
of our distribution.
© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 28
Summarizing Data
• The 5-number summary → provides five descriptive statistics that
summarize our data:

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 29


Summarizing Data - Example

Min =

Q1 =

0, 3, 4, 4, 5, 6, 9, 12 Q2 =

Q3 =

Max =

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 30


Summarizing Data - Example

Min = 0

Q1 = 3

0, 3, 4, 4, 5, 6, 9, 12 Q2 = 4.5

Q3 = 9

Max = 12

© Ibrahim Abu alhul , Bushra Alhijawi , Ahmad Alzghoul, Wael Etaiwi 31


www.psut.edu.jo
Call: (+962) 6-5359 949
Fax: (+962) 6-5347 295
Email: [email protected]
Princess Sumaya University for Technology
Amman 11941 Jordan
P.o.Box 1438 Al-Jubaiha

You might also like