0% found this document useful (0 votes)

121 views13 pages

Unit III DWM

The document discusses ETL (extraction, transformation, loading) processes in data warehousing. ETL involves extracting data from source systems, transforming it to fit the data warehouse needs, and loading it. Challenges include diverse source systems with different structures, platforms, and data quality. Data extraction identifies sources and defines how data will be extracted. Transformation tasks reformat, decode, split, merge and convert data to prepare it for the warehouse. This prepares the raw source data for effective use in the data warehouse.

Uploaded by

ARAVIND IYER

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

121 views13 pages

Unit III DWM

Uploaded by

ARAVIND IYER

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

ETL

3.1 Introduction
ETL functions ( extraction, transformation and loading) that take place in the data staging area reshape
the relevant data from the source systems into useful information to be stored in the data warehouse.

3.1.1 Challenges in ETL Process

 Diverse and disparate
 Different operating systems/platforms
 May not preserve historical data
 Quality of data may not be guaranteed in the older operational source systems
 Structures keep changing with time
 Occurrence of data inconsistency in the source system.
 Data may be stored in cryptic form
 Data type, format, naming convention may be different

1
3.2 Data Extraction
In this stage data flows from the data sources and pauses at the staging area.

Many of the operational systems are sill legacy systems, while some of the operational systems run on
the client/server architecture and some have ERP data sources, so extracting data from such disparate
systems is not a trivial issue. Apart from production systems data, data stored in temporaey files is
also important.

Data Extraction Issues

• Source Identification—identify source applications and source structures.
• Method of extraction—for each data source, define whether the extraction process is manual or
tool-based.
• Extraction frequency—for each data source, establish how frequently the data extraction must
by done—daily, weekly, quarterly, and so on.
• Time window—for each data source, denote the time window for the extraction process.
• Job sequencing—determine whether the beginning of one job in an extraction job stream has to
wait until the previous job has finished successfully.
• Exception handling—determine how to handle input records that cannot be extracted

3.2.1 Identification of Data Sources

The sequence of steps performed in the source identification.
 List every fact needed for analysis in fact tables.
 For every dimension table, list each and every attribute.
 For each target data item, find the source system and the appropriate source data item.
 If there are multiple sources for the same data then choose the preferred source.
 Formulate a consolidation rule for every data item that has multiple sources.
2
 Formulate splitting rules for every source field that will be distributed to multiple fields.
 Determine the default values.
 Search the source data for the missing values.

3.2.2 Extraction Data for Refreshing

Data Extraction Techniques
 Immediate Data Extraction(real time)
a) Capture through transaction logs
b) Capture through database triggers
c) Capture through source applications
 Deferred data Extraction(capture happens later)
a) Capture based on data and timestamp
b) Capture by comparing files

3
Immediate Data Extraction(real time)

a) Capture through transaction logs

● It reads the trasaction log and select all the comitted transactions.
● Log shoud be extracted before log file gets refreshed.
● Does not provide much flexibility for capturing specifications
● Does not affect the performance of source systems
● Does not require any revisions to the existing source applications
● Cannot be used on file oriented system.

b) Capture through database triggers

● Triggers can be created for all events for which data needs to be captured.
● Does not provide much flexibility for capturing specifications
● Does not affect the performance of source systems
● Does not require any revisions to the existing source applications
● Cannot be used on file oriented system.
● Cannot be used on a legacy system

c) Capture through source applications

● Provides flexibility for capturing specification
● Performance degradation of source systems
● Requires the existing source systems to be revised
● Can be used on a file oriented system
● Can be used on a legacy system

4
Deferred data Extraction(capture happens later)

a) Capture based on data and timestamp

 Every time a record in the source system is created or updated, it is marked with a timestamp that
will be used for selecting the records for data extraction. The timestamp shows the date and time at
which the source record was created or updated.
 Provides flexibility for capturing specification
 Does not affect the performance of source systems
 Requires the existing source systems to be revised
 Can be used on a any sysyem i.e. file oriented system, legacy system
 Deletion of source records presents a special problem. If a source record gets deleted in between
two extract runs, the information about the delete is not detected.
 You can get around this by marking the source record for delete first, do the extraction run, and
then go ahead and physically delete the record. Indicates more logic to the source applications.

b) Capture by comparing files

● It compares two snapshots of the source data.
● For e.g. if we want to apply this technique to capture the changes in the sales data, then while
performing today's data extraction for changes to sales data, a full file comparison between
today's copy nd previous day's copy of the sales data is done to capture any changes between
the two copies.
● Provides flexibility for capturing specification
● Does not affect the performance of source systems
● Does not require the existing source systems to be revised
● may be used on a file oriented system, legacy system

5
3.3 Data Transformation
The data extracted from source system cannot be stored directly in the data warehouse mainly because
of two reasons
• Raw data that must be processes to be made usable in the data warehouse.
• As operational data is extracted from many old legacy systems, the quality of data in those
systems may not good enough for datawarehouse

3.3.1 Tasks involved in Data Transformation

Format revision It related with changes to the data types and lengths of individual data
fields. For example, in the source systems, the customer's income level
may be identified by codes and ranges in which the fields may be text or
numeric. Length of the customer's name field may vary from one source
system to the other.
Decoding of fields When the data comes from multiple source systems, the same data items
may have been described by different field values. For e.g. coding for
gender, with one system using 0 and 1 for male and female, another using
M and F, and the other using male, female. Data with cryptic codes must
also be decoded before being moved in the data warehouse.
Splitting of fileds The first name, middle name, and last name, as well as some other values,
were stored as a large text in a single field in the earlier legacy systems.
You need to store individual components of names and addresses in
separate fields in your data repository to improve the operating
performance by indexing and analyzing individual components.
Merging of information It does not indicated merging of several fields to create a single field. Foe
e.g. the details of the customers could be collected from a number of data
sources. Like customer name and code can be fetched from one table, his
income level and age from the other table and his address and living style
from other table.
Character set conversion Main aim to convert character set to agreed standard charcater set. Some of
the legacy systems stroring text in EBCDIC, ASCII character set. So need
of conversion to standard character set.
Conversion of Units For multinational comapny having branches in number of countries.So
amount may be represented in different currencies. So need to convert
figures into a common unit of measurement
Data & Time Conversion As American & British datae formats are different. So need to decide

6
common format.
Summarization Used to derive summarized data from the most granual data. The
summarized data will then be loaded in the data warehouse instead of
loading the most granular level of data. For e.g. instead of keeping the
details of each and every sales transaction in individual stores, we can
summarize this data and keep the summary data storing the total sales in
each store on every individual date.
Ket restructuring When choosing keys for our database tables, we have to avoid the ones
with built-in meanings. If we use the product code as the primary key, then
problem occurs. If the product is moved to another warehouse, the
warehouse part of the product key will have to be changed. Restructuring
in the ETL is the transformation of such keys into generic keys produced
by the system itself.
De-duplication In a customer database some customers may be represented by several
records for various reasons: incorrect data values because of data entry
errors, incomplete information, change of address, etc. It makes sense to
keep a single record for one customer and link all the duplicates to this
single record. This process in ETL is called de-duplication of the customer
file.

3.3.2 Role of Data Tranaformation Process

Role of Data Transformation Process The data transformation process takes the following course.
● Map the input data from the source systems to data warehouse repository.
● Clean the data, fill all the missing values by some default value.
● Remove duplicate the records so that they may be stored only once in the data warehouse.
● Perform splitting and merging of fields.
● Sort the records.
● De-normalize the extracted data according to the dimensional model of the data warehouse.
● Convert to appropriate data types.
● Perform aggregations and summarizations.
● Inspect the data for referential integrity.
● Consolidation and integration of data from multiple source systems.

7
3.4 Data Loading
• Data loading takes the prepared data, applies it to the data warehouse, and stores it in the
database
• Terminology:
– Initial Load — populating all the data warehouse tables for the very first time
– Incremental Load — applying ongoing changes as necessary in a periodic manner
– Full Refresh — completely erasing the contents of one or more tables and reloading
with fresh data (initial load is a refresh of all the tables)

Before loading data in data warehouse, indexes are usually dropped from tables and are recreated after
loading.

3.4.1 Techniques of Data Loading

1. Load
2. Append
3. Destructive merge
4. Constructive merge
5. Initial load
1. Load:
• If the target table to be loaded already exists and data exists in the table, the load process wipes
out the existing data and applies the data from the incoming file.
• If the table is already empty before loading, the load process simply applies the data from the
incoming file.

8
2. Append
• Extension of the load.
• If data already exists in the table, the append process unconditionally adds the incoming data,
preserving the existing data in the target table.
• When an incoming record is a duplicate of an already existing record, you may define how to
handle an incoming duplicate:
– The incoming record may be allowed to be added as a duplicate.
– In the other option, the incoming duplicate record may be rejected during the append
process.

3. Destructive merge:
• Applies incoming data to the target data.
• If the primary key of an incoming record matches with the key of an existing record, update the
matching target record.
• If the incoming record is a new record without a match with any existing record, add the
incoming record to the target table.

9
4. Constructive merge
• Slightly different from the destructive merge.
• If the primary key of an incoming record matches with the key of an existing record, leave the
existing record, add the incoming record, and mark the added record as superceding the old
record.

3.4.2 When to Go for Data Update Rather than Data Refresh

Afterbthe initial load, the data warehouse is updated using two methods:
Update: Application of incremental changes in the data sources
Refresh: Complete reload at specified intervals.
Update Refresh
Update is complex than Refresh. Refresh is much simpler than update.
We need to extract the changes from each data Complete replacement of the data warehouse
source and then apply the changes or the extracted tables takes place.
records to the data warehouse
Update jobs takes time based on no.of updates Refresh jobs take long time to complete
Not like refresh We have to keep it offline for a long time and the
case worsens if the database has large tables.
Cost of updates varies depending upon the Cost of refresh remains the same irrespective of
number of changes in the source system the number of changes in the source systems

3.4.3 Loading the Fact Tables and Dimension Tables

The keys of records in the source systems are different from the keys of the data warehouse.
Therefore, before source data can be applied to the dimension tables, whether during intitial loading or
during updating, the production keys must be converted to system generated keys in the data
warehouse. These key conversions must be sone as a part of tranformation process

10
Follwoing iagram explain how Typpe 1, Type 2, type 3 changes are handled.

The key of the fact tables is a concatenation of the keys of the dimesion tables. Thats why dimesnion
tables are loaded before fcat tables.
3.5 Data Quality
Poor data quality results in poor decisions.Dirty data is one of the common reason for failure of data
warehouse.

What is Data Quality?

The data item is exactly fit for the purpose for which the business users have defined it. Wider concept
grounded in the specific business of the company. Relates not just to single data elements but to the
system as a whole. Form and content of data elements consistent across the whole system. Essentially
needed in a corporate-wide

3.5.1 Need for Data Quality

 Boosts confidence in decision making
 Enables better customer service
 Increases opportunity to add better value to the services
 Reduces risk from disastrous decisions
 Reduces costs, especially of marketing campaigns
 Enhances strategic decision making
 Improves productivity by streamlining processes
 Avoids compounding effects of data contamination

The cost of not having good quality data

 Bad decisions
 Lost business opportnities
 Wastage of resources
11
 Inconsistent data reports
 time and effort needed to correct data

1. Incomplete Errrors
a) Missing Records Record present in source system bu not in WH.
b) Missing fields Fields are empty in source system table
c) Records or fields, Two cases. First, new field designed by developer which is not present in
by design, are not source system, Second, filed in presnt in source bystem but empty and required
being recorded by WH

2. Incorrect Errors
a) Wrong codes Differnt codes used in different vesrions of source system
b)Wrong aggragations Wrong calcualtion perfomed on detailed granualr fields
c) Duplicate records Duplicate records presnt in source systems
d)Wrong information For e.g. Date format is American but entered in British format

12
entered into the system

2. Incomprehensibility Errors
a) Multiple fields For e.g. name field in source system contains first, middle and last name.
within one field But in WH developer want to keep in three sepate fields
b) Unknown codes Due to lack of documenation in source system, some fields are not
understandable

4. Inconsistency Errors
a) Inconsistent use of M anf F in one system and 0 aand 1in another system
different codes
b) Inconsistent meaing Occurs due to definition of organizational entity changes over time
of code
c)Inconsistent Due to inconsistent business rules differnt aggregation present in source
aggregating systems
d) Lack of refrential When source system is built without basic check
integrity
3.5.3 Issues in Data Cleansing
Which data to cleanse
Primarily, it must be user's decision. It is the users who know better what type of data they need from
data warehouse.
But for best quality, both the users and the project team should jointly work & take decision.
The team must determine how the data has to be cleaned and weigh the benefits of data cleansing with
the aftermaths of leaving the dirty data to study how it will affect any analysis made by the users in the
data warehouse.
Where to cleanse
 Cleansing data in staging area is simple and practically possible. In data staging area,
developer team: already solved extraction problem, aware of the structure, conent and nature of
data
 If cleansing operation takes place in source system then it would be complex, difficult and
expensive task.

How to Cleanse in source system

Use of appropriate tools. It will work for new source system. But for old source systems, where
developer tem has to write in-house programs.

Module 3
No ratings yet
Module 3
30 pages
Module 2
No ratings yet
Module 2
117 pages
Unique Constraints in Partitioned Tables
No ratings yet
Unique Constraints in Partitioned Tables
38 pages
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
No ratings yet
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
36 pages
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
Advanced ETL Techniques for DWH
No ratings yet
Advanced ETL Techniques for DWH
46 pages
Chap3 ETL Process
No ratings yet
Chap3 ETL Process
44 pages
IDW Lecture 27-ETL-Tranformation & Loading
No ratings yet
IDW Lecture 27-ETL-Tranformation & Loading
22 pages
Chapter IV
No ratings yet
Chapter IV
22 pages
ETL Process in Data Warehousing
No ratings yet
ETL Process in Data Warehousing
80 pages
Bases de Dados e Armazéns de Dados: Bibliography
No ratings yet
Bases de Dados e Armazéns de Dados: Bibliography
11 pages
ETL Process for Data Warehouse Integration
No ratings yet
ETL Process for Data Warehouse Integration
45 pages
ETL Process Overview and Challenges
No ratings yet
ETL Process Overview and Challenges
32 pages
Session5 6 Etl
No ratings yet
Session5 6 Etl
22 pages
Lect#4
No ratings yet
Lect#4
27 pages
Why ETL
No ratings yet
Why ETL
15 pages
ETL Process-Training
0% (1)
ETL Process-Training
85 pages
Understanding the ETL Process Steps
No ratings yet
Understanding the ETL Process Steps
30 pages
ETL Process for Data Warehousing
100% (1)
ETL Process for Data Warehousing
52 pages
Details Extract Transform Load
No ratings yet
Details Extract Transform Load
13 pages
Building The Data WareHouse - Chapter 03
No ratings yet
Building The Data WareHouse - Chapter 03
95 pages
ETL Process: Challenges and Steps
No ratings yet
ETL Process: Challenges and Steps
32 pages
Basics of Data Integration
100% (1)
Basics of Data Integration
61 pages
Imran Introduction To DWH-5
No ratings yet
Imran Introduction To DWH-5
26 pages
ETL Process in Data Warehouse
67% (3)
ETL Process in Data Warehouse
40 pages
ETL Process in Data Warehousing
No ratings yet
ETL Process in Data Warehousing
29 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
04 - ETL Process
No ratings yet
04 - ETL Process
40 pages
Outline: ETL Extraction Transformation Loading
No ratings yet
Outline: ETL Extraction Transformation Loading
38 pages
ETL Process in Data Warehousing
No ratings yet
ETL Process in Data Warehousing
37 pages
ETL Power Point Presentation
No ratings yet
ETL Power Point Presentation
40 pages
ETL Process in Data Integration Explained
No ratings yet
ETL Process in Data Integration Explained
6 pages
DW - Unit 3
No ratings yet
DW - Unit 3
10 pages
Cs614 (Final)
No ratings yet
Cs614 (Final)
29 pages
Business Intelligence Endsem
No ratings yet
Business Intelligence Endsem
10 pages
Lecture 7 (17-04-2024)
No ratings yet
Lecture 7 (17-04-2024)
29 pages
ETL Basics for Data Warehousing
No ratings yet
ETL Basics for Data Warehousing
27 pages
Presentation 2
No ratings yet
Presentation 2
22 pages
Extraction, Tranformation and Loading
No ratings yet
Extraction, Tranformation and Loading
14 pages
New DM
No ratings yet
New DM
47 pages
Bi Unit 3
No ratings yet
Bi Unit 3
26 pages
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
No ratings yet
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
72 pages
ETL Process in Data Warehouse: Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Chirayu Poundarik
40 pages
Ass 1
No ratings yet
Ass 1
31 pages
ETL Process for Data Engineers
No ratings yet
ETL Process for Data Engineers
7 pages
Lecture 17
No ratings yet
Lecture 17
35 pages
Unit 3
No ratings yet
Unit 3
14 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
Lecture-9 Extraction Transformation Loading
No ratings yet
Lecture-9 Extraction Transformation Loading
15 pages
Lecture 16
No ratings yet
Lecture 16
21 pages
Lecture 5 - Data Transformation
No ratings yet
Lecture 5 - Data Transformation
7 pages
Data Extraction Part2
No ratings yet
Data Extraction Part2
15 pages
LP-VI Handwritten Writeups
No ratings yet
LP-VI Handwritten Writeups
9 pages
Data Warehouse Slide3
No ratings yet
Data Warehouse Slide3
43 pages
Microsoft PowerPoint - 03 - ETL Process - PPT (Compatibility Mode)
No ratings yet
Microsoft PowerPoint - 03 - ETL Process - PPT (Compatibility Mode)
16 pages
Power Bi Get Data Notes
No ratings yet
Power Bi Get Data Notes
10 pages
MCQs Databases SQL - Notepad
No ratings yet
MCQs Databases SQL - Notepad
5 pages
CISA Job Practice 2019
100% (1)
CISA Job Practice 2019
4 pages
Homework 5 Solutions
No ratings yet
Homework 5 Solutions
6 pages
Unit Iv Data Normalization: Semantics of Attributes Should Be Easy To Interpret
No ratings yet
Unit Iv Data Normalization: Semantics of Attributes Should Be Easy To Interpret
14 pages
Presentation 1
No ratings yet
Presentation 1
13 pages
SAS Base Programming for SAS 9 Guide
No ratings yet
SAS Base Programming for SAS 9 Guide
17 pages
Best SAP Training Institutes in Hyderabad
No ratings yet
Best SAP Training Institutes in Hyderabad
4 pages
DBMS vs Traditional File Systems Explained
100% (2)
DBMS vs Traditional File Systems Explained
13 pages
SQL and Database Concepts Quiz
No ratings yet
SQL and Database Concepts Quiz
13 pages
Cloud Software Development Life Cycle
No ratings yet
Cloud Software Development Life Cycle
5 pages
Poster Access-Web PDF
No ratings yet
Poster Access-Web PDF
1 page
Data Warehousing Syllabus
No ratings yet
Data Warehousing Syllabus
3 pages
San Francisco Bikeshare Data Insights
No ratings yet
San Francisco Bikeshare Data Insights
11 pages
DW Unit II Notes
No ratings yet
DW Unit II Notes
57 pages
It4it Webinar 01
No ratings yet
It4it Webinar 01
14 pages
International Development & BI Tools Overview
No ratings yet
International Development & BI Tools Overview
4 pages
For Students DataStage NOTES
No ratings yet
For Students DataStage NOTES
163 pages
Dbms 1-4 Unit Notes
No ratings yet
Dbms 1-4 Unit Notes
87 pages
Module 6: Backing Up Databases
No ratings yet
Module 6: Backing Up Databases
35 pages
Introduction to Databases and DBMS
No ratings yet
Introduction to Databases and DBMS
2 pages
Software Engineering Course Guide
100% (1)
Software Engineering Course Guide
43 pages
SQL Server 2016 Features Overview
No ratings yet
SQL Server 2016 Features Overview
38 pages
Scsa4003 - Business Analytics QB
No ratings yet
Scsa4003 - Business Analytics QB
6 pages
SAP Training Content Update 50
No ratings yet
SAP Training Content Update 50
9 pages
Tci Reference Architecture Quick Guide
No ratings yet
Tci Reference Architecture Quick Guide
24 pages
Skelta BPM Achieves TEC Certification
No ratings yet
Skelta BPM Achieves TEC Certification
9 pages
Week 1 Activity Sheet:: Defining A Database
100% (1)
Week 1 Activity Sheet:: Defining A Database
30 pages
Encrypted Document Analysis
No ratings yet
Encrypted Document Analysis
12 pages
Data Warehousing Imp Questions
No ratings yet
Data Warehousing Imp Questions
13 pages

Unit III DWM

Uploaded by

Unit III DWM

Uploaded by

ETL

3.1.1 Challenges in ETL Process

Data Extraction Issues

3.2.1 Identification of Data Sources

3.2.2 Extraction Data for Refreshing

a) Capture through transaction logs

b) Capture through database triggers

c) Capture through source applications

a) Capture based on data and timestamp

b) Capture by comparing files

3.3.1 Tasks involved in Data Transformation

3.3.2 Role of Data Tranaformation Process

3.4.1 Techniques of Data Loading

3.4.2 When to Go for Data Update Rather than Data Refresh

3.4.3 Loading the Fact Tables and Dimension Tables

What is Data Quality?

3.5.1 Need for Data Quality

The cost of not having good quality data

3.5.2 Categories of Errors

How to Cleanse in source system

You might also like