0% found this document useful (0 votes)
121 views13 pages

Unit III DWM

The document discusses ETL (extraction, transformation, loading) processes in data warehousing. ETL involves extracting data from source systems, transforming it to fit the data warehouse needs, and loading it. Challenges include diverse source systems with different structures, platforms, and data quality. Data extraction identifies sources and defines how data will be extracted. Transformation tasks reformat, decode, split, merge and convert data to prepare it for the warehouse. This prepares the raw source data for effective use in the data warehouse.

Uploaded by

ARAVIND IYER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views13 pages

Unit III DWM

The document discusses ETL (extraction, transformation, loading) processes in data warehousing. ETL involves extracting data from source systems, transforming it to fit the data warehouse needs, and loading it. Challenges include diverse source systems with different structures, platforms, and data quality. Data extraction identifies sources and defines how data will be extracted. Transformation tasks reformat, decode, split, merge and convert data to prepare it for the warehouse. This prepares the raw source data for effective use in the data warehouse.

Uploaded by

ARAVIND IYER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

ETL

3.1 Introduction
ETL functions ( extraction, transformation and loading) that take place in the data staging area reshape
the relevant data from the source systems into useful information to be stored in the data warehouse.

3.1.1 Challenges in ETL Process


 Diverse and disparate
 Different operating systems/platforms
 May not preserve historical data
 Quality of data may not be guaranteed in the older operational source systems
 Structures keep changing with time
 Occurrence of data inconsistency in the source system.
 Data may be stored in cryptic form
 Data type, format, naming convention may be different

1
3.2 Data Extraction
In this stage data flows from the data sources and pauses at the staging area.

Many of the operational systems are sill legacy systems, while some of the operational systems run on
the client/server architecture and some have ERP data sources, so extracting data from such disparate
systems is not a trivial issue. Apart from production systems data, data stored in temporaey files is
also important.

Data Extraction Issues


• Source Identification—identify source applications and source structures.
• Method of extraction—for each data source, define whether the extraction process is manual or
tool-based.
• Extraction frequency—for each data source, establish how frequently the data extraction must
by done—daily, weekly, quarterly, and so on.
• Time window—for each data source, denote the time window for the extraction process.
• Job sequencing—determine whether the beginning of one job in an extraction job stream has to
wait until the previous job has finished successfully.
• Exception handling—determine how to handle input records that cannot be extracted

3.2.1 Identification of Data Sources


The sequence of steps performed in the source identification.
 List every fact needed for analysis in fact tables.
 For every dimension table, list each and every attribute.
 For each target data item, find the source system and the appropriate source data item.
 If there are multiple sources for the same data then choose the preferred source.
 Formulate a consolidation rule for every data item that has multiple sources.
2
 Formulate splitting rules for every source field that will be distributed to multiple fields.
 Determine the default values.
 Search the source data for the missing values.

3.2.2 Extraction Data for Refreshing


Data Extraction Techniques
 Immediate Data Extraction(real time)
a) Capture through transaction logs
b) Capture through database triggers
c) Capture through source applications
 Deferred data Extraction(capture happens later)
a) Capture based on data and timestamp
b) Capture by comparing files

3
Immediate Data Extraction(real time)

a) Capture through transaction logs


● It reads the trasaction log and select all the comitted transactions.
● Log shoud be extracted before log file gets refreshed.
● Does not provide much flexibility for capturing specifications
● Does not affect the performance of source systems
● Does not require any revisions to the existing source applications
● Cannot be used on file oriented system.

b) Capture through database triggers


● Triggers can be created for all events for which data needs to be captured.
● Does not provide much flexibility for capturing specifications
● Does not affect the performance of source systems
● Does not require any revisions to the existing source applications
● Cannot be used on file oriented system.
● Cannot be used on a legacy system

c) Capture through source applications


● Provides flexibility for capturing specification
● Performance degradation of source systems
● Requires the existing source systems to be revised
● Can be used on a file oriented system
● Can be used on a legacy system

4
Deferred data Extraction(capture happens later)

a) Capture based on data and timestamp


 Every time a record in the source system is created or updated, it is marked with a timestamp that
will be used for selecting the records for data extraction. The timestamp shows the date and time at
which the source record was created or updated.
 Provides flexibility for capturing specification
 Does not affect the performance of source systems
 Requires the existing source systems to be revised
 Can be used on a any sysyem i.e. file oriented system, legacy system
 Deletion of source records presents a special problem. If a source record gets deleted in between
two extract runs, the information about the delete is not detected.
 You can get around this by marking the source record for delete first, do the extraction run, and
then go ahead and physically delete the record. Indicates more logic to the source applications.

b) Capture by comparing files


● It compares two snapshots of the source data.
● For e.g. if we want to apply this technique to capture the changes in the sales data, then while
performing today's data extraction for changes to sales data, a full file comparison between
today's copy nd previous day's copy of the sales data is done to capture any changes between
the two copies.
● Provides flexibility for capturing specification
● Does not affect the performance of source systems
● Does not require the existing source systems to be revised
● may be used on a file oriented system, legacy system

5
3.3 Data Transformation
The data extracted from source system cannot be stored directly in the data warehouse mainly because
of two reasons
• Raw data that must be processes to be made usable in the data warehouse.
• As operational data is extracted from many old legacy systems, the quality of data in those
systems may not good enough for datawarehouse

3.3.1 Tasks involved in Data Transformation


Format revision It related with changes to the data types and lengths of individual data
fields. For example, in the source systems, the customer's income level
may be identified by codes and ranges in which the fields may be text or
numeric. Length of the customer's name field may vary from one source
system to the other.
Decoding of fields When the data comes from multiple source systems, the same data items
may have been described by different field values. For e.g. coding for
gender, with one system using 0 and 1 for male and female, another using
M and F, and the other using male, female. Data with cryptic codes must
also be decoded before being moved in the data warehouse.
Splitting of fileds The first name, middle name, and last name, as well as some other values,
were stored as a large text in a single field in the earlier legacy systems.
You need to store individual components of names and addresses in
separate fields in your data repository to improve the operating
performance by indexing and analyzing individual components.
Merging of information It does not indicated merging of several fields to create a single field. Foe
e.g. the details of the customers could be collected from a number of data
sources. Like customer name and code can be fetched from one table, his
income level and age from the other table and his address and living style
from other table.
Character set conversion Main aim to convert character set to agreed standard charcater set. Some of
the legacy systems stroring text in EBCDIC, ASCII character set. So need
of conversion to standard character set.
Conversion of Units For multinational comapny having branches in number of countries.So
amount may be represented in different currencies. So need to convert
figures into a common unit of measurement
Data & Time Conversion As American & British datae formats are different. So need to decide

6
common format.
Summarization Used to derive summarized data from the most granual data. The
summarized data will then be loaded in the data warehouse instead of
loading the most granular level of data. For e.g. instead of keeping the
details of each and every sales transaction in individual stores, we can
summarize this data and keep the summary data storing the total sales in
each store on every individual date.
Ket restructuring When choosing keys for our database tables, we have to avoid the ones
with built-in meanings. If we use the product code as the primary key, then
problem occurs. If the product is moved to another warehouse, the
warehouse part of the product key will have to be changed. Restructuring
in the ETL is the transformation of such keys into generic keys produced
by the system itself.
De-duplication In a customer database some customers may be represented by several
records for various reasons: incorrect data values because of data entry
errors, incomplete information, change of address, etc. It makes sense to
keep a single record for one customer and link all the duplicates to this
single record. This process in ETL is called de-duplication of the customer
file.

3.3.2 Role of Data Tranaformation Process


Role of Data Transformation Process The data transformation process takes the following course.
● Map the input data from the source systems to data warehouse repository.
● Clean the data, fill all the missing values by some default value.
● Remove duplicate the records so that they may be stored only once in the data warehouse.
● Perform splitting and merging of fields.
● Sort the records.
● De-normalize the extracted data according to the dimensional model of the data warehouse.
● Convert to appropriate data types.
● Perform aggregations and summarizations.
● Inspect the data for referential integrity.
● Consolidation and integration of data from multiple source systems.

7
3.4 Data Loading
• Data loading takes the prepared data, applies it to the data warehouse, and stores it in the
database
• Terminology:
– Initial Load — populating all the data warehouse tables for the very first time
– Incremental Load — applying ongoing changes as necessary in a periodic manner
– Full Refresh — completely erasing the contents of one or more tables and reloading
with fresh data (initial load is a refresh of all the tables)

Before loading data in data warehouse, indexes are usually dropped from tables and are recreated after
loading.

3.4.1 Techniques of Data Loading


1. Load
2. Append
3. Destructive merge
4. Constructive merge
5. Initial load
1. Load:
• If the target table to be loaded already exists and data exists in the table, the load process wipes
out the existing data and applies the data from the incoming file.
• If the table is already empty before loading, the load process simply applies the data from the
incoming file.

8
2. Append
• Extension of the load.
• If data already exists in the table, the append process unconditionally adds the incoming data,
preserving the existing data in the target table.
• When an incoming record is a duplicate of an already existing record, you may define how to
handle an incoming duplicate:
– The incoming record may be allowed to be added as a duplicate.
– In the other option, the incoming duplicate record may be rejected during the append
process.

3. Destructive merge:
• Applies incoming data to the target data.
• If the primary key of an incoming record matches with the key of an existing record, update the
matching target record.
• If the incoming record is a new record without a match with any existing record, add the
incoming record to the target table.

9
4. Constructive merge
• Slightly different from the destructive merge.
• If the primary key of an incoming record matches with the key of an existing record, leave the
existing record, add the incoming record, and mark the added record as superceding the old
record.

3.4.2 When to Go for Data Update Rather than Data Refresh


Afterbthe initial load, the data warehouse is updated using two methods:
Update: Application of incremental changes in the data sources
Refresh: Complete reload at specified intervals.
Update Refresh
Update is complex than Refresh. Refresh is much simpler than update.
We need to extract the changes from each data Complete replacement of the data warehouse
source and then apply the changes or the extracted tables takes place.
records to the data warehouse
Update jobs takes time based on no.of updates Refresh jobs take long time to complete
Not like refresh We have to keep it offline for a long time and the
case worsens if the database has large tables.
Cost of updates varies depending upon the Cost of refresh remains the same irrespective of
number of changes in the source system the number of changes in the source systems

3.4.3 Loading the Fact Tables and Dimension Tables


The keys of records in the source systems are different from the keys of the data warehouse.
Therefore, before source data can be applied to the dimension tables, whether during intitial loading or
during updating, the production keys must be converted to system generated keys in the data
warehouse. These key conversions must be sone as a part of tranformation process

10
Follwoing iagram explain how Typpe 1, Type 2, type 3 changes are handled.

The key of the fact tables is a concatenation of the keys of the dimesion tables. Thats why dimesnion
tables are loaded before fcat tables.
3.5 Data Quality
Poor data quality results in poor decisions.Dirty data is one of the common reason for failure of data
warehouse.

What is Data Quality?


The data item is exactly fit for the purpose for which the business users have defined it. Wider concept
grounded in the specific business of the company. Relates not just to single data elements but to the
system as a whole. Form and content of data elements consistent across the whole system. Essentially
needed in a corporate-wide

3.5.1 Need for Data Quality


 Boosts confidence in decision making
 Enables better customer service
 Increases opportunity to add better value to the services
 Reduces risk from disastrous decisions
 Reduces costs, especially of marketing campaigns
 Enhances strategic decision making
 Improves productivity by streamlining processes
 Avoids compounding effects of data contamination

The cost of not having good quality data


 Bad decisions
 Lost business opportnities
 Wastage of resources
11
 Inconsistent data reports
 time and effort needed to correct data

3.5.2 Categories of Errors


Divided into four categories
1. Incomplete Errrors
a) Missing Records
b) Missing fields
c) Records or fields, by design, are not being recorded
2. Incorrect Errors
a) Wrong codes
b) Wrong calcuations, aggragations
c) Duplicate records
d) Wrong information entered into the system
3. Incomprehensibility Errors
a) Multiple fields within one field
b) Unknown codes
4. Inconsistency Errors
a) Inconsistent use of different codes
b) Inconsistent meaing of code
c) Inconsistent aggregating
d) Lack of refrential integrity

1. Incomplete Errrors
a) Missing Records Record present in source system bu not in WH.
b) Missing fields Fields are empty in source system table
c) Records or fields, Two cases. First, new field designed by developer which is not present in
by design, are not source system, Second, filed in presnt in source bystem but empty and required
being recorded by WH

2. Incorrect Errors
a) Wrong codes Differnt codes used in different vesrions of source system
b)Wrong aggragations Wrong calcualtion perfomed on detailed granualr fields
c) Duplicate records Duplicate records presnt in source systems
d)Wrong information For e.g. Date format is American but entered in British format

12
entered into the system

2. Incomprehensibility Errors
a) Multiple fields For e.g. name field in source system contains first, middle and last name.
within one field But in WH developer want to keep in three sepate fields
b) Unknown codes Due to lack of documenation in source system, some fields are not
understandable

4. Inconsistency Errors
a) Inconsistent use of M anf F in one system and 0 aand 1in another system
different codes
b) Inconsistent meaing Occurs due to definition of organizational entity changes over time
of code
c)Inconsistent Due to inconsistent business rules differnt aggregation present in source
aggregating systems
d) Lack of refrential When source system is built without basic check
integrity
3.5.3 Issues in Data Cleansing
Which data to cleanse
Primarily, it must be user's decision. It is the users who know better what type of data they need from
data warehouse.
But for best quality, both the users and the project team should jointly work & take decision.
The team must determine how the data has to be cleaned and weigh the benefits of data cleansing with
the aftermaths of leaving the dirty data to study how it will affect any analysis made by the users in the
data warehouse.
Where to cleanse
 Cleansing data in staging area is simple and practically possible. In data staging area,
developer team: already solved extraction problem, aware of the structure, conent and nature of
data
 If cleansing operation takes place in source system then it would be complex, difficult and
expensive task.

How to Cleanse in source system


Use of appropriate tools. It will work for new source system. But for old source systems, where
developer tem has to write in-house programs.

13

You might also like