ETL Basics
Lesson 2: ETL Process
Lesson Objectives
On completion of this lesson on Data Modeling, you
will be able to understand:
The ETL process
The steps in Data Cleansing
Copyright Capgemini 2015. All Rights Reserved 2
The ETL Process
Source Staging Presentation
Systems Area System
Extract Transform Load
Copyright Capgemini 2015. All Rights Reserved 3
The ETL Process
Extract
Extract relevant data
Transform
Transform data to DW format
Build keys, etc.
Cleansing of data
Load
Load data into DW
Build aggregates, etc
Copyright Capgemini 2015. All Rights Reserved 4
EXTRACTION PHASE
Copyright Capgemini 2015. All Rights Reserved 5
ETL DATA CAPTURE
Capture = extractobtaining a snapshot
of a chosen subset of the source data for
loading into the data warehouse
Static extract = capturing a Incremental extract = capturing
snapshot of the source data at a changes that have occurred since
point in time the last static extract
Copyright Capgemini 2015. All Rights Reserved 6
Change Data Capture
Data warehousing involves the extraction and transportation of data
from one or more databases into a target system or systems for
analysis.
But this involves the extraction and transportation of huge volumes
of data and is very expensive in both resources and time.
The ability to capture only the changed source data and to move it
from a source to a target system(s) in real time is known as Change
Data Capture (CDC).
Copyright Capgemini 2015. All Rights Reserved 7
Change Data Capture
CDC helps identify the data in the source system that has changed
since the last extraction.
Set of software design patterns used to determine the data that has
changed in a database.
Copyright Capgemini 2015. All Rights Reserved 8
Change Data Capture
Based on the Publisher/Subscriber model.
Publisher
Identifies the source tables from which the change data needs to be captured
Captures the change data and stores it in specially created change tables
Allows the subscribers controlled access to the change data
Subscriber
Subscriber needs to know what change data it is interested in
It creates a subscriber view to access the change data to which it has been
granted access by the publisher
Copyright Capgemini 2015. All Rights Reserved 9
Data Staging
Often used as an interim step between data extraction and later
steps
Accumulates data from asynchronous sources using native
interfaces, flat files, FTP sessions, or other processes
At a predefined cutoff time, data in the staging file is transformed
and loaded to the warehouse
There is usually no end user access to the staging file
An operational data store may be used for data staging
Copyright Capgemini 2015. All Rights Reserved 10
Reasons for Dirty Data
Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys,
Non-Unique Identifiers
Data Integration Problems
Copyright Capgemini 2015. All Rights Reserved 11
ETL DATA Extraction
The extraction process can be done either by hand coded method or
by using tools.
Advantages and disadvantages Of Custom-programmed )/Hand
Coded Extraction (PL SQL Scripts) and Tool based extraction.
Tools have Well Defined disciplined approach and Documentation.
Tools provide an easier way to perform the extraction method by
providing click, drag and drop features.
Hand coded extraction techniques allow extraction in cost effective
manner since the PL/SQL construct are available with the RDBMS.
Hand coded extraction are used when the extraction is to be taken
place where the programmer has clear data structure known.
Copyright Capgemini 2015. All Rights Reserved 12
ETL - Extraction Techniques
Extraction Technique
Bulk Extraction-
The entire data warehouse is refreshed periodically by extraction's from the
source systems.
All applicable data are extracted from the source systems for loading into the
warehouse.
This approach heavily uses the network connection for loading data from source
to target databases, but such mechanism is easy to set up and maintain.
Copyright Capgemini 2015. All Rights Reserved 13
Data Extraction
Capture of data from Source Systems
Important to decide the frequency of Extraction
Sometimes source data is copied to the target database using the
replication capabilities of standard RDBMS (not recommended
because of dirty data in the source systems)
Copyright Capgemini 2015. All Rights Reserved 14
Data Transformation
Transforms the data in accordance with the business rules and
standards that have been established
Example include: format changes, de-duplication, splitting up fields,
replacement of codes, derived values, and aggregates
Copyright Capgemini 2015. All Rights Reserved 15
Data Transformation
Validating
Process of ensuring that the data captured is accurate and transformation process
is correct
E.g. Date of Birth of a Customer should not be more than todays date
Copyright Capgemini 2015. All Rights Reserved 16
Data Transformation
Data Cleansing
Source systems contain dirty data that must be cleansed
ETL software contains rudimentary data cleansing capabilities
Specialized data cleansing software is often used.
Important for performing name and address correction and house holding
functions
Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium),
and Firstlogic ([Link])
Copyright Capgemini 2015. All Rights Reserved 17
Data Transformation
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
Conditioning
Enrichment
Copyright Capgemini 2015. All Rights Reserved 18
Data Transformation
Parsing
Parsing locates and identifies individual data elements in the source files and then
isolates these data elements in the target files
Examples include :
parsing the first, middle, and last name;
street number and street name; and city and state
Copyright Capgemini 2015. All Rights Reserved 19
Data Transformation
Parsing
Parsed Data in Target File
First Name: Beth
Middle Name: Christine
Input Data from Source File Last Name: Parker
Beth Christine Parker, SLS MGR Title: SLS MGR
Regional Port Authority Firm: Regional Port Authority
Federal Building Location: Federal Building
12800 Lake Calumet Number: 12800
Hedgewisch, IL Street: Lake Calumet
City: Hedgewisch
State: IL
Copyright Capgemini 2015. All Rights Reserved 20
Data Transformation
Correcting
Corrects parsed individual data components using sophisticated data algorithms
and secondary data sources.
Example include replacing a vanity address and adding a zip code.
Copyright Capgemini 2015. All Rights Reserved 21
Data Transformation
Correcting
Copyright Capgemini 2015. All Rights Reserved 22
Data Transformation
Standardizing
Standardizing applies conversion routines to transform data into its preferred (and
consistent) format using both standard and custom business rules.
Examples include adding a pre name, replacing a nickname, and using a
preferred street name.
Copyright Capgemini 2015. All Rights Reserved 23
Data Transformation
Standardizing
Corrected Data
Corrected Data Pre-name: Ms.
First Name: Beth First Name: Beth
Middle Name: Christine 1st Name Match
Last Name: Parker Standards: Elizabeth, Bethany, Bethel
Title: SLS MGR Middle Name: Christine
Firm: Regional Port Authority Last Name: Parker
Location: Federal Building Title: Sales Mgr.
Number: 12800 Firm: Regional Port Authority
Street: South Butler Drive Location: Federal Building
City: Chicago Number: 12800
State: IL Street: S. Butler Dr.
Zip: 60633 City: Chicago
Zip+Four: 2398 State: IL
Zip: 60633
Zip+Four: 2398
Copyright Capgemini 2015. All Rights Reserved 24
Data Transformation
Matching
Searching and matching records within and across the parsed, corrected and
standardized data based on predefined business rules to eliminate duplications.
Examples include identifying similar names and addresses.
Copyright Capgemini 2015. All Rights Reserved 25
Data Transformation
Matching
Corrected Data (Data Source #2)
Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678
Copyright Capgemini 2015. All Rights Reserved 26
Data Transformation
Consolidating
Analyzing and identifying relationships between matched records
and consolidating/merging them into ONE representation.
Copyright Capgemini 2015. All Rights Reserved 27
Data Transformation
Consolidating
Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678
Copyright Capgemini 2015. All Rights Reserved 28
Data Transformation
Conditioning
The conversion of data types from the source to the target data store (warehouse)
-- always a relational database
Eg. OLTP Date stored as text (DDMMYY); DW format is Oracle Date type
Copyright Capgemini 2015. All Rights Reserved 29
Data Transformation
Conditioning
Copyright Capgemini 2015. All Rights Reserved 30
Data Transformation
Enrichment
Adding/combining external data values, rules to enrich the information already
existing in the data
E.g. If we can get a list that provides a relationship between Zip Code, City and
State, then if a address field has Zip code 06905 it be safely assumed and
address can be enriched by doing a lookup on this table to get Zip Code 06905 >
City Stamford > State CT
Copyright Capgemini 2015. All Rights Reserved 31
Data Transformation
Enrichment
Copyright Capgemini 2015. All Rights Reserved 32
Data Loading
Data are physically moved to the data warehouse
The loading takes place within a load window
Loading the Extracted and Transformed data into the Staging Area
or Data Warehouse.
Copyright Capgemini 2015. All Rights Reserved 33
Data Loading
First time bulk load to get the historical data into the Data
Warehouse
Periodic Incremental loads to bring in modified data
Design load strategy to using appropriate Slowly Changing
Dimension type .
The Loading window should be as small as possible
Should be clubbed with strong Error Management process to
capture the failures or rejections in the Loading process
Copyright Capgemini 2015. All Rights Reserved 34
Slowly Changing Dimension Types
Three types of slowly changing dimensions
Type 1
Updates existing record with modifications
Does not maintain history
Type 2
Adds new record
Maintain history
Maintains old record
Type 3:
Keep old and new values in the existing row
Requires a design change
Copyright Capgemini 2015. All Rights Reserved 35
Meta Data
Data about data
Needed by both information technology personnel and users
IT personnel need to know data sources and targets; database,
table and column names; refresh schedules; data usage measures;
etc.
Users need to know entity/attribute definitions; reports/query tools
available; report distribution information; help desk contact
information, etc.
Copyright Capgemini 2015. All Rights Reserved 36
Metadata
Metadata is more comprehensive and transcends the data.
Metadata provide the format and name of data items
It actually provides the context in which the data element exists.
provides information such as the domain of possible values;
the relation that data element has to others;
the data's business rules,
and even the origin of the data.
Copyright Capgemini 2015. All Rights Reserved 37
Importance of Metadata
Metadata establish the context of the Warehouse data
Metadata facilitate the Analysis Process
Metadata are a form of Audit Trail for Data Transformation
Metadata Improve or Maintain Data Quality
Copyright Capgemini 2015. All Rights Reserved 38
Copyright Capgemini 2015. All Rights Reserved 39
Feature of ETL Tools
Support data extraction, cleansing, aggregation, reorganization,
transformation, and load operations
Generate and maintain centralized metadata
Filter data, convert codes, calculate derived values, map source
data fields to target data fields
Automatic generation of ETL programs
Closely integrated with RDBMS
High speed loading of target data warehouses using Engine-driven
ETL Tools
Copyright Capgemini 2015. All Rights Reserved 40
Advantages of using ETL Tools
GUI based design of jobs ease of development and maintenance
Generation of directly executable code
Engine driven technology is fast, efficient and multithreaded
In-memory data streaming for high-speed data processing
Products are easy to learn and require less training
Copyright Capgemini 2015. All Rights Reserved 41
Advantages of using ETL Tools
Automatic generation and maintenance of open, extensible
metadata
Support for multiple data formats and platforms
Large number of vendor supplied data transformation objects
Copyright Capgemini 2015. All Rights Reserved 42
Example of ETL requirements
Integration of masters across different systems
E.g. State code AP could mean Andhra Pradesh in one system while it could
mean Arunachal Pradesh in another
De-duplication of data from different systems
E.g. State Karnataka could be represented as KA in one system and KN in
another system
Mapping of old codes to Data Warehouse codes
Data Cleansing - Changing to upper case, assigning defaults to
unavailable data elements
Copyright Capgemini 2015. All Rights Reserved 43
Summary
In this module, you learned about the following:
ETL process
Cleansing steps
Copyright Capgemini 2015. All Rights Reserved 44