Unit II
Data Warehousing and Data Mining
a) Date Ware housing(DwH):
Definition, Characteristic, types,
Dataware housing frame work,
DwH 3 tier architecture,
Alternative Architectures,
Data ware housing Integration,
Data ware housing- Development Approaches,
Real time Data ware housing.
b) Data Mining :-
Definition, Characteristic, Benefits, Date Mining Functions, Data
Mining Applications, Data Mining techniques and tools.
Text Mining, Web Mining.
Data Warehousing
Data
Warehouse:
It is an optimized form of operational database contain
only relevant information and provide fast access to
data.
Subject oriented
Eg: Data to all the departments of
related an
organization
Integrated: A
Different views Single unified
of data B Warehouse view
C
Time –
variant
Nonvolatile
Introduction
A warehouse is a repository for data
imported from other databases. Attached
to the front end of the warehouse is a set of
analytical procedures for making sense out
of the data. Retailers, home shopping
companies and banks have been early
adopters of data warehouses.
Different people have different definitions
for a data warehouse.
What is Data Warehousing?
A data warehousing is a technique for collecting and
managing data from varied sources to provide meaningful
business insights. It is a blend of technologies and
components which allows the strategic use of data.
It is electronic storage of a large amount of information by
a business which is designed for query and analysis
instead of transaction processing. It is a process of
transforming data into information and making it
available to users in a timely manner to make a difference.
The data warehouse is the core of the BI system which is
built for data analysis and reporting.
Data warehouse system is also known
by the following name:
History of Data warehouse
The Data warehouse benefits users to understand and
enhance their organization's performance. The need to
warehouse data evolved as computer systems became
more complex and needed to handle increasing
amounts of Information. However, Data Warehousing
is a not a new thing.
Here are some key events in evolution of Data
Warehouse-
1960- Dartmouth and General Mills in a joint research
project, develop the terms dimensions and facts.
1970- A Nielsen and IRI introduces dimensional data
marts for retail sales.
History of Data warehouse
1983- Tera Data Corporation introduces a database
management system which is specifically designed for
decision support
Data warehousing started in the late 1980s when IBM
worker Paul Murphy and Barry Devlin developed the
Business Data Warehouse.
However, the real concept was given by Inmon Bill. He was
considered as a father of data warehouse. He had written
about a variety of topics for building, usage, and
maintenance of the warehouse & the Corporate Information
Factory.
Inmon
Father of the data warehouse
Co-creator of the Corporate
Information Factory.
He has 35 years of
experience in database
technology management
and data warehouse design.
Inmon-Cont’d
Bill has written about a variety of topics on the building,
usage, & maintenance of the data warehouse & the
Corporate Information Factory.
He has written more than 650 articles (Datamation,
ComputerWorld, and Byte Magazine).
Inmon has published 45 books.
Many of books has been translated to Chinese, Dutch,
French, German, Japanese, Korean, Portuguese, Russian,
and Spanish.
Definition
What is Data Warehouse?
A data warehouse is a collection of integrated
databases designed to support a DSS.
According to Inmon’s (father of data warehousing)
definition(Inmon,1992a,p.5):
It is a collection of integrated, subject-oriented
databases designed to support the DSS function,
where each unit of data is non-volatile and relevant
to some moment in time.
Cont’d.
Where is it used?
It is used for evaluating future strategy.
It needs a successful technician:
Flexible.
Team player.
Good balance of business and technical understanding.
Introduction-Cont’d.
The ultimate use of data warehouse is Mass Customization.
For example, it increased Capital One’s customers from
1 million to approximately 9 millions in 8 years.
Just like a muscle: DW increases in strength with active use.
With each new test and product, valuable information is
added to the DW, allowing the analyst to learn from the
success and failure of the past.
The key to survival:
Is the ability to analyze, plan, and react to changing
business conditions in a much more rapid fashion.
Data Warehouse
In order for data to be effective, DW must be:
Consistent.
Well integrated.
Well defined.
Time stamped.
DW environment:
The data store, data mart & the metadata.
Bill Inmon, who provided the
following:
“A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of
management’s decision making process.”
Subject-Oriented:
A data warehouse can be used to analyze a
particular subject area. For example, “sales” can
be a particular subject.
Integrated:
A data warehouse integrates data from multiple
data sources.
For example, source A and source B may have different ways of
identifying a product, but in a data warehouse, there will be only a
single way of identifying a product.
Time-Variant:
Historical data is kept in a data warehouse.
For example, one can retrieve data from 3 months,
6 months, 12 months, or even older data from a data
warehouse.
This contrasts with a transactions system, where
often only the most recent data is kept.
For example, a transaction system may hold the
most recent address of a customer, where a data
warehouse can hold all addresses associated with a
customer.
Non-volatile:
Once data is in the data warehouse, it will not
change. So, historical data in a data warehouse
should never be altered.
Ralph Kimball provided a more concise
definition of a data warehouse:
“A data warehouse is a copy of transaction data
specifically structured for query and analysis.”
This is a functional view of a data warehouse.
Kimball did not address how the data warehouse is
built like Inmon did, rather he focused on the
functionality of a data warehouse.
A data warehouse (DW) is a database used for
reporting and analysis. The data stored in the
warehouse is uploaded from the operational
systems. The data may pass through an
operational data store for additional operations
before it is used in the DW for reporting.
A data warehouse maintains its
functions in three layers:
Layer:1 Staging - Used to store raw data for
use by developers.
Layer: 2 Integration - used to integrate data
and to have a level of abstraction from users.
Layer: 3 access – used to access data
One thing to mention about data warehouse is
that they can be subdivided into data mart
Types of Data Warehouse
Three main types of Data Warehouses are:
1. Enterprise Data Warehouse:
Enterprise Data Warehouse is a centralized warehouse. It provides decision
support service across the enterprise. It offers a unified approach for
organizing and representing data. It also provide the ability to classify data
according to the subject and give access according to those divisions.
2. Operational Data Store:
Operational Data Store, which is also called ODS, are nothing but data store
required when neither Data warehouse nor OLTP systems support
organizations reporting needs. In ODS, Data warehouse is refreshed in real
time. Hence, it is widely preferred for routine activities like storing records of
the Employees.
3. Data Mart:
Data marts it stores subsets of data from a
warehouse, which focuses on a specific aspect of a
company like sales or a marketing process.
This definition of the data warehouse focuses on
data storage.
The main source of the data is cleaned,
transformed, catalogued and made available for
use by managers and other business professionals
for data mining, online analytical processing,
market research and decision support.
Characteristics of Data Warehouse:
i. Subject-oriented :
The warehouse organizes data around the essential subjects of
the business (customers and products) rather than around
applications such as inventory management or order
processing.
i.Integrated:
It is consistent in the way that data from several sources is
extracted and transformed. For example, coding conventions
are standardized: M _ male, F _ female.
ii. Time-variant:
Data are organized by various time-periods (e.g. months).
iii. Non-volatile:
The warehouse’s database is not updated
in real time. There is periodic bulk
uploading of transactional and other data.
This makes the data less subject to
momentary change. There are a number of
steps and processes in building a
warehouse.
What Is a Data Warehouse Used For?
Here, are most common sectors where Data warehouse is used:
Airline:
In the Airline system, it is used for operation purpose like crew
assignment, analyses of route profitability, frequent flyer
program promotions, etc.
Banking:
It is widely used in the banking sector to manage the resources
available on desk effectively. Few banks also used for the
market research, performance analysis of the product and
operations.
Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient's
treatment reports, share data with tie-in insurance companies, medical aid services, etc.
Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyze tax records, health policy records, for every individual.
Investment and Insurance sector:
In this sector, the warehouses are primarily used to analyze data patterns, customer trends, and to
track market movements.
Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing policy.
Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.
Data Warehouse Tools
There are many Data Warehousing tools are available in the market. Here, are some most
prominent one:
1. MarkLogic:
MarkLogic is useful data warehousing solution that makes data integration easier and faster using
an array of enterprise features. This tool helps to perform very complex search operations. It can
query different types of data like documents, relationships, and metadata.
http://developer.marklogic.com/products
2. Oracle:
Oracle is the industry-leading database. It offers a wide range of choice of data warehouse solutions
for both on-premises and in the cloud. It helps to optimize customer experiences by increasing
operational efficiency.
https://www.oracle.com/index.html
3. Amazon RedShift:
Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to analyze all types of
data using standard SQL and existing BI tools. It also allows running complex queries against
petabytes of structured data, using the technique of query optimization.
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Datawarehouse Tools.
Benefits of a Data Warehouse:
This architectural complexity provides the opportunity to:
a. Maintain data history, even if the source transaction systems do not.
b. Integrate data from multiple source systems, enabling a central view
across the enterprise. This benefit is always valuable, but particularly so
when the organization has grown by merger.
c. Improve data, by providing consistent codes and descriptions,
flagging or even fixing bad data.
d. Present the organization’s information consistently.
e. Provide a single common data model for all data of interest
regardless of the data’s source.
f. Restructure the data so that it makes sense to the business users.
g. Restructure the data so that it delivers excellent query performance,
even for complex analytic queries, without impacting the operational
systems.
h. Add value to operational business applications, notably customer
relationship management (CRM) systems.
Dimensions of Data Warehouse:
A dimension is a data element that categorizes each item in a data set into
non-overlapping regions. A data warehouse dimension provides the means
to “slice and dice” data in a data warehouse. Dimensions provide structured
labeling information to otherwise unordered numeric measures. For
example, “Customer”, “Date”, and “Product” are all dimensions that could
be applied meaningfully to a sales receipt. A dimensional data element is
similar to a categorical variable in statistics.
The primary function of dimensions is threefold: to provide filtering,
grouping and labeling. For example, in a data warehouse where each person
is categorized as having a gender of male, female or unknown, a user of the
data warehouse would then be able to filter or categorize each presentation
or report by either filtering based on the gender dimension or displaying
results broken out by the gender.
Each dimension in a data warehouse may have one or more hierarchies
applied to it. For the “Date” dimension, there are several possible
hierarchies: “Day > Month > Year”, “Day > Week > Year”, “Day > Month >
Quarter > Year”, etc.
Business Intelligence
Business Intelligence is a term commonly associated with
data warehousing.
In fact, many of the tool vendors position their products
as business intelligence software rather than data
warehousing software.
There are other occasions where the two terms are used
interchangeably.
Business intelligence usually refers to the information that is
available for the enterprise to make decisions on.
A data warehousing (or data mart) system is the backend, or the
infrastructural, component for achieving business intelligence.
Business intelligence also includes the insight gained from doing
data mining analysis, as well as unstructured data (thus the
need for content management systems).
DATA
WAREHOUSING
&
DATA MINING
Components of Data Warehouse:
The data warehouse is based on an RDBMS server which is a central
information repository that is surrounded by some key components to
make the entire environment functional, manageable and accessible
There are mainly five components of Data Warehouse:
There are 5 main components of a Data warehouse.
1. Data Warehouse Database
2. Sourcing, Acquisition, Clean-up and Transformation Tools
(ETL)
3. Metadata
4. Query Tools
5. Data Marts
Three parts of the data
warehouse
The data warehouse that contains the data and
associated software
Data acquisition (back-end) software that
extracts data from legacy systems and external
sources, consolidates and summarizes them, and
loads them into the data warehouse
Client (front-end) software that allows users to
access and analyze data from the warehouse
35
Data Warehouse Architectures
There are mainly three types of Data warehouse
Architectures: -
Single-tier architecture
The objective of a single layer is to minimize the
amount of data stored.
This goal is to remove data redundancy.
This architecture is not frequently used in practice.
Data Warehouse Architectures
Two-tier architecture
Two-layer architecture separates physically
available sources and data warehouse.
This architecture is not expandable and also
not supporting a large number of end-users.
It also has connectivity problems because of
network limitations.
Architecture of a two tier data
warehouse
38
3- Tier Data
Warehouse
Architecture
Architecture of a
three-tier data
warehouse
40
3-Tier Data
Warehouse
Architecture
Data ware house adopt a three tier architecture.
These 3 tiers are:
Bottom Tier
Middle Tier
Top Tier
Data Warehouse Architectures
Three-tier architecture
This is the most widely used architecture. It consists of the Top, Middle and Bottom
Tier.
Bottom Tier: The database of the Data warehouse servers as the bottom tier. It is usually
a relational database system. Data is cleansed, transformed, and loaded into this layer
using back-end tools.
Middle Tier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model. For a user, this application tier
presents an abstracted view of the database. This layer also acts as a mediator between
the end-user and the database.
In the middle tier, we have the OLAP Server that can be implemented in either of the
following ways.
By Relational OLAP (ROLAP), which is an extended relational database management
system. The ROLAP maps the operations on multidimensional data to standard
relational operations.
By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you
connect and get data out from the data warehouse. It could be Query tools, reporting
tools, managed query tools, Analysis tools and Data mining tools.
Data Sources:
All the data related to any bussiness organization is
stored in operational databases, external files and flat
files.
These sources are application oriented
Eg: complete data of organization such as training detail,
customer
detail, sales, departments, transactions, employee detail
etc.
Data present here in different formats or host format
Contain data that is not well documented
Bottom Tier:
Data warehouse server
Data Warehouse server fetch only relevant
information based on data mining (mining a
knowledge from large amount of data) request.
Eg: customer profile information provided by external
consultants.
Data is feed into bottom tier by some backend tools
and utilities.
Bottom tier
The bottom tier is a warehouse database server that is almost always a relational
database system.
Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources.
These tools and utilities perform data extraction, cleaning, and transformation, as
well as load and refresh functions to update the data warehouse.
The data are extracted using application program interfaces known as gateways.
Also, A gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).
This tier also contains a metadata repository, which stores information about the data
warehouse and its contents
Backend Tools & Utilities:
Functions performed by backend tools and
utilities are:
Data Extraction
Data Cleaning
Data Transformation
Load
Refresh
Bottom Tier Contains:
Data warehouse
Metadata Repository
Data Marts
Monitoring and
Administration
Metadata repository:
It figure out that what is available in data warehouse.
It contains:
Structure of data warehouse
Data names and definitions
Source of extracted data
Algorithm used for data cleaning purpose
Sequence of transformations applied on data
Data related to system performance
Data Marts:
Subset of data warehouse contain only small slices
of data warehouse
Eg: Data pertaining to the single department
Two types of data marts:
Dependent Independent
sourced directly sourced from one or
from data warehouse more data sources
Monitoring & Administration:
Data Refreshment
Data source synchronization
Disaster recovery
Managing access control and security
Manage data growth, database performance
Controlling the number & range of queries
Limiting the size of data warehouse
Bottom Tier: Data
Monitoring Administration Warehouse Server
Data
Data Marts
Metadata Warehouse
Repository
Data
Sourc e
A B C
Middle Tier: OLAP
Server
It presents the users a multidimensional data from
data warehouse or data marts.
Typically implemented using two models:
ROLAP Model MOLAP Model
Present data in Present data in array
relational tables based structures means
map directly to data
cube array structure.
Middle tier
The middle tier is an OLAP server that typically implemented
using either.
A relational OLAP (ROLAP) model, that , an extended
relational DBMS that maps operations on multidimensional data
to standard relational operations or,
Also, A multidimensional OLAP (MOLAP) model, that , a
special-purpose server that directly implements
multidimensional data and operations
OLAP Operations
Since OLAP servers are based on multidimensional view of data,
OLAP operations in multidimensional data, the list of OLAP
operations −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Four types of OLAP servers:
Relational OLAP
ROLAP servers are placed between the relational back-end server
and client front-end tools.
Moreover, To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes the following:
Implementation of aggregation navigation logic.
Optimization for each DBMS back end.
Additional tools and services
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of
data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Moreover, Many MOLAP servers use two levels of data storage representation to handle
dense and sparse data sets.
Hybrid OLAP (HOLAP)
Hybrid OLAP is a combination of both ROLAP and MOLAP.
Also, It offers higher scalability of ROLAP and faster computation of MOLAP.
Moreover, HOLAP servers allow storing the large data volumes of detailed information.
The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Also, Specialized SQL servers provide advanced query language and query processing support
for SQL queries over star and snowflake schemas in a read-only environment.
Top Tier: Front end tools
The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools.
It is front end client layer.
Query and reporting tools
Reporting Tools: Production reporting tools
Report writers
Managed query tools: Point and click creation of
SQL used in customer mailing list.
Analysis tools : Prepare charts based on analysis
Data mining Tools: mining knowledge, discover
hidden piece of information, new correlations, useful
pattern
OLAP vs OLTP
Sr. Data Warehouse (OLAP) Operational Database (OLTP)
No.
1 Involves historical processing of Involves day-to-day processing.
information.
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers database professionals.
and analysts.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.
Schema and Fact Constellation
Schema.
6 Contains historical data. Contains current data.
7 Provides summarized and consolidated Provides primitive and highly detailed data.
data.
8 Provides summarized and Provides detailed and flat relational view of
multidimensional view of data. data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in Number of records accessed is in tens.
millions.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.
From the architecture point of view, there are three data warehouse models:
Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire
organization.
It provides corporate-wide data integration, usually from one or more operational systems or
external information providers, and is cross-functional in scope.
Moreover, It typically contains detailed data as well as summarized data,
Also, It can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
Data mart
A data mart contains a subset of corporate-wide data that is of value to a specific group of
users.
Virtual warehouse
A virtual warehouse set of views over operational databases.
Moreover, For efficient query processing, only some of the possible summary views may
materialize.
Architecture of web based data warehousing.
62
Alternative Data Warehouse
Architectures:
•EDW Architecture
Centralized enterprise data warehouse –similar to hub and spoke but
there are no dependent data marts but instead a gigantic enterprise DW
that serves for all the needs of all organizational units. Provides users to
all data in the DW instead of limiting them to just the data marts.
64
Alternative Data Warehouse
Architectures:
•Data Mart Architecture
Independent data marts –
arguably the simplest and the
least costly architecture
alternative, data marts are
developed to operate
independently of each other
to serve for the needs of
individual organizational units.
Data mart bus –individual
marts are linked to each other
65 via some kind of middle ware
Alternative Data Warehouse
Architectures:
•Hub-and-Spoke Data Mart Architecture
Hub-and-spoke –
attention is focused
on building a
scalable and
maintainable
infrastructure, this
allows for easy and
customization of
user interfaces and
reports
66
Alternative Data Warehouse Architectures:
•EDW and ODS (real time access support)
67
Alternative Data Warehouse
Architectures:
•Distributed Data Warehouse Architecture
Federated -is a concession to
the natural forces that
undermine the best plans for
developing a perfect-system. It
uses all possible means to
integrate analytical resources
from multiple sources to meet
changing needs or business
conditions. Essentially the
federated approach involves
integrating disparate systems.
Good for supplementing data
68 warehouses but not replacing
them
Alternative Architectures for Data Warehouse
Efforts
69
Teradata Corp.’s
EDW
70
The alternative data warehouse architectures and their basic descriptions:
Independent data marts –arguably the simplest and the least costly
architecture alternative, data marts are developed to operate
independently of each other to serve for the needs of individual
organizational units.
Data mart bus –individual marts are linked to each other via some kind
of middle ware
Hub-and-spoke –attention is focused on building a scalable and
maintainable infrastructure, this allows for easy and customization of user
interfaces and reports
Centralized enterprise data warehouse –similar to hub and spoke
but there are no dependent data marts but instead a gigantic enterprise
DW that serves for all the needs of all organizational units. Provides users
to all data in the DW instead of limiting them to justthe data marts.
Ten factors that potentially affect the
architecture selection
decision:
1. Information 5. Constraints on resources
interdependence between 6. Strategic view of the data
organizational units warehouse prior to
2. Upper management’s implementation
information needs 7. Compatibility with
3. Urgency of need for a existing systems
data warehouse 8. Perceived ability of the in-
4. Nature of end-user tasks house IT staff
5. Technical issues
6. Social/political factors
Data Warehousing & Application
Integration
Data integration
Data integration involves combining data from
several disparate sources, which are stored using
various technologies and provide a unified view of
the data.
Data Integration comprises three major processes
that when correctly implemented permit data to
be accessed and an array of ETL & analysis tools
and data warehousing environment: data access,
data federation(integration of business views
across multiple data stores) and change capture.
Various Integration technologies that enable data
and metadata integration are:
1. Enterprise Application integration (EAI)
2. Service Oriented architecture (SOA)
3. Enterprise Information Integration (EII)
4. Extraction Transformation and Load (ETL)
Enterprise Application integration
Enterprise Application integration involves integrating
application functionality and is focused on sharing
functionality (than data) across the systems thereby
enabling the flexibility and reuse.
EAI is accomplished by using SOA coarse grained
services
Enterprise Information integration
Enterprise information integration (EII) is software that
combines enterprise data and information into a unique
data monitoring interface where data is expressed via
uniform representation.
EII is an evolving tool space that promises real-time data
integration from variety of sources such as relational
databases, web services and multi dimensional databases.
EII tools use pre defined meta data to populate that make
integrated data appear relational to end user.
XML is the most important aspect of EII because it allows
data to be tagged at creation time or later.
ETL - Introduction
ETL stands for Extract, Transform and Load.
An ETL tool extracts the data from different
RDBMS source systems, transforms the data
like applying calculations, concatenate, etc.
and then load the data to Data Warehouse
system.
The data is loaded in the DW system in the
form of dimension and fact tables.
What are ETL Tools?
ETL is the process of transferring data from
source database to the destination
data warehouse.
In the process, there are 3 different sub-
processes like E for Extract, T for Transform
and L for Load.
The data is extracted from the source database
in the extraction process which is then
transformed into the required format and then
loaded to the destination data warehouse.
For performing all these functions there are
certain tools which are called the ETL tools.
The lists of the ETL tools are given
below
• Pervasive Data Integrator
• Pentaho Data Integration
• Open Text Integration • 19Adeptia Integration
Center Server
• Relational Junction ETL • 4SAS Data Management
Manager (Sesame Software) • 16Centerprise Data
• CloverETL Integrator
• PowerCenter Informatica • 20Syncsort DMX
• Talend Studio for Data • 10Sagent Data Flow
Integration • 21QlikView Expressor
• Oracle Warehouse Builder • 2SAP Data Services
(OWB) • 6Elixir Repertoire for
• Oracle Data Integrator Data ETL
(ODI) • SQL Server Integration
• Data Migrator (IBI)
Services (SSIS)
• Cognos Data Manager
• IBM Infosphere
• IBM Infosphere
Information Server
Warehouse Edition
Selection of the ETL Tools for your Data
warehouse Tasks and its Importance
To experience a successful ETL functioning, proper
selection of the ETL tools is most mandatory.
The proper transfer of data between databases is
possible only when we have proper tools for the
process.
If the procedure is done with less appropriate
tools, then there will be a problem in the
functioning of the complete transfer process.
For proper functioning of data, you will have to
take care of the sub-processes and each tool used
in the sub-processes. Hence, carefulness has to be
maintained by a proper selection of the ETL tools.
ETL process
ETL (Extract, Transform and Load) is a process in data warehousing
responsible for pulling data out of the source systems and placing it into a data
warehouse. ETL involves the following tasks:
- extracting the data from source systems (SAP, ERP, other oprational
systems), data from different source systems is converted into one consolidated
data warehouse format which is ready for transformation processing.
- transforming the data may involve the following tasks: applying business
rules (so-called derivations, e.g., calculating new measures and dimensions),
cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),
filtering (e.g., selecting only certain columns to load), splitting a column into
multiple columns and vice versa, joining together data from multiple sources
(e.g., lookup, merge), transposing rows and columns, applying any kind of
simple or complex data validation (e.g., if the first 3 columns in a row are empty
then reject the row from processing)
- loading the data into a data warehouse or data repository other reporting
applications
Extract
The main objective of the extract step is to retrieve all the required data from the
source system with as little resources as possible. The extract step should be designed
in a way that it does not negatively affect the source system in terms or performance,
response time or any kind of locking.
There are several ways to perform the extract:
Update notification - if the source system is able to provide a notification that a
record has been changed and describe the change, this is the easiest way to get the
data.
Incremental extract - some systems may not be able to provide notification that an
update has occurred, but they are able to identify which records have been modified
and provide an extract of such records. During further ETL steps, the system needs to
identify changes and propagate it down. Note, that by using daily extract, we may not
be able to handle deleted records properly.
Full extract - some systems are not able to identify which data has been changed at all,
so a full extract is the only way one can get the data out of the system. The full extract
requires keeping a copy of the last extract in the same format in order to be able to
identify changes. Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely
important. Particularly for full extracts; the data volumes can be in tens of gigabytes.
Extraction
A staging area is required during ETL load. There are various reasons why
staging area is required.
The source systems are only available for specific period of time to extract data.
This period of time is less than the total data-load time. Therefore, staging area
allows you to extract the data from the source system and keeps it in the staging
area before the time slot ends.
Staging area is required when you want to get the data from multiple data sources
together or if you want to join two or more systems together. For example, you will
not be able to perform a SQL query joining two tables from two physically different
databases.
Data extractions’ time slot for different systems vary as per the time zone and
operational hours.
Data extracted from source systems can be used in multiple data warehouse
system, Operation Data stores, etc.
ETL allows you to perform complex transformations and requires extra area to
store the data.
Clean
The cleaning step is one of the most important as it
ensures the quality of the data in the data warehouse.
Cleaning should perform basic data unification rules, such
as:
Making identifiers unique (sex categories
Male/Female/Unknown, M/F/null, Man/Woman/Not
Available are translated to standard
Male/Female/Unknown)
Convert null values into standardized Not Available/Not
Provided value
Convert phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming,
e.g. Street/St/St./Str./Str
Validate address fields against each other (State/Country,
City/State, City/ZIP code, City/Street).
Transform
The transform step applies a set of rules to transform the data from the
source to the target. This includes converting any measured data to the
same dimension (i.e. conformed dimension) using the same units so
that they can later be joined. The transformation step also requires
joining data from several sources, generating aggregates, generating
surrogate keys, sorting, deriving new calculated values, and applying
advanced validation rules.
Load
During the load step, it is necessary to ensure that the load is
performed correctly and with as little resources as possible. The target
of the Load process is often a database. In order to make the load
process efficient, it is helpful to disable any constraints and indexes
before the load and enable them back only after the load completes.
The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Managing ETL Process
The ETL process seems quite straight forward. As with every application, there is a
possibility that the ETL process fails. This can be caused by missing extracts from one of
the systems, missing values in one of the reference tables, or simply a connection or
power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery
in mind.
Staging
It should be possible to restart, at least, some of the phases independently from the
others and this can be ensured by implementing proper staging.
Staging means that the data is simply dumped to the location (called the Staging Area) so
that it can then be read by the next processing phase.
The staging area is also used during ETL process to store intermediate results of
processing. However, the staging area should is to be accessed by the load ETL process
only.
It should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the end-user as it may contain incomplete or in-the-
middle-of-the-processing data.
ETL Tool Implementation
When you are about to use an ETL tool, there is a fundamental decision to
be made:
Will the company build its own data transformation tool or will it use an
existing tool?
Building your own data transformation tool (usually a set of shell scripts) is
the preferred approach for a small number of data sources which reside in
storage of the same type. The reason for that is the effort to implement the
necessary transformation is little due to similar data structure and common
system architecture
There are many ready-to-use ETL tools on the market. The main benefit of
using off-the-shelf ETL tools is the fact that they are optimized for the ETL
process by providing connectors to common data sources like databases, flat
files, mainframe systems, xml, etc. They provide a means to implement data
transformations easily and consistently across various data sources.
The tools also support transformation scheduling, version control,
monitoring and unified metadata management. Some of the ETL tools are
even integrated with BI tools.
The most well known commercial tools are
Ab Initio,
IBM InfoSphere DataStage,
Informatica,
Oracle Data Integrator and
SAP Data Integrator.
There are several open source ETL tools, among others
Apatar, CloverETL, Pentaho and Talend.
Industry News
Critical Factors for Cloud Deployments: Agility, Flexibility and
Scalability the Reasons for Choosing Teradata
Some of the Well Known ETL Tools
Oracle Enterprise BI Server
SAP Business Objects Enterprise
SAP NetWeaver BI
SAS Enterprise BI Server
Microsoft BI platform
IBM Cognos Series 8
Board Management IntelligenceToolkit
BizzScore Suite
WebFocus
QlikView
Microstrategy
Oracle Hyperion System
Actuate
Popular Business Intelligence
Tools
Data Warehouse Architecture
Data Warehouse Architecture
Different data warehousing systems have
different structures. Some may have an
ODS (operational data store), while some
may have multiple data marts. Some may
have a small number of data sources, while
some may have dozens of data sources.
In view of this, it is far more reasonable to
present the different layers of a data
warehouse architecture rather than
discussing the specifics of any one system.
Data Warehouse Architecture
In general, all data warehouse systems have the following
layers:
Data Source Layer
Data Extraction Layer
Staging Area
ETL LayerData
Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata LayerSystem
Operations Layer
Data Source Layer
This represents the different data sources that feed data into the
data warehouse. The data source can be of any format -- plain text
file, relational database, other types of database, Excel file, etc.,
can all act as a data source.
Many different types of data can be a data source:
Operations -- such as sales data, HR data, product data,
inventory data, marketing data, systems data.
ERP (SAP) and CRM (SalesForce.com) Systems
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data,
or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the
data warehouse system. There is likely some
minimal data cleansing, but there is unlikely
any major data transformation.
Staging Area
This is where data sits prior to being
scrubbed and transformed into a data
warehouse / data mart. Having one common
area makes it easier for subsequent data
processing / integration.
ETL Layer
This is where data gains its "intelligence", as
logic is applied to transform the data from a
transactional nature to an analytical nature.
This layer is also where data cleansing
happens.
The ETL design phase is
phase often the most time-
consuming phase in a data warehousing
project, and an ETL tool is often used in this
layer.
Popular ETL Tools
IBM WebSphere Information Integration (Ascential
DataStage)
Ab Initio
Informatica
Data Storage Layer
This is where the transformed and cleansed
data sit. Based on scope and functionality, 3
types of entities can be found here: data
warehouse,
warehouse data mart,
mart and operational
data store (ODS).
(ODS)
In any given system, you may have just one of
the three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored.
Business rules stored here do not affect the underlying data
transformation rules, but do affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users.
This can be in a form of a tabular / graphical report
in a browser, an emailed report that gets
automatically generated and sent everyday, or an
alert that warns users of exceptions, among others.
Usually an OLAP tool and/or
tool a reporting tool is
tool
used in this layer
OLAP
OLAP stands for On-Line Analytical Processing.
Processing The first
attempt to provide a definition to OLAP was by Dr. Codd, who
proposed 12 rules for OLAP.
Online analytical processing, or OLAP is an approach to
answering multi-dimensional queries in faster and quick time.
OLAP tools such as Cubes, Oracle BI suite, helps users to
analyze multidimensional data effectively from multiple
perspectives.
OLAP consists of three basic operations: roll-up, drill-down,
and slicing and dicing.
Metadata Layer
This is where information about the data stored in the data warehouse
system is stored.
A logical data model would be an example of something that's in the
metadata layer.
A metadata tool is often used to manage metadata.(typically XML)
System Operations Layer
This layer includes information on how the data warehouse system
operates, such as ETL job status, system performance, and user
access history.
Data ware housing- Various Data Warehouse
Design Approaches:Top-Down and Bottom-Up
Data Warehouse design approaches are very important
aspect of building data warehouse.
Selection of right data warehouse design could save lot of
time and project cost.
There are two different Data Warehouse Design Approaches
normally followed when designing a Data Warehouse
solution and based on the requirements of your project you
can choose which one suits your particular scenario.
These methodologies are a result of research from Bill
Inmon and Ralph Kimball.
Data ware housing- Various Data
Warehouse Design Approaches:Top-
Down and Bottom-Up
In a nutshell, here are the two approaches: in Bill Inmon’s
enterprise data warehouse approach (the top-down
design), a normalised data model is designed first, then
the dimensional data marts, which contain data required
for specific business processes or specific departments, are
created from the data warehouse. In Ralph Kimball’s
dimensional design approach (the bottom-up design), the
data marts facilitating reports and analysis are created
first; these are then combined together to create a broad
data warehouse.
Inmon’s top-down approach
Inmon defines a data warehouse as a centralised
repository for the entire enterprise. A data
warehouse stores the “atomic” data at the lowest
level of detail. Dimensional data marts are
created only after the complete data warehouse
has been created. Thus, the data warehouse is at
the centre of the corporate information factory
(CIF), which provides a logical framework
for delivering business intelligence.
Kimball’s bottom-up approach
Keeping in mind the most important business aspects or
departments, data marts are created first. These provide a
thin view into the organisational data and, as and when
required, these can be combined into a larger data
warehouse. Kimball defines data warehouse as “a copy of
transaction data specifically structured for query and
analysis”.
Kimball’s data warehousing architecture is also known as
data warehouse bus (BUS). Dimensional modelling
focuses on ease of end-user accessibility and provides a
high level of performance to the data warehouse.
Ralph Kimball –
Bottom-up Data Warehouse Design Approach
Pros and cons of both approaches
Real-Time Data Warehousing
Real-time (active) data warehousing
The process of loading and
providing data via a data warehouse
as they become available
Active Data Warehousing is the
technical ability to capture transactions
when they change, and integrate them
into the warehouse, along with
maintaining batch or scheduled cycle
refreshes. ... Real-time Data
Warehousing describes a system that
reflects the state of
the warehouse in real time.
What is Real Time data-warehousing?
The combination of real-time activity and data
warehousing is called real time warehousing. The activity
that happens at current time is known as real-time
activity. Data is available after completion of the activity.
Business activity data is captured in real-time data
warehousing as the data occurs. Soon after the business
activity and the available data, the data of completed
activity is flown into the data warehouse. This data is
available instantly. Real-time data warehousing can be
viewed / utilized as a framework for the information
retrieval from data as the data is available.
Definition
Active Data Warehousing is the technical ability to capture
transactions when they change, and integrate them into the
warehouse, along with maintaining batch or scheduled cycle
refreshes. An active data warehouse offers the possibility of
automating routine tasks and decisions. The active data warehouse
exports decisions automatically to the On-Line Transaction
Processing (OLTP) systems.
Real-time Data Warehousing describes a system that reflects the
state of the warehouse in real time.
Levels of data warehouses:
1. Reports what happened
2. Some analysis occurs
3. Provides prediction capabilities,
4. Operationalization
5. Becomes capable of making events
happen