Lecture # 1-2
Dataware House book
Book:
The Data Warehouse Toolkit,
by
Ralph Kimball,
2013
Book:
Building the Data Warehouse
W. H. Inmon
Fourth Edition
John Wiley & Sons.
2005.
A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?
What product prom- Which customers
-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
Introduction – Decision Support System (DSS)
A Decision Support System (DSS) is an interactive computer-
based system or subsystem intended to help decision makers use
communications technologies, data, documents, knowledge
and/or models to identify and solve problems, complete decision
process tasks, and make decisions.
It is clear that DSS belong to an environment with
multidisciplinary foundations, including (but not exclusively):
– Database research,
– Artificial intelligence,
– Human-computer interaction,
– Simulation methods,
– Software engineering, and
– Telecommunications.
Introduction – Decision Support System (DSS)
• DSSs serve the management, operations, and planning
levels of an organization (usually mid and higher
management) and help to make decisions, which may
be rapidly changing and not easily specified in advance
(Unstructured and Semi-Structured decision problems).
• Decision support systems can be either fully
computerized, human or a combination of both.
Typical DSS Architecture
TPS: transaction
processing system
MODEL: representation DSS DATA EXTERNAL
of a problem TPS DATA
BASE
OLAP: on-line analytical
processing
USER INTERFACE:
how user enters problem DSS SOFTWARE
& receives answers
SYSTEM
USER
MODELS
DSS DATABASE: INTERFACE
current data from OLAP TOOLS
applications or groups
DATA MININGTOOLS
DATA MINING:
technology for finding
relationships in large data USER
bases for prediction
6
Why we uses DSS?
Increasing complexity of decisions
– Technology
– Information:
“Data, data everywhere, and not the time to think!”
– Number and complexity of options
– Pace of change
Increasing availability of computerized support
– Inexpensive high-powered computing
– Better software
– More efficient software development process
Increasing usability of computers
7
Operational Database
Operational database management systems are used to
manage dynamic data in real-time.
These types of databases allow you to do more than simply
view archived data. Operational databases allows to modify
that data (add, change or delete data), doing it in real-time.
Operational databases are increasingly supporting
distributed database architecture that provides high
availability and fault tolerance through replication and scale
out ability.
Data Warehousing brief History
• The concept of data warehousing dates back to the late 1980s
when IBM researchers Barry Devlin and Paul Murphy
developed the "business data warehouse".
• 1960s - General Mills and Dartmouth College, in a joint
research project, develop the terms dimensions and facts.
• 1970s - ACNielsen and IRI provide dimensional data marts for
retail sales.
• 1983 – Tera data introduces a database management system
specifically designed for decision support.
• 1988 - Barry Devlin and Paul Murphy publish the article An
architecture for a business and information systems in IBM
Systems Journal where they introduce the term "business data
warehouse".
Data warehouse Introduction
10
Data warehouse Introduction
Subject
“Data Warehouse is a Oriented
subject oriented,
integrated, time-
variant and non-
Non- Data
volatile collection of volatile Warehouse
Integrated
data in support of
management’s decision
making process.” – W.
H. Inmon Time
Variant
Data warehouse Usage
Three kinds of data warehouse applications
– Information processing
supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
– Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools.
Differences among the three tasks
12
Data warehouse: Subject Oriented
Organized around major subjects, such as customer, product,
sales.
Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process.
13
Data warehouse: Subject Oriented
Data
Operational
Warehouse
Leads Prospects Customers Products
Quotes Regions Time
Orders
14
Data warehouse: Integrated
Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
15
Data warehouse: Time Varying
The time horizon for the data warehouse is significantly longer
than that of operational systems.
– Operational database: current value data.
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”.
16
Data warehouse: Time Varying
Data
Operational
Warehouse
Current Value data Snapshot data
• time horizon : 60-90 days • time horizon : 5-10 years
•data warehouse stores historical
data
17
Data warehouse: Non-Volatile
A physically separate store of data transformed from the
operational environment.
Operational update of data does not occur in the data
warehouse environment.
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
initial loading of data and access of data.
18
Data warehouse: Non-Volatile
insert change
Operational Data
Warehouse
insert
delete
load
read only
access
replace
change
19
Data, Data everywhere yet ...
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I found
– available data poorly documented
• I can’t use the data I found
– results are unexpected
– data needs to be transformed from
one form to other
What Is a Data Warehouse?
“A subject-oriented, integrated, time-variant, non-
volatile collection of data in support of
management’s decision making process.”
W.H. Inmon
“The data warehouse is where we publish used
data.”
Ralph Kimball
6
Difference between Database and data warehouse
FEATURES DATABASE DATAWAREHOUSE
Characteristic It is based on Operational Processing. It is based on Informational Processing.
Data It mainly stores the Current data which It usually stores the Historical data whose
always guaranteed to be up-to-date. accuracy is maintained over time.
Function It is used for day-to-day operations. It is used for long-term informational
requirements and decision support.
User The common users are clerk, DBA, The common users are knowledge worker
database professional. (e.g., manager, executive, analyst)
Unit of work Its work consists of short and simple The operations on it consists of complex
transaction. queries..
Focus The focus is on “Data IN” The focus is on “Information OUT”
Orientation The orientation is on Transaction. The orientation is on Analysis.
DB design The designing of database is ER based The designing is done using star/snowflake
and application-oriented. schema and its subject-oriented.
Summarization The data is primitive and highly The data is summarized and in consolidated
detailed. form.
View The view of the data is flat relational. The view of the data is multidimensional.
22
Difference between Database and data warehouse
FEATURES DATABASE DATAWAREHOUSE
Function It is used for day-to-day operations. It is used for long-term informational
requirements and decision support.
User The common users are clerk, DBA, The common users are knowledge worker
database professional. (e.g., manager, executive, analyst)
Access The most frequent type of access type is It mostly use the read access for the
read/write. stored data.
Operations The main operation is index/hash on For any operation it needs a lot of scans.
primary key.
Number of A few tens of records. A bunch of millions of records.
records accessed
Number of users In order of thousands. In the order of hundreds only.
DB size 100 MB to GB. 100 GB to TB.
Priority High performance, high availability High flexibility, end-user autonomy
Metric To measure the efficiency, transaction To measure the efficiency, query
throughput is measured. throughput and response time is
measured.
23
Evolution in organizational use of data warehouses
Organizations generally start off with relatively simple use of data
warehousing. Over time, more sophisticated use of data warehousing evolves.
The following general stages of use of the data warehouse can be
distinguished:
• Off line Operational Database
–Data warehouses in this initial stage are developed by simply copying the
data off an operational system to another server where the processing load
of reporting against the copied data does not impact the operational
system's performance.
• Off line Data Warehouse
–Data warehouses at this stage are updated from data in the operational
systems on a regular basis and the data warehouse data is stored in a data
structure designed to facilitate reporting.
• Real Time Data Warehouse
–Data warehouses at this stage are updated every time an operational
system performs a transaction (e.g. an order or a delivery or a booking.)
• Integrated Data Warehouse
–Data warehouses at this stage are updated every time an operational
system performs a transaction. The data warehouses then generate
transactions that are passed back into the operational systems.
Data Warehouse Architecture
Client Client
Query & Analysis
Metadata Warehouse
Integration
Source Source
Source
Why a Warehouse?
Two Approaches:
Query-Driven (Lazy)
Warehouse (Eager)
Source Source
The Traditional Research Approach
Query-driven (lazy, on-demand)
Clients
Integration System Metadata
...
Wrapper Wrapper Wrapper
...
Source Source Source
Disadvantages of Query-Driven
Approach
Delay in query processing
Slow or unavailable information sources
Complex filtering and integration
Inefficient and potentially expensive for frequent
queries
Competes with local processing at sources
Hasn’t caught on in industry
The Warehousing Approach
Information Clients
integrated in
advance Data
Warehouse
Stored in wh
for direct
Integration System Metadata
querying and
analysis ...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor
...
Source Source Source
Advantages of Warehousing Approach
High query performance
But not necessarily most current information
Doesn’t interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
Has caught on in industry
Not Either-Or Decision
Query-driven approach still better for
Rapidly changing information
Rapidly changing information sources
Truly vast amounts of data from large numbers of
sources
Clients with unpredictable needs
Data Warehouse? A Practitioners Viewpoint
“A data warehouse is simply a single, complete, and
consistent store of data obtained from a variety of
sources and made available to end users in a way they
can understand and use it in a business context.”
-- Barry Devlin, IBM Consultant
Data Warehouse Architectures: Conceptual View
Operational Informational
Single-layer systems systems
Every data element is stored once only
“Real-time data”
Virtual warehouse
Two-layer
Real-time + derived data Operational Informational
systems systems
Most commonly used approach in
industry today
Derived Data
Real-time data
Three-layer Architecture: Conceptual View
Transformation of real-time data to derived data
really requires two steps
Operational Informational
systems systems
View level
“Particular informational
Derived Data
needs”
Physical Implementation
Reconciled Data
of the Data Warehouse
Real-time data
Data Warehousing: Two Distinct
Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in warehouse
“Warehouse DBMS”
Both rich research areas
Industry has focused on (2)
Issues in Data Warehousing
Warehouse Design
Extraction
Wrappers, monitors (change detectors)
Integration
Cleansing & merging
Warehousing specification & Maintenance
Optimizations
Miscellaneous (e.g., evolution)
OLTP vs. OLAP
OLTP: On Line Transaction Processing
Describes processing at operational sites
OLAP: On Line Analytical Processing
Describes processing at warehouse
Warehouse is a Specialized DB
Standard DB (OLTP) Warehouse (OLAP)
Mostly updates Mostly reads
Many small transactions Queries are long and
Mb - Gb of data complex
Current snapshot Gb - Tb of data
Index/hash on p.k. History
Raw data Lots of scans
Thousands of users (e.g., Summarized, reconciled
clerical users) data
Hundreds of users (e.g.,
decision-makers, analysts)
DATA WAREHOUSES
Data Warehouses:
Data spread in several databases –
physically located at numerous sites
Data warehouse – repository of multiple
DBs in single schema; resides at single site.
Data warehousing processes
1.Data Cleaning 2. Data Integration 3. Data
Transformation
4. Data Loading 5. Periodic data refreshing
Data warehouse diagram
Data cleaning:-Data Cleaning includes,
filling in missing values, smoothing noisy
data, identifying or removing outliers, and
resolving inconsistencies.
Data integration:-Data Integration includes
integration of multiple databases, data
cubes, or files.
Data transformation:-Convert data from
legacy or host format to warehouse format.
Load :-sort;summarize,consolidate;compute
views; check integrity.Build indices and
partitions.
Refresh:-Propagates the update from data
sources to the warehouse.
Data in a data warehouse are organized
around major subjects.
Data provide information on historical
perspective – summarized on periodic
dimension.
Eg. Sales of an item for a region in a period
Data warehouse model – multidimensional
database structure / data cube
Dimensions – Attributes / set of attributes
Facts – Aggregated measures (Count /
Sales amount)
Facts about data warehousing:-
Issues involved in warehousing include
techniques for dealing with errors and
techniques for efficient storage and
indexing of large volumes of data.
This system is used for reporting and data
analysis.
It usually contains historical data derived
from transaction data.
Data warehousing is not meant for current
“live”data.
Components of a data warehouse
• Sources –Data source interaction
• Data Transformation
• Data warehouse (data storage )
• Reporting (Data presentation )
• Metadata
Tiers of data warehouse architecture
Top tier:
Middle tier
Bottom tier
Data
warehouse
server
Backend tools
fig:- A three tier data warehousing
1)Bottom tier:-The bottom tier is a warehouse database
server that is always a relational database system.
Back-end tools and utilities are used to feed data into the
bottom tier from operational databases or other external
sources. These tools and utilities perform data
extraction,cleaning and transformation as well as load and
refresh functions to update the data warehouse.
The date extracted using application program
interfaces known as gateways.
Example of gateways are ODBC(open database
connection)and OLEDB(Open Linking and embedding for
database) by microsoft and jdbc(java database
connecton).
This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2.)Middle tier:- The middle tier is an OLAP server
that is typically implemented using either:-
a) A relational OLAP (ROLAP) model that is,an
extended relation DBMS that maps
operations.Intermediate server b/w relational
back-end server and client front end tools.
b) A multidimentional OLAP (MOLAP) model that is,
a special purpose server that directly
implements multidimentional data and
operations. Supports multidimention views.
3.)Top tier:-The top tier is a front –end client layer
,which contains query and reporting tools ,analysis
tools,and or data mining tools.
Note:-
OLAP – Online Analytical Processing:
This is the major task of Data Warehousing
System.
Useful for complex data analysis and
decision making.
Market oriented –used by
managers,executives and data analyst.
Needs for Data Warehousing
Better business intelligence for end-users
Reduction in time to locate, access, and analyze information
Consolidation of disparate information sources
Strategic advantage over competitors
Faster time-to-market for products and services
Replacement of older, less-responsive decision support systems
Reduction in demand on IS to generate reports
51
Why separate Data Warehouse?
High performance for both systems
– DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
– Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
– missing data: Decision support requires historical data
which operational DBs do not typically maintain
– data consolidation: DS requires consolidation
(aggregation, summarization) of data from heterogeneous
sources
– data quality: different sources typically use inconsistent
data representations, codes and formats which have to be
reconciled
Applications of Data Warehouse
Industry Application
Finance Credit card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record Analysis
Transport Logistics management
Consumer goods Promotion Analysis
Popular Tools of Data Warehouse
Tools Company
Infomix IBM
Oracle Warehouse Builder ORACLE
Microsoft SQL Server Microsoft
Integration
for your attention!