DataWare House
Introduction
Introduction
Lecturer: Saba Ghani
Since July, 2019
Department of Computer Science
MSCS-FAST NUCES
BSCS-LCWU Lahore
Email: [Link]@[Link]
Class honor code
Google classroom code
otnelmt
PREREQUISITES
DATA BASE
This course requires that you should be proficient with
Data Bases concepts
5
Books
Reference books
W. H. Inmon, Building the Data Warehouse
(Third Edition), John Wiley & Sons Inc., NY.
R. Kimball, The Data Warehouse Toolkit
(Second Edition), John Wiley & Sons Inc., NY.
Paulraj Ponniah, Data Warehousing Fundamentals,
John Wiley & Sons Inc., NY.
6
Objective
Understand the desperate need for strategic
information
Recognize the information crisis at every enterprise
Distinguish between operational and informational
systems
“Drowning in data and starving for
information”
Knowledge is power, Intelligence is
absolute power!
8
.
9
NEEDS
Operational databases (online transaction
processing systems or OLTP), are not
suitable for data analysis
Contain current and detailed data
Do not include historical data
Perform poorly for complex queries due to
normalization
Problems with the
Naturally Evolving
Architecture
The naturally evolving architecture
presents many challenges, such as:
• Data credibility
• Productivity
• Inability to transform data into information
Data credibility
1. No time basis of data
2. The algorithmic
differential of data
3. The levels of extraction
4. The problem of external data
5. No common source of data from the
beginning
Productivity
To produce a corporate report, many files
and layouts of data must be analyzed
Report-generation program should be
simple to write, retrieving the data for the
report is tedious
cross every technology that the company uses
Data to Information
Data Integration
Historical data
The systems found in the naturally evolving architecture
are simply inadequate for supporting information needs.
NEEDS
Organization stored Data in database
Never use of this data for business
improvement
Manager need information to
Formulate the business strategies
Establish goals
Set objectives
Monitor results
Historical overview: Crisis of
Credibility
Businesses demand Intelligence (BI).
Complex questions from integrated
data.
“Intelligent Enterprise
16
Historical overview: Crisis of
Credibility
List of all items that were sold last
month?
List of all items purchased by David?
The total sales of the last month of
each branch?
How many sales transactions
occurred during the month of
January?
17
Intelligent Enterprise
1. Which items sell together? Which items to
stock?
2. Retain the present customer base
3. Improve product quality levels in top five
product groups.
4. Gain market share by 10% in the next 3 years
5. Bring three new products to market in 2 years
6. Increase sales by 15% in the North East
Division
Needs
For business growth needs
In Depth information knowledge of their company’s
operations
Review and monitor key performance indicators
Factor affect one another
How business factors change over time
compare their company’s performance
Strategic information
Strategic information - Types of information needed
to make decisions in the formulation and execution of
business strategies
Organization needs to extract these information for
business growth
Its important for
continued health and business survival
Critical business decisions depend on the availability of
proper strategic information in an enterprise.
Characteristics
Characteristics of strategic information are
INTEGRATED:
Single, enterprise-wide view
DATA INTEGRITY
Information must be accurate and must conform to business rules.
ACCESSIBLE
Easily accessible, and responsive for analysis.
CREDIBLE
Business factor must have one and only one value
TIMELY
Information must be available within the stipulated time frame.
The Information Crisis
Two facts are
[Link] have lots of data
[Link] technology resources and systems are not effective at
turning all that data into useful strategic information
Information double every 18 month
Information crisis
the available data is not readily usable for strategic decision making
An enterprise can have many types of incompatible structures and systems
All information needed to be integrate from all systems for proper decision
making
Data needed for strategic decision making must be in a format suitable for
easy analysis
OPERATIONAL VERSUS DECISION-SUPPORT
SYSTEMS
Making the Wheels of Business Turn
• Operational systems are online transaction
processing (OLTP) systems used to run the
day-to-day business
• bread-and-butter systems.
Watching the Wheels of Business Turn
• watch how the business runs, and make
strategic decisions to improve the business
Different Scope, Different Purposes
• strategic information need to build informational systems that are
different from the operational systems
OPERATIONAL VERSUS DECISION-SUPPORT
SYSTEMS
OPERATIONAL INFORMATIONAL
Data Content Current values Archived, derived,
summarized
Data Structure Optimized for Optimized for complex
transactions queries
Access Frequency High Medium to low
Access Type Read, update, delete Read
Usage Predictable, repetitive Ad hoc, random,
heuristic
Response Time Sub-seconds Several seconds to min
Users Large number Relatively small number
OPERATIONAL VERSUS DECISION-SUPPORT
SYSTEMS
A complete repository of historical corporate data extracted from
transaction systems that is available for ad-hoc access by
knowledge workers
What is a Data Warehouse?
Transaction System
Management Information System (MIS)
Could be typed sheets (NOT transaction system)
Ad-Hoc access
Dose not have a certain access pattern.
Queries not known in advance.
Difficult to write SQL in advance.
Knowledge workers
Typically NOT IT literate (Executives, Analysts, Managers).
NOT clerical workers.
Decision makers
26
What is a Data Warehouse?
An Alternative Viewpoint
“A DW is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in
organizational decision making.”
27
What is a Data Warehouse ?
It is a blend of many technologies, the basic concept being:
Take all data from different operational systems.
If necessary, add relevant data from industry.
Transform all data and bring into a uniform format.
Integrate all data as a single entity.
Store data in a format supporting easy access for decision support.
Create performance enhancing indices.
Implement performance enhancement joins.
Run ad-hoc queries with low selectivity.
28
What is a Data Warehouse ?
29
Types of data warehouse
Financial
Telecommunication
Insurance
Human Resource
Global
Exploratory
30
How is it Different?
Decision making is Ad-Hoc
31
How is it Different?
Combines operational and historical data.
Don’t do data entry into a DWH
OLTP or ERP are the source systems.
OLTP systems don’t keep history, cant get balance statement more than a year
old.
DWH keep historical data, even of bygone customers. Why?
In the context of bank, want to know why the customer left?
What were the events that led to his/her leaving? Why?
Customer retention.
32
How is it Different?
33
How much history?
Depends on:
Industry.
Cost of storing historical data.
Economic value of historical data.
Industries and history
Telecomm calls are much more as compared to bank transactions- 18 months.
Retailers interested in analyzing yearly seasonal patterns- 65 weeks.
Insurance companies want to do actuary analysis, use the historical data in order to
predict risk- 7 years.
Hence, NOT a complete repository of data
34
How much history?
Economic value of data
Vs.
Storage cost
Data Warehouse a
complete repository of data?
35
How is it Different?
Usually (but not always) periodic or batch updates rather than real-time.
The boundary is blurring for active data warehousing.
For an ATM, if update not in real-time, then lot of real trouble.
DWH is for strategic decision making based on historical
data. Wont hurt if transactions of last one hour/day are absent.
Rate of update depends on:
volume of data,
nature of business,
cost of keeping historical data,
benefit of keeping historical data.
36
How is it Different?
Does not follows the traditional development model
37
Data Warehouse Vs. OLTP
38
Data Warehouse Vs. OLTP
39
Comparison of Response Times
On-line analytical processing (OLAP) queries must be executed in a
small number of seconds.
Often requires denormalization and/or sampling.
Complex query scripts and large list selections can generally be
executed in a small number of minutes.
Sophisticated clustering algorithms (e.g., data mining) can
generally be executed in a small number of hours (even for
hundreds of thousands of customers).
40
Putting the pieces together
41
Why is this hard?
Data sources are unstructured & heterogeneous.
Requirements are always changing.
Most computer scientist trained on OTLP systems, those
concepts not valid for DSS.
Performance impacts are often non-linear O(n) Vs. O(nlog_n) e.g.
scanning vs. indexing.
Complex computer/database architectures.
Rapidly changing product characteristics.
42
High-level Implementation Steps
Phase-I
1. Determine Users' Needs
2. Determine DBMS Server Platform
3. Determine Hardware Platform
4. Information & Data Modeling
5. Construct Metadata Repository
Phase-II
6. Data Acquisition & Cleansing
7. Data Transform, Transport & Populate
8. Determine Middleware Connectivity
9. Prototyping, Querying & Reporting
10. Data Mining
11. On Line Analytical Processing (OLAP)
Phase-III
12. Deployment & System Management
43