Lecture4 - DATA WAREHOUSING PDF
Lecture4 - DATA WAREHOUSING PDF
Chapter 8
• Data Origin
– Internal
Traditional Data Management – External
– Personal
1
Data for Decision Support
• Databases
– Relational
– Hierarchical & Network OLAP & Data Warehousing
– Object-Oriented
– Multimedia & Document Systems
• Data Problems
– Incorrect data
– Data not timely
Databases for Decision Support
– Data not measured or indexed properly
– Needed data do not exist
2
Data Analysis Limitations The Problem: OLTP
• Consider a typical business analysis problem: • OLTP = On-Line Transaction Processing
– Find the share of total sales represented by each
product in different markets, categories, and periods, – OLTP is the basic architecture underlying
compared with the same period a year ago. relational database-based transaction processing
• To do so, you would calculate the percentage systems.
each number is of the total of its column, a – Data is broken up into relational tables, with
simple and common concept. relationships described using foreign keys.
• However, in a classic relational database these – OLTP is optimized for the entry and retrieval of
calculations and display would require definition individual pieces of fact.
of a separate view, requiring complex SQL – OLTP was not designed to support large-scale
commands. data analysis.
3
The Data Warehouse The Data Warehouse
• A managed database in which the data is: – Time-variant:
– Subject-oriented: • Contains informational data (historic data along a
time-axis) rather than operational data
• Designed for decision-support rather than for TPS.
(momentary snapshot).
– Integrated: – Aggregated:
• Contains data from different legacy & DB-systems.
• Data exists at different levels of granularity
– Nonvolatile: • to support management needs
• New data is integrated with existing data, rather
than replacing it.
4
Why do we need a Data Warehouse? Data Warehousing & Legacy Systems
5
Operational Units
• Operational data stores (ODS):
– A type of database often used as an interim
area for a data warehouse. It includes short- Data Flows in the Data Warehouse
term data from multiple sources in a near
real-time manner to support short-term
decision making
• Oper marts: Putting the Data Warehouse together
– An operational data mart, which allows for
multi-dimensional data analysis of ODS data.
6
Inflow isn’t easy... Inflow isn’t easy...
• Because of incorrect, contaminated, or just plain
missing data, the migration phase to a data
warehouse can dominate the entire project! Source 1: 9-12-2009
Source 2: Sep-12-09 09/12/2009
• Integrating data about a single entity from Source 3: September 12, 2009
multiple sources is often difficult.
• Automated tools are available to help out:
Source 1: IBM
– Data Migration tools for data extraction Source 2: I.B.M. ?
– Data Scrubbing tools for cleaning up data. Source 3: IBM Corporation
– Data Auditing tools for patterns and structures
contained within the data.
7
Inflow By Any Other Name… Data Flows in a Data Warehouse
• Enterprise application integration (EAI) • Upflow
– A technology that provides a vehicle for – Highly detailed data Heavily Summarized
pushing data from source systems into a data is aggregated,
warehouse summarized, and
otherwise processed Lightly
• Enterprise information integration (EII) in the warehouse to Summarized
– A tool space that provides real-time data make it easier for
integration from various sources, such as users to get quick
relational databases, Web services, and responses to their
queries. Current
multidimensional databases
Detail
8
Data Flows in a Data Warehouse Data Flows in a Data Warehouse
• The reason for downflow: data loses business Monthly sales by product
Heavily Summarized
value over time! line 2001-2011
Legacy
metadata - data about the data. Select Metadata Data
Extract Reports Mart
Data Sources Data Transform Query Tools
OLTP
Storage Users Integrate Enterprise
Risk
Mgmt
Maintain
Data Warehouse
Preparation
System Regulate & Business External Data
OLAP Tools
Modeling Synthesize Modeling Mart
Engineering Web Browsers
METADATA
9
Data Warehouse Architectures Data Warehouse Architectures
10
The Star Schema
• A currently popular data architecture for data
warehouses is the Star Schema.
Data in the Data Warehouse – Simple structure with relatively few tables.
– Data is not fully normalized as in OLTP.
– Star design is easily understood by analysts
and end users, especially those without much
The Star Schema database familiarity.
– The star design eases data analysis: more
analysis possibilities and simpler queries.
11
Nine Decisions in Data Design Estimating Data Warehouse Size
• Choose the subject matter Grocery Store Star Schema
• Decide what a fact table represents
Time Dimension Product Dimension
Fact Table
• Identify and conform the dimensions
Time-ID Product-ID
• Choose the facts Day ID
Time-ID SKU ID
• Store pre-calculations in the fact table Week ID
Month ID
Store-ID Package ID
Brand ID
• Round out the dimension tables Quarter ID Product-ID
Category ID
Year ID Promotion-ID
• Choose the duration of the database Sales Facts
Subcategory ID
12
Data Warehouse Pros & Cons Data Warehouse Pros & Cons
• Indirect benefits result from end users • Data warehousing implementation issues
using these direct benefits – Implementing a data warehouse is generally
– Enhance business knowledge a massive effort that must be planned and
– Present competitive advantage executed according to established methods
– Enhance customer service and satisfaction – Data warehouses are expensive and time-
consuming to develop
– Facilitate decision making
– Benefits might not be immediately obvious
– Help in reforming business processes
– Management support and championing is
vital to success
13