Module I
Data Warehouse (DWH)
Fundamentals
By : Aruna Khubalkar
Outline
Need for Data Warehousing
Benefits and features of Data Warehouse
Data warehouse Characteristics
Data warehouse Architecture
Dimensional Modeling : Star & Snowflake
OLAP operations
OLTP v/s OLAP
By : Ms. Aruna Khubalkar 2
Operational Sources (OLTP’s)
Operational computer systems did provide information to run
day-to-day operations, and answer’s daily questions, but…
Also called Online Transactional Processing System (OLTP)
Data is read or manipulated with each transaction
Transactions/queries are simple, and easy to write
Usually for middle management
Examples
Sales systems
Hotel reservation systems
HRM Applications
Etc.
By : Ms. Aruna Khubalkar 3
Typical decision queries
Data set are mounting everywhere, but not useful for
decision support
Decision-making require complex questions from
integrated data.
Enterprise wide data is desired
Decision makers want to know:
Where to build new oil warehouse?
Which market they should strengthen?
Which customer groups are most profitable?
How much is the total sale by month/ year/ quarter for each
offices?
Is there any relation between promotion campaigns and sales
growth?
Can OLTP answer all such questions, efficiently?
By : Ms. Aruna Khubalkar 4
*
Information crisis
Integrated
Must have a single, enterprise-wide view
Data Integrity
Information must be accurate and must conform to
business rules
Accessible
Easily accessible with intuitive access paths and
responsive for analysis
Credible
Every business factor must have one and only one
value
Timely
Information must be available within the stipulated
time frame
* Paulraj 2001.
By : Ms. Aruna Khubalkar 5
Expectations of new soln.
DB designed for analytical tasks
Data from multiple applications
Easy to use
Ability of what-if analysis
Read-intensive data usage
Direct interaction with system, without IT assistance
Periodical updating contents & stable
Current & historical data
Ability for users to initiate reports
By : Ms. Aruna Khubalkar 7
DW meets expectations
Provides enterprise view
Current & historical data available
Decision-transaction possible without affecting
operational source
Reliable source of information
Ability for users to initiate reports
Acts as a data source for all analytical applications
By : Ms. Aruna Khubalkar 8
Definition of DW
Bill Inmon (Father of Data Warehousing) defined
“A DW is a subject-oriented, integrated, non-volatile,
time-variant collection of data in favor of decision-making”
Kelly said
“Separate available, integrated, time-stamped, subject-
oriented, non-volatile, accessible”
Four properties/characteristics of DW
subject-oriented, integrated,
non-volatile, time-variant
By : Ms. Aruna Khubalkar 9
Subject-oriented
In operational sources data is organized by
applications, or business processes.
In DW subject is the organization method
Subjects vary with enterprise
These are critical factors, that affect performance
Example of Manufacturing Company
Sales
Shipment
Inventory etc
By : Ms. Aruna Khubalkar 10
Integrated Data
Data comes from several applications
Problems of integration comes into play
File layout, encoding, field names, systems, schema,
data heterogeneity are the issues
Bank example, variance: naming convention,
attributes for data item, account no, account type,
size, currency
In addition to internal, external data sources
External companies data sharing
Websites
Others
Removal of inconsistency
So process of extraction, transformation & loading
By : Ms. Aruna Khubalkar 11
Time variant
Operational data has current values
Comparative analysis is one of the best techniques for
business performance evaluation
Time is critical factor for comparative analysis
Every data structure in DW contains time element
In order to promote product in certain, analyst has to
know about current and historical values
The advantages are
Allows for analysis of the past
Relates information to the present
Enables forecasts for the future
By : Ms. Aruna Khubalkar 12
Non-volatile
Data from operational systems are moved into DW
after specific intervals
Data is persistent/ not removed i.e. non volatile
Every business transaction don’t update in DW
Data from DW is not deleted
Data is neither changed by individual transactions
Properties summary
Subject Oriented Time-Variant Non-Volatile
Organized along the lines of Every record in the data Refers to the inability of data
the subjects of the to be updated. Every record
warehouse has some
corporation. Typical subjects in the data warehouse is time
are customer, product, form of time variancy stamped in one form or
vendor and transaction. attached to it. another.
By : Ms. Aruna Khubalkar 13
Generic two-level architecture
L
One,
T company-
wide
warehouse
E
Periodic extraction data is not completely current in warehouse
By : Ms. Aruna Khubalkar 15
DW Architecture
By : Ms. Aruna Khubalkar 17
Components of DW
Source Data Component
Data Staging Component
Data Storage Component
Information Delivery Component
Metadata Component
Management and Control Component
By : Ms. Aruna Khubalkar 18
Source Data Component
Source Data coming into the data warehouse
may be grouped into 4 broad categories :
1. Production data
2. Internal data
3. Archived data
4. External data
By : Ms. Aruna Khubalkar 19
Data Staging Component
Data staging provides a place and an area with
a set of functions to clean, change, combine,
convert, deduplicate and prepare source data for
storage and use in the data warehouse.
3 major functions of Extraction, Transformation
and preparation for Loading takes place in
staging area.
By : Ms. Aruna Khubalkar 20
The ETL Process
Capture
Scrub or data cleansing
Transform
Load and Index
ETL = Extract, transform, and load
By : Ms. Aruna Khubalkar 21
Data Transformation
Data transformation is the component of
data reconcilation that converts data from
the format of the source operational systems
to the format of enterprise data warehouse.
Data transformation consists of a variety of
different functions:
– record-level functions,
– field-level functions and
– more complex transformation.
By : Ms. Aruna Khubalkar 22
Role of Metadata (data catalog)
Identify subjects of the data mart
Identify dimensions and facts
Indicate how data is derived from enterprise data
warehouses, including derivation rules
Indicate how data is derived from operational
data store, including derivation rules
Identify available reports and predefined queries
Identify data analysis techniques (e.g. drill-down)
Identify responsible people
By : Ms. Aruna Khubalkar 23
Derived Data
Objectives
– Ease of use for decision support applications
– Fast response to predefined user queries
– Customized data for particular target audiences
– Ad-hoc query support
– Data mining capabilities
Characteristics
– Detailed (mostly periodic) data
– Aggregate (for summary)
– Distributed (to departmental servers)
Most common data model = star schema
(also called “dimensional model”)
By : Ms. Aruna Khubalkar 24
Data Warehouse Modeling :
Data Cube and OLAP
Design Decisions
Choosing the Process. Selecting the subjects from the
information packages for the first set of logical structures
to be designed.
Choosing the Grain. Determining the level of detail for
the data in the data structures.
Identifying and Conforming the Dimensions.
Choosing the business dimensions (such as product,
market, time, etc.)
Choosing the Facts. Selecting the metrics or units of
measurements (such as product sale units, dollar sales,
dollar revenue, etc.)
Choosing the Duration of the Database. Determining
how far back in time you should go for historical data. 26
By : Ms. Aruna Khubalkar
Dimension Table
Dimensions are perspectives or entities with
respect to which an organization wants to keep
records.
Time, item, branch, location
Each dimension have a table associated with it,
Dimension table, which further describes the
dimension.
Dimension tables, such as
item (item_name, brand, type), or
time(day, week, month, quarter, year)
By : Ms. Aruna Khubalkar 27
Fact Table
A multidimensional data model is typically
organized around a central theme, like sales,
represented by Fact Table
Fact table contains measures (such as
dollars_sold, units_sold) and keys to each
of the related dimension tables
By : Ms. Aruna Khubalkar 28
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Example of Snowflake Schema
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
branch location_key location to_location
branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
• Star schema: A fact table in the middle connected to a
set of dimension tables
• Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
• Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
Defining Star Schema in DMQL
define cube sales_star [time, item, branch,
location]:
l
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars), units_sold
= count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key,
item_name, brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key,
street, city, province_or_state, country)
Defining Snowflake Schema in
DMQL
define cube sales_snowflake [time, item, branch,
location]:
l
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city(city_key, province_or_state, country))
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
l
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
define cube shipping [time, item, shipper, from_location,
to_location]:
l
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location
as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
A Concept Hierarchy: Dimension
(location)
all
region Europe ... North_America
Germany ... Spain Canada ... Mexico
city Frankfurt ... Vancouver ... Toronto
office L. Chan ... M. Wind
From Tables and Spreadsheets to
Data Cubes
A data warehouse is based on a
multidimensional data model which views
data in the form of a data cube
A data cube, such as sales, allows data to
be modeled and viewed in multiple
dimensions
o Defined by dimensions and facts
Multidimensional Data
Sales volume as a function of product,
month, and region
Region
Sales data warehouse
Dimensions: Product, Location, Time
Product
Month
In data warehousing literature, a data
cube (such as above) is referred as
cuboid.
Given a set of dimensions, we can
generate a cuboid for each of the
possible subsets of given dimensions.
The result is lattice of cuboids, each
showing the data at different level of
summarization.
The lattice of cuboids is referred as a
Data Cube
A Sample Data Cube
Date
Total annual sales
2Qtr of TVs in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
Product PC U.S.A
VCR
sum
Canada
Country
Mexico
sum
All, All, All
Cuboids Corresponding to
the Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
product,date product,country date, country
2-D cuboids
product, date, country 3-D(base) cuboid
An n-D base cube is called a base
cuboid.
The top most 0-D cuboid, which
holds the highest-level of
summarization, is called the apex
cuboid.
The lattice of cuboids forms a data
cube.
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
time item location supplier
1-D cuboids
time,location item,location location,supplier
time,item 2-D cuboids
time,supplier item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D (base) cuboid
time, item, location, supplier
On-Line Analytical Processing (OLAP)
OLAP is the use of a set of graphical tools that
provides users with multidimensional views of their
data and allows them to analyze the data using
simple windowing techniques
Relational OLAP (ROLAP)
– OLAP tools that view the database as a traditional
relational database in either a star schema or other
normalized or denormalized set of tables.
Multidimensional OLAP (MOLAP)
– OLAP tools that load data into an intermediate
structure, usually a three or higher dimensional
array. (Cube structure)
By : Ms. Aruna Khubalkar 45
Typical OLAP Operations
Roll up (drill-up): summarize data
◦ by climbing up hierarchy or by dimension
reduction
Drill down (roll down): reverse of roll-up
◦ from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
Slice and Dice: project and select
Pivot (rotate):
◦ reorient the cube, visualization, 3D to series of
2D planes
47
Fig. 3.10 Typical
OLAP Operations
48
Figure 11-22: Slicing a data cube
By : Ms. Aruna Khubalkar 50
Summary report
Example of drill-down
Drill-down with
color added
By : Ms. Aruna Khubalkar 51
Data Warehouse vs. Operational
DBMS
OLTP (on-line transaction processing)
◦ Major task of traditional relational DBMS
◦ Day-to-day operations: purchasing,
inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
◦ Major task of data warehouse system
◦ Data analysis and decision making
Data Mining: Concepts and
17/01/23 Techniques 52
OLTP vs. OLAP Systems
Major Distinguishing features
◦ User and system orientation:
customer vs. market
◦ Data contents: current, detailed vs.
historical, consolidated
◦ Database design: application-oriented
+ ER vs. subject-oriented +
star/snowflake model
◦ View: current, local vs. evolutionary,
integrated
◦ Access patterns: short, atomic
transaction vs. read-only but complex
OLTP vs. OLAP Systems
Feature OLTP OLAP
Orientation transaction analysis
Function day to day operations decision support
Users clerk, DBA, IT professional knowledge worker(manager,analyst)
Data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
DB design application-oriented subject-oriented
Operations index/hash on primary key lots of scans
Unit of work short, simple transaction complex query
# records accessed tens millions
# users thousands hundreds
DB size 100MB-GB 100GB-TB
Metric transaction throughput query throughput, response
Priority High performance, high High flexibility
Data Mining: Concepts and
availability 17/01/23 Techniques 54