DATA WAREHOUSING
AND
DATA MINING
What is a Data Warehouse?
A single, complete
and consistent store of
data obtained from a
variety of different
sources made available
to end users in a what
they can understand
and use in a business
context.
[Barry Devlin] 2
What are the users saying...
❚ Data should be integrated
across the enterprise
❚ Summary data has a real
value to the
organization
❚ Historical data holds the
key to understanding
data over time
❚ What-if capabilities are
required
3
What is Data Warehousing?
A process of
Information
transforming data into
information and
making it available to
users in a timely
enough manner to
make a difference
[Forrester Research, April
Data 1996]
4
Data Warehousing --
It is a process
❚ Technique for assembling and
managing data from
various sources for the
purpose of answering
business questions. Thus
making decisions that were
not previous possible
❚ A decision support database
maintained separately from
the organization’s
operational database 5
Data Warehouse
❚ A data warehouse is a
❙ subject-oriented
❙ integrated
❙ time-varying
❙ non-volatile
collection of data that is used primarily
in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
6
Data Warehouse Architecture
Relationa
l
Database
s
Optimized Loader
Extraction
ERP
Systems Cleansing
Data Warehouse
Engine Analyze
Purchase Query
d
Data
Legacy
Data Metadata Repository
7
Decision Support
❚ Used to manage and control business
❚ Data is historical or point-in-time
❚ Optimized for inquiry rather than update
❚ Use of the system is loosely defined and
can be ad-hoc
❚ Used by managers and end-users to
understand the business and make
judgements
8
Data Mining works with Warehouse
Data
❚ Data Warehousing
provides the Enterprise
with a memory
❚
❚Data Mining provides the
Enterprise with
intelligence
❚ 9
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
10
Data Mining in Use
❚ The US Government uses Data Mining to
track fraud
❚ A Supermarket becomes an information
broker
❚ Basketball teams use it to track game
strategy
❚ Cross Selling
❚ Warranty Claims Routing
❚ Holding on to Good Customers
❚ Weeding out Bad Customers 11
What makes data mining possible?
❚ Advances in the following areas are
making data mining deployable:
❙ data warehousing
❙ better and more data (i.e.,
operational, behavioral, and
demographic)
❙ the emergence of easily deployed
data mining tools and
❙ the advent of new data mining
techniques. 12
• -- Gartner Group
OLTP vs. Data Warehouse
❚ OLTP systems are tuned for known
transactions and workloads while
workload is not known a priori in a data
warehouse
❚ Special data organization, access methods
and implementation methods are
needed to support data warehouse
queries (typically multidimensional
queries)
❙ e.g., average amount spent on phone calls
between 9AM-5PM in Pune during the
13
month of December
OLTP vs Data Warehouse
❚ OLTP ❚ Warehouse (DSS)
❙ Application ❙ Subject Oriented
Oriented ❙ Used to analyze
❙ Used to run business
business ❙ Summarized and
❙ Detailed data refined
❙ Current up to ❙ Snapshot data
date ❙ Integrated Data
❙ Isolated Data ❙ Ad-hoc access
❙ Repetitive ❙ Knowledge User
access (Manager)
14
❙ Clerical User
OLTP vs Data Warehouse
❚ OLTP ❚ Data Warehouse
❙ Performance ❙ Performance relaxed
Sensitive ❙ Large volumes
❙ Few Records accessed at a
accessed at a time time(millions)
(tens) ❙ Mostly Read (Batch
Update)
❙ Read/Update Access ❙ Redundancy present
❙ Database Size
100 GB - few
❙ No data redundancy
terabytes
❙ Database Size
100MB -100 GB
15
OLTP vs Data Warehouse
❚ OLTP ❚ Data Warehouse
❙ Transaction ❙ Query
throughput is throughput is
the the
performance performance
metric metric
❙ Thousands of ❙ Hundreds of
users users
❙ Managed in ❙ Managed by
entirety subsets
❙
16
To summarize ...
❚ OLTP Systems are
used to “run” a
business
❚
❚
❚
❚ The Data
❚ Warehouse helps
to “optimize” the
business
17
Components of the Warehouse
❚ Data Extraction and Loading
❚ The Warehouse
❚ Analyze and Query -- OLAP Tools
❚ Metadata
❚
❚ Data Mining tools
18
Loading the Warehouse
Cleaning the data
before it is loaded
Data -- Heart of the Data
Warehouse
❚ Heart of the data warehouse is the
data itself!
❚ Single version of the truth
❚ Corporate memory
❚ Data is organized in a way that
represents business -- subject
orientation
20
From the Data Warehouse to Data
Marts
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Structured Data Warehouse
Data
21
Data Warehouse and Data Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured
Organizationally structured
Atomic
Detailed Data Warehouse Data
22
II. On-Line Analytical Processing (OLAP)
Making Decision
Support Possible
Limitations of SQL
“A Freshman
in Business
needs a Ph.D.
in SQL”
-- Ralph Kimball
24
Typical OLAP Queries
❚ Write a multi-table join to compare sales for each
product line YTD this year vs. last year.
❚ Repeat the above process to find the top 5
product contributors to margin.
❚ Repeat the above process to find the sales of a
product line to new vs. existing customers.
❚ Repeat the above process to find the customers
that have had negative sales growth.
25
What Is OLAP?
❚ Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*
❚ Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence,
Executive Information System
❚ OLAP = Multidimensional Database
❚ MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)
❚ ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
26
Strengths of OLAP
❚ It is a powerful visualization paradigm
❚ It provides fast, interactive response
times
❚ It is good for analyzing time series
❚ It can be useful to find some clusters and
outliers
❚ Many vendors offer OLAP tools
27
OLAP Is FASMI
❚ Fast
❚ Analysis
❚ Shared
❚ Multidimensional
❚ Information
Nigel Pendse, Richard Creath - The OLAP Report
28
Multi-dimensional Data
❚ “Hey…I sold $100M worth of goods”
Dimensions: Product, Region, Time
Hierarchical summarization paths
on
gi
W
S
Re
N Product Region Time
Industry Country Year
Product
Juice
Cola
Milk Category Region Quarter
Cream
Toothpaste
Soap Product City Month Week
1 2 34 5 6 7
Month Office Day
29
A Visual Operation: Pivot (Rotate)
NY
LA
th
SF
n
Mo
Juice 10
Cola 47
Region
Milk 30
Cream 12 Product
3/1 3/2 3/3 3/4
Date 30
“Slicing and Dicing”
The Telecomm Slice
Product
Household
Telecomm n s
i o
eg
Video R Europe
Far East
Audio India
Retail Direct Special Sales Channel
31
Roll-up and Drill Down
Higher Level of
Aggregation ❚
❚ Sales Channel
Drill-Down
❚ Region
Roll Up
❚ Country
❚ State
❚ Location Address
❚ Sales
Representative
Low-level
Details
32
Nature of OLAP Analysis
❚ Aggregation -- (total sales,
percent-to-total)
❚ Comparison -- Budget vs.
Expenses
❚ Ranking -- Top 10, quartile
analysis
❚ Access to detailed and
aggregate data
❚ Complex criteria
specification
❚ Visualization 33
Organizationally Structured Data
❚ Different Departments look at the same
detailed data in different ways. Without
the detailed, organizationally structured
data as a foundation, there is no
reconcilability of data
marketing
sales
finance
manufacturing
34
OLAP - Data Cube
❚ Idea: analysts need to group data in many
different ways
❙ eg. Sales(region, product, prodtype,
prodstyle, date, saleamount)
❙ saleamount is a measure attribute, rest are
dimension attributes
❙ groupby every subset of the other
attributes
❘ materialize (precompute and store)
groupbys to give online response
❙ Also: hierarchies on attributes: date ->
weekday, 35
date -> month -> quarter -> year
Relational OLAP: 3 Tier DSS
Data Warehouse ROLAP Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic Generate SQL Obtain multi-
data in industry execution plans in dimensional
standard the ROLAP engine reports from the
RDBMS. to obtain OLAP DSS Client.
functionality.
36
MD-OLAP: 2 Tier DSS
MDDB Engine MDDB Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic data in a proprietary data Obtain multi-
structure (MDDB), pre-calculate as dimensional
many outcomes as possible, obtain reports from the
OLAP functionality via proprietary DSS Client.
algorithms running against this data.
37