Data Warehousing & Mining
B. Tech and MBA. Tech. (Information Technology)
Semester-V
Chapter 2 Architecture and Infrastructure & Data
Representation
Chapter 3 Information access and delivery
By. Prof. Bhushan Inje
Outline
• Architectural components
• Infrastructure and metadata
• Principles of dimensional modeling
• Dimensional modeling advance topics
• Data Extraction, Transformation and Loading
• Data quality
By. Prof. Bhushan Inje
Introduction
• A data warehouse is a collection of corporate information,
derived directly from operational systems and some external data sources.
• Its specific purpose is to support business decisions, not business
operations.
• This is what a data warehouse is all about, helping your business ask
“What if?” questions.
• The answers to these questions will ensure your business is proactive,
instead of reactive, a necessity in today’s information age.
What is a Data Warehouse?
• Data warehouse provides architectures and tools for business executives to
systematically organise, understand, and use their data to make strategic
decisions.
• In simple terms, a data warehouse refers to a database that is maintained
separately from an organization’s operational databases.
• According to W. H. Inman, a leading architect in the construction of data
warehouse systems,
“a data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision making
process.”
Cont.…
• Let us understand the four key words in more detail as
follows:
– Subject-oriented
– Integrated
– Time-variant
– Non-volatile
Use of Data Warehouses in organizations
• Many organizations are creating data warehouse to support
business decision-making activities for the following reasons:
– To increasing customer focus.
– To reposition products and managing product portfolios.
– To analyzing operations and looking for sources of profit.
– To managing the customer relationships
– Data warehousing is also very useful from the point of view of
heterogeneous database integration.
Differences between operational Database systems
and Data Warehouses
• On Line Transactional Processing (OLTP)
• On Line Analytical Processing (OLAP).
Characteristics of Data Warehouse
• Subject oriented
• Integrated
• Time variant
• Non volatile
Data Warehouse components
Data Warehouse components
• Data sources
• Data Warehouse
• Reporting
– Business intelligence tools
– Executive information systems (known more widely as Dashboard (business)
– OLAP Tools
– Data Mining
• Metadata
• Operations
•
Data Warehouse components Cont..
• Optional components
– Dependent Data Marts
– Logical Data Marts
– Operational Data Store
Designing the Data Warehouse
Data Warehouse Architecture
• Why do Business analysts need Data Warehouse?
Process of Data Warehouse Design
A data warehouse can be built using three approaches:
1. A top-down approach
2. A bottom-up approach
3. A combination of both approaches
• In general, the warehouse design process consists of the
following steps:
A three-tier Data
Warehouse
architecture
OLAP server architectures
There are three different possible designs:
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)
Getting Multidimensional Data out of the Warehouse
A Multidimensional Data Model
• From Tables and Spreadsheets to Data Cubes
• What is a data cube?
– A data cube allows data to be modeled and viewed in
multiple dimensions. It is defined by dimensions and
facts
– Dimensions are the perspectives or entities with respect
to which an organization wants to keep records.
Cont.…
Cont.…
Schemas for Multidimensional Databases
• Stars, Snowflakes, and Fact Constellations:
• Star schema:
– (1) a large central table (fact table) containing the bulk of the data,
with no redundancy, and
– (2) a set of smaller attendant tables (dimension tables), one for each
dimension.
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas
• Data warehouses and data marts can be defined using two language primitives,
one for cube definition and one for dimension definition. The cube definition
statement has the following syntax:
Measures: Their Categorization and Computation
• Note that a multidimensional point in the data cube space can be defined
by a set of dimension-value pairs, for example, (time = “Q1”, location =
“Vancouver”, item = “computer”).
• A data cube measure is a numerical function that can be evaluated
at each point in the data cube space.
• Measures can be organized into three categories (i.e., distributive,
algebraic, holistic), based on the kind of aggregate functions used.
• Distributive: An aggregate function is distributive if it can be
computed in a distributed manner
• Algebraic: An aggregate function is algebraic if it can be
computed by an algebraic function with M arguments
• Holistic: An aggregate function is holistic if there is no constant
bound on the storage size needed to describe a sub aggregate.
Concept Hierarchies
• A concept hierarchy defines a sequence of mappings from a set of low-
level concepts to higher-level, more general concepts.
OLAP Operations in the Multidimensional Data Model
• “How are concept hierarchies useful in OLAP?”
OLAP Operations
• Roll-up- The roll-up operation (also called the drill-up operation by some vendors)
performs aggregation on a data cube, either by climbing up a concept hierarchy for
a dimension or by dimension reduction.
• Drill-down- Drill-down is the reverse of roll-up. It navigates from less detailed data
to more detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions.
• Slice and dice- The slice operation performs a selection on one dimension of the
given cube, resulting in a sub-cube
• Pivot (rotate)- Pivot (also called rotate) is a visualization operation that rotates the
data axes in view in order to provide an alternative presentation of the data.
• Other OLAP operations
– Drill-across
– Drill-through
OLAP Systems versus Statistical Databases
• Many of the characteristics of OLAP systems, such as the use
of a multidimensional data model and concept hierarchies, the
association of measures with dimensions, and the notions of
roll-up and drill-down, also exist in earlier work on statistical
databases (SDBs).
• A statistical database is a database system that is designed to
support statistical applications. Similarities between the two
types of systems are rarely discussed, mainly due to
differences in terminology and application domains.
A Starnet Query Model for Querying Multidimensional
Databases
A Starnet Query Model for Querying Multidimensional
Databases
Data Warehouse Back-End Tools and Utilities
• Data extraction, which typically gathers data from multiple, heterogeneous, and
external sources
• Data cleaning, which detects errors in the data and rectifies them when possible
• Data transformation, which converts data from legacy or host format to
warehouse format
• Load, which sorts, summarizes, consolidates, computes views, checks integrity,
and builds indices and partitions
• Refresh, which propagates the updates from the data sources to the warehouse
Types of OLAP Servers: ROLAP versus MOLAP
versus HOLAP
• Relational OLAP (ROLAP) servers:
• Multidimensional OLAP (MOLAP) servers:
• Hybrid OLAP (HOLAP) servers:
• Specialized SQL servers:
“How are data actually stored in ROLAP and MOLAP
architectures?”
Data Warehouse Implementation
• Efficient Computation
of Data Cubes
– The compute cube
Operator and the
Curse of
Dimensionality
“How many cuboids are there in an n-dimensional data cube?”
• If there were no hierarchies associated with each dimension, then the total
number of cuboids for an n-dimensional data cube, as we have seen above,
is 2n.
• such as in the hierarchy “day < month < quarter < year”.
where Li is the number of levels associated with dimension i.
Partial Materialization: Selected Computation of Cuboids
• There are three choices for data cube materialization given a base cuboid:
1. No materialization:
2. Full materialization:
3. Partial materialization:
• The partial materialization of cuboids or subcubes should consider three factors:
(1) identify the subset of cuboids or subcubes to materialize;
(2) exploit the materialized cuboids or subcubes during query processing; and
(3) efficiently update the materialized cuboids or subcubes during load and refresh.
Indexing OLAP Data
• How to index OLAP data by bitmap indexing and join indexing.
Efficient Processing of OLAP Queries
1. Determine which operations should be performed on the
available cuboids:
2.Determine to which materialized cuboid(s) the relevant
operations should be applied:
• “Which of the above four cuboids should be
selected to process the query?”
• “How would the costs of each cuboid
compare if used to process the query?”
From Data Warehousing to Data Mining
• “How do data warehousing and OLAP relate to data mining?”
• Data Warehouse Usage
– There are three kinds of data warehouse applications:
• Information processing
• Analytical processing
• Data mining
• “How does data mining relate to information processing and
on-line analytical processing?”
• “Do OLAP systems perform data mining?
• Are OLAP systems actually data mining systems?”
From On-Line Analytical Processing to
On-Line Analytical Mining
• On-line analytical mining (OLAM) (also called OLAP mining)
• OLAM is particularly important for the following reasons:
– High quality of data in data warehouses
– Available information processing infrastructure surrounding data
warehouses
– OLAP-based exploratory data analysis
– On-line selection of data mining functions
Architecture for
On-Line
Analytical Mining
Data
Warehouse
Deployment
Data Warehouse Deployment
• Lifecycle for data warehouse deployment project:
– 0. Project Scoping and Planning
– 1. Requirement
– 2. Front-End Design
– 3. Warehouse Schema Design
– 4. OLTP to data warehouse mapping
– 5. Implementation
– 6. Deployment
– 7. Management and Maintenance of the system
Growth and maintenance of Data warehouse
• Monitoring The Data Warehouse
– Collection of Statistics
– The following is a random list that includes statistics for different uses. You will find most
of these applicable to your environment.
• Physical disk storage space utilization
• Number of times the DBMS is looking for space in blocks or causes fragmentation
• Memory buffer activity
Collection of Statistics Cont..
• Buffer cache usage
• Input–output performance
• Memory management Profile of the warehouse content, giving number of distinct
entity occurrences (example: number of customers, products, etc.)
• Size of each database table Accesses to fact table records
• Usage statistics relating to subject areas
• Numbers of completed queries by time slots during the day
• Time each user stays online with the data warehouse
• Total number of distinct users per day
• Maximum number of users during time slots daily
• Duration of daily incremental loads
• Count of valid users
• Query response times
• Number of reports run each day
• Number of active tables in the database
Using Statistics for Growth Planning
• We indicate below the types of action that are prompted by the monitoring
statistics:
– Allocate more disk space to existing database tables Plan for new disk
space for additional tables
– Modify file block management parameters to minimize fragmentation
– Create more summary tables to handle large number of queries looking
for summary information
– Reorganize the staging area files to handle more data volume
– Add more memory buffers and enhance buffer management Upgrade
database servers
– Offload report generation to another middle tier
– Smooth out peak usage during the 24-hour cycle
– Partition tables to run loads in parallel and to manage backups
MANAGING THE DATA WAREHOUSE
Platform
Upgrades
Ongoing Managing
Fine- Data
Tuning Growth
Information
Delivery Storage
Enhanceme Management
nts
Data
ETL
Model Management
Revisions
Thank You