0% found this document useful (0 votes)

41 views8 pages

2 - Data Exploration

The document provides an overview of data warehouses, highlighting their properties, types of measures, and schemas. It discusses OLAP operations, the curse of dimensionality, and pre-computation strategies for efficient data processing. Additionally, it covers data mining as a hypothesis generator and explores two types of exploration in OLAPs: hypothesis-driven and discovery-driven approaches.

Uploaded by

xerodo1379

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views8 pages

2 - Data Exploration

Uploaded by

xerodo1379

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Exercises:

2.1 – Determining the type of a Measure (distributive, algebraic, holistic)

2.2 – Implementation of DW schemas and OLAP operations
2.3 – Implementation of queries using OLAP operations
3.2 – SelfExp, Standardized Residuals

Data Warehouse – a decision support database that is maintained separately from the operational
database(s).

It supports information processing by providing a solid platform of consolidated, historical data for
analysis.

The data warehouse is the primary source for data mining.

Data Warehouses have 4 main properties:

1. Subject-oriented:
DWs are organized around major subjects (like customers, products, sales) and provide a concise
view of particular subject issues by excluding data that is not useful in the decision support process.

2. Integrated:
DWs are constructed by integrating multiple heterogeneous data sources. When data is moved to
the warehouse, it is converted.

3. Time-variant:
DW data provides information from a bigger time perspective (e.g. past 5 years). As such, the time
horizon for the data warehouse is significantly longer than that of operational systems.

4. Nonvolatile:
Requires only two operations in data accessing: initial loading of data and access of data.
Operational updates of data do not occur in the DW’s environment.
The differences between OLTP and OLAP are as follows:
There are 3 types of measures in data warehouses:

1. Distributive:
If the result derived by applying the function to n aggregate values is the same as that derived by
applying the function on all the data without partitioning.
In other words: if we can compute the result by computing sub-results for sub-sets of a larger set
and then by applying the same function to these sub-results.

Examples:
 count (D1 U D2) = count (D1) + count (D2)
 sum (D1 U D2) = sum (D1) + sum (D2)
 min (D1 U D2) = min (min (D1), min (D2))
 max (D1 U D2) = max (max (D1), max (D2))
 sum^2 (D1 U D2) = ∑ 𝑥 = sum^2 (D1) + sum^2 (D2)

2. Algebraic:
A measure/function is algebraic if it can be computed by calculating an algebraic function with a
total (constant!) amount of m arguments, each of which is obtained by applying a distributive
aggregate function.
For example, avg() uses m=2 arguments.

Examples:
 avg() = sum() / count()
 standard_deviation()
 n_smallest()
 n_largest()
 variance()

3. Holistic:
A measure is holistic if there is no constant bound on the storage size (i.e. m) needed to describe a
sub-aggregate.
That is, if there is no algebraic function with m (= constant) distributively computable arguments
that characterizes the calculation.
More often than not requires going through the entire data set in order to calculate the result,
sometimes more than once.

Examples:
 median()
 most_frequent()
 rank()
 mode()
And there are 3 conceptual models (schemas) of data warehouses:

1. Star Schema – a fact table (that contains measures, in gray, and keys to each of the dimension
tables) in the middle connected to a set of dimension tables:

2. Snowflake Schema – some dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to that of a snowflake:

3. Fact Constellation – multiple fact tables share dimension tables, i.e., a collection of stars. Also called
galaxy schema:
A data warehouse is based on a multidimensional data model which views data in the form of a data
cube. A data cube allows data to be modeled and viewed in multiple dimensions.

Typical OLAP operations include:

1. Roll-up / Drill-up – summarize data by climbing up along the hierarchy or by dimension reduction.
Examples:
 city -> country
 day -> month
 grouping by “doctor”, thus rolling-up on “patient” (because we omit “patient”)

2. Drill-down / Roll-down – move from a higher level summary to a lower level summary of more
detailed data, or by introducing new dimensions.

3. Slice-and-dice – selection on one (slice) or more (dice) dimensions.

4. Pivot / Rotate – reorienting of the cube, visualization, 3D to series of 2D planes.

5. Drill-across – “drill” involving (across) more than one fact table.

6. Drill-through – “drill” through the bottom level of the cube to its back-end relational tables (using
SQL).
Curse of Dimensionality – Given n dimensions (without hierarchies), one has 2 many cuboids.

As such, pre-computation might be useful. There are 3 options:

1. No materialization – do not pre-compute any non-base cuboid, i.e. there is no pre-computation of

any possible aggregate function.
→ expensive mul -dimensional aggregates (on-the-fly computation), requires computational power

2. Full materialization – pre-compute all of the cuboids (i.e. all of the possible combinations of
aggregates).
The resulting lattice of computed cuboids is referred to as the full cube.
→ requires huge amounts of memory (which grows exponentially with the number of dimensions)

3. Partial materialization – selectively compute a proper subset of all possible cuboids/aggregates,

which either satisfy some user-specific criterion, are frequently accessed by users, or due to some
other criteria.
→ trade-off between storage space and response time

Another easy way to reduce computation time: bitmap indexing:

However: not suitable for high cardinality domains! Instead, indexing structures are used for Efficient
Query Processing on Large Databases.

Data Mining can be used as a hypothesis generator:

1. By using Data Mining algorithms for complex (e.g. multi-variant) data
2. Utilizing descriptive patterns and models that are easy to verify/check by decision makers

There are 2 main types of exploration in OLAPs:

1. Hypothesis-driven:
 All hypotheses are pre-determined by people (e.g. the marketing department)
 Afterwards, the hypotheses are checked by per-hand analysis of the data warehouse
2. Discovery-driven:

 Developed in 1998 at IBM Research

 The software itself pre-computes measures indicating exceptions and guides the user in the
data analysis at all levels of aggregation towards personal discovery of outliers via highlighting
of cells that should be drilled-down into

 Visual clues such as background color are used to reflect the degree of exception of each cell

 Example:

 Measure expectation: pre-calculates the expected value of each cell, which is then compared
to its actual content (and, if needed, highlighted by the software for further discovery)

 Measures for exceptions:

i. SelfExp:
Surprise of cell relative/compared to other cells at same level of aggregation/granularity.
The bigger the SelfExp value is, the further a cell is from the mean value.

Computes the standardized residual of a cell [𝑖, 𝑗] as:

𝑦 −𝑦
𝑠=
𝜎
(where: 𝜎 – standard deviation, 𝑦 – expected value)

With the following pre-calculations:

1
𝜎= (𝑦 − 𝑦)
𝑛

𝑦 ∗𝑦
𝑦 , =
𝑦

Assuming a normal distribution of data: a value with 𝑠 > 𝜏 = 2.5 is considered to be

exceptional.
ii. InExp:
Surprise (directly) beneath the cell. Beneath means: if you drill down into the cell (on many
branches), how much surprise will you find on the more granular level?
Computed by determining the maximum SelfExp value over all cells underneath this cell (i.e.
that can be reached by drill-down operation).

iii. PathExp:
Surprise beneath cell for each drill-down path. Same as InExp, but for a very specific drill-
down path.
Computed as the maximum of SelfExp over all cells reachable by drilling down along that
path.

 Exceptions themselves can be stored, indexed, retrieved and searched like pre-computed
aggregates.

 Computation consists of three phases:

i. Computation of aggregate values (bottom-up)
ii. Model fitting, i.e. computation of γ coefficients and computation of standardized residuals
iii. Computation of SelfExp, InExp, and PathExp

 Advantages of Self-/In-/PathExp:
i. Easily scalable
ii. Simple and fast to calculate
iii. Descriptive detection of abnormal values through InExp and PathExp

 Disadvantages of Self-/In-/PathExp:
i. We assume the normal distribution, which might not be the case
ii. Missing trend analysis over time: if we have an down-/upwards trend over time, then it's not
considered in the analysis
iii. Model fitting is affected by deviation (outliers)
iv. Missing conditions on subparts of the cube: we assume globally that everything is
independent from each other, even if it is not. For example: BBQ Sauce <-> Sales in the
Southern US
v. Assumes no correlation and considers dimensions separately, even if they may be
correlated, globally

Differences Between Operational Database Systems and DataWarehouses
No ratings yet
Differences Between Operational Database Systems and DataWarehouses
1 page
DMBI FH 2024 Solution
No ratings yet
DMBI FH 2024 Solution
7 pages
Correct DW
No ratings yet
Correct DW
9 pages
Unit 2 DATA WAREHOUSE AND DATA MART
No ratings yet
Unit 2 DATA WAREHOUSE AND DATA MART
17 pages
What Is Data Warehouse?: Data Mining by IK Unit 2
No ratings yet
What Is Data Warehouse?: Data Mining by IK Unit 2
21 pages
UNIT2DM
No ratings yet
UNIT2DM
63 pages
ML Module1
No ratings yet
ML Module1
56 pages
Data Warehousing & Mining Guide
No ratings yet
Data Warehousing & Mining Guide
29 pages
DWDM Notes
No ratings yet
DWDM Notes
19 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
12 pages
DWDM 2
No ratings yet
DWDM 2
16 pages
Unit 2
No ratings yet
Unit 2
144 pages
BMW M-2
No ratings yet
BMW M-2
41 pages
DM and DW Notes-Module2
No ratings yet
DM and DW Notes-Module2
18 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
Note2 3
No ratings yet
Note2 3
36 pages
Lecture 2.1.1 2.1.2
No ratings yet
Lecture 2.1.1 2.1.2
23 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
38 pages
Mapping Data Warehouse to Multiprocessor
No ratings yet
Mapping Data Warehouse to Multiprocessor
34 pages
DWDM (Unit-4) - 2
No ratings yet
DWDM (Unit-4) - 2
23 pages
DMDW 1 2nd Module
No ratings yet
DMDW 1 2nd Module
29 pages
MultiDimensional Data Model
No ratings yet
MultiDimensional Data Model
22 pages
Data Warehousing for ISE Students
No ratings yet
Data Warehousing for ISE Students
41 pages
Chapter 2.introduction To Data Warehouse
No ratings yet
Chapter 2.introduction To Data Warehouse
49 pages
Unit 4
No ratings yet
Unit 4
27 pages
DW&DM Material
No ratings yet
DW&DM Material
107 pages
Solutions For Data Warehousing 7
No ratings yet
Solutions For Data Warehousing 7
18 pages
Unit 2 - Data Science BCA
No ratings yet
Unit 2 - Data Science BCA
20 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
DW&DM (Unit - 4)
No ratings yet
DW&DM (Unit - 4)
9 pages
CS3352 Foundations of Data Science APRIL MAY 2023
No ratings yet
CS3352 Foundations of Data Science APRIL MAY 2023
16 pages
Data Warehouse for Information Processing
No ratings yet
Data Warehouse for Information Processing
89 pages
Summary For Exam
No ratings yet
Summary For Exam
8 pages
DMDW Mid 1 Solution
No ratings yet
DMDW Mid 1 Solution
29 pages
HAJJATII
No ratings yet
HAJJATII
11 pages
Data Mining Important
No ratings yet
Data Mining Important
15 pages
Data Warehousing & Modeling: Module - 2
No ratings yet
Data Warehousing & Modeling: Module - 2
144 pages
Mod1 Data Warehouse
No ratings yet
Mod1 Data Warehouse
30 pages
Data Warehousing and OLAP Technology For Data Mining
No ratings yet
Data Warehousing and OLAP Technology For Data Mining
30 pages
Data Mining & Warehousing Overview
No ratings yet
Data Mining & Warehousing Overview
47 pages
Data Warehouse C
No ratings yet
Data Warehouse C
34 pages
DM Unit-1
No ratings yet
DM Unit-1
14 pages
Unit 2 DWM
No ratings yet
Unit 2 DWM
16 pages
On-Line Analytical Processing: Analyzing Data Resources
No ratings yet
On-Line Analytical Processing: Analyzing Data Resources
60 pages
Conceptual Data Warehouse Design Guide
No ratings yet
Conceptual Data Warehouse Design Guide
38 pages
Resume 1
100% (1)
Resume 1
106 pages
OLAM SE: Interactive Data Mining
No ratings yet
OLAM SE: Interactive Data Mining
12 pages
Data Mining Unit3
No ratings yet
Data Mining Unit3
19 pages
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
59 pages
Implementation: Data Warehouse
No ratings yet
Implementation: Data Warehouse
56 pages
DM Module 2
No ratings yet
DM Module 2
47 pages
Dmbi
No ratings yet
Dmbi
9 pages
Understanding OLAP and Data Cubes
No ratings yet
Understanding OLAP and Data Cubes
42 pages
Data Mining: Concept Description
No ratings yet
Data Mining: Concept Description
64 pages
Unit2 Olap
No ratings yet
Unit2 Olap
13 pages
Business Analytics Essentials
No ratings yet
Business Analytics Essentials
18 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
35 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Installation Development Plan (IDP)
No ratings yet
Installation Development Plan (IDP)
23 pages
Book 252 Volume 2
No ratings yet
Book 252 Volume 2
314 pages
RFSensing GPT
No ratings yet
RFSensing GPT
14 pages
of Bayesian Statistics (Chirayu Jain & Group)
No ratings yet
of Bayesian Statistics (Chirayu Jain & Group)
8 pages
1) Define Information Retrieval: 2) Explain Logical View of A Document With Diagram
No ratings yet
1) Define Information Retrieval: 2) Explain Logical View of A Document With Diagram
18 pages
Objectives of DBMS:: Data Availability
0% (1)
Objectives of DBMS:: Data Availability
2 pages
ChatGPT Overview & Future Insights
No ratings yet
ChatGPT Overview & Future Insights
21 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
DBMS Ctevt Students
100% (1)
DBMS Ctevt Students
230 pages
Data Structure Previous Question Paper
100% (2)
Data Structure Previous Question Paper
2 pages
Minor Project Details (Responses)
No ratings yet
Minor Project Details (Responses)
40 pages
Qdoc - Tips - Zack The Lazy Zebra
No ratings yet
Qdoc - Tips - Zack The Lazy Zebra
36 pages
Geocomputation With R
No ratings yet
Geocomputation With R
8 pages
Computer Science Thesis Help
100% (3)
Computer Science Thesis Help
7 pages
Glossary DA Terms and Definitions
No ratings yet
Glossary DA Terms and Definitions
4 pages
Sumera Baloch: AI & HR Professional
No ratings yet
Sumera Baloch: AI & HR Professional
2 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Model Answer G7 - Theory
No ratings yet
Model Answer G7 - Theory
8 pages
Edugene Python Projects
No ratings yet
Edugene Python Projects
6 pages
Understanding Distributed Databases
No ratings yet
Understanding Distributed Databases
21 pages
ME1 Syllabus Papers
No ratings yet
ME1 Syllabus Papers
40 pages
PUT - SET A - Cloud Computing KCS713 DEC 2024
No ratings yet
PUT - SET A - Cloud Computing KCS713 DEC 2024
2 pages
Database Security and Computer Programming
No ratings yet
Database Security and Computer Programming
11 pages
SAS® Intelligence Platform Overview
No ratings yet
SAS® Intelligence Platform Overview
74 pages
Vibhquiz
No ratings yet
Vibhquiz
3 pages
Understanding Natural Language Understanding
No ratings yet
Understanding Natural Language Understanding
514 pages
Git & Github: 26 December 2022 19:50
No ratings yet
Git & Github: 26 December 2022 19:50
32 pages
Cisco Resume
No ratings yet
Cisco Resume
2 pages
Data Science - Sem6
100% (3)
Data Science - Sem6
118 pages
Log
No ratings yet
Log
2 pages

2 - Data Exploration

Uploaded by

2 - Data Exploration

Uploaded by

Exercises:

2.1 – Determining the type of a Measure (distributive, algebraic, holistic)

The data warehouse is the primary source for data mining.

Data Warehouses have 4 main properties:

Typical OLAP operations include:

3. Slice-and-dice – selection on one (slice) or more (dice) dimensions.

4. Pivot / Rotate – reorienting of the cube, visualization, 3D to series of 2D planes.

5. Drill-across – “drill” involving (across) more than one fact table.

As such, pre-computation might be useful. There are 3 options:

1. No materialization – do not pre-compute any non-base cuboid, i.e. there is no pre-computation of

3. Partial materialization – selectively compute a proper subset of all possible cuboids/aggregates,

Another easy way to reduce computation time: bitmap indexing:

Data Mining can be used as a hypothesis generator:

There are 2 main types of exploration in OLAPs:

 Developed in 1998 at IBM Research

 Measures for exceptions:

Computes the standardized residual of a cell [𝑖, 𝑗] as:

With the following pre-calculations:

Assuming a normal distribution of data: a value with 𝑠 > 𝜏 = 2.5 is considered to be

 Computation consists of three phases:

You might also like