Exercises:
2.1 – Determining the type of a Measure (distributive, algebraic, holistic)
2.2 – Implementation of DW schemas and OLAP operations
2.3 – Implementation of queries using OLAP operations
3.2 – SelfExp, Standardized Residuals
Data Warehouse – a decision support database that is maintained separately from the operational
database(s).
It supports information processing by providing a solid platform of consolidated, historical data for
analysis.
The data warehouse is the primary source for data mining.
Data Warehouses have 4 main properties:
1. Subject-oriented:
DWs are organized around major subjects (like customers, products, sales) and provide a concise
view of particular subject issues by excluding data that is not useful in the decision support process.
2. Integrated:
DWs are constructed by integrating multiple heterogeneous data sources. When data is moved to
the warehouse, it is converted.
3. Time-variant:
DW data provides information from a bigger time perspective (e.g. past 5 years). As such, the time
horizon for the data warehouse is significantly longer than that of operational systems.
4. Nonvolatile:
Requires only two operations in data accessing: initial loading of data and access of data.
Operational updates of data do not occur in the DW’s environment.
The differences between OLTP and OLAP are as follows:
There are 3 types of measures in data warehouses:
1. Distributive:
If the result derived by applying the function to n aggregate values is the same as that derived by
applying the function on all the data without partitioning.
In other words: if we can compute the result by computing sub-results for sub-sets of a larger set
and then by applying the same function to these sub-results.
Examples:
count (D1 U D2) = count (D1) + count (D2)
sum (D1 U D2) = sum (D1) + sum (D2)
min (D1 U D2) = min (min (D1), min (D2))
max (D1 U D2) = max (max (D1), max (D2))
sum^2 (D1 U D2) = ∑ 𝑥 = sum^2 (D1) + sum^2 (D2)
2. Algebraic:
A measure/function is algebraic if it can be computed by calculating an algebraic function with a
total (constant!) amount of m arguments, each of which is obtained by applying a distributive
aggregate function.
For example, avg() uses m=2 arguments.
Examples:
avg() = sum() / count()
standard_deviation()
n_smallest()
n_largest()
variance()
3. Holistic:
A measure is holistic if there is no constant bound on the storage size (i.e. m) needed to describe a
sub-aggregate.
That is, if there is no algebraic function with m (= constant) distributively computable arguments
that characterizes the calculation.
More often than not requires going through the entire data set in order to calculate the result,
sometimes more than once.
Examples:
median()
most_frequent()
rank()
mode()
And there are 3 conceptual models (schemas) of data warehouses:
1. Star Schema – a fact table (that contains measures, in gray, and keys to each of the dimension
tables) in the middle connected to a set of dimension tables:
2. Snowflake Schema – some dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to that of a snowflake:
3. Fact Constellation – multiple fact tables share dimension tables, i.e., a collection of stars. Also called
galaxy schema:
A data warehouse is based on a multidimensional data model which views data in the form of a data
cube. A data cube allows data to be modeled and viewed in multiple dimensions.
Typical OLAP operations include:
1. Roll-up / Drill-up – summarize data by climbing up along the hierarchy or by dimension reduction.
Examples:
city -> country
day -> month
grouping by “doctor”, thus rolling-up on “patient” (because we omit “patient”)
2. Drill-down / Roll-down – move from a higher level summary to a lower level summary of more
detailed data, or by introducing new dimensions.
3. Slice-and-dice – selection on one (slice) or more (dice) dimensions.
4. Pivot / Rotate – reorienting of the cube, visualization, 3D to series of 2D planes.
5. Drill-across – “drill” involving (across) more than one fact table.
6. Drill-through – “drill” through the bottom level of the cube to its back-end relational tables (using
SQL).
Curse of Dimensionality – Given n dimensions (without hierarchies), one has 2 many cuboids.
As such, pre-computation might be useful. There are 3 options:
1. No materialization – do not pre-compute any non-base cuboid, i.e. there is no pre-computation of
any possible aggregate function.
→ expensive mul -dimensional aggregates (on-the-fly computation), requires computational power
2. Full materialization – pre-compute all of the cuboids (i.e. all of the possible combinations of
aggregates).
The resulting lattice of computed cuboids is referred to as the full cube.
→ requires huge amounts of memory (which grows exponentially with the number of dimensions)
3. Partial materialization – selectively compute a proper subset of all possible cuboids/aggregates,
which either satisfy some user-specific criterion, are frequently accessed by users, or due to some
other criteria.
→ trade-off between storage space and response time
Another easy way to reduce computation time: bitmap indexing:
However: not suitable for high cardinality domains! Instead, indexing structures are used for Efficient
Query Processing on Large Databases.
Data Mining can be used as a hypothesis generator:
1. By using Data Mining algorithms for complex (e.g. multi-variant) data
2. Utilizing descriptive patterns and models that are easy to verify/check by decision makers
There are 2 main types of exploration in OLAPs:
1. Hypothesis-driven:
All hypotheses are pre-determined by people (e.g. the marketing department)
Afterwards, the hypotheses are checked by per-hand analysis of the data warehouse
2. Discovery-driven:
Developed in 1998 at IBM Research
The software itself pre-computes measures indicating exceptions and guides the user in the
data analysis at all levels of aggregation towards personal discovery of outliers via highlighting
of cells that should be drilled-down into
Visual clues such as background color are used to reflect the degree of exception of each cell
Example:
Measure expectation: pre-calculates the expected value of each cell, which is then compared
to its actual content (and, if needed, highlighted by the software for further discovery)
Measures for exceptions:
i. SelfExp:
Surprise of cell relative/compared to other cells at same level of aggregation/granularity.
The bigger the SelfExp value is, the further a cell is from the mean value.
Computes the standardized residual of a cell [𝑖, 𝑗] as:
𝑦 −𝑦
𝑠=
𝜎
(where: 𝜎 – standard deviation, 𝑦 – expected value)
With the following pre-calculations:
1
𝜎= (𝑦 − 𝑦)
𝑛
𝑦 ∗𝑦
𝑦 , =
𝑦
Assuming a normal distribution of data: a value with 𝑠 > 𝜏 = 2.5 is considered to be
exceptional.
ii. InExp:
Surprise (directly) beneath the cell. Beneath means: if you drill down into the cell (on many
branches), how much surprise will you find on the more granular level?
Computed by determining the maximum SelfExp value over all cells underneath this cell (i.e.
that can be reached by drill-down operation).
iii. PathExp:
Surprise beneath cell for each drill-down path. Same as InExp, but for a very specific drill-
down path.
Computed as the maximum of SelfExp over all cells reachable by drilling down along that
path.
Exceptions themselves can be stored, indexed, retrieved and searched like pre-computed
aggregates.
Computation consists of three phases:
i. Computation of aggregate values (bottom-up)
ii. Model fitting, i.e. computation of γ coefficients and computation of standardized residuals
iii. Computation of SelfExp, InExp, and PathExp
Advantages of Self-/In-/PathExp:
i. Easily scalable
ii. Simple and fast to calculate
iii. Descriptive detection of abnormal values through InExp and PathExp
Disadvantages of Self-/In-/PathExp:
i. We assume the normal distribution, which might not be the case
ii. Missing trend analysis over time: if we have an down-/upwards trend over time, then it's not
considered in the analysis
iii. Model fitting is affected by deviation (outliers)
iv. Missing conditions on subparts of the cube: we assume globally that everything is
independent from each other, even if it is not. For example: BBQ Sauce <-> Sales in the
Southern US
v. Assumes no correlation and considers dimensions separately, even if they may be
correlated, globally