0% found this document useful (0 votes)
55 views11 pages

Data Cube Computation and Data Generalization: Lesson Introduction

This document discusses data cube computation and data generalization in the context of data warehousing and OLAP. It covers efficient methods for data cube computation, including multiway array cube computation and bottom-up computation, as well as the concept of attribute-oriented induction for data generalization. The document outlines the importance of these methods in enhancing performance and facilitating data analysis by summarizing data at various levels of abstraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views11 pages

Data Cube Computation and Data Generalization: Lesson Introduction

This document discusses data cube computation and data generalization in the context of data warehousing and OLAP. It covers efficient methods for data cube computation, including multiway array cube computation and bottom-up computation, as well as the concept of attribute-oriented induction for data generalization. The document outlines the importance of these methods in enhancing performance and facilitating data analysis by summarizing data at various levels of abstraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

8.

Data Cube Computation and Data Generalization

Lesson Introduction

Data warehousing and OLAP perform data generalization by summarizing data at varying levels of
abstraction. In this lesson, we will cover the efficient methods of data cube computation including
multiway array cube computation and bottom-up approach. Finally, another method of data
generalization, known as attribute-oriented induction.

Learning Outcomes:
After successful completion of this lesson, the students will be able to understand the different way of
efficient data cube computation, what methods can be used for attribute reduction and data
generalization.

Lesson Outline:
• Data cube computation.
• Attribute reduction.
• Data generalization.

8.1 Efficient Methods for Data Cube Computation.

Data cube computation is an essential task in data warehouse implementation. The precomputation of all
or part of a data cube can greatly reduce the response time and enhance the performance of on-line
analytical processing. However, such computation is challenging because it may require substantial
computational time and storage space.

8.1.1 Cube Materialization

Figure 8-1 shows a 3-D data cube for the dimensions A, B, and C, and an aggregate measure, M. A data
cube is a lattice of cuboids. Each cuboid represents a group-by. ABC is the base cuboid, containing all three
of the dimensions. Here, the aggregate measure, M, is computed for each possible combination of the
three dimensions. The base cuboid is the least generalized of all the cuboids in the data cube. The most
generalized cuboid is the apex cuboid, commonly represented as all. It contains one value. It aggregates
measure M for all the tuples stored in the base cuboid. To drill down in the data cube, we move from the
apex cuboid, downward in the lattice. To roll up, we move from the base cuboid, upward. For the purposes
of our discussion in this lesson, we will always use the term data cube to refer to a lattice of cuboids rather
than an individual cuboid.
A cell in the base cuboid is a base cell. A cell from a nonbase cuboid is an aggregate cell. An aggregate cell
aggregates over one or more dimensions, where each aggregated dimension is indicated by a “*” in the
cell notation. Suppose we have an n-dimensional data cube. Let a = (a1, a2, ….. , an, measures) be a cell
from one of the cuboids making up the data cube. We say that a is an m-dimensional cell (that is, from an
m-dimensional cuboid) if exactly m (m ≤ n) values among {a 1, a2 ,…… , an } are not “*”. If m = n, then a is a
base cell; otherwise, it is an aggregate cell (i.e., where m < n).

Figure 8-1: A 3-D data cube with the dimensions A, B, and C for some aggregate measure, M.

Base and aggregate cells: Consider a data cube with the dimensions month, city, and customer group, and
the measured price. (Jan, *, * , 2800) and (*, Toronto, *, 1200) are 1-D cells, (Jan, * , Business, 150) is a 2-
D cell, and (Jan, Toronto, Business, 45) is a 3-D cell. Here, all base cells are 3-D, whereas 1-D and 2-D cells
are aggregate cells. An ancestor-descendant relationship may exist between cells. In an n-dimensional
data cube, an i-D cell a = (a1, a2, …….. , an, measuresa ) is an ancestor of a j-D cell b = (b1, b2, ……., bn,
measuresb ), and b is a descendant of a, if and only if (1) i < j, and (2) for 1 < m < n, a m = bm whenever am ≠
“*”. In particular, cell a is called a parent of cell b, and b is a child of a, if and only if j = i+1 and b is a
descendant of a.

In order to ensure fast on-line analytical processing, it is sometimes desirable to precompute the full cube
(i.e., all the cells of all of the cuboids for a given data cube). This, however, is exponential to the number
of dimensions. That is, a data cube of n dimensions contains 2 n cuboids. There are even more cuboids if
we consider concept hierarchies for each dimension. In addition, the size of each cuboid depends on the
cardinality of its dimensions. Thus, precomputation of the full cube can require huge and often excessive
amounts of memory.

In many cases, a substantial amount of the cube’s space could be taken up by a large number of cells with
very low measure values. This is because the cube cells are often quite sparsely distributed within a
multiple dimensional space. For example, a customer may only buy a few items in a store at a time. Such
an event will generate only a few non-empty cells, leaving most other cube cells empty. In such situations,
it is useful to materialize only those cells in a cuboid (group-by) whose measure value is above some
minimum threshold. In a data cube for sales, say, we may wish to materialize only those cells for which
count ≥ 10 (i.e., where at least 10 tuples exist for the cell’s given a combination of dimensions), or only
those cells representing sales ≥ $100. This not only saves processing time and disk space, but also leads
to a more focused analysis. The cells that cannot pass the threshold are likely to be too trivial to warrant
further analysis. Such partially materialized cubes are known as iceberg cubes.
8.2 Strategies for Cube Computation

With different kinds of cubes as described above, we can expect that there are a good number of methods
for efficient computation. In general, there are two basic data structures used for storing cuboids.
Relational tables are used as the basic data structure for the implementation of relational OLAP (ROLAP),
while multidimensional arrays are used as the basic data structure in multidimensional OLAP (MOLAP). In
this subsection, we introduce several two methods for efficient cube computation that explore some or
all the above optimization strategies. They are multiway array aggregation (MultiWay) method and
Bottom-Up Computation (BUC) for computing full cubes.

8.2.1 Multiway array cube computation.

Consider a 3-D data array containing the three dimensions A, B, and C. The 3-D array is partitioned into
small, memory-based chunks. In this example, the array is partitioned into 64 chunks as shown in Figure
8-2. Dimension A is organized into four equal-sized partitions, a 0, a1, a2, and a3. Dimensions B and C are
similarly organized into four partitions each. Chunks 1, 2, . . . , 64 correspond to the subcubes a 0b0c0,
a1b0c0, . . . , a3b3c3, respectively. Suppose that the cardinality of the dimensions A, B, and C is 40, 400, and
4000, respectively. Thus, the size of the array for each dimension, A, B, and C, is also 40, 400, and 4000,
respectively. The size of each partition in A, B, and C is therefore 10, 100, and 1000, respectively. Full
materialization of the corresponding data cube involves the computation of all the cuboids defining this
cube.

Figure 8-2: A 3-D array for the dimensions A, B, and C, organized into 64 chunks .
The resulting full cube consists of the following cuboids:

 The base cuboid denoted by ABC (from which all of the other cuboids are directly or indirectly
computed). This cube is already computed and corresponds to the given 3-D array.
 The 2-D cuboids, AB, AC, and BC, which respectively correspond to the group-by’s AB, AC, and BC.
These cuboids must be computed.
 The 1-D cuboids, A, B, and C, which respectively correspond to the group-by’s A, B, and C. These
cuboids must be computed.
 The 0-D (apex) cuboid, denoted by all, which corresponds to the group-by (); that is, there are no
group-by here. This cuboid must be computed. It consists of one value. If, say, the data cube
measure is count, then the value to be computed is simply the total count of all of the tuples in
ABC.

Consider the following example for Multiway array aggregation

Table 8-1: Ordered Set Representation of a Data Cube

According to table 8-1. When we consider the first base cuboid {P1, Calgary, Vance}, the corresponding
cube will get the value= 2 as indicated in figure 8-3. When the {P1, Calgary, Vance} cell is filled at the same
time, the following cells will also get aggregated with value = 2.

2-D cuboids: {P1, Calgary, *},{P1, *, Vance},{*, Calgary, Vance}

1-D cuboids: {*, *, Vance}, {*, Calgary, *},{P1, *, *}

Apex cuboid: {*, *, *}

Same computation will be applied to all the rows given in table 8-1, and the final data cube is given in
figure 8-3.
Figure 8-3: Data Cube computation

8.2.2 Bottom-Up Computation (BUC)

Unlike Multiway array aggregation, BUC constructs the cube from the apex cuboid toward the base cuboid.
Figure 8-1 shows a lattice of cuboids, making up a 3-D data cube with the dimensions A, B, and C. The apex
(0-D) cuboid, representing the concept all (that is, (*, *, *)), is at the top of the lattice. This is the most
aggregated or generalized level. The 3-D base cuboid, ABC, is at the bottom of the lattice. It is the least
aggregated (most detailed or specialized) level. This representation of a lattice of cuboids, with the apex
at the top and the base at the bottom, is commonly accepted in data warehousing. It consolidates the
notions of drill-down (where we can move from a highly aggregated cell to lower, more detailed cells) and
roll-up (where we can move from detailed, low-level cells to a higher level, more aggregated cells). BUC
stands for “Bottom-Up Construction.” However, according to the lattice convention described above and
used throughout course, the order of processing of BUC is top-down! However, because we adopt the
application worldview where drill-down refers to drilling from the apex cuboid down toward the base
cuboid, the exploration process of BUC is regarded as top-down. BUC’s exploration for the computation
of a 3-D data cube is shown in Figure 8-4.
Figure 8-4: BUC exploration for the computation of a 3-D data cube

8.2.3 Star Attributes and Star Nodes

If the single dimensional aggregate on an attribute value p does not satisfy the iceberg condition, it is
useless to distinguish such nodes in the iceberg cube computation. Thus, the node p can be replaced by
“ * ” so that the cuboid tree can be further compressed. We say that the node p in an attribute A is a star-
node if the single dimensional aggregate on p does not satisfy the iceberg condition; otherwise, p is a non-
star-node. A cuboid tree that is compressed using star-nodes is called a star-tree.

A base cuboid table is shown in Table 4.1. There are 5 tuples and 4 dimensions. The cardinalities for
dimensions A, B, C, D are 2, 4, 4, 4, respectively. The one-dimensional aggregates for all attributes are
shown in Table 8-2.

Table 8-2: Base Cuboids before star reduction.


Suppose min sup = 2 in the iceberg condition. Only attribute values a1, a2, b1, c3, d4 satisfy the condition.
All the other values are below the threshold and thus become star-nodes. We can replace the attribute
value which does not satisfy with the minimum support by “ * “ and its shown in table 8-3. By collapsing
star-nodes, the reduced base table is Table 8-4. Notice that the table contains two fewer rows and also
fewer distinct values than Table 8-2. We use the reduced base table to construct the cuboid tree because
it is smaller. The resultant star-tree is shown in Figure 8-5. To help identify which nodes are star-nodes, a
star-table is constructed for each star-tree. Figure 8-5 also shows the corresponding star table for the star-
tree (where only the star-nodes are shown in the star-table).

Table 8-3: Base cuboids after replacement by “ * “.

Table 8-4: Base Cuboids after star reduction.

Figure 8-5: Star-tree and star-table.


8.3 Data Generalization: Attribute-Oriented Induction.

Data generalization summarizes data by replacing relatively low-level values (such as numeric values for
an attribute age) with higher-level concepts (such as young, middle-aged, and senior). Given a large
amount of data stored in databases, it is useful to be able to describe concepts in concise and succinct
terms at generalized (rather than low) levels of abstraction. Allowing datasets to be generalized at
multiple levels of abstraction facilitates users in examining the general behavior of the data.

The attribute-oriented induction (AOI) approach to concept description was first proposed in 1989, a few
years before the introduction of the data cube approach. The data cube approach is essentially based on
materialized views of the data, which typically have been precomputed in a data warehouse. In general,
it performs off-line aggregation before an OLAP or data mining query is submitted for processing. On the
other hand, the attribute-oriented induction approach is a query-oriented, generalization-based, on-line
data analysis technique.

The general idea of attribute-oriented induction is to first collect the task-relevant data using a database
query and then perform generalization based on the examination of the number of distinct values of each
attribute in the relevant set of data. The generalization is performed by either attribute removal or
attribute generalization. Aggregation is performed by merging identical generalized tuples and
accumulating their respective counts. This reduces the size of the generalized dataset. The resulting
generalized relation can be mapped into different forms for presentation to the user, such as charts or
rules.

Consider the following example for attribute-oriented induction

Suppose that a user would like to describe the general characteristics of graduate students in the Big
University database, given the attributes name, gender, major, birthplace, birth date, residence, phone#
(telephone number), and GPA (grade point average). A data mining query for this characterization can be
expressed in the data mining query language, DMQL, as follows:

use Big University DB

mine characteristics as “Science Students”

in relevance to name, gender, major, birth place, birth date, residence,

phone#, GPA

from student

where status in “graduate.”


The first step is initial relation: Data focusing should be performed before attribute-oriented induction.
This step corresponds to the specification of the task-relevant data (i.e., data for analysis). The data are
collected based on the information provided in the data mining query in given in table 8-5 .

Table 8-5: Initial relation

The second step is to prepare for generalization : Attribute removal is based on the following rule: If there
is a large set of distinct values for an attribute of the initial working relation, but either (1) there is no
generalization operator on the attribute (e.g., there is no concept hierarchy defined for the attribute), or
(2) its higher-level concepts are expressed in terms of other attributes, then the attribute should be
removed from the working relation.

Attribute generalization is based on the following rule: If there is a large set of distinct values for an
attribute in the initial working relation, and there exists a set of generalization operators on the attribute,
then a generalization operator should be selected and applied to the attribute. This rule is based on the
following reasoning.

For each attribute of the relation, the generalization proceeds as follows:

 Name: Since there are a large number of distinct values for name and there is no generalization
operation defined on it, this attribute is removed.
 Gender: Since there are only two distinct values for gender, this attribute is retained, and no
generalization is performed on it.
 Major: Suppose that a concept hierarchy has been defined that allows the attribute major to be
generalized to the values {arts & science, engineering, business}. Suppose also that the attribute
generalization threshold is set to 5, and that there are more than 20 distinct values for major in
the initial working relation. By attribute generalization and attribute generalization control, major
is therefore generalized by climbing the given concept hierarchy.
 Birthplace: This attribute has a large number of distinct values; therefore, we would like to
generalize it. Suppose that a concept hierarchy exists for the birthplace, defined as “city < province
or state < country”. If the number of distinct values for the country in the initial working relation
is greater than the attribute generalization threshold, then birthplace should be removed,
because even though a generalization operator exists for it, the generalization threshold would
not be satisfied. If instead, the number of distinct values for a country is less than the attribute
generalization threshold, then birth_place should be generalized to birth_country.
 birth_date: Suppose that a hierarchy exists that can generalize birth_date to age, and age to
age_range, and that the number of age ranges (or intervals) is small with respect to the attribute
generalization threshold. Generalization of birth_date should, therefore, take place.
 Residence: Suppose that residence is defined by the attributes number, street, residence city,
residence_province_or_state, and residence_country. The number of distinct values for number
and street will likely be very high since these concepts are quite a low level. The attributes number
and street should, therefore, be removed, so that residence is then generalized to residence_city,
which contains fewer distinct values.
 phone#: As with the attribute name above, this attribute contains too many distinct values and
should, therefore, be removed in generalization.
 GPA: Suppose that a concept hierarchy exists for GPA that groups values for grade point average
into numerical intervals like {3.75–4.0, 3.5–3.75,. . .}, which in turn are grouped into descriptive
values, such as {excellent, very good,. . .}. The attribute can therefore be generalized.

Table 8-6 shows how the attribute-oriented induction is performed on the initial working relation given
in table 8-5.

Table 8-6: Selecting attribute removal and attribute generalization

Step 3 derives the prime relation, P. This is performed by inserting generalized tuples into P. the result is
given in table 8-7.

Table 8-7: Prime relation

Finally, the presentation: User interaction and adjust levels by drilling, pivoting and mapping into rules.
The end user should get the control of how it should be visualized. Table 8-8 show one such output to the
end user.
Table 8-8: Visualization and Presentation

You might also like