Concept Hierarchy in Data Mining
Concept Hierarchy in Data Mining
Yijun Lu
M.Sc., Mathematics and Statistics, Simon Fraser University, Canada, 1995
B.Sc., Huazhong University of Science and Technology, China, 1985
c Yijun Lu 1997
SIMON FRASER UNIVERSITY
December 1997
Name: Yijun Lu
Degree: Master of Science
Title of thesis: Concept Hierarchy in Data Mining: Specication, Gen-
eration and Implementation
Date Approved:
ii
Abstract
Data mining is the nontrivial extraction of implicit, previously unknown, and po-
tentially useful information from data. As one of the most important background
knowledge, concept hierarchy plays a fundamentally important role in data mining.
It is the purpose of this thesis to study some aspects of concept hierarchy such as the
automatic generation and encoding technique in the context of data mining.
After the discussion on the basic terminology and categorization, automatic gen-
eration of concept hierarchies is studied for both nominal and numerical hierarchies.
One algorithm is designed for determining the partial order on a given set of nominal
attributes. The resulting partial order is a useful guide for users to nalize the concept
hierarchy for their particular data mining tasks. Based on hierarchical and partition-
ing clustering methods, two algorithms are proposed for the automatic generation of
numerical hierarchies. The quality and performance comparisons indicates that the
proposed algorithms can correctly capture the distribution nature of the concerned
numerical data and generate reasonable concept hierarchies. The applicability of the
algorithms is also discussed and some useful guides are given for the selection of the
algorithms. As an important technique for ecient implementation, encoding of con-
cept hierarchy is investigated. An encoding method is presented and its properties are
studies. The superior advantages of this method are shown by comparing the storage
requirement and performance with some other techniques. Finally, the applications
of concept hierarchies in processing typical data mining tasks are discussed.
iii
Acknowledgments
I would like to express my deepest gratitude to my senior supervisor, Dr.Jiawei Han.
He has provided me with inspiration both professionally and personally during the
course of my degree. The completion of this thesis would not have been possible
without his encouragement, patient guidance and constant support.
I am very grateful to Dr.Veronica Dahl for being my supervisory committee mem-
ber and Dr.Qiang Yang for being my external examiner. They were generous with
their time to read this thesis carefully and make thoughtful suggestions.
My thanks also go to Dr.Yongjian Fu for his valuable suggestions and comments,
and to my fellow students and colleagues in the Database Systems Laboratory, Jenny
Chiang, Sonny Chee, Micheline Kamber, Betty Xia, Cheng Shan, Wan Gong, Nebojsa
Stefanovic, and Kris Koperski for their assistance and friendship.
Financial supports from the research grants of Dr.Jiawei Han and from the School
of Computing Science at Simon Fraser University are much appreciated.
Finally, my special thanks are due to my wife, Ying Zhang, for her love and care
these years.
iv
Contents
Abstract : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii
Acknowledgments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iv
List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii
List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ix
1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
1.1 Data Mining and Knowledge Discovery : : : : : : : : : : : : : 3
1.2 The Role of Concept Hierarchy in Data Mining : : : : : : : : 4
1.3 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
1.4 Outline of the Thesis : : : : : : : : : : : : : : : : : : : : : : : 7
2 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9
2.1 Concept Hierarchy in Data Warehousing : : : : : : : : : : : : 9
2.2 Concept Hierarchy in Data Mining : : : : : : : : : : : : : : : 11
2.3 Concept Hierarchy in Other Areas : : : : : : : : : : : : : : : : 12
2.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14
3 Specication of Concept Hierarchies : : : : : : : : : : : : : : : : : : : 15
3.1 Preliminaries : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
3.2 A Portion of DMQL for Specifying Concept Hierarchies : : : : 22
3.3 Types of Concept Hierarchies : : : : : : : : : : : : : : : : : : 23
v
3.3.1 Schema hierarchy : : : : : : : : : : : : : : : : : : : : 23
3.3.2 Set-grouping hierarchy : : : : : : : : : : : : : : : : : 25
3.3.3 Operation-derived hierarchy : : : : : : : : : : : : : : 27
3.3.4 Rule-based hierarchy : : : : : : : : : : : : : : : : : : 28
3.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 32
4 Automatic Generation of Concept Hierarchies : : : : : : : : : : : : : 33
4.1 Automatic Generation of Nominal Hierarchies : : : : : : : : : 34
4.1.1 Algorithm : : : : : : : : : : : : : : : : : : : : : : : : 34
4.1.2 On date/time Hierarchies : : : : : : : : : : : : : : : 37
4.2 Automatic Generation of Numerical Hierarchies : : : : : : : : 38
4.2.1 Basic Algorithm : : : : : : : : : : : : : : : : : : : : 39
4.2.2 An Algorithm Using Hierarchical Clustering : : : : : 41
4.2.3 An Algorithm Using Partitioning Clustering : : : : : 46
4.2.4 Quality and Performance Comparison : : : : : : : : 53
4.3 Discussion and Summary : : : : : : : : : : : : : : : : : : : : : 59
5 Techniques of Implementation : : : : : : : : : : : : : : : : : : : : : : 61
5.1 Relational Table Approach : : : : : : : : : : : : : : : : : : : : 62
5.2 Encoding of Concept Hierarchy : : : : : : : : : : : : : : : : : 66
5.2.1 Algorithm : : : : : : : : : : : : : : : : : : : : : : : : 68
5.2.2 Properties : : : : : : : : : : : : : : : : : : : : : : : : 69
5.2.3 Remarks : : : : : : : : : : : : : : : : : : : : : : : : : 72
5.3 Performance Analysis and Comparison : : : : : : : : : : : : : 75
5.3.1 Storage Requirement : : : : : : : : : : : : : : : : : : 77
5.3.2 Disk Access Time : : : : : : : : : : : : : : : : : : : : 84
5.4 Discussion and Summary : : : : : : : : : : : : : : : : : : : : : 87
vi
6 Data Mining Using Concept Hierarchies : : : : : : : : : : : : : : : : 88
6.1 DBMiner System : : : : : : : : : : : : : : : : : : : : : : : : : 89
6.2 DMQL Query Expansion : : : : : : : : : : : : : : : : : : : : : 89
6.3 Concept Generalization : : : : : : : : : : : : : : : : : : : : : : 91
6.4 On the Utilization of Rule-based Concept Hierarchies : : : : : 93
6.5 Concept Lookup for Displaying Results of Data Mining : : : : 94
6.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
7 Conclusions and Future Work : : : : : : : : : : : : : : : : : : : : : : 96
7.1 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96
7.2 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : 98
Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 101
vii
List of Tables
4.1 Optimal combination of fan-out and number of bins : : : : : : : : : : 58
5.1 Hierarchy table for location : : : : : : : : : : : : : : : : : : : : : : : : 65
5.2 An date/time hierarchy table : : : : : : : : : : : : : : : : : : : : : : : 66
5.3 An encoded hierarchy table : : : : : : : : : : : : : : : : : : : : : : : 74
5.4 Hierarchy tables for approach A : : : : : : : : : : : : : : : : : : : : : 75
5.5 Hierarchy tables for approach B : : : : : : : : : : : : : : : : : : : : : 76
viii
List of Figures
3.1 Four sample concept hierarchies. : : : : : : : : : : : : : : : : : : : : : 16
3.2 A concept hierarchy location for the provinces in Canada. : : : : : : : 19
3.3 A lattice-like concept hierarchy science. : : : : : : : : : : : : : : : : : 21
3.4 Top-level DMQL syntax for dening concept hierarchies : : : : : : : : 22
3.5 A set-grouping hierarchy statusHier for attribute status : : : : : : : : 26
3.6 A rule-based concept hierarchy gpaHier for attribute GPA : : : : : : : 29
3.7 Generalization rules for concept hierarchy gpaHier. : : : : : : : : : : : 29
3.8 A variant of the concept hierarchy gpaHier. : : : : : : : : : : : : : : : 31
4.1 A histogram for attribute A. : : : : : : : : : : : : : : : : : : : : : : : 40
4.2 A concept hierarchy for attribute A generated by algorithm AGHF. : : 40
4.3 A concept hierarchy for attribute A generated by algorithm AGHC. : 46
4.4 A concept hierarchy for attribute A generated by Algorithm 4.5 using
WGS (4.1). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 50
4.5 A concept hierarchy for attribute A generated by Algorithm 4.5 using
WGS (4.2). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51
4.6 A concept hierarchy for attribute A generated by Algorithm 4.5 using
WGS (4.3). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51
4.7 Another histogram for attribute A. : : : : : : : : : : : : : : : : : : : 54
ix
4.8 A concept hierarchy for attribute A generated by algorithm AGHC with
input histogram given in Figure 4.7. : : : : : : : : : : : : : : : : : : : 55
4.9 A concept hierarchy for attribute A generated by algorithm AGPC with
input histogram given in Figure 4.7. : : : : : : : : : : : : : : : : : : : 55
4.10 Comparison of execution time when the fan-out is 3. : : : : : : : : : 57
4.11 Comparison of execution time when the fan-out is 5. : : : : : : : : : 57
5.1 Post-order traversal encoding of a small hierarchy. : : : : : : : : : : : 67
5.2 An encoded concept hierarchy. : : : : : : : : : : : : : : : : : : : : : : 70
5.3 Storage comparison for dierent number of dimensions. : : : : : : : : 80
5.4 Storage comparison by varying number of levels. : : : : : : : : : : : : 80
5.5 Storage comparison for dierent fan-out in hierarchies. : : : : : : : : 81
5.6 Storage comparison for dierent concept lengths. : : : : : : : : : : : : 82
5.7 Storage comparison the number of leaf nodes in hierarchies is xed. : 83
5.8 Comparison of disk access time for generalizing a concept. : : : : : : 86
6.1 Architecture of the DBMiner system. : : : : : : : : : : : : : : : : : : 89
6.2 A sample procedure of code chopping o : : : : : : : : : : : : : : : : 92
7.1 A concept hierarchy for attribute age. : : : : : : : : : : : : : : : : : : 98
7.2 Another concept hierarchy for attribute age. : : : : : : : : : : : : : : 99
7.3 A histogram for attribute age. : : : : : : : : : : : : : : : : : : : : : : 99
x
Chapter 1
Introduction
With the rapid growth in size and number of available databases in commercial,
industrial, administrative and other applications, it is necessary and interesting to
examine how to extract knowledge automatically from huge amount of data.
Knowledge discovery in databases (KDD), or data mining is the nontrivial extrac-
tion of implicit, previously unknown, and potentially useful information from data[17].
Through the extraction of knowledge in databases, large databases serve as rich, re-
liable sources for knowledge retrieval and verication, and the discovered knowledge
can be applied to information management, decision making, process control and
many other applications. Therefore, data mining has been considered as one of the
most important and challenge research areas. Researchers in many dierent elds,
including database systems, knowledge-base systems, articial intelligence, machine
learning, knowledge acquisition, statistics, spatial databases and data visualization,
have shown great interest in data mining. Many industrial companies are approaching
this important area and realize that data mining will provide an opportunity of major
revenue.
1
CHAPTER 1. INTRODUCTION 2
A popular myth about data mining is to expect that a data mining engine (often
called a data miner) will dig out all kinds of knowledge from a database autonomously
and present them to users without humans instructions or intervention. This sounds
appealing. However, as one may aware, an overwhelmingly large set of knowledge,
deep or shallow, from one perspective or another, could be generated from many
dierent combinations of the sets of the data in the database. The whole set of
knowledge generated from the database, if measured in bytes, could be far large than
the size of the database. Thus it is neither realistic nor desirable to generate, store,
or present such set of the knowledge discoverable from the database.
A relatively realistic goal is that a user or an expert communicate with a data
miner using a set of data mining primitives for eective and fruitful data mining.
Such primitives include the specication of the portion of a database in which one is
interested, the kind of knowledge or rules to be mined, the background knowledge that
a mining process should use, the desired forms to present the discovered knowledge,
etc.
As one of the useful background knowledge, concept hierarchies organize data or
concepts in hierarchical forms or in certain partial order, which are used for expressing
knowledge in concise, high-level terms, and facilitating mining knowledge at multiple
levels of abstraction. Concept hierarchies are also utilized to form dimensions in
multidimensional databases and thus are essential components for data warehousing
as well[29].
In this chapter, the tasks of data mining are described in section 1.1, where dier-
ent kinds of rules are introduced. In section 1.2, the role of concept hierarchies in the
basic attribute-oriented induction (AOI) and multiple-level rule mining is discussed.
Motivation of this thesis is addressed in section 1.3. Section 1.4 gives an overview of
the thesis.
CHAPTER 1. INTRODUCTION 3
one may discover that a set of symptoms often occurs together with another set
of symptoms.
4. Classication Rule Mining, the categorization of the data into a set of known
classes. For example, a set of cars associated with many features may be clas-
sied based on their gas mileages.
5. Clustering, the identication of clusters (classes or groups) for a set of objects
based on their attributes. The objects are so clustered that the within-group
similarity is minimized and between-group similarity is maximized based on
CHAPTER 1. INTRODUCTION 4
some criteria. For example, a set of diseases can be clustered into several clusters
based on the similarities of their symptoms.
6. Prediction, the forecast of the possible values of some missing data or the dis-
tribution of certain attribute(s) in a set of data. For example, an employee's
salary can be predicted based on the salary distribution of similar employees in
a company.
7. Evolution Rule Mining, the discovery of a set of rules which re
ect the general
evolution behavior of a set of data. For example, one may discover the major
factors which in
uence the
uctuations of certain stock prices.
The data mining tasks described above are part of widely recognized ones. Other data
mining tasks in the form of dierent knowledge rules have also been studying. Even
for the above stated rules, there exist special forms or variants in dierent cases. For
example, quantitative association rule mining is the new development of the general
case association rule mining.
knowledge at higher abstraction levels have superior advantages over data mining at
a primitive level. For example, if we have discovered a rule at a primitive level as
follows.
After abstracting data to certain higher levels, we may have the following rule.
Obviously, Rule 2 is much conciser than Rule 1, and, to certain extent, convey more
information. What we have done here is to abstract people titled with professor, senior
engineer, doctor and lawer to a higher conceptual level, i.e., well educated people. And
we generalize salary between $60,000 and $100,000 to higher level concept well paid.
Dierent sets of data could have dierent abstractions and then organized to form
dierent concept hierarchies. A formal denition of concept hierarchy will be given
in x3.1.
Concept hierarchies can be used in the processing of all the tasks stated in the last
section. For a typical data mining task, the following basic steps should be executed
and concept hierarchies play a key role in these steps.
Before proceeding to the next section, It is worth pointing out that concept hi-
erarchies also have the fundamental importance in data warehousing techniques. In
a typical data warehousing system, dimensions are organized in the form of concept
hierarchies. Therefore, the OLAP operations roll-up and drill-down can be performed
by concept (or data) generalization and specialization.
1.3 Motivation
The incoperation of concept hierarchies into data mining and data warehousing tech-
niques has produced many important research results as well as useful systems. How-
ever most of the eort in research and industry has been put on the utilization of
concept hierarchies. Of course, it is the ultimate goal of all the studies on concept
hierarchies. However, their ecient use should be based upon the complete under-
standing of dierent aspects and techniques concerning concept hierarchies. Some of
the problems related to concept hierarchies are listed as follows.
These and other problems let us recognize the fundamental importance of concept
hierarchies and motivate us to conduct an indepth study on concept hierarchy. The
concept hierarchies may be applied to other areas and may have other problems, but
we conne our study in the context of data mining and data warehousing.
addressed with a comparison with the traditional le operating approach. The en-
coding technique of concept hierarchies and its application substantially improve the
performance of our data mining system. An algorithm is developed for the purpose of
hierarchy encoding. The performance comparison of the employment of encoded hier-
archies against non-encoded ones conducted there shows the evendence of the superior
of our encoding technique.
Chapter 6 considers the application of concept hierarchies in the typical data
mining system, DBMiner. Where we will discuss how to utilize concept hierarchies in
DMQL query processing, concept generalization, handling information loss problems
in use of rule-based hierarchies and display of nial mining results.
Finally, we summarize the thesis in Chapter 7, in which some interesting problems
are addressed for future study.
Chapter 2
Related Work
In the early studies or in areas other than data mining, concept hierarchy is commonly
called taxonomy. We adopt the term concept hierarchy because of the popularity of
this term in the community of data mining and knowledge discovery.
In this chapter, we brie
y go through the previous work related to concept hier-
archy in the context of data warehousing, data mining and some other areas.
9
CHAPTER 2. RELATED WORK 10
The use of concept hierarchies in a data warehousing system provides the foun-
dation of operations roll-up and drill-down. Harinarayan, Rajaraman and Ullman[29]
studied the view materialization problem when hierarchical dimensions are involved
in the construction of data cubes. To improve the performance of executing OLAP
operations, a lattice framework is used to express dependencies among views. These
dependencies are actually introduced by using concept hierarchies. A more recent re-
search by Wang and Iyer[49] proposed an encoding method of concept hierarchies for
beneting the roll-up and drill-down queries of OLAP. The post-order labeling method
used in [49] demonstrates better performance than the traditional join method in the
DB2 V2 system. Dierent from other researches, this work focuses on the topic of how
to eciently use concept hierarchies to improve the performance of OLAP queries.
Many commercial products of OLAP systems are available, and Cognos PowerPlay
[42], Oracle Express[8] and MicroStrategy DSS[11] are among the most popular ones.
Since the analysis of historical information for decision support is the ultimate goal
of any data warehousing systems, at least one time dimension should be involved in
the construction of data cubes. Once the time period is specied, a time dimension
is reasonably stable. The
exibility of time schema lets PowerPlay, Express and DSS
put a great deal of eort to handle dierent time dimensions. One interesting thing is
that, usually numerical attributes are taken as measurements and thus assigned as a
measure or fact in the fact tables. Of course, one can take attribute age as a measure-
ment and obtain some aggregates such as avg(age) over a set of data. However, when
we compare attributes account balance with age we can nd that account balance has
more meaning of measurement. It could be more useful to build a concept hierarchy
for age and place attribute age in a dimension table. The vacancy of the generation
of concept hierarchies for numerical attributes is the common disadvantage of the
current commercial OLAP products.
CHAPTER 2. RELATED WORK 11
2.4 Summary
Some related work on the research of concept hierarchy in the context of data ware-
housing, data mining and some other areas such as machine learning, statistics, plan-
ning and image processing are summarized. A great deal of those researches is con-
cerning the utilization of concept hierarchies in dierent algorithms. The research
work on the generation and techniques for ecient implementation of concept hierar-
chy is relatively little. These are the major topics of the thesis and will be studied in
the rest chapters of the thesis.
Chapter 3
Specication of Concept
Hierarchies
The importance of concept hierarchies stimulate us to conduct a systematic study on
them. In this Chapter, we give a formal denition of concept hierarchy, and study
its properties in section 3.1. Some basic terms such as nearest ancestor, level name,
schema level partial order are introduced. In section 3.2, the portion of DMQL for
specifying concept hierarchies is described. In section 3.3, concept hierarchies are
categorized into four types based on the methods of specifying them. Finally, we
summarize this chapter in section 3.4.
3.1 Preliminaries
The denition of concept hierarchy is introduced in this section. Some basic terms
are also discussed.
In traditional philosophy, a concept is determined by its extent and intent, where
15
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 16
the extent consists of all objects belonging to the concept while the intent is the
multitude of all attributes valid for all those objects. A formal denition of concept
can be found in [50]. For the purpose of data mining and knowledge discovery, we
simply take a concept as a unit of thoughts, expressed as a linguistic term. For
example, \human being" is a concept, \computing science" is a concept, too. Here we
do not explicitly describe the extent and intent of a concept and assume that they
can be reasonably interpreted in the context of a particular data mining task.
H.
There are some other names for concept hierarchy in literatures, for example, taxonomy
or is-a hierarchy [46, 48], structured attribute [33], etc.
E F G D N H I
D C L M
F G
C B H I J K
A B A A B C D E F G A B C D E
(1) (2) (3) (4)
Example 3.1 Since posets can be visually sketched using Hasse diagrams[20], we can
also use such kind of diagrams to express concept hierarchies. Figure 3.1 illustrates
four dierent concept hierarchies.
1 A partial order on set H is an irre
ective and transitive relation[20].
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 17
Example 3.2 Following denition 3.3, we nd that concept hierarchies (2) and (3)
in Figure 3.1 are regular concept hierarchies. For concept hierarchy (3), the greatest
element is N and we have H = fN g, H = fL; M g, H = fH; I; J; K g and H =
0 1 2 3
fA; B; C; D; E; F; Gg.
From now on, we will focus our discussions on regular concept hierarchies and call
regular concept hierarchy as concept hierarchy or, simply, hierarchy.
Usually, the partial order in a concept hierarchy re
ects the special-general
relationship between concepts, which is also called subconcept-superconcept relation
(see [50, 47]). Another important term for describing the degree of generality of
concepts is level number. We assign zero as the level number of the greatest element
(called most general concept) of H, and the level number for each of the other concepts
is one plus its nearest ancestor's level number. A concept with level number l is also
called a concept at level l.
Due to the layered structure of a hierarchy as described in denition 3.3, we notice
that all the concepts with the same level number must be in set Hl for one and only
one l, l = 0; : : : ; (n ? 1). We thus simply call Hl as level l of the concept hierarchy.
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 18
: y if y is a nearest ancestor of x.
If we impose a constraint that function g is single valued, that is, for any x, y 2 Hl,
if gl(x) 6= gl (y) then x 6= y, then the Hasse diagram of a concept hierarchy is actually
a tree. Therefore, all the terminology for a tree such as node, root, path, leaf, parent,
child, sibling etc. are applicable to the concept hierarchy as well. It is not dicult to
see that g(Hl) Hl? for each l = 1; 2; : : : ; (n ? 1). In the case that g(Hl) = Hl? for
1 1
each l = 1; 2; : : : ; (n ? 1), we conclude that every node except the ones in Hn? has
1
If level numbers are already assigned to the levels of a hierarchy, a simple way to
gure out a level name to each level is to combine word level with its level number.
For example, we assign level2 as the level name of the level with level number 2.
Based on the above discussion, when we talk about a level in a concept hierarchy,
we could use a set of concepts, or the level name assigned to it without any dierence.
Example 3.3 A concept hierarchy location for provinces in Canada is shown in Fig-
ure 3.2, which consists of three levels (n = 3) with level names country, region and
province, respectively. We have H = fCanadag, H = fWestern; Central; Maritimeg,
0 1
and H = fBC; AB; MT; SK; ON; QU; NS; NB; NF; PEI g, and relation is dened
2
as
BC Western Canada;
AB Western Canada;
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 19
Canada country
BC AB MB SK ON QC NS NB NF PE province
MB Western Canada;
SK Western Canada;
ON Central Canada;
QC Central Canada;
NS Maritime Canada;
NB Maritime Canada;
NF Maritime Canada;
PE Maritime Canada;
All the other expressions for this relation, such as BC Canada, ON Canada,
can be derived from the above expressions using transitivity of the relation. 2
Denition 3.5 (Schema level partial order) A schema level partial order of a con-
cept hierarchy H is a partial order on S, the set of level names of concept hierarchy
H.
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 20
Theorem 3.1 The derived relation is not only a partial order on S, but a total
order as well.
For example, the derived schema level partial order on the set of the level names in
Example 3.3 is
province region country:
On the other hand, if a partial order is given on the set fHlgnl ? , we can dene
=0
1
a partial order on set Snl ? Hl. This is convenient especially when we are concerned
=0
1
For example, the following two expressions demonstrate the application of the partial
order on the set of instances of date values.
Jan:12; 1996 January 1996 Q1 1996 1996;
In general, functions gl for l = 1; 2; :::; n can also be multi-valued. In this case, the
concept hierarchy cannot be illustrated as a tree. A lattice-like graph is employed to
visually describe it. More detailed discussion can be seen in x3.3.4 and x6.4.
Example 3.5 A lattice-like hierarchy science is shown in Figure 3.3. where the dis-
cipline data mining has three parents AI, database and statistics. 2
all science
The syntax of the DMQL is dened in an extended BNF grammar, where \[ ]"
represents 0 or one occurrence, \f g" represents 0 or more occurrences, and the words
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 23
Example 3.6 The home address of the attributes of a relation employee in a company
database is dened in DMQL as follows.
This statement denes the partial order among a sequence of attributes: house number
is at one level lower than street, which is in turn at one level lower than city, and so on.
Notice that multiple hierarchies can be formed in a data relation based on dierent
combinations and orderings of the attributes.
Similarly, a concept hierarchy for date(day, month, quarter, year) is usually pre-
dened by a data mining system, which can be done by using the following DMQL
statement.
A concept hierarchy denition may cross several relations. For example, a hier-
archy productHier may involve two relations, product and company, dened by the
following schema.
In this denition, an attribute name which is shared by two relations has the
corresponding relation name specied in front of the attribute name using the dot
notation as in SQL, and the join condition of the two relations is specied by a where
clause. 2
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 25
Here we use the notation that fA ; A ; :::; Akg B is equivalent to that Ai B for
1 2
each i = 1; 2; :::; k. This hierarchy can also be visually expressed in Figure 3.5.
allStatus
graduate undergraduate
The following statement gives the specication of this hierarchy in DMQL.
dene hierarchy statusHier as
level2: ffreshman, sophomore, junior, seniorg < level1: undergraduate;
level2: fM.Sc, Ph.Dg < level1: graduate;
level1: fgraduate, undergraduateg < level0: allStatus 2
These denitions add a rened layer to the existing denition in the schema hier-
archy location shown in Figure 3.2.
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 27
Example 3.8 The GPA values of students are real numbers ranging from 0 to 4.
However, the GPA values are usually not uniformly distributed, and it is preferable
to dene a hierarchy gpaHier by an automatic generation algorithm.
This statement says that a default algorithm AGHC which will be discussed in the
next chapter is performed on all the GPA values of the relation student, and 4 is the
value of fan-out. 2
ANY
weak strong
R9 R10 R11 R12 R13
poor average good excellent
R1 R2 R3 R4 R5 R6 R7 R8
R1 : f0.02.0g ! poor;
R2 : f2.02.5g ^ fgraduateg ! poor;
R3 : f2.02.5g ^ fundergraduateg ! average;
R4 : f2.53.0g ! average;
R5 : f3.03.5g ! good;
R6 : f3.53.8g ^ fgraduateg ! good;
R7 : f3.53.8g ^ fundergraduateg ! excellent;
R8 : f3.84.0g ! excellent;
R9 : fpoorg ! weak;
R10 : faverageg ^ fsenior, graduateg ! weak;
R11 : faverageg ^ ffreshman, sophomore, juniorg ! strong;
R12 : fgoodg ! strong;
R : fexcellentg ! strong.
13
For the seek of simplicity, we have adapt the following convention in the thesis for
numerical ranges: a value x of an attribute A is in range \a b" if a x < b. The
only exception is when b is the maximum value of the attribute, in that case we can
have a x b.
Sometimes it is possible to convert the lattice-like structure of a rule-based hier-
archy to a tree-like correspondence. Assume that each of the generalization rules is
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 30
in the form of
A(x) ^ B (x) ! C (x)
that is, for a tuple x, concept A can be generalized to concept C (higher level at-
tribute values) if condition B can be satised by x. If B is also a value of certain
attribute, we can take A ^ B as a new concept and the above rule is actually a
subconcept-superconcept relationship. Therefore, a tree-structured concept hierarchy
can be derived from the given generalization rules.
Consider again the above hierarchy gpaHier, we can see that, besides gpa, there
is one more attribute status involved in the generalization rules. With the assistance
of hierarchy shown in Figure 3.5, we can replace the higher level concepts of status
with their corresponding leaf level concepts and transform one generalization rule into
several ones. For instance, rules R and R can be split into
10 11
The other rules can be dealt with similarly. Finally there are 30 detailed generalization
rules.
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES
Figure 3.8: A variant of the concept hierarchy gpaHier.
ANY
weak ; strong ;
2.0~2.0; 2.0~2.5; 2.0~2.5; 2.0~2.5; 2.5~3.0; 2.5~3.0; 2.5~3.0; 2.0~2.5; 2.5~3.0; 2.0~2.5; 2.5~3.0; 2.0~2.5; 2.5~3.0; 3.0~3.5; 3.5~3.8; 3.5~3.8; 3.5~3.8; 3.5~3.8; 3.5~3.8; 3.5~3.8; 3.8~4.0;
M.Sc Ph.D Se Se M.Sc Ph.D Fr Fr So So Ju Ju M.Sc Ph.D Fr So Ju Se
31
CHAPTER 3. SPECIFICATION OF CONCEPT HIERARCHIES 32
Figure 3.8 shows the hierarchy derived from those rules where we use fr, so, ju
and se to represent freshman, sophomore, junior and senior, respectively, and every
concept (node) is a pair of concepts for attributes gpa and status. The sign \-" means
any value of an attribute. This hierarchy is equivalent to the one shown in Figure 3.6
in the sense that we will obtain the same result if we generalize a tuple using these
two hierarchies separately. This kind of transformation from a rule-based hierarchy
to a equivalent but non-rule-based one is important in order to apply our encoding
algorithm which will be addressed in Chapter 5. Another advantage of the splitting
is that we can avoid the information loss problems (see [7]) encountered during the
attribute-oriented induction. We will return to this issue in Chapter 6.
Finally, it is necessary to notice that, in practical applications, a concept hierarchy
can be composited as a mixed type of hierarchy which could be formed by merging
several dierent types of concept hierarchies described in the above three subsections.
3.4 Summary
As the base of our study on concept hierarchies, we rst dened and discussed some
terminology and characteristics of concept hierarchies. The top-level data mining
query language (DMQL) portion for specifying concept hierarchies is stated and illus-
trated by examples for dening dierent hierarchies. Concept hierarchies are classied
into four types, i.e., schema, set-grouping, operation-derived and rule-based, each of
which is discussed concerning their characteristics and specications.
Chapter 4
Automatic Generation of Concept
Hierarchies
As we mentioned in earlier chapters that concept hierarchies could be provided by
knowledge engineers, domain experts or users. The eort of constructing a concept
hierarchy is mostly depends on the size of the hierarchy. It is feasible to manually
construct a hierarchy of small size. However, it could be too much work for a user or an
expert to specify every concept hierarchy, especially large sized ones. Moreover, some
specied hierarchies may not be desirable for a particular data mining task. Therefore,
mechanisms should be introduced for automatic generation and/or adjustment of
concept hierarchies based on the data distributions in a data set. The data set could
be the whole database or a portion of it, or the whole set or a portion of the set
of the data relevant to a particular mining task. The former is independent of a
particular mining task and is thus called a static data set or a database data set;
whereas the latter is generated dynamically (after the mining task is submitted to
a mining system), and is thus called a dynamic data set or a query-relevant data set.
In this context, the generation of a concept hierarchy based on a static (or dynamic)
33
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 34
4.1.1 Algorithm
Intuitively, based on the structure of a concept hierarchy, we may say that the hierar-
chy is reasonable if any level Hl has fewer nodes (or concepts) than each of its lower
levels. This consideration leads the following algorithm to nd out the hidden partial
order on a set of nominal attributes.
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 35
2. while (k < m) f
:=
? fBk g;
minNum := 1;
for (each attribute Ai in
) f
count the number of distinct tuples with respect to attribute
list B ; B ; : : :; Bk ; Ai . Denote this number by myNum;
1 2
g // end of if
g //end of for loop
k := k + 1;
g //end of while loop
3. Assign the only attribute in
to Bm . 2
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 36
The major operations in the above algorithm can be implemented by SQL func-
tionalities. For example, the operation count the number of distinct tuples in R with
respect to attribute list B ; B ; : : : ; Bk ; A may be fullled by the following SQL query:
1 2
FROM R
Theorem 4.1 A partial order on a set S of attributes can be worked out by Algo-
rithm 4.1 in O(m n log n) time, where m = jS j, and n is the total number of tuples
3
Proof Assume that there are no indices on the target database table. It is easy to
see that the time for retrieving distinct tuples with respect to a set of t attributes is
(tn log n). Since, for each t = 1; 2; : : : ; m, we need to execute this kind of retrieval
(m ? t) times, the total time is
?1
mX
[(m ? t)O(tn log n) + (m ? t)] = O(m n log n):
3
t=1
Thus the theorem follows. 2
It is important to point out that users have the freedom of adjusting the partial
order obtained from the algorithm because they may have a better understanding
about the database schema. A partial order worked out from the semantics of those
attributes may result in a better interpretation for the nal mining results. Base on
the same reason, sometimes, it may not be necessary to apply the above algorithm
and use the initially assigned order on a set of attributes as the partial order.
nd that a schema hierarchy could be formed using attributes state, areaname and
county which are attributes in relation cif pop. By applying the algorithm 4.1, we
obtain the partial order: areaname county state , which is consistent with the
real geographic natures in the United States.
handle irregular time period, for example, scal year, semester year, etc., are usually
employed in dierent companies or institutions, the resulting hierarchies should be
able to characterize those cases.
1. How to form the leaf level nodes? This is equivalent to the problem of descretiz-
ing the numerical attribute into a number of subintervals. One method called
equal-width-interval is to partition the whole interval of the attribute into equal
width subintervals. The width or number of these subintervals can be adjusted
in order to obtain reasonable granularity of the partition. Because the leaf level
can be replaced with any higher level, ner partition of the whole interval will
give us good feature of the row data distribution. However, more computational
time is needed in the ner partition case. An alternative, called equal-frequency-
interval, is to choose the interval boundaries so that each subinterval contains
approximately the same number of values of the attribute.
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 39
2. How to merge the leaf level nodes to form higher level nodes? Any higher
level node is obtained by merging some leaf nodes. One constraint is that
only contiguous nodes could be merged. The methods equal-width-interval or
equal-frequency-interval could also be used to produce higher level nodes. Other
methods could be designed based on dierent purposes of using the numerical
hierarchies. In x4.2.1, a basic algorithm for generating numerical hierarchies is
described. An algorithm based on hierarchical clustering with order constraint
is proposed in x4.2.2, and another algorithm based on partitioning clustering is
developed in x4.2.3. Performance analysis and quality comparison are presented
in x4.2.4.
Example 4.2 Suppose a histogram has been produced as shown in Figure 4.1 for
attribute A. Applying the algorithm AGHF we generate a concept hierarchy shown in
Figure 4.2. If we look at the count for each node at level 1, we observe that the count
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 40
is 14 for node \0 50", 19 for node \50 90", and 17 for node \90 120". This is
an approximately even distribution of counts. 2
0
0 20 40 60 80 100 120
0~120
The problem of clustering involves the partitioning of a set of objects into groups or
clusters in order to maximize the homogeneity within each group and to also maximize
the discrimination between groups. See [19], [43] and [13] for detailed discussion of
clustering algorithms and their applications. By obtaining clusters we expect to gure
out the hidden structures of the data. Two types of clustering approaches are available
in literature: hierarchical and partitioning ones. The algorithms proposed in this and
next section are based on these two approaches, respectively.
As addressed before, we can only merge contiguous intervals (or nodes) to form
a higher level nodes in a numerical hierarchy. If we take an interval as an object, in
the terminology of clustering, we confront the problem of clustering a set of objects
with order constraint (see [19] and [37]). For example, given a set of nonoverlapped
intervals O = [0; 2), O = [2; 3), O = [4; 7) and O = [7; 9], we are actually given
1 2 3 4
order \<" which is dened as: O < O < O < O . During the clustering, object O
1 2 3 4 1
can be merged only with O , but O cannot be merged with O without the involving
2 1 3
of O .
2
Some algorithms for clustering with order constraint are developed in [19] and
[37]. The algorithm we will utilize in the automatic generation of concept hierarchies
is outlined below. Refer to Lebbe and Vignes[37] for a detailed discussion of the
algorithm.
Assume that there is a set of N objects on which an order between objects is
also given. Thus the N objects are denoted by O = foigNi where the indices of the
=1
objects are the representation of the order. A hierarchical clustering H~ on the set O
of N objects is dened as a set of clusters, that is H~ = fcj gMj , where M is a positive
=1
qij = qi;l + ql ;j ;
+1
pij = l; g // end if
g // end for l
qij = qij + q0 (ci;j );
g // end for i
g // end for k 2
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 44
Generation Algorithm
After applying algorithm 4.2 to a set of objects with order constraints, we actually
obtain two matrices P and Q. The resultant clustering is formed by tracking back in
matrix P . Because only two clusters are involved in each merge, the nal clustering
of the algorithm is a binary tree. This tree might be used as our concept hierarchy
directly in data mining. However, this kind of concept hierarchy may have a large
number of levels, and thus cannot make the drill-down or roll-up focus on interesting
results quickly. Moreover, this kind of concept hierarchy may need much more storage.
Usually, a parameter called fan-out[32] is specied by for a tree and this param-
eter can be used in the generation of a desirable concept hierarchy. The algorithm
presented below is based on the data clustering algorithm 4.2 and reconstruction of
the clustering result such that the fan-out condition is satised for each node except
the leaf nodes at the bottom level.
of node Aki in Hk .
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 45
4. Select a nonleaf node, say Ak0 from Hk , which has the greatest quality
+1
among all the nodes in Hk ; expand Hk by replacing Ak0 with its two
+1 +1
children in H~ and make the parent of Ak0 the parent of these two children;
mk := mk + 1.
+1 +1
5. Repeat the above step until the fan-out condition is satised for each non-
leaf node in Hk except those whose children are all leaf nodes of H~ .
6. If each node in Hk is a leaf node of H~ , stop; otherwise k := k + 1, go to
+1
step 3. 2
Proof First of all, consider algorithm 4.2. Since for each group of consecutive bins
from bin i to bin j , where i = 1; 2; :::; n; j = i + 1; i + 2; :::; n, we have to examine
(j ? i) positions and perform a comparison, the time for detecting the best position
is 2(j ? i). The computation of quality for this group is of time 4(j ? i). Thus the
time for process this group is 6(j ? i) and the total time for executing algorithm 4.2
is
n
n X "n?1 #
X X
6(j ? i) = 3 (i + i)
2
= O(n ) 3
(4.1)
Now, let us examine step 2 through 6 of the algorithm 4.3. Since we need to perform
F j operations to form the nodes at level j , the total time for these steps is proportional
to
F n?
Xlog 1
F j = O(n) (4.2)
j =0
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 46
Summing up the above calculations (4.1) and (4.2), we conclude that the time com-
plexity of algorithm AGHC is O(n ). 3
2
Example 4.3 Consider the histogram shown in Figure 4.1 for attribute A. Applying
algorithm AGHC we produce a concept hierarchy illustrated in Figure 4.3. 2
0~120
to search an optimized grouping of the data for a certain number of groups and may
be a better way of nding structures in the data.
In this subsection, we present an algorithm of generating numerical hierarchies
using partitioning clustering methods. A new quality measure is provided based on
the characteristics of our clustering problem. Examples are given to demonstrate the
selection of quality measures.
As in the previous subsection, we assume that a histogram of a numerical attribute
is given. The collection of the bins of the histogram is the set of objects on which
clustering algorithms are performed. Denoted by fi the frequency of bin oi. Based on
the nature of the set of objects with order constraint, for a given number of clusters,
say k, we try to nd (k ? 1) partition points such that the resultant k clusters will
have an optimal quality. In the second round, the clustering method is applied to
each of these k clusters. This procedure is repeated until each cluster has no more
than certain number of objects. Clearly, a hierarchical structure is built up during
this iterative application of the partitioning clustering method.
Assume that the (k ? 1) partition points are fokj gjk? . Dene k = ?1 and kk = n.
1
=1 0
The quality measures or within-group similarities (WGS) for the j ?th group specied
from point okj to okj+1 widely employed in literature (see for example [19]) are as
+1
follows:
2. Information content:
kj
X
WGSj = fi log(fi=mj ) (4.4)
i=kj?1 +1
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 48
Algorithm 4.4 (Partition Clustering) Partition a set of n objects with order con-
straint into k groups such that certain quality measure is optimized.
Before presenting the clustering results using dierent quality measures, we de-
scribe our algorithm of generating numerical hierarchy for a numerical attribute for
which a histogram is given based on the distribution of this attribute in a database.
and S is associated with the top level node [min; max] of hierarchy H;
2. If jS j F , return; else apply algorithm 4.4 to S to get F groups denoted
by St, t = 1; 2; :::; F . These F groups are used to form F nodes in the
hierarchy H and are the children of the node associated with S ;
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 50
As we pointed out before, dierent quality measures may produce dierent results.
The following example illustrates the eect of the selection of dierent within-group
similarity measures when applying the algorithm 4.5 to a particular data distribution.
Example 4.4 Again, let us consider the attribute A with histogram shown in Figure
4.1. Applying algorithm 4.5 to this data distribution using the within-group similarity
measures given in (4.3), (4.4) and (4.5), we obtain three concept hierarchies with
F = 3 shown in Figures 4.4, 4.5 and 4.6.
0~120
Figure 4.4: A concept hierarchy for attribute A generated by Algorithm 4.5 using WGS
(4.1).
It is easy to see from the histogram (Figure 4.1) that there are three modes in
the data distribution. The boundary bins are (3040) and (7080). Clearly, the
hierarchy shown in Figure 4.6, which is generated by algorithm 4.5 using within-group
similarity measure (4.5), captures the structure of the data. However, the hierarchies
shown in Figures 4.4 and 4.5, which are produced by the same algorithm using within-
group similarity measures (4.3) and (4.4) respectively, distort the structure. Actually,
if we look at level 1 of these two hierarchies, the hierarchy displayed in Figure 4.4
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 51
0~120
Figure 4.5: A concept hierarchy for attribute A generated by Algorithm 4.5 using WGS
(4.2).
0~120
Figure 4.6: A concept hierarchy for attribute A generated by Algorithm 4.5 using WGS
(4.3).
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 52
merged the rst two modes together, whileas the hierarchy shown in Figure 4.5 cannot
demonstrate the last two modes. The eect illustrated here lets us choose the variance
quality as the within-group similarity measure in the generation of numerical concept
hierarchies using algorithm 4.5. 2
Using variance quality as the measure in algorithm AGPC, we have the following
complexity result.
Theorem 4.3 The worst case computational complexity of algorithm AGPC is O(n ) 3
and its best case computational complexity is O(n2 ), where n is the number of bins of
the given histogram for attribute A.
Proof Assume that there are m nodes at level j of the hierarchy. These m nodes
correspond to m groups of bins in the given histogram. Denote by ni the number of
bins in the ith group, i = 1; 2; :::; m. It is not dicult to see that to split the ith
group into F subgroups, we have to perform kF (ni ? F )(5ni + 1) operations, where
k is the number of iterations to nd the best boundary bins. Thus to construct the
nodes at level j + 1 we have to take time
m
X m
X
kF (ni ? F )(5ni + 1) = a ni + b
2
In the best case, there are totally (logF n) levels in the hierarchy and the time
needed for generating level (j + 1) from level j is
kF i ( Fni ? F )( 5Fni + 1)
+1
By summation, we conclude that the total time for the best case is O(n ).
2
2
Comparison of Quality
Notice that the concept hierarchies shown in Figures 4.2, 4.3 and 4.6 are respectively
generated by using algorithms AGHF, AGHC and AGPC to the same input histogram
given in Figure 4.1. Obviously, the hierarchy (Figure 4.2) generated by AGHF does
not catch the structure of the data. The eort of balancing count or frequency for the
nodes at each level makes the algorithm totally ignore the modes in the distribution
of the data. The simplicity and the eciency of the algorithm is still attractive in
certain situations such as the data distribution is approximately uniform or there are
too many modes in the distribution.
The concept hierarchy generated by algorithm AGHC is good in the sense that
the hidden structure of the data is reasonably represented by the hierarchy (Figure
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 54
4.3). Comparing Figure 4.3 with Figure 4.6 which is produced by algorithm AGPC,
we notice that the dierence occurs only at the boundary bins. In most applications
it does not make much dierence to include boundary bins into their left-hand groups
or right-hand groups. Thus we consider the two hierarchies shown in Figures 4.3 and
4.6 have the same quality.
Now, we consider another input histogram, shown in Figure 4.7, which is an exten-
sion of that shown in Figure 4.1. Here we add some perturbations in the third mode.
Executing algorithms AGHC and AGPC, we obtain two concept hierarchies shown in
Figures 4.8 and 4.9, respectively.
0
0 20 40 60 80 100 120 140
As we can see from level 2 of the two hierarchies that both of the algorithms
successfully detect the three modes in the histogram (Figure 4.7), even though the
third mode has been added some noisy data, and all the branches in the hierarchies
correspond to the reasonable boundary bins. Both algorithms are robust because the
perturbations in the third mode does not confuse the algorithms to capture the overall
structure of the data.
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 55
0~140
Figure 4.8: A concept hierarchy for attribute A generated by algorithm AGHC with
input histogram given in Figure 4.7.
0~140
Figure 4.9: A concept hierarchy for attribute A generated by algorithm AGPC with
input histogram given in Figure 4.7.
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 56
Based on our testing of the two algorithms AGHC and AGPC, we conclude that
these two algorithm are robust and in most cases they can produce very similar con-
cept hierarchies.
600
AGHF
AGHC
AGPC
500
400
execution time
300
200
100
0
20 30 40 50 60 70 80
number of bins
3500
3000
execution time
2500
2000
1500
1000
500
0
20 40 60 80 100 120 140 160
number of bins
are, in most cases, very close. Therefore the number of bins of the input histogram
and the fan-out could be used to determine which algorithm to use. Based on our
experiment, Table 4.1 is obtained and may be used as guidance for the selection of
the algorithms.
To utilize Table 4.1, we check the input fan-out, say 8, and look up its correspond-
ing number of bins from Table 4.1, in this case it is 170. If the number of bins of a
input histogram is less than 170, then we chose algorithm AGHC to perform automatic
generation of a concept hierarchy. Otherwise, we select algorithm AGPC to generate
a concept hierarchy.
CHAPTER 4. AUTOMATIC GENERATION OF CONCEPT HIERARCHIES 59
be presented at the same level of the generated hierarchy. However, the number of
modes is not known a priori. Users can obtain some idea on this number by visually
observe the given histogram, but the histogram may include some noise or distortion.
In addition, even if the number of modes is known there is no simple way to guarantee
the all the nodes corresponding to those modes can be produced at the same level of
the hierarchy. These and other unclear features of the generation algorithms make it
dicult to judge the qualities of the generated hierarchies.
Chapter 5
Techniques of Implementation
In this chapter, we discuss the implementation of concept hierarchies. A relational
table strategy is employed for storing concept hierarchies in section 5.1 by consider-
ing that we are discovering knowledge from relational databases and, as an important
background knowledge, concept hierarchies should be a natural component of data
sources. To incorporate the concept hierarchies into a data mining system, encod-
ing plays a key role. A generic encoding algorithm is developed in section 5.2. By
\generic" we mean that the encoded hierarchies can be used for any data mining mod-
ules when concept generalization is involved. The performance comparison between
with-encoding and without-encoding hierarchies is conducted concerning the storage
requirement and disk access time. The superior performance of our encoding approach
is demonstrated on both of the two factors in section 5.3. Finally, we summarize the
chapter in section 5.4.
61
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 62
hierarchies, in which there is a concept id (cid for short) for each node in the hierarchy.
In a typical tuple of the table, we record a node's cid, name, its parent cid and other
useful information. This approach is also widely used in other OLAP systems[49].
The advantages of these methods are that each of the child-parent relationships
can be directly represented by tuples of a relational table and the contents of all
the hierarchies might be put in one table, thus all those hierarchies can be handled
uniformly. However, once we need to use several dimensions organized as hierarchies
to generate a data cube for executing data mining tasks, the disk space consumed may
be very large, and the disk access time might be very long because the disk access is
required for each concept generalization.
In order to handle large databases and large number of dimensions, and manipulate
data cubes eciently, we adopt the following approach: each hierarchy has its own
table (also called hierarchy table) for storing it contents. Each tuple of the table
records a path of the hierarchy from the root to a leaf node. The reason of using
dierent tables for dierent hierarchies is that dierent hierarchies may have dierent
number of levels, and thus the lengths of root-leaf paths for dierent hierarchies may
be dierent. The advantages of this method will be addressed in the next two sections.
Example 5.1 For the concept hierarchy shown in Example 3.3 (Figure 3.2), its hier-
archy table is shown in Table 5.1. Notice that a default top level allLocation, which
has one node ANY, is added to the hierarchy in order to guarantee that the hierarchy
is regular. It is not really necessary for the current hierarchy because at the country
level there is only one node Canada. However, as an uniform method, adding a default
top level can be used to handle any hierarchies. 2
This relational table strategy for storing concept hierarchies can be used for solving
the concept duplication problem frequently encountered in date/time hierarchies. The
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 65
Remark 5.1 (On date/time hierarchies) In x 4.1.2, we have mentioned the prob-
lem when the attribute week is included in a date/time concept hierarchy. For exam-
ple, week 27 may across June and July. Once we need to generalize concept week 27
to the month level, which one should we take as its high level correspondence? June
or July? This problem can be naturally solved using our relational table approach.
The following example explains the solution.
Example 5.2 Table 5.2 gives a hierarchy table which is instantiated using schema
hierarchy allDate year month week day dened on relation title in database
pubs which is a sample database in MS SQL server. It can be seen that W27 1991
crosses two months, i.e., Jun 1991 and Jul 1991. During the concept generalization of
W27 1991, we only need to follow the paths specied by the two dierent tuples, in
this case the second and third tuples, and nd its higher level correspondences. So
Jun 1991 is the parent of the rst W27 1991 and Jul 1991 is the parent of the second
W27 1991. By this way, confusion will never occur because each raw data value has
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 66
Finally, it is valuable to notice that to achieve the goal of ecient access, related
indices are created on the tables described above. The advantage of our relational
table approach for storing hierarchies is gained incorporating with the hierarchy en-
coding strategy which is presented in the next section.
data cubes, and in this case we need to spend a lot of time to process mining tasks
because of the memory page swapping. Hierarchy encoding strategy is introduced to
tackle this problem. We attempt to encode a concept hierarchy in such a way that
the partial order of the hierarchy is exactly represented by the codes so that we only
need to manipulate the codes when we process mining tasks. The access of the stored
concept hierarchy is only needed when we want to create a data cube and to display
a mining result once a mining task is fullled.
16 store
7 15 department
3 6 11 14 category
item
0 1 2 4 5 8 9 10 12 13
5.2.1 Algorithm
A new hierarchy encoding algorithm is proposed in this subsection which can be
treated as a generic purpose encoding strategy suitable for any data mining function-
alities. The main idea is to assign each node (or concept) of a hierarchy a unique
binary code which consists of j elds, where j is the level number of the node in
this hierarchy. Once a hierarchy is encoded, we only need to retrieve the codes of
the hierarchy to the memory and realize generalization and specialization by manip-
ulating the codes. The performance analysis discussed in the next section will clearly
demonstrate the advantage of our hierarchy encoding scheme.
To describe our encoding algorithm, let us rst introduce several notations.
Denote by fPi gmi the set of all the distinct root-leaf paths in the hierarchy H,
=1
and let
Pi = (ai ; ai ; :::; ai;l? );
0 1 1 i = 1; 2; :::; m:
where aij is the j th node which corresponds to the j th level of the hierarchy on the
path Pi. The encoding algorithm is described as follows.
Input: A concept hierarchy H from which the set of root-leaf paths fPigmi is sorted
=1
in ascending order and the maximum number of siblings, denoted by sj for each
level j = 0; 1; :::; (l ? 1) is given.
Output: A set of binary codes which are assigned to the root-leaf paths of hierarchy
H.
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 69
0; 1; :::; (l ? 1)), where each binary number cj has blog (sj + 1)c bits.
2
while (i m) f
for (j = 0; j < l; j + +) f
if aij 6= ai?1;j f
cj := cj + 1;
assign code c = c :::cj to node aij ;
0
for (k = j + 1; k < l; k + +) f
ck := 1;
assign code c = cck to node aik ; g
j := l ? 1; g g
i := i + 1; g 2
Example 5.3 Apply the above algorithm to the concept hierarchy shown in Fig-
ure 5.1, we get an encoded hierarchy demonstrated in Figure 5.2. 2
5.2.2 Properties
For the computational complexity of the encoding algorithm 5.1, we have the following
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 70
101 110
1010101 1010110 1010111 1011001 1011010 1100101 1100110 1100111 1101001 1101010
Lemma 5.1 For any two nodes A and B with codes cA0 cA1 :::cAi and cB0 cB1 :::cBj ,
respectively, A is a child of B if and only if i = j + 1 and cAk = cBk for k = 0; 1; :::j .
Proof If A is a child of B , then the code for A has one more eld than of B and the
code of A is formed by concatenating a binary number to that of B , thus i = j + 1
and cAk = cBk for k = 0; 1; :::j .
On the other hand, if i = j +1 and cAk = cBk for k = 0; 1; :::j , but A is not a child
of B , we attempt to generate contradiction. We only need to consider the situation
that A and B are not at the same level, since otherwise we will have i = j which is
an obvious contradiction to i = j + 1. Since A is not a child of B , we have the cases
of either they are on the same root-leaf path or on two dierent root-leaf paths. Let
us rst consider the same path case. Since A is not a child of B , according to the
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 71
algorithm, the number of elds in the code for A is at least two more or no larger
than that of B , in other words, i j + 2 or i < j . This contradict to i = j + 1.
Now let us consider the case that A and B are on two dierent root-leaf paths and
i = j + 1. In this case, A's parent P with code, say, cP0 cP1 :::cPj , is at the same level
as B . P 6= B since A and B are on dierent paths, thus there must be at least one
t such that 0 t j , cBt 6= cPt . Since the code for A is formed by concatenating
cP0 cP1 :::cPj with another binary number, we have cAk = cPk for k = 0; 1; :::; j , and
thus cBt 6= cAt , which contradicts to that cAk = cBk for k = 0; 1; :::j . 2
From this Lemma, it is easy to see that the code for the parent of a node at level
j with code c c :::cj can be formed by dropping its partial code cj corresponding to
0 1
level j and get c c :::cj? . Based on this property, we only need to store the codes of
0 1 1
leaf nodes. The codes for other nodes can be easily obtained by simply chopping o
one of its leaf node code to certain levels.
The following theorem reveals the relationship between the partial order of a hi-
erarchy and its codes.
Theorem 5.2 Given the partial order of hierarchy H and the codes obtained by
applying algorithm 5.1 we have, for any pair of nodes A and B with codes cA0 cA1 :::cAi
and cB0cB1:::cBj , respectively, A B if and only if i > j and cAk = cBk for k =
0; 1; :::j .
Proof A B if and only if B is an ancestor of A. The theorem follows by repeatedly
applying Lemma 5.1. 2
According to this theorem, we can realize the manipulations of a concept hierarchy
by only using its codes. This is the base of executing concept generalization and
specialization in our data mining system.
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 72
5.2.3 Remarks
In the input statement of the encoding algorithm 5.1, we posed the requirements that
the set of root-leaf paths is sorted and that the maximum number of siblings at each
level is available. These requirements can be achieved using the methods discussed in
the remarks below.
Remark 5.2 As discussed in the previous section, the content of a hierarchy is stored
in a relational table. So the requirement that the set of root-leaf paths is sorted in
the input of the above algorithm is easy to be implemented using a SQL query by
specifying an attribute list on which an order by statement is formed. This utilization
of SQL allow us to avoid coding sorting algorithms and, together with the indices
created on the hierarchy table, to obtain ecient execution. For example, assuming
that we have level names a , a , ..., and al? , and the hierarchy table is hierTable,
0 1 1
then the following SQL query realizes the sorting task and the result is stored in table
tempHierTable.
Remark 5.3 To satisfy the second requirement that the maximumnumber of siblings
sl for each level l = 0; 1; :::; (l ? 1) is calculated, we need to execute several SQL queries
and introduce a couple of auxiliary tables. The solution is detailed in the following
algorithm.
Input. A concept hierarchy H whose hierarchy table is hierTable and level names are
a , a , ..., and al? .
0 1 1
3. SELECT MAX(theCount)
FROM tempStats2
2
2. SELECT COUNT(*)
FROM tempStats1
If we replace the maximum numbers of siblings with the numbers of nodes, and still
denoted by sj , j = 0; 1; :::; (l ? 1), the method in the algorithm 5.1 can be executed
for hierarchy encoding without any modication. Although sj , j = 0; 1; :::; (l ? 1)
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 74
might be larger in the case of numbers of nodes, blog (sj + 1)c, j = 0; 1; :::; (l ? 1),
2
are excepted to be not much larger than the corresponding numbers in the case of
maximum numbers of siblings. Hence, it is feasible to employ this simple approach in
Step 2 of algorithm 5.1.
Remark 5.4 Because each tuple in the hierarchy table represents a root-leaf path in
H, and the codes generated for the leaf nodes are actually associated with these paths
respectively, we can record these codes by adding one more attribute (column), say
code, to the hierarchy table. An index can also be created on this attribute in order
to eciently access the codes of the hierarchy.
Example 5.4 After applying the encoding algorithm 5.1, the hierarchy table 5.1 be-
comes Table 5.3. Where the data type for attribute code is binary. Since one byte
of binary data is expressed by a group of two characters, the values of code look like
hexadecimal data, but in fact they are in bit patterns. For example, 6A is actually
01101010.
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 75
cName pName
BC Western
AB Western
MB Western cName pName
SK Western Western Canada
ON Central Central Canada
QC Central Maritime Canada
NS Maritime
NB Maritime
NF Maritime
PE Maritime
Table 5.4: Hierarchy tables for approach A
Approach A: without encoding. Use a collection of several tables for storing one
concept hierarchy in which real concepts are used as join key.
A concept hierarchy consists of several relations, each of which is a map table
from a lower level to its next higher level. For example, the hierarchy location
shown in Figure 3.1 is stored by using the two tables shown in Table 5.4.
Approach B: without encoding. Use a collection of several map tables for storing
a hierarchy in which concept identier is used as join key.
Adopted by usual OLAP systems (see [49]), this approach is similar to approach
A, but, instead of using real concept name as join key between tables, here an
unique integer identier is assigned to each node for the purpose of table join.
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 76
Again we use the hierarchy location to illustrate the idea of the approach. The
collection of the three tables in Table 5.5, which have the schema of (cID, cName,
pID), gives the whole hierarchy.
Approach C: with encoding. Use one relational table for each concept hierarchy.
This is the approach we employed in our implementation. An example is given
in Table 5.3.
Before proceeding to the comparison of storage requirement and disk access time,
we state the assumptions and notations used in the analysis below. First, for a typical
concept hierarchy, say, hierarchy Hi , we denote by li its number of levels, Fi its fan-out,
nij its number of nodes at level j and sij its maximum number of siblings at level j for
j = 0; 1; :::; (li ? 1). We assume that each concept in this hierarchy is represented by a
character string with length Ri bytes. Second, we assume that B ?tree indices have
+
been built on the related attributes on relational tables for storing concept hierarchies
and a node of the B ?tree just ts one page having size of B bytes in the disk storage.
+
Hence, if the sizes of a search key value and a pointer in a B ?tree are B bytes and
+
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 77
P bytes, respectively, the number of search key values in a node of the tree is (k ? 1)
and the number of pointers in a node of the tree is k, where k = b PB LL c, L is the size
+
+
of the search key. Therefore, the number of levels of the B ?tree with N search keys
+
There are ni; li? tuples in the encoded hierarchy table and each tuple is of size
( 1)
Now, let us consider the storage requirement for a typical least generalized data
cube. Suppose there are d dimensions in the data cube, each of which is organized
as a hierarchy. We also assume that the measurements of the data cube requires m
bytes to store in each cube cell.
Since there are totally Qdi (ni; li? + 1) cube cells in the least generalized cube,
=1 ( 1)
we conclude that the storage requirements for approaches A, B and C are respectively
"d # d
Y X
SCA = (ni; li? + 1) (m +
( 1) Ri ) (5.5)
i=1 i=1
"d # d
Y X
SCB = (ni; li? + 1) (m +
( 1) I) (5.6)
i=1 i=1
and " d #
Y d
X
SCC = (ni; li? + 1) (m +
( 1) Li) (5.7)
i=1 i=1
bytes, where Li is given by (5.3).
Summing up the above analysis, especially equations (5.1){(5.7), we have the
following
Theorem 5.3 The storage requirements of both d concept hierarchies and the cor-
responding least generalized data cubes consisting of d dimensions organized as the
hierarchies for approaches A, B and C, denoted by SA , SB and SC , are respectively
i ?1
d lX " d # d
X Y X
SA = 2 nij Ri + (ni; li? + 1) (m +
( 1) Ri ) (5.8)
i=1 j =1 i=1 i=1
i ?1
d lX "d #
X Y
SB = (Ri + 2I )nij + (ni; li? + 1) (m + dI )
( 1) (5.9)
i=1 j =0 i=1
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 79
d " d # d
X Y X
SC = ni; li? (liRi + Li) +
( 1) (ni; li? + 1) (m +
( 1) Li ) (5.10)
i=1 i=1 i=1
2
In the special case of nij = Fij for i = 1; 2; :::; d and j = 0; 1; :::; (li ? 1), we can
simplify the above formula and have
d 2R (F li ? 1) "d # d
X i i Y l ? X
SA = Fi ? 1 + (Fi + 1) (m + Ri)
i 1
(5.11)
i=1 i=1 i=1
SB =
d
X (Ri + 2I )(Fi li +1
? 1) + Y(F li?1 + 1)# (m + dI )
"d
(5.12)
i=1 Fi ? 1 i=1
i
d "d # d
X l ? Y l ? X
SC = Fi (liRi + Li ) + (Fi + 1) (m + Li)
i 1 i 1
(5.13)
i=1 i=1 i=1
where Li = [1 + (li ? 1) log (Fi + 1)]=8.
2
1. We vary the number of dimensions from which a data cube is built, and the
other parameters are xed as R = 20, l = 4, F = 6. Figure 5.3 is plotted
using a linear scale for x?axis and a logarithmic scale for y?axis. As shown
in the gure, the required disk space is increased exponentially with respect to
the number of dimensions for each of these approaches. approach C needs less
space than the other two. For each dierent number of dimensions, the encoding
approach saves more than 80% and 36% of the space required by approaches A
and B respectively.
2. We change the number of levels of concept hierarchies and the other parameters
are xed as d = 3, R = 20, F = 6. By the semilog plotting for the total disk
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 80
10000
Approach A
Approach B
1000 Approach C
100
10
log(total space (10M))
0.1
0.01
0.001
0.0001
1e-05
2 2.5 3 3.5 4 4.5 5
number of dimensions
1e-10
2 2.5 3 3.5 4 4.5 5 5.5 6
number of levels
space, we can see from Figure 5.4 that, with the increase of number of levels, the
required space is also increasing exponentially for each approach. And approach
C needs the least space among the three methods. It respectively saves more
than 84% and 37% of the space needed by approach A and approach B.
3. We change the fan-out of concept hierarchies from 2 to 10 and x the other
parameters as d = 3, R = 20, l = 5. Figure 5.5 shows, again, that the encoding
approach is the best among the three and it respectively saves about 84% and
38% of the space needed by approach A and approach B when the fan-out is no
larger than 8. The degree of the space savings is decreasing when the fan-out
is increasing. Notice that the number of leaf nodes of each hierarchy is also
increasing in this case.
7000
Approach A
Approach B
Approach C
6000
5000
total space (10M)
4000
3000
2000
1000
0
2 3 4 5 6 7 8 9 10
fan-out
22
Approach A
Approach B
20 Approach C
18
16
total space (10M)
14
12
10
2
5 10 15 20 25 30
average length of concepts
spaces needed for approaches B and C are increasing very slowly. We even cannot
detect the changes from Figure 5.6. The linear increasing nature for approach A is
obvious. The conclusion we can draw from this observation is that the changing
of the concept length has little aect to the length of code in approach C and
the spaces required for storing data cubes in the three approaches dominate
the total spaces. Again, the approach C is the best and it saves from 70% to
89% space relative to approach A when the concept length varies from 10 to 30.
Approach B is 37% better than approach A in any case.
450
Approach A
Approach B
approach C
400
350
total space (10M)
300
250
200
150
100
0 5 10 15 20 25 30
fan-out
Figure 5.7: Storage comparison the number of leaf nodes in hierarchies is xed.
not sensitive to the changing of fan-out and number of levels while the number
of leaf nodes are xed, which indicates that the overall storage is dominated by
the number of leaf nodes. Again, the encoding approach (approach C) is the best
among the three methods and it is about 61% and 19% better than approaches
A and B respectively when the fan-out is greater than 2. The curve for approach
C also gives us indication of how to choose a reasonable fan-out. Apparently,
too large fan-out will make the number of levels too small, and too small fan-out
will give us too large number of levels. In the current case, fan-out around 6
give us better saving of storage.
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 84
in one hierarchy because the total disk assess time of generalizing the cube to certain
higher level is the summation of that for each individual concept generalization when
more concepts and more than one hierarchy are involved. We retain the assumption
made right before subsection 5.3.1. And, for the seek of simplicity, we only consider
the case that nij = Fij , i.e., the number of nodes at each level is the power of its
fan-out.
Let us start with analyzing approach A. If l = (l ? 1) we do not need to have disk
0
access because the concept is in its real form already. When l < (l ? 1), we need
0
to access (l ? l ? 1) hierarchy tables. Since the table associated with level i has F i
0
tuples, we need to have 1 + dlog kA? F ie disk accesses to read a tuple from the table
( 1)
is the length of attribute cName. Therefore the total disk access time for generalizing
a concept at level (l ? 1) to level l is 0
l?1
X
1 + dlog kA? F ie tb
( 1) (5.14)
i=l0 +1
where tb is the time of one page disk access (see Elmasri and Navathe[12] for disk
access parameters), and we assume that Pti s = 0 if s > t. =
we only need to chop o the code by (l ? l ? 1) elds. Disk access is need to look up
0
the real concept name corresponding to this generalized code in order to display the
mining result. The method of chopping o code and looking up real concept names
will be addressed in the next chapter. Again, since a B ?tree index is created on
+
the attribute code for the hierarchy table, we need to access disk 1 + dlog kC ? F l? e
( 1)
1
times, where kC = b PB LL c, and L is the length of a code. Thus, the disk access time
+
+
for approach C is
1 + dlog kC ? F l? e tb
( 1)
1
(5.16)
the disk access time using encoded hierarchy (approach C) is constant which can also
be detected from equation (5.19). It is more important to notice that the disk access
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 86
700
Approach A
Approach B
Approach C
600
500
disk access time (msec)
400
300
200
100
0
0 1 2 3 4 5
number of generalized levels
time of approach C is much less than that of approaches A and B, except that we do not
perform any generalization and display only the result in the least generalized cube.
With the increasing of the number of generalized levels, the performance superior
of the encoding method is also increasing. For example, the encoding approach is 4
times faster than approach A or approach B when a concept is generalized from level
4 to level 1. Finally, we point out that approach B is slower that approach A because,
comparing to approach A, we need to access one more hierarchy table in using approach
B to generalize a concept to a certain level. 2
Based upon the comparison of storage and disk access time for the three approaches,
we can conclude that the encoding approach outperforms the two without-encoding
approaches. The encoding method gives us a way to spend less storage and obtain
more ecient processing of data mining tasks.
CHAPTER 5. TECHNIQUES OF IMPLEMENTATION 87
88
CHAPTER 6. DATA MINING USING CONCEPT HIERARCHIES 89
Example 6.1 Suppose that a database UNIVERSITY has the following schema:
CHAPTER 6. DATA MINING USING CONCEPT HIERARCHIES 90
student(name, sno, status, major, gpa, birth date, birth city, birth province)
course(cno, name, department)
grading(sno, cno, instructor, semester, grade)
In order to discover some hidden regularities in this database, we specify the following
DMQL query:
USE database UNIVERSITY
MINE CHARACTERISTIC RULE
FROM student
WHERE major="cs" and gpa="3.5~4.0" and birth_place="Canada"
IN RELEVANCE TO gpa, birth_place
ANALYZE count
One may immediately nd from this query that birth place is not an attribute in table
student, and "3.54.0" is not a value for attribute gpa. Actually, the two dimension
gpa and birth place appearing in the IN RELEVANCE TO statement are associated with
concept hierarchies gpa and birth place. And "3.54.0" and Canada are two concepts
in the mentioned hierarchies, respectively.
To transform this query into a SQL query to retrieve task-relevant data and to
complete the mining task, we need to get the following two things done.
Expand dimensions. The dimensions involved in the "in relevance to" clause should
be expanded in order to get a SQL select statement. The attributes in the SQL
select statement must be available in database tables. In the above DMQL query,
dimension gpa is an attribute in table student, but birth place is not. Assume that
hierarchy birth place has level names all place(C), country(C), birth province(S)
and birth city(S), where the letters C or S in the parentheses indicate the type of
the levels. The dimension birth place is replaced with birth province and birth city
which are of type S(schema). Now, the SQL select statement is
SELECT gpa, birth province, birth city
CHAPTER 6. DATA MINING USING CONCEPT HIERARCHIES 91
Expand where clause. The higher level concepts in the where clause of DMQL
query have to be expanded so that only raw data values are involved in the
formed SQL where clause. For example, "Canada" is not a value in table student.
We use concept hierarchy birth place, which is identical to hierarchy location
shown in Figure 3.2, to nd the nearest descendents of schema type which
have level name birth province and values of the nine provinces. Thus, after
expanding, the condition birth place = "Canada" is replaced with
birth province = BC OR birth province = "AB"
Example 6.2 Figure 6.2 illustrates the procedure for concept generalization, where
the related concept hierarchy is assumed to have four levels, and we want to generalize
the cid to level one. So the last two elds or levels are chopped o, and the code x9257
is changed to x9000. 2
CHAPTER 6. DATA MINING USING CONCEPT HIERARCHIES 93
applicable to a usual hierarchy, such as storing into relational tables and encoding.
To create a data cube, one needs to relate the attributes appeared in that rule-based
hierarchy together and pick up the corresponding code from the hierarchy table.
Once the data cube has been created, we no longer need to access the raw data
and all the other data mining functionalities can be executed normally.
Example 6.3 Using Example 6.2, we consider the concept look up for cid
1 0010 00000 000000
where a1 and a2 are the rst two level names of the concerned concept hierarchy.
Finally we can use the retrieved values for a1 or (a1, a2) for displaying our mining
results. 2
6.6 Summary
The architecture of the DBMiner system is brie
y introduced. Concept hierarchies
are used in the data cube construction and all the other functional modules. The
major applications of concept hierarchies, including DMQL query expansion, concept
generalization, the use of rule-based hierarchies and display of mining results, are
discussed using examples. Many other applications, including the retrieval and search
of hierarchy-related information, and the special treatment of time/date hierarchies,
are also implemented in the DBMiner system.
Chapter 7
Conclusions and Future Work
Data mining and knowledge discovery in databases have been attracting a signi-
cant amount of research, industry and media attention. As one of the important
background knowledge for data mining, concept hierarchy provides any data mining
methods with the ability of generalizing raw data to some abstraction level, and make
it possible to express knowledge in concise and simple terms. Concept hierarchies also
make it possible to mining knowledge at multiple levels. This thesis is focused on the
study of concept hierarchy concerning its specication, generation, implementation
and application. In this last chapter of the thesis, we give a brief summary of the
work we have done in the thesis and discuss some related topics which are important
and interesting for future research.
7.1 Summary
The ecient use of concept hierarchy in data mining is the ultimate goal of the study.
Dierent aspects of the concept hierarchy are investigated in the thesis, including its
96
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 97
0~112
0~112
0~80 80~112
count
80
60
40
20
0
0 16 32 48 64 80 96 112 age
If we use the measure dened in [21], we nd that the second concept hierar-
chy (Figure 7.2) has a higher quality than that of the rst hierarchy (Figure 7.1).
Nevertheless, only the rst hierarchy correctly describes the hidden structure of the
attribute on which the histogram is produced. Therefore, one can make sure that
knowledge rules discovered using the second hierarchy are denitely worse than those
using the rst hierarchy. How to measure the quality of a concept hierarchy is still
an open problem.
(3) How to handle complex rule-based concept hierarchies?
A deductive generalization rule has the form: A(x) ^ B (x) ! C (x), which means
that, for a tuple x, concept A can be generalized to concept C if condition B is satised
by x. The condition B (x) can be a simple predicate or a very complex logic formula
involving dierent attributes and relations. The technique used in Chapter 3 can
only deal with simple predicate cases. Further researches are needed on implementing
complex rule-based concept hierarchies.
Bibliography
[1] A. A. A and V. Clark. Computer-aided multivariate analysis. 3rd edition,
Chapman and Hall, NY, 1996.
[2] R. Agrawal, T. Imielinski and A. Swami. Mining association rules between sets
of items in large databases. In Proc. of the ACM SIGMOD conf. on Management
of Data, Washington, D.C., 207-216, 1993.
[3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc.
1994 Int. Conf. Very Large Data Bases, Santiago, Chile, 487-499, 1994
[4] H. At-Kaci, R. Boyer, P. Lincoln and R. Nasr. Ecient implementation of lattice
operations. ACM Transactions on Programming Languages, 11(1):115-146, 1989.
[5] C. Brew. Systemic classication and its eciency. Computational Linguistics,
17(4):375-408, 1991.
[6] D. Chamberlin. Using the new DB2: IBM's object-relational database system.
Morgan Kaufmann, 1996.
[7] D. W. Cheung, A. W. Fu and J. Han. Knowledge discovery in databases: a rule-
based attribute-oriented approach. In Proc. 1994 Int. Symp. on Methodologies
for Intelligent Systems (ISMIS'94), Charlotte, NC, 164-173, 1994.
101
BIBLIOGRAPHY 102
[19] A. D. Gordon. Classication: Methods for the Exploratory Analysis and Multi-
variate. Chapman and Hall, 1981.
[20] R. P. Grimaldi. Discrete and combinatorial mathematics: An applied introduc-
tion. Addison-Wesley Publishing Company, 1994.
[21] H. J. Hamilton and D. R. Fudger. Estimating DBLearn's potential for knowledge
discovery in databases. Computational Intelligence, 11(2), 280-296, 1995.
[22] J. Han. Mining knowledge at multiple concept levels. In Proc. 4th Int. Conf.
on Information and Knowledge Management (CIKM'95), Baltimore, Maryland,
19-24, 1995.
[23] J. Han. Conference Tutorial Notes: Integration of data mining and data ware-
housing technologies. 1997 Int'l Conf. on Data Engineering (ICDE'97), Birm-
ingham, England, 1997.
[24] J. Han, Y. Cai and N. Cercone. Data-driven discovery of quantitative rules in
relational databases. IEEE Tran. on Knowledge and Data Engineering, 5(1), 29-
40, 1993.
[25] J. Han and Y. Fu. Dynamic generation and renement of concept hierarchies for
knowledge discovery in databases. In Proc. AAAI'94 Workshop on Knowledge
Discovery in Databases(KDD'94), Seattle, WA, 157-168, 1994.
[26] J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. In Proc. 1995 Int. Conf. Very Large Data Bases (VLDB'95), Zurich,
Switzerland, 420-431, 1995.
[27] J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in
data mining. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy,
BIBLIOGRAPHY 104
[35] D. Keim, H. Kriegel and T. Seidl. Supporting data mining of large databases by
visual feedback queries. In Proc. 10th Int. Conf. on Data Engineering, 302-313,
Houston, TX, Feb. 1994.
[36] R. Kerber. ChiMerge: Discretization of numeric attribute. In Proc. Tenth Na-
tional Conf. on Articial Intelligence (AAAI-92), San Jose, CA, 123-127, 1992.
[37] J. Lebbe and R. Vignes. Optimal hierarchical clustering with order constraint. In
Ordinal and Symbolic Data Analysis, E.Diday, Y.Lechevallier and O.Opitz, eds.,
Springer-Verlag, 265-276, 1996.
[38] C. Mellish. The description identication problem. Articial Intelligence,
52(2):151-167, 1991.
[39] R. S. Michalski. Inductive learning as rule-guided generalization and conceptual
simplication of symbolic description: unifying principles and a methodology.
Workshop on Current Developments in Machine Learning, Carnegie Mellon Uni-
versity, Pittsburgh, PA, 1980.
[40] R. S. Michalski and R. Stepp. Automated construction of classications: Con-
ceptual clustering versus numerical taxonomy. IEEE Trans. Pattern Analysis and
Machine Intelligence, 5:396-410, 1983.
[41] R. Missaoui and R. Godin. An incremental concept formation approach for learn-
ing from databases. In V.S.Alagar, L.V.S.Lakshmanan and F.Sadri, editors, For-
mal Methods in Databases and Software Engineering, Springer-Verlag, 39-53,
1993.
[42] PowerPlay: Packaging information with transformer. Cognos Incorporated, 1996.
[43] H. C. Romesburg. Cluster analysis for researchers. Krieger Publishing Company,
Malabar, Florida, 1990.
BIBLIOGRAPHY 106
[44] S. J. Russell. Tree-structured bias. In Proc. 1988 AAAI Conf., Minneapolis, MN,
641-645, 1988.
[45] R. R. Sikal and P. H. A. Sneath. Principles of numerical taxonomy. W.H.Freeman
and Co., London, 1963.
[46] R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. 1995
Int. Conf. Very Large Data Bases, Zurich, Switzerland, 407-419, 1995.
[47] G. Stumme. Exploration tools in formal concept analysis. In Ordinal and symbolic
data analysis, E. Diday, Y. Lechevallier and O. Opitz (Eds.), 31-44, 1995.
[48] P. Valtchew and J. Euzenat. Classication of concepts through products of con-
cepts and abstract data types. In Ordinal and symbolic data analysis, E. Diday,
Y. Lechevallier and O. Opitz (Eds.), 3-12, 1995.
[49] M. Wang and B. Iyer. Ecient roll-up and drill-down analysis in relational
database. In 1997 SIGMOD Workshop on Research Issues on Data Mining and
Knowledge Discovery, 39-43, 1997
[50] R. Wille. Concept lattices and conceptual knowledge systems. Computer & Math-
ematics with Applications, 23, 493-515, 1992.