0% found this document useful (0 votes)
6 views20 pages

06 Text Clustering

This chapter discusses the conceptual view of text clustering, which involves segmenting a text group into subgroups based on similarity. It differentiates between hard and soft clustering, as well as flat and hierarchical clustering, and compares text clustering with other data mining tasks like text categorization and association. The chapter also explores various clustering algorithms and the importance of similarity metrics in the clustering process.

Uploaded by

ddls0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views20 pages

06 Text Clustering

This chapter discusses the conceptual view of text clustering, which involves segmenting a text group into subgroups based on similarity. It differentiates between hard and soft clustering, as well as flat and hierarchical clustering, and compares text clustering with other data mining tasks like text categorization and association. The chapter also explores various clustering algorithms and the importance of similarity metrics in the clustering process.

Uploaded by

ddls0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

182 Taeho Jo

9 Text Clustering: Conceptual View

This chapter is concerned with the conceptual view of text clustering tasks.
It is the process of segmenting a text group into subgroups each of which
contains similar texts. Text clustering is divided into hard text clustering
and soft text clustering, depending on whether each text is allowed to belong
to more than one cluster, or not. Depending on whether nested clustering
is allowed or not, it is divided into flat text clustering and hierarchical text
clustering. In this section, we describe text clustering with its functional view.
Keywords: Text Clustering, Hard Text Clustering, Soft Text Clustering,
Hierarchical Text Clustering
We make the general discussions concerned with text clustering in Section
9.1 and compare the task of data clustering with other data mining tasks, in
Section 9.2. In Section 9.3, we explore the mutually opposite types of text
clustering depending on dichotomies. We mention the real tasks which are
derived from text clustering in Section 9.4 and make the summary and the
further discussions on what we study in this chapter, in Section 9.5.

9.1 Definition of Text Clustering

Text clustering refers to the process of segmenting a text group into subgroups
of content based similar ones. It is possible to cluster texts by their character
combinations, called lexical clustering, but the kind of clustering is covered
in this chapter. It is called semantic clustering to cluster texts based on
their contents or meaning, and it will be covered in this study. It is assumed
that texts are not labeled at all and unsupervised learning algorithm which
are mentioned in Chapter 10, are adopted for this task. In this section, we
describe briefly the text clustering before discussing it in detail, together with
the review of text categorization which was covered in the previous part.
Let us review briefly the text categorization which was covered in the pre-
vious part, in order to help to characterize the text clustering. The text cate-
gorization was mentioned as the process of classifying texts into one or some
of the predefined categories. The supervised machine learning algorithms,
such as K Nearest Neighbor, Na?ve Bayes, or Support Vector Machine, are
used as the main approaches. The performances of text categorization sys-
tems and approaches are evaluated by the accuracy or the F1 measure which
were mentioned in Chapter 8. Both the text categorization and the text clus-
tering may be integrated into a particular task, rather than two independent
tasks, as mentioned in Chapter 16.
The text clustering is characterized with several agenda differently from
other text mining tasks, such as the text categorization and the text associa-
tion. A group of texts is given as the input and anonymous groups of them are
the results from clustering the texts. The text clustering proceeds depend-
Text Mining: Concepts, Implementation, and Big Data Challenge 183

ing strongly on the similarities among the texts; it is very important task
how to define similarity metrics. Cluster prototypes are initialized randomly
or arbitrarily, and optimized into to maximize item similarities within each
cluster and minimize similarities between clusters. The set of texts is never
partitioned into the training set and the test set, in evaluating text clustering
systems.
The text clustering is coupled with the text categorization for automating
the preliminary tasks. Because manual labeling is very tedious job, it is not
easy to gather labeled examples, but it is easy to gather unlabeled ones.
The labeled text collection is able to be constructed by clustering unlabeled
texts and naming the clusters as the sample texts for implementing text
categorization systems. The list of cluster names becomes that of predefined
categories and texts within each cluster become the sample texts which are
labeled with its cluster name. Coupling of the both tasks is proposed and
implemented by Jo in 2006 [25].
The text clustering is compared with the two tasks, the text categoriza-
tion and the text association in the previous parts. In comparing the text
association and the text clustering with each other, the former defines uni-
directional relations among texts, while the latter defines the bidirectional
relations among them. The preliminary tasks such as category predefinition
and sample text preparation, are required for the text categorization. The
text clustering tends to be coupled with the text categorization into a single
task, as mentioned above. Association rules among texts which are gener-
ated from the text association becomes useful contexts for doing the text
clustering and the text categorization.

9.2 Data Clustering

This section is concerned with the generic data clustering in its functional
and conceptual view, and consists of the four sections. The clustering itself
is described in its functional view in Section 9.2.1. In Section 9.2.2, the clus-
tering is compared with the association, in order distinguish them from each
other. In Section 9.2.3, the clustering and the classification are compared
with each other. In Section 9.2.4, the constraint clustering is mentioned as
the special clustering where some labeled examples are provided.

9.2.1 SubSubsectionTitle

The data clustering is illustrated in Figure 113. A single group of various


data items is given as the initial clustering status. The single group is seg-
mented into the several subgroups which are called clusters, depending on
the similarities among data items. The data items within their own subgroup
184 Taeho Jo

Fig. 113 Data Clustering

should be similar as each other, ones in different subgroups should be dis-


criminated among each other. In this section, we describe the data clustering,
in its functional view.
Computing the similarities among examples and clusters is very important
task for clustering data items. The similarity metrics between two individual
items which are represented into numerical vectors are previously expressed
in Equation (6.1) (6.4). We set the similarity between an individual item
and a cluster as the similarity between its represented vector and the mean
vector of the cluster. The similarity between two clusters is computed by the
similarity between mean vectors which correspond to them. We may use the
maximum or the minimum of similarities of all possible pairs of items as the
similarity between the two clusters.
Even if the clustering proceeds differently depending on clustering algo-
rithm, the general frame of doing it exists. We mentioned above the compu-
tation of similarities among items or clusters as the core operation which is
necessary for clustering data items. The clusters are characterized by their
prototypes or their representative examples, and the data items are arranged
by the likelihoods with the clusters. The two steps, cluster characterization
and data item arrangement, are iterated until the cluster characteristics be-
come stable; they change very little. The stable cluster characteristics and
the clusters of data items are the final results from the data clustering.
Let us mention some clustering algorithms which are approaches to the
text clustering. The AHC algorithm is mentioned as a typical clustering algo-
rithm, and it is simplest but has highly time complexity. We may consider the
k means algorithm as the most popular clustering algorithm. The k means
algorithm may be the fuzzy version where memberships of data items in
clusters are computed. It belongs to the EM (Estimation and Maximization)
algorithm where data items are clustered with the two steps: the estimation
step and the maximization step.
Let us compare the clustering task with other data mining tasks: associ-
ation, classification, regression, and outlier detection. The data association
is referred to the process of extracting association rules which are given as
if-then ones from a collection of item sets. The data classification is defined
as the process of assigning one or some of the predefined categories to each
data item by analyzing the labeled sample ones. The regression is the process
of estimating a continuous output value or values by analyzing input factors
and previous output values. The outlier detection which is an additional data
Text Mining: Concepts, Implementation, and Big Data Challenge 185

mining task is the process of detecting and removing alien data items and
noisy ones for cleaning data items.

9.2.2 Association vs Clustering

This section is concerned with the comparison of the text clustering with
the text association. The text association means the process of extracting
association rules as if-then forms between texts; it was studied in Chapter
4. As the distinguished characteristics of the two tasks, the text association
pursues for casual relations among texts, whereas the text clustering does
for similarities among them. They are distinguished from each other, but
they may be combined with each other. In this section, we investigate the
differences between tasks and connect them with each other.
We described the process of associating data items in Chapter 4. In the
association, item sets are given as the input, initially. All association rules
each of which consists of a single item which is the conditional part, and items
which are its causal parts are extracted. For each association rule, we compute
its support and confidence and select some with their higher supports and
confidences than the given threshold among them. The difference between
the two tasks is that in the association, the item sets are given as the input
whereas in the clustering a single group of items is given as the input.
Table 12 illustrates the comparison of the two tasks: clustering and asso-
ciation. In the association, item sets are given as the input, whereas a single
group of items is given as the input in the clustering. In the association,
if-then rules are generated as the output, whereas in the clustering, clusters
of similar items are generated. In the association, its results indicate causal
relations among items, whereas in the clustering, its results indicate similar-
ities. The causal relations are uni-directional which does not guarantee its
reverse equivalence, but the similarities are bi-directional which does it.

Table 12 Association vs Clustering


Association Clustering
Input Item Sets Item Group
Output If then rules Clusters
Relation Causal Similar
Direction One Direction Both Directions

The results from clustering data items may become the input for doing the
data association. The clusters which are generated as the output are viewed
as item sets. The association rules are extracted from the clusters, using
the association algorithm, apriori algorithm. It becomes possible to find the
association rules from individual items by means of data clustering. However,
186 Taeho Jo

note that this type of data association is sensitive to the results from data
clustering.
The word association rules are generated from texts by the above process.
The list of words which are given the conditional parts of association rules
becomes the list of taxonomies. In each association rule, texts which are
relevant to the words in the causal part are retrieved in each association rule.
In generating taxonomies from the association rules, each association rule is
transformed into a topic and texts which belong to it. Texts should be encoded
into lists of words through text indexing for doing the word association.

9.2.3 Classification vs Clustering

The classification was already covered in the previous part, by studying the
text categorization. We mention the classification in this section for distin-
guishing the clustering from the classification. The classification task is re-
ferred to the process of assigning one or some of the predefined categorization
to each data item. The clustering task looks similar as the classification in
that each item is arranged into one or some clusters. In this section, we
explain briefly the classification and compare it the clustering task.
Let us review the process of classifying data items which was covered in
Part 3. We predefined categories as a list or a tree, and gather training exam-
ples which are labeled with one or some of the predefined ones. The classifi-
cation capacity is constructed by the process called supervised learning with
the training examples. Novice examples are classified with the constructed
classification capacity. The supervise learning is performed by minimizing dif-
ferences between the true labels and the classified ones of training examples.
The comparisons between the classification and the clustering are illus-
trated in Table 13. The classification categories are defined automatically
through clustering, whereas they are defined manually in the classification.
The supervised learning algorithms are applied to the classification, whereas
the unsupervised learning algorithms are done to the clustering. In the clus-
tering, a group of data items is given as the input, whereas in the classifi-
cation, a single data item is given as the input. In the clustering, subgroups
each of which contains similar data items are generated as the output, whereas
one or some among the predefined categories are assigned to each item in the
classification.

Table 13 Clustering vs Classification


Clustering Classification
Classification Categories Automatic Definition Manual Definition
Machine Learning Type Unsupervised Learning Supervised Learning
Input Group of Data Items Single Data Item
Output Subgroups of Similar Data Items Category or Some Categories
Text Mining: Concepts, Implementation, and Big Data Challenge 187

The results from clustering the data items are significant to doing the data
classification. The clustering results consist of the predefined categories and
the sample texts which belong to each of them. A list of clusters or a cluster
hierarchy becomes the predefined category structure and the texts which are
contained in each cluster are sample ones which are labeled with its corre-
sponding cluster identifier. The data clustering automates the preliminary
tasks for the text categorization, here. The results from clustering data items
are used as sample texts for learning a supervised learning algorithm for
performing the text categorization.
The event detection and tracking was mentioned as a typical case of com-
bining the text categorization with the text clustering [2]. The event detection
means the process of deciding whether the current news article deals with a
particular event, or not, as the special instance of text clustering [8]. The
event tracking is the process of tracing events of other news articles using the
detected ones as the special case of text categorization. News articles about a
particular event are grabbed by the event detection and related news articles
are retrieved by the event tracking. Detection and tracking of topics as well
as events exist.

9.2.4 Constraint Clustering

This section is concerned with the constraint clustering where some labeled
examples are provided. Figure 114 illustrates the constrain clustering as a
diagram. Both labeled examples and unlabeled ones are used for clustering
both kinds of them. Only small number of labeled examples is usually given
as references for clustering unlabeled ones. In this section, we explain the
constraint clustering in detail, and compare it with other tasks.
Let us compare the constraint clustering with the clustering which was
mainly covered in this part. In the clustering, it is assumed that all exam-
ples are unlabeled, whereas in the constraint clustering some examples may
be labeled by the prior knowledge or accidently. In the clustering, it pro-
ceeds depending on similarities or distances among examples purely, in the
constraint clustering, it does depending on similarities or distance and la-
bels of some examples. In both the clustering and the constraint clustering,
their ultimate goal is to segment a group of items into subgroups of similar
ones. In the clustering, clustering prototypes are initialized at random, in the
constraint clustering, they are initialized based on some labeled examples.
The constraint clustering is similar as the pure clustering except the fact
that some labeled examples are given, additionally. In the constraint clus-
tering, a small number of labeled examples and a large number of unlabeled
ones are given as the input; unlabeled examples are given as the majority
of data items. The labeled examples are arranged into their clusters which
correspond to their labels, in advance, and the cluster prototypes are ini-
tialized based on the labeled ones. The unlabeled examples are clustered by
188 Taeho Jo

Fig. 114 Constraint Clus-


tering

a clustering algorithm, afterward. The both kinds of examples are used for
learning like the semi-supervised learning.
Let us compare the constraint clustering with the semi-supervised learning.
In the both tasks, both labeled examples and unlabeled ones are used, as men-
tioned above. However, they have their different goals; the semi-supervised
learning is intended for the classification and the regression, whereas the
constraint clustering is done for the clustering. The semi-supervised learning
involves both the supervised learning algorithm and the unsupervised ones,
while the constraint clustering involves only unsupervised one [91]. In the
semi-supervised learning, unlabeled examples are used as the additional ones,
whereas in the constraint clustering, the labeled ones are used as constraints
for clustering data items.
Let us mention the alien cluster which consists of very sparse number of
data items. When plotting data points in the two dimensional space, it may
be discovered that some points are far from main clusters. The groups of
the points are called alien clusters or outliers [110]. The sparse clusters are
generated as some among a large number of small sized clusters, or they
may be generated as noises in the case of a small number of large clusters.
This kind of clusters should be removed by the process which is called outlier
detection[110].
Text Mining: Concepts, Implementation, and Big Data Challenge 189

9.3 Clustering Types

This section is concerned with the clustering types by various views. In Sec-
tion 9.3.1, we mention the offline vs the online clustering. In Section 9.3.2, we
explain the two clustering types, the crisp clustering and the fuzzy clustering.
In Section 9.3.3, we describe the flat clustering and the hierarchical cluster-
ing, depending on whether any nested cluster is allowed, or not. In Section
9.3.4, we introduce the single viewed and the multiple viewed clustering.

9.3.1 Static vs Dynamic Clustering

This section is concerned with the two types of data clustering: static cluster-
ing and dynamic clustering. The dichotomy for dividing the clustering task
into the two types is whether data clusters as the clustering results are fixed
or continually variable. In the static clustering, results from clustering data
items are fixed, whereas in the dynamic clustering, results from doing so are
updated, continually. In the previous works on data clustering, the static
clustering has been assumed for explaining the clustering algorithm. In this
section, we explain the two types of clustering, individually, and compare
them with each other.
The static clustering is one with where its results is fixed, continually. Un-
til now, the clustering algorithms have been explained under the assumption
of the static clustering. A group of various data items is given as input and
clusters of similar ones which are results, are fixed. However, there is possi-
bility of adding more data items and deleting some from the results, so the
data items should be reorganized, accordingly. In the static clustering, the
real situation are overlooked.
Let us consider the dynamic clustering which is opposite to the static clus-
tering. The dynamic clustering is one where as more data items are added or
some items are deleted, continually, the data items are reorganized, accord-
ingly. There are two kinds of reorganizations: the hard organization which
organizes all of data items again by a clustering algorithm and the soft or-
ganization which merges some clusters into one or divides one into several
clusters for reorganizing data items. We need to measure the current orga-
nization quality, in order to decide whether items should be reorganized, or
not. The dynamic clustering will be explained in detail in Chapter 16.
The static clustering and the dynamic clustering are compared with each
other. The assumption in the static clustering is that all of data items are
given at a time, whereas that in the dynamic clustering is that data items
are added and deleted, continually. In the static clustering, a single group
of data items is clustered into subgroups only one time, whereas in the dy-
namic clustering, data items are clustered continually after clustering so. In
the static clustering, the organization of data items is fixed, whereas in the
dynamic clustering, it is variable. In the static clustering, the operations such
190 Taeho Jo

as division and merging of existing clustering are not considered, whereas in


the dynamic clustering, they are considered as the soft organization.
The organization management in the dynamic clustering is to decide one
of the maintenance, the soft organization and the hard organization. The
two metrics which indicate the quality of clustering results are computed:
the intra-cluster similarity and the inter-cluster similarity. The direction of
clustering data items is to maximize the intra-cluster similarity and minimize
the inter-cluster similarity, at same time. The decision of one of the three ac-
tions is made by comparing values of the two metrics after adding or deleting
items with those after doing that. In Chapter 12, we mention in detail the
two measures for evaluating the results from clustering data items.

9.3.2 Crisp vs Fuzzy Clustering

This section is concerned with the two types of clustering: crisp and fuzzy
clustering. The dichotomy criteria is whether each item is allowed to belong
to more than one cluster, or not. The crisp clustering is one where every item
belongs to only one cluster, whereas the fuzzy clustering is one where at least
one item belongs to more than one cluster. In the fuzzy clustering, rather
than belonging to more than one cluster, to each item, membership values of
clusters are assigned. In this section, we explain the two types of clustering
and their differences.
We present an example of the crisp clustering in Figure 115. The table in
Figure 115 is given as a matrix of the eight items and the four clusters. All of
eight items belong to only one of the four clusters; no item belongs to more
than one cluster. As results from the exclusive clustering, cluster 1 contains
item 1 and 6, cluster 2 does item 2, 4, 5, and 8, cluster 3 contains only item
7, and cluster 4 contains item 3. No overlapping between clusters exists in
the crisp clustering.
The simple example of the fuzzy clustering is illustrated in Figure 116. The
table in the Figure 116 has the same frame to that in Figure 115. The five
of the eight items are labeled with more than one category. Item 2 is labeled
with the three categories and the four items are labeled with two categories.
If at least one item is labeled with more than one category, it becomes the
results from the fuzzy clustering.
Table 14 illustrates the differences between the crisp clustering and the
fuzzy clustering which are covered in this section. The crisp clustering does
not allow the overlapping between clusters at all, whereas the fuzzy one does
it. The membership values are given binary values in the crisp clustering
whereas the membership values are given as continuous values between zero
and one. In the crisp clustering, each data item is arranged into its most
similar cluster, whereas in the fuzzy clustering, it is arranged into more than
one cluster whose similarity is more than the given threshold. The crisp clus-
tering is usually applied to the opinion clustering where each data item is
Text Mining: Concepts, Implementation, and Big Data Challenge 191

Fig. 115 Example of Crisp Clustering

Fig. 116 Example of Fuzzy Clustering

arranged into one of the positive cluster, the neutral one, and the negative
one, whereas the fuzzy clustering is applied to topic based text clustering
where each data item is arranged into its relevant topic clusters.

Table 14 Crisp vs Fuzzy Clustering


Crisp Clustering Fuzzy Clustering
Overlapping No Yes
Membership 0 or 1 Between 0 and 1
Arrangement Highest Similarity Higher than Threshold
Case Opinion Mining Topic based Clustering
192 Taeho Jo

Fig. 117 Item-Cluster Matrix

In Figure 117, the item-cluster matrix is illustrated as the results from the
fuzzy clustering. There are the two types of fuzzy clustering results: one is the
clusters of items which some items belong to more than one cluster and the
other is the item-cluster matrix which is shown in Figure 117. In the matrix
frame, each column corresponds to a cluster, and each row does to an item.
The entry which is crossed of a column and a row is a membership value of
the item to the cluster. The item-cluster matrix will be mentioned again for
explaining the fuzzy k means algorithm in Chapter 10.

9.3.3 SubsectionTitle

This section is concerned with the two opposite types of clustering: flat clus-
tering and hierarchical clustering. The dichotomy criteria of the two types is
whether to allow nested clusters in a particular cluster. The flat clustering
is one where no nested cluster is available, while the hierarchical one is one
where at least, one nested cluster is available. In the flat clustering, a list of
clusters is given as its results, while in the hierarchical clustering, a tree of
clusters is given as ones. In this section, we explain the two types of clustering
and their comparisons.
Figure 118 illustrates the flat clustering and the hierarchical clustering.
The flat clustering is shown in the left side of Figure 118. In the flat clustering,
no cluster is nested in any cluster and a list of clusters is given as the results.
The k means algorithm which were mentioned in Section 10.2.1 is usually
used as the approach to this kind of clustering. The flat clustering is used
for dividing a group of data items into several subgroups based on their
similarities.
Text Mining: Concepts, Implementation, and Big Data Challenge 193

Fig. 118 Flat vs Hierarchical Clustering

The hierarchical clustering is one where at least a nested cluster is allowed


in a particular cluster as shown in the right side of Figure 118. The three
clusters are given in the both sides in Figure 118; there is no nested cluster in
the left side, while there are nested clusters in the right side. A tree of clusters
where its root node is a group of entire items, the lead nodes are individual
data items, and the intermediate nodes are subgroups, is generated as the
results. The AHC algorithm which was covered in Section 10.2.1, and the
divisive algorithm which was covered in Section 10.2.2 are typical approaches
to this type of clustering. The hierarchical clustering is used for constructing
a hierarchical organization of data items for browsing them.
Table 15 illustrates the differences from the two types of clustering. As
mentioned above, in the flat clustering, no nested cluster is allowed whereas
in the hierarchical clustering, any nested cluster is allowed. Even if various
clustering algorithms are used for the both types of clustering, in the flat
clustering, the k means algorithm is usually used, whereas in the hierarchical
one, the AHC algorithm is usually used. The flat clustering is characterized as
the process of arranging data items into one or some among clusters, whereas
the hierarchical one is characterized as the process of dividing in the top down
direction or merging in the bottom up direction. The flat clustering results
are given as a list of clusters, whereas the hierarchical clustering results are
given as a tree of clusters.

Table 15 Flat vs Hierarchical Clustering


Crisp Clustering Hierarchical Clustering
Nested Cluster No Yes
Typical Algorithm K Means Algorithm AHC Algorithm
Clustering Proceed Arrangement Merging or Division
Results List Tree
194 Taeho Jo

Fig. 119 Single Viewed Clustering

Let us consider the process of evaluating the hierarchical clustering results.


By computing the intra-cluster similarity and the inter-cluster similarity, the
flat clustering results are evaluated, easily. The results from the hierarchical
clustering tend to be underestimated by the lower intra-cluster similarities
of higher clusters and the higher inter-cluster similarities among nested ones.
Even if the adopted clustering algorithm was really to the hierarchical cluster-
ing task, it tends to be evaluated in the flat clustering tasks before applying
it. In this study, the clustering index is proposed as the metric of evaluating
the clustering results, and the evaluation process will be mentioned in Section
12.3.4.

9.3.4 Single vs Multiple Viewed Clustering

This section is concerned with the two types of clustering, single viewed clus-
tering and multiple viewed one. The dichotomy for dividing the clustering
task into the two types is whether multiple results are allowed to be accom-
modated, or not. Even by same clustering algorithm, different results may be
produced depending on its external parameter values. As the difference be-
tween the hierarchical clustering and the multiple viewed clustering, a single
tree of clusters is generated from the former, whereas forests which consists
of multiple trees are produced from the latter. In this section, we explain the
two types of clustering tasks and their comparisons.
The single viewed clustering is conceptually illustrated in Figure 119. From
a group of items, several clusters are generated by a clustering algorithm. The
clusters are regarded as the entire results whether they are overlapping or
exclusive. In the area of machine learning and data mining, until now, it has
been assumed that results from clustering data items by a single algorithm
is everything. However, afterward, we need to accommodate other clustering
results which are generated in other views.
As shown in Figure 120, in the multiple viewed clusters, various results
from clustering data items are accommodated. The fact that data items are
organized differently by different subjective even in the manual clustering
is the motivation for considering this type of data clustering. The results
from clustering data items are different depending on values of its external
Text Mining: Concepts, Implementation, and Big Data Challenge 195

Fig. 120 Multiple Viewed Clustering

parameters, even by same algorithm. For examples, in the case of the k means
algorithm, different results are caused by changing number of clusters. In this
type of clustering different multiple results from clustering data items are
allowed as their organizations.
We need to integrate multiple results from clustering data items into a
single one. Multiple different results which are called multiple organizations
were mentioned above and one of them may be selected by a user. By merg-
ing similar clusters, they are integrated into one as the union of multiple
organizations. The scheme of integrating multiple organizations which are
clustering results is described in detail in [112].
The hierarchical clustering and the multiple viewed clustering are illus-
trated in Figure 121. More than one list of clusters exists in both clustering
types, so they look confusing for distinguishing them from each other. A
single organization of data items is given as the results in the hierarchical
clustering. However, the multiple organizations are given independently of
each other in the multiple viewed clustering. In other words, a single tree
of clusters is given as the results in the hierarchical clustering whereas, the
forests are given as ones in the multiple viewed clustering.

9.4 Derived Tasks from Text Clustering

This section is concerned with the specific tasks of data clustering and other
tasks which derive from it and it consists of the four subsections. In Section
9.4.1, we mention the cluster naming which is the subsequent task after the
text clustering. In Section 9.4.2, we describe the subtext clustering as the
process of generating virtual texts. In Section 9.4.3, we explain the process
196 Taeho Jo

Fig. 121 Hierarchical Clustering and Multiple Viewed Clustering

Fig. 122 Cluster Naming

of using the text clustering results as sample texts for performing the text
categorization. In Section 9.4.4, we cover the process of applying the text
clustering for detecting redundant national projects.

9.4.1 Cluster Naming

Figure 122 illustrates the process of assigning names to clusters which are
results from clustering data items. The cluster naming is the process of iden-
tifying each cluster, relevantly to its contents with the symbolic names. We
will mention some principles of naming clusters; identifying each cluster with
a primary key value or a numerical value, irrelevantly to its content is out
of cluster naming. Reflecting text contents in each cluster in its name is the
requirement of cluster naming. In this section, we describe the cluster naming
for enabling browsing of text clusters.
The principles of naming clusters symbolically were defined in 2006 in
his PhD dissertation [25]. The first principle of naming clusters is to reflect
cluster contents in their symbolic names. The second principle is that each
cluster name should not be too long; the cluster name consists of only couple
of words. The third principle is that each cluster name should be unique;
Text Mining: Concepts, Implementation, and Big Data Challenge 197

there is no redundant name among clusters. The cluster names are helpful
for accessing texts by browsing.
In 2006, Jo described the process of naming clusters based on the above
principles, in his PhD dissertation [25]. In each cluster, its texts are indexed
into a list of words with the basic three steps which were mentioned in Chap-
ter 2. For each cluster, weights of indexed words are computed and the words
with their highest weighs are selected as the cluster name candidates. If more
than two clusters have same name, the action is taken against it, by adopting
one among schemes which will be mentioned in the next paragraph. Other
approaches are available than this way and cluster naming is regarded as a
text mining task which is separated from the text clustering.
More than two clusters with same name violates against the principle, so
let us mention the schemes of avoiding the redundancy. In this case, the word
with its highest weight is replaced by one with its second highest weight in
naming the cluster. The clusters with their redundant name are merged into
one cluster, as the alternative scheme. We use multiple schemes of weighting
words and the word is replaced by one which is weighted by another scheme.
In spite of avoiding the redundant cluster name, it is possible that the clusters
which are named with their different words have their identical meanings.
The taxonomy generation is achieved by combining the text clustering and
the cluster naming with each other. The taxonomy generation is referred to
the process of generating topics, their relations, and their associated texts,
from a corpus. The taxonomy generation process consists of clustering texts
by a clustering algorithm and naming clusters by the above process. The tax-
onomy generation may be expanded to the process of constructing manually
or semi-automatically ontologies. The taxonomy generation will be mentioned
in detail, in Chapter 15.

9.4.2 Subtext Clustering

This section is concerned with the subtext clustering which is illustrated


in Figure 123. The subtext means a part of full text, such as paragraphs,
sentences, and words, in the broad meaning. The texts in the corpus are
partitioned into paragraphs and they are clustered into subgroups of content
based similar ones. The results from clustering subtexts become the source
of synthesizing subtexts into an artificial text. In this section we describe the
process of clustering subtexts.
A text may be viewed as a hierarchical form of its components. The texts
in the corpus are partitioned into paragraphs by the carriage return. Each
paragraph is partitioned into sentences by punctuation mark. Each sentence is
partitioned into words; paragraphs, sentences, and words belong to subtexts.
Each full text is expressed in a hierarchical form whose root node is the full
text, intermediate nodes are paragraphs or sentences, and the terminal nodes
are words.
198 Taeho Jo

Fig. 123 Subtext Clustering

Even if a subtext is a part of the given full text, it belongs to textual data;
it could become a text clustering target. Subtexts are extracted from texts in
the given collection, and they are encoded into numerical vectors. They are
clustered into subgroups based on their similarities by a clustering algorithm.
There is the possibility that subtexts within a given text are scattered over
different clusters. Note that numerical vectors which represent subtexts may
be sparser than those which represent full texts.
Let us mention some similar tasks with subtext clustering which is cov-
ered in this section. The summary based clustering which is mentioned in
Chapter 13, is the process of clustering texts by encoding their summaries,
instead of full texts. If clustering paragraphs in a long text into subgroups,
the paragraph subgroups are also called subtexts. A full text is segmented
into subtexts based on their contents through the text segmentation, and
each subtext is called topic based subtext.
Let us point out some issues from subtext clustering. The numerical vectors
which represent subtexts tend to be sparser than ones representing full texts.
The summary based clustering which is mentioned in Chapter 13, belongs to
the kind of subtext clustering for clustering texts fast. Clustering paragraphs
within a text is called unordered text segmentation which is mentioned in
Chapter 14. It is the critical factor how to select subtexts from a full text in
this task.

9.4.3 Automatic Sampling for Text Categorization

This section is concerned with the process of generating sample texts au-
tomatically for the text categorization. The preliminary task for the text
categorization is to predefine categories as a list or a tree and to allocate
texts to each category as samples. The preliminary tasks is automated by
the two tasks: the text clustering and the cluster naming. A list or a tree of
cluster names is given as the predefined categories and texts which belong
to the clusters become sample labeled texts. In this section, we describe the
Text Mining: Concepts, Implementation, and Big Data Challenge 199

text clustering which is viewed as the process of automating the preliminary


tasks.
The preliminary tasks for the text categorization, such as the category
predefinition and sample text allocation, become the tedious jobs. Topics or
categories are predefined as a list or a tree. Texts are collected from the
external source and labeled manually. The sample texts are encoded into
numerical vectors by the process which was described in Chapter 3. However,
more reliable quality of sample texts is expected for training text classifiers.
Let us explain the process of automating the above preliminary tasks for
the text categorization. In this case, unlabeled texts are gathered from exter-
nal sources. The collection of the gathered ones is segmented into subgroups
of similar ones, and each cluster is named symbolically. The cluster names
are given as categories and texts in each cluster are given as labeled ones.
However, as the payment for automating the preliminary tasks, we expect
less reliable quality of sample texts which are automatically generated by
this process.
We need to customize the system in using the text clustering for automat-
ing the preliminary tasks for the text categorization. We decide the clas-
sification types such as crisp vs fuzzy classification and flat vs hierarchical
classification. The clustering should be executed following the classification
type which we decide. The identifiers are assigned to the clusters and they
are named symbolically through the cluster naming which was mentioned in
Section 9.4.1. As an optional one, we may add the process of redistributing
cluster examples as sample examples for the binary classifications, which are
decomposed from the multiple classification.
We should remind the poor reliability of sample examples which are gener-
ated automatically by the above process. So we need to consider the learning
type where both labeled and unlabeled examples are used. The learning which
combines the supervised learning with the unsupervised one is called semi-
supervised learning. In Chapter 10, we mention the LVQ (Learning Vector
Quantization) as the semi-supervised learning algorithm. In clustering items
by the k means algorithm or the Kohonen Networks, it is required to decide
the number of clusters in advance.

9.4.4 Redundant Project Detection

This section is concerned with the process of detecting redundant projects


as a derived task from the text clustering. The task which is discovered in
this section is defined as the process of building groups of content based very
similar proposals of research projects. The research project proposals which
are given as texts are entered and projects clusters are constructed by the
text clustering. In 2003, Jo implemented the system of doing that, using the
single pass algorithm whose similarity threshold is set closely to 1.0 [24].
200 Taeho Jo

In this section, we describe the process of detecting the redundant research


projects by clustering them.
The research project proposal is consists of the three parts: project scope,
project contents, and project goals. The first part, project scope, is concerned
with area or coverage of the project which researcher try to propose. The
second part, project contents, describe what is proposed, implemented or
researched in the project [24]. The last part, project goals, indicates what
the project is intended. It is assumed that the project proposal is given as a
XML documents with the three components which are given tags.
Let us mention the process of detecting the redundancy of research project
proposals through the text clustering. The research project proposals each
of which consists of the three parts are collected in the given system. The
collection is clustered into subgroups by a clustering algorithm; the single
pass algorithm was used as the clustering algorithm in [24]. The research
project proposals within each cluster are regarded as potential candidates of
redundant proposals or associated ones. Either of two actions should be taken
to each cluster: selection of only one proposal or integration of projects into
a single project.
We need to customize the text clustering to be more suitable for detecting
redundant project proposals. The direction is to generate large number of
clusters each of which contains very small number of items. In this case, the
similarity threshold is set closely to one in using the single pass algorithm.
Discriminant weights are assigned to the three component of proposal; the
higher weight is assigned to the research content and research goal, and the
lower weight is assigned to the research scope. More than half of clusters in
the results are skeletons each of which has only one item.
The text association becomes the alternative way of detecting redundant
research projects to the text clustering. Words are represented as text sets
from the corpus. Association text rules are extracted from the text sets by
the Apriori algorithm. Texts are associated as redundant project candidates
or associated ones. In this case, the support threshold and the confidence
threshold are set closely to 1.0.

9.5 Summary and Further Discussions

In this chapter, we described the text clustering in the functional view. We


defined the clustering and compared it with other tasks. We surveyed the
dichotomies of two opposite text clustering types. We mentioned the tasks
which are similar as the text clustering and derived from it. In this section,
we make some further discussions from what we studied in this chapter.
Let us refer to the association mining to the process of extracting implicit
knowledge from a collection of association rules. They are extracted from a
collection or clusters of sets and become a kind of data, in this case. They are
Text Mining: Concepts, Implementation, and Big Data Challenge 201

classified or clustered; they become instances of association mining. We need


to define operations on association rules for implementing association min-
ing system. The association rules which are concerned with relation among
association rules are extracted from them; the task is called meta association
rules.
Let us consider the online clustering which proceeds the clustering to the
data items which arrive continually. The assumption underlying in the off-
line clustering which is opposite to the online clustering is that all clustering
targets are given at a time. No data is available initially and whenever data
items arrive continually, the data clustering proceeds by updating current
clusters. The traditional clustering algorithms which mentioned in Chapter
10 were inherently designed for the offline clustering, so we need to modify
them into online clustering version. The online clustering will be mentioned
in Chapter 10.
The taxonomy generation is for creating a frame of organizing texts from
a corpus. It is defined as the process of generating a list, a tree, or a network
of topics which are called taxonomies. The word categorization, the word
clustering, and the keyword extraction are involved in executing the taxon-
omy generation. Each taxonomy is linked to its own relevant texts and is
associated to its scripts. It will be described in detail in Chapter 15.
The organization evolution may be considered as the process of changing
the text organization, gradually to the better one. Reorganizing entire texts
is the hard action, whereas modifying some organization parts is the soft
action. Typical operations for maintaining the organization is to merge some
clusters into a single cluster and to divide a cluster into several ones. Outlier
detection is considered as the process of detecting whether a text is an alien or
not, for creating one more cluster. The schemes of evolving the organization
will be described in detail in Chapter 16.

You might also like