06 Text Clustering
06 Text Clustering
This chapter is concerned with the conceptual view of text clustering tasks.
It is the process of segmenting a text group into subgroups each of which
contains similar texts. Text clustering is divided into hard text clustering
and soft text clustering, depending on whether each text is allowed to belong
to more than one cluster, or not. Depending on whether nested clustering
is allowed or not, it is divided into flat text clustering and hierarchical text
clustering. In this section, we describe text clustering with its functional view.
Keywords: Text Clustering, Hard Text Clustering, Soft Text Clustering,
Hierarchical Text Clustering
We make the general discussions concerned with text clustering in Section
9.1 and compare the task of data clustering with other data mining tasks, in
Section 9.2. In Section 9.3, we explore the mutually opposite types of text
clustering depending on dichotomies. We mention the real tasks which are
derived from text clustering in Section 9.4 and make the summary and the
further discussions on what we study in this chapter, in Section 9.5.
Text clustering refers to the process of segmenting a text group into subgroups
of content based similar ones. It is possible to cluster texts by their character
combinations, called lexical clustering, but the kind of clustering is covered
in this chapter. It is called semantic clustering to cluster texts based on
their contents or meaning, and it will be covered in this study. It is assumed
that texts are not labeled at all and unsupervised learning algorithm which
are mentioned in Chapter 10, are adopted for this task. In this section, we
describe briefly the text clustering before discussing it in detail, together with
the review of text categorization which was covered in the previous part.
Let us review briefly the text categorization which was covered in the pre-
vious part, in order to help to characterize the text clustering. The text cate-
gorization was mentioned as the process of classifying texts into one or some
of the predefined categories. The supervised machine learning algorithms,
such as K Nearest Neighbor, Na?ve Bayes, or Support Vector Machine, are
used as the main approaches. The performances of text categorization sys-
tems and approaches are evaluated by the accuracy or the F1 measure which
were mentioned in Chapter 8. Both the text categorization and the text clus-
tering may be integrated into a particular task, rather than two independent
tasks, as mentioned in Chapter 16.
The text clustering is characterized with several agenda differently from
other text mining tasks, such as the text categorization and the text associa-
tion. A group of texts is given as the input and anonymous groups of them are
the results from clustering the texts. The text clustering proceeds depend-
Text Mining: Concepts, Implementation, and Big Data Challenge 183
ing strongly on the similarities among the texts; it is very important task
how to define similarity metrics. Cluster prototypes are initialized randomly
or arbitrarily, and optimized into to maximize item similarities within each
cluster and minimize similarities between clusters. The set of texts is never
partitioned into the training set and the test set, in evaluating text clustering
systems.
The text clustering is coupled with the text categorization for automating
the preliminary tasks. Because manual labeling is very tedious job, it is not
easy to gather labeled examples, but it is easy to gather unlabeled ones.
The labeled text collection is able to be constructed by clustering unlabeled
texts and naming the clusters as the sample texts for implementing text
categorization systems. The list of cluster names becomes that of predefined
categories and texts within each cluster become the sample texts which are
labeled with its cluster name. Coupling of the both tasks is proposed and
implemented by Jo in 2006 [25].
The text clustering is compared with the two tasks, the text categoriza-
tion and the text association in the previous parts. In comparing the text
association and the text clustering with each other, the former defines uni-
directional relations among texts, while the latter defines the bidirectional
relations among them. The preliminary tasks such as category predefinition
and sample text preparation, are required for the text categorization. The
text clustering tends to be coupled with the text categorization into a single
task, as mentioned above. Association rules among texts which are gener-
ated from the text association becomes useful contexts for doing the text
clustering and the text categorization.
This section is concerned with the generic data clustering in its functional
and conceptual view, and consists of the four sections. The clustering itself
is described in its functional view in Section 9.2.1. In Section 9.2.2, the clus-
tering is compared with the association, in order distinguish them from each
other. In Section 9.2.3, the clustering and the classification are compared
with each other. In Section 9.2.4, the constraint clustering is mentioned as
the special clustering where some labeled examples are provided.
9.2.1 SubSubsectionTitle
mining task is the process of detecting and removing alien data items and
noisy ones for cleaning data items.
This section is concerned with the comparison of the text clustering with
the text association. The text association means the process of extracting
association rules as if-then forms between texts; it was studied in Chapter
4. As the distinguished characteristics of the two tasks, the text association
pursues for casual relations among texts, whereas the text clustering does
for similarities among them. They are distinguished from each other, but
they may be combined with each other. In this section, we investigate the
differences between tasks and connect them with each other.
We described the process of associating data items in Chapter 4. In the
association, item sets are given as the input, initially. All association rules
each of which consists of a single item which is the conditional part, and items
which are its causal parts are extracted. For each association rule, we compute
its support and confidence and select some with their higher supports and
confidences than the given threshold among them. The difference between
the two tasks is that in the association, the item sets are given as the input
whereas in the clustering a single group of items is given as the input.
Table 12 illustrates the comparison of the two tasks: clustering and asso-
ciation. In the association, item sets are given as the input, whereas a single
group of items is given as the input in the clustering. In the association,
if-then rules are generated as the output, whereas in the clustering, clusters
of similar items are generated. In the association, its results indicate causal
relations among items, whereas in the clustering, its results indicate similar-
ities. The causal relations are uni-directional which does not guarantee its
reverse equivalence, but the similarities are bi-directional which does it.
The results from clustering data items may become the input for doing the
data association. The clusters which are generated as the output are viewed
as item sets. The association rules are extracted from the clusters, using
the association algorithm, apriori algorithm. It becomes possible to find the
association rules from individual items by means of data clustering. However,
186 Taeho Jo
note that this type of data association is sensitive to the results from data
clustering.
The word association rules are generated from texts by the above process.
The list of words which are given the conditional parts of association rules
becomes the list of taxonomies. In each association rule, texts which are
relevant to the words in the causal part are retrieved in each association rule.
In generating taxonomies from the association rules, each association rule is
transformed into a topic and texts which belong to it. Texts should be encoded
into lists of words through text indexing for doing the word association.
The classification was already covered in the previous part, by studying the
text categorization. We mention the classification in this section for distin-
guishing the clustering from the classification. The classification task is re-
ferred to the process of assigning one or some of the predefined categorization
to each data item. The clustering task looks similar as the classification in
that each item is arranged into one or some clusters. In this section, we
explain briefly the classification and compare it the clustering task.
Let us review the process of classifying data items which was covered in
Part 3. We predefined categories as a list or a tree, and gather training exam-
ples which are labeled with one or some of the predefined ones. The classifi-
cation capacity is constructed by the process called supervised learning with
the training examples. Novice examples are classified with the constructed
classification capacity. The supervise learning is performed by minimizing dif-
ferences between the true labels and the classified ones of training examples.
The comparisons between the classification and the clustering are illus-
trated in Table 13. The classification categories are defined automatically
through clustering, whereas they are defined manually in the classification.
The supervised learning algorithms are applied to the classification, whereas
the unsupervised learning algorithms are done to the clustering. In the clus-
tering, a group of data items is given as the input, whereas in the classifi-
cation, a single data item is given as the input. In the clustering, subgroups
each of which contains similar data items are generated as the output, whereas
one or some among the predefined categories are assigned to each item in the
classification.
The results from clustering the data items are significant to doing the data
classification. The clustering results consist of the predefined categories and
the sample texts which belong to each of them. A list of clusters or a cluster
hierarchy becomes the predefined category structure and the texts which are
contained in each cluster are sample ones which are labeled with its corre-
sponding cluster identifier. The data clustering automates the preliminary
tasks for the text categorization, here. The results from clustering data items
are used as sample texts for learning a supervised learning algorithm for
performing the text categorization.
The event detection and tracking was mentioned as a typical case of com-
bining the text categorization with the text clustering [2]. The event detection
means the process of deciding whether the current news article deals with a
particular event, or not, as the special instance of text clustering [8]. The
event tracking is the process of tracing events of other news articles using the
detected ones as the special case of text categorization. News articles about a
particular event are grabbed by the event detection and related news articles
are retrieved by the event tracking. Detection and tracking of topics as well
as events exist.
This section is concerned with the constraint clustering where some labeled
examples are provided. Figure 114 illustrates the constrain clustering as a
diagram. Both labeled examples and unlabeled ones are used for clustering
both kinds of them. Only small number of labeled examples is usually given
as references for clustering unlabeled ones. In this section, we explain the
constraint clustering in detail, and compare it with other tasks.
Let us compare the constraint clustering with the clustering which was
mainly covered in this part. In the clustering, it is assumed that all exam-
ples are unlabeled, whereas in the constraint clustering some examples may
be labeled by the prior knowledge or accidently. In the clustering, it pro-
ceeds depending on similarities or distances among examples purely, in the
constraint clustering, it does depending on similarities or distance and la-
bels of some examples. In both the clustering and the constraint clustering,
their ultimate goal is to segment a group of items into subgroups of similar
ones. In the clustering, clustering prototypes are initialized at random, in the
constraint clustering, they are initialized based on some labeled examples.
The constraint clustering is similar as the pure clustering except the fact
that some labeled examples are given, additionally. In the constraint clus-
tering, a small number of labeled examples and a large number of unlabeled
ones are given as the input; unlabeled examples are given as the majority
of data items. The labeled examples are arranged into their clusters which
correspond to their labels, in advance, and the cluster prototypes are ini-
tialized based on the labeled ones. The unlabeled examples are clustered by
188 Taeho Jo
a clustering algorithm, afterward. The both kinds of examples are used for
learning like the semi-supervised learning.
Let us compare the constraint clustering with the semi-supervised learning.
In the both tasks, both labeled examples and unlabeled ones are used, as men-
tioned above. However, they have their different goals; the semi-supervised
learning is intended for the classification and the regression, whereas the
constraint clustering is done for the clustering. The semi-supervised learning
involves both the supervised learning algorithm and the unsupervised ones,
while the constraint clustering involves only unsupervised one [91]. In the
semi-supervised learning, unlabeled examples are used as the additional ones,
whereas in the constraint clustering, the labeled ones are used as constraints
for clustering data items.
Let us mention the alien cluster which consists of very sparse number of
data items. When plotting data points in the two dimensional space, it may
be discovered that some points are far from main clusters. The groups of
the points are called alien clusters or outliers [110]. The sparse clusters are
generated as some among a large number of small sized clusters, or they
may be generated as noises in the case of a small number of large clusters.
This kind of clusters should be removed by the process which is called outlier
detection[110].
Text Mining: Concepts, Implementation, and Big Data Challenge 189
This section is concerned with the clustering types by various views. In Sec-
tion 9.3.1, we mention the offline vs the online clustering. In Section 9.3.2, we
explain the two clustering types, the crisp clustering and the fuzzy clustering.
In Section 9.3.3, we describe the flat clustering and the hierarchical cluster-
ing, depending on whether any nested cluster is allowed, or not. In Section
9.3.4, we introduce the single viewed and the multiple viewed clustering.
This section is concerned with the two types of data clustering: static cluster-
ing and dynamic clustering. The dichotomy for dividing the clustering task
into the two types is whether data clusters as the clustering results are fixed
or continually variable. In the static clustering, results from clustering data
items are fixed, whereas in the dynamic clustering, results from doing so are
updated, continually. In the previous works on data clustering, the static
clustering has been assumed for explaining the clustering algorithm. In this
section, we explain the two types of clustering, individually, and compare
them with each other.
The static clustering is one with where its results is fixed, continually. Un-
til now, the clustering algorithms have been explained under the assumption
of the static clustering. A group of various data items is given as input and
clusters of similar ones which are results, are fixed. However, there is possi-
bility of adding more data items and deleting some from the results, so the
data items should be reorganized, accordingly. In the static clustering, the
real situation are overlooked.
Let us consider the dynamic clustering which is opposite to the static clus-
tering. The dynamic clustering is one where as more data items are added or
some items are deleted, continually, the data items are reorganized, accord-
ingly. There are two kinds of reorganizations: the hard organization which
organizes all of data items again by a clustering algorithm and the soft or-
ganization which merges some clusters into one or divides one into several
clusters for reorganizing data items. We need to measure the current orga-
nization quality, in order to decide whether items should be reorganized, or
not. The dynamic clustering will be explained in detail in Chapter 16.
The static clustering and the dynamic clustering are compared with each
other. The assumption in the static clustering is that all of data items are
given at a time, whereas that in the dynamic clustering is that data items
are added and deleted, continually. In the static clustering, a single group
of data items is clustered into subgroups only one time, whereas in the dy-
namic clustering, data items are clustered continually after clustering so. In
the static clustering, the organization of data items is fixed, whereas in the
dynamic clustering, it is variable. In the static clustering, the operations such
190 Taeho Jo
This section is concerned with the two types of clustering: crisp and fuzzy
clustering. The dichotomy criteria is whether each item is allowed to belong
to more than one cluster, or not. The crisp clustering is one where every item
belongs to only one cluster, whereas the fuzzy clustering is one where at least
one item belongs to more than one cluster. In the fuzzy clustering, rather
than belonging to more than one cluster, to each item, membership values of
clusters are assigned. In this section, we explain the two types of clustering
and their differences.
We present an example of the crisp clustering in Figure 115. The table in
Figure 115 is given as a matrix of the eight items and the four clusters. All of
eight items belong to only one of the four clusters; no item belongs to more
than one cluster. As results from the exclusive clustering, cluster 1 contains
item 1 and 6, cluster 2 does item 2, 4, 5, and 8, cluster 3 contains only item
7, and cluster 4 contains item 3. No overlapping between clusters exists in
the crisp clustering.
The simple example of the fuzzy clustering is illustrated in Figure 116. The
table in the Figure 116 has the same frame to that in Figure 115. The five
of the eight items are labeled with more than one category. Item 2 is labeled
with the three categories and the four items are labeled with two categories.
If at least one item is labeled with more than one category, it becomes the
results from the fuzzy clustering.
Table 14 illustrates the differences between the crisp clustering and the
fuzzy clustering which are covered in this section. The crisp clustering does
not allow the overlapping between clusters at all, whereas the fuzzy one does
it. The membership values are given binary values in the crisp clustering
whereas the membership values are given as continuous values between zero
and one. In the crisp clustering, each data item is arranged into its most
similar cluster, whereas in the fuzzy clustering, it is arranged into more than
one cluster whose similarity is more than the given threshold. The crisp clus-
tering is usually applied to the opinion clustering where each data item is
Text Mining: Concepts, Implementation, and Big Data Challenge 191
arranged into one of the positive cluster, the neutral one, and the negative
one, whereas the fuzzy clustering is applied to topic based text clustering
where each data item is arranged into its relevant topic clusters.
In Figure 117, the item-cluster matrix is illustrated as the results from the
fuzzy clustering. There are the two types of fuzzy clustering results: one is the
clusters of items which some items belong to more than one cluster and the
other is the item-cluster matrix which is shown in Figure 117. In the matrix
frame, each column corresponds to a cluster, and each row does to an item.
The entry which is crossed of a column and a row is a membership value of
the item to the cluster. The item-cluster matrix will be mentioned again for
explaining the fuzzy k means algorithm in Chapter 10.
9.3.3 SubsectionTitle
This section is concerned with the two opposite types of clustering: flat clus-
tering and hierarchical clustering. The dichotomy criteria of the two types is
whether to allow nested clusters in a particular cluster. The flat clustering
is one where no nested cluster is available, while the hierarchical one is one
where at least, one nested cluster is available. In the flat clustering, a list of
clusters is given as its results, while in the hierarchical clustering, a tree of
clusters is given as ones. In this section, we explain the two types of clustering
and their comparisons.
Figure 118 illustrates the flat clustering and the hierarchical clustering.
The flat clustering is shown in the left side of Figure 118. In the flat clustering,
no cluster is nested in any cluster and a list of clusters is given as the results.
The k means algorithm which were mentioned in Section 10.2.1 is usually
used as the approach to this kind of clustering. The flat clustering is used
for dividing a group of data items into several subgroups based on their
similarities.
Text Mining: Concepts, Implementation, and Big Data Challenge 193
This section is concerned with the two types of clustering, single viewed clus-
tering and multiple viewed one. The dichotomy for dividing the clustering
task into the two types is whether multiple results are allowed to be accom-
modated, or not. Even by same clustering algorithm, different results may be
produced depending on its external parameter values. As the difference be-
tween the hierarchical clustering and the multiple viewed clustering, a single
tree of clusters is generated from the former, whereas forests which consists
of multiple trees are produced from the latter. In this section, we explain the
two types of clustering tasks and their comparisons.
The single viewed clustering is conceptually illustrated in Figure 119. From
a group of items, several clusters are generated by a clustering algorithm. The
clusters are regarded as the entire results whether they are overlapping or
exclusive. In the area of machine learning and data mining, until now, it has
been assumed that results from clustering data items by a single algorithm
is everything. However, afterward, we need to accommodate other clustering
results which are generated in other views.
As shown in Figure 120, in the multiple viewed clusters, various results
from clustering data items are accommodated. The fact that data items are
organized differently by different subjective even in the manual clustering
is the motivation for considering this type of data clustering. The results
from clustering data items are different depending on values of its external
Text Mining: Concepts, Implementation, and Big Data Challenge 195
parameters, even by same algorithm. For examples, in the case of the k means
algorithm, different results are caused by changing number of clusters. In this
type of clustering different multiple results from clustering data items are
allowed as their organizations.
We need to integrate multiple results from clustering data items into a
single one. Multiple different results which are called multiple organizations
were mentioned above and one of them may be selected by a user. By merg-
ing similar clusters, they are integrated into one as the union of multiple
organizations. The scheme of integrating multiple organizations which are
clustering results is described in detail in [112].
The hierarchical clustering and the multiple viewed clustering are illus-
trated in Figure 121. More than one list of clusters exists in both clustering
types, so they look confusing for distinguishing them from each other. A
single organization of data items is given as the results in the hierarchical
clustering. However, the multiple organizations are given independently of
each other in the multiple viewed clustering. In other words, a single tree
of clusters is given as the results in the hierarchical clustering whereas, the
forests are given as ones in the multiple viewed clustering.
This section is concerned with the specific tasks of data clustering and other
tasks which derive from it and it consists of the four subsections. In Section
9.4.1, we mention the cluster naming which is the subsequent task after the
text clustering. In Section 9.4.2, we describe the subtext clustering as the
process of generating virtual texts. In Section 9.4.3, we explain the process
196 Taeho Jo
of using the text clustering results as sample texts for performing the text
categorization. In Section 9.4.4, we cover the process of applying the text
clustering for detecting redundant national projects.
Figure 122 illustrates the process of assigning names to clusters which are
results from clustering data items. The cluster naming is the process of iden-
tifying each cluster, relevantly to its contents with the symbolic names. We
will mention some principles of naming clusters; identifying each cluster with
a primary key value or a numerical value, irrelevantly to its content is out
of cluster naming. Reflecting text contents in each cluster in its name is the
requirement of cluster naming. In this section, we describe the cluster naming
for enabling browsing of text clusters.
The principles of naming clusters symbolically were defined in 2006 in
his PhD dissertation [25]. The first principle of naming clusters is to reflect
cluster contents in their symbolic names. The second principle is that each
cluster name should not be too long; the cluster name consists of only couple
of words. The third principle is that each cluster name should be unique;
Text Mining: Concepts, Implementation, and Big Data Challenge 197
there is no redundant name among clusters. The cluster names are helpful
for accessing texts by browsing.
In 2006, Jo described the process of naming clusters based on the above
principles, in his PhD dissertation [25]. In each cluster, its texts are indexed
into a list of words with the basic three steps which were mentioned in Chap-
ter 2. For each cluster, weights of indexed words are computed and the words
with their highest weighs are selected as the cluster name candidates. If more
than two clusters have same name, the action is taken against it, by adopting
one among schemes which will be mentioned in the next paragraph. Other
approaches are available than this way and cluster naming is regarded as a
text mining task which is separated from the text clustering.
More than two clusters with same name violates against the principle, so
let us mention the schemes of avoiding the redundancy. In this case, the word
with its highest weight is replaced by one with its second highest weight in
naming the cluster. The clusters with their redundant name are merged into
one cluster, as the alternative scheme. We use multiple schemes of weighting
words and the word is replaced by one which is weighted by another scheme.
In spite of avoiding the redundant cluster name, it is possible that the clusters
which are named with their different words have their identical meanings.
The taxonomy generation is achieved by combining the text clustering and
the cluster naming with each other. The taxonomy generation is referred to
the process of generating topics, their relations, and their associated texts,
from a corpus. The taxonomy generation process consists of clustering texts
by a clustering algorithm and naming clusters by the above process. The tax-
onomy generation may be expanded to the process of constructing manually
or semi-automatically ontologies. The taxonomy generation will be mentioned
in detail, in Chapter 15.
Even if a subtext is a part of the given full text, it belongs to textual data;
it could become a text clustering target. Subtexts are extracted from texts in
the given collection, and they are encoded into numerical vectors. They are
clustered into subgroups based on their similarities by a clustering algorithm.
There is the possibility that subtexts within a given text are scattered over
different clusters. Note that numerical vectors which represent subtexts may
be sparser than those which represent full texts.
Let us mention some similar tasks with subtext clustering which is cov-
ered in this section. The summary based clustering which is mentioned in
Chapter 13, is the process of clustering texts by encoding their summaries,
instead of full texts. If clustering paragraphs in a long text into subgroups,
the paragraph subgroups are also called subtexts. A full text is segmented
into subtexts based on their contents through the text segmentation, and
each subtext is called topic based subtext.
Let us point out some issues from subtext clustering. The numerical vectors
which represent subtexts tend to be sparser than ones representing full texts.
The summary based clustering which is mentioned in Chapter 13, belongs to
the kind of subtext clustering for clustering texts fast. Clustering paragraphs
within a text is called unordered text segmentation which is mentioned in
Chapter 14. It is the critical factor how to select subtexts from a full text in
this task.
This section is concerned with the process of generating sample texts au-
tomatically for the text categorization. The preliminary task for the text
categorization is to predefine categories as a list or a tree and to allocate
texts to each category as samples. The preliminary tasks is automated by
the two tasks: the text clustering and the cluster naming. A list or a tree of
cluster names is given as the predefined categories and texts which belong
to the clusters become sample labeled texts. In this section, we describe the
Text Mining: Concepts, Implementation, and Big Data Challenge 199