Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008
Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high dimensional and sparse vectorsa few thousand dimensions is typical. Practical approaches to clustering such document vectors use an iterative procedure (e.g. k-means, EM) that is known to be especially sensitive to initial starting conditions (k and initial centroids). In this paper, we introduce a hybrid clustering algorithm that determines these initial conditions automatically, depending on the required quality for the obtained clusters. The hybrid algorithm combines the agglomerative hierarchical approach with the k-means approach to provide k disjoint clusters. However, the textual, unstructured nature of documents makes the task considerably more difficult than other data sets. We present the results of an experimental study of our introduced algorithm.
2005
We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.
Everyday vast amounts of documents, e-mails, and web pages are generated. In order to handle these data, automatic techniques such as document clustering are needed.
Encyclopedia of Data Warehousing and Mining, Second Edition, 2009
Lecture Notes in Computer Science, 2005
Document Clustering Methods Hierarchical Clustering Methods One popular approach in document clustering is agglomerative hierarchical clustering (Kaufman and Rousseeuw, 1990). Algorithms in this family build the hierarchy bottom-up by iteratively computing the similarity between all pairs of clusters and then merging the most similar pair. Different variations may employ different similarity measuring schemes (Zhao and Karypis, 2001; Karypis, 2003). Steinbach (2000) shows that Unweighted Pair Group Method with Arithmatic Mean (UPGMA) (Kaufman and Rousseeuw, 1990) is the most accurate one in its category. The hierarchy can also be built top-down which is known as the divisive approach. It starts with all the data objects in the same cluster and iteratively splits a cluster into smaller clusters until a certain termination condition is fulfilled. Methods in this category usually suffer from their inability to perform adjustment once a merge or split has been performed. This inflexibility often lowers the clustering accuracy. Furthermore, due to the complexity of computing the similarity between every pair of clusters, UPGMA is not scalable for handling large data sets in document clustering as experimentally demonstrated in (Fung, Wang, Ester, 2003). Partitioning Clustering Methods K-means and its variants (Larsen and Aone, 1999; Kaufman and Rousseeuw, 1990; Cutting, Karger, Pedersen, and Tukey, 1992) represent the category of partitioning clustering algorithms that create a flat, non-hierarchical clustering consisting of k clusters. The k-means algorithm iteratively refines a randomly chosen set of k initial centroids, minimizing the average distance (i.e., maximizing the similarity) of documents to their closest (most similar) centroid. The bisecting k-means algorithm first selects a cluster to split, and then employs basic k-means to create two sub-clusters, repeating these two steps until the desired number k of clusters is reached. Steinbach (2000) shows that the bisecting k-means algorithm outperforms basic kmeans as well as agglomerative hierarchical clustering in terms of accuracy and efficiency (Zhao and Karypis, 2002). Both the basic and the bisecting k-means algorithms are relatively efficient and scalable, and their complexity is linear to the number of documents. As they are easy to implement, they are widely used in different clustering applications. A major disadvantage of k-means, however, is that an incorrect estimation of the input parameter, the number of clusters, may lead to poor clustering accuracy. Also, the k-means algorithm is not suitable for discovering clusters of largely varying sizes, a common scenario in document clustering. Furthermore, it is sensitive to noise that may have a significant influence on the cluster centroid, which in turn lowers the clustering accuracy. The k-medoids algorithm (Kaufman and Rousseeuw, 1990; Krishnapuram, Joshi, and Yi, 1999) was proposed to address the noise problem, but this algorithm is computationally much more expensive and does not scale well to large document sets.
The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05), 2005
Text Processing since it requires to extract regular patterns from a document collection without a priori knowledge on the category structure. This task can be difficult also for humans because many different but valid partitions may exist for the same collection. Moreover, the lack of information about categories makes it difficult to apply effective feature selection techniques to reduce the noise in the representation of texts. Despite these intrinsic difficulties, text clustering is an important task for Web search applications in which huge collections or quite long query result lists must be automatically organized.
Clustering is an efficient technique that organizes a large quantity of unordered text documents into a small number of significant and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. It is studied by the researchers at broad level because of its broad application in several areas such as web mining, search engines, and information extraction. It clusters the documents based on various similarity measures. The existing K-means (document clustering algorithm) was based on random center generation and every time the clusters generated was different In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents based on fixed center generation, collect only exclusive words from different documents in dataset and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high compare to existing algorithm in terms of F-Measure, Recall, Precision and time complexity.
Data Mining and Knowledge Discovery, 2005
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering algorithms that build meaningful hierarchies out of large document collections are ideal tools for their interactive visualization and exploration as they provide data-views that are consistent, predictable, and at different levels of granularity. This paper focuses on document clustering algorithms that build such hierarchical solutions and (i) presents a comprehensive study of partitional and agglomerative algorithms that use different criterion functions and merging schemes, and (ii) presents a new class of clustering algorithms called constrained agglomerative algorithms, which combine features from both partitional and agglomerative approaches that allows them to reduce the early-stage errors made by agglomerative methods and hence improve the quality of clustering solutions. The experimental evaluation shows that, contrary to the common belief, partitional algorithms always lead to better solutions than agglomerative algorithms; making them ideal for clustering large document collections due to not only their relatively low computational requirements, but also higher clustering quality. Furthermore, the constrained agglomerative methods consistently lead to better solutions than agglomerative methods alone and for many cases they outperform partitional methods, as well.
2017 13th International Computer Engineering Conference (ICENCO), 2017
K-means algorithm is a well-known clustering algorithm due to its simplicity. Unfortunately, the output of k-means depends on the initialization of cluster centroids. In this paper, we propose a new hybrid approach for document clustering which uses the outputs of single pass clustering (SPC) as an initialization for k-means algorithm. We aim to get the advantages of careful seeding with single pass clustering and the benefits of k-means algorithm. The experimental results state that the proposed approach outperforms traditional k-means algorithm in both unsupervised and supervised evaluation measures especially when the number of required clusters is increased.
… conference on Knowledge discovery and data mining
Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of F-Measure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidfand feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and kmeans. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, "vector average damping," that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively.
2006
The steady increase of information on WWW, digital library, portal, database and local intranet, gave rise to the development of several methods to help user in Information Retrieval, information organization and browsing. Clustering algorithms are of crucial importance when there are no labels associated to textual information or documents. The aim of clustering algorithms, in the text mining domain, is to group documents concerning with the same topic into the same cluster, producing a flat or hierarchical structure of clusters. In this paper we present a Knowledge Discovery System for document processing and clustering. The clustering algorithm implemented in this system, called Induced Bisecting k-Means, outperforms the Standard Bisecting k-Means and is particularly suitable for on line applications when computational efficiency is a crucial aspect.
2014
Clustering is an automatic learning technique aimed at grouping a set of objects into subsets or clusters. Objects in the same cluster should be as similar as possible, whereas objects in one cluster should be as dissimilar as possible from objects in the other clusters. Document clustering has become an increasingly important task in analysing huge documents. The challenging aspect to analyse the enormous documents is to organise them in such a way that facilitates better search and knowledge extraction without introducing extra cost and complexity. Document clustering has played an important role in many fields like information retrieval and data mining. In this paper, first Document Clustering has been proposed using Hierarchical Agglomerative Clustering and K-Means Clustering Algorithm.Here, the approach is purely based on the frequency count of the terms present in the documents where context of the documents are totally ignored. Therefore, the method is modified by incorporati...
2016
As the number of electronic documents generated<br> from worldwide source increases, it is hard to manually<br> organize, analyze and present these documents efficiently.<br> Document clustering is one of the traditionally data mining<br> techniques and an unsupervised learning paradigm. Fast and<br> high quality document clustering algorithms play an<br> important role in helping users to effectively navigate,<br> summarize and organize the information. K-Means algorithm<br> is the most commonly used partitioned clustering algorithm<br> because it can be easily implemented and is the most efficient<br> one in terms of execution times. However, the major problem<br> with this algorithm is that it is sensitive to the selection of<br> initial centroid and may converge to local optima. The<br> algorithm takes the initial cluster centre arbitrarily so it does<br> not always guarantee good clustering resu...
Automatic document clustering has played an important role in the field of information retrieval. The aim of the developed this system is to store documents in clusters and to improve its retrieval efficiently. Clustering is a technique aimed at grouping a set of objects into clusters. Document clustering is the task of combining a set of documents into clusters so that similar type of documents will be store in one cluster. We applied non overlapping method to store document into cluster. In this project, we write an algorithm which will calculate similarity of document’s keywords and according to its similarity points it will either put into existing cluster or new cluster is created and stored into that cluster. To find keywords from document various techniques are used like tokenization, stop word removal, stemmer, TF*IDF calculation
2016
In today's era of World Wide Web, there is a<br> tremendous proliferation in the amount of digitized text<br> documents. As there is huge collection of documents on the web,<br> there is a need of grouping the set of documents into clusters.<br> Document clustering plays an important role in effectively<br> navigating and organizing the documents. K-Means clustering<br> algorithm is the most commonly document clustering algorithm<br> because it can be easily implemented and is the most efficient one<br> in terms of execution times. The major problem with this<br> algorithm is that it is quite sensitive to selection of initial cluster<br> centroids. The algorithm takes the initial cluster center arbitrarily<br> so it does not always promise good clustering results. If the initial<br> centroids are incorrectly determined, the remaining data points<br> with the same similarity scores may fall into the dif...
Data & Knowledge Engineering, 2010
The use of centroids as prototypes for clustering text documents with the k-means family of methods is not always the best choice for representing text clusters due to the high dimensionality, sparsity, and low quality of text data. Especially for the cases where we seek clusters with small number of objects, the use of centroids may lead to poor solutions near the bad initial conditions. To overcome this problem, we propose the idea of synthetic cluster prototype that is computed by first selecting a subset of cluster objects (instances), then computing the representative of these objects and finally selecting important features. In this spirit, we introduce the MedoidKNN synthetic prototype that favors the representation of the dominant class in a cluster. These synthetic cluster prototypes are incorporated into the generic spherical k-means procedure leading to a robust clustering method called k-synthetic prototypes (k-sp). Comparative experimental evaluation demonstrates the robustness of the approach especially for small datasets and clusters overlapping in many dimensions and its superior performance against traditional and subspace clustering methods.
Document Clustering is the collection of similar documents into classes and the similarity is some function on the document. Document Clustering need not require any separate training process and manual tagging group in advance. The documents used in the same clusters are more similar, while the documents used in different clusters are more dissimilar. It is one of the familiar technique used in data analysis and is used in many areas including data mining, statistics and image analysis. The traditional clustering approaches lose its algorithmic approach when handling high dimensional data. For this, a new K-Means Clustering technique is proposed in this work. Here Cosine Similarity of Vector Space Model is used as the centroid for clustering. Using this approach, the documents can be clustered efficiently even when the dimension is high because it uses vector space representation for documents which is suitable for high dimensions.
Over the past few decades, the volume of existing text data increased exponentially. Automatic tools to organize these huge collections of documents are becoming unprecedentedly important. Document clustering is important for organizing automatically documents into clusters. Most of the clustering algorithms process document collections as a whole; however, it is important to process these documents dynamically. This research aims to develop an incremental algorithm of hierarchical document clustering where each document is processed as soon as it is available. The algorithm is based on two well-known data clustering algorithms (COBWEB and CLASSIT), which create hierarchies of probabilistic concepts, and seldom have been applied to text data. The main contribution of this research is a new framework for incremental document clustering, based on extended versions of these algorithms in conjunction with a set of traditional techniques, modified to work in incremental environments.
Künstliche Intelligenz (KI, 2001
Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. In this paper, we propose a new approach for applying background knowledge during preprocessing in order to improve clustering results and allow for selection between results. We preprocess our input data applying an ontology-based heuristics for feature selection and feature aggregation. Thus, we construct a number of alternative text representations. Based on these representations, we compute multiple clustering results using K-Means. The results may be distinguished and explained by the corresponding selection of concepts in the ontology. Our results compare favourably with a sophisticated baseline preprocessing strategy.
KDD workshop on text mining, 2000
This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data. 1 Background and Motivation Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. Initially, document clustering was investigated for improving the precision or recall in information retrieval systems [Rij79, Kow97] and as an efficient way of finding the nearest neighbors of a document [BL85]. More recently, clustering has been proposed for use in browsing a collection of documents [CKPT92] or in organizing the results returned by a search engine in response to a user's query [ZEMK97]. Document clustering has also been used to automatically generate hierarchical clusters of documents [KS97]. (The automatic generation of a taxonomy of Web documents like that provided by Yahoo!
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.