Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Lecture Notes in Computer Science
In this paper we present a new algorithm for document clustering called Condensed Star (ACONS). This algorithm is a natural evolution of the Star algorithm proposed by Aslam et al., and improved by them and other researchers. In this method, we introduced a new concept of star allowing a different star-shaped form; in this way we retain the strengths of previous algorithms as well as address previous shortcomings. The evaluation experiments on standard document collections show that the proposed algorithm outperforms previously defined methods and obtains a smaller number of clusters. Since the ACONS algorithm is relatively simple to implement and is also efficient, we advocate its use for tasks that require clustering, such as information organization, browsing, topic tracking, and new topic detection.
2008 IEEE International Conference on Data Mining Workshops, 2008
In this paper a new algorithm, called CStar, for document clustering is presented. This algorithm improves recently developed algorithms like Generalized Star (GStar) and ACONS algorithms, originally proposed for reducing some drawbacks presented in previous Star-like algorithms. The CStar algorithm uses the Condensed Star-shaped Subgraph concept defined by ACONS, but defines a new heuristic that allows to construct a new cover of the thresholded similarity graph and to reduce the drawbacks presented in GStar and ACONS algorithms. The experimentation over standard document collections shows that our proposal outperforms previously defined algorithms and other related algorithms used to document clustering.
2008
In this paper, we introduce a new clustering algorithm for obtaining labeled document clusters that accurately identify the topics of a text collection. In order to determine the topics, our approach relies on both probable term pairs generated from the collection and the estimation of the topic homogeneity associated to term pair clusters. Experimental results obtained over two benchmark text collections demonstrate the utility of this new approach.
International Journal of Applied Information Systems, 2012
Document clustering is automatic organization of documents into clusters so that documents within a cluster have high similarity in comparison to documents in other clusters. It has been studied intensively because of its wide applicability in various areas such as web mining, search engines, and information retrieval. It is measuring similarity between documents and grouping similar documents together. It provides efficient representation and visualization of the documents; thus helps in easy navigation also. In this paper, we have given overview of various document clustering methods studied and researched since last few years, starting from basic traditional methods to fuzzy based, genetic, coclustering, heuristic oriented etc. Also, the document clustering procedure with feature selection process, applications, challenges in document clustering, similarity measures and evaluation of document clustering algorithm is explained.
Automatic document clustering has played an important role in the field of information retrieval. The aim of the developed this system is to store documents in clusters and to improve its retrieval efficiently. Clustering is a technique aimed at grouping a set of objects into clusters. Document clustering is the task of combining a set of documents into clusters so that similar type of documents will be store in one cluster. We applied non overlapping method to store document into cluster. In this project, we write an algorithm which will calculate similarity of document’s keywords and according to its similarity points it will either put into existing cluster or new cluster is created and stored into that cluster. To find keywords from document various techniques are used like tokenization, stop word removal, stemmer, TF*IDF calculation
2012
There are two important problems worth conducting research in the fields of personalized information services based on user model. One is how to get and describe user personal information, i.e. building user model, the other is how to organize the information resources, i.e. document clustering. It is difficult to find out the desired information without a proper clustering algorithm. Several new ideas have been proposed in recent years. But most of them only took into account the text information, but some other useful information may have more contributions for documents clustering, such as the text size, font and other appearance characteristics, so called visual features.In this paper we introduce a new technique called Closed Document Clustering Method (CDCM) by using advanced clustering metrics. This method enhances the previous method of cluster the scientific documents based on visual features, so called VF-Clustering algorithm. Five kinds of visual features of documents are...
research.ijcaonline.org
As text documents are largely increasing in the internet, the process of grouping similar documents for versatile applications have put the eye of researchers in this area. However most clustering methods suffer from challenges in dealing with problems of high dimensionality, scalability, accuracy and meaningful cluster labels. This paper presents a review on all these well known methods of document clustering. Hierarchical document clustering method is explained in detail. Study shows that hierarchical document clustering performs well but still there is a scope to improve above mentioned problems.
Encyclopedia of Data Warehousing and Mining, Second Edition, 2009
The measure of information recorded and the content data accessible on the web has been surprisingly expanding, gathering and enlarging with every day. Such information and data which is accessible in high voluminous structure is really not accessible in a structure which is reasonable for content handling as the information accessible is generally unclear, amorphous or unstructured. Content mining is a sub field of information mining which goes for investigating the valuable data from the recorded assets. Content mining has three significant difficulties. They are high dimensionality, embraced remove measures, accomplishing quality bunches and improved classifier exactnesses. Grouping of archive is significant with the end goal of report association, rundown, subject extraction and data recovery in a proficient manner. At first, grouping is connected for upgrading the data recovery procedures. Recently, bunching strategies have been connected in the territories which include perusing the assembled information or in ordering the result given by the web indexes to the answer to the question raised by the clients. In this paper, we are giving an exhaustive review over the archive bunching.
KDD workshop on text mining, 2000
This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data. 1 Background and Motivation Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. Initially, document clustering was investigated for improving the precision or recall in information retrieval systems [Rij79, Kow97] and as an efficient way of finding the nearest neighbors of a document [BL85]. More recently, clustering has been proposed for use in browsing a collection of documents [CKPT92] or in organizing the results returned by a search engine in response to a user's query [ZEMK97]. Document clustering has also been used to automatically generate hierarchical clusters of documents [KS97]. (The automatic generation of a taxonomy of Web documents like that provided by Yahoo!
2015
Data mining , knowledge discovery is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit
2014
Clustering is an automatic learning technique aimed at grouping a set of objects into subsets or clusters. Objects in the same cluster should be as similar as possible, whereas objects in one cluster should be as dissimilar as possible from objects in the other clusters. Document clustering has become an increasingly important task in analysing huge documents. The challenging aspect to analyse the enormous documents is to organise them in such a way that facilitates better search and knowledge extraction without introducing extra cost and complexity. Document clustering has played an important role in many fields like information retrieval and data mining. In this paper, first Document Clustering has been proposed using Hierarchical Agglomerative Clustering and K-Means Clustering Algorithm.Here, the approach is purely based on the frequency count of the terms present in the documents where context of the documents are totally ignored. Therefore, the method is modified by incorporati...
2017 13th International Computer Engineering Conference (ICENCO), 2017
K-means algorithm is a well-known clustering algorithm due to its simplicity. Unfortunately, the output of k-means depends on the initialization of cluster centroids. In this paper, we propose a new hybrid approach for document clustering which uses the outputs of single pass clustering (SPC) as an initialization for k-means algorithm. We aim to get the advantages of careful seeding with single pass clustering and the benefits of k-means algorithm. The experimental results state that the proposed approach outperforms traditional k-means algorithm in both unsupervised and supervised evaluation measures especially when the number of required clusters is increased.
This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed.
International journal of research in social sciences, 2016
Clustering is an efficient technique that organizes a large quantity of unordered text documents into a small number of significant and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. It is studied by the researchers at broad level because of its broad application in several areas such as web mining, search engines, and information extraction. It clusters the documents based on various similarity measures. The existing K-means (document clustering algorithm) was based on random center generation and every time the clusters generated was different In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents based on fixed center generation, collect only exclusive words from different documents in dataset and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high com...
2010
Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of classindependent general-words, and a handful class-specific corewords. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.
2015
Data mining is the process of non-trivial discovery from implied, previously unknown, and potentially useful information from data in large databases. Hence it is a core element in knowledge discovery, often used synonymously. Clustering, one of technique for data mining used for grouping similar terms together. Earlier statistical analysis used in text mining depends on term frequency. Then, new concept based text mining model was introduced which analyses terms. Clustering of document is useful for the purpose of document organization, summarization, and information retrieval in an efficient way. Initially, clustering is applied for enhancing the information retrieval techniques. Of late, clustering techniques have been applied in the areas which involve browsing the gathered data or in categorizing the outcome provided by the search engines for the reply to the query raised by the users. In this paper, we are providing a comprehensive survey over the document clustering.
CSE573 :: Group13 :: Term Project :: Final Report, 2022
The main idea of this project is to explore various document clustering, summarization, and visualization techniques such that they can be used in two ways - summarization of the cluster topic and clustering of summaries and evaluate their performances. The 20 Newsgroups dataset by Ken Leng has around 20000 documents that are partitioned (nearly) evenly across 20 different categories. The first step in the project was to pre-process the documents, which included stop words removal, data cleanup, and lemmatization. Next, document clustering consists of two stages - creating embedded vectors from text and clustering those vectors. Vector creation is done using a topic modeling technique called Top2Vec and clustering using Bisecting K-means. Cluster summarization is implemented by a trained transformer named Bart-Large-CNN, an abstractive summarization. Later, we used techniques like t-SNE, Plotly, and DASH to visualize clustering results.
2014 International Conference on High Performance Computing and Applications (ICHPCA), 2014
With the rising quantity of textual data available in electronic format, the need to organize it become a highly challenging task. In the present paper, we explore a document organization framework that exploits an intelligent hierarchical clustering algorithm to generate an index over a set of documents. The framework has been designed to be scalable and accurate even with large corpora. The advantage of the proposed algorithm lies in the need for minimal inputs, with much of the hierarchy attributes being decided in an automated manner using statistical methods. The use of topic modeling in a pre-processing stage ensures robustness to a range of variations in the input data. For experimental work 20-Newsgroups dataset has been used. The Fmeasure of the proposed approach has been compared with the traditional K-Means and K-Medoids clustering algorithms. Test results demonstrate the applicability, efficiency and effectiveness of our proposed approach. After extensive experimentation, we conclude that the framework shows promise for further research and specialized commercial applications.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.