Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008
Many techniques have been used for document clustering that depended on the number of word occurrences in documents. In these techniques, words are considered as dimensions of the clustering space. Since a huge number of words is found in each document, studies were held to reduce this high dimensionality for better performance i.e., words pruning. Sampling was used to choose random documents representatives to which apply clustering techniques instead of using the whole data set, but it was not implemented on words before. In this paper, we study the effect of using word sampling on document clustering as a method of high dimensionality reduction, where a random word sampling technique is presented. The Euclidean and Manhattan distance functions were both used as the similarity measure. A hybrid clustering algorithm is modified to include word sampling. The results are compared with the non-word sampling through the clustering accuracy of the resultant clusters. Key-Words: Data min...
2008
In this paper a clustering algorithm for documents is proposed that adapts a sampling-based pruning strat egy to simplify hierarchical clustering. The algorithm can be applied to any text documents data set whose entries can be em bedded in a high dimensional Euclidean space in which every doc ument is a vector of real numbers. This paper presents the res ults of an experimental study of the proposed document cluster ing technique. The performance of the method is illustr ated in terms of quality of clusters.
Arxiv preprint cond-mat/0109006, 2001
We compare the performance of different clustering algorithms applied to the task of unsupervised text categorization. We consider agglomerative clustering algorithms, principal direction divisive partitioning and (for the first time) superparamagnetic clustering with several distance measures. The algorithms have been applied to test databases extracted from the Reuters-21578 text categorization test database. We find that simple application of the different clustering algorithms yields clustering solutions of comparable quality. In order to achieve considerable improvements of the clustering results it is crucial to reduce the dictionary of words considered in the representation of the documents. Significant improvements of the quality of the clustering can be obtained by identifying discriminative words and filtering out indiscriminative words from the dictionary. We present two methods, each based on a resampling scheme, for selecting discriminative words in an unsupervised way.
2007 IEEE 23rd International Conference on Data Engineering Workshop, 2007
Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clustering problem. Several standard benchmark datasets are used. The three Document representation methods considered are based on the vector space model, and they include word, multi-word term, and character N-gram representations. The dimension reduction methods are independent component analysis (ICA), latent semantic indexing (LSI), and a feature selection technique based on Document Frequency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better results compared with LSI. Experiments also show that the word representation gives better clustering results compared to term and N-gram representation. Finally, for the N-gram representation, it is demonstrated that a profile length (before dimensionality reduction) of 2000 is sufficient to capture the information and, in most cases, a 4-gram representation gives better performance than 3-gram representation.
Encyclopedia of Data Warehousing and Mining, Second Edition, 2009
Lecture Notes in Computer Science, 2005
In this research, a systematic study is conducted of four dimension reduction techniques for the text clustering problem, using five benchmark data sets. Of the four methods --Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) --ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of the data sets. Random projection consistently returns the worst results, where this appears to be due to the noise distribution characterizing the document clustering task.
This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed.
Data accumulate as time passes and there is a growing need for automated systems for partitioning data into groups, in order to describe, organise and retrieve information. With documental databases, one of the main aims is text categorisation, consisting of identifying documents with similar topics. In the usual vector space model, documents are represented as points in the high-dimensional space spanned by words. One obstacle to the efficient performance of algorithms is the curse of dimensionality: as dimensions increase, the space in which individuals are represented shrinks. Classical statistical methods lose their properties, and researchers' interest is devoted towards finding dense areas in lower dimensional spaces. The aim of this paper is to review the basic literature on the topic, focusing on dimensionality reduction and double clustering.
Singapore Management University, 2009
Eighth Sense Research Group
ABSTRACT The explosive growth of information stored in unstructured texts created a great demand for new and powerful tools to acquire useful information, such as text mining. Document clustering is one of its the powerful methods and by which document retrieval, organization and summarization can be achieved. Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents according to its similarity. As there is a huge amount of unstructured data and there is a semantic correlation between features of data it is difficult to handle that. There are large no of feature selection methods that are used to used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from the text document contents. Statistical methods were used in the text clustering and feature selection algorithm. The semantic clustering and feature selection method was proposed to improve the clustering and feature selection mechanism with semantic relations of the text documents. Keywords:- Clustering, CHIR, CHIRSIM, K-means algorithm
2015
Data mining , knowledge discovery is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit
IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR), 2020
The TF-IDF model is the most common way of representing documents in the vector space. However, its results are highly dimensional, posing problems to the classic clustering algorithms due to the curse of dimensionality. Recent word embeddings based techniques can reduce the documents representations dimensionality while also preserving the semantic relationships between words. In this paper, we analyze the accuracy of four different classical clustering algorithms (K-Means, Spherical K-Means, LDA, and DBSCAN) in combination with the Document to Vector model.
2004
Abstract Text document clustering can greatly simplify browsing large collections of documents by reorganizing them into a smaller number of manageable clusters. Algorithms to solve this task exist; however, the algorithms are only as good as the data they work on. Problems include ambiguity and synonymy, the former allowing for erroneous groupings and the latter causing similarities between documents to go unnoticed.
2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA), 2022
Nowadays, huge amounts of text are being generated on the Web by a vast number of applications. Examples of such applications include instant messengers, social networks, e-mail clients, news portals, blog communities, commercial platforms, and so forth. The requirement for effectively identifying documents of similar content in these services rendered text clustering one of the most emerging problems of the machine learning discipline. Nevertheless, the high dimensionality and the natural sparseness of text introduce significant challenges that threat the feasibility of even the most successful algorithms. Consequently, the role of dimensionality reduction techniques becomes crucial for this particular problem. Motivated by these challenges, in this article we investigate the impact of dimensionality reduction on the performance of text clustering algorithms. More specifically, we experimentally analyze its effects in the effectiveness and running times of eight clustering algorithms by employing six high-dimensional text datasets. The results indicate that, in most cases, dimensionality reduction may significantly improve the algorithm execution times, by sacrificing only small amounts of clustering quality.
Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Partitional clustering algorithms have been recognized to be more suitable as opposed to the hierarchical clustering schemes for processing large datasets. A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy. In this paper, we compare and analyze the effectiveness of these measures in partitional clustering for text document datasets. Our experiments utilize the standard K-means algorithm and we report results on seven text document datasets and five distance/similarity measures that have been most commonly used in text clustering.
KDD workshop on text mining, 2000
This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data. 1 Background and Motivation Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. Initially, document clustering was investigated for improving the precision or recall in information retrieval systems [Rij79, Kow97] and as an efficient way of finding the nearest neighbors of a document [BL85]. More recently, clustering has been proposed for use in browsing a collection of documents [CKPT92] or in organizing the results returned by a search engine in response to a user's query [ZEMK97]. Document clustering has also been used to automatically generate hierarchical clusters of documents [KS97]. (The automatic generation of a taxonomy of Web documents like that provided by Yahoo!
No.Of correctly retrieved documents Precision= No. Of retrieved documents (Online) 41 | P a g e
2012
There are two important problems worth conducting research in the fields of personalized information services based on user model. One is how to get and describe user personal information, i.e. building user model, the other is how to organize the information resources, i.e. document clustering. It is difficult to find out the desired information without a proper clustering algorithm. Several new ideas have been proposed in recent years. But most of them only took into account the text information, but some other useful information may have more contributions for documents clustering, such as the text size, font and other appearance characteristics, so called visual features.In this paper we introduce a new technique called Closed Document Clustering Method (CDCM) by using advanced clustering metrics. This method enhances the previous method of cluster the scientific documents based on visual features, so called VF-Clustering algorithm. Five kinds of visual features of documents are...
The growing rise of databases almost in every area of human activity has caused the need for new powerful tools to change the suitable knowledge increase. In order to satisfy this need, the researches of various fields such as machine learning, pattern identification, analysis of statistical data, data visualization, neural networks, econometrics, information retrieving, information extraction, etc. have explored some methods and ideas. Text mining uses unstructured textual information, studying it in order to discover the structure and hidden lateral meanings in the text. Documents' clustering via unsupervised machine learning methods has an expanded function in different areas of natural processing languages such as automatic multi-text summarization, information retrieving, etc. The current paper aims to introduce some useful functions of this area, clustering the documents with the approach of decreasing noise redundancy as well as unrelated data. Dimension reduction is a method of erasing such features.
Clustering is an efficient technique that organizes a large quantity of unordered text documents into a small number of significant and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. It is studied by the researchers at broad level because of its broad application in several areas such as web mining, search engines, and information extraction. It clusters the documents based on various similarity measures. The existing K-means (document clustering algorithm) was based on random center generation and every time the clusters generated was different In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents based on fixed center generation, collect only exclusive words from different documents in dataset and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high compare to existing algorithm in terms of F-Measure, Recall, Precision and time complexity.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.