Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
—Rapid advancements of smart technologies, permits the individuals and organizations to store large number of documents in repositories. But it is quite difficult to retrieve the relevant documents from these massive collections. Document clustering is the process of organizing such massive document collections into meaningful clusters. It is simple and less tedious to find relevant documents, if documents are clustered on the basis of topic or category. There are various document clustering algorithms available for effectively organizing the documents such that a document is close to its related documents. This paper presents various clustering techniques that are being used in text mining.
International Journal of Electrical and Computer Engineering (IJECE), 2021
Increasing progress in numerous research fields and information technologies, led to an increase in the publication of research papers. Therefore, researchers take a lot of time to find interesting research papers that are close to their field of specialization. Consequently, in this paper we have proposed documents classification approach that can cluster the text documents of research papers into the meaningful categories in which contain a similar scientific field. Our presented approach based on essential focus and scopes of the target categories, where each of these categories includes many topics. Accordingly, we extract word tokens from these topics that relate to a specific category, separately. The frequency of word tokens in documents impacts on weight of document that calculated by using a numerical statistic of term frequency-inverse document frequency (TF-IDF). The proposed approach uses title, abstract, and keywords of the paper, in addition to the categories topics to perform the classification process. Subsequently, documents are classified and clustered into the primary categories based on the highest measure of cosine similarity between category weight and documents weights.
International Journal of Computer Applications, 2014
With the growth of Internet, large amount of text data is increasing, which are created by different media like social networking sites, web, and other informatics sources, etc. This data is in unstructured format which makes it tedious to analyze it, so we need methods and algorithms which can be used with various types of text formats. Clustering is an important part of the data mining. Clustering is the process of dividing the large &similar type of text into the same class. Clustering is widely used in many applications like medical, biology, signal processing, etc. This paper briefly covers the various kinds of text clustering algorithm, present scenario of the text clustering algorithm, analysis and comparison of various aspects which contain sensitivity, stability. Algorithm contains traditional clustering like hierarchal clustering, density based clustering and self-organized map clustering.
No.Of correctly retrieved documents Precision= No. Of retrieved documents (Online) 41 | P a g e
Document clustering is becoming more and more important with the abundance of text documents available through World Wide Web and corporate document management systems. Document clustering is the process of categorizing text document into a systematic cluster or group, such that the documents in the same cluster are similar whereas the documents in the other clusters are dissimilar. This survey includes the information about data mining clustering technique for unstructured data.
International Journal of Engineering Research and Advanced Technology , 2020
Clustering algorithms are taking attention in recent times, according to a huge amount of datasets and the growth of parallelized computing architectures. The goal of clustering algorithms is to divided the dataset into clusters, such that objects within the same cluster are similar to each other and differ from objects of other clusters. Clustering algorithms play an important role in information retrieval, indexing and text summarization. In this paper a brief overview of several clustering algorithms is discussed
2000
This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data.
Clustering is an efficient technique that organizes a large quantity of unordered text documents into a small number of significant and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. It is studied by the researchers at broad level because of its broad application in several areas such as web mining, search engines, and information extraction. It clusters the documents based on various similarity measures. The existing K-means (document clustering algorithm) was based on random center generation and every time the clusters generated was different In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents based on fixed center generation, collect only exclusive words from different documents in dataset and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high compare to existing algorithm in terms of F-Measure, Recall, Precision and time complexity.
International Journal of Electrical and Computer Engineering (IJECE), 2019
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
This paper deals with e-government documents multilayered clustering based on hybrid approach that combines Fuzzy-C-mean algorithm, cosine similarity and semantic similarity measures. The system described here is intended to reduce response time between citizen's questions and government answers, either to eliminate or to minimize the role of subject matter experts. Layers of documents are defined by key terms that are discovered by a clustering engine that we named ADVANSE. After short overview of clustering algorithms the paper concentrates step by step on the functionality of ADVANSE. Finally, concluding remarks emphasize some important features of this approach and gave future research directions.
The amount of digital data utilized in daily life has increased owing to the high dependence on such data. Most data can be stored in textual documents. With the rapid increase in the number of textual documents, users face problems in obtaining useful information. Thus, a method by which to manage data is required to give users an idea about content. In addition, techniques to increase the ratio of precision in information retrieval results are also needed. Therefore, the textual document clustering area is developed to represent the data in meaningful clusters. The two main factors encountered in the process of textual document clustering are efficiency and goodness or quality of data clusters. Efforts have been exerted to deal with these factors. These attempts can be categorized into either traditional or modern approaches. However, these attempts also face numerous issues. In this paper, we present the previous and current issues faced by textual document clustering algorithms to help text domain researchers understand these issues. This study provides researchers and students an overview about textual document clustering algorithms. Furthermore, this study can encourage researchers to find solutions to these issues.
2015
Data mining is the process of non-trivial discovery from implied, previously unknown, and potentially useful information from data in large databases. Hence it is a core element in knowledge discovery, often used synonymously. Clustering, one of technique for data mining used for grouping similar terms together. Earlier statistical analysis used in text mining depends on term frequency. Then, new concept based text mining model was introduced which analyses terms. Clustering of document is useful for the purpose of document organization, summarization, and information retrieval in an efficient way. Initially, clustering is applied for enhancing the information retrieval techniques. Of late, clustering techniques have been applied in the areas which involve browsing the gathered data or in categorizing the outcome provided by the search engines for the reply to the query raised by the users. In this paper, we are providing a comprehensive survey over the document clustering.
2015
Data mining , knowledge discovery is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit
2014
Knowledge discovery is a process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results. Text mining is a sub domain of knowledge discovery from the text data. The presented study provides a broad way of understanding the text mining and their applications in different domain of real time applications. The text mining includes the process of text classification and text clustering. On the other hand, the cluster analysis is performed on the un-labelled and unstructured data. In this paper, I have presented a study of various research papers that explore the area of Text Clustering approaches in various genres.
International Journal of Engineering Research and, 2016
With the huge upsurge in information, it has become difficult to gather relevant information within the limited time. Hence clustering methods are introduced to ease the task of gathering the relevant information in a cluster. Efficiency of clustering therefore becomes one of the crucial requirements to be met by the clustering methods. There are several methods and algorithms have been introduced. Hierarchical clustering is often portrayed as the better quality clustering approach, but it is limited because of its time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents. A clustering method based on the hidden semantics within the documents is proposed here for better results. The proposed method extracts features from the web documents using conditional random fields and builds a linguistic topological space based on the associations of features. The features that are used this method are TF (Term Frequency) and IDF (Inverse Document Frequency). Both TF and IDF values are best in reflecting the importance of the document in the given context. Then the documents are clustered based on the K-means clustering after finding the topics in the documents using these features. The advantage of K-means method is that it produces tighter clusters than hierarchical clustering, especially if the clusters are globular.
IJRCAR, 2014
Text mining is a technique to find meaningful patterns from the available text documents. The pattern discovery from the text and document organization of document is a well-known problem in data mining. Analysis of text content and categorization of the documents is a complex task of data mining. In order to find an efficient and effective technique for text categorization, various techniques of text categorization and classification is recently developed. Some of them are supervised and some of them unsupervised manner of document arrangement. This presented paper discusses different method of text categorization and cluster analysis or text documents. In addition of that a new text mining technique is proposed for future implementation.
Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization , and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.
Text clustering is a text mining technique used to group text documents into groups (or clusters) based on similarity of content. This organization (i.e. clustering) is so as to make documents more understandable and easier to search the relevant information, easier to process, and even more efficient in utilizing communication bandwidth and storage space. An example is clustering results of a web search engine operation into groups of similar documents. Many text clustering algorithms have been developed using different approaches, but none can be said to be the best. The choice of a particular algorithm is a big issue to text clustering system developers. K Means is arguably the most popular text clustering algorithm. However, just like the others, it must be having its own weaknesses. In this paper, we explore the K Means algorithm as well as its variants and discuss their appropriateness in text clustering. We describe the characteristics of the algorithms accompanied by some examples and illustrations in an attempt to discover the strengths and weaknesses. The paper thus gives an in depth view of the K Means algorithms, discusses the appropriateness of the algorithms, and also gives guidance to researchers of text mining concerning the choice of K Means for text clustering.
International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2021
Clustering is a widely used unsupervised data mining technique. In clustering, the main aim is to put similar data objects in one cluster and dissimilar in another cluster. The k-implies is the most famous clustering algorithm because of its effortlessness. But the performance of the k-means clustering algorithm depends upon the parameter selection. Parameter selection like number of cluster and initial cluster center are key of k-means algorithm. Distance augmentation method, density method quadratic clustering methods are utilized to initial cluster selection. This paper examination five unique methods, for example, improved k-means text clustering algorithm, revisiting k-means, LMMK algorithm, SELF-DATA architecture, Clustering Approach for Relation e.t.c. But these techniques have some limitations. To improve these approach, this paper has proposed the development of text clustering method with k-means for analysis of text data.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.