Document Clustering using Improved K- meansAlgorithm

Publisher ijmra.us UGC Approved

Document Clustering using Improved K- meansAlgorithm

Publisher ijmra.us UGC Approved

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Clustering is an efficient technique that organizes a large quantity of unordered text documents into a small number of significant and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. It is studied by the researchers at broad level because of its broad application in several areas such as web mining, search engines, and information extraction. It clusters the documents based on various similarity measures. The existing K-means (document clustering algorithm) was based on random center generation and every time the clusters generated was different In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents based on fixed center generation, collect only exclusive words from different documents in dataset and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high compare to existing algorithm in terms of F-Measure, Recall, Precision and time complexity.

Anina John

International Journal of Engineering Research and, 2016

With the huge upsurge in information, it has become difficult to gather relevant information within the limited time. Hence clustering methods are introduced to ease the task of gathering the relevant information in a cluster. Efficiency of clustering therefore becomes one of the crucial requirements to be met by the clustering methods. There are several methods and algorithms have been introduced. Hierarchical clustering is often portrayed as the better quality clustering approach, but it is limited because of its time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents. A clustering method based on the hidden semantics within the documents is proposed here for better results. The proposed method extracts features from the web documents using conditional random fields and builds a linguistic topological space based on the associations of features. The features that are used this method are TF (Term Frequency) and IDF (Inverse Document Frequency). Both TF and IDF values are best in reflecting the importance of the document in the given context. Then the documents are clustered based on the K-means clustering after finding the topics in the documents using these features. The advantage of K-means method is that it produces tighter clusters than hierarchical clustering, especially if the clusters are globular.

Log In

Document Clustering using Improved K- meansAlgorithm

Sign up for access to the world's latest research

Abstract

Related papers

Related topics