Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2000, IEEE Transactions on Knowledge and Data Engineering
…
13 pages
1 file
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, we propose a phrase-based document similarity to compute the pairwise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpasses the results of traditional single-word tf-idf similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.
Proceedings of the 16th international conference on World Wide Web - WWW '07, 2007
In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community.
2010
Document clustering as an unsupervised approach extensively used to navigate, filter, summarize and manage large collection of document repositories like the World Wide Web (WWW). Recently, focuses in this domain shifted from traditional vector based document similarity for clustering to suffix tree based document similarity, as it offers more semantic representation of the text present in the document. In this paper, we compare and contrast two recently introduced approaches to document clustering based on suffix tree data model. The first is an Efficient Phrase based document clustering, which extracts phrases from documents to form compact document representation and uses a similarity measure based on common suffix tree to cluster the documents. The second approach is a frequent word/word meaning sequence based document clustering, it similarly extracts the common word sequence from the document and uses the common sequence/ common word meaning sequence to perform the compact representation, and finally, it uses document clustering approach to cluster the compact documents. These algorithms are using agglomerative hierarchical document clustering to perform the actual clustering step, the difference in these approaches are mainly based on extraction of phrases, model representation as a compact document, and the similarity measures used for clustering. This paper investigates the computational aspect of the two algorithms, and the quality of results they produced.
In text categorization problem the most used method for documents representation is based on words frequency vectors called VSM (Vector Space Model). This representation is based only on words from documents and in this case loses any “word context” information found in the document. In this article we make a comparison between the classical method of document representation and a method called Suffix Tree Document Model (STDM) that is based on representing documents in the Suffix Tree format. For the STDM model we proposed a new approach for documents representation and a new formula for computing the similarity between two documents. Thus we propose to build the suffix tree only for any two documents at a time. This approach is faster, it has lower memory consumption and use entire document representation without using methods for disposing nodes. Also for this method is proposed a formula for computing the similarity between documents, which improves substantially the clustering quality. This representation method was validated using HAC - Hierarchical Agglomerative Clustering. In this context we experiment also the stemming influence in the document preprocessing step and highlight the difference between similarity or dissimilarity measures to find “closer” documents.
International Journal of Computer Applications
ABSTARCT Document clustering is one of the difficult and recent research fields in the search engine research. Most of the existing documents clustering techniques use a group of keywords from each document to cluster the documents. Document clustering arises from information retrieval domains, and "It finds grouping for a set of documents belonging to the same cluster are similar and documents belongs to the different cluster are dissimilar". The nformation retrieval plays an important role in data mining for extracting the relevant information for related to user request. Information retrieval finds the file contents and identifies their similarity. It measures the performance of the documents by using the precision and recall. In this paper we proposed a phrase based clustering scheme which based on application of Suffix Tree Document Clustering (STDC) model. The proposed algorithm is designed to use the STDC model for accurate equivalent representation of document and similarity measurement of the similar documents. This method of clustering reduces the grouping time and similarity accuracy as compared to other existing methods.
No.Of correctly retrieved documents Precision= No. Of retrieved documents (Online) 41 | P a g e
The most popular way for representing documents is the vector space model, because of its speed and versatility. The vector space model has some drawbacks. To overcome the bag of words problems, text documents are treated as a sequence of words and documents are retrieved based on sharing of frequent word sequences from text databases. The sequential relationship between the words and documents is preserved using a suffix tree data structure. Syntax based disambiguation is attempted by enriching the text document representations by background knowledge provided in a core ontology. Word Net is used for this purpose in our model. This work aims to extend a document representation model which is elegant by combining the versatility of the vector space model, the increased relevance of the suffix tree document model and also retains the relationship between words like synonyms. The effectiveness and the relevance of this concept based model compared to the existing models is evaluated by a partitioning clustering technique and then a systematic comparative study of the impact of similarity measures in conjunction with different types of vector space representation on cluster quality is performed. This document model will be called the Concept Based Vector Suffix Tree Document Model (CBVSTDM).
This paper presents a new document clustering method based on frequent co-occurring words. We first employ the Singular Value Decomposition, and then group the words into clusters called word representatives as substitution of the corresponding words in the original documents. Next, we extract the frequent word representative sets by Apriori. Subsequently, each document is designated to a basic unit described by the frequent word representative set, from which we can get the ultimate clusters by hierarchical clustering. The major advantage of our method is that it can produce the cluster description by the frequent word representatives and then by the corresponding words in the clustering process without any extra works. Compared with the state-of-the-art UPGMA method on benchmark datasets, our method has better performance in terms of the entropy and cluster purity.
Document clustering is used to organize the documents into groups. VSM (Vector Space Model) is a technique used to represent the document as a vector. Working with VSM to cluster the documents is easier.
2013
With rapid change in technology, Data Mining and Warehousing is gaining a lot of prominence in the field of computers. Retrieval of information in large intra organizations is becoming a tedious task. Data mining is now offering many powerful and innovative techniques for solving the problem of information retrieval. This paper introduces a novelapproach for clustering text documents based on frequent subtrees. Document trees are constructed by extracting noun hypernyms relationship for each and every word in the text document using Wordnet 2.1 lexical reference. This technique sweeps over the traditional text mining approaches which are based on frequent keyword occurrences. The aim of this technique is that it can cluster documents even if the documents do not have words in common. The key idea behind this paper is to automate the clustering mechanism by discovering frequent subtrees from various document trees. To identify the frequent sub trees occurrences in the constructed doc...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Romanian Journal of Information Science and Technology
International Journal of Advanced Computer Science and Applications, 2016
Proceedings of the 7th International Conference on Advanced Data Mining and Applications Volume Part Ii, 2011
Data & Knowledge Engineering, 2008
Knowledge and Information Systems, 2011