Efficient Phrase-Based Document Similarity for Clustering

Hung Chim; Xiaotie Deng

Efficient Phrase-Based Document Similarity for Clustering

Xiaotie Deng

2000, IEEE Transactions on Knowledge and Data Engineering

visibility

…

description

13 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, we propose a phrase-based document similarity to compute the pairwise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpasses the results of traditional single-word tf-idf similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.

Xiaotie Deng

Proceedings of the 16th international conference on World Wide Web - WWW '07, 2007

In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community.

Log In

Efficient Phrase-Based Document Similarity for Clustering

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics