The Effect of Word Sampling on Document Clustering

ahmed hamad

The Effect of Word Sampling on Document Clustering

ahmed hamad

2008

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Many techniques have been used for document clustering that depended on the number of word occurrences in documents. In these techniques, words are considered as dimensions of the clustering space. Since a huge number of words is found in each document, studies were held to reduce this high dimensionality for better performance i.e., words pruning. Sampling was used to choose random documents representatives to which apply clustering techniques instead of using the whole data set, but it was not implemented on words before. In this paper, we study the effect of using word sampling on document clustering as a method of high dimensionality reduction, where a random word sampling technique is presented. The Euclidean and Manhattan distance functions were both used as the similarity measure. A hybrid clustering algorithm is modified to include word sampling. The results are compared with the non-word sampling through the clustering accuracy of the resultant clusters. Key-Words: Data min...

ahmed hamad

2008

In this paper a clustering algorithm for documents is proposed that adapts a sampling-based pruning strat egy to simplify hierarchical clustering. The algorithm can be applied to any text documents data set whose entries can be em bedded in a high dimensional Euclidean space in which every doc ument is a vector of real numbers. This paper presents the res ults of an experimental study of the proposed document cluster ing technique. The performance of the method is illustr ated in terms of quality of clusters.

Log In

The Effect of Word Sampling on Document Clustering

Sign up for access to the world's latest research

Abstract

Related papers

Related topics

Related papers