Academia.eduAcademia.edu

The Effect of Word Sampling on Document Clustering

2008

Abstract

Many techniques have been used for document clustering that depended on the number of word occurrences in documents. In these techniques, words are considered as dimensions of the clustering space. Since a huge number of words is found in each document, studies were held to reduce this high dimensionality for better performance i.e., words pruning. Sampling was used to choose random documents representatives to which apply clustering techniques instead of using the whole data set, but it was not implemented on words before. In this paper, we study the effect of using word sampling on document clustering as a method of high dimensionality reduction, where a random word sampling technique is presented. The Euclidean and Manhattan distance functions were both used as the similarity measure. A hybrid clustering algorithm is modified to include word sampling. The results are compared with the non-word sampling through the clustering accuracy of the resultant clusters. Key-Words: Data min...