Academia.eduAcademia.edu

Determining Initial Starting Conditions for Documents Clustering

2008

Abstract

Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high dimensional and sparse vectorsa few thousand dimensions is typical. Practical approaches to clustering such document vectors use an iterative procedure (e.g. k-means, EM) that is known to be especially sensitive to initial starting conditions (k and initial centroids). In this paper, we introduce a hybrid clustering algorithm that determines these initial conditions automatically, depending on the required quality for the obtained clusters. The hybrid algorithm combines the agglomerative hierarchical approach with the k-means approach to provide k disjoint clusters. However, the textual, unstructured nature of documents makes the task considerably more difficult than other data sets. We present the results of an experimental study of our introduced algorithm.