Academia.eduAcademia.edu

Discriminative features for text document classification

2004, Formal Pattern Analysis & Applications

The bag-of-words approach to text document representation typically results in vectors of the order of 5000 to 20000 components as the representation of documents. In order to make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion.