Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2001, Journal of Documentation
…
18 pages
1 file
Automatic categorisation can be understood as a learning process during which a programme recognises the characteristics that distinguish each category or class from others, i.e. those characteristics which the documents should have in order to belong to that category. As yet few experiments have been carried out with documents in Spanish.
ACM Computing Surveys, 2002
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Automatic text categorisation systems is a type of software that every day it is receiving more interest, due not only to its use in documentaries environments but also to its possible application to tag properly documents on the Web. Many options have been proposed to face this subject using statistical approaches, natural language processing tools, ontologies and lexical databases. Nevertheless, there have been no too many empirical evaluations comparing the influence of the different tools used to solve these problems, particularly in a multilingual environment. In this paper we propose a multi-language rule-based pipeline system for automatic document categorisation and we compare empirically the results of applying techniques that rely on statistics and supervised learning with the results of applying the same techniques but with the support of smarter tools based on language semantics and ontologies, using for this purpose several corpora of documents. GENIE is being applied to real environments, which shows the potential of the proposal.
IEEE Transactions on Knowledge and Data Engineering, 1999
AbstractÐWe develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two realworld document collections from the MEDLINE database. Next, we investigate the application of automatic categorization to text retrieval. Our experiments clearly indicate that automatic categorization improves the retrieval performance compared with no categorization. We also demonstrate that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization. Furthermore, detailed analysis of the retrieval performance on each individual test query is provided.
… Journal for the …, 2005
In this article we examine three different approaches to categorising documents from multilingual corpora using machine learning algorithms. These approaches satisfy two main conditions: there may be an unlimited number of different languages in the corpus and it is unnecessary to previously identify each document's language. The approaches differ in two main aspects: how documents are pre-processed (using either language-neutral or language-specific techniques) and how many classifiers are employed (either one global or one for each existing language). These approaches were tested on a bilingual corpus provided by a Spanish newspaper that contains articles written in Spanish and Basque. The empirical findings were studied from the point of view of classification accuracy and system performance including execution time and memory usage.
2008
This paper analyzes the incidence that dimensionality reduction techniques have in the process of text categorization of documents written in Basque. Classification techniques such as Naïve Bayes, Winnow, SVMs and k-NN have been selected. The Singular Value Decomposition (SVD) dimensionality reduction technique together with lemmatization and noun selection have been used in our experiments. The results obtained show that the approach which combines SVD and k-NN for a lemmatized corpus gives the best accuracy rates of all with a remarkable difference.
2009
In the process of preparing learning material for Computer Supported Learning Systems (CSLSs), one of the first steps involves finding documents relevant to the topics and to the students. This requires documents to be categorized according to some criteria. In this paper we analyze the behaviour of classification techniques such as Naı̈ve Bayes, Winnow, SVMs and k-NN, together with lemmatization and noun selection, in the categorization of documents written in Basque. In a second experiment, we study the effect of applying the Singular Value Decomposition (SVD) dimensionality reduction technique before using the mentioned classification techniques. The results obtained show that the approach which combines SVD and k-NN for a lemmatized corpus gives the best categorization of all with a remarkable difference. The final aim pursued in this project is to facilitate the semiautomatic construction of the domain module of a CSLS.
IEEE Transactions on Neural Networks, 1999
AbstractÐWe develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two realworld document collections from the MEDLINE database. Next, we investigate the application of automatic categorization to text retrieval. Our experiments clearly indicate that automatic categorization improves the retrieval performance compared with no categorization. We also demonstrate that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization. Furthermore, detailed analysis of the retrieval performance on each individual test query is provided.
Machine Learning for Text Categorization, 2006
A long standing goal for the field of Artificial Intelligence (AI) is to enable computer understanding of human languages. Much progress has been made in reaching this goal, but much also remains to be done. Concept maps are considered by some educational psychologists as a very important tool to improve learning. Moreover with the rapid spread of the internet and the increase of online information, the new technology for automatically classifying huge amounts of diverse text information has come to play a very important role. This led to the use of the machine learning approach, which is a method of creating classifiers automatically from the text data given in a category label. This dissertation presents a research on the field of AI via studying machine learning for natural language understanding. One important part of the process of understanding a text consists in apprehending its underlying interrelations of concepts. The proposed system aims to extracts concepts from text printed of natural language in constructing two models in the following steps :- 1. Extract concepts and relations between them. 2. Classify sets of documents written some of them in different domains. 3. In spite of the complexity of natural languages the proposed system with the assistance of user offers creation of interactive interface for structured query and complete the concepts relations before extracting the desired information from one or a lot of documents in specific domain using Inductive Logic Programming(ILP). Our examples focus on a text written in English natural language. Extracted data are particularly useful for obtaining a structured database from unstructured documents, and preparing information for database entries. This dissertation discusses an efficient algorithm to construct a model for building a classifier from a preclassified documents, search for the characteristics of terms and categories, to classify a corpus or a set of documents.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
AAAI-98 Workshop on Learning for Text Classification, 1998
… Journal of Computer Science & Information Technology …, 2010
International Journal of …
Language Resources and Evaluation, 2008