Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008, Information Systems
AI
The paper discusses the increasing need for automatic text categorization methods in the context of large-scale digital document collections. It elaborates on the limitations of human categorization and highlights the evolution of text categorization methods, particularly the naïve Bayes approach and its extensions. The research presents various statistical techniques, including SVM and LR models, to evaluate performance across different domains, providing empirical results that demonstrate improvements in accuracy and reductions in false positive rates.
ACM Computing Surveys, 2002
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Machine Learning for Text Categorization, 2006
A long standing goal for the field of Artificial Intelligence (AI) is to enable computer understanding of human languages. Much progress has been made in reaching this goal, but much also remains to be done. Concept maps are considered by some educational psychologists as a very important tool to improve learning. Moreover with the rapid spread of the internet and the increase of online information, the new technology for automatically classifying huge amounts of diverse text information has come to play a very important role. This led to the use of the machine learning approach, which is a method of creating classifiers automatically from the text data given in a category label. This dissertation presents a research on the field of AI via studying machine learning for natural language understanding. One important part of the process of understanding a text consists in apprehending its underlying interrelations of concepts. The proposed system aims to extracts concepts from text printed of natural language in constructing two models in the following steps :- 1. Extract concepts and relations between them. 2. Classify sets of documents written some of them in different domains. 3. In spite of the complexity of natural languages the proposed system with the assistance of user offers creation of interactive interface for structured query and complete the concepts relations before extracting the desired information from one or a lot of documents in specific domain using Inductive Logic Programming(ILP). Our examples focus on a text written in English natural language. Extracted data are particularly useful for obtaining a structured database from unstructured documents, and preparing information for database entries. This dissertation discusses an efficient algorithm to construct a model for building a classifier from a preclassified documents, search for the characteristics of terms and categories, to classify a corpus or a set of documents.
2018
Due to the availability of documents in the digital form becoming enormous the need to access them into more adjustable way becoming extremely important. In this context,document management tasks based on content is called as IR or Information Retrieval. Thishas achieved a noticeable position in the area of information system.For faster response time of IR,it is very important and essential to organize,categorize and classify texts and digital documents according to the definitions,proposed by Text Mining experts and Computer scientists.Automatic text Categorization or Topic Spotting,is a process to sorta document set automatically into categories from a predefined set.According to researchers the superior access to this problem depends on machine learning methods in which,a general posteriori process builds a classifier automatically by learning pre-classified documents given and the category’s characteristics.The acceptance of automatic text categorization is done becauseit is fre...
International Journal of Computer Applications, 2013
The automated categorization (classification) of texts into predefined categories is one of the widely explored fields of research in text mining. Now-a-days, availability of digital data is very high, and to manage them in predefined categories has become a challenging task. Machine learning technique is an approach by which we can train automated classifier to classify the documents with minimum human assistance. This paper discusses the Naïve Bayes, Rocchio, k-Nearest Neighborhood and Support Vector Machine methods within machine learning paradigm for automated text categorization of given documents in predefined categories.
Computer Science & Engineering: An International Journal, 2015
As the time goes on and on, digitization of text has been increasing enormously and the need to organize, categorize and classify text has become indispensable. Disorganization and very little categorization and classification of text may result in slower response time of text or information retrieval. Therefore it is very important and essential to organize, categorize and classify texts and digitized documents according to definitions proposed by text mining experts and computer scientists. Work has been done on Text Mining, Text Categorization and Automatic Text Classification by computer and information scientists, but obviously a lot of space for novel research in this domain is available. In this paper we have proposed the mathematical notation and graphical models for Text Mining, Text Categorization and Automatic Text Classification to get in depth understanding of these techniques and concepts. Introduction and proposal of mathematical and graphical models for Text Mining, Text Categorization and Automatic Text Classification will shorten the response time of text and information retrieval. Also the performance of web search engines can be improved so much by employing these mathematical and graphical models.
The exponential growth of the internet has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching the Web. Document retrieval, categorization, routing and filtering can all be formulated as classification problems. However, the complexity of natural languages and the extremely high dimensionality of the feature space of documents have made this classification problem very difficult. We investigate four different methods for document classification: the naive Bayes classifier, the nearest neighbour classifier, decision trees and a subspace method. These were applied to seven-class Yahoo news groups (business, entertainment, health, international, politics, sports and technology) individually and in combination. We studied three classifier combination approaches: simple voting, dynamic classifier selection and adaptive classifier combination. Our experimental results indicate that the naive Bayes classifier and the subspace method outperform the other two classifiers on our data sets. Combinations of multiple classifiers did not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, our adaptive classifier combination method introduced here performed the best. The best classification accuracy that we are able to achieve on this seven-class problem is approximately 83%, which is comparable to the performance of other similar studies. However, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents.
Information Systems, 2008
Text categorization is the task of assigning predefined categories to natural language text. With the widely used "bag-ofword" representation, previous researches usually assign a word with values that express whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tfidf style equation, and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.
Webology, 2015
Automated classification of text into predefined categories has always been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. This kind of web information, popularly known as the digital/electronic information is in the form of documents, conference material, publications, journals, editorials, web pages, e-mail etc. People largely access information from these online sources rather than being limited to archaic paper sources like books, magazines, newspapers etc. But the main problem is that this enormous information lacks organization which makes it difficult to manage. Text classification is recognized as one of the key techniques used for organizing such kind of digital data. In this paper we have studied the existing work in the area of text classification which will allow us to have a fair evaluation of the progress made in this field till date. We have investigated the papers to the ...
Lecture Notes in Computer Science, 2009
Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from 'conventional' classification tasks and from each other.
Studies in Computational Intelligence, 2009
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
International Journal of Advanced Networking and Applications, 2020
There are huge data from unstructured text obtained daily from various resources like emails, tweets, social media posts, customer comments, reviews, and reports in many different fields, etc. Unstructured text data can be analyzed to obtain useful information that will be used according to the purpose of the analysis also the domain that the data was obtained from it. Because of the huge amount of the data the human manually analysis of these texts is not possible, so we have to automatic analysis. The topic analysis is the Natural Language Processing (NLP) technology that organizes and understands large collections of text data, by identifying the topics, finding patterns and semantic. There two common approaches for topic analysis, topic modeling, and topic classification each approach has different algorithms to apply that will be discussed.
2008
TEXT CATEGORIZATION OF COMMERCIAL WEB PAGES E. Binaghi, M. Carullo, I. Gallo and M. Madaio Universit´a degli Studi dell'Insubria Via Mazzini 5, 21100 Varese, Italy email: [email protected] ABSTRACT In this paper we describe a new on-line document categorization strategy that can be integrated within Web applications. A salient aspect is the use of neural learning in both representation and classification tasks.
International Journal on Natural Language Computing, 2016
In this paper firstly I have compared Single Label Text Categorization with Multi Label Text Categorization in detail then I have compared Document Pivoted Categorization with Category Pivoted Categorization in detail. For this purpose I have given the general definition of Text Categorization with its mathematical notation for the purpose of its frugality and cost effectiveness. Then with the help of mathematical notation and set theory ,I have converted the general definitions of
IEEE Transactions on Neural Networks, 1999
AbstractÐWe develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two realworld document collections from the MEDLINE database. Next, we investigate the application of automatic categorization to text retrieval. Our experiments clearly indicate that automatic categorization improves the retrieval performance compared with no categorization. We also demonstrate that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization. Furthermore, detailed analysis of the retrieval performance on each individual test query is provided.
IEEE Transactions on Knowledge and Data Engineering, 1999
AbstractÐWe develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two realworld document collections from the MEDLINE database. Next, we investigate the application of automatic categorization to text retrieval. Our experiments clearly indicate that automatic categorization improves the retrieval performance compared with no categorization. We also demonstrate that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization. Furthermore, detailed analysis of the retrieval performance on each individual test query is provided.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.