Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
As the number of research papers increases, the need for academic categorizer system becomes crucial. This is to help academicians organize their research papers into predefined categories based on the documents' content similarity. This paper presents the Document Categorizer Agent based on ACM CCS (Association for Computing Machinery Computing Classification System). First, we studied the ACM categories hierarchy. Next, based on these categories, we retrieved our corpus from ACM DL (ACM Digital Library) to train our Categorizer Agent using a popular machine learning technique called Naïve Bayes Classifier. We used two types of training data for the corpus namely, negative training data and positive training data. Next, these papers are categorized according to their content based on the same training data. We tested our Document Categorizer Agent on a number of academic papers to test its accuracy. The result we obtained showed promising results.
2010
This paper presents Document Categorizer Agent that categorizes computer science academic papers in pdf format such as journals and proceedings. In this paper, we propose the use of set of term stored in a database to categorize computer science papers. Few methods and ...
American Journal of Economics and Business Administration, 2011
Problem statement: With the rapid development of World Wide Web (WWW), a huge amount of information is now accessible to the web users. This phenomenon has attracted academic users to publish their research papers online, at the same time downloading and sharing academic papers among them through WWW. Categorizing a document manually can take up considerable amount of user's time whereby user will have to read each of the documents to decide which category it is suitable. Approach: Our research study proposes the use of set of terms stored in a database to categorize computer science papers. The categorizer agent focuses on categorizing the text document into predetermined categories based on the extracted keyword. Results: We have evaluated our document categorizer agent on a number of computer science papers. The categorization process is done by parsing the document, calculating the frequency of each term and matching the terms found in the database. Conclusion: The Categorizer Agent proposed in this research paper is evaluated as a good approach to categorize electronic papers. Moreover, the results indicated that the use of this term database is a sustainable way to categorize computer science electronic documents.
International Journal of Computing and Digital Systems, 2020
In this modern era of bleeding-edge technologies, information creation, sharing and consumption are rising at an exponential rate. In the same vein, there has been a continued increase in the amount of research is are being published worldwide and a large proportion of them are in the computer science field. There is an urgent need to provide some level of order in this huge jungle of data. Thus, in this article, we have used eight supervised machine learning techniques to classify computer science research papers. Machine learning techniques, such as logistic regression, multinomial naive bayes, gaussian naive bayes, support vector machines, k-nearest neighbours, decision tree, random forest and deep learning neural networks were trained to classify research papers into appropriate categories. For this purpose, a labelled dataset of 69776 papers was downloaded from arXiv and these were classified into 35 categories. The best f1-score of 0.60 was obtained by the logistic regression classifier. It was also the fastest machine learning classifier. The best f1-score from the deep learning network was 0.59. Using only the list of references for classification produced an f1-score of 0.57, but the training and testing time was significantly less. This shows that it is possible to use only references to classify computer science research papers. The f1-score for abstracts only was 0.52. Computer science papers often do not fall into neat categories. They are often multi-topical. Thus, in the future, we intend to perform multi-label classification on the same dataset.
Journal of the National Science Foundation of Sri Lanka, 2016
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. It provides conceptual views of the collected documents and has important applications in the real world. Text based categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes' algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of the above mentioned classification algorithm.
2018
English language journals. This research uses a system of text mining technology that extracts text data to search information from a set of documents. Abstract content of 120 data downloaded at www.computer.org. Data grouping consists of three categories: DM (Data Mining), ITS (Intelligent Transport System) and MM (Multimedia). Systems built using naive bayes algorithms to classify abstract journals and feature selection processes using term weighting to give weight to each word. Dimensional reduction techniques to reduce the dimensions of word counts rarely appear in each document based on dimensional reduction test parameters of 10%-90% of 5.344 words. The performance of the classification system is tested by using the Confusion Matrix based on comparative test data and test data. The results showed that the best classification results were obtained during the 75% training data test and 25% test data from the total data. Accuracy rates for categories of DM, ITS and MM were 100%, 100%, 86%. respectively with dimension reduction parameters of 30% and the value of learning rate between 0.1-0.5.
2015
Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Clustering is one of the applications of Machine Learning. Clustering is nothing but grouping data items together into classes as per the similarity among themselves. Data items within the class have high similarity in comparison to one another but are very dissimilar to data items in other class. The K-means algorithm is one of the widely used clustering algorithms. It is easy to implement and can be applied to wide variety of problems. But use of K-means was restricted to small datasets. Hadoop system supports Mapreduce which divides input into small inputs and operations are executed in distributed manner. It is inexpensive, scalable, free and open source which is why a promising technology for data intensive problems. Aim of this work is to use K-means algorithm for large dataset on Hadoop and then checking the performance of the same by increasing data an...
–Text classification is used to classify the document of similar types. Text classification can be also performed under supervision i.e. it is an supervised leaning technique Text classification is a process in which documents are sorted spontaneously into different classes using predefined set. The main issue is that large scale of information lacks organization which makes it difficult to manage. Text classification is identified as one of the key methods used for recognizing such types of digital information. Text classification have various applications such as in information retrieval, natural language processing, automatic indexing, text filtering, image processing, etc. Text classification is also used to process the big data and it can also be used to predict the class labels for newly added data. Text classification is also being used in academic and industries to classify the unstructured data. There are various types of the text classification approaches such as decision tree, SVM, Naïve Bayes etc. In this survey paper, we have analysed the various text classification techniques such as decision tree, SVM, Naïve Bayes etc. These techniques have their individual set of advantages which make them suitable in almost all classification jobs. In this paper we have also analysed evaluation parameters such as F-measure, G-measure and accuracy used in various research works. .
Now days, information retrieval is a challenging work for search engines. In this paper we will discuss about text categorization. Text documents categorization is the process to classify documents according to some predefined knowledge. Documents with same concept are grouped together, and documents with different concept are formed other group according to their similarity of context of the documents. This grouping technique is called document categorization. So the related documents will be in same category and non related documents in other category. In this paper we have concentrated on the document context, according to their context, categorization process is done. So we are trying to propose Link base document categorization according to the document context of a particular concept. In this way we can retrieve the proper information about the document and also find about the document's main concept and about what sub concept according to the percentage of weights of domains of a document. According to percentages of different concepts of different domain and indexing of documents, the categorization can be improved for information retrieval process of a search engine.
International Journal of Advanced Computer Science and Applications
Natural Language Processing, specifically text classification or text categorization, has become a trend in computer science. Commonly, text classification is used to categorize large amounts of data to allocate less time to retrieve information. Students, as well as research advisers and panelists, take extra effort and time in classifying research documents. To solve this problem, the researchers used state-of-the-art supervised term weighting schemes, namely: TF-MONO and SQRTF-MONO and its application to machine learning algorithms: K-Nearest Neighbor, Linear Support Vector, Naive Bayes Classifiers, creating a total of six classifier models to ascertain which of them performs optimally in classifying research documents while utilizing Optical Character Recognition for text extraction. The results showed that among all classification models trained, SQRTF-MONO and Linear SVC outperformed all other models with an F1 score of 0.94 both in the abstract and the background of the study datasets. In conclusion, the developed classification model and application prototype can be a tool to help researchers, advisers, and panelists to lessen the time spent in classifying research documents.
2020
Categorizing Text documents is the method of arranging different types of documents into labelled data. The field of this paper is to combine the Data mining Technology, Data extraction and Artificial Intelligence for text categorization. This paper will showcase the features of the technologies involved. There are three machine learning algorithms (SVM, Multinomial Naive Bayes and Logistic Regression) used in this paper for text categorization, i.e. arrange documents into different categories of dataset 20 news groups. In the evaluation of the above classification techniques, SVM classifier outperforms other classifiers for text categorization.
2018
Due to the availability of documents in the digital form becoming enormous the need to access them into more adjustable way becoming extremely important. In this context,document management tasks based on content is called as IR or Information Retrieval. Thishas achieved a noticeable position in the area of information system.For faster response time of IR,it is very important and essential to organize,categorize and classify texts and digital documents according to the definitions,proposed by Text Mining experts and Computer scientists.Automatic text Categorization or Topic Spotting,is a process to sorta document set automatically into categories from a predefined set.According to researchers the superior access to this problem depends on machine learning methods in which,a general posteriori process builds a classifier automatically by learning pre-classified documents given and the category’s characteristics.The acceptance of automatic text categorization is done becauseit is fre...
Computing Research Repository, 2010
Text Document classification aims in associating one or more predefined categories based on the likelihood suggested by the training set of labeled documents. Many machine learning algorithms play a vital role in training the system with predefined categories among which Na\"ive Bayes has some intriguing facts that it is simple, easy to implement and draws better accuracy in large datasets
2014
With the explosion of information fuelled by the growth of the World Wide Web it is no longer feasible for a human observer to understand all the data coming in or even classify it into categories. With this growth of information and simultaneous growth of available computing power automatic classification of data, particularly textual data, gains increasingly high importance. Text classification is a task of automatically sorting a set of documents into categories from a predefined set and is one of the important research issues in the field of text mining. This paper provides a review of generic text classification process, phases of that process and methods being
AIP Conference Proceedings, 2017
Increasing of technology had made categorizing documents become important. It caused by increasing of number of documents itself. Managing some documents by categorizing is one of Information Retrieval application, because it involve text mining on its process. Whereas, categorization technique could be done both Fuzzy C-Means (FCM) and K-Nearest Neighbors (KNN) method. This experiment would consolidate both methods. The aim of the experiment is increasing performance of document categorize. First, FCM is in order to clustering training documents. Second, KNN is in order to categorize testing document until the output of categorization is shown. Result of the experiment is 14 testing documents retrieve relevantly to its category. Meanwhile 6 of 20 testing documents retrieve irrelevant to its category. Result of system evaluation shows that both precision and recall are 0,7.
2015
Due to the continuous growth of web database, automatic identification of category for the newly published web documents is very important now-a-days. Accordingly, variety of algorithms has been developed in the literature for automatic categorization of web document to easy retrieval of web documents. In this paper, Document-Document similarity matrix and Naive-Bayes classification is combined to do web information retrieval. At first, web documents are pre-processed to extract the features which are then utilized to find document-document similarity matrix where every element within matrix is similarity between two web documents using semantic entropy measure. Subsequently, D-D matrix is used to create a training table which contains the frequency of every attributes and its probability. In the testing phase, relevant category is found for the input web document using the trained classification model to obtain the relevant categorized documents from the database. The relevant category identified from the classifier model is used to retrieve the relevant categorized documents which are already stored in the web database semantically. The experimentation is performed using 100 web documents of two different categories and the evaluation is done using sensitivity, specificity and accuracy.
2017
Automatic categorization of computer science research papers using just the abstracts, is a hard problem to solve. This is due to the short text length of the abstracts. Also, abstracts are a general discussion of the topic with few domain specific terms. These reasons make it hard to generate good representations of abstracts which in turn leads to poor categorization performance. To address this challenge, external Knowledge Bases (KB) like Wikipedia, Freebase etc. can be used to enrich the representations for abstracts, which can aid in the categorization task. In this work, we propose a novel method for enhancing classification performance of research papers into ACM computer science categories using knowledge extracted from related Wikipedia articles and Freebase entities. We use state-of-the-art representation learning methods for feature representation of documents, followed by learning to rank method for classification. Given the abstracts of research papers from the Citatio...
2018
As most information is stored as text in web, text document classification is considered to have a high commercial value. Text classification is classifying the documents according to predefined categories. Complexity of natural languages and the very high dimensionality of the feature space of documents have made this classification problem difficult. In this paper we have given the introduction of text classification, process of text classification, overview of the classifiers and compared some existing classifier on basis of few criteria like time principle, merits and demerits.
ACM Computing Surveys, 2002
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Proceedings of the 28th Annual ACM Symposium on Applied Computing, 2013
The issue of the automatic classification of research articles into one or more fields of science is of primary importance for scientific databases and digital libraries. A sophisticated classification strategy renders searching more effective and assists the users in locating similar relevant items. Although the most publishing services require from the authors to categorize their articles themselves, there are still cases where older documents remain unclassified, or the taxonomy changes over time. In this work we attempt to address this interesting problem by introducing a machine learning algorithm which combines several parameters and meta-data of a research article. In particular, our model exploits the training set to correlate keywords, authors, co-authorship, and publishing journals to a number of labels of the taxonomy. In the sequel, it applies this information to classify the rest of the documents. The experiments we have conducted with a large dataset comprised of about 1,5 million articles, demonstrate that in this specific application, our model outperforms the AdaBoost.MH and SVM methods.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.