Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015
The large amount of textual information digitally available today gives rise to the need for effective means of indexing, searching and retrieving this information. Keywords are used to describe briefly and precisely the contents of a textual document. In this paper we present an algorithm for keyword extraction from documents written in Spanish.This algorithm combines autoencoders, which are adequate for highly unbalanced classification problems, with the discriminative power of conventional binary classifiers. In order to improve its performance on larger and more diverse datasets, our algorithm trains several models of each kind through bagging.Fil: Aquino, Germán Osvaldo. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Lanzarini, Laura Cristina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Info...
2013
Obtaining the most representative set of words in a document is a very significant task, since it allows characterizing the document and simplifies search and classification activities. This paper presents a novel method, called LIKE, that offers the ability of automatically extracting keywords from a document regardless of the language used in it. To do so, it uses a three-stage process: the first stage identifies the most representative terms, the second stage builds a numeric representation that is appropriate for those terms, and the third one uses a feed-forward neural network to obtain a predictive model. To measure the efficacy of the LIKE method, the articles published by the Workshop of Computer Science Researchers (WICC) in the last 14 years (1999-2012) were used. The results obtained show that LIKE is better than the KEA method, which is one of the most widely mentioned solutions in literature about this topic.
Language Resources and Evaluation, 2008
A frequent problem in automatic categorization appl ications involving Portuguese language is the absen ce of large corpora of previously classified documents, which permit the v alidation of experiments carried out. Generally, th e available corpora are not classified or, when they are, they contain a very r educed number of documents. The general goal of this study is to
… Linguistics and Intelligent …, 2009
The C-value/NC-value algorithm, a hybrid approach to automatic term recognition, has been originally developed to extract multiword term candidates from specialised documents written in English. Here, we present three main modifications to this algorithm that affect how the obtained output is refined. The first modification aims to maximise the number of real terms in the list of candidates with a new approach for the stop-list application process. The second modification adapts the C-value calculation formula in order to consider single word terms. The third modification changes how the term candidates are grouped, exploiting a lemmatised version of the input corpus. Additionally, size of candidate's context window is variable. We also show the necessary linguistic modifications to apply this algorithm to the recognition of term candidates in Spanish.
2008
This paper presents a corpus-based approach for extracting keywords from a text written in a language that has no word boundary. Based on the concept of Thai character cluster, a Thai running text is preliminarily segmented into a sequence of inseparable units, called TCCs. To enable the handling of a largescaled text, a sorted sistring (or suffix array) is applied to calculate a number of statistics of each TCC. Using these statistics, we applied three alternative supervised machine learning techniques, naïve Bayes, centroid-based and k-NN, to learn classifiers for keyword identification. Our method is evaluated using a medical text extracted from WWW. The result showed that k-NN achieves the highest performance of 79.5 % accuracy.
2010
In this paper a machine learning approach is applied to Automatic Term Recognition (ATR). Similar approaches have been successfully used in Automatic Keyword Extraction (AKE). Using a dataset consisting of Swedish patent texts and validated terms belonging to these texts, unigrams and bigrams are extracted and annotated with linguistic and statistical feature values. Experiments using a varying ratio between positive and negative examples in the training data are conducted using the annotated n-grams. The results indicate that a machine learning approach is viable for ATR. Furthermore, a machine learning approach for bilingual ATR is discussed. Preliminary analysis however indicate that some modifications have to be made to apply the monolingual machine learning approach to a bilingual context.
Proceedings of the 21st International Conference …, 2006
This paper presents a study on if and how automatically extracted keywords can be used to improve text categorization. In summary we show that a higher performance -as measured by micro-averaged F-measure on a standard text categorization collection -is achieved when the full-text representation is combined with the automatically extracted keywords. The combination is obtained by giving higher weights to words in the full-texts that are also extracted as keywords. We also present results for experiments in which the keywords are the only input to the categorizer, either represented as unigrams or intact. Of these two experiments, the unigrams have the best performance, although neither performs as well as headlines only.
Lecture Notes in Computer Science, 2018
In this paper, we present YAKE!, a novel feature-based system for multilingual keyword extraction from single documents, which supports texts of different sizes, domains or languages. Unlike most systems, YAKE! does not rely on dictionaries or thesauri, neither it is trained against any corpora. Instead, we follow an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in many different languages without the need for external knowledge. This can be beneficial for a large number of tasks and a plethora of situations where the access to training corpora is either limited or restricted. In this demo, we offer an easy to use, interactive session, where users from both academia and industry can try our system, either by using a sample document or by introducing their own text. As an add-on, we compare our extracted keywords against the output produced by the IBM Natural Language Understanding (IBM NLU) and Rake system.
2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)
This paper presents a survey-cum-evaluation of methods for the comprehensive comparison of the task of keyword extraction using datasets of various sizes, forms, and genre. We use four different datasets which includes Amazon product data-Automotive, SemEval 2010, TMDB and Stack Exchange. Moreover, a subset of 100 Amazon product reviews is annotated and utilized for evaluation in this paper, to our knowledge, for the first time. Datasets are evaluated by five Natural Language Processing approaches (3 unsupervised and 2 supervised), which include TF-IDF, RAKE, TextRank, LDA and Shallow Neural Network. We use a tenfold cross-validation scheme and evaluate the performance of the aforementioned approaches using recall, precision and F-score. Our analysis and results provide guidelines on the proper approaches to use for different types of datasets. Furthermore, our results indicate that certain approaches achieve improved performance with certain datasets due to inherent characteristics of the data.
Journal of Documentation, 2001
Automatic categorisation can be understood as a learning process during which a programme recognises the characteristics that distinguish each category or class from others, i.e. those characteristics which the documents should have in order to belong to that category. As yet few experiments have been carried out with documents in Spanish.
International Journal of Electrical and Computer Engineering (IJECE), 2025
This article presents a comprehensive comparative analysis of two advanced hybrid machine learning approaches for keyword extraction: bidirectional encoder representations from transformers (BERT) combined with autoencoder (AE) and term frequency-inverse document frequency (TF-IDF) combined with autoencoder. The research targets the task of semantic analysis in text data to evaluate the effectiveness of these methods in ensuring adequate keyword coverage across diverse text corpora. The study delves into the architecture and operational principles of each method, with a particular focus on the integration with autoencoders to enhance the semantic integrity and relevance of the extracted keywords. The experimental section provides a detailed performance analysis of both methods on various text datasets, highlighting how the structure and semantic richness of the source data influence the outcomes. The evaluation methodology includes precision, recall, and F1-score metrics. The paper discusses the advantages and disadvantages of each approach and their suitability for specific keyword extraction tasks. The findings offer valuable insights for the scientific community, aiding in the selection of the most appropriate text processing method for applications requiring deep semantic understanding and high accuracy in information extraction.
2012
strong links among the linguistic resources developed for each language of the region is essential. Additionally, a lack of published resources in some of these languages exists. Such lack propitiates a strong inter-relation between them and higher resourced languages, such as English and Spanish. In order to favour the intra-relation among the peninsular languages as well as the inter-relation between them and foreign languages, different purpose multilingual NLP tools need to be developed. Interesting topics to be researched include, among others, analysis of parallel and comparable corpora, development of multilingual resources, and language analysis in bilingual environments and within dialectal variations. With the aim of solving these tasks, statistical, linguistic and hybrid approaches are proposed. Therefore, the workshop addresses researchers from different fields of natural language processing/computational linguistics: text mining, machine learning, pattern recognition, i...
KI 2006: Advances in Artificial Intelligence, 2007
SN Computer Science
The goal of keyword extraction is to extract from a text, words, or phrases indicative of what it is talking about. In this work, we look at keyword extraction from a number of different perspectives: Statistics, Automatic Term Indexing, Information Retrieval (IR), Natural Language Processing (NLP), and the emerging Neural paradigm. The 1990s have seen some early attempts to tackle the issue primarily based on text statistics [13, 17]. Meanwhile, in IR, efforts were largely led by DARPA’s Topic Detection and Tracking (TDT) project [2]. In this contribution, we discuss how past innovations paved a way for more recent developments, such as LDA, PageRank, and Neural Networks. We walk through the history of keyword extraction over the last 50 years, noting differences and similarities among methods that emerged during the time. We conduct a large meta-analysis of the past literature using datasets from news media, science, and medicine to business and bureaucracy, to draw a general pict...
RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning
This paper evaluates different techniques for building a supervised, multilanguage keyphrase extraction pipeline for languages which lack a gold standard. Starting from an unsupervised English keyphrase extraction pipeline, we implement pipelines for Arabic, Italian, Portuguese, and Romanian, and we build test collections for languages which lack one. Then, we add a Machine Learning module trained on a well-known English language corpus and we evaluate the performance not only over English but on the other languages as well. Finally, we repeat the same evaluation after training the pipeline over an Arabic language corpus to check whether using a language-specific corpus brings a further improvement in performance. On the five languages we analyzed, results show an improvement in performance when using a machine learning algorithm, even if such algorithm is not trained and tested on the same language.
2008
Automatic Keyword extraction is now a mature language technology. It enables the annotation of large amount of documents for content-gathering, indexing, searching and for its identification, in general. The reliability of results when processing documents in a multilingual environment, however, is still a challenge, particularly when documents are not limited to one specific semantic domain. The use of multi-term descriptors seems to be a good mean to identify the content. According to our previous evaluations (Panunzi et al. 2006a, 2006b), the availability of multi-term keywords increases the performance with respect to mono-term keywords of 100% relative factor. The LABLITA tool presented in this demo works now in a multilingual environment, as well. The demo calculates on the fly the number of mono-term and multiword keywords of parallel documents in English, Italian, German, French and Spanish, and will allow the audience to judge: a) the enhancement bared by multiword keywords...
2005
Support Vector Machines have been applied to text classification with great success. In this paper, we apply and evaluate the impact of using part-of-speech tags (nouns, proper nouns, adjectives and verbs) as a feature selection procedure in a European Portuguese written dataset—the Portuguese Attorney General's Office documents.
2010
The subject of how to identify keywords in random texts lies at the heart of many important applications such as document retrieval, bibliographic databases, and search engines. It has received a fair interest in the research community ever since the late 1960s, where most approaches to-date have mainly relied on frequency analysis and word association. In this project, we present a different approach to achieve such objective by using Machine Learning algorithms such as support vector machines (SVM) and GDA, both with uncertain label classification. While such approach uses minimal linguistic tools and relies instead on pure numeric features, which might seem at first to be at odds with the original objective, the motivation behind the adoption of such Machine Learning algorithms is driven by similar successful applications in various other disciplines such as optical character recognition. Both SVM and GDA achieve a very good performance in keyword identification, where GDA usually performs much better than SVM.
2021
Currently, millions of data are generated daily and its exploitation and interpretation has become essential at every scope. However, most of this information is in textual format, lacking the structure and organisation of traditional databases, which represents an enormous challenge to overcome. Over the course of time, different approaches have been proposed for text representation attempting to better capture the semantic of documents. They included classic information retrieval approaches (like Bag of Words) to new approaches based on neural networks such as basic word embeddings, deep learning architectures (LSTMs and CNNs), and contextualized embeddings based on attention mechanisms (Transformers). Unfortunately, most of the available resources supporting those technologies are English-centered. In this work, using an e-mail-based study case, we measure the performance of the three most important machine learning approaches applied to the text classification, in order to verif...
Computación y Sistemas
We construct an ensemble method for automatic keyword extraction from single documents. We utilize three different unsupervised automatic keyword extractors in building our ensemble method. These three approaches provide candidate keywords for the ensemble method without using their respective threshold functions. The ensemble method combines these candidate keywords and recomputes their scores after applying pruning heuristics. It then extracts keywords by employing dynamic threshold functions. We analyze the performance of our ensemble method by using all parts of the Inspect data set. Our ensemble method achieved a better overall performance when compared to the automatic keyword extractors that were used in its development as well as to some recent automatic keyword extraction methods.
2000
Text classification is a well-known topic in the research of knowledge discovery in databases. Algorithms for text classification generally involve two stages. The first is concerned with identification of textual features (i.e. words and/or phrases) that may be relevant to the classification process. The second is concerned with classification rule mining and categorisation of "unseen" textual data. The first stage is the subject of this thesis and often involves an analysis of text that is both language-specific (and possibly domain-specific), and that may also be computationally costly especially when dealing with large datasets. Existing approaches to this stage are not, therefore, generally applicable to all languages. In this thesis, we examine a number of alternative keyword selection methods and phrase generation strategies, coupled with two potential significant word list construction mechanisms and two final significant word selection mechanisms, to identify such words and/or phrases in a given textual dataset that are expected to serve to distinguish between classes, by simple, language-independent statistical properties. We present experimental results, using common (large) textual datasets presented in two distinct languages, to show that the proposed approaches can produce good performance with respect to both classification accuracy and processing efficiency. In other words, the study presented in this thesis demonstrates the possibility of efficiently solving the traditional text classification problem in a language-independent (also domain-independent) manner.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.