Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2023, arXiv (Cornell University)
…
5 pages
1 file
This article presents a sentence-level sentiment dataset for the Croatian news domain. In addition to the 3K annotated texts already present, our dataset contains 14.5K annotated sentence occurrences that have been tagged with 5 classes. We provide baseline scores in addition to the annotation process and inter-annotator agreement.
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
This paper introduces Cro-FiReDa, a sentimentannotated dataset for Croatian in the domain of movie reviews. The dataset, which contains over 10,000 sentences, has been annotated at the sentence level. In addition to presenting the overall annotation process, we also present benchmark results based on the transformerbased fine-tuning approach.
Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 2023
2016
In this paper we present a Hungarian sentiment corpus manually annotated at aspect level. Our corpus consists of Hungarian opinion texts written about different types of products. The main aim of creating the corpus was to produce an appropriate database providing possibilities for developing text mining software tools. The corpus is a unique Hungarian database: to the best of our knowledge, no digitized Hungarian sentiment corpus that is annotated on the level of fragments and targets has been made so far. In addition, many language elements of the corpus, relevant from the point of view of sentiment analysis, got distinct types of tags in the annotation. In this paper, on the one hand, we present the method of annotation, and we discuss the difficulties concerning text annotation process. On the other hand, we provide some quantitative and qualitative data on the corpus. We conclude with a description of the applicability of the corpus.
Language Resources and Evaluation, 2024
Automated sentiment analysis of textual data is one of the central and most challenging tasks in political communication studies. However, the toolkits available are primarily for English texts and require contextual adaptation to produce valid results-especially concerning morphologically rich languages such as Hungarian. This study introduces (1) a new sentiment and emotion annotation framework that uses inductive approaches to identify emotions in the corpus and aggregate these emotions into positive, negative, and mixed sentiment categories, (2) a manually annotated sentiment data set with 5700 political news sentences, (3) a new Hungarian sentiment dictionary for political text analysis created via word embeddings, whose performance was compared with other available sentiment dictionaries. (4) Because of the limitations of sentiment analysis using dictionaries we have also applied various machine learning algorithms to analyze our dataset, (5) Last but not least to move towards state-of-the-art approaches, we have fine-tuned the Hungarian BERT-base model for sentiment analysis. Meanwhile, we have also tested how different pre-processing steps could affect the performance of machine-learning algorithms in the case of Hungarian texts.
2015
This article describes a corpus of news texts in Brazilian Portuguese. News were collected from four big newswire outlets, segmented in paragraphs, and marked up by a group of four annotators, who had to classify each paragraph according to two dimensions: target entity (that is the person which is the main subject of the news contained in the paragraph), and the paragraph’s polarity with respect to the target entity. The corpus comprises 131 news, segmented in 1,447 paragraphs, with 65,675 words in total. Along with the corpus, we have also built a gold standard, where paragraphs are classified according to the opinion of the majority of annotators. This gold standard and annotated corpus are available to the community under a Creative Commons licence.
PLoS ONE, 2020
Choosing a comprehensive and cost-effective way of articulating and annotating the sentiment of a text is not a trivial task, particularly when dealing with short texts, in which sentiment can be expressed through a wide variety of linguistic and rhetorical phenomena. This problem is especially conspicuous in resource-limited settings and languages, where design options are restricted either in terms of manpower and financial means required to produce appropriate sentiment analysis resources, or in terms of available language tools, or both. In this paper, we present a versatile approach to addressing this issue, based on multiple interpretations of sentiment labels that encode information regarding the polarity, subjectivity, and ambiguity of a text, as well as the presence of sarcasm or a mixture of sentiments. We demonstrate its use on Serbian, a resource-limited language, via the creation of a main sentiment analysis dataset focused on movie comments, and two smaller datasets belonging to the movie and book domains. In addition to measuring the quality of the annotation process, we propose a novel metric to validate its cost-effectiveness. Finally, the practicality of our approach is further validated by training, evaluating, and determining the optimal configurations of several different kinds of machine-learning models on a range of sentiment classification tasks using the produced dataset.
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 2017
Sentiment Analysis is a broad task that involves the analysis of various aspect of the natural language text. However, most of the approaches in the state of the art usually investigate independently each aspect, i.e. Subjectivity Classification, Sentiment Polarity Classification, Emotion Recognition, Irony Detection. In this paper we present a Multi-View Sentiment Corpus (MVSC), which comprises 3000 English microblog posts related the movie domain. Three independent annotators manually labelled MVSC, following a broad annotation schema about different aspects that can be grasped from natural language text coming from social networks. The contribution is therefore a corpus that comprises five different views for each message, i.e. subjective/objective, sentiment polarity, implicit/explicit, irony, emotion. In order to allow a more detailed investigation on the human labelling behaviour, we provide the annotations of each human annotator involved.
2013
The availability of annotated data is an important prerequisite for the development of machine learning algorithms for sentiment analysis. However, as manually labeling large datasets is time-consuming and expensive, few datasets are available and most of them represent a small sample of a very narrow domain, e.g. movie reviews or reviews of a certain product type. Additionally, many annotated datasets are available for English texts only. However, the influence of different characteristics of the input dataset on the performance of algorithms for sentiment analysis remains unclear if only training data from one specific domain is available or if specific domains are mixed in the test corpus. We therefore introduce a new dataset for German product reviews of various product types and investigate whether even small variances in this specific domain (different product types) already exhibit different characteristics, e.g. with regard to the difficulty of sentiment annotation. The anno...
2010
Recent years have brought a significant growth in the volume of research in sentiment analysis, mostly on highly subjective text types (movie or product reviews). The main difference these texts have with news articles is that their target is clearly defined and unique across the text. Following different annotation efforts and the analysis of the issues encountered, we realised that news opinion mining is different from that of other text types. We identified three subtasks that need to be addressed: definition of the target; separation of the good and bad news content from the good and bad sentiment expressed on the target; and analysis of clearly marked opinion that is expressed explicitly, not needing interpretation or the use of world knowledge. Furthermore, we distinguish three different possible views on newspaper articlesauthor, reader and text, which have to be addressed differently at the time of analysing sentiment. Given these definitions, we present work on mining opinions about entities in English language news, in which (a) we test the relative suitability of various sentiment dictionaries and (b) we attempt to separate positive or negative opinion from good or bad news. In the experiments described here, we tested whether or not subject domain-defining vocabulary should be ignored. Results showed that this idea is more appropriate in the context of news opinion mining and that the approaches taking this into consideration produce a better performance.
2015
The applications of plWordNet, a very large wordnet for Polish, do not yet include work on sentiment and emotions. We present a pilot project to annotate plWordNet manually with sentiment polarity values and basic emotion values. We work with lexical units, plWordNet’s basic building blocks.1 So far, we have annotated about 30,000 nominal and adjectival LUs. The resulting lexicon is already one of the largest sentiment and emotion resources, in particular among those based on wordnets. We opted for manual annotation to ensure high accuracy, and to provide a reliable starting point for future semi-automated expansion. The paper lists the principal assumptions, outlines the annotation process, and introduces the resulting resource, plWordNetemo. We discuss the selection of the material for the pilot study, show the distribution of annotations across the wordnet, and consider the statistics, including interannotator agreement and the resolution of disagreement.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
arXiv (Cornell University), 2023
arXiv (Cornell University), 2022
Acta Polytechnica Hungarica
Journal of Natural Language Processing
Proceedings of the 11th International Conference on Agents and Artificial Intelligence, Prague, Czech Republic, 2019
Future Internet
Applied Sciences
The Semantic Web: ESWC 2015 Satellite Events, 2015
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2016