Papers by Marilena Di Bari
This paper describes a methodology for supporting the task of annotating sentiment in natural
la... more This paper describes a methodology for supporting the task of annotating sentiment in natural
language by detecting borderline cases and inconsistencies. Inspired by the co-training strategy,
a number of machine learning models are trained on different views of the same data. The predictions
obtained by these models are then automatically compared in order to bring to light highly
uncertain annotations and systematic mistakes. We tested the methodology against an English
corpus annotated according to a fine-grained sentiment analysis annotation schema (SentiML).
We detected that 153 instances (35%) classified differently from the gold standard were acceptable
and further 69 instances (16%) suggested that the gold standard should have been improved.

Sentiment Analysis is the task of automatically identifying whether a text or a single sentence i... more Sentiment Analysis is the task of automatically identifying whether a text or a single sentence is intended to carry a positive or negative connotation. The commonly used Bag-of-Words approach that relies on counting positive and negative words, whose connotation is indicated by specially crafted sentiment dictionaries, is not ideal because it does not take into account the relations between words and how the connotation of single words changes according to the context. This paper proposes a way of identifying and analysing the targets of the opinions and their modifiers, along with their linkage (appraisal group) through an annotation schema called SentiML. Such schema has been developed in order to facilitate the identification of these elements and the annotation of their sentiment, along with advanced linguistic features such as their appraisal type according to the Appraisal Framework. The schema is XML-based and has been also designed to be language-independent. Preliminary re...
This thesis employs the methods of corpus linguistics to test Anna Wierzbicka’s theories about cr... more This thesis employs the methods of corpus linguistics to test Anna Wierzbicka’s theories about cross-cultural communication, in particular about the importance of some keywords in understanding and interpreting a given reference culture. Specifically, it seeks to explore Wierzbicka’s hypothesis that the English word humility and the Russian word smirenie embody significantly different attitudes to life, reflecting the extent to which cultural elements are essential in the process of lexicalization of ethical concepts. This work is based on Stubbs’s new approach, which uses corpora analysis to find empirical evidence of the importance of some cultural words in English. In particular, I used available web corpora for English and Russian, namely UkWaC and Russian Web Corpus.
Uploads
Papers by Marilena Di Bari
language by detecting borderline cases and inconsistencies. Inspired by the co-training strategy,
a number of machine learning models are trained on different views of the same data. The predictions
obtained by these models are then automatically compared in order to bring to light highly
uncertain annotations and systematic mistakes. We tested the methodology against an English
corpus annotated according to a fine-grained sentiment analysis annotation schema (SentiML).
We detected that 153 instances (35%) classified differently from the gold standard were acceptable
and further 69 instances (16%) suggested that the gold standard should have been improved.
language by detecting borderline cases and inconsistencies. Inspired by the co-training strategy,
a number of machine learning models are trained on different views of the same data. The predictions
obtained by these models are then automatically compared in order to bring to light highly
uncertain annotations and systematic mistakes. We tested the methodology against an English
corpus annotated according to a fine-grained sentiment analysis annotation schema (SentiML).
We detected that 153 instances (35%) classified differently from the gold standard were acceptable
and further 69 instances (16%) suggested that the gold standard should have been improved.