Papers by Mijail Kabadjov
kind of problems do protein interactions raise for anaphora resolution? A preliminary analysis
We report on two large corpora of semantically annotated full-text biomedical research papers cre... more We report on two large corpora of semantically annotated full-text biomedical research papers created in order to develop information extraction (IE) tools for the TXM project. Both corpora have been annotated with a range of entities (CellLine, Complex, Developmental- Stage, Disease, DrugCompound, ExperimentalMethod, Fragment, Fusion, GOMOP, Gene, Modification, mRNAcDNA, Mutant, Protein, Tissue), normalisations of selected entities to the NCBI Taxonomy, RefSeq, EntrezGene, ChEBI and MeSH and enriched relations (protein-protein interactions, tissue expressions and fragment- or mutant-protein relations). While one corpus targets protein-protein interactions (PPIs), the focus of other is on tissue expressions (TEs). This paper describes the selected markables and the annotation process of the ITI TXM corpora, and provides a detailed breakdown of the inter-annotator agreement (IAA).
Biocomputing 2008 - Proceedings of the Pacific Symposium, 2008
Although text mining shows considerable promise as a tool for supporting the curation of biomedic... more Although text mining shows considerable promise as a tool for supporting the curation of biomedical text, there is little concrete evidence as to its effectiveness. We report on three experiments measuring the extent to which curation can be speeded up with assistance from Natural Language Processing (NLP), together with subjective feedback from curators on the usability of a curation tool that integrates NLP hypotheses for protein-protein interactions (PPIs). In our curation scenario, we found that a maximum speed-up of 1/3 in curation time can be expected if NLP output is perfectly accurate. The preference of one curator for consistent NLP output and output with high recall needs to be confirmed in a larger study with several curators.

Genome Biology, 2008
The tasks in BioCreative II were designed to approximate some of the laborious work involved in c... more The tasks in BioCreative II were designed to approximate some of the laborious work involved in curating biomedical research papers. The approach to these tasks taken by the University of Edinburgh team was to adapt and extend the existing natural language processing (NLP) system that we have developed as part of a commercial curation assistant. Although this paper concentrates on using NLP to assist with curation, the system can be equally employed to extract types of information from the literature that is immediately relevant to biologists in general. Our system was among the highest performing on the interaction subtasks, and competitive performance on the gene mention task was achieved with minimal development effort. For the gene normalization task, a string matching technique that can be quickly applied to new domains was shown to perform close to average. The technologies being developed were shown to be readily adapted to the BioCreative II tasks. Although high performance may be obtained on individual tasks such as gene mention recognition and normalization, and document classification, tasks in which a number of components must be combined, such as detection and normalization of interacting protein pairs, are still challenging for NLP systems.
Proceedings of LREC, 2008
We report on two large corpora of semantically annotated full-text biomedical research papers cre... more We report on two large corpora of semantically annotated full-text biomedical research papers created in order to develop information extraction (IE) tools for the TXM project. Both corpora have been annotated with a range of entities (CellLine, Complex, Developmental-Stage, Disease, DrugCompound, ExperimentalMethod, Fragment, Fusion, GOMOP, Gene, Modification, mRNAcDNA, Mutant, Protein, Tissue), normalisations of selected entities to the NCBI Taxonomy, RefSeq, EntrezGene, ChEBI and MeSH and ...
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, Jun 24, 2011
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysi... more Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, ACL-HLT 2011, pages 28Ā36, 24 June, 2011, Portland, Oregon, USA cO2011 Association for Computational Linguistics Creating Sentiment Dictionaries via Triangulation Josef ...
In this paper we present an approach to large-scale coreference resolution for an ample set of hu... more In this paper we present an approach to large-scale coreference resolution for an ample set of human languages, with a particular emphasis on time performance and precision. One of the distinctive features of our approach is the use of a mature multilingual named entity repository (persons and organizations) gradually compiled over the past few years. Our experiments show promising results -an overall precision of 94% tested on seven different languages. We also present an extrinsic evaluation on seven languages in the context of summarization where we gauge the contribution of the coreference resolver towards the end summarization performance.

Lrec, 2010
Recent years have brought a significant growth in the volume of research in sentiment analysis, m... more Recent years have brought a significant growth in the volume of research in sentiment analysis, mostly on highly subjective text types (movie or product reviews). The main difference these texts have with news articles is that their target is clearly defined and unique across the text. Following different annotation efforts and the analysis of the issues encountered, we realised that news opinion mining is different from that of other text types. We identified three subtasks that need to be addressed: definition of the target; separation of the good and bad news content from the good and bad sentiment expressed on the target; and analysis of clearly marked opinion that is expressed explicitly, not needing interpretation or the use of world knowledge. Furthermore, we distinguish three different possible views on newspaper articlesauthor, reader and text, which have to be addressed differently at the time of analysing sentiment. Given these definitions, we present work on mining opinions about entities in English language news, in which (a) we test the relative suitability of various sentiment dictionaries and (b) we attempt to separate positive or negative opinion from good or bad news. In the experiments described here, we tested whether or not subject domain-defining vocabulary should be ignored. Results showed that this idea is more appropriate in the context of news opinion mining and that the approaches taking this into consideration produce a better performance.

Language Resources and Evaluation, 2011
The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather,... more The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugging in a new language by providing the language-specific resources for that language. We thus describe the type of language-specific resources needed, the effort involved, and ways of boot-strapping the generation of these resources in order to keep the effort of adding a new language to a minimum. The text analysis applications pursued in our efforts include clustering, classification, recognition and disambiguation of named entities (persons, organisations and locations), recognition and normalisation of date expressions, as well as the identification of reported speech quotations by and about people.
We describe a multilingual methodology for adapting an event extraction system to new languages. ... more We describe a multilingual methodology for adapting an event extraction system to new languages. The methodology is based on highly multilingual domain-specific grammars and exploits weakly supervised machine learning algorithms for lexical acquisition. We adapted an already existing event extraction system for the domain of conflicts and crises to Portuguese and Spanish languages. The results are encouraging and demonstrate the effectiveness of our approach.
The main focus of this work is to investigate robust ways for generating summaries from summary r... more The main focus of this work is to investigate robust ways for generating summaries from summary representations without recurring to simple sentence extraction and aiming at more human-like summaries. This is motivated by empirical evidence from TAC 2009 data showing that human summaries contain on average more and shorter sentences than the system summaries. We report encouraging preliminary results comparable to those attained by participating systems at TAC 2009.
Lecture Notes in Computer Science, 2007
Lecture Notes in Computer Science, 2011
ABSTRACT In this paper we address the question of whether āvery positiveā or āvery negativeā sent... more ABSTRACT In this paper we address the question of whether āvery positiveā or āvery negativeā sentences from the perspective of sentiment analysis are āgoodā summary sentences from the perspective of text summarisation. We operationalise the concepts of very positive and very negative sentences by using the output of a sentiment analyser and evaluate how good a sentence is for summarisation by making use of standard text summarisation metrics and a corpus annotated for both salience and sentiment. In addition, we design and execute a statistical test to evaluate the aforementioned hypothesis. We conclude that the hypothesis does not hold, at least not based on our corpus data, and argue that summarising sentiment and summarising text are two different tasks which should be treated separately.
2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 2009
Opinion mining is the task of extracting from a set of documents opinions expressed by a source o... more Opinion mining is the task of extracting from a set of documents opinions expressed by a source on a specified target. This article presents a comparative study on the methods and resources that can be employed for mining opinions from quotations (reported speech) in newspaper articles. We show the difficulty of this task, motivated by the presence of different possible targets and the large variety of affect phenomena that quotes contain. We evaluate our approaches using annotated quotations extracted from news provided by the EMM news gathering engine. We conclude that a generic opinion mining system requires both the use of large lexicons, as well as specialised training and testing data.
2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 2009
In this paper we present a generic approach for summarising multilingual news clusters such as th... more In this paper we present a generic approach for summarising multilingual news clusters such as the ones produced by the Europe Media Monitor (EMM) system. It is generic because it uses robust statistical techniques to perform the summarisation step and its multilinguality is inherited from the multilingual entity disambiguation system used to build the source representation. We ran preliminary experiments with the TAC 2008 data, an English corpus for summarisation research, and we obtained promising improvements over a summarisation system ranked in the top 20% at the TAC 2008 competition.
Lecture Notes in Computer Science, 2010
Lecture Notes in Computer Science, 2010
... for comparability reasons with other systems to produce summaries of a certain length, it is ... more ... for comparability reasons with other systems to produce summaries of a certain length, it is possible to first select sentences and to fill the remaining summary space with a relatively high-ranking summary sentence. ... Multilingual (Multi-document) Summarisation Evaluation ...
Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing - HLT '05, 2005
We propose an approach to summarization exploiting both lexical information and the output of an ... more We propose an approach to summarization exploiting both lexical information and the output of an automatic anaphoric resolver, and using Singular Value Decomposition (SVD) to identify the main terms. We demonstrate that adding anaphoric information results in significant performance improvements over a previously developed system, in which only lexical terms are used as the input to SVD. However, we also show that how anaphoric information is used is crucial: whereas using this information to add new terms does result in improved performance, simple substitution makes the performance worse.

Language Resources and Evaluation, 2011
The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather,... more The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugging in a new language by providing the language-specific resources for that language. We thus describe the type of language-specific resources needed, the effort involved, and ways of boot-strapping the generation of these resources in order to keep the effort of adding a new language to a minimum. The text analysis applications pursued in our efforts include clustering, classification, recognition and disambiguation of named entities (persons, organisations and locations), recognition and normalisation of date expressions, as well as the identification of reported speech quotations by and about people.

Journal of Intelligent Information Systems, 2012
ABSTRACT The present is marked by the influence of the Social Web on societies and people worldwi... more ABSTRACT The present is marked by the influence of the Social Web on societies and people worldwide. In this context, users generate large amounts of data, especially containing opinion, which has been proven useful for many real-world applications. In order to extract knowledge from user-generated content, automatic methods must be developed. In this paper, we present different approaches to multi-document summarization of opinion from blogs and reviews. We apply these approaches to: (a) identify positive and negative opinions in blog threads in order to produce a list of arguments in favor and against a given topic and (b) summarize the opinion expressed in reviews. Subsequently, we evaluate the proposed methods on two distinct datasets and analyze the quality of the obtained results, as well as discuss the errors produced. Although much remains to be done, the approaches we propose obtain encouraging results and point to clear directions in which further improvements can be made.
Uploads
Papers by Mijail Kabadjov