Papers by Simon Clematide
The detection of mentions of protein-protein interactions in the scientific literature has recent... more The detection of mentions of protein-protein interactions in the scientific literature has recently emerged as a core task in biomedical text mining. We present effective techniques for this task, which have been developed using the IntAct database as a gold standard, and have been evaluated in two text mining competitions.
For our participation in the CDR task of BioCreative 5, we have adapted the Ontogene System and o... more For our participation in the CDR task of BioCreative 5, we have adapted the Ontogene System and optimized it for disease recognition (DNER Task) and identification of chemical-disease relationships (CID Task). For the DNER Task we have experimented with different changes to the term matching system. We describe the effects of an abbreviation detection tool as well as a selection of rules for term normalization.
We describe a biological event detection method implemented for the Genia Event Extraction task o... more We describe a biological event detection method implemented for the Genia Event Extraction task of BioNLP 2013. The method relies on syntactic dependency relations provided by a general NLP pipeline, supported by statistics derived from Maximum Entropy models for candidate trigger words, for potential arguments, and for argument frames.

Most automated procedures used for the analysis of textual data do not apply natural language pro... more Most automated procedures used for the analysis of textual data do not apply natural language processing techniques. While these applications usually allow for an efficient data collection, most have difficulties to achieve sufficient accuracy because of the high complexity and interdependence of semantic concepts used in the social sciences. Manual content analysis approaches sometimes lack accuracy too, but, more virulently, human coding entails a heavy workload for the researcher. To address this high cost problem without running into the risk of oversimplification, we suggest a semi-automatic approach. Our application implements an innovative coding method based on computational linguistic techniques, i.e. mainly named entity recognition and concept identification. In order to show the potential of this new method, we apply it to an analysis of electoral campaigns. In the first stage of this contribution, we describe how relations between political parties and issues can be reco...

Automatic extraction of biological network information is one of the most desired and most comple... more Automatic extraction of biological network information is one of the most desired and most complex tasks in biological text mining. The BioCreative track 4 provides training data and an evaluation environment for the extraction of causal relationships in Biological Expression Language (BEL). BEL is a modeling language that is easily editable by humans or by automatic systems and can express causal relationships of different levels of granularity. Proteinprotein relations can be expressed in BEL as well as relations between biological processes and disease stages. To extract BEL information automatically, named entity recognition and normalization to defined name spaces are necessary. Furthermore, relations extracted from text have to be transformed into correct BEL syntax. The track provided training and evaluation for two complementary task: Given a sentence extract all BEL statements and given a BEL statement propose up to 10 evidence sentences from the literature.

We show how to use large biomedical databases in order to obtain a gold standard for training a m... more We show how to use large biomedical databases in order to obtain a gold standard for training a machine learning system over a corpus of biomedical text. As an example we use the Comparative Toxicogenomics Database (CTD) and describe by means of a short case study how the obtained data can be applied. We explain how we exploit the structure of the database for compiling training material and a testset. Using a Naive Bayes document classification approach based on words, stem bigrams and MeSH descriptors we achieve a macro-average F-score of 61% on a subset of 8 action terms. This outperforms a baseline system based on a lookup of stemmed keywords by more than 20%. Furthermore, we present directions of future work, taking the described system as a vantage point. Future work will be aiming towards a weakly supervised system capable of discovering complete biomedical interactions and events.

We describe a recent development of the OntoGene Relation Miner system (OG-RM), aimed at the auto... more We describe a recent development of the OntoGene Relation Miner system (OG-RM), aimed at the automatic extraction of drug/gene/diseases relationships from the biomedical literature. The OG-RM system was developed originally for the extraction of protein-protein interactions (PPI) from the biomedical literature [1]. The system has been tested in a controlled setting by participation to the PPI tasks of the BioCreative II and BioCreative II.5 competitive evaluations of text mining systems [2,3]. Recently, within the context of the SASEBio project (a collaboration between the OntoGene group at the University of Zurich and the NITAS/TMS group of Novartis Pharma AG), the OntoGene Relation Miner has been extended in order to deal with larger classes of entities and relationships. In particular, terminology for drugs and diseases has been derived from the Pharmacogenomics Knowledge Base (PharmGKB) [4].

We present an approach to the extraction of relations between pharmacogenomics entities like drug... more We present an approach to the extraction of relations between pharmacogenomics entities like drugs, genes and diseases which is based on syntax and on discourse. Particularly, discourse has not been studied widely for improving Text Mining. We learn syntactic features semi-automatically from lean document-level annotation. We show how a simple Maximum Entropy based machine learning approach helps to estimate the relevance of candidate relations based on dependency-based features found in the syntactic path connecting the involved entities. Maximum Entropy based relevance estimation of candidate pairs conditioned on syntactic features improves relation ranking by 68% relative increase measured by AUCiP/R and by 60% for TAP-k (k=10). We also show that automatically recognizing document-level discourse characteristics to expand and filter acronyms improves term recognition and interaction detection by 12% relative, measured by AUCiP/R and by TAP-k (k=10). Our pilot study uses PharmGKB ...
Introduction ODIN is a lightweight graphical interface for literature curation that can be run wi... more Introduction ODIN is a lightweight graphical interface for literature curation that can be run within a web browser. ODIN has been developed by the OntoGene group (http://www.ontogene.org/) at the University of Zurich, which specializes in biomedical text mining, in particular extraction of domain entities and their relationships from the scientific literature. The quality of their textmining technologies has been evaluated several times through participation in communityorganized competitive evaluation challenges, where OntoGene frequently obtained top-ranked results [2].
We describe the automatic harmonization method used for building the English Silver Standard anno... more We describe the automatic harmonization method used for building the English Silver Standard annotation supplied as a data source for the multilingual CLEF-ER named entity recognition challenge. The use of an automatic Silver Standard is designed to remove the need for a costly and time-consuming expert annotation. The final voting threshold of 3 for the harmonization of 6 different annotations from the project partners kept 45% of all available concept centroids. On average, 19% (SD 14%) of the original annotations are removed. 97.8% of the partner annotations that go into the Silver Standard Corpus have exactly the same boundaries as their harmonized representations.
In this paper we describe the architecture of the On-toGene Relation mining pipeline and some of ... more In this paper we describe the architecture of the On-toGene Relation mining pipeline and some of its recent applications. With this research overview paper we intend to provide a contribution towards the recently started discussion towards standards for information extraction architectures in the biomedical domain. Our approach delivers domain entities mentioned in each input document, as well as candidate relationships, both ranked according to a confidency score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation.
We give an overview of our approach to the extraction of interactions between pharmacogenomic ent... more We give an overview of our approach to the extraction of interactions between pharmacogenomic entities like drugs, genes and diseases and suggest classes of interaction types driven by data from PharmGKB and partly following the top level ontology WordNet and biomedical types from BioNLP. Our text mining approach to the extraction of interactions is based on syntactic analysis. We use syntactic analyses to explore domain events and to suggest a set of interaction labels for the pharmacogenomics domain.

Proceedings of the Workshop on Language Technology for Digital Historical Archives - with a Special Focus on Central-, (South-)Eastern Europe, Middle East and North Africa, 2019
Geotagging historic and cultural texts provides valuable access to heritage data, enabling locati... more Geotagging historic and cultural texts provides valuable access to heritage data, enabling location-based searching and new geographically related discoveries. In this paper, we describe two distinct approaches to geotagging a variety of fine-grained toponyms in a diachronic corpus of alpine texts. By applying a traditional gazetteer-based approach, aided by a few simple heuristics, we attain strong high-precision annotations. Using the output of this earlier system, we adopt a state-of-the-art neural approach in order to facilitate the detection of new toponyms on the basis of context. Additionally, we present the results of preliminary experiments on integrating a small amount of crowdsourced annotations to improve overall performance of toponym recognition in our heritage corpus.

Database, 2016
Automatic extraction of biological network information is one of the most desired and most comple... more Automatic extraction of biological network information is one of the most desired and most complex tasks in biological and medical text mining. Track 4 at BioCreative V attempts to approach this complexity using fragments of large-scale manually curated biological networks, represented in Biological Expression Language (BEL), as training and test data. BEL is an advanced knowledge representation format which has been designed to be both human readable and machine processable. The specific goal of track 4 was to evaluate text mining systems capable of automatically constructing BEL statements from given evidence text, and of retrieving evidence text for given BEL statements. Given the complexity of the task, we designed an evaluation methodology which gives credit to partially correct statements. We identified various levels of information expressed by BEL statements, such as entities, functions, relations, and introduced an evaluation framework which rewards systems capable of delivering useful BEL fragments at each of these levels. The aim of this evaluation method is to help identify the characteristics of the systems which, if combined, would be most useful for achieving the overall goal of automatically constructing causal biological networks from text.
An Incremental Entity-Mention Model for Coreference Resolution with Restrictive Antecedent Accessibility
We introduce an incremental model for coreference resolution that competed in the CoNLL 2011 shar... more We introduce an incremental model for coreference resolution that competed in the CoNLL 2011 shared task (open regular). We decided to participate with our baseline model, since it worked well with two other datasets. The benefits of an incremental over a mention-pair architecture are: a drastic reduction of the number of candidate pairs, a means to overcome the problem of underspecified items in pairwise classification and the natural integration of global constraints such as transitivity. We do not apply machine learning, instead the system uses an empirically derived salience measure based on the dependency labels of the true mentions. Our experiments seem to indicate that such a system already is on par with machine learning approaches.

Journal of the American Medical Informatics Association : JAMIA, Jan 6, 2015
To create a multilingual gold-standard corpus for biomedical concept recognition. We selected tex... more To create a multilingual gold-standard corpus for biomedical concept recognition. We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best ann...
Journal of Biomedical Semantics, 2012
Background: One of the key pieces of information which biomedical text mining systems are expecte... more Background: One of the key pieces of information which biomedical text mining systems are expected to extract from the literature are interactions among different types of biomedical entities (proteins, genes, diseases, drugs, etc.). Several large resources of curated relations between biomedical entities are currently available, such as the Pharmacogenomics Knowledge Base (PharmGKB) or the Comparative Toxicogenomics Database (CTD). Biomedical text mining systems, and in particular those which deal with the extraction of relationships among entities, could make better use of the wealth of already curated material.

Collection-Wide Extraction of Protein-Protein Interactions
Evidence in support of relationships among biomedical entities, such as protein-protein interacti... more Evidence in support of relationships among biomedical entities, such as protein-protein interactions, can be gathered from a multiplicity of sources. The larger the pool of evidence, the more likely a given interaction can be considered to be. In the context of biomedical text mining, this elementary observation can be translated into an approach that seeks to find in the literature all available evidence for a given interaction, and thus provides a reliable means to assign it a likelihood score before delivering the results to an end user. In this paper we present the initial results of an on-going collaborative project between a major pharmaceutical company and an academic group with extensive expertise in biomedical text mining, with the goal of extracting protein-protein interactions from a large pool of supporting papers.

The coverage of multilingual biomedical resources is high for the English language, yet sparse fo... more The coverage of multilingual biomedical resources is high for the English language, yet sparse for non-English languages-an observation which holds for seemingly well-resourced, yet still dramatically low-resourced ones such as Spanish, French or German but even more so for really under-resourced ones such as Dutch. We here present experimental results for automatically annotating parallel corpora and simultaneously acquiring new biomedical terminology for these under-resourced non-English languages on the basis of two types of language resources, namely parallel corpora (i.e. full translation equivalents at the document unit level) and (admittedly deficient) multilingual biomedical terminologies, with English as their anchor language. We automatically annotate these parallel corpora with biomedical named entities by an ensemble of named entity taggers and harmonize non-identical annotations the outcome of which is a so-called silver standard corpus. We conclude with an empirical assessment of this approach to automatically identify both known and new terms in multilingual corpora.
Uploads
Papers by Simon Clematide