Papers by George Demetriou

European Conference on Principles of Data Mining and Knowledge Discovery, 2011
The results of an experiment are often described in a series of textual statements, the most conc... more The results of an experiment are often described in a series of textual statements, the most concise of which being the title of the article. Here we implemented a novel approach, using standard data mining techniques, to collect a set of concise 'factual' statements about a research area. We compare two standard text classification approaches to identify 'factual' and 'non-factual' sentences in article titles; the first of which uses a statistical language-modelling approach, and the second a more sophisticated semantic and grammatical approach. We find that the simple approach provides more accurately classified titles; achieving 92% overall accuracy compared to 90% for the complex approach. We also implement a strategy to convert the phrasal dependencies in a 'factual' title into subject-predicate-object structures (triples). These triples can then be organised according to a schema provided by domain ontologies; which occurs by mapping URIs to entities found in the textual labels.

European Conference on Principles of Data Mining and Knowledge Discovery, 2011
The results of an experiment are often described in a series of textual statements, the most conc... more The results of an experiment are often described in a series of textual statements, the most concise of which being the title of the article. Here we implemented a novel approach, using standard data mining techniques, to collect a set of concise 'factual' statements about a research area. We compare two standard text classification approaches to identify 'factual' and 'non-factual' sentences in article titles; the first of which uses a statistical language-modelling approach, and the second a more sophisticated semantic and grammatical approach. We find that the simple approach provides more accurately classified titles; achieving 92% overall accuracy compared to 90% for the complex approach. We also implement a strategy to convert the phrasal dependencies in a 'factual' title into subject-predicate-object structures (triples). These triples can then be organised according to a schema provided by domain ontologies; which occurs by mapping URIs to entities found in the textual labels.
CLEF (Working Notes), 2002
CLEF (Working Notes), 2002
Information Extraction (IE), defined as the activity to extract structured knowledge from unstruc... more Information Extraction (IE), defined as the activity to extract structured knowledge from unstructured text sources, offers new opportunities for the exploitation of biological information contained in the vast amounts of scientific literature. But while IE technology has received increasing attention in the area of molecular biology, there have not been many examples of IE systems successfully deployed in end-user applications. We describe the development of PASTAWeb, a WWWbased interface to the extraction output of PASTA, an IE system that extracts protein structure information from MEDLINE abstracts. Key characteristics of PASTAWeb are the seamless integration of the PASTA extraction results (templates) with WWWbased technology, the dynamic generation of WWW content from 'static' data and the fusion of information extracted from multiple documents.
Information Extraction (IE), defined as the activity to extract structured knowledge from unstruc... more Information Extraction (IE), defined as the activity to extract structured knowledge from unstructured text sources, offers new opportunities for the exploitation of biological information contained in the vast amounts of scientific literature. But while IE technology has received increasing attention in the area of molecular biology, there have not been many examples of IE systems successfully deployed in end-user applications. We describe the development of PASTAWeb, a WWWbased interface to the extraction output of PASTA, an IE system that extracts protein structure information from MEDLINE abstracts. Key characteristics of PASTAWeb are the seamless integration of the PASTA extraction results (templates) with WWWbased technology, the dynamic generation of WWW content from 'static' data and the fusion of information extracted from multiple documents.

Background: Automatic identification of gene and protein names from biomedical publications can h... more Background: Automatic identification of gene and protein names from biomedical publications can help curators and researchers to keep up with the findings published in the scientific literature. As of today, this is a challenging task related to information retrieval, and in the realm of Big Data Analytics. Objectives: To investigate the feasibility of using word embeddings (i.e. distributed word representations) from Deep Learning algorithms together with terms from the Cardiovascular Disease Ontology (CVDO) as a step to identifying omics information encoded in the biomedical literature. Methods: Word embeddings were generated using the neural language models CBOW and Skip-gram with an input of more than 14 million PubMed citations (titles and abstracts) corresponding to articles published between 2000 and 2016. Then the abstracts of selected papers from the sysVASC systematic review were manually annotated with gene/protein names. We set up two experiments that used the word embeddings to produce term variants for gene/protein names: the first experiment used the terms manually annotated from the papers; the second experiment enriched/expanded the annotated terms using terms from the human-readable labels of key classes (gene/proteins) from the CVDO ontology. CVDO is formalised in the W3C Web Ontology Language (OWL) and contains 172,121 UniProt Knowledgebase protein classes related to human and 86,792 UniProtKB protein classes related to mouse. The hypothesis is that by enriching the original annotated terms, a better context is provided, and therefore, it is easier to obtain suitable (full and/or partial) term variants for gene/protein names from word embeddings. Results: From the papers manually annotated, a list of 107 terms (gene/protein names) was acquired. As part of the word embeddings generated from CBOW and Skip-gram, a lexicon with more than 9 million terms was created. Using the cosine similarity metric, a list of the 12 top-ranked terms was generated from word embeddings for query terms present in the generated lexicon. Domain experts evaluated a total of 1968 pairs of terms and classified the retrieved terms as: TV (term variant); PTV (partial term variant); and NTV (non term variant, meaning none of the previous two categories). In experiment I, Skip-gram finds the double amount of (full and/or partial) term variants for gene/protein names as compared with CBOW. Using Skip-gram, the weighted Cohen's Kappa inter-annotator agreement for two domain experts was 0.80 for the first experiment and 0.74 for the second experiment. In the first experiment, suitable (full and/or partial) term variants were found for 65 of the 107 terms. In the second experiment, the number increased to 100. This study demonstrates the benefits of using terms from the CVDO ontology classes to obtain more pertinent term variants for gene/protein names from word embeddings generated from an unannotated corpus with more than 14 million PubMed citations. As the terms variants are induced from the biomedical literature, they can facilitate data tagging and semantic indexing tasks. Overall, our study explores the feasibility of obtaining methods that scale when dealing with big data, and which enable automation of deep semantic analysis and markup of textual information from unannotated biomedical literature.

Background: Automatic identification of gene and protein names from biomedical publications can h... more Background: Automatic identification of gene and protein names from biomedical publications can help curators and researchers to keep up with the findings published in the scientific literature. As of today, this is a challenging task related to information retrieval, and in the realm of Big Data Analytics. Objectives: To investigate the feasibility of using word embeddings (i.e. distributed word representations) from Deep Learning algorithms together with terms from the Cardiovascular Disease Ontology (CVDO) as a step to identifying omics information encoded in the biomedical literature. Methods: Word embeddings were generated using the neural language models CBOW and Skip-gram with an input of more than 14 million PubMed citations (titles and abstracts) corresponding to articles published between 2000 and 2016. Then the abstracts of selected papers from the sysVASC systematic review were manually annotated with gene/protein names. We set up two experiments that used the word embeddings to produce term variants for gene/protein names: the first experiment used the terms manually annotated from the papers; the second experiment enriched/expanded the annotated terms using terms from the human-readable labels of key classes (gene/proteins) from the CVDO ontology. CVDO is formalised in the W3C Web Ontology Language (OWL) and contains 172,121 UniProt Knowledgebase protein classes related to human and 86,792 UniProtKB protein classes related to mouse. The hypothesis is that by enriching the original annotated terms, a better context is provided, and therefore, it is easier to obtain suitable (full and/or partial) term variants for gene/protein names from word embeddings. Results: From the papers manually annotated, a list of 107 terms (gene/protein names) was acquired. As part of the word embeddings generated from CBOW and Skip-gram, a lexicon with more than 9 million terms was created. Using the cosine similarity metric, a list of the 12 top-ranked terms was generated from word embeddings for query terms present in the generated lexicon. Domain experts evaluated a total of 1968 pairs of terms and classified the retrieved terms as: TV (term variant); PTV (partial term variant); and NTV (non term variant, meaning none of the previous two categories). In experiment I, Skip-gram finds the double amount of (full and/or partial) term variants for gene/protein names as compared with CBOW. Using Skip-gram, the weighted Cohen's Kappa inter-annotator agreement for two domain experts was 0.80 for the first experiment and 0.74 for the second experiment. In the first experiment, suitable (full and/or partial) term variants were found for 65 of the 107 terms. In the second experiment, the number increased to 100. This study demonstrates the benefits of using terms from the CVDO ontology classes to obtain more pertinent term variants for gene/protein names from word embeddings generated from an unannotated corpus with more than 14 million PubMed citations. As the terms variants are induced from the biomedical literature, they can facilitate data tagging and semantic indexing tasks. Overall, our study explores the feasibility of obtaining methods that scale when dealing with big data, and which enable automation of deep semantic analysis and markup of textual information from unannotated biomedical literature.

5th European Conference on Speech Communication and Technology (Eurospeech 1997)
This paper presents a study on the use of wide-coverage semantic knowledge for large vocabulary (... more This paper presents a study on the use of wide-coverage semantic knowledge for large vocabulary (theoretically unrestricted) domain-independent speech recognition. A machine readable dictionary was used to provide the semantic information about the words and a semantic model was developed based on the conceptual association between words as computed directly from the textual representations of their meanings. The findings of our research suggest that the model is capable of capturing phenomena of semantic associativity or connectivity between words in texts and considerably reducing the semantic ambiguity in natural language. The model can cover both short and long-distance semantic relationships between words and has shown signs of robustness across various text genres. Experiments with simulated speech recognition hypotheses indicate that the model can efficiently be used to reduce the word error rates when applied to word lattices or N-best sentence hypotheses.

5th European Conference on Speech Communication and Technology (Eurospeech 1997)
This paper presents a study on the use of wide-coverage semantic knowledge for large vocabulary (... more This paper presents a study on the use of wide-coverage semantic knowledge for large vocabulary (theoretically unrestricted) domain-independent speech recognition. A machine readable dictionary was used to provide the semantic information about the words and a semantic model was developed based on the conceptual association between words as computed directly from the textual representations of their meanings. The findings of our research suggest that the model is capable of capturing phenomena of semantic associativity or connectivity between words in texts and considerably reducing the semantic ambiguity in natural language. The model can cover both short and long-distance semantic relationships between words and has shown signs of robustness across various text genres. Experiments with simulated speech recognition hypotheses indicate that the model can efficiently be used to reduce the word error rates when applied to word lattices or N-best sentence hypotheses.
Department of Computer Science
A comparison of semantic tagging with syntactic Part-of-Speech tagging leads us to propose that a... more A comparison of semantic tagging with syntactic Part-of-Speech tagging leads us to propose that a domain-independent semantic tagger for English corpora should not aim to annotate each word with an atomic ‘sem-tag’, but instead that a semantic tagging should attach to each word a set of semantic primitive attributes or features. These features should include: − lemma or root, grouping together inflected and derived forms of the same lexical item; − broad subject categories where applicable;
Department of Computer Science
A comparison of semantic tagging with syntactic Part-of-Speech tagging leads us to propose that a... more A comparison of semantic tagging with syntactic Part-of-Speech tagging leads us to propose that a domain-independent semantic tagger for English corpora should not aim to annotate each word with an atomic ‘sem-tag’, but instead that a semantic tagging should attach to each word a set of semantic primitive attributes or features. These features should include: − lemma or root, grouping together inflected and derived forms of the same lexical item; − broad subject categories where applicable;

Traditional distributional semantic models (DSMs) like Latent Semantic Analysis (LSA) and Latent ... more Traditional distributional semantic models (DSMs) like Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) derive representations for words assuming words occurring in similar contexts will have similar representations. Deep Learning has made feasible the derivation of word embeddings (i.e. distributed word representations) from corpora of billions of words applying neural language models like CBOW and Skip-gram. The application of Deep Learning to aid ontology development remains largely unexplored. This study investigates the performance of LSA, LDA, CBOW and Skip-gram for ontology learning tasks. We conducted six experiments; firstly using 300K and later with 14M PubMed titles and abstracts to obtain topranked candidate terms related to the patient safety domain. Based on the evaluation performed, we conclude that Deep Learning can contribute to ontology engineering from the biomedical literature.
In calculating both EMPathie and PASTA terminology results we have used a weak criterion of corre... more In calculating both EMPathie and PASTA terminology results we have used a weak criterion of correctness whereby a response is correct if its type matches the type of the answer key and its text extent matches a substring of the key's extent. Insisting on the stronger matching criterion of strict string identity lowers recall and precision scores by approximately 4 % overall

Traditional distributional semantic models (DSMs) like Latent Semantic Analysis (LSA) and Latent ... more Traditional distributional semantic models (DSMs) like Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) derive representations for words assuming words occurring in similar contexts will have similar representations. Deep Learning has made feasible the derivation of word embeddings (i.e. distributed word representations) from corpora of billions of words applying neural language models like CBOW and Skip-gram. The application of Deep Learning to aid ontology development remains largely unexplored. This study investigates the performance of LSA, LDA, CBOW and Skip-gram for ontology learning tasks. We conducted six experiments; firstly using 300K and later with 14M PubMed titles and abstracts to obtain topranked candidate terms related to the patient safety domain. Based on the evaluation performed, we conclude that Deep Learning can contribute to ontology engineering from the biomedical literature.
Studies in health technology and informatics, 2017
We investigate the application of distributional semantics models for facilitating unsupervised e... more We investigate the application of distributional semantics models for facilitating unsupervised extraction of biomedical terms from unannotated corpora. Term extraction is used as the first step of an ontology learning process that aims to (semi-)automatic annotation of biomedical concepts and relations from more than 300K PubMed titles and abstracts. We experimented with both traditional distributional semantics methods such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) as well as the neural language models CBOW and Skip-gram from Deep Learning. The evaluation conducted concentrates on sepsis, a major life-threatening condition, and shows that Deep Learning models outperform LSA and LDA with much higher precision.
In calculating both EMPathie and PASTA terminology results we have used a weak criterion of corre... more In calculating both EMPathie and PASTA terminology results we have used a weak criterion of correctness whereby a response is correct if its type matches the type of the answer key and its text extent matches a substring of the key's extent. Insisting on the stronger matching criterion of strict string identity lowers recall and precision scores by approximately 4 % overall

Motivation: There are many existing resources that integrate data between databases; they do this... more Motivation: There are many existing resources that integrate data between databases; they do this either semantically by the use of RDF and triplestores (e.g. Bio2RDF), or with web links and ID mapping services (e.g. PICR, eUtils). Results declared in the literature are, however, only rarely interlinked with existing databases and even more rarely interlinked with each other. We describe a method to take factual statements reported in the literature and turn them into semantic networks of RDF triples. We use a method based on finding titles of papers that contain positive, direct statements about the outcome of a biomedical investigation. We then use dependency parsing and an ontological perspective to create and combine graphs of knowledge about a domain. Our aim in this work is to collect knowledge from the literature for inclusion in the Kidney and Urinary Pathway Knowledge Base (KUPKB), which will be used in the e-LICO project to illustrate the utility of data-mining methods for...
Studies in health technology and informatics, 2017
We investigate the application of distributional semantics models for facilitating unsupervised e... more We investigate the application of distributional semantics models for facilitating unsupervised extraction of biomedical terms from unannotated corpora. Term extraction is used as the first step of an ontology learning process that aims to (semi-)automatic annotation of biomedical concepts and relations from more than 300K PubMed titles and abstracts. We experimented with both traditional distributional semantics methods such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) as well as the neural language models CBOW and Skip-gram from Deep Learning. The evaluation conducted concentrates on sepsis, a major life-threatening condition, and shows that Deep Learning models outperform LSA and LDA with much higher precision.

Motivation: There are many existing resources that integrate data between databases; they do this... more Motivation: There are many existing resources that integrate data between databases; they do this either semantically by the use of RDF and triplestores (e.g. Bio2RDF), or with web links and ID mapping services (e.g. PICR, eUtils). Results declared in the literature are, however, only rarely interlinked with existing databases and even more rarely interlinked with each other. We describe a method to take factual statements reported in the literature and turn them into semantic networks of RDF triples. We use a method based on finding titles of papers that contain positive, direct statements about the outcome of a biomedical investigation. We then use dependency parsing and an ontological perspective to create and combine graphs of knowledge about a domain. Our aim in this work is to collect knowledge from the literature for inclusion in the Kidney and Urinary Pathway Knowledge Base (KUPKB), which will be used in the e-LICO project to illustrate the utility of data-mining methods for...
Uploads
Papers by George Demetriou