Papers by Christopher Rawlings

High throughput genomic studies can identify large numbers of potential candidate genes, which mu... more High throughput genomic studies can identify large numbers of potential candidate genes, which must be interpreted and filtered by investigators to select the best ones for further analysis. Prioritization is generally based on evidence that supports the role of a gene product in the biological process being investigated. The two most important bodies of information providing such evidence are bioinformatics databases and the scientific literature. In this paper we present an extension to the Ondex data integration framework that uses text mining techniques over Medline abstracts as a method for accessing both these bodies of evidence in a consistent way. In an example use case, we apply our method to create a knowledge base of Arabidopsis proteins implicated in plant stress response and use various scoring metrics to identify key protein-stress associations. In conclusion, we show that the additional text mining features are able to highlight proteins using the scientific literature that would not have been seen using data integration alone.

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1993
In this paper, two programs are described (CBS1e and CBS2e). These are implemented in the paralle... more In this paper, two programs are described (CBS1e and CBS2e). These are implemented in the parallel constraint logic programming language ElipSys. These predict protein alpha/beta-sheet and beta-sheet topologies from secondary structure assignments and topological folding rules (constraints). These programs illustrate how recent developments in logic programming environments can be applied to solve large-scale combinatorial problems in molecular biology. We demonstrate that parallel constraint logic programming is able to overcome some of the important limitations of more established logic programming languages i.e. Prolog. This is particularly the case in providing features that enhance the declarative nature of the program and also in addressing directly the problems of scaling-up logic programs to solve scientifically realistic problems. Moreover, we show that for large topological problems CBS1e was approximately 60 times faster than an equivalent Prolog implementation (CBS1) on ...
Lecture Notes in Computer Science, 1986
The logic programming language PROLOG was used to represent and reason about the topology of prot... more The logic programming language PROLOG was used to represent and reason about the topology of protein structures. PROLOG descriptions of the relative positions of protein secondary structural features (protein topology) were generatedfrom information in the Brookhaven databank. P-structural motif (hairpin, meander, Greek key andjelly roll) were then defined using PROLOG rules. The PROLOG program was able to infer the presence of these structures in the PROLOG representation of the protein.
Bioinformatics/computer Applications in The Biosciences, 2003
Summary: Certain types of genetic analysis are simplified by assembling a collection of unrelated... more Summary: Certain types of genetic analysis are simplified by assembling a collection of unrelated individuals, e.g. case-control experiments. If a family study is being per- formed then it will be necessary to extract subsets of un- related, available individuals from pedigress. Our program provides an optimal method for performing this task. Availability: The software is available, free of charge, on
Intelligent Systems in Molecular Biology, 1995
International Conference on Logic Programming/Joint International Conference and Symposium on Logic Programming, 1993
The APPLAUSE (Application and Assessment of Parallel Programming Using Logic) Project is building... more The APPLAUSE (Application and Assessment of Parallel Programming Using Logic) Project is building major applications using the ElipSys parallel constraint logic programming system developed at ECRC (European Computer-Industry Research Centre). APPLAUSE ...
The Computer Journal, 1993
Applications of ElipSys in Molecular Biology 691 biotechnology industry is dependent on a detaile... more Applications of ElipSys in Molecular Biology 691 biotechnology industry is dependent on a detailed under-standing of the structure of proteins and, in particular, how changes in structure influence the function and biological role of the protein in the cell. Many aspects of cancer research ...
Lecture Notes in Computer Science, 1993
... to generate training material to introduce applications developers to ElipSys-like languages;... more ... to generate training material to introduce applications developers to ElipSys-like languages; - to assess the advantages (and disadvantages) of ElipSys-like ... Interna-tional (ESI) and the University of Athens as a demonstrator for the Greek National Tourist Organization. ...
Grid Economics and Business Models, 2007
edutain@grid is an exciting and ground breaking new project making use of Grid technology. The pr... more edutain@grid is an exciting and ground breaking new project making use of Grid technology. The project will identify and define a new class of applications that are highly significant for Grid computing but have not been studied in the past, which we characterise as Real-Time Online Interactive Applications (ROIA). The distinctive features that make ROIA unique include large user concurrency

The automated annotation of data from high throughput sequencing and genomics experiments is a si... more The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ON-DEX system which is freely available from http://ondex.sf.net/.

Network inference utilizes experimental high-throughput data for the reconstruction of molecular ... more Network inference utilizes experimental high-throughput data for the reconstruction of molecular interaction networks where new relationships between the network entities can be predicted. Despite the increasing amount of experimental data, the parameters of each modeling technique cannot be optimized based on the experimental data alone, but needs to be qualitatively assessed if the components of the resulting network describe the experimental setting. Candidate list prioritization and validation builds upon data integration and data visualization. The application of tools supporting this procedure is limited to the exploration of smaller information networks because the display and interpretation of large amounts of information is challenging regarding the computational effort and the users' experience. The Ondex software framework was extended with customizable context-sensitive menus which allow additional integration and data analysis options for a selected set of candidates during interactive data exploration. We provide new functionalities for on-the-fly data integration using InterProScan, PubMed Central literature search, and sequence-based homology search. We applied the Ondex system to the integration of publicly available data for Aspergillus nidulans and analyzed transcriptome data. We demonstrate the advantages of our approach by proposing new hypotheses for the functional annotation of specific genes of differentially expressed fungal gene clusters. Our extension of the Ondex framework makes it possible to overcome the separation between data integration and interactive analysis. More specifically, computationally demanding calculations can be performed on selected sub-networks without losing any information from the whole network. Furthermore, our extensions allow for direct access to online biological databases which helps to keep the integrated information up-to-date.

Journal of Molecular Biology, 1995
Gene families are often recognised by sequence homology using similarity searching to find relati... more Gene families are often recognised by sequence homology using similarity searching to find relationships, however, genomic sequence data provides gene architectural information not used by conventional search methods. In particular, intron positions and phases are expected to be relatively conserved features, because mis-splicing and reading frame shifts should be selected against. A fast search technique capable of detecting possible weak sequence homologies apparent at the intron/exon level of gene organization is presented for comparing spliceosomal genes and gene fragments. FINEX compares strings of exons delimited by intron/exon boundary positions and intron phases (exon fingerprint) using a global dynamic programming algorithm with a combined intron phase identity and exon size dissimilarity score. Exon fingerprints are typically two orders of magnitude smaller than their nucleic acid sequence counterparts giving rise to fast search times: a ranked search against a library of 6755 fingerprints for a typical three exon fingerprint completes in under 30 seconds on an ordinary workstation, while a worst case largest fingerprint of 52 exons completes in just over one minute. The short “sequence” length of exon fingerprints in comparisons is compensated for by the large exon alphabet compounded of intron phase types and a wide range of exon sizes, the latter contributing the most information to alignments. FINEX performs better in some searches than conventional methods, finding matches with similar exon organization, but low sequence homology. A search using a human serum albumin finds all members of the multigene family in the FINEX database at the top of the search ranking, despite very low amino acid percentage identities between family members. The method should complement conventional sequence searching and alignment techniques, offering a means of identifying otherwise hard to detect homologies where genomic data are available.
Lecture Notes in Computer Science, 2008
Grid infrastructures are maturing to a point where they are attracting the interest of businesses... more Grid infrastructures are maturing to a point where they are attracting the interest of businesses in many application domains. While many large-scale on-line gaming platforms exist, they fail to take into consideration the potential business to business relationships when it comes to dynamic on-line game hosting. This work presents an initial implementation of the edutain@grid architecture to support business value chains identified for on-line gaming and elearning application hosting. An analysis of business actors and value chains is presented briefly before a detailed description of the edutain@grid implementation. We also consider first results concerning how best to construct appropriate value chains using bipartite and bi-directional Service Level Agreements.
Lecture Notes in Computer Science, 2007
Abstract edutain@ grid is an exciting and ground breaking new project making use of Grid technolo... more Abstract edutain@ grid is an exciting and ground breaking new project making use of Grid technology. The project will identify and define a new class of applications that are highly significant for Grid computing but have not been studied in the past, which we characterise ...

Nucleic Acids Research, 2006
To utilize effectively the growing number of verified genes that mediate an organism's ability to... more To utilize effectively the growing number of verified genes that mediate an organism's ability to cause disease and/or to trigger host responses, we have developed PHI-base. This is a web-accessible database that currently catalogs 405 experimentally verified pathogenicity, virulence and effector genes from 54 fungal and Oomycete pathogens, of which 176 are from animal pathogens, 227 from plant pathogens and 3 from pathogens with a fungal host. PHI-base is the first on-line resource devoted to the identification and presentation of information on fungal and Oomycete pathogenicity genes and their host interactions. As such, PHI-base is a valuable resource for the discovery of candidate targets in medically and agronomically important fungal and Oomycete pathogens for intervention with synthetic chemistries and natural products. Each entry in PHI-base is curated by domain experts and supported by strong experimental evidence (gene/transcript disruption experiments) as well as literature references in which the experiments are described. Each gene in PHI-base is presented with its nucleotide and deduced amino acid sequence as well as a detailed description of the predicted protein's function during the host infection process. To facilitate data interoperability, we have annotated genes using controlled vocabularies (Gene Ontology terms, Enzyme Commission Numbers and so on), and provide links to other external data sources (e.g. NCBI taxonomy and EMBL). We welcome new data for inclusion in PHIbase, which is freely accessed at www4.rothamsted. bbsrc.ac.uk/phibase/.

Nucleic Acids Research, 2007
The pathogen-host interaction database (PHI-base) is a web-accessible database that catalogues ex... more The pathogen-host interaction database (PHI-base) is a web-accessible database that catalogues experimentally verified pathogenicity, virulence and effector genes from bacterial, fungal and Oomycete pathogens, which infect human, animal, plant, insect, fish and fungal hosts. Plant endophytes are also included. PHI-base is therefore an invaluable resource for the discovery of genes in medically and agronomically important pathogens, which may be potential targets for chemical intervention. The database is freely accessible to both academic and non-academic users. This publication describes recent additions to the database and both current and future applications. The number of fields that characterize PHI-base entries has almost doubled. Important additional fields deal with new experimental methods, strain information, pathogenicity islands and external references that link the database to external resources, for example, gene ontology terms and Locus IDs. Another important addition is the inclusion of anti-infectives and their target genes that makes it possible to predict the compounds, that may interact with newly identified virulence factors. In parallel, the curation process has been improved and now involves several external experts. On the technical side, several new search tools have been provided and the database is also now distributed in XML format. PHI-base is available at: http://www.phi-base.org/.

Nucleic Acids Research, 2007
Wheat biologists face particular problems because of the lack of genomic sequence and the three h... more Wheat biologists face particular problems because of the lack of genomic sequence and the three homoeologous genomes which give rise to three very similar forms for many transcripts. However, over 1.3 million available public-domain Triticeae ESTs (of which »850 000 are wheat) and the full rice genomic sequence can be used to estimate likely transcript sequences present in any wheat cDNA sample to which PCR primers may then be designed. Wheat Estimated Transcript Server (WhETS) is designed to do this in a convenient form, and to provide information on the number of matching EST and high quality cDNA (hq-cDNA) sequences, tissue distribution and likely intron position inferred from rice. Triticeae EST and hq-cDNA sequences are mapped onto rice loci and stored in a database. The user selects a rice locus (directly or via Arabidopsis) and the matching Triticeae sequences are assembled according to user-defined filter and stringency settings. Assembly is achieved initially with the CAP3 program and then with a single nucleotide polymorphism (SNP)-analysis algorithm designed to separate homoeologues. Alignment of the resulting contigs and singlets against the rice template sequence is then displayed. Sequences and assembly details are available for download in fasta and ace formats, respectively. WhETS is accessible at http://www4.rothamsted.bbsrc. ac.uk/whets.

Journal of Computational Biology, 2001
Collectively, ve of the co-authors of this paper, have extensive expertise on NPPs and general bi... more Collectively, ve of the co-authors of this paper, have extensive expertise on NPPs and general bioinformatics methods. Their motivation for generating a NPP grammar was that none of the existing bioinformatics methods could provide suf cient cost-savings during the search for new NPPs. Prior to this project experienced specialists at SmithKline Beecham had tried for many months to hand-code such a grammar but without success. Our best predictor makes the search for novel NPPs more than 100 times more ef cient than randomly selecting proteins for synthesis and testing them for biological activity. As far as these authors are aware, this is both the rst biological grammar learnt using ILP and the rst real-world scienti c application of the ILP Bayesian approach to learning from positive examples. A group of features is derived from this grammar. Other groups of features of NPPs are derived using other learning strategies. Amalgams of these groups are formed. A recognition model is generated for each amalgam using C4.5 and C4.5rules and its performance is measured using both predictive accuracy and a new cost function, Relative Advantage (RA). The highest RA was achieved by a model which includes grammar-derived features. This RA is signi cantly higher than the best RA achieved without the use of the grammar-derived features. Predictive accuracy is not a good measure of performance for this domain because it does not discriminate well between NPP recognition models: despite covering varying numbers of (the rare) positives, all the models are awarded a similar (high) score by predictive accuracy because they all exclude most of the abundant negatives. 493 494 MUGGLETON ET AL. ARE GRAMMATICAL REPRESENTATIONS USEFUL? 495 materials, methods, and results. Section 6 is the discussion. Appendix A describes the new cost function, relative advantage (RA). Appendix B includes the production rules generated by CProgol. Appendix C includes our best multistrategy predictor of NPPs.
Uploads
Papers by Christopher Rawlings