Skip to main content

Christopher Rawlings

Followers

8

Following

4

Co-authors

4

Public Views

Röbbe Wünschiers

Lourdes Borrajo

Miguel Reboiro-jato

Cambridge College

Patrizio Arrigo

Luciano Milanesi

Consiglio Nazionale delle Ricerche (CNR)

University of Aveiro

TU Dortmund

Wageningen University

Interests

Uploads

Papers by Christopher Rawlings

Enhancing data integration with text analysis to find proteins implicated in plant stress response

by Roxane Legaie and Christopher Rawlings

High throughput genomic studies can identify large numbers of potential candidate genes, which mu... more High throughput genomic studies can identify large numbers of potential candidate genes, which must be interpreted and filtered by investigators to select the best ones for further analysis. Prioritization is generally based on evidence that supports the role of a gene product in the biological process being investigated. The two most important bodies of information providing such evidence are bioinformatics databases and the scientific literature. In this paper we present an extension to the Ondex data integration framework that uses text mining techniques over Medline abstracts as a method for accessing both these bodies of evidence in a consistent way. In an example use case, we apply our method to create a knowledge base of Arabidopsis proteins implicated in plant stress response and use various scoring metrics to identify key protein-stress associations. In conclusion, we show that the additional text mining features are able to highlight proteins using the scientific literature that would not have been seen using data integration alone.

Protein topology prediction through parallel constraint logic programming

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1993

In this paper, two programs are described (CBS1e and CBS2e). These are implemented in the paralle... more In this paper, two programs are described (CBS1e and CBS2e). These are implemented in the parallel constraint logic programming language ElipSys. These predict protein alpha/beta-sheet and beta-sheet topologies from secondary structure assignments and topological folding rules (constraints). These programs illustrate how recent developments in logic programming environments can be applied to solve large-scale combinatorial problems in molecular biology. We demonstrate that parallel constraint logic programming is able to overcome some of the important limitations of more established logic programming languages i.e. Prolog. This is particularly the case in providing features that enhance the declarative nature of the program and also in addressing directly the problems of scaling-up logic programs to solve scientifically realistic problems. Moreover, we show that for large topological problems CBS1e was approximately 60 times faster than an equivalent Prolog implementation (CBS1) on ...

Using Prolog to represent and reason about protein structure

Lecture Notes in Computer Science, 1986

The logic programming language PROLOG was used to represent and reason about the topology of prot... more The logic programming language PROLOG was used to represent and reason about the topology of protein structures. PROLOG descriptions of the relative positions of protein secondary structural features (protein topology) were generatedfrom information in the Brookhaven databank. P-structural motif (hairpin, meander, Greek key andjelly roll) were then defined using PROLOG rules. The PROLOG program was able to infer the presence of these structures in the PROLOG representation of the protein.

Pedstrip: extracting a maximal subset of available, unrelated individuals from a pedigree

Bioinformatics/computer Applications in The Biosciences, 2003

Summary: Certain types of genetic analysis are simplified by assembling a collection of unrelated... more Summary: Certain types of genetic analysis are simplified by assembling a collection of unrelated individuals, e.g. case-control experiments. If a family study is being per- formed then it will be necessary to extract subsets of un- related, available individuals from pedigress. Our program provides an optimal method for performing this task. Availability: The software is available, free of charge, on

Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, United Kingdom, July 16-19, 1995

Intelligent Systems in Molecular Biology, 1995

APPLAUSE: Applications Using the ElipSys Parallel CLP System

International Conference on Logic Programming/Joint International Conference and Symposium on Logic Programming, 1993

The APPLAUSE (Application and Assessment of Parallel Programming Using Logic) Project is building... more

Solving Large Combinatorial Problems in Molecular Biology Using the ElipSys Parallel Constraint Logic Programming System

The Computer Journal, 1993

Applications of ElipSys in Molecular Biology 691 biotechnology industry is dependent on a detaile... more

APPLAUSE: Application & assessment of parallel programming using logic

Lecture Notes in Computer Science, 1993

... to generate training material to introduce applications developers to ElipSys-like languages;... more

The edutain@grid Project

Grid Economics and Business Models, 2007

edutain@grid is an exciting and ground breaking new project making use of Grid technology. The pr... more edutain@grid is an exciting and ground breaking new project making use of Grid technology. The project will identify and define a new class of applications that are highly significant for Grid computing but have not been studied in the past, which we characterise as Real-Time Online Interactive Applications (ROIA). The distinctive features that make ROIA unique include large user concurrency

Graph-based sequence annotation using a data integration approach

The automated annotation of data from high throughput sequencing and genomics experiments is a si... more The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ON-DEX system which is freely available from http://ondex.sf.net/.

Analysis and visualisation of RDF resources in Ondex

Nature Precedings, 2010

Interactive exploration of integrated biological datasets using context-sensitive workflows

Network inference utilizes experimental high-throughput data for the reconstruction of molecular ... more Network inference utilizes experimental high-throughput data for the reconstruction of molecular interaction networks where new relationships between the network entities can be predicted. Despite the increasing amount of experimental data, the parameters of each modeling technique cannot be optimized based on the experimental data alone, but needs to be qualitatively assessed if the components of the resulting network describe the experimental setting. Candidate list prioritization and validation builds upon data integration and data visualization. The application of tools supporting this procedure is limited to the exploration of smaller information networks because the display and interpretation of large amounts of information is challenging regarding the computational effort and the users' experience. The Ondex software framework was extended with customizable context-sensitive menus which allow additional integration and data analysis options for a selected set of candidates during interactive data exploration. We provide new functionalities for on-the-fly data integration using InterProScan, PubMed Central literature search, and sequence-based homology search. We applied the Ondex system to the integration of publicly available data for Aspergillus nidulans and analyzed transcriptome data. We demonstrate the advantages of our approach by proposing new hypotheses for the functional annotation of specific genes of differentially expressed fungal gene clusters. Our extension of the Ondex framework makes it possible to overcome the separation between data integration and interactive analysis. More specifically, computationally demanding calculations can be performed on selected sub-networks without losing any information from the whole network. Furthermore, our extensions allow for direct access to online biological databases which helps to keep the integrated information up-to-date.

Identification and analysis of multigene families by comparison of exon fingerprints

Journal of Molecular Biology, 1995

Gene families are often recognised by sequence homology using similarity searching to find relati... more Gene families are often recognised by sequence homology using similarity searching to find relationships, however, genomic sequence data provides gene architectural information not used by conventional search methods. In particular, intron positions and phases are expected to be relatively conserved features, because mis-splicing and reading frame shifts should be selected against. A fast search technique capable of detecting possible weak sequence homologies apparent at the intron/exon level of gene organization is presented for comparing spliceosomal genes and gene fragments. FINEX compares strings of exons delimited by intron/exon boundary positions and intron phases (exon fingerprint) using a global dynamic programming algorithm with a combined intron phase identity and exon size dissimilarity score. Exon fingerprints are typically two orders of magnitude smaller than their nucleic acid sequence counterparts giving rise to fast search times: a ranked search against a library of 6755 fingerprints for a typical three exon fingerprint completes in under 30 seconds on an ordinary workstation, while a worst case largest fingerprint of 52 exons completes in just over one minute. The short “sequence” length of exon fingerprints in comparisons is compensated for by the large exon alphabet compounded of intron phase types and a wide range of exon sizes, the latter contributing the most information to alignments. FINEX performs better in some searches than conventional methods, finding matches with similar exon organization, but low sequence homology. A search using a human serum albumin finds all members of the multigene family in the FINEX database at the top of the search ranking, despite very low amino acid percentage identities between family members. The method should complement conventional sequence searching and alignment techniques, offering a means of identifying otherwise hard to detect homologies where genomic data are available.

Edutain@Grid: A Business Grid Infrastructure for Real-Time On-Line Interactive Applications

Lecture Notes in Computer Science, 2008

Grid infrastructures are maturing to a point where they are attracting the interest of businesses... more Grid infrastructures are maturing to a point where they are attracting the interest of businesses in many application domains. While many large-scale on-line gaming platforms exist, they fail to take into consideration the potential business to business relationships when it comes to dynamic on-line game hosting. This work presents an initial implementation of the edutain@grid architecture to support business value chains identified for on-line gaming and elearning application hosting. An analysis of business actors and value chains is presented briefly before a detailed description of the edutain@grid implementation. We also consider first results concerning how best to construct appropriate value chains using bipartite and bi-directional Service Level Agreements.

The edutain@grid Project

Lecture Notes in Computer Science, 2007

Abstract edutain@ grid is an exciting and ground breaking new project making use of Grid technolo... more

PHI-base: a new database for pathogen host interactions

Nucleic Acids Research, 2006

To utilize effectively the growing number of verified genes that mediate an organism's ability to... more To utilize effectively the growing number of verified genes that mediate an organism's ability to cause disease and/or to trigger host responses, we have developed PHI-base. This is a web-accessible database that currently catalogs 405 experimentally verified pathogenicity, virulence and effector genes from 54 fungal and Oomycete pathogens, of which 176 are from animal pathogens, 227 from plant pathogens and 3 from pathogens with a fungal host. PHI-base is the first on-line resource devoted to the identification and presentation of information on fungal and Oomycete pathogenicity genes and their host interactions. As such, PHI-base is a valuable resource for the discovery of candidate targets in medically and agronomically important fungal and Oomycete pathogens for intervention with synthetic chemistries and natural products. Each entry in PHI-base is curated by domain experts and supported by strong experimental evidence (gene/transcript disruption experiments) as well as literature references in which the experiments are described. Each gene in PHI-base is presented with its nucleotide and deduced amino acid sequence as well as a detailed description of the predicted protein's function during the host infection process. To facilitate data interoperability, we have annotated genes using controlled vocabularies (Gene Ontology terms, Enzyme Commission Numbers and so on), and provide links to other external data sources (e.g. NCBI taxonomy and EMBL). We welcome new data for inclusion in PHIbase, which is freely accessed at www4.rothamsted. bbsrc.ac.uk/phibase/.

PHI-base update: additions to the pathogen host interaction database

Nucleic Acids Research, 2007

The pathogen-host interaction database (PHI-base) is a web-accessible database that catalogues ex... more The pathogen-host interaction database (PHI-base) is a web-accessible database that catalogues experimentally verified pathogenicity, virulence and effector genes from bacterial, fungal and Oomycete pathogens, which infect human, animal, plant, insect, fish and fungal hosts. Plant endophytes are also included. PHI-base is therefore an invaluable resource for the discovery of genes in medically and agronomically important pathogens, which may be potential targets for chemical intervention. The database is freely accessible to both academic and non-academic users. This publication describes recent additions to the database and both current and future applications. The number of fields that characterize PHI-base entries has almost doubled. Important additional fields deal with new experimental methods, strain information, pathogenicity islands and external references that link the database to external resources, for example, gene ontology terms and Locus IDs. Another important addition is the inclusion of anti-infectives and their target genes that makes it possible to predict the compounds, that may interact with newly identified virulence factors. In parallel, the curation process has been improved and now involves several external experts. On the technical side, several new search tools have been provided and the database is also now distributed in XML format. PHI-base is available at: http://www.phi-base.org/.

Wheat Estimated Transcript Server (WhETS): a tool to provide best estimate of hexaploid wheat transcript sequence

Nucleic Acids Research, 2007

Wheat biologists face particular problems because of the lack of genomic sequence and the three h... more Wheat biologists face particular problems because of the lack of genomic sequence and the three homoeologous genomes which give rise to three very similar forms for many transcripts. However, over 1.3 million available public-domain Triticeae ESTs (of which »850 000 are wheat) and the full rice genomic sequence can be used to estimate likely transcript sequences present in any wheat cDNA sample to which PCR primers may then be designed. Wheat Estimated Transcript Server (WhETS) is designed to do this in a convenient form, and to provide information on the number of matching EST and high quality cDNA (hq-cDNA) sequences, tissue distribution and likely intron position inferred from rice. Triticeae EST and hq-cDNA sequences are mapped onto rice loci and stored in a database. The user selects a rice locus (directly or via Arabidopsis) and the matching Triticeae sequences are assembled according to user-defined filter and stringency settings. Assembly is achieved initially with the CAP3 program and then with a single nucleotide polymorphism (SNP)-analysis algorithm designed to separate homoeologues. Alignment of the resulting contigs and singlets against the rice template sequence is then displayed. Sequences and assembly details are available for download in fasta and ace formats, respectively. WhETS is accessible at http://www4.rothamsted.bbsrc. ac.uk/whets.

Prepublication data sharing

Nature, 2009

Are Grammatical Representations Useful for Learning from Biological Sequence Data?— A Case Study

Journal of Computational Biology, 2001

Collectively, ve of the co-authors of this paper, have extensive expertise on NPPs and general bi... more Collectively, ve of the co-authors of this paper, have extensive expertise on NPPs and general bioinformatics methods. Their motivation for generating a NPP grammar was that none of the existing bioinformatics methods could provide suf cient cost-savings during the search for new NPPs. Prior to this project experienced specialists at SmithKline Beecham had tried for many months to hand-code such a grammar but without success. Our best predictor makes the search for novel NPPs more than 100 times more ef cient than randomly selecting proteins for synthesis and testing them for biological activity. As far as these authors are aware, this is both the rst biological grammar learnt using ILP and the rst real-world scienti c application of the ILP Bayesian approach to learning from positive examples. A group of features is derived from this grammar. Other groups of features of NPPs are derived using other learning strategies. Amalgams of these groups are formed. A recognition model is generated for each amalgam using C4.5 and C4.5rules and its performance is measured using both predictive accuracy and a new cost function, Relative Advantage (RA). The highest RA was achieved by a model which includes grammar-derived features. This RA is signi cantly higher than the best RA achieved without the use of the grammar-derived features. Predictive accuracy is not a good measure of performance for this domain because it does not discriminate well between NPP recognition models: despite covering varying numbers of (the rare) positives, all the models are awarded a similar (high) score by predictive accuracy because they all exclude most of the abundant negatives. 493 494 MUGGLETON ET AL. ARE GRAMMATICAL REPRESENTATIONS USEFUL? 495 materials, methods, and results. Section 6 is the discussion. Appendix A describes the new cost function, relative advantage (RA). Appendix B includes the production rules generated by CProgol. Appendix C includes our best multistrategy predictor of NPPs.

Enhancing data integration with text analysis to find proteins implicated in plant stress response

by Roxane Legaie and Christopher Rawlings

High throughput genomic studies can identify large numbers of potential candidate genes, which mu... more High throughput genomic studies can identify large numbers of potential candidate genes, which must be interpreted and filtered by investigators to select the best ones for further analysis. Prioritization is generally based on evidence that supports the role of a gene product in the biological process being investigated. The two most important bodies of information providing such evidence are bioinformatics databases and the scientific literature. In this paper we present an extension to the Ondex data integration framework that uses text mining techniques over Medline abstracts as a method for accessing both these bodies of evidence in a consistent way. In an example use case, we apply our method to create a knowledge base of Arabidopsis proteins implicated in plant stress response and use various scoring metrics to identify key protein-stress associations. In conclusion, we show that the additional text mining features are able to highlight proteins using the scientific literature that would not have been seen using data integration alone.

Protein topology prediction through parallel constraint logic programming

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1993

In this paper, two programs are described (CBS1e and CBS2e). These are implemented in the paralle... more In this paper, two programs are described (CBS1e and CBS2e). These are implemented in the parallel constraint logic programming language ElipSys. These predict protein alpha/beta-sheet and beta-sheet topologies from secondary structure assignments and topological folding rules (constraints). These programs illustrate how recent developments in logic programming environments can be applied to solve large-scale combinatorial problems in molecular biology. We demonstrate that parallel constraint logic programming is able to overcome some of the important limitations of more established logic programming languages i.e. Prolog. This is particularly the case in providing features that enhance the declarative nature of the program and also in addressing directly the problems of scaling-up logic programs to solve scientifically realistic problems. Moreover, we show that for large topological problems CBS1e was approximately 60 times faster than an equivalent Prolog implementation (CBS1) on ...

Using Prolog to represent and reason about protein structure

Lecture Notes in Computer Science, 1986

The logic programming language PROLOG was used to represent and reason about the topology of prot... more The logic programming language PROLOG was used to represent and reason about the topology of protein structures. PROLOG descriptions of the relative positions of protein secondary structural features (protein topology) were generatedfrom information in the Brookhaven databank. P-structural motif (hairpin, meander, Greek key andjelly roll) were then defined using PROLOG rules. The PROLOG program was able to infer the presence of these structures in the PROLOG representation of the protein.

Pedstrip: extracting a maximal subset of available, unrelated individuals from a pedigree

Bioinformatics/computer Applications in The Biosciences, 2003

Summary: Certain types of genetic analysis are simplified by assembling a collection of unrelated... more Summary: Certain types of genetic analysis are simplified by assembling a collection of unrelated individuals, e.g. case-control experiments. If a family study is being per- formed then it will be necessary to extract subsets of un- related, available individuals from pedigress. Our program provides an optimal method for performing this task. Availability: The software is available, free of charge, on

Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, United Kingdom, July 16-19, 1995

Intelligent Systems in Molecular Biology, 1995

APPLAUSE: Applications Using the ElipSys Parallel CLP System

International Conference on Logic Programming/Joint International Conference and Symposium on Logic Programming, 1993

The APPLAUSE (Application and Assessment of Parallel Programming Using Logic) Project is building... more

Solving Large Combinatorial Problems in Molecular Biology Using the ElipSys Parallel Constraint Logic Programming System

The Computer Journal, 1993

Applications of ElipSys in Molecular Biology 691 biotechnology industry is dependent on a detaile... more

APPLAUSE: Application & assessment of parallel programming using logic

Lecture Notes in Computer Science, 1993

... to generate training material to introduce applications developers to ElipSys-like languages;... more

The edutain@grid Project

Grid Economics and Business Models, 2007

edutain@grid is an exciting and ground breaking new project making use of Grid technology. The pr... more edutain@grid is an exciting and ground breaking new project making use of Grid technology. The project will identify and define a new class of applications that are highly significant for Grid computing but have not been studied in the past, which we characterise as Real-Time Online Interactive Applications (ROIA). The distinctive features that make ROIA unique include large user concurrency

Graph-based sequence annotation using a data integration approach

The automated annotation of data from high throughput sequencing and genomics experiments is a si... more The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ON-DEX system which is freely available from http://ondex.sf.net/.

Analysis and visualisation of RDF resources in Ondex

Nature Precedings, 2010

Interactive exploration of integrated biological datasets using context-sensitive workflows

Network inference utilizes experimental high-throughput data for the reconstruction of molecular ... more Network inference utilizes experimental high-throughput data for the reconstruction of molecular interaction networks where new relationships between the network entities can be predicted. Despite the increasing amount of experimental data, the parameters of each modeling technique cannot be optimized based on the experimental data alone, but needs to be qualitatively assessed if the components of the resulting network describe the experimental setting. Candidate list prioritization and validation builds upon data integration and data visualization. The application of tools supporting this procedure is limited to the exploration of smaller information networks because the display and interpretation of large amounts of information is challenging regarding the computational effort and the users' experience. The Ondex software framework was extended with customizable context-sensitive menus which allow additional integration and data analysis options for a selected set of candidates during interactive data exploration. We provide new functionalities for on-the-fly data integration using InterProScan, PubMed Central literature search, and sequence-based homology search. We applied the Ondex system to the integration of publicly available data for Aspergillus nidulans and analyzed transcriptome data. We demonstrate the advantages of our approach by proposing new hypotheses for the functional annotation of specific genes of differentially expressed fungal gene clusters. Our extension of the Ondex framework makes it possible to overcome the separation between data integration and interactive analysis. More specifically, computationally demanding calculations can be performed on selected sub-networks without losing any information from the whole network. Furthermore, our extensions allow for direct access to online biological databases which helps to keep the integrated information up-to-date.

Identification and analysis of multigene families by comparison of exon fingerprints

Journal of Molecular Biology, 1995

Gene families are often recognised by sequence homology using similarity searching to find relati... more Gene families are often recognised by sequence homology using similarity searching to find relationships, however, genomic sequence data provides gene architectural information not used by conventional search methods. In particular, intron positions and phases are expected to be relatively conserved features, because mis-splicing and reading frame shifts should be selected against. A fast search technique capable of detecting possible weak sequence homologies apparent at the intron/exon level of gene organization is presented for comparing spliceosomal genes and gene fragments. FINEX compares strings of exons delimited by intron/exon boundary positions and intron phases (exon fingerprint) using a global dynamic programming algorithm with a combined intron phase identity and exon size dissimilarity score. Exon fingerprints are typically two orders of magnitude smaller than their nucleic acid sequence counterparts giving rise to fast search times: a ranked search against a library of 6755 fingerprints for a typical three exon fingerprint completes in under 30 seconds on an ordinary workstation, while a worst case largest fingerprint of 52 exons completes in just over one minute. The short “sequence” length of exon fingerprints in comparisons is compensated for by the large exon alphabet compounded of intron phase types and a wide range of exon sizes, the latter contributing the most information to alignments. FINEX performs better in some searches than conventional methods, finding matches with similar exon organization, but low sequence homology. A search using a human serum albumin finds all members of the multigene family in the FINEX database at the top of the search ranking, despite very low amino acid percentage identities between family members. The method should complement conventional sequence searching and alignment techniques, offering a means of identifying otherwise hard to detect homologies where genomic data are available.

Edutain@Grid: A Business Grid Infrastructure for Real-Time On-Line Interactive Applications

Lecture Notes in Computer Science, 2008

Grid infrastructures are maturing to a point where they are attracting the interest of businesses... more Grid infrastructures are maturing to a point where they are attracting the interest of businesses in many application domains. While many large-scale on-line gaming platforms exist, they fail to take into consideration the potential business to business relationships when it comes to dynamic on-line game hosting. This work presents an initial implementation of the edutain@grid architecture to support business value chains identified for on-line gaming and elearning application hosting. An analysis of business actors and value chains is presented briefly before a detailed description of the edutain@grid implementation. We also consider first results concerning how best to construct appropriate value chains using bipartite and bi-directional Service Level Agreements.

The edutain@grid Project

Lecture Notes in Computer Science, 2007

Abstract edutain@ grid is an exciting and ground breaking new project making use of Grid technolo... more

PHI-base: a new database for pathogen host interactions

Nucleic Acids Research, 2006

To utilize effectively the growing number of verified genes that mediate an organism's ability to... more To utilize effectively the growing number of verified genes that mediate an organism's ability to cause disease and/or to trigger host responses, we have developed PHI-base. This is a web-accessible database that currently catalogs 405 experimentally verified pathogenicity, virulence and effector genes from 54 fungal and Oomycete pathogens, of which 176 are from animal pathogens, 227 from plant pathogens and 3 from pathogens with a fungal host. PHI-base is the first on-line resource devoted to the identification and presentation of information on fungal and Oomycete pathogenicity genes and their host interactions. As such, PHI-base is a valuable resource for the discovery of candidate targets in medically and agronomically important fungal and Oomycete pathogens for intervention with synthetic chemistries and natural products. Each entry in PHI-base is curated by domain experts and supported by strong experimental evidence (gene/transcript disruption experiments) as well as literature references in which the experiments are described. Each gene in PHI-base is presented with its nucleotide and deduced amino acid sequence as well as a detailed description of the predicted protein's function during the host infection process. To facilitate data interoperability, we have annotated genes using controlled vocabularies (Gene Ontology terms, Enzyme Commission Numbers and so on), and provide links to other external data sources (e.g. NCBI taxonomy and EMBL). We welcome new data for inclusion in PHIbase, which is freely accessed at www4.rothamsted. bbsrc.ac.uk/phibase/.

PHI-base update: additions to the pathogen host interaction database

Nucleic Acids Research, 2007

The pathogen-host interaction database (PHI-base) is a web-accessible database that catalogues ex... more The pathogen-host interaction database (PHI-base) is a web-accessible database that catalogues experimentally verified pathogenicity, virulence and effector genes from bacterial, fungal and Oomycete pathogens, which infect human, animal, plant, insect, fish and fungal hosts. Plant endophytes are also included. PHI-base is therefore an invaluable resource for the discovery of genes in medically and agronomically important pathogens, which may be potential targets for chemical intervention. The database is freely accessible to both academic and non-academic users. This publication describes recent additions to the database and both current and future applications. The number of fields that characterize PHI-base entries has almost doubled. Important additional fields deal with new experimental methods, strain information, pathogenicity islands and external references that link the database to external resources, for example, gene ontology terms and Locus IDs. Another important addition is the inclusion of anti-infectives and their target genes that makes it possible to predict the compounds, that may interact with newly identified virulence factors. In parallel, the curation process has been improved and now involves several external experts. On the technical side, several new search tools have been provided and the database is also now distributed in XML format. PHI-base is available at: http://www.phi-base.org/.

Wheat Estimated Transcript Server (WhETS): a tool to provide best estimate of hexaploid wheat transcript sequence

Nucleic Acids Research, 2007

Wheat biologists face particular problems because of the lack of genomic sequence and the three h... more Wheat biologists face particular problems because of the lack of genomic sequence and the three homoeologous genomes which give rise to three very similar forms for many transcripts. However, over 1.3 million available public-domain Triticeae ESTs (of which »850 000 are wheat) and the full rice genomic sequence can be used to estimate likely transcript sequences present in any wheat cDNA sample to which PCR primers may then be designed. Wheat Estimated Transcript Server (WhETS) is designed to do this in a convenient form, and to provide information on the number of matching EST and high quality cDNA (hq-cDNA) sequences, tissue distribution and likely intron position inferred from rice. Triticeae EST and hq-cDNA sequences are mapped onto rice loci and stored in a database. The user selects a rice locus (directly or via Arabidopsis) and the matching Triticeae sequences are assembled according to user-defined filter and stringency settings. Assembly is achieved initially with the CAP3 program and then with a single nucleotide polymorphism (SNP)-analysis algorithm designed to separate homoeologues. Alignment of the resulting contigs and singlets against the rice template sequence is then displayed. Sequences and assembly details are available for download in fasta and ace formats, respectively. WhETS is accessible at http://www4.rothamsted.bbsrc. ac.uk/whets.

Prepublication data sharing

Nature, 2009

Are Grammatical Representations Useful for Learning from Biological Sequence Data?— A Case Study

Journal of Computational Biology, 2001

Collectively, ve of the co-authors of this paper, have extensive expertise on NPPs and general bi... more Collectively, ve of the co-authors of this paper, have extensive expertise on NPPs and general bioinformatics methods. Their motivation for generating a NPP grammar was that none of the existing bioinformatics methods could provide suf cient cost-savings during the search for new NPPs. Prior to this project experienced specialists at SmithKline Beecham had tried for many months to hand-code such a grammar but without success. Our best predictor makes the search for novel NPPs more than 100 times more ef cient than randomly selecting proteins for synthesis and testing them for biological activity. As far as these authors are aware, this is both the rst biological grammar learnt using ILP and the rst real-world scienti c application of the ILP Bayesian approach to learning from positive examples. A group of features is derived from this grammar. Other groups of features of NPPs are derived using other learning strategies. Amalgams of these groups are formed. A recognition model is generated for each amalgam using C4.5 and C4.5rules and its performance is measured using both predictive accuracy and a new cost function, Relative Advantage (RA). The highest RA was achieved by a model which includes grammar-derived features. This RA is signi cantly higher than the best RA achieved without the use of the grammar-derived features. Predictive accuracy is not a good measure of performance for this domain because it does not discriminate well between NPP recognition models: despite covering varying numbers of (the rare) positives, all the models are awarded a similar (high) score by predictive accuracy because they all exclude most of the abundant negatives. 493 494 MUGGLETON ET AL. ARE GRAMMATICAL REPRESENTATIONS USEFUL? 495 materials, methods, and results. Section 6 is the discussion. Appendix A describes the new cost function, relative advantage (RA). Appendix B includes the production rules generated by CProgol. Appendix C includes our best multistrategy predictor of NPPs.