Tutorial R
Tutorial R
Published by InTech
Janeza Trdine 9, 51000 Rijeka, Croatia
Statements and opinions expressed in the chapters are these of the individual contributors
and not necessarily those of the editors or publisher. No responsibility is accepted
for the accuracy of information contained in the published articles. The publisher
assumes no responsibility for any damage or injury to persons or property arising out
of the use of any materials, instructions, methods or ideas contained in the book.
Preface IX
Part 1 Reviews 1
Bioinformatics is a cross-disciplinary field and its birth in the sixties and seventies
depended on discoveries and developments in different fields, such as: the proposed
double helix model of DNA by Watson and Crick from X-ray data obtained by
Franklin and Wilkins in 1953; the development of a method to solve the phase problem
in protein crystallography by Perutz's group in 1954; the sequencing of the first protein
by Sanger in 1955; the creation of the ARPANET in 1969 at Stanford UCLA; the
publishing of the Needleman-Wunsch algorithm for sequence comparison in 1970; the
first recombinant DNA molecule created by Paul Berg and his group in 1972; the
announcement of the Brookhaven Protein DataBank in 1973; the establishment of the
Ethernet by Robert Metcalfe in the same year; the concept of computers network and
the development of the Transmission Control Protocol (TCP) by Vint Cerf and Robert
Khan in 1974, just to cite some of the landmarks that allowed the rise of bioinformatics.
Later, the Human Genome Project (HGP), started in 1990, was also very important for
pushing the development of bioinformatics and related methods of analysis of large
amount of data.
This book presents some theoretical issues, reviews, and a variety of bioinformatics
applications. For better understanding, the chapters were grouped in two parts. It was
not an easy task to select chapters for these parts, since most chapters provide a mix of
review and case study. From another point of view, all chapters also have extensive
X Preface
biological and computational information. Therefore, the book is divided into two
parts. In Part I, the chapters are more oriented towards literature review and
theoretical issues. Part II consists of application-oriented chapters that report case
studies in which a specific biological problem is treated with bioinformatics tools.
Molecular phylogeny analysis has become a routine technique not only to understand
the sequence-structure-function relationship of biomolecules but also to assist in their
classification. The first chapter of Part I, by Kolekar et al., presents the theoretical basis,
discusses the fundamental of phylogenetic analysis, and a particular view of steps and
methods used in the analysis.
Methods for protein function and gene expression are briefly reviewed in Hupert-
Kocurek and Kaguni’s chapter, and contrasted with the traditional approach of
mapping a gene via the phenotype of a mutation and deducing the function of the
gene product, based on its biochemical analysis in concert with physiological studies.
An example of experimental approach is provided that expands the current
understanding of the role of ATP binding and its hydrolysis by DnaC during the
initiation of DNA replication. This is contrasted with approaches that yield large sets
of data, providing a different perspective on understanding the functions of sets of
genes or proteins and how they act in a network of biochemical pathways of the cell.
Due to the importance of transcriptional regulation, one of the main goals in the post-
genomic era is to predict how the expression of a given gene is regulated based on the
presence of transcription factor binding sites in the adjacent genomic regions. Nain et
al. review different computational approaches for modeling and identification of
regulatory elements, as well as recent advances and the current challenges.
Microarray technology has become one of the most important technologies for
unveiling gene expression profiles, thus fostering the development of new
bioinformatics methods and tools. In the chapter by Lowery et al. a thorough review of
microarray technology is provided, with special focus on MRNA and microRNA
profiling of breast cancer.
high-resolution methods are especially used for characterizing critical regions of the
systems under investigation. Vitale et al. present a thorough review of mass
spectrometry and the related computational methods for studying the three-
dimensional structure of proteins.
Rodrigues and Kluskens review synthetic biology approaches for the development of
alternatives for cancer diagnosis and drug development, providing several application
examples and pointing challenging directions of research.
Exciting and updated issues are presented in Part II, where theoretical bases are
complemented with case studies, showing how bioinformatics analysis pipelines were
applied to answer a variety of biological issues.
During the last years we have witnessed an exponential growth of the biological data
and scientific articles. Consequently, retrieving and categorizing documents has
become a challenging task. The second part of the book starts with the chapter by
Addis et al. that propose a multiagent system for retrieving and categorizing
bioinformatics publications, with special focus on the information extraction task and
adopted hierarchical text categorization technique.
Cancer is one of the most important public health problems worldwide. Breast and
cervical cancer are the most frequent in female population. Salcedo et al. present a
study about the functional analysis of the cervical carcinoma transcriptome, with focus
on the methods for unveiling networks and finding new genes associated to cervical
cancer.
In Chu et al., an alignment approach using the pure best hit strategy is proposed to
classify TIM barrel protein domain structures in terms of the superfamily and family
categories with high accuracy.
Jamil and Sabeena use classic bioinformatic tools, such as ClustalW for Multiple
Sequence Alignment, SCI-PHY server for superfamily determination, ExPASy tools for
pattern matching, and visualization softwares for residue recognition and functional
elucidation to determine the functional diversity of the enolase enzyme superfamily.
In Fu, a new method is presented that explores potential genes in intergenic regions of
an annotated genome on the basis of their gene expression activity. The method was
applied to the M. tuberculosis genome where potential protein-coding genes were
found, based on bioinformatics analysis in conjunction with transcriptional evidence
obtained using the Affymetrix GeneChip. The study revealed potential genes in the
intergenic regions, such as DNA-binding protein in the CopG family and a nickel
binding GTPase, as well as hypothetical proteins.
Cai et al. present a new method for developmental studies. It combines experimental
studies and computational analysis to predict the trans-acting factors and
transcriptional regulatory networks for mouse embryonic retinal development.
The chapter by Roberts shows how advances in bioinformatics can be applied to the
development of improved therapeutic strategies. The chapter describes how functional
genomics experimentation and bioinformatics tools could be applied to the design of
synthetic promoters for therapeutic and diagnostic applications or adapted across the
biotech industry. Designed synthetic gene promoters can than be incorporated in
novel gene transfer vectors to promote safer and more efficient expression of
therapeutic genes for the treatment of various pathological conditions. Tools used to
Preface XIII
analyze data obtained from large-scale gene expression analyses, which are
subsequently used in the smart design of synthetic promoters are also presented.
Bruskin et al. describe how candidate genes commonly involved in psoriasis and
Crohn's disease were detected using lists of differentially expressed genes from
microarrays experiments with different numbers of probes. These gene codes for
proteins are particular targets for elaborating new approaches to treating these
pathologies. A comprehensive meta-analysis of proteomics and transcriptomics of
psoriatic lesions from independent studies is performed. Network-based analysis
revealed similarities in regulation at both proteomics and transcriptomics level.
Some eukaryotic mRNAs have multiple ORFs, which are recognized as polycistronic
mRNAs. One of the well-known extra ORFs is the upstream ORF (uORF), that
functions as a regulator of mRNA translation. In Ao-Kondo et al., this issue is
addressed and an introduction to the mechanism of translation initiation and
functional roles of uORF in translational regulation is given, followed by a review of
how the authors identified novel small proteins with Mass Spectrometry and a
discussion on the progress of bioinformatics analyses for elucidating the
diversification of short coding regions defined by the transcriptome.
Reviews
1
1. Introduction
The endeavour for the classification and study of evolution of organisms, pioneered by
Linneaus and Darwin on the basis of morphological and behavioural features of organisms,
is now being propelled by the availability of molecular data. The field of evolutionary
biology has experienced a paradigm shift with the advent of sequencing technologies and
availability of molecular sequence data in the public domain databases. The post-genomic
era provides unprecedented opportunities to study the process of molecular evolution,
which is marked with the changes organisms acquire and inherit. The species are
continuously subjected to evolutionary pressures and evolve suitably. These changes are
observed in terms of variations in the sequence data that are collected over a period of time.
Thus, the molecular sequence data archived in various databases are the snapshots of the
evolutionary process and help to decipher the evolutionary relationships of genes/proteins
and genomes/proteomes for a group of organisms. It is known that the individual genes
may evolve with varying rates and the evolutionary history of a gene may or may not
coincide with the evolution of the species as a whole. One should always refrain from
discussing the evolutionary relationship between organisms when analyses are performed
using limited/partial data. Thorough understanding of the principles and methods of
phylogeny help the users not only to use the available software packages in an efficient
manner, but also to make appropriate choices of methods of analysis and parameters so that
attempts can be made to maximize the gain on huge amount of available sequence data.
As compared to classical phylogeny based on morphological data, molecular phylogeny has
distinct advantages, for instance, it is based on sequences (as descrete characters) unlike the
morphological data, which is qualitative in nature. While the tree of life is depicted to have
three major branches as bacteria, archaea and eukaryotes (it excludes viruses), the trees
based on molecular data accounts for the process of evolution of bio-macromolecules (DNA,
RNA and protein). The trees generated using molecular data are thus referred to as ‘inferred
trees’, which present a hypothesized version of what might have happened in the process of
evolution using the available data and a model. Therefore, many trees can be generated
using a dataset and each tree conveys a story of evolution. The two main types of
information inherent in any phylogenetic tree are the topology (branching pattern) and the
branch lengths.
4 Computational Biology and Applied Bioinformatics
Before getting into the actual process of molecular phylogeny analysis (MPA), it will be
helpful to get familiar with the concepts and terminologies frequently used in MPA.
Phylogenetic tree: A two-dimensional graph depicting nodes and branches that illustrates
evolutionary relationships between molecules or organisms.
Nodes: The points that connect branches and usually represent the taxonomic units.
Branches: A branch (also called an edge) connects any two nodes. It is an evolutionary
lineage between or at the end of nodes. Branch length represents the number of
evolutionary changes that have occurred in between or at the end of nodes. Trees with
uniform branch length (cladograms), branch lengths proportional to the changes or distance
(phylograms) are derived based on the purpose of analysis.
Operational taxonomic units (OTUs): The known external/terminal nodes in the
phylogenetic tree are termed as OTU.
Hypothetical taxonomic units (HTUs): The internal nodes in the phylogenetic tree that are
treated as common ancestors to OTUs. An internal node is said to be bifurcating if it has
only two immediate descendant lineages or branches. Such trees are also called binary or
dichotomous as any dividing branch splits into two daughter branches. A tree is called a
‘multifurcating’ or ‘polytomous’ if any of its nodes splits into more than two immediate
descendants.
Monophyletic: A group of OTUs that are derived from a single common ancestor
containing all the descendents of single common ancestor.
Polyphyletic: A group of OTUs that are derived from more than one common ancestor.
Paraphyletic: A group of OTUs that are derived from a common ancestor but the group
doesn’t include all the descendents of the most recent common ancestor.
Clade: A monophyletic group of related OTUs containing all the descendants of the
common ancestor along with the ancestor itself.
Ingroup: A monophyletic group of all the OTUs that are of primary interest in the
phylogenetic study.
Outgroup: One or more OTUs that are phylogenetically outside the ingroup and known to
have branched off prior to the taxa included in a study.
Cladogram: The phylogenetic tree with branches having uniform lengths. It only depicts the
relationship between OTUs and does not help estimate the extent of divergence.
Phylogram: The phylogenetic tree with branches having variable lengths that are
proportional to evolutionary changes.
Species tree: The phylogenetic tree representing the evolutionary pathways of species.
Gene tree: The phylogenetic tree reconstructed using a single gene from each species. The
topology of the gene tree may differ from ‘species tree’ and it may be difficult to reconstruct
a species tree from a gene tree.
Unrooted tree: It illustrates the network of relationship of OTUs without the assumption of
common ancestry. Most trees generated using molecular data are unrooted and they can be
rooted subsequently by identifying an outgroup. Total number of bifurcating unrooted trees
can be derived using the equation: Nu= (2n-5)!/2n-3 (n-3)!
Rooted tree: An unrooted phylogenetic tree can be rooted with outgroup species, as a
common ancestor of all ingroup species. It has a defined origin with a unique path to each
ingroup species from the root. The total number of bifurcating rooted trees can be calculated
using the formula, Nr= (2n-3)!/2n-2 (n-2)! (Cavalli-Sforza & Edwards, 1967). Concept of
unrooted and rooted trees is illustrated in Fig. 1.
Molecular Evolution & Phylogeny: What, When, Why & How? 5
Fig. 1. Sample rooted and unrooted phylogenetic trees drawn using 5 OTUs . The external
and internal nodes are labelled with alphabets and Arabic numbers respectively. Note that
the rooted and unrooted trees shown here are one of the many possible trees (105 rooted
and 15 unrooted) that can be obtained for 5 OTUs.
The MPA typically involves following steps
• Definition of problem and motivation to carry out MPA
• Compilation and curation of homologous sequences of nucleic acids or proteins
• Multiple sequence alignments (MSA)
• Selection of suitable model(s) of evolution
• Reconstruction of phylogenetic tree(s)
• Evaluation of tree topology
A brief account of each of these steps is provided below.
At this stage, it is necessary to collate the dataset consisting of homologous sequences with the
appropriate coverage of OTUs and outgroup sequences, if needed. Care should be taken to
select the equivalent regions of sequences having comparable lengths (± 30 bases or amino
acids) to avoid the subsequent errors associated with incorrect alignments leading to incorrect
sampling of dataset, which may result in erroneous tree topology. Length differences of >30
might result in insertion of gaps by the alignment programs, unless the gap opening penalty is
suitably modified. Many comprehensive primary and derived databases of nucleic acid and
protein sequences are available in public domain, some of which are listed in Table 1. The
database issue published by the journal ‘Nucleic Acids research’ (NAR) in the month of
January every year is a useful resource for existing as well as upcoming databases. These
databases can be queried using the ‘text-based’ or ‘sequence-based’ database searches.
Table 1. List of some of the commonly used nucleotide, protein and molecule-/species-
specific databases.
Text-based queries are supported using search engines viz., Entrez and SRS, which are
available at NCBI and EBI respectively. The list of hits returned after the searches needs to
be curated very carefully to ensure that the data corresponds to the gene/protein of interest
and is devoid of partial sequences. It is advisable to refer to the feature-table section of every
entry to ensure that the data is extracted correctly and corresponds to the region of interest.
The sequence-based searches involve querying the databases using sequence as a probe and
are routinely used to compile a set of homologous sequences. Once the sequences are
compiled in FASTA or another format, as per the input requirements of MPA software, the
sequences are usually assigned with unique identifiers to facilitate their identification and
comparison in the phylogenetic trees. If the sequences posses any ambiguous characters or
low complexity regions, they could be carefully removed from sequences as they don’t
contribute to evolutionary analysis. The presence of such regions might create problems in
alignment, as it could lead to equiprobable alternate solutions to ‘local alignment’ as part of
Molecular Evolution & Phylogeny: What, When, Why & How? 7
a global alignment. Such regions possess ‘low’ information content to favour a tree topology
over the other. The inferiority of input dataset interferes with the analysis and interpretation
of the MPA. Thus, compilation of well-curated sequences, for the problem at hand, plays a
crucial role in MPA.
The concept of homology is central to MPA. Sequences are said to be homologous if they share
a common ancestor and are evolutionarily related. Thus, homology is a qualitative description
of the relationship and the term %homology has no meaning. However, supporting data for
deducing homology comes from the extent of sequence identity and similarity, both of which
are quantitative terms and are expressed in terms of percentage.
The homologous sequences are grouped into three types, viz., orthologs (same gene in
different species), paralogs (the genes that originated from duplication of an ancestral gene
within a species) and xenologs (the genes that have horizontally transferred between the
species). The orthologous protein sequences are known to fold into similar three-dimensional
shapes and are known to carry out similar functions. For example, haemoglobin alpha in horse
and human. The paralogous sequences are copies of the ancestral genes evolving within the
species such that nature can implement a modified function. For example haemoglobin alpha
and beta in horse. The xenologs and horizontal transfer events are extremely difficult to be
proved only on the basis of sequence comparison and additional experimental evidence to
support and validate the hypothesis is needed. The concepts of sequence alignments,
similarity and homology are extensively reviewed by Phillips (2006).
is employed to derive the optimum MSA (Nicholas et al., 2002; Batzoglou, 2005). Most of the
MSA packages use Needleman and Wunsch (1970) algorithm to compute pair wise sequence
similarity. The ClustalW is the widely used MSA package (Thompson et al., 1994). Recently
many alternative MSA algorithms are also being developed, which are enlisted in Table 2.
The standard benchmark datasets are used for comparative assessment of the alternative
approaches (Aniba et al., 2010; Thompson et al., 2011). Irrespective of the proven
performance of MSA methods for individual genes and proteins, some of the challenges and
issues regarding computational aspects involved in handling genomic data are still the
causes of concern (Kemena & Notredame, 2009).
Alignment
Algorithm description Available at / Reference
programs
http://www.ebi.ac.uk/Tools/msa/clustalw2/ ;
ClustalW Progressive
Thompson et al., 1994
http://www.ebi.ac.uk/Tools/msa/muscle/ ;
MUSCLE Progressive/iterative
Edgar, 2004
http://www.ebi.ac.uk/Tools/msa/tcoffee/ ;
T-COFFEE Progressive
Notredame et al., 2000
http://bibiserv.techfak.uni-bielefeld.de/dialign/ ;
DIALIGN2 Segment-based
Morgenstern et al., 1998
http://mafft.cbrc.jp/alignment/software/ ;
MAFFT Progressive/iterative
Katoh et al., 2005
Alignment visualization programs
http://www.mbio.ncsu.edu/bioedit/bioedit.html
*BioEdit
; Hall, 1999
http://www.megasoftware.net/ ;
MEGA5
Kumar et al., 2008
http://dambe.bio.uottawa.ca/dambe.asp ;
DAMBE
Xia & Xie, 2001
http://aig.cs.man.ac.uk/research/utopia/cinema ;
CINEMA5
Parry-Smith et al., 1998
*: Not updated since 2008, but the last version is available for use.
Table 2. List of commonly used multiple sequence alignment programs and visualization
tools.
The MSA output can also be visualized and edited, if required, with the software like
BioEdit, DAMBE etc. Multiple alignment output shows the conserved and variable sites,
usually residues are colour coded for the ease of visualisation, identification and analysis.
The character sites in MSA can be divided as conserved (all the sequences have same
residue or base), variable-non-informative (singleton site) and variable-informative sites.
The sites containing gaps in all or majority of the species are of no importance from the
evolutionary point of view and are usually removed from MSA while converting MSA data
to input data for MPA. A sample MSA is shown in Fig. 2. The sequences of surface
hydrophobic (SH) protein from various genotypes (A to M) of Mumps virus, are aligned. A
careful visual inspection of MSA allows us to locate the patterns and motifs (LLLXIL) in a
given set of sequences. Apart from MPA, the MSA data in turn can be used for the
construction of position specific scoring matrix (PSSM), generation of consensus sequence,
Molecular Evolution & Phylogeny: What, When, Why & How? 9
sequence logos, identification and prioritisation of potential B- and T-cell epitopes etc.
Nowadays the databases of curated, pre-computed alignments of reference species are also
being made available, which can be used for the benchmark comparison, evaluation purpose
(Thompson et al., 2011) and it also helps to keep the track of changes that get accumulated in
the species over a period of time. For example, in case of viruses, observed changes are
correlated with emergence of new genotypes (Kulkarni-Kale et al., 2004; Kuiken et al., 2005).
Fig. 2. The complete multiple sequence alignment of the surface hydrophobic (SH) proteins
of Mumps virus genotypes (A to M) carried out using ClustalW. The MSA is viewed using
BioEdit. The species labels in the leftmost column begin with genotype letter (A-M) followed
by GenBank accession numbers. The scale for the position in alignment is given at the top of
the alignment. The columns with conserved residues are marked with an “*” in the last row.
sequence evolution becomes important as a part of effective MPA. Two types of approaches
are adapted for the building of models, first one is empirical i.e. using the properties
revealed through comparative studies of large datasets of observed sequences, and the other
is parametrical, which uses biological and biochemical knowledge about the nucleic acid
and protein sequences, for example the favoured substitution patterns of residues.
Parametric models obtain the parameters from the MSA dataset under study. Both types of
approaches result in the models based on the Markov process, in the form of matrix
representing the rate of all possible transitions between the types of residues (4 nucleotides
in nucleic acids and 20 amino acids in proteins). According to the type of sequence (nucleic
acid or protein), two categories of models have been developed.
appropriate for the phylogenetic study of membrane proteins. On the other hand, Adachi and
Hasegawa (1996) obtained a replacement matrix using mitochondrial proteins across 20
vertebrate species and can be effectively used for mitochondrial protein phylogeny. Henikoff
and Henikoff (1992) derived the series of BLOSUM matrices using local, ungapped alignments
of distantly related sequences. The BLOSUM matrices are widely used in similarity searches
against databases than for phylogenetic analyses.
Fig. 3. The types of substitutions in nucleotides. α denotes the rate of transitions and β
denotes the rate of transversions. For example, in the case of JC model α=β while in the case
of K2P model α>β.
Recently, structural constraints of the nucleic acids and proteins are also being incorporated in
the building of models of evolution. For example, Rzhetsky (1995) contributed a model to
estimate the substitution patterns in ribosomal RNA genes with the account of secondary
structure elements like stem-loops in ribosomal RNAs. Another approach introduced a model
with the combination of protein secondary structures and amino acid replacement (Lio &
Goldman, 1998; Thorne et al., 1996). The overview of different models of evolution and the
criteria for the selection of models is also provided by Lio & Goldman (1998); Luo et al. (2010).
evolutionary pattern of the OTUs under study. The exhaustive search method examines
theoretically all possible tree topologies for a chosen number of species and derives the best
tree topology using a set of certain criteria. Table 3 shows the possible number of rooted and
unrooted trees for n number of species/OTUs.
Fig. 4. The distance matrix obtained for a sample nucleotide sequence dataset using Jukes-
Cantor model. Dataset contains 5 OTUs (A-E) and 6 sites shown in Phylip format. Dnadist
program in PHYLIP package is used to compute distance matrix.
Iteration 1: OTU A is minimally equidistant from OTUs B and C. Randomly we select the
OTUs A and B to form one composite OTU (AB). A and B are clustered together. Compute
new distances of OTUs C, D and E from composite OTU (AB). The distances between
unclustered OTUs will be retained. See Fig. 4 for initial distance matrix and Fig. 5 for
updated matrix after first iteration of UPGMA.
d(AB,C) = [d(A,C) + d(B,C)]/2 = [0.188486 + 0.440840]/2 = 0.314633
d(AB,D) = [d(A,D) + d(B,D)]/2 = [0.823959 + 0.440840]/2 = 0.632399
d(AB,E) = [d(A,E) + d(B,E)]/2 = [1.647918 + 0.823959]/2 = 1.235938
Fig. 5. The updated distance matrix and clustering of A and B after the 1st iteration of
UPGMA.
Iteration 2: OTUs (AB) and C are minimally distant. We select these OTUs to form one
composite OTU (ABC). AB and C are clustered together. We then compute new distances of
OTUs D and E from composite OTU (ABC). See Fig. 5 for distance matrix obtained in
iteration 1 and Fig. 6 for updated matrix after the second iteration of UPGMA.
d(ABC,D) = [d(AB,D) + d(C,D)]/2 = [0.632399 + 1.647918]/2 = 1.140158
d(ABC,E) = [d(AB,E) + d(C,E)]/2 = [1.235938 + 0.823959]/2 = 1.029948
Fig. 6. The updated distance matrix and clustering of A, B and C after the 2nd iteration of
UPGMA.
Iteration 3: OTUs D and E are minimally distant. We select these OTUs to form one
composite OTU (DE). D and E are clustered together. Compute new distances of OTUs
(ABC) and (DE) from each other. Finally the remaining two OTUs are clustered together. See
Fig. 6 for distance matrix obtained in iteration 2 and Fig. 7 for updated matrix after third
iteration of UPGMA.
d(ABC,DE) = [d(ABC,D) + d(ABC,E)]/2 = [1.140158 + 1.029948]/2 = 1.085053
Molecular Evolution & Phylogeny: What, When, Why & How? 15
Fig. 7. The updated distance matrix and clustering of OTUs after the 3rd iteration of
UPGMA. Numbers on the branches indicate branch lengths, which are additive.
Fig. 8. The modified distance matrix Md and clustering for iteration 1 of N-J.
As can be seen from Md in Fig. 8, OTUs A and C are minimally distant. We select the OTUs
A and C to form one composite OTU (AC). A and C are clustered together.
Iteration 2: Compute new distances of OTUs B, D and E from composite OTU (AC).
Distances between unclustered OTUs will be retained from the previous step.
d(AC,B) = [d(A,B) + d(C,B)-d(A,C)]/2 = 0.22042
d(AC,D) = [d(A,D) + d(C,D) -d(A,C)]/2 = 1.141695
d(AC,E) = [d(A,E) + d(C,E) -d(A,C)]/2 = 1.141695
Compute r as in the previous step with N=4. See Fig. 9 for new distance matrix and r vector.
Fig. 9. The new distance matrix D and vector r obtained for NJ algorithm iteration 2.
Now, we compute the modified distance matrix, Md as in the previous step and cluster the
minimally distant OTUs. See Fig. 10
Fig. 10. The modified distance matrix Md, obtained during N-J algorithm iteration 2.
In this step, AC & B and D & E are minimally distant, so we cluster AC with B and D with E.
Repeating the above steps we will finally get the following phylogenetic tree, Fig. 11.
Both the distance-based methods, UPGMA and N-J, are computationally faster and hence
suited for the phylogeny of large datasets. N-J is the most widely used distance-based
method for phylogenetic analysis. The results of these methods are highly dependent on the
model of evolution selected a priori.
Molecular Evolution & Phylogeny: What, When, Why & How? 17
Fig. 11. The phylogenetic tree obtained using N-J algorithm for distance matrix in Fig 4.
Numbers on the branches indicate branch length.
aligned nucleotides. Since there are four taxa (A, B, C & D), three possible unrooted trees
can be obtained for each site. Out of 5 character sites, only two sites, viz., 4 & 5 are
informative i.e. sites having at least two different types of characters (nucleotides/amino
acids) with a minimum frequency 2. In the Maximum parsimony method, only informative
sites are analysed. Fig. 12 shows the Maximum parsimony phylogenetic analysis of site 5
shown in Table 5. Three possible unrooted trees are shown for site 5 and the tree length is
calculated in terms of number of substitutions. Tree II is favoured over trees I and III as it
can explain the observed changes in the sequences just with a single substitution. In the
same way unrooted trees can be obtained for other informative sites such as site 4. The most
parsimonious tree among them will be selected as the final phylogenetic tree. If two or more
trees are found and no unique tree can be inferred, trees are said to be equally
parsimonious.
Table 5. Example of phylogenetic analysis from 5 aligned character sites in 4 OTUs using
Maximum parsimony method.
Fig. 12. Example showing various tree topologies based on site 5 in Table 5 using the
Maximum parsimony method.
This method is suitable for a small number of sequences with higher similarity and was
originally developed for protein sequences. Since this method examines the number of
evolutionary changes in all possible trees it is computationally intensive and time
consuming. Thus, it is not the method of choice for large sized genome sequences with high
variation. The unequal rates of variation in different sites can lead to erroneous parsimony
tree with some branches having longer lengths than others as parsimony method assumes
the rate of change across all sites to be equal.
Molecular Evolution & Phylogeny: What, When, Why & How? 19
Fig. 13. Flowchart showing the analysis steps involved in phylogenetic reconstruction.
Fig. 14. The procedure to generate pseudo-replicate datasets of original dataset using
bootstrap is shown above. The character sites are shown in colour codes at the bottom of
datasets to visualize “sampling with replacement protocol”.
randomly from original dataset with a possibility of sampling the same site repeatedly, in
the process of regular bootstrap. This leads to generation of population of datasets, which
are given as an input to tree building methods thus giving rise to population of phylogenetic
Molecular Evolution & Phylogeny: What, When, Why & How? 21
trees. The consensus phylogenetic tree is then inferred by the majority rule that groups those
OTUs, which are found to cluster most of the times in the population of trees. The branches
in consensus phylogenetic tree are labelled with bootstrap support values enabling the
significance of the relationship of OTUs as depicted using a branching pattern. The
procedure for regular bootstrap is illustrated in the Fig. 14. It shows the original dataset
along with four pseudo-replicate datasets.
The sites in the original dataset are colour coded to visualize the “sampling with
replacement protocol” used in generation of pseudo-replicate datasets 1-4. Seqboot program
in PHYLIP package was used for this purpose with choice of regular bootstrap. For
example, pseudo-replicate dataset 1 contains the site 1 (red) from original dataset sampled 3
times. In the general practice, usually 100 to 1000 datasets are generated and for each of the
datasets phylogenetic tree is obtained. The consensus phylogenetic tree is then obtained by
majority rule. The reliability of the consensus tree is assessed from the “branch times” value
displayed along the branches of tree.
In Jackknife procedure, the pseudo-datasets are generated by “sampling without replacement”
protocol. In this process, sampling (<n) character sites randomly from original dataset
generates each pseudo dataset. This leads to generation of population of datasets, which are
given as an input to tree building methods thus giving rise to population of phylogenetic trees.
The consensus phylogenetic tree is inferred by the majority rule that groups those OTUs,
which are found to be clustered most of the times in the population of trees.
Fig. 15. The unrooted consensus phylogenetic tree obtained for Mumps virus genotypes
using Neighbor-Joining method. The first letter in OTU labels indicates the genotype (A-M),
which is followed by the GenBank accession numbers for the sequences. The OTUs are also
colour coded according to genotypes as following, A: red; B: light blue; C: yellow; D: light
green; E: dark blue; F: magenta; G: cyan; H: brick; I: pink; J: orange; K: black; L: dark green;
M: purple. All of the genotypes have formed monophyletic clades with high bootstrap
support values shown along the branches. The monophyletic clade of M genotypes (with 98
bootstrap support at its base) separated from the individual monophyletic clades of other
genotypes (A-L) re-confirms the detection of new genotype M.
e. Building phylogenetic tree: The distance matrices obtained in the previous step were
given as an input to N-J method to build phylogenetic trees. The ‘outfile’ generated by
Protdist program containing distance matrices was given as an input to Neighbor
program in PHYLIP package.
f. The consensus phylogenetic tree was then obtained using Consense program. For this
purpose the ‘outtree’ file (in Newick format) generated by Neighbor program was given
as an input to Consense program.
g. The consensus phylogenetic tree was visualized using FigTree software (available from
http://tree.bio.ed.ac.uk/software/figtree/). The consensus unrooted phylogenetic tree
is shown in Fig. 15.
The phylogenetic tree for the same dataset was also obtained by using Maximum parsimony
method, implemented as the Protpars program in PHYLIP by carrying out MSA and
bootstrap as detailed above. The consensus phylogenetic tree is shown in Fig. 16.
Comparison of the trees shown in Fig. 15 & Fig. 16 with that of the published tree re-
confirms the emergence of new MuV genotype M during the epidemic in São Paulo, Brazil
(Santos et al., 2008), as the members of genotype M have formed a distinct monophyletic
clade similar to the known genotypes (A-L). But, a keen observer would note the differences
in ordering of clades in the two phylograms obtained using two different methods viz., N-J
Molecular Evolution & Phylogeny: What, When, Why & How? 23
and MP. For example, the clade of genotype J is close to the clade of genotype I in the N-J
phylogram (see Fig. 15) whereas in the MP phylogram (Fig. 16) the clade of genotype J is
shown to cluster with the clade of genotype F. Such differences in the ordering of clades are
observed some times as these methods (N-J & MP) employ different assumptions and
models of evolution. The user can interpret the results with reasonable confidence where the
similar clustering pattern of clades is observed in trees drawn using multiple methods. The
user, on the other hand, should refrain from over interpretation of sub-tree topolologies,
where branching order doesn’t match in the trees drawn using different methods. Similarly,
a lot of case studies pertaining to the emergence of new species as well as evolution of
individual genes/proteins have been published. It is advisable to re-run through a few case
studies, which are published, to understand the way in which the respective authors have
interpreted the results on the basis of phylogenetic analyses.
Fig. 16. The unrooted consensus phylogenetic tree obtained for Mumps virus genotypes
using Maximum parsimony method. The labelling of OTUs and colour coding is same as in
Fig. 15.
Tiedje, 2007). But whole genome based phylogeny poses many challenges to the traditional
methods of MPA, major concerns of them being the size, memory and computational
complexity involved in alignment of genomes (Liu et al., 2010).
The methods of MSA developed so far are adequate to handle the requirements of limited
amount of data viz. individual gene or protein sequences from various organisms. The
increased size of data in terms of the whole genome sequences, however, poses constrains
on use and applicability of currently available methods of MSA as they become
computationally intensive with requirement of higher memory. The uncertainty associated
with alignment procedures, which leads to variations in the inferred phylogeny, has also
been pointed out to be the cause of concern (Wong et al., 2008). The benchmark datasets are
made available to validate performance of multiple sequence alignment methods (Kemena
& Notredame, 2009). These challenges have opened up opportunities for development of
alternative approaches for MPA with emergence of alignment-free methods for the same
(Kolekar et al., 2010; Sims et al., 2009; Vinga & Almeida, 2003). The field of MPA is also
evolving with attempts to develop novel methods based on various data mining techniques
viz. Hidden Markov Model (HMM) (Snir & Tuller, 2009), Chaos game theory (Deschavanne
et al., 1999), Return Time Distributions (Kolekar et al., 2010) etc. The recent approaches are
undergoing refinement and will have to be evaluated with the benchmark datasets before
they are routinely used. However, sheer dimensionality of genomic data demands their
application. These approaches along with the conventional approaches are extensively
reviewed elsewhere (Blair & Murphy, 2010; Wong & Nielsen, 2007).
10. Conclusion
The chapter provides excursion of molecular phylogeny analyses for potential users. It gives
an account of available resources and tools. The fundamental principles and salient features
of various methods viz. distance-based and character-based are explained with worked out
examples. The purpose of the chapter will be served if it enables the reader to develop
overall understanding, which is critical to perform such analyses involving real data.
11. Acknowledgment
PSK acknowledges the DBT-BINC junior research fellowship awarded by Department of
Biotechnology (DBT), Government of India. UKK acknowledges infrastructural facilities and
financial support under the Centre of Excellence (CoE) grant of DBT, Government of India.
12. References
Adachi, J. & Hasegawa, M. (1996) Model of amino acid substitution in proteins encoded by
mitochondrial DNA. Journal of Molecular Evolution 42(4):459-468.
Aniba, M.; Poch, O. & Thompson J. (2010) Issues in bioinformatics benchmarking: the case
study of multiple sequence alignment. Nucleic Acids Research 38(21):7353-7363.
Batzoglou, S. (2005) The many faces of sequence alignment. Briefings in Bioinformatics 6(1):6-
22.
Benson, D.; Karsch-Mizrachi, I.; Lipman, D.; Ostell, J. & Sayers, E. (2011) GenBank. Nucleic
Acids Research 39(suppl 1):D32-D37.
Molecular Evolution & Phylogeny: What, When, Why & How? 25
Blair, C. & Murphy, R. (2010) Recent Trends in Molecular Phylogenetic Analysis: Where to
Next? Journal of Heredity 102(1):130.
Bruno, W.; Socci, N. & Halpern A. (2000) Weighted neighbor joining: a likelihood-based
approach to distance-based phylogeny reconstruction. Molecular Biology and
Evolution 17(1):189.
Cavalli-Sforza, L. & Edwards, A. (1967) Phylogenetic analysis. Models and estimation
procedures. American Journal of Human Genetics 19(3 Pt 1):233.
Cole, J.; Wang, Q.; Cardenas, E.; Fish, J.; Chai, B.; Farris, R.; Kulam-Syed-Mohideen, A.;
McGarrell, D.; Marsh, T.; Garrity, G. & others. (2009) The Ribosomal Database
Project: improved alignments and new tools for rRNA analysis. Nucleic Acids
Research 37(suppl 1):D141-D145.
Deschavanne, P.; Giron, A.; Vilain, J.; Fagot, G. & Fertil, B. (1999) Genomic signature:
characterization and classification of species assessed by chaos game representation
of sequences. Mol Biol Evol 16(10):1391-9.
Edgar, R.(2004). MUSCLE: a multiple sequence alignment method with reduced time and
space complexity. BMC Bioinformatics 5:113
Efron, B. (1979) Bootstrap Methods: Another Look at the Jackknife. Ann. Statist. 7:1-26.
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood
approach. J. Mol Evol 17:368-376.
Felsenstein, J. (1985) Confidence Limits on Phylogenies: An Approach Using the Bootstrap.
Evolution 39 783-791.
Felsenstein, J.(1989). PHYLIP-phylogeny inference package (version 3.2). Cladistics 5:164-166
Felsenstein, J. (1996) Inferring phylogenies from protein sequences by parsimony, distance,
and likelihood methods. Methods Enzymol 266:418-27.
Fitch, W. (1971) Toward Defining the Course of Evolution: Minimum Change for a Specific
Tree Topology. Systematic Zoology 20(4):406-416.
Gascuel, O. (1997) BIONJ: an improved version of the NJ algorithm based on a simple model
of sequence data. Molecular Biology and Evolution 14(7):685-695.
Hall, T. (1999). BioEdit: A user-friendly biological sequence alignment editor and analysis
program for Windows 95/98/NT. Nucleic Acids Symp Ser 41:95-98.
Hendy, M. & Penny, D. (1982) Branch and bound algorithms to determine minimal
evolutionary trees. Mathematical Biosciences 59(2):277-290.
Henikoff, S. & Henikoff, J. (1992) Amino acid substitution matrices from protein blocks.
Proceedings of the National Academy of Sciences of the United States of America
89(22):10915.
Jones, D.; Taylor, W. & Thornton, J. (1994) A mutation data matrix for transmembrane
proteins. FEBS Letters 339(3):269-275.
Jukes, T. & Cantor, C. (1969) Evolution of protein molecules. In “Mammalian Protein
Metabolism”(HN Munro, Ed.). Academic Press, New York.
Kaminuma, E.; Kosuge, T.; Kodama, Y.; Aono, H.; Mashima, J.; Gojobori, T.; Sugawara, H.;
Ogasawara, O; Takagi, T.; Okubo, K. & others. (2011). DDBJ progress report.
Nucleic Acids Research 39(suppl 1):D22-D27
Katoh, K.; Kuma, K.; Toh, H. & Miyata T (2005) MAFFT version 5: improvement in accuracy
of multiple sequence alignment. Nucleic Acids Res 33:511 - 518
Kemena, C. & Notredame, C. (2009) Upcoming challenges for multiple sequence alignment
methods in the high-throughput era. Bioinformatics 25(19):2455-2465.
26 Computational Biology and Applied Bioinformatics
Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions
through comparative studies of nucleotide sequences. Journal of Molecular Evolution
16(2):111-120.
Kolekar, P.; Kale, M. & Kulkarni-Kale, U. (2010) `Inter-Arrival Time' Inspired Algorithm and
its Application in Clustering and Molecular Phylogeny. AIP Conference Proceedings
1298(1):307-312.
Konstantinidis, K. & Tiedje, J. (2007) Prokaryotic taxonomy and phylogeny in the genomic
era: advancements and challenges ahead. Current Opinion in Microbiology 10(5):504-
509.
Kuiken, C.; Leitner, T.; Foley, B.; Hahn, B.; Marx, P.; McCutchan, F.; Wolinsky, S.; Korber, B.;
Bansal, G. & Abfalterer, W. (2009) HIV sequence compendium 2009. Document LA-
UR:06-0680
Kuiken, C.; Yusim, K.; Boykin, L. & Richardson, R. (2005) The Los Alamos hepatitis C
sequence database. Bioinformatics 21(3):379.
Kulkarni-Kale, U.; Bhosle, S.; Manjari, G. & Kolaskar, A. (2004) VirGen: a comprehensive
viral genome resource. Nucleic Acids Research 32(suppl 1):D289.
Kumar, S.; Nei, M.; Dudley, J. & Tamura, K. (2008) MEGA: a biologist-centric software for
evolutionary analysis of DNA and protein sequences. Brief Bioinform, 9(4):299-306.
Leinonen, R.; Akhtar, R.; Birney, E.; Bower, L.; Cerdeno-Tárraga, A.; Cheng, Y.; Cleland, I.;
Faruque, N.; Goodgame, N.; Gibson, R. & others. (2011) The European Nucleotide
Archive. Nucleic Acids Research 39(suppl 1):D28-D31.
Lio, P. & Goldman, N. (1998) Models of molecular evolution and phylogeny. Genome Res
8(12):1233-44.
Liu, K.; Linder, C. & Warnow, T. (2010) Multiple sequence alignment: a major challenge to
large-scale phylogenetics. PLoS Currents 2.
Luo, A.; Qiao, H.; Zhang, Y.; Shi, W.; Ho, S.; Xu, W.; Zhang, A. & Zhu, C. (2010) Performance
of criteria for selecting evolutionary models in phylogenetics: a comprehensive
study based on simulated datasets. BMC Evolutionary Biology 10(1):242.
Morgenstern, B.; French, K.; Dress, A. & Werner, T. (1998). DIALIGN: finding local
similarities by multiple sequence alignment. Bionformatics 14:290 - 294
Needleman, S. & Wunsch, C. (1970) A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Journal of Molecular Biology
48(3):443-453.
Nicholas, H.; Ropelewski, A. & Deerfield DW. (2002) Strategies for multiple sequence
alignment. Biotechniques 32(3):572-591.
Notredame, C.; Higgins, D. & Heringa, J. (2000) T-Coffee: A novel method for fast and
accurate multiple sequence alignment. Journal of Molecular Biology 302:205 - 217
Parry-Smith, D.; Payne, A.; Michie, A.& Attwood, T. (1998). CINEMA--a novel colour
INteractive editor for multiple alignments. Gene 221(1):GC57-GC63
Pearson, W.; Robins, G. & Zhang, T. (1999) Generalized neighbor-joining: more reliable
phylogenetic tree reconstruction. Molecular Biology and Evolution 16(6):806.
Phillips, A. (2006) Homology assessment and molecular sequence alignment. Journal of
Biomedical Informatics 39(1):18-33.
Ronquist, F. & Huelsenbeck, J. (2003). MrBayes 3: Bayesian phylogenetic inference under
mixed models. Bioinformatics 19(12):1572-1574
Molecular Evolution & Phylogeny: What, When, Why & How? 27
1. Introduction
Bioinformatics has its origins in the development of DNA sequencing methods by Alan
Maxam and Walter Gilbert (Maxam and Gilbert, 1977), and by Frederick Sanger and
coworkers (Sanger et al., 1977). By entirely different approaches, the first genomes
determined at the nucleotide sequence level were that of bacteriophage φX174, and the
recombinant plasmid named pBR322 composed of about 5,400 (Sanger et al., 1977), or 4,400
base pairs (Sutcliffe, 1979), respectively. In contrast, two articles that appeared in February
2001 reported on the preliminary DNA sequence of the human genome, which corresponds
to 3 billion nucleotides of DNA sequence information (Lander et al., 2001; Venter et al.,
2001). Only two years later, the GenBank sequence database contained more than 29.3
billion nucleotide bases in greater than 23 million sequences. With the development of new
technologies, experts predict that the cost to sequence an individual’s DNA will be about
$1000. This reduction in cost suggests that efforts in the area of comparative genomics will
increase substantially, leading to an enormous database that vastly exceeds the existing one.
By way of comparative genomics approaches, computational methods have led to the
identification of homologous genes shared among species, and their classification into
superfamilies based on amino acid sequence similarity. In combination with their
evolutionary relatedness, superfamily members have been clustered into clades. In addition,
high throughput sequencing of small RNAs and bioinformatics analyses have contributed to
the identification of regions between genes that can code small RNAs (siRNA, microRNA,
and long noncoding RNA), which act during the development of an organism to modulate
gene expression at the post-transcriptional level (Fire et al., 1998; Hamilton and Baulcombe,
1999) reviewed in Elbashir et al., 2001; Ghildiyal and Zamore, 2009; Christensen et al., 2010).
An emerging area is functional genomics whereby gene function is deduced using large-
scale methods by identifying the involvement of specific genes in metabolic pathways. More
recently, phenotype microarray methods have been used to correlate the functions of genes
of microbes with cell phenotypes under a variety of growth conditions (Bochner, 2009).
These methods contrast with the traditional approach of mapping a gene via the phenotype
of a mutation, and deducing the function of the gene product based on its biochemical
analysis in concert with physiological studies. Such studies have been performed to confirm
the functional importance of conserved residues shared by superfamily members, and also
30 Computational Biology and Applied Bioinformatics
to determine the role of specific residues for a given protein. In comparison, comparative
genomics methods are unable to distinguish if a nonconserved amino acid among
superfamily members is functionally important, or simply reflects sequence divergence due
to the absence of selection during evolution. Without functional information, it is not
possible to determine if a nonconserved amino acid is important.
the acidic residue in the Walker B motif interacts with and aligns the activated water
molecule during nucleotide hydrolysis. The Box VII motif, which is also called the SRH
(Second Region of Homology) motif, contains an arginine named the arginine finger by its
analogous function with the corresponding arginine of GTPase activator proteins that
interacts with GTP bound to a small G protein partner to promote GTP hydrolysis. The
crystal structures of several AAA+ proteins have shown that the Box VII motif in an
individual molecule is located some distance away from the nucleotide binding pocket. In
AAA+ proteins that assemble into ring-shaped or helical oligomers, the Box VII motif of one
protomer directs an arginine residue responsible for interaction with the γ phosphate of
ATP toward the ATP binding pocket of the neighboring molecule. It is proposed that this
interaction or lack thereof coordinates ATP hydrolysis with a conformational change. The
Sensor 2 motif, which resides in one of the α helices that follow the Rossmann fold, also
contains a conserved arginine. For proteins whose structures contain the bound nucleoside
triphosphate or a nucleotide analogue, this amino acid interacts with the γ phosphate of the
nucleotide. As reviewed by Ogura (Ogura et al., 2004), this residue is involved in ATP
binding or its hydrolysis in some but not all AAA+ proteins. Like the arginine finger
residue, this arginine is thought to coordinate a change in protein conformation with
nucleotide hydrolysis.
Fig. 1. Structural organization of the AAA+ domain, and the locations of the Walker A/P-
loop, Walker B, Sensor 1, Box VII and Sensor 2 motifs are shown (adapted from ref.
(Erzberger and Berger, 2006)).
Because this chapter focuses on members of the DnaA/CDC6/ORC or initiator clade, the
following summarizes properties of this clade and not others. Like the clamp loader clade,
proteins in the initiator clade as represented by DnaA and DnaC have a structure
resembling an open spiral on the basis of X-ray crystallography (Erzberger et al., 2006; Mott
et al., 2008). In comparison, oligomeric proteins in the remaining clades form closed rings. A
characteristic feature of proteins in the initiator clade is the presence of two α helices
between the β2 and β3 strands (Figure 1). Compared with the function of DnaA in the
initiation of E. coli DNA replication, DnaC plays a separate role. Their functions are
32 Computational Biology and Applied Bioinformatics
described in more detail below. The ORC/CDC6 group of eukaryotic proteins in the
initiator clade, like DnaA and DnaC, act to recruit the replicative helicase to replication
origins at the stage of initiation of DNA replication (Lee and Bell, 2000; Liu et al., 2000). The
origin recognition complex (ORC) is composed of six related proteins named Orc1p through
Orc6p, and likely originated along with Cdc6p from a common ancestral gene.
Bioinformatics analysis of DnaC suggests that this protein is a paralog of DnaA, arising by
gene duplication and then diverging with time to perform a separate role from DnaA during
the initiation of DNA replication (Koonin, 1992). This notion leads to the question of what
specific amino acids are responsible for the different functions of DnaA and DnaC despite
the shared presence of the AAA+ amino acid sequence motifs. Presumably, specific amino
acids that are not conserved between these two proteins have critical roles in determining
their different functions, but how are these residues identified and distinguished from those
that are not functionally important? In addition, some amino acids that are conserved
among homologous DnaC proteins, which were identified by multiple sequence alignment
of twenty-eight homologues (Figure 2), are presumable responsible for the unique activities
of DnaC, but what are these unique activities? These issues underscore the limitation of
deducing the biological function of protein by relying only on bioinformatics analysis.
replacement of the chromosomal copy of the gene with the drug resistance cassette, after
which the excised copy of the chromosomal gene is lost. In both E. coli and S. cerevisiae, this
approach has been used in seeking to correlate a phenotype with genes of unknown
function, and to identify those that are essential for viability (Winzeler et al., 1999; Baba et
al., 2006). By either gene disruption or transposon mutagenesis, genetic mapping of the
mutation can be performed by inverse PCR where primers complementary to a sequence
near the ends of the drug resistance cassette or the transposon are used. This approach first
involves partially digesting the chromosomal DNA with a restriction enzyme followed by
ligation of the resulting fragments to form a collection of circular DNAs. DNA sequence
analysis of the amplified DNA with the primers described above followed by comparison of
the nucleotide sequence with the genomic DNA sequence can identify the site of the
disrupted gene, or the site of insertion of the transposon.
Fig. 3. DNA recombination between the chromosomal gene and homologous DNA at ends
of the drug resistance gene leads to replacement of the chromosomal copy of the gene with
the drug resistance cassette. The chromosomal gene, and homologous DNA at the ends of
the drug resistance cassette are indicated by the lighter shaded rectangles. The drug
resistance gene is indicated by the darker rectangles. The thin line represents chromosomal
DNA flanking the gene.
With a multicellular organism, a similar strategy that relies on homologous recombination is
used to delete a gene. The type of cell to introduce the deletion is usually an embryonic stem
cell so that the effect of deletion can be measured in the whole organism. Many eukaryotic
organisms have two complete sets of chromosomes. Because the process of homologous
recombination introduces the deletion mutation in one of the two pairs of chromosomes,
yielding a heterozygote, the presence of the wild type copy on the sister chromosome may
conceal the biological effect of the deletion. Thus, the ideal objective is to delete both copies
of a gene in order to measure the associated phenotype. To attempt to obtain an organism in
which both copies of a gene have been “knocked out,” the standard strategy is to mate
heterozygous individuals. By Mendelian genetics, one-fourth of the progeny should carry
the deletion in both copies of the gene. The drawback with the approach of deleting a gene
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 35
Fig. 4. Organization of DNA sequence motifs in the E. coli replication origin (oriC). Near the
left border are 13-mer motifs that are unwound by DnaA complexed to ATP. Sites
recognized by DnaA are the DnaA boxes (arrow), I-sites (warped oval), and τ-sites (warped
circle). Sites recognized by IHF (shaded rectangle), Fis (filled rectangle), and DNA adenine
methyltransferase (shaded circle) are also indicated.
Recent reviews describe the independent mechanisms that control the frequency of
initiation from this site (Nielsen and Lobner-Olesen, 2008; Katayama et al., 2010). In
Escherichia coli, the minimal oriC sequence of 245 base pairs contains DNA-binding sites for
many different proteins that either act directly in DNA replication, or modulate the
frequency of this process (reviewed in Leonard and Grimwade, 2009). One of them is DnaA,
which is the initiator of DNA replication, and has been placed in one of the clades of the
AAA+ superfamily via bioinformatics analysis (Koonin, 1992; Erzberger and Berger, 2006).
DnaA binds to a consensus 9 base pair sequence known as the DnaA box. There are five
DnaA boxes individually named R1 through R5 within oriC that are similar in sequence and
are recognized by DnaA (Leonard and Grimwade, 2009). In addition to these sites, DnaA
complexed to ATP specifically recognizes three I- sites and τ-sites in oriC, which leads to the
unwinding of three AT-rich 13-mer repeats located in the left half of oriC. Binding sites are
also present for IHF protein (integration host factor) and FIS protein (factor for inversion
stimulation). As these proteins induce bends in DNA, their apparent ability to modulate the
binding of DnaA to the respective sites in oriC may involve DNA bending. Additionally,
oriC carries 11 GATC sequences recognized by DNA adenine methyltransferase, and sites
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 37
recognized by IciA, Rob, and SeqA. The influence of IHF, FIS, IciA, Rob and SeqA proteins
on the initiation process is described in more detail in a review (Leonard and Grimwade,
2005).
At the initiation stage of DNA replication, the first step requires the binding of DnaA
molecules, each complexed to ATP, to the five DnaA boxes, I- and τ- sites of oriC. After
binding, DnaA unwinds the duplex DNA in the AT-rich region to form an intermediate
named the open complex. HU or IHF stimulates the formation of the open complex. In the
next step, the replicative helicase named DnaB becomes stably bound to the separated
strands of the open complex to form an intermediate named the prepriming complex. At
this juncture, DnaC must be complexed to DnaB for a single DnaB hexamer to load onto
each of the separated strands. DnaC protein must then dissociate from the complex in order
for DnaB to be active as a helicase. Following the loading of DnaB, this helicase enlarges the
unwound region of oriC, and then interacts with DnaG primase (Tougu and Marians, 1996).
This interaction between DnaB and DnaG primase, which synthesize primer RNAs that are
extended by DNA polymerase III holoenzyme during semi-conservative DNA replication,
marks the transition between the process of initiation and elongation stage of DNA
replication (McHenry, 2003; Corn and Berger, 2006; Langston et al., 2009). Replication fork
movement that is supported by DnaB helicase and assisted by a second helicase named Rep
proceeds bidirectionally around the chromosome until it reaches the terminus region (Guy
et al., 2009). The two progeny DNAs then segregate near opposite poles of the cell before
septum formation and cell division.
DnaC protein (27 kDa) is essential for cell viability because it is required during the
initiation stage of DNA replication (reviewed in Kornberg and Baker, 1992; Davey and
O'Donnell, 2003). DnaC is also required for DNA replication of the single stranded DNA of
phage φX174, and for many plasmids (e.g. pSC101, P1, R1). DnaC additionally acts to
resurrect collapsed replication forks that appear when a replication fork encounters a nick,
gap, double-stranded break, or modified bases in the parental DNA (Sandler et al., 1996).
This process of restarting a replication fork involves assembly of the replication restart
primosome that contains PriA, PriB, PriC, DnaT, DnaB, DnaC, and Rep protein (Sandler,
2005; Gabbai and Marians, 2010). The major roles of DnaC at oriC, at the replication origins
of the plasmids and bacteriophage described above, or in restarting collapsed replication
forks is to form a complex with DnaB, which is required to escort the helicase onto the DNA,
and then to depart. Since the discovery of the dnaC gene over 40 years ago (Carl, 1970), its
ongoing study by various laboratories using a variety of approaches continue to reveal new
aspects of the molecular mechanisms of DnaC in DNA replication.
Biochemical analysis combined with the X-ray crystallographic structure of the majority of
Aquifex aeolicus DnaC (residues 43 to the C-terminal residue at position 235) reveals that
DnaC protein consists of a smaller N-terminal domain that is responsible for binding to the
C- terminal face of DnaB helicase, and larger ATP-binding region of 190 amino acids (Figure
2; (Ludlam et al., 2001; Galletto et al., 2003; Mott et al., 2008)). Sequence comparison of
homologues of the dnaC gene classifies DnaC as a member of the AAA+ family of ATPases
(Koonin, 1992; Davey et al., 2002; Mott et al., 2008). However unlike other AAA+ proteins,
DnaC contains two additional α helices named the ISM motif (Initiator/loader–Specific
Motif) that directs the oligomerization of this protein into a right-handed helical filament
(Mott et al., 2008). In contrast, the majority AAA+ proteins lacking these α helices assemble
into a closed-ring. Phylogenetic analysis of the AAA+ domain reveals that DnaC is most
38 Computational Biology and Applied Bioinformatics
closely related to DnaA, suggesting that both proteins arose from a common ancestor
(Koonin, 1992). In support, the X-ray crystallographic structures of the ATPase region of
DnaA and DnaC are very similar (Erzberger et al., 2006; Mott et al., 2008).
For DnaC, ATP increases its affinity for single-stranded DNA, which stimulates its ATPase
activity (Davey et al., 2002; Biswas et al., 2004). Other results suggest that ATP stabilizes the
interaction of DnaC with DnaB in the DnaB-DnaC complex (Wahle et al., 1989; Allen and
Kornberg, 1991), which contradicts studies that support the contrary conclusion that ATP is
not necessary for DnaC to form a stable complex with DnaB (Davey et al., 2002; Galletto et
al., 2003; Biswas and Biswas-Fiss, 2006). As mutant DnaC proteins bearing amino acid
substitutions in the Walker A box are both defective in ATP binding and apparently fail to
interact with DnaB, the consequence is that these mutants cannot escort DnaB to oriC
(Ludlam et al., 2001; Davey et al., 2002). Hence, despite the ability of DnaB by itself to bind
to single-stranded DNA in vitro (LeBowitz and McMacken, 1986), DnaC is essential for
DnaB to become stably bound to the unwound region of oriC (Kobori and Kornberg, 1982;
Ludlam et al., 2001). The observation that DnaC complexed to ATP interacts with DnaA
raises the possibility that both proteins act jointly in helicase loading (Mott et al., 2008).
Together, these observations indicate that the ability of DnaC to bind to ATP is essential for
its function in DNA replication, but the paradox about the role of ATP binding and its
hydrolysis on the activity of DnaC and about the mechanism that leads to the dissociation of
DnaC from DnaB have been long-standing issues.
Fig. 5. Plasmid shuffle method. With an E. coli strain lacking the chromosomal copy of dnaC,
a plasmid carrying the wild type dnaC gene residing in the bacterial cell can provide for
dnaC function. Using a plasmid that depends on IPTG for its maintenance, the strain will not
survive in the absence of IPTG. An introduced plasmid that does not depend on IPTG for its
propagation can sustain viability in the absence of IPTG only if it encodes a functional dnaC
allele.
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 39
6. Conclusions
In summary, the combination of various experimental approaches on the study of DnaC
have led to insightful experiments that expand our understanding of the role of ATP
binding and its hydrolysis by DnaC during the initiation of DNA replication. Evidence
suggests that ATP hydrolysis by DnaC that leads to the dissociation of DnaC from DnaB
helicase is coupled with primer formation that requires an interaction between DnaG
primase and DnaB. Hence, these critical steps are involved in the transition from the process
of initiation to the elongation phase of DNA replication in E. coli.
This example on the molecular mechanism of DnaC protein is a focused study of one
protein and its interaction with other required proteins during the process of initiation of
DNA replication. One may consider this a form of vertical thinking. It contrasts with
bioinformatics approaches that yield large sets of data for proteins based on the DNA
sequences of genomes, and with microarray approaches that, for example, survey the
expression of genes and their regulation at the genome level under different conditions, or
identify interacting partners for a specific protein. The vast wealth of data from these global
approaches provide a different perspective on understanding the functions of sets of genes
or proteins and how they act in a network of biochemical pathways of the cell.
7. Acknowledgements
We thank members of our labs for discussions on the content and organization of this
chapter. This work was supported by a grant GM090063 from the National Institutes of
Health, and the Michigan Agricultural Station to JMK.
40 Computational Biology and Applied Bioinformatics
8. References
Allen, G. J. and A. Kornberg (1991). Fine balance in the regulation of DnaB helicase by DnaC
protein in replication in Escherichia coli. J Biol Chem 266(33): 22096-22101
Baba, T., T. Ara, M. Hasegawa, Y. Takai, Y. Okumura, M. Baba, K. A. Datsenko, M. Tomita,
B. L. Wanner and H. Mori (2006). Construction of Escherichia coli K-12 in-frame,
single-gene knockout mutants: the Keio collection. Mol Syst Biol 2: 1-11
Beyer, A. (1997). Sequence analysis of the AAA protein family. Protein Sci 6(10): 2043-58,
0961-8368 (Print), 0961-8368 (Linking)
Biswas, S. B. and E. E. Biswas-Fiss (2006). Quantitative analysis of binding of single-stranded
DNA by Escherichia coli DnaB helicase and the DnaB-DnaC complex. Biochemistry
45(38): 11505-11513
Biswas, S. B., S. Flowers and E. E. Biswas-Fiss (2004). Quantitative analysis of nucleotide
modulation of DNA binding by the DnaC protein of Escherichia coli. Biochem J 379:
553-562
Bochner, B. R. (2009). Global phenotypic characterization of bacteria. FEMS Microbiol Rev
33(1): 191-205, 0168-6445 (Print), 0168-6445 (Linking)
Carl, P. L. (1970). Escherichia coli mutants with temperature-sensitive synthesis of DNA. Mol
Gen Genet 109(2): 107-122
Carthew, R. W. and E. J. Sontheimer (2009). Origins and Mechanisms of miRNAs and
siRNAs. Cell 136(4): 642-55, 1097-4172 (Electronic), 0092-8674 (Linking)
Christensen, N. M., K. J. Oparka and J. Tilsner (2010). Advances in imaging RNA in plants.
Trends Plant Sci 15(4): 196-203, 1878-4372 (Electronic), 1360-1385 (Linking)
Corn, J. E. and J. M. Berger (2006). Regulation of bacterial priming and daughter strand
synthesis through helicase-primase interactions. Nucleic Acids Res 34(15): 4082-
4088
Davey, M. J., L. Fang, P. McInerney, R. E. Georgescu and M. O'Donnell (2002). The DnaC
helicase loader is a dual ATP/ADP switch protein. EMBO J 21(12): 3148-3159
Davey, M. J., D. Jeruzalmi, J. Kuriyan and M. O'Donnell (2002). Motors and switches: AAA+
machines within the replisome. Nat Rev Mol Cell Biol 3(11): 826-835
Davey, M. J. and M. O'Donnell (2003). Replicative helicase loaders: ring breakers and ring
makers. Curr Biol 13(15): R594-596
DePamphilis, M. L. and S. D. Bell (2010). Genome duplication, Garland Science/Taylor &
Francis Group, 9780415442060, London
Elbashir, S. M., J. Harborth, W. Lendeckel, A. Yalcin, K. Weber and T. Tuschl (2001).
Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian
cells. Nature 411(6836): 494-8, 0028-0836 (Print), 0028-0836 (Linking)
Erzberger, J. P. and J. M. Berger (2006). Evolutionary relationships and
structural mechanisms of AAA+ proteins. Annu Rev Biophys Biomol Struct 35: 93-
114
Erzberger, J. P., M. L. Mott and J. M. Berger (2006). Structural basis for ATP-dependent
DnaA assembly and replication-origin remodeling. Nat Struct Mol Biol 13(8): 676-
683
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 41
Fire, A., S. Xu, M. K. Montgomery, S. A. Kostas, S. E. Driver and C. C. Mello (1998). Potent
and specific genetic interference by double-stranded RNA in Caenorhabditis elegans.
Nature 391(6669): 806-11, 0028-0836 (Print), 0028-0836 (Linking)
Fischer, S. E. (2010). Small RNA-mediated gene silencing pathways in C. elegans. Int J
Biochem Cell Biol 42(8): 1306-15, 1878-5875 (Electronic), 1357-2725 (Linking)
Gabbai, C. B. and K. J. Marians (2010). Recruitment to stalled replication forks of the PriA
DNA helicase and replisome-loading activities is essential for survival. DNA
Repair (Amst) 9(3): 202-209, 1568-7856 (Electronic), 1568-7856 (Linking)
Galletto, R., M. J. Jezewska and W. Bujalowski (2003). Interactions of the Escherichia coli
DnaB Helicase Hexamer with the Replication Factor the DnaC Protein. Effect of
Nucleotide Cofactors and the ssDNA on Protein-Protein Interactions and the
Topology of the Complex. J Mol Biol 329(3): 441-465
Ghildiyal, M. and P. D. Zamore (2009). Small silencing RNAs: an expanding universe. Nat
Rev Genet 10(2): 94-108, 1471-0064 (Electronic), 1471-0056 (Linking)
Guy, C. P., J. Atkinson, M. K. Gupta, A. A. Mahdi, E. J. Gwynn, C. J. Rudolph, P. B. Moon, I.
C. van Knippenberg, C. J. Cadman, M. S. Dillingham, R. G. Lloyd and P. McGlynn
(2009). Rep provides a second motor at the replisome to promote duplication of
protein-bound DNA. Mol Cell 36(4): 654-666, 1097-4164 (Electronic), 1097-2765
(Linking)
Hamilton, A. J. and D. C. Baulcombe (1999). A species of small antisense RNA in
posttranscriptional gene silencing in plants. Science 286(5441): 950-952, 0036-8075
(Print), 0036-8075 (Linking)
Hanson, P. I. and S. W. Whiteheart (2005). AAA+ proteins: have engine, will work. Nat Rev
Mol Cell Biol 6(7): 519-529
Hupert-Kocurek, K., J. M. Sage, M. Makowska-Grzyska and J. M. Kaguni (2007). Genetic
method to analyze essential genes of Escherichia coli. Appl Environ Microbiol 73(21):
7075-7082
Iyer, L. M., D. D. Leipe, E. V. Koonin and L. Aravind (2004). Evolutionary history and higher
order classification of AAA+ ATPases. J Struct Biol 146(1-2): 11-31, 1047-8477
(Print), 1047-8477 (Linking)
Katayama, T., S. Ozaki, K. Keyamura and K. Fujimitsu (2010). Regulation of the replication
cycle: conserved and diverse regulatory systems for DnaA and oriC. Nat Rev
Microbiol 8(3): 163-170, 1740-1534 (Electronic), 1740-1526 (Linking)
Ketting, R. F., T. H. Haverkamp, H. G. van Luenen and R. H. Plasterk (1999). Mut-7 of C.
elegans, required for transposon silencing and RNA interference, is a homolog of
Werner syndrome helicase and RNaseD. Cell 99(2): 133-141, 0092-8674 (Print), 0092-
8674 (Linking)
Kobori, J. A. and A. Kornberg (1982). The Escherichia coli dnaC gene product. II.
Purification, physical properties, and role in replication. J Biol Chem 257(22):
13763-13769
Koonin, E. V. (1992). DnaC protein contains a modified ATP-binding motif and belongs
to a novel family of ATPases including also DnaA. Nucleic Acids Res 20(8):
1997
42 Computational Biology and Applied Bioinformatics
Kornberg, A. and T. A. Baker (1992). DNA Replication Second Edition, W.H. Freeman and
Company, 9781891389443, New York
Krol, J., I. Loedige and W. Filipowicz (2010). The widespread regulation of microRNA
biogenesis, function and decay. Nat Rev Genet 11(9): 597-610, 1471-0064
(Electronic), 1471-0056 (Linking)
Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K.
Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J.
Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J.
P. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A.
Sheridan, C. Sougnez, N. Stange-Thomann, N. Stojanovic, A. Subramanian, D.
Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee,
N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R.
Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M.
Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A.
Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson,
L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T.
Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty,
T. L. Miner, A. Delehaunty, J. B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J.
Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S.
Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin, E.
Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E. Scherer, J. B. Bouck, E. J.
Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell, M. L. Metzker, S. L. Naylor, R.
S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori,
T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J.
Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E. Pelletier,
C. Robert, P. Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K.
Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S.
Taudien, A. Rump, H. Yang, J. Yu, J. Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A.
Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M.
Myers, J. Schmutz, M. Dickson, J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, N.
Shimizu, K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A.
Roe, F. Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de
la Bastide, N. Dedhia, H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L.
Aravind, J. A. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C.
B. Burge, L. Cerutti, H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S.
R. Eddy, E. E. Eichler, T. S. Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y.
Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. S. Johnson, T. A.
Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J. Kent, P. Kitts, E. V. Koonin, I. Korf,
D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght, T. Mikkelsen, J. V. Moran, N.
Mulder, V. J. Pollara, C. P. Ponting, G. Schuler, J. Schultz, G. Slater, A. F. Smit, E.
Stupka, J. Szustakowski, D. Thierry-Mieg, J. Thierry-Mieg, L. Wagner, J. Wallis, R.
Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, S. P. Yang, R. F. Yeh, F. Collins, M. S.
Guyer, J. Peterson, A. Felsenfeld, K. A. Wetterstrand, A. Patrinos, M. J. Morgan, P.
de Jong, J. J. Catanese, K. Osoegawa, H. Shizuya, S. Choi and Y. J. Chen (2001).
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 43
Initial sequencing and analysis of the human genome. Nature 409(6822): 860-921,
0028-0836 (Print), 0028-0836 (Linking)
Langston, L. D., C. Indiani and M. O'Donnell (2009). Whither the replisome: emerging
perspectives on the dynamic nature of the DNA replication machinery. Cell Cycle
8(17): 2686-2691, 1551-4005 (Electronic), 1551-4005 (Linking)
LeBowitz, J. H. and R. McMacken (1986). The Escherichia coli dnaB replication protein is a
DNA helicase. J Biol Chem 261(10): 4738-48,
Lee, D. G. and S. P. Bell (2000). ATPase switches controlling DNA replication initiation. Curr
Opin Cell Biol 12(3): 280-285, 0955-0674 (Print), 0955-0674 (Linking)
Leonard, A. C. and J. E. Grimwade (2005). Building a bacterial orisome: emergence of
new regulatory features for replication origin unwinding. Mol Microbiol 55(4): 978-
985
Leonard, A. C. and J. E. Grimwade (2009). Initiating chromosome replication in E. coli: it
makes sense to recycle. Genes Dev 23(10): 1145-1150,
Liu, J., C. L. Smith, D. DeRyckere, K. DeAngelis, G. S. Martin and J. M. Berger (2000).
Structure and function of Cdc6/Cdc18: implications for origin recognition and
checkpoint control. Mol Cell 6(3): 637-648
Liu, Q., T. A. Rand, S. Kalidas, F. Du, H. E. Kim, D. P. Smith and X. Wang (2003). R2D2, a
bridge between the initiation and effector steps of the Drosophila RNAi pathway.
Science 301(5641): 1921-1925, 1095-9203 (Electronic), 0036-8075 (Linking)
Ludlam, A. V., M. W. McNatt, K. M. Carr and J. M. Kaguni (2001). Essential amino acids of
Escherichia coli DnaC protein in an N-terminal domain interact with DnaB helicase. J
Biol Chem 276(29): 27345-27353
Lupas, A. N. and J. Martin (2002). AAA proteins. Curr Opin Struct Biol 12(6): 746-53, 0959-
440X (Print), 0959-440X (Linking)
Lupski, J. R., J. G. Reid, C. Gonzaga-Jauregui, D. Rio Deiros, D. C. Chen, L. Nazareth, M.
Bainbridge, H. Dinh, C. Jing, D. A. Wheeler, A. L. McGuire, F. Zhang, P.
Stankiewicz, J. J. Halperin, C. Yang, C. Gehman, D. Guo, R. K. Irikat, W. Tom, N. J.
Fantin, D. M. Muzny and R. A. Gibbs (2010). Whole-genome sequencing in a
patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 362(13): 1181-91, 1533-
4406 (Electronic), 0028-4793 (Linking)
Makowska-Grzyska, M. and J. M. Kaguni (2010). Primase Directs the Release of DnaC from
DnaB. Mol Cell 37(1): 90-101
Maxam, A. M. and W. Gilbert (1977). A new method for sequencing DNA. Proc Natl Acad
Sci U S A 74(2): 560-4, 0027-8424 (Print), 0027-8424 (Linking)
McHenry, C. S. (2003). Chromosomal replicases as asymmetric dimers: studies of subunit
arrangement and functional consequences. Mol Microbiol 49(5): 1157-1165
Moerman, D. G. and R. J. Barstead (2008). Towards a mutation in every gene in
Caenorhabditis elegans. Brief Funct Genomic Proteomic 7(3): 195-204, 1477-4062
(Electronic), 1473-9550 (Linking)
Mott, M. L., J. P. Erzberger, M. M. Coons and J. M. Berger (2008). Structural synergy and
molecular crosstalk between bacterial helicase loaders and replication initiators.
Cell 135(4): 623-634
44 Computational Biology and Applied Bioinformatics
1. Introduction
In multi-cellular organisms development from zygote to adult and adaptation to different
environmental stresses occur as cells acquire specialized roles by synthesizing proteins
necessary for each task. In eukaryotes the most commonly used mechanism for maintaining
cellular protein environment is transcriptional regulation of gene expression, by recruiting
required transcription factors at promoter regions. Owing to the importance of
transcriptional regulation, one of the main goals in the post-genomic era is to predict gene
expression regulation on the basis of presence of transcription factor (TF) binding sites in the
promoter regions. Genome wide knowledge of TF binding sites would be useful to build
transcriptional regulatory networks model that result in cell specific differentiation. In
eukaryotic genomes only a fraction (< 5%) of total genome codes for functional proteins or
RNA, while remaining DNA sequences consist of non-coding regulatory sequences, other
regions and sequences still with unknown functions.
Since the discovery of trans-acting factors in gene regulation by Jacob and Monads in lac
operon of E. coli, scientists had an interest in finding new transcription factors, their specific
recognition and binding sequences. In DNAse footprinting (or DNase protection assay);
transcription factor bound regions are protected from DNAse digestion, creating a
"footprint" in a sequencing gel. This methodology has resulted in identification of hundreds
of regulatory sequences. However, limitation of this methodology is that it requires the TF
and promoter sequence (100-300 bp) in purified form. Our knowledge of known
transcription factors is limited and recognition and binding sites are scattered over the
complete genome. Therefore, in spite of high degree of accuracy in prediction of TF binding
site, this methodology is not suitable for genome wide or across the genomes scanning.
Detection of TF binding sites through phylogenetic footprinting is gradually becoming
popular. It is based on the fact that random mutations are not easily accepted in functional
sequences, while they continuously keep on tinkering non functional sequences. Many
comparative genomics studies have revealed that during course of evolution regulatory
elements remain conserved while the non-coding DNA sequences keep on mutating. With
an ever increasing number of complete genome sequence from multiple organisms and
mRNA profiling through microarray and deep sequencing technologies, wealth of gene
expression data is being generated. This data can be used for identification of regulatory
48 Computational Biology and Applied Bioinformatics
elements through intra and inter species comparative genomics. However, the identification
of TF binding sites in promoters still remains one of the major challenges in bioinformatics
due to following reasons:
1. Very short (5-15 nt) size of regulatory motifs that also differ in their number of
occurrence and position on DNA strands with respect to transcription start site. This
wide distribution of short TF binding sites makes their identification with commonly
used sequence alignment programmes challenging.
2. A very high degree of conservation between two closely related species generally
shows no clear signature of highly conserved motifs.
3. Absence of significant similarities between highly diverse species hinders the alignment
of functional sequences.
4. Sometimes, functional conservation of gene expression is not sufficient to assure the
evolutionary preservation of corresponding cis-regulatory elements (Pennacchio and
Rubin, 2001).
5. Transcription factors binding sites are often degenerate.
In order to overcome these challenges, in the last few years novel approaches have been
developed that integrate comparative, structural, and functional genomics with the
computational algorithms. Such interdisciplinary efforts have increased the sensitivity of
computational programs to find composite regulatory elements.
Here, we review different computational approaches for identification of regulatory
elements in promoter region with seed specific legumin gene promoter analysis as an
example. Based on the type of DNA sequence information the motif finding algorithms are
classified into three major classes: (1) methods that use promoter sequences from co
regulated genes from a single genome, (2) methods that use orthologous promoter
sequences of a single gene from multiple species, also known as phylogenetic footprinting
and (3) methods that use promoter sequences of co regulated genes as well as phylogenetic
footprinting (Das and Dai, 2007).
where j represents position in the substring, sj is the symbol at position j in the substring,
and mα,j is the score in row α, column j of the matrix.
Although matrix representation appears superior, the solution space for PWMs and PSSMs,
which consists of 4l real numbers is infinite in size, and there are many local optimal
matrices, thus, algorithms generally either produce a suboptimal motif matrix or take too
long to run when the motif is longer than 10 bp (Francis and Henry, 2008).
3.1.1 TRANSFAC
TRANSFAC is the largest repository of transcription factors binding sites. TRANSFAC
(TRANSFAC 7.0, 2005) web accessible database consists of 6,133 factors with 7,915 sites,
while professional version (TRANSFAC 2008.3) consists of 11,683 factors with 30,227 sites.
TRANSFAC database is composed of six tables SITE, GENE, FACTOR, CELL, CLASS and
50 Computational Biology and Applied Bioinformatics
MATRIX. GENE table gives a short explanation of the gene where a site (or group of sites)
belongs to; FACTOR table describes the proteins binding to these sites. CELL gives brief
information about the cellular source of proteins that have been shown to interact with the
sites. CLASS contains some background information about the transcription factor classes,
while the MATRIX table gives nucleotide distribution matrices for the binding sites of
transcription factors. This database is most frequently used as reference for TFB sites as well
as for development of new algorithms. However, new users find it difficult to access the
database because it requires search terms to be entered manually. There is no criterion to
select the organism, desired gene or TF from a list, so web interface is not user friendly.
Other web tools such as TF search and Signal Scan overcome this limitation to certain extent.
3.1.3 TRRD
The transcription regulatory region database (TRRD) is a database of transcription
regulatory regions of the eukaryotic genome. The TRRD database contains three
interconnected tables: TRRDGENES (description of the genes as a whole), TRRDSITES
(description of the sites), and TRRDBIB (references). The current version, TRRD 3.5,
comprises of the description of 427 genes, 607 regulatory units (promoters, enhancers, and
silencers), and 2147 transcription factor binding sites. The TRRDGENES database compiles
the data on human (185 entries), mouse (126), rat (69), chicken (29), and other genes.
Transcription
Developmental/Environmental
factor binding Position
stimulus Sequence
site
TATA Box -33 tcccTATAaataa
Core promoter
Cat Box -49 gCCAAc
G Box -66 tgACGgtgt
Stress responsive ABRE -76 acaccttctttgACTGtccatccttc
ABI4 -245 CACCg
W Box -72 cttctTTGAcgtgtcca
Pathogen defense
TCA gAGAAgagaa
Light Response I box -302 gATATga
WUN -348 tAATTacac
Wound specific
TCA -646 gAGAAgagaa
Legumin -118 tccatacCCATgcaagctgaagaatgtc
Opaque-2 -348 TAATtacacatatttta
Seed Specific
Prolamine box -385 TTaaaTGTAAAAgtAa
AAGAA-motif -294 agaAAGAa
Table 1. In silico analysis of pigeonpea legumin gene promoter for identification of
regulatory elements. Database search reveals that it consist of regulatory elements that can
direct its activation under different envirnmental conditions and developmental stages.
In Silico Identification of Regulatory Elements in Promoters 51
3.1.4 PlantCARE
PlantCARE is database of plant specific cis-Acting regulatory elements in the promoter
regions (Lescot et al., 2002). It generates a sequence analysis output on a dynamic webpage,
on which TF binding sites are highlighted in the input sequence. The database can be
queried on names of transcription factor (TF) sites, motif sequence, function, species, cell
type, gene, TF and literature references. Information regarding TF site, organism, motif
position, strand, core similarity, matrix similarity, motif sequence and function are listed
whereas the potential sites are mapped on the query sequence.
3.1.5 PLACE
PLACE is another database of plant cis-acting regulatory elements extracted from published
reports (Higo et al., 1999). It also includes variations in the motifs in different genes or plant
species. PLACE also includes non-plant cis-elements data that may have homologues with
plant. PLACE database also provides brief description of each motif and links to
publications.
3.1.6 RegulonDB
RegulonDB is a comprehensive database of gene regulation and interaction in E. coli. It
consists of data on almost every aspect of gene regulation such as terminators, promoters,
TF binding sites, active and inactive transcription factor conformations, matrices alignments,
transcription units, operons, regulatory network interactions, ribosome binding sites (rbs),
growth conditions, gene product and small RNAs.
3.1.7 ABS
ABS is a database of known TF binding sites identified in promoters of orthologous
vertebrate genes. It has 650 annotated and experimental validated binding sites from 68
transcription factors and 100 orthologous target genes in human, mouse, rat and chicken
genome sequences. Although it’s a simple and easy-to-use web interface for data retrieval
but it does not facilitate either analysis of new promoter sequence or mapping user defined
motif in the promoter.
3.1.8 MatInspector
MatInspector identifies cis-acting regulatory elements in nucleotide sequences using
library of weight matrices (Cartharius et al., 2005). It is based on novel matrix family
concept, optimized thresholds, and comparative analysis that overcome the major limitation
of large number of redundant binding sites predicted by other programs. Thus it increases
the sensitivity of reducing false positive predictions. MatInspector also allows integration of
output with other sequence analysis programs e.g. DiAlignTF, FrameWorker,
SequenceShaper, for an in-depth promoter analysis and designing regulatory sequences.
MatInspector library contains 634 matrices representing one of the largest libraries available
for public searches.
3.1.9 JASPAR
JASPAR is the another open access database that compete with the commercial TF binding
site databases such as TRANSFAC (Portales-Casamar et al., 2009). The latest release has a
52 Computational Biology and Applied Bioinformatics
3.1.11 MAPPER
MAPPER stands for Multi-genome Analysis of Positions and Patterns of Elements of
Regulation It is a platform for the computational identification of TF binding sites in
multiple genomes (Marinescu et al., 2005). The MAPPER consists of three modules, the
MAPPER database, the Search Engine, and rSNPs and combines TRANSFAC and JASPAR
data. However, MAPPER database is limited to TFBSs found only in the promoter of genes
from the human, mouse and D.melanogaster genomes.
3.1.12 Stubb
Like Cister, Stubb also uses hidden Markov models (HMM) to obtain a statistically
significant score for modules (Sinha et al., 2006). STUBB is more suitable for finding
modules over genomic scales with small set of transcription factors whose binding sites are
known. Stubb differs from MAPPER in that the application of latter is limited to binding
sites of a single given motif in an input sequence.
3.1.13 Clover
Clover is another program for identifying functional sites in DNA sequences. It take a set of
DNA sequences that share a common function, compares them to a library of sequence
motifs (e.g. transcription factor binding patterns), and identifies which, if any, of the motifs
are statistically overrepresented in the sequence set (Frith et al., 2004). It requires two input
files one for sequences in fasta format and another for sequence motif. Clover provides
JASPAR core collection of TF binding sites that can be converted to clover format. Clover is
also available as standalone application for windows, Linux as well as Mac operating
systems.
3.1.14 RegSite
Regsite consists of plant specific largest repository of transcription factor binding sites.
Current RegSite release contains 1816 entries. It is used by transcription start site prediction
programs (Sinha et al., 2006).
In Silico Identification of Regulatory Elements in Promoters 53
3.1.15 JPREdictor
JPREdictor is a JAVA based cis-regulatory TF binding site prediction program (Fiedler and
Rehmsmeier, 2006). The JPREdictor can use different types of motifs: Sequence Motifs,
Regular Expression Motifs, PSPMs as well as PSSMs and the complex motif type
(MultiMotifs). This tool can be used for the prediction of cis-regulatory elements on a
genome-wide scale.
for comparison are too closely related. If the species are too distantly related, it is difficult to
find an accurate alignment. It requires computational tools that bypass the requirement of
sequence alignment completely and have the capabilities to identify short and scattered
conserved regions.
3.2.1.2 MEME, Consensus, Gibbs sampler, AlignAce
In cases where multiple alignment algorithms fails, motif finding algorithms such as MEME,
Consensus and Gibbs sampler have been used (Fig. 2). The feasibility of using comparative
DNA sequence analysis to identify functional sequences in the genome of S. cerevisiae, with
the goal of identifying regulatory sequences and sequences specifying nonprotein coding
RNAs was investigated (Cliften et al., 2001). It was found that most of the DNA sequences of
the closely related Saccharomyces species aligned to S.cerevisiae sequences and known
promoter regions were conserved in the alignments. Pattern search algorithms like
CONSENSUS (Hertz et al., 1990), Gibbs sampling (Lawrence et al., 1993) and AlignAce
(Roth et al., 1998) were useful for identifying known regulatory sequence elements in the
promoters, where they are conserved through the most diverged Saccharomyces species.
Gibbs sampler was used for motif finding using phylogenetic footprinting in proteobacterial
genomes (McCue et al., 2001). These programs employ two approaches for motif finding.
One approach is to employ a training set of transcription factor binding sites and a scoring
scheme to evaluate predictions. The scoring scheme is often based on information theory
and the training set is used to empirically determine a score threshold for reporting of the
predicted transcription factor binding sites. The second method relies on a rigorous
statistical analysis of the predictions, based upon modeled assumptions. The statistical
significance of a sequence match to a motif can be accessed through the determination of p-
value. P-value is the probability of observing a match with a score as good or better in a
randomly generated search space of identical size and nucleotide composition. The smaller
the p-value, the lesser the probability that the match is due to chance alone. Since the motif
finding algorithms assume the input sequences to be independent, therefore, they are
limited by the fact that the data sets containing a mixture of some closely related species will
have an unduly high weight in the results of motifs reported.
Multiple genome sequences were compared that are as optimally diverged as possible in
Saccharomyces genomes. Phylogenetic footprints were searched among the genome
sequences of six Saccharomyces species using the sequence alignment tool CLUSTAL W and
many statistically significant conserved sequence motifs (Cliften et al., 2003) were found.
3.2.1.3 Footprinter
This promising novel algorithm was developed to overcome the limitations imposed by
motif finding algorithms. This algorithm identifies the most conserved motifs among the
input sequences as measured by a parsimony score on the underlying phylogenetic tree
(Blanchette and Tompa, 2002). It uses dynamic programming to find most parsimonious k-
mer from each of the input sequences where k is the motif length. In general, the algorithm
selects motifs that are characterized by a minimal number of mismatches and are conserved
over long evolutionary distances. Furthermore, the motifs should not have undergone
independent losses in multiple branches. In other words, the motif should be present in the
sequences of subsequent taxa along a branch. The algorithm, based on dynamic
programming, proceeds from the leaves of the phylogenetic tree to its root and seeks for
motifs of a user-defined length with a minimum number of mismatches. Moreover, the
algorithm allows a higher number of mismatches for those sequences that span a greater
evolutionary distance. Motifs that are lost along a branch of the tree are assigned an
additional cost because it is assumed that multiple independent losses are unlikely in
evolution. To compensate for spurious hits, statistical significance is calculated based on a
random set of sequences in which no motifs occur.
3.2.1.4 CONREAL
CONREAL (Conserved Regulatory Elements Anchored Alignment Algorithm) is another
motif finding algorithm based on phylogenetic footprinting (Berezikov et al., 2005). This
algorithm uses potential motifs as represented by positional weight matrices (81 vertebrate
matrices form JASPAR database and 546 matrices from TRANSFAC database) to establish
anchors between orthologous sequences and to guide promoter sequence alignment.
Comparison of the performance of CONREAL with the global alignment programs LAGAN
and AVID using a reference data set, shows that CONREAL performs equally well for
closely related species like rodents and human, and has a clear added value for aligning
promoter elements of more divergent species like human and fish, as it identifies conserved
transcription-factor binding sites that are not found by other methods.
3.2.1.5 PHYLONET
The PHYLONET computational approach identifies conserved regulatory motifs directly
from whole genome sequences of related species without reliance on additional information
was developed by (Wang and Stormo, 2005). The major steps involved are: i) construction of
phylogenetic profiles for each promoter , ii) searching through the entire profile space of all
the promoters in the genome to identify conserved motifs and the promoters that contain
them using algorithm like BLAST, iii) determination of statistical significance of motifs
(Karlin and Altschul, 1990). By comparing promoters using phylogenetic profiles (multiple
sequence alignments of orthologous promoters) rather than individual sequences, together
with the application of modified Karlin– Altschul statistics, they readily distinguished
biologically relevant motifs from background noise. When applied to 3524 Saccharomyces
cerevisiae promoters with Saccharomyces mikatae, Saccharomyces kudriavzevii, and Saccharomyces
bayanus sequences as references PHYLONET identified 296 statistically significant motifs
with a sensitivity of >90% for known transcription factor binding sites. The specificity of the
predictions appears very high because most predicted gene clusters have additional
supporting evidence, such as enrichment for a specific function, in vivo binding by a known
TF, or similar expression patterns.
56 Computational Biology and Applied Bioinformatics
3.3.2 Seqmotifs
Seqmotifs is a suite of web based programs to find regulatory motifs in coregulated genes of
both prokaryotes and eukaryotes. In this suite BioProspector (Liu et al., 2001) is used for
finding regulatory motifs in prokaryote or lower eukaryote sequences while
CompareProspector(Liu et al., 2002) is used for higher eukaryotes. Another program Mdscan
(Liu et al., 2002) is used for finding protein-DNA interaction sites from ChIP-on-chip targets.
These programs analyze a group of sequences of coregulated genes so they may share
common regulatory motifs and output a list of putative motifs as position-specific probability
matrices, the individual sites used to construct the motifs, and the location of each site on the
input sequences. CompareProspector has been used for identification of transcription factors
Mef2, My, Srf, and Sp1 motifs from a human-muscle-specific co regulated genes. Additionally
in a C. elegans–C briggsae comparison, CompareProspector found the PHA-4 motif and the
UNC-86 motif.(Liu et al., 2004) Another C. Elegans CompareProspector analysis showed that
intestine genes have GATA transcription factor binding motif that was latter experimentally
validated (Pauli et al., 2006).
3.3.3 RSAT
The Regulatory Sequence Analysis Tools (RSAT) is an integrated online tool to analyze
regulatory sequences in co regulated genes (Thomas-Chollier et al., 2008). The only input
In Silico Identification of Regulatory Elements in Promoters 57
3.3.4 TFM
TFM (Transcription Factor Matrices) is a software suite for identifying and analyzing
transcription factor binding sites in DNA sequences. It consists of TFM-Explorer, TFM-Scan,
TFM-Pvalue and TFM-Cluster.
TFM-Explorer (Transcription Factor Matrix Explorer) proceeds in two steps: scans sequences
for detecting all potential TF binding sites, using JASPAR or TRANSFAC and extracts
significant transcription factor.
TFM-Scan is a program dedicated to the location of large sets of putative transcription factor
binding sites on a DNA sequence. It uses Position Weight Matrices such as those available in
the Transfac or JASPAR databases. The program takes as input a set of matrices and a DNA
sequence. It computes all occurrences of matrices on the sequence for a given P-value
threshold. The algorithm is very fast and allows for large-scale analysis. TFM-Scan is also
able to cluster similar matrices and similar occurrences
TFM-Pvalue is a software suite providing tools for computing the score threshold associated
to a given P-value and the P-value associated to a given score threshold. It uses Position
Weight Matrices such as those available in the Transfac or JASPAR databases. The program
takes as input a matrix, the background probabilities for the letters of the DNA alphabet and
a score or a P-value.
3.3.5 rVISTA
rVISTA (regulatory VISTA) combines searching the major transcription binding site
database TRANSFAC Professional from Biobase with a comparative sequence analysis and
this procedure reduced the number of predicted transcription factor binding sites by several
orders of magnitude (Loots and Ovcharenko, 2004). It can be used directly or through links
in mVISTA, Genome VISTA, or VISTA Browser. Human and mouse sequences are aligned
using the global alignment program AVID (Bray et al., 2003).
3.3.6 Mulan
Mulan brings together several novel algorithms: the TBA multi-aligner program for rapid
identification of local sequence conservation and the multiTF program for detecting
evolutionarily conserved transcription factor binding sites in multiple alignments. In
addition, Mulan supports two-way communication with the GALA database; alignments of
multiple species dynamically generated in GALA can be viewed in Mulan, and conserved
transcription factor binding sites identified with Mulan/multiTF can be integrated and
overlaid with extensive genome annotation data using GALA. Local multiple alignments
58 Computational Biology and Applied Bioinformatics
3.3.7 MotifVoter
MotifVoter is a variance based ensemble method for discovery of binding sites. It uses 10
most commonly used individual basic motif finders as its component (Wijaya et al., 2008).
AlignACE (Hughes et al., 2000), MEME (Bailey and Elkan, 1994; Bailey et al., 2006),
ANNSpec, Mitra, BioProspector, MotifSampler, Improbizer, SPACE, MDScan and
Weeder. All programs can be selected individually or collectively. Though the existing
ensemble methods overall perform better than stand-alone motif finders, the improvement
gained is not substantial. These methods do not fully exploit the information obtained from
the results of individual finders, resulting in minor improvement in sensitivity and poor
precision.
3.3.8 ConSite
ConSite is a, web-based tool for finding cis-regulatory elements in genomic sequences
(Sandelin et al., 2004). Two genomic sequences submitted for analysis are aligned by ORCA
method. Alternatively, prealigned sequences can be submitted in ClustalW, MSF (GCG),
Fasta or Pair wise BLAST format. For analysis Transcription factors can be selected on the
basis of species, name, domain or user defined matrix (raw counts matrix or position weight
matrix). Predictions are based on the integration of binding site prediction generated with
high-quality transcription factor models and cross-species comparison filtering
(phylogenetic footprinting). ConSite (Sandelin et al., 2004) is based on the JASPAR database
(Portales-Casamar et al., 2009). By incorporating evolutionary constraints, selectivity is
increased by an order of magnitude as compared to single sequence analysis. ConSite offers
several unique features, including an interactive expert system for retrieving orthologous
regulatory sequences.
3.3.9 OPOSSUM
OPOSSUM identifies statistically over-represented, conserved TFBSs in the promoters of co-
expressed genes (Ho Sui et al., 2005). OPOSSUM integrates a precomputed database of
predicted, conserved TFBSs, derived from phylogenetic footprinting and TFBS detection
algorithms, with statistical methods for calculating overrepresentation. The background
data set was compiled by identifying all strict one-to-one human/mouse orthologues from
the Ensemble database. These orthologues were then aligned using ORCA, a pair-wise DNA
alignment program. The conserved non-coding regions were identified. The conserved
regions which fell within 5000 nucleotides upstream and downstream of the transcription
start site (TSS) were then scanned for TF sites using the position weight matrices (PWMs)
from the JASPAR database (Portales-Casamar et al., 2009). These TF sites were stored in the
OPOSSUM database and comprise the background set.
In Silico Identification of Regulatory Elements in Promoters 59
3.3.10 TOUCAN2
TOUCAN 2 is an operating system independent, open source, JAVA based workbench for
regulatory sequence analysis (Aerts et al., 2004). It can be used for detection of significant
transcription factor binding sites from comparative genomics data or for detection of
combinations of binding sites in sets of co expressed/co regulated genes. It tightly integrates
with Ensemble and EMBL for retrieval of sequences data. TOUCAN provides options to
align sequences with different specialized algorithms viz AVID (Bray et al., 2003), LAGAN
(Brudno et al., 2003), or BLASTZ. MotifScanner algorithm is used to search occurrence of
sites of transcription factors by using libraries of position weight matrices from TRANSFAC
6 (Matys et al., 2003), JASPAR, PLANTCARE (Lescot et al., 2002), SCPD and others. Motif
Sampler can be used for detection of over-represented motifs. More significantly TOUCAN
provides an option to select cis-regulatory modules using the ModuleSearch In essence
TOUCAN 2 provides one of the best integration of different algorithms for identification of
cis regulatory elements.
3.3.11 WebMOTIFS
WebMOTIFS web server combines TAMO and THEME tools for identification of conserved
motifs in co regulated genes (Romer et al., 2007). TAMO combines results from four motif
discovery programs viz AlignACE, MDscan, MEME, and Weeder, followed by clustering of
results (Gordon et al., 2005). Subsequently Bayesian motif analysis of known motifs is done
by THEME. Thus it integrates de novo motif discovery programs with Bayesian approaches
to identify the most significant motifs. However, current version of Web MOTIFS supports
motif discovery only for S. cerevisiae, M. musculus, and H. sapiens genomes.
3.3.12 Pscan
Pscan is a software tool that scans a set of sequences (e.g. promoters) from co-regulated or
co-expressed genes with motifs describing the binding specificity of known transcription
factors (Zambelli et al, 2009). It assesses which motifs are significantly over or
underrepresented, providing thus hints on which transcription factors could be common
regulators of the genes studied, together with the location of their candidate binding sites in
the sequences. Pscan does not resort to comparisons with orthologous sequences and
experimental results show that it compares favorably to other tools for the same task in
terms of false positive predictions and computation time.
confirmed that there promoters direct the seed specific expression and subsequently these
seed storage gene promoters were used for developing transgenic plants for expressing
different genes of interest. Earlier we isolated legumin gene promoter from pigeon pea
(Cajanus cajan) and identified regulatory elements in its promoter region(Jaiswal et al., 2007).
Sequence analysis with PLACE, PLANTCARE, MATINSPECTOR and TRANSFC shows
that legumin promoter not only consist of regulatory elements for seed specific expression
but also have elements that are present in promoters of other genes involved in pathogen
defense, abiotic stress, light response and wounding (Table 1). Our pervious study also
confirmed that legumin promoter in expressed in the developing seedling as well (Jaiswal et
al., 2007). Recent studies have shown that these promoters are expressed in non seed tissues
as well (Zakharov et al., 2004) and play a role in bruchid (insect) resistance (Sales et al.,
2000).
In such a scenario where seed storage protein performs an additional task of pathogen
defense its prompter must have responsive elements to such stresses. In fact legumin
promoter consists of transcription factor binding site for wounding, a signal of insect attack
and pathogen defense (Table 1). Since promoter sequences are available for legumin
promoter from different species it becomes a good system for identification novel regulatory
elements in these promoter. Phylogenetic footprinting analysis reveals presentence of
another conserved motif 19 base pair downstream to legumin box (Fig. 1), named L-19 box
(Jaiswal et al., 2007). Further MEME analysis shows that in addition to four conserved
blocks present for TATA box, G box, Legumin box and L-19 box, there are other conserved,
non overlapping sequence blocks are present that were not revealed by multiple sequence
alignment based phylogenetic footprinting (Fig. 2).
5. Conclusion
With the critical role of cis-regulatory elements in differentiation of specific cells leading to
growth, development and survival of an organism, scientists have a great interest in their
identification and characterization. However, due to the limited availability of known
transcription factors identification of the corresponding regulatory elements through
conventional DNA-protein interaction techniques is challenging. Rapid increase in number
of complete genome sequences, identification of co-regulated genes through microarray
technology with available electronic storage and computational power has put before the
scientific community a challenge to integrate these advancing technology and develop
computational program to identify these short but functionally important regulatory
sequences. Although some progress has been made in this direction and databases like
TRANSFC and others store libraries of transcription factor binding sites. However, there are
limitations primarily because publicly available libraries are very small and complete
datasets are not freely available. Secondly, because of their very small size there is certain
degree of redundancy in binding and therefore the chances of false prediction are very high.
These limitations have been overcome to some extent by databases like JASPAR that are
freely available and have a collection of regulatory elements large enough to compete with
the commercially available datasets. Another concern with cis-acting regulatory elements is
that the data pool of these functional non coding transcription factor binding sites is very
small (a few thousands), compared with the fact that thousand of genes are expressed in a
cell at any point of time and every gene is transcribed by a combination of minimum 5-10
transcription factors. Phylogenetic footprinting, has enormous potential in identifying new
In Silico Identification of Regulatory Elements in Promoters 61
regulatory elements and completing the gene network map. Although routine sequence
alignment programs such as clustalW fail to align short conserved sequences in a
background of hyper variable sequences, more sophisticated sequence alignment programs
have been specially developed for identification of conserved regulatory elements. These
programs such as CONREAL uses available transcription factor binding site data to align
the two sequences that decreases the chances of missing a regulatory site considerably.
Moreover, other approaches such as MEME altogether abandons the presence of sequence
alignment and directly identifies the rare conserved blocks even if they have jumbled up to
the complementary strand. With the increasing sophistication and accuracy of motif finding
programs and available genome sequences it can be assumed that knowledge of these
regulatory sequences will definitely increase (Table 2). Once we have sufficient data it can
be used for development of synthetic promoters with desired expression patterns (Fig. 3).
Fruit
Specific
Wound
Specific
Seed Specific
Construction of Synthetic
Promotor
Gene
Gene
6. References
Aerts, S., Van Loo, P., Thijs, G., Mayer, H., de Martin, R., Moreau, Y., and De Moor, B.
(2004). TOUCAN 2: the all-inclusive open source workbench for regulatory
sequence analysis. Nucleic Acids Research 33, W393-W396.
In Silico Identification of Regulatory Elements in Promoters 63
Bailey, T.L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to
discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28-36.
Bailey, T.L., Williams, N., Misleh, C., and Li, W.W. (2006). MEME: discovering and
analyzing DNA and protein sequence motifs. Nucleic Acids Research 34, W369-
W373.
Berezikov, E., Guryev, V., and Cuppen, E. (2005). CONREAL web server: identification and
visualization of conserved transcription factor binding sites. Nucleic Acids
Research 33, W447-W450.
Blanchette, M., and Tompa, M. (2002). Discovery of Regulatory Elements by a
Computational Method for Phylogenetic Footprinting. Genome Research 12, 739-
748.
Bray, N., Dubchak, I., and Pachter, L. (2003). AVID: A Global Alignment Program. Genome
Research 13, 97-102.
Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Program, N.C.S., Green, E.D.,
Sidow, A., and Batzoglou, S. (2003). LAGAN and Multi-LAGAN: Efficient Tools for
Large-Scale Multiple Alignment of Genomic DNA. Genome Research 13, 721-731.
Buhler, J., and Tompa, M. (2002). Finding motifs using random projections. J Comput Biol 9,
225-242.
Carmack, C.S., McCue, L., Newberg, L., and Lawrence, C. (2007). PhyloScan: identification
of transcription factor binding sites using cross-species evidence. Algorithms for
Molecular Biology 2, 1.
Cartharius, K., Frech, K., Grote, K., Klocke, B., Haltmeier, M., Klingenhoff, A., Frisch, M.,
Bayerlein, M., and Werner, T. (2005). MatInspector and beyond: promoter analysis
based on transcription factor binding sites. Bioinformatics 21, 2933-2942.
Che, D., Jensen, S., Cai, L., and Liu, J.S. (2005). BEST: Binding-site Estimation Suite of Tools.
Bioinformatics 21, 2909-2911.
Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R.,
Cohen, B.A., and Johnston, M. (2003). Finding Functional Features in
Saccharomyces Genomes by Phylogenetic Footprinting. Science.
Cliften, P.F., Hillier, L.W., Fulton, L., Graves, T., Miner, T., Gish, W.R., Waterston, R.H., and
Johnston, M. (2001). Surveying Saccharomyces Genomes to Identify Functional
Elements by Comparative DNA Sequence Analysis. Genome Research 11, 1175-
1186.
Das, M., and Dai, H.-K. (2007). A survey of DNA motif finding algorithms. BMC
Bioinformatics 8, S21.
Fiedler, T., and Rehmsmeier, M. (2006). jPREdictor: a versatile tool for the prediction of cis-
regulatory elements. Nucleic Acids Research 34, W546-W550.
Francis, C., and Henry, C.M.L. (2008). DNA Motif Representation with Nucleotide
Dependency. IEEE/ACM Trans. Comput. Biol. Bioinformatics 5, 110-119.
Frith, M.C., Hansen, U., and Weng, Z. (2001). Detection of cis -element clusters in higher
eukaryotic DNA. Bioinformatics 17, 878-889.
Frith, M.C., Fu, Y., Yu, L., Chen, J.F., Hansen, U., and Weng, Z. (2004). Detection of
functional DNA motifs via statistical over†representation. Nucleic Acids
Research 32, 1372-1381.
64 Computational Biology and Applied Bioinformatics
Gordon, D.B., Nekludova, L., McCallum, S., and Fraenkel, E. (2005). TAMO: a flexible,
object-oriented framework for analyzing transcriptional regulation using DNA-
sequence motifs. Bioinformatics 21, 3164-3165.
Henry, C.M.L., and Fracis, Y.L.C. (2006). Discovering DNA Motifs with Nucleotide
Dependency, Y.L.C. Francis, ed, pp. 70-80.
Hertz, G.Z., and Stormo, G.D. (1999). Identifying DNA and protein patterns with
statistically significant alignments of multiple sequences. Bioinformatics 15, 563-
577.
Hertz, G.Z., Hartzell, G.W., and Stormo, G.D. (1990). Identification of Consensus Patterns in
Unaligned DNA Sequences Known to be Functionally Related. Comput Appl Biosci
6, 81 - 92.
Higo, K., Ugawa, Y., Iwamoto, M., and Korenaga, T. (1999). Plant cis-acting regulatory DNA
elements (PLACE) database: 1999. Nucleic Acids Research 27, 297-300.
Ho Sui, S.J., Mortimer, J.R., Arenillas, D.J., Brumm, J., Walsh, C.J., Kennedy, B.P., and
Wasserman, W.W. (2005). oPOSSUM: identification of over-represented
transcription factor binding sites in co-expressed genes. Nucleic Acids Research 33,
3154-3164.
Hughes, J.D., Estep, P.W., Tavazoie, S., and Church, G.M. (2000). Computational
identification of Cis-regulatory elements associated with groups of functionally
related genes in Saccharomyces cerevisiae. Journal of Molecular Biology 296, 1205-
1214.
Jaiswal, R., Nain, V., Abdin, M.Z., and Kumar, P.A. (2007). Isolation of pigeon pea (Cajanus
cajan L.) legumin gene promoter and identification of conserved regulatory
elements using tools of bioinformatics. Indian Journal of experimental Biology 6,
495-503.
Jensen, S.T., and Liu, J.S. (2004). BioOptimizer: a Bayesian scoring function approach to
motif discovery. Bioinformatics 20, 1557-1564.
Karlin, S., and Altschul, S.F. (1990). Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes. Proceedings of the
National Academy of Sciences 87, 2264-2268.
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C.
(1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple
alignment. Science 262, 208-214.
Lescot, M., Déhais, P., Thijs, G., Marchal, K., Moreau, Y., Van de Peer, Y., RouzÃ, P., and
Rombauts, S. (2002). PlantCARE, a database of plant cis-acting regulatory elements
and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids
Research 30, 325-327.
Liu, X., Brutlag, D.L., and Liu, J.S. (2001). BioProspector: discovering conserved DNA motifs
in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput, 127-
138.
Liu, X.S., Brutlag, D.L., and Liu, J.S. (2002). An algorithm for finding protein-DNA binding
sites with applications to chromatin-immunoprecipitation microarray experiments.
Nat Biotech 20, 835-839.
Liu, Y., Liu, X.S., Wei, L., Altman, R.B., and Batzoglou, S. (2004). Eukaryotic Regulatory
Element Conservation Analysis and Identification Using Comparative Genomics.
Genome Research 14, 451-458.
In Silico Identification of Regulatory Elements in Promoters 65
Loots, G.G., and Ovcharenko, I. (2004). rVISTA 2.0: evolutionary analysis of transcription
factor binding sites. Nucleic Acids Research 32, W217-W221.
Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M., and
Frazer, K.A. (2000). Identification of a Coordinate Regulator of Interleukins 4, 13,
and 5 by Cross-Species Sequence Comparisons. Science 288, 136-140.
Marinescu, V., Kohane, I., and Riva, A. (2005). MAPPER: a search engine for the
computational identification of putative transcription factor binding sites in
multiple genomes. BMC Bioinformatics 6, 79.
Matys, V., Fricke, E., Geffers, R., Gößling, E., Haubrock, M., Hehl, R., Hornischer, K.,
Karas, D., Kel, A.E., Kel-Margoulis, O.V., Kloos, D.U., Land, S., Lewicki-Potapov,
B., Michael, H., Münch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S.,
and Wingender, E. (2003). TRANSFAC®: transcriptional regulation, from patterns
to profiles. Nucleic Acids Research 31, 374-378.
Pauli, F., Liu, Y., Kim, Y.A., Chen, P.-J., and Kim, S.K. (2006). Chromosomal clustering and
GATA transcriptional regulation of intestine-expressed genes in C. elegans.
Development 133, 287-295.
Pennacchio, L.A., and Rubin, E.M. (2001). Genomic strategies to identify mammalian
regulatory sequences. Nat Rev Genet 2, 100-109.
Portales-Casamar, E., Thongjuea, S., Kwon, A.T., Arenillas, D., Zhao, X., Valen, E., Yusuf, D.,
Lenhard, B., Wasserman, W.W., and Sandelin, A. (2009). JASPAR 2010: the greatly
expanded open-access database of transcription factor binding profiles. Nucleic
Acids Research.
Rombauts, S., Florquin, K., Lescot, M., Marchal, K., Rouzé, P., and Van de Peer, Y. (2003).
Computational Approaches to Identify Promoters and cis-Regulatory Elements in
Plant Genomes. Plant Physiology 132, 1162-1176.
Romer, K.A., Kayombya, G.-R., and Fraenkel, E. (2007). WebMOTIFS: automated discovery,
filtering and scoring of DNA sequence motifs using multiple programs and
Bayesian approaches. Nucleic Acids Research 35, W217-W220.
Roth, F.P., Hughes, J.D., Estep, P.W., and Church, G.M. (1998). Finding DNA regulatory
motifs within unaligned noncoding sequences clustered by whole-genome mRNA
quantitation. Nat Biotech 16, 939-945.
Sales, M.Ì.c.P., Gerhardt, I.R., Grossi-de-Sá, M.F.t., and Xavier-Filho, J. (2000). Do Legume
Storage Proteins Play a Role in Defending Seeds against Bruchids? Plant
Physiology 124, 515-522.
Sandelin, A., Wasserman, W.W., and Lenhard, B. (2004). ConSite: web-based prediction of
regulatory elements using cross-species comparison. Nucleic Acids Research 32,
W249-W252.
Sinha, S., Liang, Y., and Siggia, E. (2006). Stubb: a program for discovery and analysis of cis-
regulatory modules. Nucleic Acids Research 34, W555-W559.
Stormo, G.D. (2000). DNA Binding Sites: Representation and Discovery. Bioinformatics 16,
16 - 23.
Thomas-Chollier, M., Sand, O., Turatsinze, J.-V., Janky, R.s., Defrance, M., Vervisch, E.,
Brohee, S., and van Helden, J. (2008). RSAT: regulatory sequence analysis tools.
Nucleic Acids Research 36, W119-W127.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994). CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
66 Computational Biology and Applied Bioinformatics
weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids
Research 22, 4673-4680.
Wang, T., and Stormo, G.D. (2005). Identifying the conserved network of cis-regulatory sites
of a eukaryotic genome. Proceedings of the National Academy of Sciences of the
United States of America 102, 17400-17405.
Wijaya, E., Yiu, S.-M., Son, N.T., Kanagasabai, R., and Sung, W.-K. (2008). MotifVoter: a
novel ensemble method for fine-grained integration of generic motif finders.
Bioinformatics 24, 2288-2295.
Zakharov, A., Giersberg, M., Hosein, F., Melzer, M., Müntz, K., and Saalbach, I. (2004).
Seed-specific promoters direct gene expression in non-seed tissue. Journal of
Experimental Botany 55, 1463-1471.
Zhu, J., Liu, J.S., and Lawrence, C.E. (1998). Bayesian adaptive sequence alignment
algorithms. Bioinformatics 14, 25-39.
4
1. Introduction
Glycosylation is one of the major post-translational modification processes essential for
expression and function of many proteins. It has been estimated that 1% of the open reading
frames of a genome is dedicated to glycosylation. Many different enzymes are involved in
glycosylation, such as glycosyltransferases and glycosidases.
Traditionally, glycosyltransferases are classified based on their enzymatic activities by
Enzyme Commission (http://www.chem.qmul.ac.uk/iubmb/enzyme/). Based on the
activated donor type, glycosyltransferases are named, for example glucosyltransferase,
mannosyltransferase and N-acetylglucosaminyltransferases. However, classification of
glycosyltransferases based on the biochemical evidence is a difficult task since most of the
enzymes are membrane proteins. Reconstruction of enzymatic assay for the membrane
proteins are intrinsically more difficult than soluble proteins. Thus the purification of
membrane-bound glycosyltransferase is a difficult task. On the other hand, with the recent
advancement of genome projects, DNA sequences of an organism are readily available.
Furthermore, bioinformatics annotation tools are now commonly used by life science
researchers to identify the putative function of a gene. Hence, new approaches based on in
silico analysis for classifying glycosyltransferase have been used successfully. The best
known database for classification of glycosyltransferase by in silico approach is the CAZy
(Carbohydrate- Active enZymes) database (http://afmb.cnrs-mrs.fr/CAZy/) (Cantarel et al.,
2009).
Glycosyltransferases are enzymes involved in synthesizing sugar moieties by transferring
activated saccharide donors into various macro-molecules such as DNA, proteins, lipids and
glycans. More than 100 glycosyltransferases are localized in the endoplasmic reticulum (ER)
and Golgi apparatus and are involved in the glycan synthesis (Narimatsu, H., 2006). The
structural studies on the ER and golgi glycosyltransferases has revealed several common
domains and motifs present between them. The glycosyltransferases are grouped into
functional subfamilies based on similarities of sequence, their enzyme characteristics, donor
specificity, acceptor specificity and the specific donor and acceptor linkages (Ishida et al.,
2005). The glycosyltransferase sequences comprise of 330-560 amino acids long and share
the same type II transmembrane protein structure with four functional domains: a short
68 Computational Biology and Applied Bioinformatics
2. LARGE
LARGE is one of the largest genes present in the human genome and it is comprised of 660
kb of genomic DNA and contains 16 exons encoding a 756-amino-acid protein. It showed
98% amino acid identity to the mouse homologue and similar genomic organization. The
expression of LARGE is ubiquitous but the highest levels of LARGE mRNA are present in
heart, brain and skeletal muscle (Peyrard et al., 1999).
LARGE encodes a protein which has an N-terminal transmembrane anchor, coiled coil motif
and two putative catalytic domains with a conserved DXD (Asp-any-Asp) motif typical of
many glycosyltransferases that uses nucleoside diphosphate sugars as donors (Longman et
al., 2003 & Peyrard et al., 1999). The proximal catalytic domain in the LARGE was most
homologous to the bacterial glycosyltransferase family 8 (GT8 in CAZy database) members
(Coutinho et al., 2003). The members of this family are mainly involved in the synthesis of
bacterial outer membrane lipopolysaccharide. The distal domain resembled the human β1,3-
N-acetytglucosaminyltransferase (iGnT), a member of GT49 family. The iGnT enzyme is
required for the synthesis of the poly-N-acetyllactosamine backbone which is part of the
erythrocyte i antigen (Sasaki et al., 1997). The presence of two catalytic domains in the
LARGE is extremely unusual among the glycosyltransferase enzymes.
The closely related homologues of LARGE are found in the human genome,
(glycosyltransferase-like 1B; GYLTL1B), mouse genome (Glylt1b; also called LARGE-Like or
LargeL) and in some other vertebrate species (Grewal & Hewitt, 2002). The homologue gene
is positioned on the chromosome 11p11.2 of the human genome and it encodes 721 amino
acid protein which has 67% identity with LARGE, suggests that the two genes may have
risen by gene duplication. Like LARGE, it is also predicted to have two catalytic domains,
though it lacks the coiled-coiled motif present in the former protein. The hyperglycosylation
of α-dystroglycan by the overexpression of GYLTL1B increased its ability to bind laminin
and both the genes showed the same level of increase in laminin binding ability
(Brockington, et al., 2005).
the following tools. To begin with, the sequences are searched for the aspartate-any residue-
aspartate (DXD) motif. The DXD motifs present in some glycosyltransferase families are
essential for its enzyme activity.
All the information related to the LARGE-like protein family was retrieved from the
different biological databases. In order to confirm that the information obtained was
reliable, the data was scrutinized at two levels. First the information was selected from the
above mentioned biological databases with customized programs (using the perl compatible
regular expressions). Then the obtained information was annotated and validated by experts
in glycobiology and bioinformatics.
The annotated data in the LGTBase database was divided into nine categories (Figure 2).
The first category is related to genomic location, displays the chromosome, the cytogenetic
band and the map location of the gene. The second is related to aliases and descriptions,
displays synonyms and aliases for the relevant gene, and descriptions of its function,
cellular localization and effect on phenotype. The third category on proteins provides
annotated information about the proteins encoded by the relevant genes. The fourth is about
protein domains and families, provides annotated information about protein domains and
families and the fifth on protein function which provides annotated information about gene
function. The sixth category is related to pathways and interactions, provides links to
pathways and interactions followed by the seventh on disorders and mutations which
draws its information from OMIM and UniProt. The eighth category is on expression in
specific tissues, shows that the tissue expression values are available for a given gene. The
last category is about research articles, lists the references related to the proteins which are
studied. In addition, the investigator can also use DNA or protein sequences to assemble the
dataset for the analysis using this workflow.
LARGE like GlcNAc Transferases and built a protein database of ‘LARGE-like protein’. This
database would assist in search for more reference sequences of LARGE-like protein.
Fig. 5. The BlastP tool of the LGTBase platform to find similar sequences to LARGE
78 Computational Biology and Applied Bioinformatics
Fig. 6. DXD motif search tool of the LGTBase platform for DXD motif prediction
Fig. 7. TMHMM analysis tool of the LGTBase platform for Transmembrane domain
prediction
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 79
Fig. 8. MEME analysis tool of the LGTBase platform to predict the sequence motifs
Fig. 9. Pfam analysis tool of the LGTBase platform to identify the known protein family of
the target protein which is studied
80 Computational Biology and Applied Bioinformatics
Fig. 10. Phylogenetic analysis tool of the LGTBase platform to study the evolutionary
relationship of the target protein
6. Future direction
We have described how to construct a computational platform to analyze the LARGE
protein family. Since the platform was built based on several commonly shared protein
domains and motifs, it can also be modified for analyzing other golgi glycosyltransferases.
Furthermore, the phylogenetic analysis (Figure 3) revealed that LARGE protein family is
related to β-1,3-N-acetylglucosaminyltransferase 1 (β3GnT). β3GnT (EC 2.4.1.149) is a group
of enzymes belong to glycosyltransferases family. Some β3GnT enzymes catalyze the
transfer of GlcNAc from UDP-GlcNAc to Gal in the Galβ1-4 Glc(NAc) structure with β-1,3
linkage. These enzymes were grouped into GT family 31, 49 in the CAZy database. The
enzyme uses 2 substrates namely, UDP-N-acetyl-D-glucosamine and D-galactosyl-β-1,4-N-
acetyl-D-glucosaminyl-R and the products are formed as UDP, N-acetyl-β-D- glucosaminyl-
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 81
7. References
Bao, X., Kobayashi, M., Hatakeyama, S., Angata, K., Gullberg, D., Nakayama, J., Fukuda,
M.N. & Fukuda, M. (2009). Tumor suppressor function of laminin-binding
α-dystroglycan requires a distinct β-3-N-acetylglucosaminyltransferase. Proceedings
of the National Academy of Sciences USA, Vol.106, No.29, (July 2009),
pp. 12109-12114
Barresi, R., Michele, D.E., Kanagawa, M., Harper, H.A., Dovico, S.A., Satz, J.S., Moore, S.A.,
Zhang, W., Schachter, H., Dumanski, J.P., Cohn, R.D., Nishino, I. & Campbell, K.P.
(2004). LARGE can functionally bypass alpha-dystroglycan glycosylation defects in
distinct congenital muscular dystrophies. Nature Medicine, Vol.10, No.7, (July 2004),
pp. 696-703.
Braun, S. (2004). Naked plasmid DNA for the treatment of muscular dystrophy. Current
Opinion in Molecular Therapeutics, Vol.6, (October 2004), pp. 499-505.
Brockington, M., Torelli, S., Prandini, P., Boito, C., Dolatshad, N.F., Longman, C., Brown,
S.C., Muntoni, F. (2005). Localization and functional analysis of the LARGE family
of glycosyltransferases: significance for muscular dystrophy. Human Molecular
Genetics, Vol.14, No.5, (March 2005), pp. 657-665.
Busch, C., Hofmann, F., Selzer, J., Munro, S., Jeckel, D. & Aktories, K. (1998). A common
motif of eukaryotic glycosyltransferases is essential for the enzyme activity of large
clostridial cytotoxins. Journal of Biological Chemistry, Vol.273, No.31, (July 1998),
pp.19566-19572.
Cantarel, B.L., Coutinho, P.M., Rancurel, C., Bernard, T., Lombard, V. & Henrissat, B.
(2009) The Carbohydrate-Active EnZymes database (CAZy): an expert
resource for Glycogenomics. Nucleic Acids Research, Vol. 37, (January 2009), pp.
D233-238
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 83
Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J. & Higgins, D.G. Thompson JD
(2003). Multiple sequence alignment with the Clustal series of programs. Nucleic
Acids Res. Vol.31, No.13, (July 2003), pp. 3497-3500.
Coutinho, P.M., Deleury, E., Davies, G.J. & Henrissat, B. (2003). An evolving hierarchical
family classification for glycosyltransferases. Journal of Molecular Biology, Vol.328,
(April 2003), pp. 307-317.
Ding, J., Berleant, D., Nettleton, D. & Wurtele, E. (2002). Mining MEDLINE: abstracts,
sentences, or phrases? Pacific Symposium on Biocomputing, Vol.7, pp. 326-337.
Dumanski, J.P., Carlbom, E., Collins, V.P., Nordenskjold, M. (1987). Deletion mapping of a
locus on human chromosome 22 involved in the oncogenesis of meningioma.
Proceedings of the National Academy of Sciences USA, Vol.84, (December 1987), pp.
9275-9279.
Durbeej, M., Henry, M.D. & Campbell, K.P. (1998). Dystroglycan in development and
disease. Current Opinions in Cell Biology Vol. 10, (October 1998), pp. 594-601.
Fujimura, K., Sawaki, H., Sakai, T., Hiruma, T., Nakanishi, N., Sato, T., Ohkura, T.,
Narimatsu, H. (2005). LARGE2 facilitates the maturation of a-dystroglycan more
effectively than LARGE. Biochemical and Biophysical Research Communications,
Vol.329, No.3, (April 2005), pp. 1162-1171
Fukuda, M., Hindsgaul, O., Hames, B.D. & Glover, D.M. (1994). In Molecular Glycobiology,
Oxford Univ. Press, Oxford.
Fukuda, M., & Hindsgaul, O. (2000). Molecular and Cellular Glycobiology (2nd ed.), Oxford
Univ. Press, Oxford.
Gee, S.H., Montanaro, F., Lindenbaum, M.H., and Carbonetto, S. (1994). Dystroglycan-α, a
dystrophin-associated glycoprotein, is a functional agrin receptor. Cell, Vol.77,
(June 1994), pp. 675-686.
Gerstein, M. (2000). Integrative database analysis in structural genomics. Nature Structural
Biology, Vol.7, (November 2000), Suppl: 960-963.
Grewal, K., Holzfeind, P.J., Bittner, R.E. & Hewitt, J.E., (2001). Mutant glycosyltransferase
and altered glycosylation of alpha-dystroglycan in the myodystrophy mouse.
Nature Genetics, Vol.28, (June 2001), pp.151-154.
Grewal, P.K. & Hewitt, J.E. (2002). Mutation of Large, which encodes a putative
glycosyltransferase, in an animal model of muscular dystrophy. Biochimica et
Biophysica Acta, Vol.1573, (December 2002), pp. 216-224.
Gromova, I., Gromov, P. & Celis J.E. (2001). A Novel Member of the Glycosyltransferase
Family, β3GnT2, highly down regulated in invasive human bladder Transitional
Cell Carcinomas. Molecular Carcinogenesis, Vol. 32, No. 2, (October 2001), pp.
61-72
Hayatsu, N., Ogasawara, S., Kaneko, M.K., Kato, Y. & Narimatsu, H. (2008). Expression of
highly sulfated keratan sulfate synthesized in human glioblastoma cells. Biochemical
and Biophysical Research Communications, Vol. 368, No. 2, (April 2008), pp. 217-222
Hwa, K.Y., Pang, T.L. & Chen, M.Y. (2007). Classification of LARGE-like GlcNAc-
Transferases of Dictyostelium discoideum by Phylogenetic Analysis. Frontiers in the
Convergence of Bioscience and Information Technologies, pp. 289-293.
84 Computational Biology and Applied Bioinformatics
Ishida, H., Togayachi, A., Sakai, T., Iwai, T., Hiruma, T., Sato, T., Okubo, R., Inaba, N., Kudo,
T., Gotoh, M., Shoda, J., Tanaka, N., & Narimatsu, H. A novel beta1,3-N-
acetylglucosaminyltransferase (beta3Gn-T8), which synthesizes poly-N-
acetyllactosamine, is dramatically upregulated in colon cancer. FEBS Letters.
(January 2005), Vol. 579, No.1, pp. 71-78.
Ishida, H., Togayachi, A., Sakai, T., Iwai, T., Hiruma, T., Sato, T., Okubo, R., Inaba, N., Kudo,
T., Gotoh, M., Shoda, J., Tanaka, N. & Narimatsu, H. (2005). A novel beta1,3-N-
acetylglucosaminyltransferase (beta3Gn-T8), which synthesizes poly-N-
acetyllactosamine, is dramatically upregulated in colon cancer. FEBS Letters,
Vol.579, No.1, (January 2005), pp. 71-8.
Iwai, T., Kudo, T., Kawamoto, R., Kubota, T., Togayachi, A., Hiruma, T., Okada, T.,
Kawamoto, T., Morozumi, K. & Narimatsu, H. (2005). Core 3 synthase is down-
regulated in colon carcinoma and profoundly suppresses the metastatic potential of
carcinoma cells. Proceedings of the National Academy of Sciences USA, Vol.102, No.12,
(March 2005), pp. 4572-4577
Kanagawa, M., Michele, D.E., Satz, J.S., Barresi, R., Kusano, H., Sasaki, T., Timpl, R., Henry,
M. D., and Campbell, K.P. (2005). Disruption of Perlecan Binding and Matrix
Assembly by Post-Translational or Genetic Disruption of Dystroglycan Function.
FEBS Letters, Vol.579, No.21, (August 2005), pp. 4792-4796.
Kataoka, K. & Huh, N.H. (2002). A novel β1,3-N-acetylglucosaminyltransferase involved in
invasion of cancer cells as assayed in vitro. Biochemical and Biophysical Research
Communications, Vol. 294, No.4, (June 2002), pp. 843-848
Lane, P.W., Beamer, T.C. & Myers, D.D. (1976). Myodystrophy, a new myopathy on
chromosome 8 of the mouse. Journal of Heredity, Vol. 67, No.3 (May-June 1976), pp.
135-138.
Longman, C., Brockington, M., Torelli, S., Jimenez-Mallebrera, C., Kennedy, C., Khalil, N.,
Feng, L., Saran, R.K., Voit, T., Merlini, L., Sewry, C.A., Brown, S.C. & Muntoni F.
(2003). Mutations in the human LARGE gene cause MDC1D, a novel form of
congenital muscular dystrophy with severe mental retardation and abnormal
glycosylation of alpha dystroglycan. Human Molecular Genetics, Vol.12, No.21,
(November 2003), pp. 2853-2861.
Marcos, N.T., Magalhães, A., Ferreira, B., Oliveira, M.J., Carvalho, A.S., Mendes, N.,
Gilmartin, T., Head, S.R., Figueiredo, C., David, L., Santos-Silva, F. & Reis, C.A.
(2008). Helicobacter pylori induces β3GnT5 in human gastric cell lines, modulating
expression of the SabA ligand Sialyl-Lewis X. Journal of Clinical Investigation, Vol.
118, No. 6, (June 2008), pp.2325-2336
Narimatsu, H. (2006). Human glycogene cloning: focus on beta 3-glycosyltransferase and
beta 4-glycosyltransferase families. Current Opinions in Structural Biology. Vol.16,
No.5, (October 2006), pp. 567-575.
Newman, E.A. & Frishman, L.J. (1991). The b-wave. In Arden, G.B. (ed.), Principles and
Practice of Clinical Electrophysiology of Vision, Mosby-Year Book, St Louis, MO.
Peng, H.B., Ali, A.A., Daggett, D.F., Rauvala, H., Hassell, J.R., & Smalheiser, N.R. (1998). The
relationship between perlecan and dystroglycan and its implication in the
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 85
Zhou, D., Dinter, A., Gutiérrez Gallego, R., Kamerling, J.P., Vliegenthart, J.F., Berger, E.G. &
Hennet, T. (1999). A β-1,3-N-acetylglucosaminyltransferase with poly-N-
acetyllactosamine synthase activity is structurally related to β-1,3-
galactosyltransferases. Proceedings of the National Academy of Sciences USA, Vol. 96,
No. 2, (January 1999), pp. 406-411
5
1. Introduction
Breast cancer is the most common form of cancer among women. In 2009, an estimated
194,280 new cases of breast cancer were diagnosed in the United States; breast cancer was
estimated to account for 27% of all new cancer cases and 15% of cancer-related mortality in
women (Jemal et al, 2009). Similarly, in Europe in 2008, the disease accounted for some 28%
and 17% of new cancer cases and cancer-related mortality in women respectively (Ferlay et
al, 2008). The increasing incidence of breast cancer worldwide will result in an increased
social and economic burden; for this reason there is a pressing need from a health and
economics perspective to develop and provide appropriate, patient specific treatment to
reduce the morbidity and mortality of the disease. Understanding the aetiology, biology and
pathology of breast cancer is hugely important in diagnosis, prognostication and selection of
primary and adjuvant therapy. Breast tumour behaviour and outcome can vary
considerably according to factors such as age of onset, clinical features, histological
characteristics, stage of disease, degree of differentiation, genetic content and molecular
aberrations. It is increasingly recognised that breast cancer is not a single disease but a
continuum of several biologically distinct diseases that differ in their prognosis and
response to therapy (Marchionni et al, 2008; Sorlie et al, 2001). The past twenty years has
seen significant advances in breast cancer management. Targeted therapies such as
hormonal therapy for estrogen receptor (ER) positive breast tumours and trastuzumab for
inhibition of HER2/neu signalling have become an important component of adjuvant
therapy and contributed to improved outcomes (Fisher et al, 2004; Goldhirsch et al, 2007;
Smith et al, 2007). However, our understanding of the molecular basis underlying breast
cancer heterogeneity remains incomplete. It is likely that there are significant differences
between breast cancers that reach far beyond the presence or absence of ER or HER2/neu
amplification. Patients with similar morphology and molecular phenotype based on ER, PR
and HER2/neu receptor status can have different clinical courses and responses to therapy.
There are small ER positive tumours that behave aggressively while some large high grade
ER negative, HER2/neu receptor positive tumours have an indolent course. ER-positive
tumours are typically associated with better clinical outcomes and a good response to
88 Computational Biology and Applied Bioinformatics
hormonal therapies such as tamoxifen (Osborne et al, 1998). However, a subset of these
patients recur and up to 40% develop resistance to hormonal therapy (Clarke et al, 2003).
Furthermore, clinical studies have shown that adding adjuvant chemotherapy to tamoxifen
in the treatment of node negative, ER positive breast cancer improves disease outcome
(Fisher et al, 2004). Indeed, treatment with tamoxifen alone is only associated with a 15%
risk of distant recurrence, indicating that 85% of these patients would do well without, and
could be spared the cytotoxic side-effects of adjuvant chemotherapy.
The heterogeneity of outcome and response to adjuvant therapy has driven the discovery of
further molecular predictors. Particular attention has focused on those with prognostic
significance which may help target cancer treatment to the group of patients who are likely
to derive benefit from a particular therapy. There has been a huge interest in defining the
gene expression profiles of breast tumours to further understand the aetiology and progression
of the disease in order to identify novel prognostic and therapeutic markers. The sequencing
of the human genome and the advent of high throughput molecular profiling has facilitated
comprehensive analysis of transcriptional variation at the genomic level. This has resulted in
an exponential increase in our understanding of breast cancer molecular biology. Gene
expression profiling using microarray technology was first introduced in 1995 (Schena et al,
1995). This technology enables the measurement of expression of tens of thousands of
mRNA sequences simultaneously and can be used to compare gene expression within a
sample or across a number of samples. Microarray technology has been productively
applied to breast cancer research, contributing enormously to our understanding of the
molecular basis of breast cancer and helping to achieve the goal of individualised breast
cancer treatment. However as the use of this technology becomes more widespread, our
understanding of the inherent limitations and sources of error increases. The large amount
of data produced from such high throughput systems has necessitated the use of complex
computational tools for management and analysis of this data; leading to rapid
developments in bioinformatics.
This chapter provides an overview of current gene expression profiling techniques, their
application to breast cancer prognostics and the bioinformatic challenges that must be
overcome to generate meaningful results that will be translatable to the clinical setting. A
literature search was performed using the PubMed database to identify publications
relevant to this review. Citations from these articles were also examined to yield further
relevant publications.
• CpG arrays (Yan et al , 2000) can be used to determine whether patterns of specific
epigenetic alterations correlate with pathological parameters.
• Protein microarrays (Stoll et al, 2005) consisting of antibodies, proteins, protein
fragments, peptides or carbohydrate elements, are used to detect patterns of protein
expression in diseased states.
• ChIP-on-chip (Oberley et al, 2004) combines chromatin immunoprecipitation (ChIP)
with glass slide microarrays (chip) to detect how regulatory proteins interact with the
genome.
All of these approaches offer unique insights into the genetic and molecular basis of disease
development and progression.
This chapter focuses primarily on gene expression profiling and cDNA microarrays,
however many of the issues raised, particularly in relation to bioinformatics are also
applicable to the other “-omic” technologies.
Gene expression which is a measurement of gene “activity” can be determined by the
abundance of its messenger RNA (mRNA) transcripts or by the expression of the protein
which it encodes. ER, PR and HER2/neu receptor status are determined in clinical practice
using immunohistochemistry (IHC) to quantitate protein expression or fluorescence in situ
hybridisation (FISH) to determine copy number. These techniques are semi-quantitative and
are optimal when determining the expression of individual or a small number of genes.
Microarray technology is capable of simultaneously measuring the expression levels of
thousands of genes in a biological sample at the mRNA level. The abundance of individual
mRNA transcripts in a sample is a reflection of the expression levels of corresponding genes.
When a complementary DNA (cDNA) mixture reverse transcribed from the mRNA is
labelled and hybridised to a microarray, the strength of the signal produced at each address
shows the relative expression levels of the corresponding gene.
cDNA microarrays are miniature platforms containing thousands of DNA sequences which
act as gene specific probes, immobilised on a solid support (nylon, glass, silicon) in a parallel
format. They are reliant on the complementarity of the DNA duplex i.e. reassembly of
strands with base pairing A to T and C to G which occurs with high specificity. There are
microarray platforms available containing bound librarys of oligonucleotides representing
literally all known human genes e.g. Affymetrix GeneChip (Santa Clara, CA), Agilent array
(Santa Clara, CA), Illumina bead array (San Diego, CA). When fluorescence-labelled cDNA
is hybridised to these arrays, expression levels of each gene in the human genome can be
quantified using laser scanning microscopes. These microscopes measure the intensity of the
signal generated by each bound probe; abundant sequences generate strong signals and rare
sequences generate weaker signals. Despite differences in microarray construction and
hybridization methodologies according to manufacturing, microarray-based measurements
of gene expression appear to be reproducible across a range of different platforms when the
same starting material is used, as demonstrated by the MicroArray Quality Control project
(Shi et al, 2006).
cluster with fibroadenoma and normal breast tissue samples (Peppercorn et al, 2008). It is
important at this point to acknowledge the limitations of this molecular taxonomy;
intrasubtype heterogeneity has been noted despite the broad similarities defined by these
large subtypes (Parker et al, 2009). In particular the basal-like subgroup can be divided into
multiple additional subgroups (Kreike et al, 2007; Nielsen et al, 2004). Additionally,
although the luminal tumours have been separated into subgroups of prognostic
significance, meta-analysis of published expression data has suggested that these luminal
tumours actually form a continuum and their separation based on expression of
proliferation genes may be subjective (Shak et al, 2006; Wirapati et al, 2008). Furthermore,
the clinical significance of the normal-like subtype is yet to be determined; it has been
proposed that this subgroup may in fact represent an artefact of sample contamination with
a high content of normal breast tissue (Parker et al, 2009; Peppercorn et al, 2008). Due to
these limitations and the subjective nature of how the molecular subtypes were identified,
the translation of this taxonomy to the clinical setting as a definitive classification has been
difficult (Pustzai et al, 2006). The development of a prognostic test based on the intrinsic
subtypes has not been feasible to date. However, the seminal work by Sorlie and Perou
(Perou et al, 2000; Sorlie et al, 2001) recognized for the first time the scale of biological
heterogeneity within breast cancer and led to a paradigm shift in the way breast cancer is
perceived.
1 http://www.genomichealth.com/OncotypeDX
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 95
range being changed from RS 18-30 to RS 11-25 to avoid exluding patients who may derive a
small benefit from chemotherapy (Sparano et al, 2006). Patients in the intermediate RS
group are randomly assigned to receive either adjuvant chemotherapy and hormonal
therapy, or hormonal therapy alone. The primary aim of the trial is to determine if ER
positive patients with an intermediate RS benefit from adjuvant chemotherapy or not.
The MammaPrint and Oncotype Dx gene signatures both predict breast cancer behaviour,
however there are fundamental differences between them (outlined in table 1). This chapter
has focused on these signatures as they were the first to be developed, have been extensively
validated, and are commercially available. However it is important to note that there are
other multi-gene based assays that have been developed and commercialized but are not
discussed in detail as they are not yet as widely utilized (Loi et al, 2007; Ma et al, 2008; Ross
et al, 2008; Wang et al, 2005 ).
al, 2006; Michiels et al, 2005). This lack of concordance has called into question the
applicability of microarray analysis across the entire breast cancer population. In order to
facilitate external validation of signatures and meta-analysis in an attempt to devise more
robust signatures, it is important that published microarray data be publicly accessible to
the scientific community. In 2001 the Microarray Gene Expression Data Society proposed
experimental annotation standards known as minimum information about a microarray
experiment (MIAME), stating that raw data supporting published studies should be made
publicly available in one of a number of online repositories (table 2), these standards are
now upheld by leading scientific journals and facilitating in depth interrogation of multiple
datasets simultaneously.
Public Database
for Microarray URL Organization Description
Data
Array Express http://www.ebi.ac.uk/arrayexpress/ European Public data
Bioinformatics deposition and
Institute (EBI) queries
GEO Gene http://www.ncbi.nlm.nih.gov/geo/ National Centre for Public data
Expression Biotechnology deposition and
Omnibus Information (NCBI) queries
CIBEX Center http://cibex.nig.ac.jp/index.jsp National Institute Public data
for Information of Genetics deposition and
Biology Gene queries
Expression
Database
ONCOMINE http://www.oncomine.org/main/ University of Public queries
Cancer Profiling index.jsp Michigan
Database
PUMAdb http://puma.princeton.edu/ Princeton Public queries
Princeton University
University
MicroArray
database
SMD Stanford http://genome- Stanford Univeristy Public queries
Microarray www5.stanford.edu/
Database
UNC Chapel https://genome.unc.edu/ University of North Public queries
Hill Microarray Carolina at Chapel
database Hill
Table 2. List of Databases with Publicly Available Microarray Data
Leukaemia (Calin et al, 2002). Loss of miR-15a and miR-16-1 from this locus results in
increased expression of the anti-apoptotic gene BCL2. Intensifying research in this field,
using a range of techniques including miRNA cloning, quantitative PCR, microarrays and
bead-based flow cytometric miRNA expression profiling has resulted in the identification
and confirmation of abnormal miRNA expression in a number of human malignancies
including breast cancer (Heneghan et al, 2010; Lowery et al, 2007). MiRNA expression has
been observed to be upregulated or downregulated in tumours compared with normal
tissue, supporting their dual role in carcinogenesis as either oncogenic miRNAs or tumour
suppressors respectively (Lu et al, 2005). The ability to profile miRNA expression in human
tumours has led to remarkable insight and knowledge regarding the developmental lineage
and differentiation states of tumours. It has been shown that distinct patterns of miRNA
expression are observed within a single developmental lineage, which reflect mechanisms of
transformation, and support the idea that miRNA expression patterns encode the
developmental history of human cancers. In contrast to mRNA profiles it is possible also to
successfully classify poorly differentiated tumours using miRNA expression profiles
(Volinia et al, 2006). In this manner, miRNA expression could potentially be used to
accurately diagnose poorly differentiated tissue samples of uncertain histological origin, e.g.
metastasis with an unknown primary tumour, thus facilitating treatment planning.
MicroRNAs exhibit unique, inherent characteristics which make them particularly attractive
for biomarker development. They are known to be dysregulated in cancer, with
pathognomonic or tissue specific expression profiles and even a modest number of miRNAs
is sufficient to classify human tumours, which is in contrast to the relatively large mRNA
signatures generated by microarray studies (Lu et al, 2005). Importantly, miRNA are
remarkably stable molecules. They undergo very little degradation even after processing
such as formalin fixation and remain largely intact in FFPE clinical tissues, lending
themselves well to the study of large archival cohorts with appropriate follow-up data (Li et
al, 2007; Xi et al, 2007). The exceptional stability of miRNAs in visceral tissue has stimulated
investigation into their possible preservation in the circulation and other bodily fluids
(urine, saliva etc.). The hypothesis is that circulating miRNAs, if detectable and quantifiable
would be the ideal biomarker accessible by minimally invasive approaches such as simple
phlebotomy (Cortez et al, 2009; Gilad et al, 2008; Mitchell et al, 2008).
et al, 2010; Tessel et al, 2010; Wang et al, 2010), prompting analysis of tumour response in
clinical samples. Rodriguez-Gonzalez et al attempted to identify miRNAs related to
response to tamoxifen therapy by exploiting the Foekens dataset (Foekens, 2008) which
comprised miRNA expression levels of 249 miRNAs in 38 ER positive breast cancer patients.
Fifteen of these patients were hormone naive and experienced relapse, which was treated
with tamoxifen. Ten patients responded and five did not, progressing within 6 months. Five
miRNAs (miR-4221, miR-30a-3p, miR-187, miR-30c and miR-182) were the most
differentially expressed between patients who benefitted from tamoxifen and those who
failed therapy. The predictive value for these miRNAs was further assessed in 246 ER
positive primary tumours of hormone naive breast cancer patients who received tamoxifen
as monotherapy for metastatic disease. MiR-30a-3p, miR-30c and miR-182 were significantly
associated with response to tamoxifen, but only miR-30c remained an independent predictor
on multivariate analysis (Rodriguez-Gonzalez, 2010).
Microarray-based expression profiling has also been used to identify circulating miRNAs
which are differentially expressed in breast cancer patients and matched healthy controls.
Zhao et al profiled 1145 miRNAs in the plasma of 20 breast cancer patients and 20 controls,
identifying 26 miRNAs with at least two-fold differential expression which reasonably
separated the 20 cases from the 20 controls (Zhao et al, 2010). This is the first example of
genome-wide miRNA expression profiling in the circulation of breast cancer patients and
indicates potential for development of a signature of circulating miRNAs that may function
as a diagnostic biomarker of breast cancer.
At present diagnostic, prognostic and predictive miRNA signatures and markers remain
hypothesis generating. They require validation in larger, independent clinical cohorts prior
to any consideration for clinical application. Furthermore as additional short non-coding
RNAs are continuously identified through biomarker discovery programmes, the available
profiling technologies must adapt their platforms to incorporate newer potentially relevant
targets. MicroRNAs possess the additional attraction of potential for development as
therapeutic targets due to their ability to regulate gene expression. It is likely that future
microarray studies will adopt and integrated approach of miRNA and mRNA expression
analysis in an attempt to decipher regulatory pathways in addition to expression patterns.
over 1.4 million probesets (Lancashire et al, 2009). As a result, it appears clearly that
extracting any relevant key component from such datasets requires robust mathematical
and/or statistical models running on efficient hardware to perform the appropriate
analyses.
With this in mind, it is clear that the identification of new biomarkers still requires a
concerted, multidisciplinary effort. It requires the expertise of the biologist or pathologist, to
extract the samples, the scientist to perform the analysis on the platform and then the
bioinformatician/biostatistician to analyse and interpret the output. The data-mining
required to cope with these types of data needs careful consideration and specific
computational tools, and as such remains a major challenge in bioinformatics.
6.1.4 Reproducibility
Reproducibility has a marked effect on the accuracy of any analysis conducted. Furthermore
reproducibility has a profound effect on the impact of other issues such as dimensionality
and false detection. Robust scientific procedures requires that the results have to be
reproducible in order to reduce the within sample variability, the variability between
sample runs and the variability across multiple reading instruments. Aspects of variability
can be addressed using technical and experimental replicates. The averaging of samples
profiles can be used to increase the confidence in the profiles for comparison (Lancashire et
al., 2009). Technical replicates provide information on the variability associated with
instrumental variability whilst experimental (or biological) replicates give a measure of the
natural sample to sample variation. Problems in data analysis occur when the technical
variability is high. In this situation the problem in part can be resolved by increasing the
number of replicates. If however the technical variation is higher than the biological
variation then the sample cannot be analysed.
Auto correlation becomes a problem when using linear based regression approaches. This is
because one of the assumptions of regression using multiple components is that the
components are not auto correlated. If intensity for multiple miRNA probes are to be added
into a regression to develop a classifier these components should not be auto correlated.
Auto correlation can be tested for using the Durbin Watson test.
6.1.6 Generality
The whole purpose of biomarker (or set of biomarkers) identification, using high-
throughput technologies or any other, is to provide the clinicians with an accurate model in
order to assess a particular aspect. However, a model is only as good as its ability to
generalize to unseen real world data. A model only able to explain the population on which
it was developed would be purely useless for any application.
As a result, if one is to develop classifiers from mRNA or miRNA array data the features
identified should be generalised. That is they will predict for new cases in the general
population of cases. When analysing high dimensional data there is an increased risk of over
fitting, particularly when the analysis methods imply supervised training on a subset of the
population. So for example, when a large number of mRNA or miRNA are analysed there is
the potential for false detection to arise. If a random element identified through false
detection is included as a component of a classifier (model) then the generality of that
classifier will be reduce; i.e. it is not a feature that relates to the broader population but is a
feature specific to the primary set of data used to develop the classifier. Standards of
validation required to determine generality have been defined by Michiels et al, 2007.
Generality of classifiers can be increased by the application of bootstrapping or cross
validation approaches.
Some algorithms and approaches, that usually involve supervised training, suffer from
over-fitting (sometimes called memorisation). This is a process where a classifier is
developed for a primary dataset but models the noise within the data as well as the relevant
features. This means that the classifier will not accurately classify for new cases i.e. it does
not represent a general solution to the problem which is applicable to all cases. This is
analogous, for example, to one developing a classifier that predicts well the risk of
metastasis for breast cancer patients from Nottingham but will not predict well for a set of
cases from Denmark. Over fitted classifiers seldom represent the biology of the system being
investigated and the features identified are often falsely detected.
One of the most common solutions to avoid over-fitting is to apply a Cross Validation
technique in combination with the supervised training. Random sample cross validation is a
process of mixing data. Firstly the data are divided into two or three parts (figure 2); the first
part is used to develop the classifier and the second or second and third parts are used to
test the classifier. These parts are sometimes termed training, test and validation data sets
respectively. In certain classifiers such as Artificial Neural Network based classifiers the
second blind set is used for optimisation and to prevent over fitting. In random sample cross
validation the random selection and training process is repeated a number of times to create
a number of models each looking at the global dataset in a number of different ways (figure
2). Often the mean performance of these models is considered.
Leave one out cross validation is an approach also used to validate findings. In this case one
sample is left out of the analysis. Once training is complete the sample left out is tested. This
process is repeated a number of times to determine the ability of a classifier to predict
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 105
unseen cases. This approach of random sample cross validation drives the classifier solution
to a generalised one by stopping the classifier from training too much on a seen dataset and
stopping the training earlier based on a blind dataset.
Fig. 2. Illustration of Cross Validation technique, here with three subsets: the training subset
used to train the classifier, the test subset used to stop the training when it has reached an
optimal performance on this subset, and a validation subset to evaluate the performance
(generalization ability) of the trained classifier.
106 Computational Biology and Applied Bioinformatics
trees have been used in the analysis of miRNA data derived to classify cancer patients (Xu,
et al, 2009).
Fig. 4. Example of a hierarchical clustering analysis result aiming to find clusters of similar
cases.
Boosted decision trees take the primary decision tree algorithm and boost it. Boosting is a
process where classifiers are derived to allow prediction of those not correctly predicted by
earlier steps. This means that a supervised classification is run where the actual class is
known. A decision tree is created that classifies correctly as many cases as possible. Those
cases that are incorrectly classified are given more weighting. A new tree is then created
with these boosted weights. This process is similar to the iterative leaning that is conducted
with the Artificial Neural Network back propagation algorithm.
Random forest approaches take the basic decision tree algorithm and couple it with random
sample cross validation. In this way a forest of trees is created. Integration of a number of
decision trees identifies a combined decision tree which, as it is developed on blind cases,
represents what approaches a generalised solution for the problem being modelled
(Breiman et al, 2001). This approach has been shown to be very good at making generalised
classifications. The approach essentially derives each tree from a random vector with
equivalent distribution from within the data set, essentially an extensive form of cross
validation. Yousef et al, (2010) have used random forest as one method for the identification
of gene targets for miRNAs. Segura et al (2010) have used random forests as a part of an
analysis to define post recurrence survival in melanoma patients.
LDA can outperform other linear classification methods as LDA tries to consider the
variation within the sample population. Nevertheless, LDA still suffers from its linear
characteristic, and often fails to accurately classify non-linear problems, which is mostly the
case in biomedical sciences (Stekel et al, 2003). This is the reason why non-linear classifiers
are recommended.
Fig. 6. Example of a classical MLP ANN topology with the details of a node (or neurone)
able to accurately discriminate the classes (Dreiseitlet al, 2001), similarly to what does Linear
Discriminant Analysis. If no such linear hyperplane can be found, usually due to the
inherent non-linearity of the dataset, the data are mapped into a high-dimensional feature
space using a kernel function (for example polynomial or radial basis functions) in which
the two classes can now be separated by a hyperplane which corresponds to a non-linear
classifier (Furey et al, 2000). The class of the unknown sample is then determined by the side
of the “maximal marginal hyper plane” on which it lies. SVMs have been used to analyse
miRNA data by Xue et al, 2005.
Fig. 7. Schematic representation of the principle of SVM. SVM tries to maximise the margin
from the hyperplane in order to best separate the two classes (red positives from blue
negatives).
8. Conclusion
The capability of microarray to simultaneously analyse expression patterns of thousands of
DNA sequences, mRNA or miRNA transcripts has the potential to provide a unique insight
into the molecular biology of malignancy. However, the clinical relevance and value of
microarray data is highly dependent on a number of crucial factors including appropriate
experimental design and suitable bioinformatic analysis. Breast cancer is a heterogeneous
disease with many biological variables which need to be considered to generate meaningful
results. Cohort selection is critical and sufficient biological and technical replicates must be
included as part of microarray study design. Experimental protocols should be appropriate
to the research question. The research community have enthusiastically applied high
throughput technologies to the study of breast cancer. Class prediction, class comparison
and class discovery studies have been undertaken in an attempt to unlock the heterogeneity
of breast cancer and identify novel biomarkers. Molecular signatures have been generated
which attempt to outperform current histopathological parameters at prognostication and
112 Computational Biology and Applied Bioinformatics
prediction of response to therapy. Two clinical tests based on gene expression profiling
(Oncotype DX and Mammaprint) are already in clinical use and being evaluated in
multicentre international trials. It is essential that the potential of microarray signatures is
carefully validated before they are adopted as prognostic tools in the clinical setting.
Standards have been set for the reporting of microarray data (MIAME) and such data is
publically available to facilitate external validation and meta-analysis. It is imperative that
the data is integrated with knowledge normally processed in the clinical setting if we are to
overcome the difficulties in reproducibility, standardization and lack of proof of significance
beyond traditional clinicopathological tools that are limiting the incorporation of microarray
based tools into today’s standard of care.
Deriving biologically and clinically relevant results from microarray data is highly
dependent on bioinformatic analysis. Microarray data is limited by inherent characteristics
that render traditional statistical approaches less effective. These include high
dimensionality, false discovery rates, noise, complexity, non-normality and limited
reproducibility. High dimensionality remains one of the most critical challenges in the
analysis of microarray data. Hierarchical clustering approaches, which have been widely
used in the analysis of breast cancer microarray data, do not cope well with dimensionality.
In overcoming this challenge supervised machine learning techniques have been adapted to
the clinical setting to complement the existing statistical methods. The majority of machine
learning techniques originated in weak-theory domains such as business and marketing.
However, these approaches including Artificial Neural Networks and Support Vector
Machines have been successfully applied to the analysis of miRNA microarray data in the
context of clinical prognostication and prediction.
It is clear that the goal of translating microarray technology to the clinical setting requires
close collaboration between the involved scientific disciplines.If the current momentum in
microarray-based miRNA and mRNA translational research can be maintained this will add
an exciting new dimension to the field of diagnostics and prognostics and will bring us
closer to the ideal of individualized care for breast cancer patients.
9. References
Abbott AL, Alvarez-Saavedra E, Miska EA et al( 2005) The let-7 MiRNA family members
mir-48, mir-84, and mir-241 function together to regulate developmental timing in
Caenorhabditis elegans. Dev Cell. 9(3):403-14.
Adam BL, Qu Y, Davis JW, Ward MD et al (2002) Serum protein fingerprinting coupled with
a pattern-matching algorithm distinguishes prostate cancer from benign prostate
hyperplasia and healthy men. Cancer Research 62:3609-3614.
Ahmed AA, Brenton JD (2005) Microarrays and breast cancer clinical studies: forgetting
what we have not yet learnt. Breast Cancer Res 7:96–99.
Arciero C, Somiari SB, Shriver CD, et al. (2003). Functional relationship and gene ontology
classification of breast cancer biomarkers. Int. J. Biol. Markers 18: 241-272.
Ashburner M, Ball CA, Blake JA et al (2000). Gene Ontology: tool for the unification of
biology. Nat Genet. 25(1): 25–29.
Baffa R, Fassan M, Volinia S et al.(2009) MicroRNA expression profiling of human metastatic
cancers identifies cancer gene targets. J Pathol, 219(2), 214‐221
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 113
Ball CA, Dolinski K, Dwight SS, et al (2000). Integrating functional genomic information into
the Saccharomyces Genome Database. Nucleic Acids Res;28:77–80
Ball G, Mian S, Holding F, et al (2002) An integrated approach utilizing artificial neural
networks and SELDI mass spectrometry for the classification of human tumours
and rapid identification of potential biomarkers. Bioinformatics 18:3395-3404.
Bartel DP. (2004) MiRNAs: genomics, biogenesis, mechanism and function. Cell; 116:281-97.
Bellman RE (1961) Adaptive Control Processes. Princeton University Press, Princeton, NJ
Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society.
Series B (Methodological);57:289-300.
Berns EM, van Staveren IL, Verhoog L et al (2001) Molecular profiles of BRCA1-mutated and
matched sporadic breast tumours: relation with clinico-pathological features.British
journal of cancer;85(4):538-45.
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press.
Blake JA, Eppig JT, Richardson JE, Davisson MT (2000). The Mouse Genome Database
(MGD): expanding genetic and genomic resources for the laboratory mouse.
Nucleic Acids Res.;28:108–111
Blake JA, Harris MA (2002) The Gene Ontology (GO) project: structured vocabularies for
molecular biology and their application to genome and expression analysis. Curr
Protoc Bioinformatics. Chapter 7:Unit 7.2.
Blenkiron C, Goldstein LD, Thorne NP, et al (2007) MicroRNA expression profiling of
human breast cancer identifies new markers of tumor subtype. Genome Biol
8(10):R214
Brenton JD, Carey LA, Ahmed AA, Caldas C (2005). Molecular classification and molecular
forecasting of breast cancer: ready for clinical application? J Clin Oncol 23:7350–
7360.
Breiman L. Random Forests (2001) Machine Learning 45:5-32.
Buyse M, Loi S, van’t Veer L, et al. (2006) Validation and clinical utility of a 70-gene
prognostic signature for women with node-negative breast cancer. J Natl Cancer
Inst. 98(17):1183-1192.
Calin GA, Dumitru CD, Shimizu M, et al (2002) Frequent deletions and down-regulation of
micro- RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia
PNAS.;99(24):15524-9.
Cardoso F, Van’t Veer L, Rutgers E, et al. (2008) Clinical application of the 70-gene profile:
the MINDACT trial. J Clin Oncol; 26:729–735.
Carey LA, Dees EC, Sawyer L et al (2007). The triple negative paradox: Primary tumor
chemosensitivity of breast cancer subtypes. Clin Cancer Res 2007; 13:2329 –2334.
Castoldi M, Schmidt S, Benes V, et al (2006) A sensitive array for MiRNA expression
profiling (miChip) based on locked nucleic acids (LNA). RNA;12(5):913-20.
Clarke R, Liu MC, Bouker KB, et al (2003). Antiestrogen resistance in breast cancer and the
role of estrogen receptor signaling. Oncogene;22(47):7316-39.
Cortez MA, Calin GA (2009). MiRNA identification in plasma and serum: a new tool to
diagnose and monitor diseases. Expert Opin Biol Ther;9(6):703-711.
114 Computational Biology and Applied Bioinformatics
Li J, Smyth P, Flavin R, et al. (2007) Comparison of miRNA expression patterns using total
RNA extracted from matched samples of formalin- fixed paraffin-embedded (FFPE)
cells and snap frozen cells. BMC biotechnology;7:36
Lisboa PJ, Taktak AF(2006). The use of artificial neural networks in decision support in
cancer: A systematic review. Neural Networks;19:408-415.
Lo SS, Norton J, Mumby PB et al(2007). Prospective multicenter study of the impact of the
21-gene recurrence score (RS) assay on medical oncologist (MO) and patient (pt)
adjuvant breast cancer (BC) treatment selection. J Clin Oncol;25(18 suppl):577
Loi S, Haibe-Kains B, Desmedt C, et al (2007). Definition of clinically distinct molecular
subtypes in estrogen receptor-positive breast carcinomas through genomic grade. J.
Clin. Oncol. 25, 1239–1246
Lowery AJ, Miller N, McNeill RE, Kerin MJ (2008). MicroRNAs as prognostic indicators and
therapeutic targets: potential effect on breast cancer management. Clin Cancer Res.
;14(2):360-5.
Lowery AJ, Miller N, Devaney A, et al (2009) . MicroRNA signatures predict estrogen
receptor, progesterone receptor and Her2/neu receptor status in breast cancer.
Breast Cancer Res.;11(3):R27.
Lu J, Getz G, Miska EA, et al.(2005) MiRNA expression profiles classify human cancers.
Nature. 2005;435(7043):834-8
Ma XJ, Hilsenbeck SG, Wang W et al (2006). The HOXB13:IL17BR expression index is a
prognostic factor in early-stage breast cancer. J Clin Oncol; 24:4611– 4619.
Ma J, Dong C, Ji C (2010). MicroRNA and drug resistance. Cancer Gene Ther, 17(8), 523‐531
Manning AT, Garvin JT, Shahbazi RI, et al (2007). Molecular profiling techniques and
bioinformatics in cancer research Eur J Surg Oncol;33(3):255-65.
Marchionni L, Wilson RF, Wolff AC, et al (2008). Systematic review: gene expression
profiling assays in early-stage breast cancer. Ann Intern Med.;148(5):358-369.
Marengo E, Robotti E, Righetti PG, et al (2004). Study of proteomic changes associated with
healthy and tumoral murine samples in neuroblastoma by principal component
analysis and classification methods. Clinica Chimica Acta;345:55-67.
Masuda N, Ohnishi T, KawamotoS , et al (1999) Analysis of chemical modification of RNA
from formalin-fixed samples and optimization of molecular biology applications
for such samples. Nucleic Acids Res. 27, 4436–4443
Matharoo-Ball B, Ratcliffe L, Lancashire L, et al (2007). Diagnostic biomarkers differentiating
metastatic melanoma patients from healthy controls identified by an integrated
MALDI-TOF mass spectrometry/bioinformatic approach. Proteomics Clinical
Applications; 1:605-620
Mattie MD, Benz CC, Bowers J, et al (2006). Optimized high-throughput MiRNA expression
profiling provides novel biomarker assessment of clinical prostate and breast
cancer biopsies. Molecular cancer;5:24
Michiels S, Koscielny S, Hill C (2005). Prediction of cancer outcome with microarrays: a
multiple random validation strategy. Lancet;365: 488-92.
Michiels S, Koscielny S, Hill C (2007). Interpretation of microarray data in cancer. British
Journal of Cancer;96:1155–1158.
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 117
Mina L, Soule SE, Badve S, et al. (2007) Predicting response to primary chemotherapy: gene
expression profiling of paraffin-embedded core biopsy tissue. Breast Cancer Res
Treat ;103:197–208.
Mitchell PS, Parkin RK, Kroh EM, et al (2008).Circulating MiRNAs as stable blood- based
markers for cancer detection.PNAS;105(30):10513-8
Mook S, Schmidt MK, Viale G, et al (2009). The 70-gene prognosis signature predicts disease
outcome in breast cancer patients with 1–3 positive lymph nodes in an
independent validation study. Breast Cancer Res Treat;116:295–302.
Mootha VK, Lindgren CM, Eriksson KF, et al (2003). PGC-1alpha Responsive Genes
Involved in Oxidative Phosphorylation are Coordinately Downregulated in Human
Diabetes, Nature Genetics 34(3):267-73
Nielsen TO, Hsu FD, Jensen K et al (2004). Immunohistochemical and clinical
characterization of the basal-like subtype of invasive breast carcinoma. Clin Cancer
Res ;10:5367-74.
Oberley MJ, Tsao J, Yau P, Farnham PJ (2004). Highthroughput screening of chromatin
immunoprecipitates using CpG-island microarrays. Methods Enzymol;376: 315-
34.
Oostlander AE, Meijer GA, Ylstra B (2004). Microarraybased comparative genomic
hybridization and its applications in human genetics. Clin Genet, 66: 488-495.
Osborne CK(1998) Tamoxifen in the treatment of breast cancer. N Engl J Med;339(22):1609-
18.
Paik S, Shak S, Tang G, et al (2004). A multigene assay to predict recurrence of tamoxifen-
treated, node-negative breast cancer. N Engl J Med;351(27):2817-26.
Paik, S. Kim, C. Y, Song, Y. K. & Kim, W. S. (2005) Technology insight: application of
molecular techniques to formalin-fixed paraffin-embedded tissues from breast
cancer. Nat. Clin. Pract. Oncol;2:246–254
Paik S, Tang G, Shak S, , et al(2006). Gene expression and benefit of chemotherapy in women
with node-negative, estrogen receptor-positive breastcancer. J Clin Oncol; 24 (23) :
3726-34.
Parker JS, Mullins M, Cheang MC, et al (2009). Supervised risk predictor of breast cancer
based on intrinsic subtypes. J Clin Oncol;27:1160–1167.
Pedraza V, Gomez-Capilla JA, Escaramis G, et al (2010) Gene expression signatures in breast
cancer distinguish phenotype characteristics, histologic subtypes, and tumor
invasiveness. Cancer.;116(2):486-96.
Peppercorn J, Perou CM, Carey LA. (2008) Molecular subtypes in breast cancer evaluation
and management: divide and conquer. Cancer Invest;26:1–10.
Perou CM, Sorlie T, Eisen MB, et al (2000). Molecular portraits of human breast tumours.
Nature;406: 747-52.
Pusztai L, Mazouni C, Anderson K, et al (2006). Molecular classification of breast cancer:
limitations and potential. Oncologist;11:868–877.
Quackenbush J (2001). Computational analysis of microarray data. Nature Reviews Genetics
;2:418-27.
118 Computational Biology and Applied Bioinformatics
Tessel MA, Krett NL, Rosen ST (2010).Steroid receptor and microRNA regulation in cancer.
Curr Opin Oncol;22(6):592‐597
The FlyBase Consortium (1999). The FlyBase database of the Drosophila Genome Projects
and community literature. Nucleic Acids Res;27:85–88.
van de Vijver M, He Y, van’t Veer L, et al (2002). A gene-expression signature as a predictor
of survival in breast cancer. N Engl J Med;347:1999–2009.
van’t Veer L, Dai H, van de Vijver M, et al (2002). Gene expression profiling predicts clinical
outcome of breast cancer. Nature;415:530–6.
Vapnik V, Lerner A (1963). Pattern recognition using generalized portrait method.
Automation and Remote Control 1963;24:774-780.
Volinia S, Calin GA, Liu CG, et al (2006). A MiRNA expression signature of human solid
tumors defines cancer gene targets. PNAS.;103(7):2257-61.
Wadsworth JT, Somers KD, Cazares LH, et al (2004) Serum protein profiles to identify head
and neck cancer. Clinical Cancer Research;10:1625-1632.
Wang Y, Klijn JG, Zhang Y et al (2005). Gene-expression profiles to predict
distant metastasis of lymph-node-negative primary breast cancer. Lancet;365:671–
679.
Warnat P, Eils R, Brors B (2005). Cross-platform analysis of cancer microarray data
improves gene expression based classification of phenotypes. BMC
bioinformatics 2;6: 265.
Weigelt B, Geyer FC, Natrajan R, et al (2010) The molecular underpinning of lobular
histological growth pattern: a genome-wide transcriptomic analysis of invasive
lobular carcinomas and grade- and molecular subtype-matched invasive ductal
carcinomas of no special type. J Pathol;220(1):45-57
Wirapati P, Sotiriou C, Kunkel S, et al (2008). Meta-analysis of gene expression profiles in
breast cancer: toward a unified understanding of breast cancer subtyping and
prognosis signatures. Breast Cancer Res;10:R65.
Wong JWH, Cagney G, Cartwright HM (2005). SpecAlign—processing and alignment of
mass spectra datasets. Bioinformatics;21:2088-2090
Xi Y, Nakajima G, Gavin E, et al (2007). Systematic analysis of MiRNA expression of RNA
extracted from fresh frozen and formalin-fixed paraffin-embedded samples.
RNA;13(10):1668-74.
Xue C, Li F, He T, Liu GP, Li Y, Xuegong Z (2005). Classification of real and pseudo
microRNA precursors using local structure-sequence features and support vector
machine. BMC Bioinformatics;6:310.
Xu R, Xu J, Wunsch DC (2009). Using default ARTMAP for cancer classification with
MicroRNA expression signatures, International Joint Conference on Neural Networks,
pp.3398-3404,
Yan PS, Perry MR, Laux DE, et al (2000). CpG island arrays: an application toward
deciphering epigenetic signatures of breast cancer. Clinical Cancer Research; 6:
1432-38.
Yousef M, Najami N, Khalifa W (2010). A comparison study between one-class and two-
class machine learning for MicroRNA target detection. Journal of Biomedical
Science and Engineering ;3:247-252.
120 Computational Biology and Applied Bioinformatics
1. Introduction
MicroRNAs (miRNAs) are a class of small RNAs of approximately 22 nucleotides in length
that regulate eukaryotic gene expression at the post-transcriptional level (Ambros 2004;
Bartel 2004; Filipowicz et al. 2008). They are transcribed as long precursor RNA molecules
(pri-miRNAs) and are successively processed by two key RNAses, namely Drosha and
Dicer, into their mature forms of ~22 nucleotides (Kim 2005; Kim et al. 2009). These small
RNAs regulate gene expression by binding to target sites in the 3’ untranslated region of
mRNAs (3’UTR). Recognition of the 3’UTR by miRNAs is mediated through complementary
hybridization at least between nucleotides 2-8, numbered from the 5’ end (seed sequences)
of the small RNAs, and complementary sequences present in the 3’UTRs of mRNAs
(Ambros 2004; Bartel 2004; Zamore and Haley 2005). Perfect or nearly perfect
complementarities between miRNAs and their 3’UTRs induce mRNA cleavage by the RNA-
induced silencing complex (RISC), whereas imperfect base pair matching may induce
translational silencing through various molecular mechanisms, namely inhibition of
translation initiation and activation of mRNA storage in P-bodies and/or stress granules
(Pillai et al. 2007).
This class of small RNAs is well conserved between eukaryotic organisms, suggesting that
they appeared early in eukaryotic evolution and play fundamental roles in gene expression
control. Each miRNA may repress hundreds of mRNAs and regulate a wide variety of
biological processes, namely developmental timing (Feinbaum and Ambros 1999; Lau et al.
2001), cell differentiation (Tay et al. 2008), immune response (Ceppi et al. 2009) and infection
(Chang et al. 2008). For this reason, their identification is essential to understand eukaryotic
biology. Their small size, low abundance and high instability complicated early
identification, but these obstacles have been overcome by next generation sequencing
approaches, namely the Genome SequencerTM FLX from Roche, the Solexa/Illumina
Genome Analyzer and the Applied Biosystems SOLiDTM Sequencer which are currently
being routinely used for rapid miRNA identification and quantification in many eukaryotes
(Burnside et al. 2008; Morin et al. 2008; Schulte et al. 2010).
As in other vertebrates, miRNAs control gene expression in zebrafish, since defective
miRNA processing arrest development (Wienholds et al. 2003). Also, a specific subset of
miRNAs is required for brain morphogenesis in zebrafish embryos, but not for cell fate
determination or axis formation (Giraldez et al. 2005). In other words, miRNAs play an
122 Computational Biology and Applied Bioinformatics
important role in zebrafish organogenesis and their expression at specific time points is
relevant to organ formation and differentiation. Since identification of the complete set of
miRNAs is fundamental to fully understand biological processes, we have used high
throughput 454 DNA pyrosequencing technologies to fully characterize the zebrafish
miRNA population (Soares et al. 2009). For this, a series of cDNA libraries were prepared
from miRNAs isolated at different embryonic time points and from fully developed organs
sequenced using the Genome SequencerTM FLX. This platform yields reads of up to 200
bases each and can generate up to 1 million high quality reads per run, which provides
sufficient sequencing coverage for miRNA identification and quantification in most
organisms. However, deep sequencing of small RNAs may pose some problems that need to
be taken into consideration to avoid sequencing biases. For example, library preparation and
computational methodologies for miRNA identification from large pool of reads need to be
optimized. There are many variables to consider, namely biases in handling large sets of
data, sequencing errors and RNA editing or splicing. If used properly, deep sequencing
technologies have enormous analytical power and have been proven to be very robust in
retrieving novel small RNA molecules. One of the major challenges when analyzing deep
sequencing data is to differentiate miRNAs from other small RNAs and RNA degradation
products.
Different research groups are developing dedicated computational methods for the
identification of miRNAs from large sets of sequencing data generated by next
generation sequencing experiments. miRDeep (http://www.mdc-
berlin.de/en/research/research_teams/systems_biology_of_gene_regulatory_elements/pr
ojects/miRDeep/index.html) (Friedlander et al. 2008) and miRanalizer
(http://web.bioinformatics.cicbiogune.es/microRNA/miRanalyser.php) (Hackenberg et al.
2009) can both detect known miRNAs annotated in miRBase and predict new miRNAs
(although using different prediction algorithms) from small RNA datasets generated by
deep sequencing. Although these online algorithms are extremely useful for miRNA
identification, custom-made pipeline analysis of deep sequencing data may be performed in
parallel to uncover the maximum number of small non-coding RNA molecules present in
the RNA datasets.
In this chapter, we discuss the tools and computational pipelines used for miRNA
identification, discovery and expression from sequencing data, based on our own experience
of deep sequencing of zebrafish miRNAs, using the Genome SequencerTM FLX from Roche.
We show how a combination of a public available, user-friendly algorithm, such as
miRDeep, with custom-built analysis pipelines can be used to identify non-coding RNAs
and uncover novel miRNAs. We also demonstrate that population statistics can be applied
to statistical analysis of miRNA populations identified during sequencing and we
demonstrate that robust computational analysis of the data is crucial for extracting the
maximum information from sequencing datasets.
We have used the Genome SequencerTM FLX system (454 sequencing) to identify zebrafish
miRNAs from different developmental stages and from different tissues. For this, cDNA
libraries are prepared following commonly used protocols (Droege M and Hill B. 2008;
Soares et al. 2009). These libraries contain specific adaptors for the small RNA molecules
containing specific priming sites for sequencing. After sequencing, raw data filtration and
extraction is performed using specialist software incorporated into the Genome SequencerTM
FLX system (Droege M and Hill B. 2008). Raw images are processed to remove background
noise and the data is normalized. Quality of raw sequencing reads is based on complete read
through of the adaptors incorporated into the cDNA libraries. The 200 base pair of 454
sequencing reads provide enough sequencing data for complete read through of the
adaptors and miRNAs. During quality control, the adaptors are trimmed and the resulting
sequences are used for further analysis. Sequences ≥ 15 nucleotides are kept for miRNA
identification, and constitute the small RNA sequencing data.
Other sequencing platforms, such as Illumina/Solexa and SOLiDTM, also have specialist
software for raw data filtration. DSAP, for example, is an automated multiple-task web
service designed to analyze small RNA datasets generated by the Solexa platform (Huang et
al. 2010). This software filters raw data by removing sequencing adaptors and poly-
A/T/C/G/N nucleotides. In addition, it performs non-coding RNA matching by sequence
homology mapping against the non-coding RNA database Rfam (rfam.sanger.ac.uk/) and
detects known miRNAs in miRBase (Griffiths-Jones et al. 2008), based on sequence
homology.
The SOLiDTM platform has its own SOLiD™ System Small RNA Analysis Pipeline Tool
(RNA2MAP), which is available online (http://solidsoftwaretools.com/gf/project/
rna2map). This software is similar to DSAP, as it filters raw data and identifies known
miRNAs in the sequencing dataset by matching reads against miRBase sequences and
against a reference genome. Although these specialist software packages are oriented for
miRNA identification in sequencing datasets they are not able to identify novel miRNAs.
For this, datasets generated from any of the sequencing platforms available have to be
analyzed using tools that include algorithms to identify novel miRNAs.
sequences deposited in miRBase. The sequence with the highest expression is always
considered as the mature miRNA sequence by the miRDeep algorithm. All hairpins that are
not processed by DICER will not match a typical secondary miRNA structure and are
filtered out.
After aligning the sequences against the desired genome using megaBlast, the blast output is
parsed for miRDeep uploading. As sequencing errors, RNA editing and RNA splicing may
alter the original miRNA sequence, one can re-align reads that do not match the genome
using SHRiMP (http://compbio.cs.toronto.edu/shrimp/). The retrieved alignments are also
parsed for miRDeep for miRNA prediction. miRDeep itself allows up to 2 mismatches in the
3’ end of each sequence, which already accounts with some degree of sequencing errors that
might have occurred.
Reads matching more than 10 different genome loci are generally discarded, as they likely
constitute false positives. The remaining alignments are used as guidelines for excision of
the potential precursors from the genome. After secondary structure prediction of putative
precursors, signatures are created by retaining reads that align perfectly with those putative
precursors to generate the signature format. miRNAs are predicted by discarding non-
plausible DICER products and scoring plausible ones. The latter are blasted against mature
miRNAs deposited in miRBase, to extract known and conserved miRNAs. The remaining
reads are considered novel miRNAs.
In order to evaluate the sensitivity of the prediction and data quality, miRDeep calculates
the false positive rate, which should be below 10%. For this, the signature and the structure-
pairings in the input dataset are randomly permutated, to test the hypothesis that the
structure (hairpin) of true miRNAs is recognized by DICER and causes the signature.
miRanalizer (Hackenberg et al. 2009) is a recently developed web server tool that detects
both known miRNAs annotated in miRBase and other non-coding RNAs by mapping
sequences to non-coding RNA libraries, such as Rfam. This feature is important, as more
classes of small non coding RNAs are being unravelled and their identification can provide
clues about their functions. At the same time, by removing reads that match other non
coding RNA classes, it reduces the false positive rate in the prediction of novel miRNAs, as
these small non coding RNAs can be confused with miRNAs. For novel miRNA prediction,
miRanalizer implements a machine learning approach based on the random forest method,
with the number of trees set to 100 (Breiman 2001). miRanalyzer can be applied to miRNA
discovery in different models, namely human, mouse, rat, fruit-fly, round-worm, zebrafish
and dog, and uses datasets from different models to build the final prediction model. In
comparison to miRDeep, this is disadvantageous as the latter can predict novel miRNAs
from any model. All pre-miRNAs candidates that match known miRNAs are extracted from
the experimental dataset and labelled as positive instances. Next, an equal amount of pre-
miRNA candidates from the same dataset are selected by random selection with the known
miRNAs removed and labelled as negative. Pre-processing of reads corresponding to
putative new miRNAs includes clustering of all reads that overlap with the genome, testing
whether the start of the current read overlaps less than 3 nucleotides with the end position
of previous reads. This avoids DICER products grouping together and be considered non-
miRNAs products, which would increase false negatives. Besides, clusters of more than 25
base pairs in length are discarded and the secondary structure of the miRNA is predicted
via RNAfold (Hofacker 2003). Structures where the cluster sequence is not fully included
and where part of the stem cannot be identified as a DICER product are discarded.
Computational Tools for Identification of microRNAs in Deep Sequencing Data Sets 125
miRTools is a comprehensive web server that can be used for characterization of the small
RNA transcriptome (Zhu et al. 2010). It offers some advantages relative to miRDeep and
miRanalyzer, since it integrates multiple computational approaches including tools for raw
data filtration, identification of novel miRNAs and miRNA expression profile generation. In
order to detect novel miRNAs, miRTools analyze all sequences that are not annotated to
known miRNAs, other small non-coding RNAs and genomic repeats or mRNA that match the
reference genome. These sequences are extracted and their RNA secondary structures are
predicted using RNAfold (Hofacker 2003) and novel miRNAs are identified using miRDeep.
2.3 Analysis of discarded reads by miRNA identification algorithms can identify new
miRNAs
Since miRDeep and miRanalyzer are highly stringent algorithms, some miRNAs may escape
detection. The false negative discovery rate can, however be calculated by simply
performing a megaBlast search of the sequencing data against the miRNAs deposited in
miRBase. Perfect alignments are considered true positives. The list of known miRNAs
identified by this method is compared to the list of known miRNAs identified by miRDeep
or miRanalyzer. False negatives are those miRNAs present in the blast analysis, but which
were missed by the miRNA prediction algorithms. This is, in our opinion, an essential
control, as it gives information about the percentage of miRNAs that may have escaped
miRDeep or miRanalyzer analysis. We have detected ~19% of false negatives, which
prompted us to develop a parallel pipeline to analyze reads that may have been incorrectly
discarded by the original algorithm (Figure 2). This analysis can and should be performed
independently of the algorithm used to retrieve miRNAs from deep sequencing data.
To overcome the lack of sensitivity of miRDeep, our parallel bioinformatics pipeline
includes a megaBlast alignment between the dataset of discarded reads by miRDeep and
mature sequences deposited in miRBase. Besides, novel transcripts encoding miRNAs
predicted by computational tools can be retrieved from the latest Ensembl version using
BioMart and also from literature predictions. These sequences are then used to perform a
megaBlast search against the sequencing data. The transcripts with perfect matches and
alignment length > 18 nucleotides are kept for further processing. These transcripts are then
compared with the mature miRNAs deposited in miRBase and those that produce imperfect
alignments or do not produce alignments are considered novel miRNAs. Imperfect alignments
may identify conserved miRNAs if there is a perfect alignment in the seed region.
Complementary alignments of our dataset reads against the zebrafish genome with SHRiMP
alignments and complementary miRDeep analysis with an analysis of the reads discarded
by this algorithm, allowed us to identify 90% of the 192 zebrafish miRNAs previously
identified, plus 107 miRNA star sequences and 25 novel miRNAs.
comparison of read numbers (Chen et al. 2005). Normalization assumes that the small RNA
population is constant and is represented by an arbitrary value (e.g. 1000), and can be
calculated as indicated below:
Fig. 2. Bioinformatics pipeline of reads discarded by miRDeep (-i and –d stand for query file
and database respectively).
128 Computational Biology and Applied Bioinformatics
Using this formula it is possible to generate miRNA profiles for each sample sequenced.
These profiles provide valuable information about relative miRNA expression, which is
essential to understand miRNA function in different tissues. In order to compare miRNA
profiles of two deep sequencing samples (e.g. condition vs control), a two-side t-test can be
applied to determine miRNA levels. Sequence count values should be log-transformed to
stabilize variance (Creighton et al. 2009). miRTools already include a computational
approach to identify significantly differentially expressed miRNAs (Zhu et al. 2010). It
compares differentially expressed miRNAs in multiple samples after normalization of the
read count of each miRNA with the total number of miRNA read counts which are matched
to the reference genome. The algorithm calculates statistical significance (P-value) based on
a Bayesian method (Audic and Claverie 1997), which accounts for sampling variability of
tags with low counts. Significantly differentially expressed miRNAs are those that show P-
values <0.01 and at least 2-fold change in normalized sequence counts.
Fig. 3. Statistical analysis of miRNA population. A) A rarefaction curve of the total number
of reads generated by deep sequencing versus the total number of miRNA species identified
is shown. The steep curve levels off towards an asymptote, which indicates the point where
additional sampling will not yield new miRNAs. B) Homogeneity of the miRNA population
was assessed using population statistics and by determining the Chao1 diversity estimator.
The Chao1 reached a mean stable value of 207, with lower and upper limits of 200.37 to
229.66, respectively, for a level of confidence of 95%.
3. Conclusion
Small non-coding RNAs are a class of molecules that regulate several biological processes.
Identification of such molecules is crucial to understand the molecular mechanisms that
they regulate. There are already several deep sequencing approaches to identify these
molecules. However, correct interpretation of sequencing data depends largely on the
bioinformatics and statistical tools available. There are online algorithms that facilitate
identification of miRNAs and other small non-coding RNAs from large datasets. However,
there are no tools to predict novel small non-coding RNAs beyond miRNAs. As those
additional RNA classes, namely piRNAs, snRNAs and snoRNAs are processed differently,
the development of algorithms based solely on their biogenesis is challenging. Moreover,
the available algorithms have some limitations and additional data analysis should be
performed with the discarded reads that can potentially hold non-conventional miRNA
molecules. Analysis of deep sequencing data is a powerful methodology to identify novel
miRNAs in any organism and determine their expression profiles. The challenge is to deal
with increasing dataset size and to integrate the information generated by small RNA
sequencing experiments. This will be essential to understand how different RNA classes are
related. Computational tools to integrate small non-coding RNA data with gene expression
data and target predictions are pivotal to understand the biological processes regulated by
miRNAs and other small non-coding RNA classes.
130 Computational Biology and Applied Bioinformatics
4. References
Ambros, V. (2004). The functions of animal microRNAs. Nature 431(7006), 350-355.
Audic, S., and Claverie, J. M. (1997). The significance of digital gene expression profiles.
Genome Res. 7(10), 986-995.
Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116(2),
281-297.
Breiman, L. Random forests. Machine Learning (2001). 45:28.
Burnside, J., Ouyang, M., Anderson, A., Bernberg, E., Lu, C., Meyers, B. C., Green, P. J.,
Markis, M., Isaacs, G., Huang, E., and Morgan, R. W. (2008). Deep sequencing of
chicken microRNAs. Bmc Genomics 9.
Ceppi, M., Pereira, P. M., Dunand-Sauthier, I., Barras, E., Reith, W., Santos, M. A., and
Pierre, P. (2009). MicroRNA-155 modulates the interleukin-1 signaling pathway in
activated human monocyte-derived dendritic cells. Proc. Natl. Acad. Sci. U. S. A
106(8), 2735-2740.
Chang, J. H., Cruo, J. T., Jiang, D., Guo, H. T., Taylor, J. M., and Block, T. M. (2008). Liver-
specific MicroRNA miR-122 enhances the replication of hepatitis C virus in
nonhepatic cells. Journal of Virology 82(16), 8215-8223.
Chao, A. (1987). Estimating the Population-Size for Capture Recapture Data with Unequal
Catchability. Biometrics 43(4), 783-791.
Chen, P. Y., Manninga, H., Slanchev, K., Chien, M. C., Russo, J. J., Ju, J. Y., Sheridan, R., John,
B., Marks, D. S., Gaidatzis, D., Sander, C., Zavolan, M., and Tuschl, T. (2005). The
developmental miRNA profiles of zebrafish as determined by small RNA cloning.
Genes & Development 19(11), 1288-1293.
Colwell, R. K., and Coddington, J. A. (1994). Estimating Terrestrial Biodiversity Through
Extrapolation. Philosophical Transactions of the Royal Society of London Series B-
Biological Sciences 345(1311), 101-118.
Creighton, C. J., Reid, J. G., and Gunaratne, P. H. (2009). Expression profiling of microRNAs
by deep sequencing. Brief. Bioinform. 10(5), 490-497.
Droege M, and Hill B. The Genome Sequencer FLX trade mark System-Longer reads, more
applications, straight forward bioinformatics and more complete data sets.
J.Biotechnol. 136 (1-2): 3-10. 2008.
Feinbaum, R., and Ambros, V. (1999). The timing of lin-4 RNA accumulation controls the
timing of postembryonic developmental events in Caenorhabditis elegans. Dev.
Biol. 210(1), 87-95.
Filipowicz, W., Bhattacharyya, S. N., and Sonenberg, N. (2008). Mechanisms of post-
transcriptional regulation by microRNAs: are the answers in sight? Nat. Rev. Genet.
9(2), 102-114.
Friedlander, M. R., Chen, W., Adamidi, C., Maaskola, J., Einspanier, R., Knespel, S., and
Rajewsky, N. (2008). Discovering microRNAs from deep sequencing data using
miRDeep. Nature Biotechnology 26(4), 407-415.
Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S.,
Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). MicroRNAs regulate brain
morphogenesis in zebrafish. Science 308(5723), 833-838.
Computational Tools for Identification of microRNAs in Deep Sequencing Data Sets 131
Griffiths-Jones, S., Saini, H. K., van, D. S., and Enright, A. J. (2008). miRBase: tools for
microRNA genomics. Nucleic Acids Res. 36(Database issue), D154-D158.
Hackenberg, M., Sturm, M., Langenberger, D., Falcon-Perez, J. M., and Aransay, A. M.
(2009). miRanalyzer: a microRNA detection and analysis tool for next-
generation sequencing experiments. Nucleic Acids Res. 37(Web Server issue),
W68-W76.
Hofacker, I. L. (2003). Vienna RNA secondary structure server. Nucleic Acids Res. 31(13),
3429-3431.
Huang, P. J., Liu, Y. C., Lee, C. C., Lin, W. C., Gan, R. R., Lyu, P. C., and Tang, P. (2010).
DSAP: deep-sequencing small RNA analysis pipeline. Nucleic Acids Res. 38(Web
Server issue), W385-W391.
Kim, V. N. (2005). MicroRNA biogenesis: coordinated cropping and dicing. Nat. Rev. Mol.
Cell Biol. 6(5), 376-385.
Kim, V. N., Han, J., and Siomi, M. C. (2009). Biogenesis of small RNAs in animals. Nat. Rev.
Mol. Cell Biol. 10(2), 126-139.
Lau, N. C., Lim, L. P., Weinstein, E. G., and Bartel, D. P. (2001). An abundant class of tiny
RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294(5543),
858-862.
Morin, R. D., O'Connor, M. D., Griffith, M., Kuchenbauer, F., Delaney, A., Prabhu, A. L.,
Zhao, Y., McDonald, H., Zeng, T., Hirst, M., Eaves, C. J., and Marra, M. A. (2008).
Application of massively parallel sequencing to microRNA profiling and discovery
in human embryonic stem cells. Genome Research 18(4), 610-621.
Pillai, R. S., Bhattacharyya, S. N., and Filipowicz, W. (2007). Repression of protein synthesis
by miRNAs: how many mechanisms? Trends Cell Biol. 17(3), 118-126.
Schulte, J. H., Marschall, T., Martin, M., Rosenstiel, P., Mestdagh, P., Schlierf, S., Thor,
T., Vandesompele, J., Eggert, A., Schreiber, S., Rahmann, S., and Schramm, A.
(2010). Deep sequencing reveals differential expression of microRNAs in
favorable versus unfavorable neuroblastoma. Nucleic Acids Res. 38(17), 5919-
5928.
Soares, A. R., Pereira, P. M., Santos, B., Egas, C., Gomes, A. C., Arrais, J., Oliveira, J. L.,
Moura, G. R., and Santos, M. A. S. (2009). Parallel DNA pyrosequencing unveils
new zebrafish microRNAs. Bmc Genomics 10.
Tay, Y. M. S., Tam, W. L., Ang, Y. S., Gaughwin, P. M., Yang, H., Wang, W. J., Liu, R. B.,
George, J., Ng, H. H., Perera, R. J., Lufkin, T., Rigoutsos, I., Thomson, A. M., and
Lim, B. (2008). MicroRNA-134 modulates the differentiation of mouse embryonic
stem cells, where it causes post-transcriptional attenuation of Nanog and LRH1.
Stem Cells 26(1), 17-29.
Wienholds, E., Koudijs, M. J., van Eeden, F. J., Cuppen, E., and Plasterk, R. H. (2003). The
microRNA-producing enzyme Dicer1 is essential for zebrafish development. Nat.
Genet. 35(3), 217-218.
Zamore, P. D., and Haley, B. (2005). Ribo-gnome: the big world of small RNAs. Science
309(5740), 1519-1524.
132 Computational Biology and Applied Bioinformatics
Zhu, E., Zhao, F., Xu, G., Hou, H., Zhou, L., Li, X., Sun, Z., and Wu, J. (2010). mirTools:
microRNA profiling and discovery based on high-throughput sequencing. Nucleic
Acids Res. 38(Web Server issue), W392-W397.
7
1. Introduction
Mass Spectrometry (MS)-based strategies featuring chemical or biochemical probing
represent powerful and versatile tools for studying structural and dynamic features of
proteins and their complexes. In fact, they can be used both as an alternative for systems
intractable by other established high-resolution techniques, and as a complementary
approach to these latter, providing different information on poorly characterized or very
critical regions of the systems under investigation (Russell et al., 2004). The versatility of
these MS-based methods depends on the wide range of usable probing techniques and
reagents, which makes them suitable for virtually any class of biomolecules and complexes
(Aebersold et al., 2003). Furthermore, versatility is still increased by the possibility of
operating at very different levels of accuracy, ranging from qualitative high-throughput fold
recognition or complex identification (Young et al., 2000), to the fine detail of structural
rearrangements in biomolecules after environmental changes, point mutations or complex
formations (Nikolova et al.,1998; Millevoi et al., 2001; Zheng et al., 2007). However, these
techniques heavily rely upon the availability of powerful computational approaches to
achieve a full exploitation of the information content associated with the experimental data.
The determination of three-dimensional (3D) structures or models by MS-based techniques
(MS3D) involves four main activity areas: 1) preparation of the sample and its derivatives
labelled with chemical probes; 2) generation of derivatives/fragments of these molecules for
further MS analysis; 3) interpretation of MS data to identify those residues that have reacted
with probes; 4) derivation of 3D structures consistent with information from previous steps.
Ideally, this procedure should be considered the core of an iterative process, where the final
model possibly prompts for new validating experiments or helps the assignment of
ambiguous information from the mass spectra interpretation step.
Both the overall MS3D procedure and its different steps have been the subject of several
accurate review and perspective articles (Sinz, 2006; Back et al., 2003; Young et al., 2000;
Friedhoff, 2005, Renzone, et al., 2007a). However, with the partial exception of a few recent
papers (Van Dijk et al., 2005; Fabris et al., 2010; Leitner et al., 2010), the full computational
detail behind 3D model building (step 4) has generally received less attention than the
former three steps. Structural derivation in MS3D, in fact, is considered a special case of
structural determination from sparse/indirect constraints (SD-SIC). Nevertheless,
information for modelling derivable from MS-based experiments exhibits some peculiar
134 Computational Biology and Applied Bioinformatics
features that differentiate it from the data types associated with other experimental
techniques involved in SD-SIC procedures, such as nuclear magnetic resonance (NMR),
electron microscopy, small-angle X-ray scattering (SAXS), Förster resonance energy transfer
(FRET) and other fluorescence spectroscopy techniques, for which most of the currently
available SD-SIC methods have been developed and tailored (Förster et al., 2008; Lin et al.,
2008; Nilges et al., 1988a; Aszodi et al., 1995).
In this view, this study will illustrate possible approaches to model building in MS3D,
underlining the main issues related to this specific field and outlining some of the possible
solutions to these problems. Whenever possible, alternative methods employing either
different programs selected among most popular applications in homology modelling,
threading, docking and molecular dynamics (MD), or different strategies to exploit the
information contained in MS data will be described. Discussion will be limited to packages
either freely available, or costing less than 1,000 US$ for academic users. For programs, the
home web address has been reported, rather than references that are very often partial
and/or outdated. Some examples, derived from the literature available in this field, or
developed ad hoc to illustrate some critical features of the computational methods in MS3D,
should clarify potentiality and current limitations of this approach.
Fig. 1. Flowchart of a generic MS3D modelling approach. Magenta, violet and pink represent
steps in which MS-based information is applied. Triangular arrows indicate use of MS-based
data. Dotted lines and borders are used for optional refinement stages. Blue codes in white
circles/ellipses label the corresponding stages within the text.
136 Computational Biology and Applied Bioinformatics
Accordingly, both the protocol used to implement MS-based information into modelling
procedures and the MS-based data themselves generally represent very critical features,
which require the maximum attention during computational setup and final analyses. In
addition, implementation of restraints in the sampling procedure either requires some
purposely programming activity, or severely limits the choice of modelling tools to
programs already including suitable user-defined restraints.
Use of MS-based information in post-sampling analyses (path S2-F2a-F2b-F2c) to help
classifying and selecting the final models exhibits a mostly complementary profile of
advantages-disadvantages. In fact, it decreases the sampling efficiency of the modelling
methods (S2), by leading to a potentially very large number of models to be subsequently
discarded on the mere basis of their violations of MS-derived restraints (F2a), and by
providing no ab initio limitations to the available conformational/configurational space of
the system. Furthermore, it may still require programming activity if available restraint
analysis tools (F2a) are lacking or inefficient in the case of the implemented information.
However, this approach warrants the maximum freedom to the user in the choice of the
sampling program; this may result very useful in those cases where the peculiar features of
a specific program are strongly required to model the investigated system. In addition, a
compared analysis of both structural features and scoring function values between models
accepted and rejected on the basis of MS-based data may allow the identification of potential
issues in the selected models and the corresponding data sets (steps F2c-X).
2.2.2 Crosslinks
Cross-linking information often directly contribute to the model building procedure (under
the form of distance restraints or direct linker addition to the simulated systems) (stage S1 in
Fig.1), in addition to their model validation/interpretation role (stages F1, F2a, FF).
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 137
A further group of approaches, presently under active development and already exhibiting
good performances in CASP and other benchmark and testing experiments, is formed by the
“integrative” or “hybrid” methods. They combine information from a varied set of
computational and experimental sources, often acting as/based on “metaservers”, i.e.
servers that submit a prediction request to several other servers, then averaging their results
to provide a consensus that in many cases is more reliable than the single predictions from
which it originated. Some metaservers use the consensus as input to their own prediction
algorithms to further elaborate the models.
In order to provide some guidelines for structural prediction/refinement tasks in the
presence of MS-based data, a general procedure will be outlined for protein fold/structure
modelling. The starting step in protein modelling is usually represented by a search for
already structurally-characterized similar sequences. Sensitive methods for sequence
homology detection and alignment have been developed, based on iterative profile searches,
e.g. PSI-Blast (Altschul et al., 1997), Hidden Markov Models, e.g. SAM (K. Karplus et al.
1998), HMMER (Eddy, 1998), or profile-profile alignment such as FFAS03 (Jaroszewski et al.,
2005), profile.scan (Marti-Renom et al., 2004), and HHsearch (Soding, 2005).
When homology with known templates is over 40%, HM programs can be used rather
confidently. In this case, especially when alignments to be used in modelling have already
been obtained, local programs represent a more viable alternative to web-based methods
than in TFM processes. If analysis is limited to most popular programs and web services
capable of implementing user MS-based restraints (strategy S1 in Fig. 1), the number of
possible candidates considerably decreases. Among web servers, on the basis of identified
homologies with templates, Robetta is automatically capable of switching from ab initio to
comparative modelling, while I-TASSER requires user-provided alignment or templates to
activate comparative modelling mode. A very powerful, versatile and popular HM
program, available both as a standalone application, and as a web service, and embedded in
many modelling servers, is MODELLER (http://www.salilab.org/modeller/). It include
routines for template search, sequence and structural alignments, determination of
homology-derived restraints, model building, loop modelling, model refinement and
validation. MS-based distance restraints can be added to those produced from target-
template alignments, as well as to other restraints enforcing secondary structures, symmetry
or part of the structure that must not be allowed to change upon modelling. However, some
scripting ability is required to fully exploit MODELLER versatility.
The overall accuracy of HM models calculated from alignments with sequence identities of
40% or higher is almost always good (typical root mean square deviations (RMSDs) from
corresponding experimental structures less than 2Å). The frequency of models deviating by
more than 2Å RMSD from experimental structures rapidly increases when target–template
sequence identity falls significantly below 30–40%, the so-called “twilight zone” of HM (Blake
& Cohen, 2001; Melo & Sali, 2007). In such cases, the quality of resulting modelled structures
significantly increases by combining additional information, both of statistical origin, such as
SS prediction profiles, and from sparse experimental data (low resolution NMR or chemical
crosslinking, limited proteolysis, chemical/isotopical labelling coupled with MS).
If the search does not produce templates with sufficient homology and/or covering of the
target sequence, TFM or mixed TFM/TBM methods must be used. Many programs based on
ab initio, fold recognition and threading methods are presently offered as web services; this
is because very often they use a metaserver approach for some steps, need extensive
140 Computational Biology and Applied Bioinformatics
Fig. 2. Comparison between the MS3D model of Gadd45β (light green) and the
crystallographic structure of its homolog Gadd45γ (light blue). Sequences with different SS
profiles are painted green in Gadd45β and magenta in Gadd45γ.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 141
3.2 Docking
Usually, methods for protein docking involve a six-dimensional search of the rotational and
translational space of one protein with respect to the other where the molecules are treated
as rigid or semirigid-bodies. However, during protein-protein association, the interface
residues of both molecules may undergo conformational changes that sometimes involve
not only side-chains, but also large backbone rearrangements. To manage at least in part
these conformational changes, protein docking protocols have introduced some degree of
protein flexibility by either use of "soft" scoring functions allowing some steric clash, or
explicit inclusion of domain movement/side chain flexibility. Biological information from
experimental data on regions or residues involved in complexation can guide the search of
complex configurations or filter out wrong solutions. Among the programs most frequently
used for protein-protein docking, recently reviewed by Moreira and colleagues (Moreira et
al., 2010), some of them can manage biological information and will be discussed in this
context.
In the Attract program (http://www.t38.physik.tu-muenchen.de/08475.htm ), proteins are
represented with a reduced model (up to 3 pseudoatoms per amino acid) to allow the
systematic docking minimization of many thousand starting structures. During the docking,
both partner proteins are treated as rigid-body and the protocol is based on energy
minimization in translational and rotational degrees of freedom of one protein with respect
to the other. Flexibility of critical surface side-chains as well as large loop movements are
introduced in the calculation by using a multiple conformational copy approach (Bastard et
al., 2006). Experimental data can be taken into account at various stages of the docking
procedure.
The 3D-Dock algorithm (http://www.sbg.bio.ic.ac.uk/docking/) performs a global scan of
translational and rotational space of the two interacting proteins, with a scoring function
based on shape complementarity and electrostatic interaction. The protein is described at
atomic level, while the side-chain conformations are modelled by multiple copy
representation using a rotamer library. Biological information can be used as distance
restraints to filter final complexes.
HADDOCK (http://www.nmr.chem.uu.nl/haddock/) makes use of biochemical or
biophysical interaction data, introduced as ambiguous intermolecular distance restraints
between all residues potentially involved in the interaction. Docking protocol consists of
four steps: 1) topology and structure generation; 2) randomization of orientations and rigid
body energy minimization; 3) semi-flexible simulated annealing (SA) in torsion angle space;
4) flexible refinement in Cartesian space with explicit solvent (water or DMSO). The final
structures are clustered using interface backbone RMSD and scored by their average
interaction energy and buried interface area. Recently, also explicit inclusion of water
molecules at the interface was incorporated in the protocol.
Molfit (http://www.weizmann.ac.il/Chemical_Services/molfit/) represents each molecule
involved in docking process by a 3-dimensional grid of complex numbers and estimates the
extent of geometric and chemical surface complementarity by correlating the grids using
Fast Fourier Transforms (FFT). During the search, contacts involving specified surface
regions of either one or both molecules are up- or down-weighted, depending on available
structural and biochemical data or sequence analysis (Ben-Zeev et al., 2003). The solutions
are sorted by their complementarity scores and the top ranking solutions are further refined
by small rigid body rotations around the starting position.
142 Computational Biology and Applied Bioinformatics
In general, MS-based data can be used to limit the protein region to be sampled (Kessl et al.,
2009) or can be explicitly considered in the docking procedure, as in the case of the mapping
of Sso7d ATPase site (Renzone et al., 2007b). In this case, three independent approaches for
molecular docking/MD studies were followed, considering both FSBA-derivatives and the
ATP-Sso7d non-covalent complex: i) unrestrained MD, starting from a full-extended,
external conformation for Y7-FSBA and K39-FSBA residue sidechains, and from several
random orientations for ATP, with an initial distance of 20 Å from Sso7d surface, in regions
not involved in protein binding; ii) restrained MD, by gradually imposing distance restraints
corresponding to a H-bond between adenine NH2 group and each accessible (i.e., within a
distance lower or equal to the maximum length of the corresponding FSBA-derivative)
donor sidechain; iii) rigid ligand docking, by calculating 2000 ZDOCK models of the non-
covalent complex of Sso7d with an adenosine molecule. The rigid ligand docking
reproduced only in part features from other approaches, as rigid docking correctly predicted
the anchoring point for adenosine ring, but failed to achieve a correct position for the ribose
moiety, due to the required concerted rearrangement of two Sso7d loops involved in the
binding. This latter feature represents one of the main advantages of modelling strategies
involving MD (in particular, in cartesian coordinates) because MD-based simulation
techniques are the best or the only approaches that reproduce medium-to-large scale
concerted rearrangements of non-contiguous regions.
3.3.1 Computational techniques and programs for model simulation and refinement
Model refinement, when not implemented in the modelling procedure, can be performed by
energy minimization (EM) or, better, by different molecular simulation methods, mostly
based on variants of molecular dynamics (MD) or Monte Carlo (MC) techniques. They are
also commonly used to characterize dynamic properties and structural changes upon local
or environmental perturbations.
Structures deriving from folding or docking procedures need, in general, at least a structural
regularization by EM before final validation steps, to avoid meaningless results from many
methods. Scoring functions of the latter evaluate the probity of parameters, such as dihedral
angle distributions, presence and distribution of steric bumps, voids in the molecular core,
specific nonbonded interactions (H-bonds, hydrophobic clusters). Representing a
mandatory step in most MC/MD protocols, EM programs are included in all the molecular
simulation packages, and they share with MC/MD most input files and part of the setup
parameters. Thus, unless they are be explicitly discussed, all system- and restraint-related
features or issues illustrated for simulation methods also implicitly held for EM.
144 Computational Biology and Applied Bioinformatics
This latter result can be reached by reducing the overall number of degrees of freedom to be
explicitly sampled and/or by reducing the number of possible values per variable to a
small, finite number (discretization, like in grid-based methods), and/or by restraining
acceptable variable ranges. Reduction of the total number of degrees of freedom can be
accomplished by switching to coarse-grained representations of the system, where a number
of explicit atoms, ranging from connected triples, to amino acid sidechains, to whole
residues, up to full protein subdomains, are replaced by a single particle. This method is
frequently used in initial stages of ab initio folding modelling, or in the simulation of very
large systems, such as giant structural proteins of huge protein aggregates.
Another possible way to reduce the number of degrees of freedom is the aforementioned TA
approach, requiring for a N atom system only N/3 torsional angles compared with 3N
coordinates in atomic cartesian space (Schwieters & Clore, 2001). Moreover, as the high
frequency motions of bending and stretching are removed, TAMD can use longer time steps
in the numerical integration of equations of motion than that required for a classical
molecular dynamics in cartesian space. Its main limitation may derive from neglecting
covalent geometry variations (in particular, bending centred on protein Cα atoms) that are
known to be associated with conformational variations (Berkholz et al., 2009), for instance
from α-helix to β-strand, and that can be important in concerted transitions or in large
structures with extensive and oriented SS regions. Discretization is mostly employed in the
initial screening of computationally intensive problems, such as ab initio modelling.
Restraining variable value ranges in MS3D is usually associated with either predictive
methods (SS, H-bond pattern, residue exposure), or to homology analysis, or to
experimentally-derived information. Origin, nature and form of these restraints have
already been discussed in previous sections, while some more detail on the implementation
of distance-related information into simulation programs will be given at the end of this
section.
While the implementation of restraints can be very variable in methods where the scoring
function does not intend to mimic or replicate a physical interaction between involved
entities, in methods based on physically-sounding molecular potential functions
(forcefields) have DRs implemented by a more limited number of approaches. At its
simplest, a DR will be represented as a harmonic restraint, for which only the target distance
and the force constant need to be specified in input. This functional form is present in
practically all most common programs, but either requires a precise knowledge of the target
distance, or it will result in a very loose restraint if the force constant is lowered too much to
account for low-precision target values, the usual case in MS-based data. In a more complex
and useful form, implemented with slight variations in several programs (AMBER,
CHARMM, GROMACS, XPLOR/CNS, DESMOND, TINKER), the restraint is a well with a
square bottom with parabolic sides out to a defined distance, and then linear beyond that on
both (AMBER) or just the upper limit side (CHARMM, GROMACS, XPLOR/CNS,
DESMOND). In some programs (CHARMM, AMBER, XPLOR/CNS), it is possible to select
an alternative behaviour when a distance restraint gets very large (Nilges et al,1988b) by
“flattening out” the potential, thus leading to no force for large violations; this allows for
errors in constraint lists, but might tend to ignore constraints that should be included to pull
a bad initial structure towards a more correct one.
Other forms for less-common applications can also be available in the programs or be
implemented by an user. However, the most interesting additional features of versatile DR
146 Computational Biology and Applied Bioinformatics
implementations are the different averages that can be used to describe DRs: i) complex
restraints can involve atom groups rather than single atoms at either or both restraint sides;
ii) time-averaged DRs, where target values are satisfied on average within a given time lapse
rather than instantaneously; iii) ambiguous DRs, averaged on different distance pairs. The
latter two cases are very useful when the overall DRs are not fully consistent each other,
because they are observed in the presence of conformational equilibria and, as such, they are
associated with different microstates of the system. In addition, complex and versatile
protocols can be simply developed in those programs where different parameters can be
smoothly varied during the simulation (AMBER).
computational approach in MS3D, to illustrate some of its peculiar features and issues, its
present potentialities and the variety of possible combinations of data and protocols that can
be devised to optimally handle different types of structural problems. Depending on nature
and quantity of available experimental information and on previous knowledge of the
investigated system, different combinations of the methods mentioned in previous sections
can be optimally employed. We will start illustrating examples of methods for de novo
protein folding, a frontier application of modelling with sparse restraints, because it is based
on minimal additional information on the system under investigation.
The MONSSTER program (Skolnick et al. 1997) only makes use of SS profiles and a limited
number of long-distance restraints. By employing system discretization and coarse-graining
to reduce required sampling, a protein is represented by a lattice-based Cα-backbone trace
with single interaction center rotamers for the side-chains. By using N/7 (N is the protein
length) long-range restraints, this method is able to produce folds of moderate resolution,
falling in the range from 4 to 5.5 Å of RMSD for Cα-traces with respect to the native
conformation for all α and α/β proteins, whereas β-proteins require, for the same resolution,
N/4 restraints. A more recent method for de novo protein modelling (Latek et al., 2007)
adopts restrained folding simulations supported by SS predictions, reinforced with sparse
experimental data. Authors focused on NMR chemical-shift-based restraints, but also sparse
restraints from different sources can be employed. A significant improvement of model
quality was already obtained by using a number of DRs equal to N/12.
As already stated by Latek and colleagues, the introduction of DRs in protein folding
protocol represents a critical step that in principle could negatively affect the sampling of
conformational space. In fact, restraint application at too early stages of calculations can trap
the protein into local minima, where restraints are satisfied, but the native conformation is
not reached. In addition to the number, even the specific distribution of long-range
restraints along the sequence can affect the sampling efficiency. To test the influence of data
sets in folding problem, we applied a well-tested protocol of SA, developed for AMBER
program and mainly oriented to NMR structure determination, to the folding simulation of
bovine pancreatic trypsin inhibitor (BPTI), by using different sets of ten long-distance
restraints, randomly selected from available NMR data (Berndt et al., 1992), with optional
inclusion of a SS profile. Fig. 3 shows representative structures for each restraint set.
The four native BPTI disulphide bridges were taken into account by additional distance and
angle restraints. BPTI represents a typical benchmark for this kind of studies, due to its
peculiar topology (an α/β fold with long connection loops, stabilized by disulphide bonds)
still associated with a limited size (58 residues), and to the availability of both X-ray and
NMR accurate structures. SA cycles of 50 structures each were obtained and compared for
four combinations of three sets (S1-3) of ten long distance restraints, totally non-redundant
among different sets and SS profiles: a) S1+SS profile; b) S1 alone; c) S2+SS profile; d) S3+SS
profile. S1 set performed definitely better than the other two, its best model exhibiting a
RMSD value of 2.4 Å on protein backbone of residues 3-53 from the representative NMR
structure.
This set was also able to provide a reasonable low-resolution fold even in the absence of SS
restraints (b). S3 resulted in almost correctly folded models, but with significantly worse
RMSD values than S1 (c). In S3 pseudomirror images (d) of the BPTI fold occurred several
times and only one model out of 50 was correctly folded (not shown).
148 Computational Biology and Applied Bioinformatics
Fig. 3. Ab initio modelling from sparse restraints of BPTI. Representative models from
different restraint sets (S1, S2, S3), with optional SS dihedral restraints are shown in ribbon
representation, coloured in red/yellow (helices), cyan (strands) and grey (loops). Models are
best-fitted on Cα atoms of residues 3-53 to the representative conformation of the NMR
high-resolution structure (PDB code: 1pit) (green), except for the S3+SS set, where
superposition with β-sheet only is shown, to better illustrate pseudo-mirroring effect,
although RMSD values are calculated on the same 3-53 residue range as other models.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 149
These results suggest a strong dependency of results upon both the exact nature of
experimental data used in structure determination, and the protocol followed for model
building. Thus, the number of restraints estimated in the aforementioned studies as
necessary/sufficient for a reliable structural prediction should be prudently interpreted for
practical purposes. If a proper protocol is adopted, increasing quantity, quality and
distribution homogeneity of data should decrease this dependency, but the problem still
remains severe when using very sparse restraints, such as those associated with many MS3D
applications. A careful validation of the models and, possibly, execution of more modelling
cycles with variations in different protocol parameters, can help to identify and solve these
kinds of problems.
However, in spite of these potential issues, ab initio MS3D can provide a precious insight
into systems that are impossible to study with other structural methods. In addition to
increases in the number of experimental data, also homology-based information and other
statistically-derived constraints can substantially increase the reliability of MS3D
predictions. Thus, suitable combinations of experimental data, predictive methods and
computational approaches have allowed the modelling of many different proteins and
protein complexes spanning a wide range of sizes and complexity. The illustrative examples
shown in Table 1 represents just a sample of systems affordable with current computational
MS3D techniques and a guideline to select possible approaches for different problem
classes. Heterogeneity of reported systems, data and methods, while suggesting the
enormous potentialities of MS3D approaches, practically prevents any really meaningful
critical comparison among methods, whose description in applicative papers is often
incomplete. A standardization of MS3D computational methods is still far from being
achieved, since it requires considerable computational effort to tackle with the considerable
number of strategies and parameters that should be tested in a truly exhaustive analysis.
Furthermore, the extreme sensitivity of modelling with sparse data to constraint
distribution, as seen in the example shown in Fig. 3, either introduces some degree of
arbitrariness in comparative analyses, or make them even more computationally-intensive,
by requiring the use of more subsets for each system setup to be sampled.
Advancements in MS3D experimental approaches continuously change the scenarios for
computational procedures, by substantially increasing the number of data, as well as the
types of crosslinking or labelling agents and proteolytic enzymes. The large number of
crosslinks obtained for apolipoproteins (Silva et al., 2005; Tubb et al., 2008) or CopA copper
ATPase (Lübben et al., 2009) represent good examples of these trends (Table 1).
5. Conclusion
As already stated in the preceding section, the compared analysis of computational
approaches involved in MS3D is still considerably limited, because of the complexity both of
the systems to be investigated, and of the methods themselves, especially when they are
used in combination with restraints as sparse as those usually available in MS3D studies.
The continuous development in all involved experimental and computational techniques
considerably accelerates the obsolescence of the results provided by any accurate
methodological analysis, thus representing a further disincentive to these usually very time
consuming studies. In this view, rather than strict prescriptions, detailed recipes or sharp
critical compared analysis of available approaches, this study was meant to provide an
150 Computational Biology and Applied Bioinformatics
PSF/NA
Gadd45β-MKK7
PSF/MD,EM
Computational Methods in Mass Spectrometry-Based Protein 3D Studies
Table 1. Some examples of MS3D studies from literature. The following abbreviations have been used (in bold, non standard
abbreviations): aXL: crosslinking, CL: chemical labelling, PL: photoaffinity labeling, LP: limited proteolysis; AA: alkylation
analyses, bsee a, PSF: post-sampling filtering with experimental data, IIS: esperimental data integrated in sampling; cXR: x-ray
crystallography; dSSp: secondary structure prediction, esee a,b, HM: Homology modeling, aiM: ab-initio modeling, FP: fold
prediction, MD: molecular dynamics, SA: Simulate Annealing, EM: energy minimization, PPD: protein-protein docking, PLD:
protein-small ligand docking; DR: distance restraints, TM: trans-membrane; NA: not available.
151
152 Computational Biology and Applied Bioinformatics
different sets could play an important role in the final result; thus, their weight should be
carefully evaluated by performing more modelling runs with different setups.
When evaluating the overall modelling procedures, their corresponding caveats and
performance issues, the importance of many details in setup and validation of MS3D
computational procedures fully emerges, thus suggesting that they still requires a
considerable human skill, although many full automated programs and servers allow in
principle the use of MS3D protocols even to inexperienced users. This is also demonstrated
for the pure ab initio modelling stage by the still superior performances obtained by human-
guided predictions in CASP rounds, when compared to fully automated servers.
Future improvements in MS3D are expected as a natural consequence of continuous
development in biochemical/MS techniques for experimental data, and in hardware/
software for molecular simulations and predictive methods. However, some specific, less
expensive and, possibly, quicker evolution in MS3D could be propelled by targeted
development of computational approaches more directly related to the real nature of the
experimental information on which MS3D is based, notably algorithms implementing
surface-dependent contributions and more faithful representations of crosslinkers than
straight distance restraints.
6. References
Aebersold, R. & Mann, M. (2003). Mass spectrometry-based proteomics. Nature, Vol.422, pp.
198–207.
Altschul, S.F.; Madden, T.L.; Schaffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W. & Lipman, D.J.
(1997). Gapped BLAST and PSI-BLAST: A new generation of protein database
search programs. Nucleic Acids Research, Vol.25, pp.3389–3402.
Aszodi, A.; Gradwell, M.J. & Taylor, W.R.(1995). Global fold determination from a small
number of distance restraints. Journal of Molecular Biology Vol.251, pp.308–326.
Aszodi, A.; Munro, R.E. & Taylor, W.R.(1997). Protein modeling by multiple sequence
threading and distance geometry. Proteins, Vol. 29, pp.38–42.
Back, J.W.; de Jong, L.; Muijsers, A.O. & de Koster, C.G. (2003). Chemical crosslinking and
mass spectrometry for protein structural modeling. Journal of Molecular Biology,
Vol.331,pp.303–313.
Back, J.W.; Sanz, M.A.; De Jong, L.; De Koning, L.J.; Nijtmans, L.G.; De Koster, C.G.; Grivell,
L.A.; Van Der Spek, H. & Muijsers, A.O.(2002). A structure for the yeast prohibitin
complex: Structure prediction and evidence from chemical crosslinking and mass
spectrometry. Protein Science, Vol. 11, pp.2471–2478.
Balasu, M.C.; Spiridon, L.N.; Miron, S. ; Craescu, C.T.; Scheidig, A.J., Petrescu, A.J. &
Szedlacsek, S.E. (2009). Interface Analysis of the Complex between ERK2 and PTP-
SL. Plos one, Vol. 4, pp. e5432.
Bastard, K.; Prévost, C. & Zacharias, M. (2006). Accounting for loop flexibility during
protein-protein docking. Proteins, Vol.62, pp. 956-969.
Ben-Zeev, E. & Eisenstein, M. (2003). Weighted geometric docking: incorporating external
information in the rotation-translation scan. Proteins, Vol.52, pp. 24-27.
Berndt, K.D.; Güntert, P.; Orbons, L.P. & Wüthrich, K. (1992). Determination of a high-
quality nuclear magnetic resonance solution structure of the bovine pancreatic
154 Computational Biology and Applied Bioinformatics
trypsin inhibitor and comparison with three crystal structures. Journal of Molecular
Biology, Vol.227, pp.757-775.
Blake, J.D. & Cohen, F.E. (2001). Pairwise sequence alignment below the twilight zone.
Journal of Molecular Biology, Vol. 307, pp. 721-735.
Bowers, P.M.; Strauss, C.E.M. & Baker, D. (2000). De novo protein structure determination
using sparse NMR data. Journal of Biomolecular NMR, Vol.18, pp.311–318.
Brazas, M.D.; Yamada, J.T. & Ouellette, B.F.F. (2010). Providing web servers and training in
Bioinformatics: 2010 update on the Bioinformatics Links Directory . Nucleic Acids
Research, Vol. 38, pp.W3–W6.
Brooks, B.R.; Bruccoleri, R.E.; Olafson, B.D.; States, D.J.; Swaminathan, S. & Karplus
M.(2003). CHARMM: A Program for Macromolecular Energy, Minimization, and
Dynamics Calculations. Journal of Computational Chemistry, Vol.4, pp.187-217.
Camacho, C. J. & Vajda, S. (2001). Protein docking along smooth association pathways.
PNAS USA, Vol.98, pp.10636–10641.
Carlsohn, E.; Ångström, J. ; Emmett, M.R.; Marshall, A.G. & Nilsson, C.L. (2004). Chemical
cross-linking of the urease complex from Helicobacter pylori and analysis by
Fourier transform ion cyclotron resonance mass spectrometry and molecular
modeling International Journal of Mass Spectrometry, Vol.234, pp. 137–144.
Chu, F.; Shan, S.; Moustakas, D.T.; Alber, F.; Egea, P.F.; Stroud, R.M.; Walter, P. &
Burlingame A.L. (2004). Unraveling the interface of signal recognition particle and
its receptor by using chemical cross-linking and tandem mass spectrometry. PNAS,
Vol.101, pp. 16454-16459.
D’Ambrosio, C.; Talamo, F.; Vitale, R.M.; Amodeo, P.; Tell, G.; Ferrara, L. & Scaloni, A.
(2003). Probing the Dimeric Structure of Porcine Aminoacylase 1 by Mass
Spectrometric and Modeling Procedures. Biochemistry, Vol. 42, pp. 4430-4443.
de Bakker, P.I.; Furnham, N.; Blundell, T.L. & DePristo, M.A. (2006). Conformer generation
under restraints. Current Opinion in Structural Biology, Vol. 16, pp.160–165.
Dimova, K; Kalkhof, S.; Pottratz, I.; Ihling, C.; Rodriguez-Castaneda, F.; Liepold, T.;
Griesinger, C.; Brose, N.; Sinz, A. & Jahn, O. (2009). Structural Insights into the
Calmodulin-Munc13 Interaction Obtained by Cross-Linking and Mass
Spectrometry. Biochemistry, Vol.48, pp. 5908-5921.
Eddy, S.R. (1998). Profile hidden Markov models. Bioinformatics, Vol.14, pp.755–763.
Fabris, D. & Yu, E.T. (2010). The collaboratory for MS3D:a new cyberinfrastructure for the
structural elucidation of biological macromolecules and their assemblies using
mass spectrometry-based approaches. Journal Proteome Research, Vol.7, pp. 4848-
4857.
Fiser, A. & Sali, A. (2003). Modeller: generation and refinement of homology base protein
structure models. Methods in Enzymology, Vol. 374, pp.461–491.
Förster, F.; Webb, B.; Krukenberg, K.A.; Tsuruta, H.; Agard, D.A. & Sali A.(2008).
Integration of Small-Angle X-Ray Scattering Data into Structural Modeling of
Proteins and Their Assemblies. Journal of Molecular Biology, Vol.382, pp.1089–
1106.
Friedhoff, P. (2005). Mapping protein–protein interactions by bioinformatics and
crosslinking.. Analitycal & Bioanalitycal Chemistry, Vol.381,pp.78–80.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 155
Giron-Monzon, L.; Manelyte, L.; Ahrends, R.; Kirsch, D.; Spengler, B. & Friedhoff, P. (2004).
Mapping Protein-Protein Interactions between MutL and MutH by Cross-linking.
The Journal of Biochemical Chemistry, Vol.279, pp. 49338–49345.
Gray, J.J.; Moughon, S.; Wang, C.; Schueler-Furman, O.; Kuhlman, B.; Rohl, C.A. & Baker, D.
(2003). Protein-protein docking with simultaneous optimization of rigid-body
displacement and side-chain conformations. Journal of Molecular Biology, Vol.331,
pp.281-299.
Green, N.S.; Reisler, E. & Houk, K.N. (2001). Quantitative evaluation of the lengths of
homobifunctional protein cross-linking reagents used as molecular rulers. Protein
Science, Vol.10, pp.1293-1304.
Grintsevich, E.E.; Benchaar, S.A.; Warshaviak, D.; Boontheung, P.; Halgand, F.; Whitelegge,
J.P.; Faull, K.F.; Ogorzalek Loo, R.R; Sept, D.; Loo, J.A. & Reisler, E. (2008). Mapping
the Cofilin Binding Site on Yeast G-Actin by Chemical Cross-Linking. Journal of
Molecular Biology, Vol.377, pp. 395-409.
Güntert, P.; Mumenthaler, C. & Wüthrich, K. (1997). Torsion angle dynamics for NMR
structure calculation with the new program Dyana. Journal of Molecular Biology, Vol.
273, pp. 283–298.
Havel, T.F.; Kuntz, I.D. & Crippen, G.M.(1983). The combinatorial distance geometry
method for the calculation of molecular conformation. I. A new approach to an old
problem. Journal of Theoretical Biology, Vol. 310, pp.638–642.
Jaroszewski, L.; Rychlewski, L.; Li, Z.; Li, W. & Godzik, A. (2005). FFAS03: a server for
profile– profile sequence alignments. Nucleic Acids Research, Vol.33, pp.W284–288.
Karplus, K.; Barrett, C. & Hughey R. (1998). Hidden Markov models for detecting remote
protein homologies. Bioinformatics, Vol.14, pp.846–856.
Kessl, J.J.; Eidahl, J.O.; Shkriabai, N.; Zhao, Z.; McKee, C.J.; Hess, S.; Burke, T.R. Jr &
Kvaratskhelia, M. (2009). An allosteric mechanism for inhibiting HIV-1 integrase
with a small molecule.Molecular Pharmacology, Vol. 76, pp.824–832.
Kirkpatrick, S.; Gelatt, C.D. Jr. & Vecchi, M.P. (1983). Optimization by Simulated Annealing.
Science, Vol 220,pp. 671-680.
Latek, D.; Ekonomiuk, D. & Kolinski , A.(2007). Protein structure prediction: combining de
novo modeling with sparse experimental data. Journal of Computational Chemistry,
Vol. 28, pp.1668–1676.
Leitner, A.; Walzthoeni, T.; Kahraman, A.; Herzog, F.; Rinner, O.; Beck, M. & Aebersolda, R.
(2010). Probing Native Protein Structures by Chemical Cross-linking, Mass
Spectrometry, and Bioinformatics. Molecular & Cellular Proteomics, Vol.24, pp. 1634-
1649.
Lin, M.; Lu, H.M.; Rong Chen,R. & Liang, J.(2008). Generating properly weighted ensemble
of conformations of proteins from sparse or indirect distance constraints. The
Journal of Chemical Physics, Vol.129, pp.094101–094114.
Marti-Renom, M.A.; Madhusudhan, M.S. & Sali, A. (2004). Alignment of protein sequences
by their profiles. Protein Science, Vol.13, pp.1071–1087.
Lübben, M.; Portmann, R.; Kock, G.; Stoll, R.; Young, M.M. & Solioz, M. (2009). Structural
model of the CopA copper ATPase of Enterococcus hirae based on chemical cross-
linking. Biometals, Vol.22, pp. 363-375.
156 Computational Biology and Applied Bioinformatics
Mathiowetz, A.M.; Jain, A.; Karasawa, N. & Goddard, W.A. III. (1994). Protein simulation
using techniques suitable for very large systems: The cell multipole method for
nonbond interactions and the Newton–Euler inverse mass operator method for
internal coordinate dynamics. Proteins, Vol. 20, pp. 227–247.
Melo, F. & Sali, A. (2007). Fold assessment for comparative protein structure modeling.
Protein Science, Vol. 16, pp. 2412–2426.
Millevoi, S.; Thion, L.; Joseph, G.; Vossen, C.; Ghisolfi-Nieto, L. & Erard, M. (2001). Atypical
binding of the neuronal POU protein N-Oct3 to noncanonical DNA targets.
Implications for heterodimerization with HNF-3b. European Journal Biochemistry,
Vol.268, pp. 781-791.
Moreira, I.S.; Fernandes, P.A. & Ramos, M.J. (2010). Protein-protein docking dealing with
the unknown. Journal of Computational Chemistry, Vol. 31, pp.317–342.
Mouradov, D.; Craven, A.; Forwood, J.K.; Flanagan, J.U.; García-Castellanos, R.; Gomis-
Rüth, F.X.; Hume, D.A.; Martin, J.L.; Kobe, B. & Huber, T. (2006). Modelling the
structure of latexin–carboxypeptidase. A complex based on chemical cross-
linking and molecular docking. Protein Engineering, Design & Selection, Vol.19, pp.
9-16.
Nikolova, L.; Soman, K. ; Nichols, J.C.; Daniel, D.S., Dickey, B.F. & Hoffenberg, S.
(1998). Conformationally variable Rab protein surface regions mapped by
limited proteolysis and homology modelling. Biochemical Journal, Vol.336, pp.
461–469.
Nilges, M.; Clore, G.M. & Gronenborn, A.M.(1988a). Determination of three dimensional
structures of proteins from interproton distance data by hybrid distance
geometry-dynamical simulated annealing calculations. FEBS Letters, Vol.229,
pp.317–324.
Nilges, M.; Gronenborn, A.M.; Brünger, A.T. & Clore, G.M. (1988b). Determination of
three- dimensional structures of proteins by simulated annealing with
interproton distance restraints: application to crambin, potato carboxypeptidase
inhibitor and barley serine proteinase inhibitor 2. Protein Engineering, Vol.2,
pp.27-38.
Nymeyer, H.; Gnanakaran, S. and García, A.E. (2004). Atomic simulations of protein
folding using the replica exchange algorithm. Methods in Enzymology, Vol.383,
pp.111-149.
Papa, S.; Monti, S.M.; Vitale, R.M.; Bubici, C.; Jayawardena, S.; Alvarez, K.; De Smaele, E.;
Dathan, N.; Pedone, C.; Ruvo M. & Franzoso, G. (2007). Insights into the structural
basis of the GADD45beta-mediated inactivation of the JNK kinase, MKK7/JNKK2..
Journal of Biological Chemistry, Vol. 282, pp. 19029-19041.
Potluri, S.; Khan, A.A.; Kuzminykh, A.; Bujnicki, J.M., Friedman, A.M. & Bailey-Kellogg, C.
(2004). Geometric Analysis of Cross-Linkability for Protein Fold Discrimination.
Pacific Symposium on Biocomputing, Vol.9, pp.447-458.
Renzone, G.; Salzano, A.M.; Arena, S.; D’Ambrosio, C. & Scaloni, A.(2007a). Mass
Spectrometry-Based Approaches for Structural Studies on Protein Complexes at
Low-Resolution. Current Proteomics, Vol. 4, pp. 1-16.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 157
Renzone, G.; Vitale, R.M.; Scaloni, A.; Rossi, M., Amodeo, P. & Guagliardi A. (2007b).
Structural Characterization of the Functional Regions in the Archaeal Protein
Sso7d. Proteins: Structure, Function, and Bioinformatics, Vol. 67, pp. 189-197.
Rice, L.M & Brünger, A.T. (1994). Torsion angle dynamics: Reduced variable conformational
sampling enhances crystallographic structure refinement. Proteins, Vol. 19, pp. 277–
290.
Russell, R.B.; Alber, F.; Aloy, P.; Davis, F.P.; Korkin, D.;Pichaud, M; Topf, M. & Sali, A.
(2004). A structural perspective on protein-protein interactions. Current Opinion in
Structural Biology, Vol.14, pp. 313-324.
Scaloni, A; Miraglia, N.; Orrù, S.; Amodeo, P.; Motta, A.; Marino, G. & Pucci, P.(1998).
Topology of the calmodulin-melittin complex. Journal of Molecular Biology, Vol. 277,
pp.945–958.
Schrag, J.D.; Jiralerspong, S.; Banville, M; Jaramillo, M.L. & O'Connor-McCourt, M.D. (2007).
The crystal structure and dimerization interface of GADD45gamma. PNAS, Vol.
105, pp. 6566-6571.
Schueler-Furman, O.; Wang, C.; Bradley, P.; Misura, K. & Baker, D. (2005). Progress in
modeling of protein structures and interactions. Science, Vol. 310, pp.638–642.
Schulz,D.M.; Kalkhof, S.; Schmidt, A.; Ihling, C.; Stingl, C.; Mechtler, K.; Zschörnig, O &
Sinz, A. (2007). Annexin A2/P11 interaction: New insights into annexin A2
tetramer structure by chemical crosslinking, high-resolution mass spectrometry,
and computational modeling. Proteins: Structure Function & Bioinformatics, Vol.69,
pp. 254-269.
Schwieters, C.D. & Clore, G.M. (2001). Internal Coordinates for Molecular Dynamics and
Minimization in Structure Determination and Refinement. Journal of Magnetic
Resonance, Vol. 152, pp.288–302.
Silva, R.A.G.D.; Hilliard, G.M.; Fang, J.; Macha, S. & Davidson, W.S. (2005). A Three-
Dimensional Molecular Model of Lipid-Free Apolipoprotein A-I Determined by
Cross-Linking/Mass Spectrometry and Sequence Threading. Biochemistry, Vol.44,
pp. 2759-2769.
Singh, P.; Panchaud, A. & Goodlett, D.R. (2010) Chemical Cross-Linking and Mass
Spectrometry As a Low-Resolution Protein Structure Determination Technique.
Analytical Chemistry, Vol. 82, pp. 2636–2642
Sinz, A. (2006). Chemical cross-linking and mass spectrometry to map three dimensional
protein structures and protein-protein interactions. Mass Spectrometry Reviews,
Vol.25, pp. 663-682.
Skolnick,J.; Kolinski, A. & Ortiz, A.R.(1997). MONSSTER: A Method for Folding Globular
Proteins with a Small Number of Distance Restraints. Journal of Molecular Biology,
Vol. 265 pp.217-241.
Soding, J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics,
Vol.21, pp.951–960.
Stein, E.G.; Rice, L.M & Brünger, A.T. (1997). Torsion-angle molecular dynamics as a new
efficient tool for NMR structure calculation. Journal of Magnetic Resonance, Vol. 124,
pp. 154–164.
158 Computational Biology and Applied Bioinformatics
Tubb, M.R.; Silva, R.A.G.D.; Fang, J.; Tso, P. & Davidson, W.S. (2008). A Three-dimensional
Homology Model of Lipid-free Apolipoprotein A-IV Using Cross-linking and Mass
Spectrometry. The Journal of Biochemical Chemistry, Vol.283, pp. 17314--17323.
Vaidehi, N., Jain, A. & Goddard, W.A. III (1996). Constant temperature constrained
molecular dynamics: The Newton–Euler inverse mass operator method. Journal of
Physical Chemistry, Vol. 100, pp.10 508–10517.
Van Dijk, A.D.J.; Boelens, R. & Bonvin, A.M.J.J. (2005). Data-driven docking for the study of
biomolecular complexes. FEBS Journal, Vol.272, pp.293–312.
Young, M.M.; Tang, N.; Hempel, J.C.; Oshiro, C.M.; Taylor, E.W.; Kuntz, I.D.; Gibson, B.W.
& Dollinger G. (2000). High throughput protein fold identification by using
experimental constraints derived from intramolecular cross-links and mass
spectrometry. PNAS, Vol.97, pp. 5802-2806.
Zheng, X.; Wintrode, P.L. & Chance M.R. (2007). Complementary Structural Mass
Spectrometry Techniques Reveal Local Dynamics in Functionally Important
Regions of a Metastable Serpin. Structure, Vol.16, pp. 38-51.
8
1. Introduction
Cancer is the second leading cause of mortality worldwide, with an expected 1.5-3.0 million
new cases and 0.5-2.0 million deaths in 2011 for the US and Europe, respectively (Jemal et
al., 2011). Hence, this is an enormously important health risk, and progress leading to
enhanced survival is a global priority. Strategies that have been pursued over the years
include the search for new biomarkers, drugs or treatments (Rodrigues et al., 2007).
Synthetic biology together with bioinformatics represents a powerful tool towards the
discovery of novel biomarkers and the design of new biosensors.
Traditionally, the majority of new drugs has been generated from compounds derived from
natural products (Neumann & Neumann-Staubitz, 2010). However, advances in genome
sequencing together with possible manipulation of biosynthetic pathways, constitute
important resources for screening and designing new drugs (Carothers et al. 2009).
Furthermore, the development of rational approaches through the use of bioinformatics for
data integration will enable the understanding of mechanisms underlying the anti-cancer
effect of such drugs (Leonard et al., 2008; Rocha et al., 2010).
Besides in biomarker development and the production of novel drugs, synthetic biology can
also play a crucial role in the level of specific drug targeting. Cells can be engineered to
recognize specific targets or conditions in our bodies that are not naturally recognized by
the immune system (Forbes, 2010).
Synthetic biology is the use of engineering principles to create, in a rational and systematic
way, functional systems based on the molecular machines and regulatory circuits of living
organisms or to re-design and fabricate existing biological systems (Benner & Sismour,
2005). The focus is often on ways of taking parts of natural biological systems, characterizing
and simplifying them, and using them as a component of a highly unnatural, engineered,
biological system (Endy, 2005). Virtually, through synthetic biology, solutions for the unmet
needs of humankind can be achieved, namely in the field of drug discovery. Indeed,
synthetic biology tools enable the elucidation of disease mechanisms, identification of
potential targets, discovery of new chemotherapeutics or design of novel drugs, as well as
the design of biological elements that recognize and target cancer cells. Furthermore,
through synthetic biology it is possible to develop economically attractive microbial
production processes for complex natural products.
160 Computational Biology and Applied Bioinformatics
Bioinformatics is used in drug target identification and validation, and in the development
of biomarkers and tools to maximize the therapeutic benefit of drugs. Now that data on
cellular signalling pathways are available, integrated computational and experimental
projects are being developed, with the goal of enabling in silico pharmacology by linking the
genome, transcriptome and proteome to cellular pathophysiology. Furthermore,
sophisticated computational tools are being developed that enable the modelling and design
of new biological systems. A key component of any synthetic biology effort is the use of
quantitative models (Arkin, 2001). These models and their corresponding simulations enable
optimization of a system design, as well as guiding their subsequent analysis. Dynamic
models of gene regulatory and reaction networks are essential for the characterization of
artificial and synthetic systems (Rocha et al., 2008). Several software tools and standards
have been developed in order to facilitate model exchange and reuse (Rocha et al., 2010).
In this chapter, synthetic biology approaches for cancer diagnosis and drug development
will be reviewed. Specifically, examples on the design of RNA-based biosensors, bacteria
and virus as anti-cancer agents, and engineered microbial cell factories for the production of
drugs, will be presented.
However, it is well-known that noisy or leaky promoters can complicate the system design.
In these cases a finer control over expression can be established by weakening the binding
strength of the downstream gene (Ham et al., 2006), or by using two promoter inputs to
drive transcription of an output via a modular AND gate (Anderson et al., 2006).
Additionally, modular and scalable RNA-based devices (aptamers, ribozymes, and
transmitter sequences) can be engineered to regulate gene transcription or translation (Win
& Smolke, 2007).
Design at the pathway level is not only concerned with including the necessary parts, but
also with controlling the expressed functionality of those parts. Parts-based synthetic
metabolic pathways will require tunable control, just as their natural counterparts which
often employ feedback and feed-forward motifs to achieve complex regulation (Purnick &
Weiss, 2009). Using a synthetic biology approach, the design of DNA sequences encoding
metabolic pathways (e.g. operons) should be relatively straightforward. Synthetic scaffolds
and well-characterized families of regulatory parts have emerged as powerful tools for
engineering metabolism by providing rational methodologies for coordinating control of
multigene expression, as well as decoupling pathway design from construction (Ellis et al.,
2009). Pathway design should not overlook the fact that exogenous pathways interact with
native cellular components and have their own specific energy requirements. Therefore,
modifying endogenous gene expression may be necessary in addition to balancing cofactor
fluxes and installing membrane transporters (Park et al., 2008).
After designing parts, circuits or pathways, the genomic constructs ought to be
manufactured through DNA synthesis. Nucleotide’s sequence information can be
outsourced to synthesis companies (e.g. DNA2.0, GENEART or Genscript, among others).
The convenience of this approach over traditional cloning allows for the systematic
generation of genetic part variants such as promoter libraries. Also, it provides a way to
eliminate restriction sites or undesirable RNA secondary structures, and to perform codon
optimization. The ability to make large changes to DNA molecules has resulted in
standardized methods for assembling basic genetic parts into larger composite devices,
which facilitate part-sharing and faster system-level construction, as demonstrated by the
BioBrick methodology (Shetty et al., 2008) and the Gateway cloning system (Hartley, 2003).
Other approaches based on type II restriction enzymes, such as Golden Gate Shuffling,
provide ways to assemble many more components together in one step (Engler et al., 2009).
A similar one-step assembly approach, circular polymerase extension cloning (CPEC),
avoids the need for restriction-ligation, or single-stranded homologous recombination
altogether (Quan & Tian, 2009). Not only is this useful for cloning single genes, but also for
assembling parts into larger sequences encoding entire metabolic pathways and for
generating combinatorial part libraries. On a chromosomal level, disruption of genes in
Escherichia coli and other microorganisms has become much faster with the development of
RecBCD and lambda RED-assisted recombination systems (Datsenko & Wanner, 2000),
allowing the insertion, deletion or modification by simply using linear gene fragments.
Additionally, multiplex automated genome engineering (MAGE) has been introduced as
another scalable, combinatorial method for producing large-scale genomic diversity (Wang
et al., 2009). This approach makes chromosomal modification easier by simultaneously
mutating target sites across the chromosome. Plasmid-based expression and chromosomal
integration are the two common vehicles for implementing synthetic metabolic pathways.
Recently, the chemically inducible chromosomal evolution (CIChE) was proposed as a long-
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 163
term expression alternative method (Tyo et al., 2009). This new method avoids
complications associated with plasmid replication and segregation, and can be used to
integrate multiple copies of genes into the genome. All these techniques will provide
technical platforms for the rapid synthesis of parts and subsequent pathways.
The majority of the synthetic biology advances has been achieved purely in vitro (Isalan et
al., 2008), or in microorganisms involving the design of small gene circuits without a direct
practical application, although scientifically very exciting. These studies have offered
fundamental insight into biological processes, like the role and sources of biological noise;
the existence of biological modules with defined properties; the dynamics of oscillatory
behavior; gene transcription and translation; or cell communication (Alon, 2003; Kobayashi
et al., 2004). An interesting example of a larger system that has been redesigned is the
refactoring of the T7 bacteriophage (Chan et al., 2005). Another successful example has been
the production of terpenoid compounds in E. coli (Martin et al., 2003) and Saccharomyces
cerevisiae (Ro et al., 2006) that can be used for the synthesis of artemisinin. Bacteria and fungi
have long been used in numerous industrial microbiology applications, synthesizing
important metabolites in large amounts. The production of amino acids, citric acid and
enzymes are examples of other products of interest, overproduced by microorganisms.
Genetic engineering of strains can contribute to the improvement of these production levels.
Altogether, the ability to engineer biological systems will enable vast progress in existing
applications and the development of several new possibilities. Furthermore, novel
applications can be developed by coupling gene regulatory networks with biosensor
modules and biological response systems. An extensive RNA-based framework has been
developed for engineering ligand-controlled gene regulatory systems, called ribozyme
switches. These switches exhibit tunable regulation, design modularity, and target
specificity and could be used, for example, to regulate cell growth (Win & Smolke, 2007).
Engineering interactions between programmed bacteria and mammalian cells will lead to
exciting medical applications (Anderson et al., 2006). Synthetic biology will change the
paradigm of the traditional approaches used to treat diseases by developing “smart”
therapies where the therapeutic agent can perform computation and logic operations and
make complex decisions (Andrianantoandro et al., 2006). There are also promising
applications in the field of living vectors for gene therapy and chemical factories (Forbes,
2010; Leonard et al., 2008).
al., 2007). Due to the large number of participating species and the complexity of their
interactions, it becomes difficult to intuitively predict a design behavior. Therefore, only
detailed modeling can allow the investigation of dynamic gene expression in a way fit for
analysis and design (Di Ventura et al., 2006). Modeling a cellular process can highlight
which experiments are likely to be the most informative in testing model hypothesis, and for
example allow testing for the effect of drugs (Di Bernardo et al., 2005) or mutant
phenotypes (Segre et al., 2002) on cellular processes, thus paving the way for individualized
medicine.
Data are the precursor to any model, and the need to organize as much experimental data as
possible in a systematic manner has led to several excellent databases as summarized in
Table 2. The term “model” can be used for verbal or graphical descriptions of a mechanism
underlying a cellular process, or refer to a set of equations expressing in a formal and exact
manner the relationships among variables that characterize the state of a biological system
(Di Ventura et al., 2006). The importance of mathematical modeling has been extensively
demonstrated in systems biology (You, 2004), although its utility in synthetic biology seems
even more dominant (Kaznessis, 2009).
Name Website
BIND (Biomolecular Interaction Network Database) http://www.bind.ca/
Brenda (a comprehensive enzyme information http://www.brenda.uni-koeln.de/
system)
CSNDB (Cell Signaling Networks Database) http://geo.nihs.go.jp/csndb/
DIP (Database of Interacting Proteins) http://dip.doe-mbi.ucla.edu/
EcoCyc/Metacyc/BioCyc (Encyclopedia of E. coli http://ecocyc.org/
genes and metabolism)
EMP (Enzymes and Metabolic Pathways Database) http://www.empproject.com/
GeneNet (information on gene networks) http://wwwmgs.bionet.nsc.ru/mgs/s
ystems/genenet/
Kegg (Kyoto Encyclopedia of Genes and Genomes) http://www.genome.ad.jp/kegg/kegg
.html
SPAD (Signaling Pathway Database) http://www.grt.kyushu-u.ac.jp/eny-
doc/
RegulonDB (E. coli K12 transcriptional network) http://regulondb.ccg.unam.mx/
ExPASy-beta (Bioinformatics Resource Portal) http://beta.expasy.org/
Table 2. Databases of molecular properties, interactions and pathways (adapted from Arkin,
2001).
Model-driven rational engineering of synthetic gene networks is possible at the level of
topologies or at the level of molecular components. In the first one, it is considered that
molecules control the concentration of other molecules, e.g. DNA-binding proteins regulate
the expression of specific genes by either activation or repression. By combining simple
regulatory interactions, such as negative and positive feedback and feed-forward loops, one
may create more complex networks that precisely control the production of protein
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 165
molecules (e.g. bistable switches, oscillators, and filters). Experimentally, these networks can
be created using existing libraries of regulatory proteins and their corresponding operator
sites. Examples of these models are the oscillator described by Gardner et al (2000) and
repressilator by Elowitz and Leibler (2000). In the second level, the kinetics and strengths of
molecular interactions within the system are described. By altering the characteristics of the
components, such as DNA-binding proteins and their corresponding DNA sites, one can
modify the system dynamics without modifying the network topology. Experimentally, the
DNA sequences that yield the desired characteristics of each component can be engineered
to achieve the desired protein-protein, protein-RNA, or protein-DNA binding constants and
enzymatic activities. For example, Alon and co-workers (2003) showed how simple
mutations on the DNA sequence of the lactose operon can result in widely different
phenotypic behavior.
Various mathematical formulations can be used to model gene circuits. At the population
level, gene circuits can be modeled using ordinary differential equations (ODEs). In an ODE
formulation, the dynamics of the interactions within the circuit are deterministic. That is, the
ODE formulation ignores the randomness intrinsic to cellular processes, and is convenient
for circuit designs that are thought to be less affected by noise or when the impact of noise is
irrelevant (Marguet et al., 2007). An ODE model facilitates further sophisticated analyses,
such as sensitivity analysis and bifurcation analysis. Such analyses are useful to determine
how quantitative or qualitative circuit behavior will be impacted by changes in circuit
parameters. For instance, in designing a bistable toggle switch, bifurcation analysis was
used to explore how qualitative features of the circuit may depend on reaction parameters
(Gardner et al., 2000). Results of the analysis were used to guide the choice of genetic
components (genes, promoters and RBS) and growth conditions to favor a successful
implementation of designed circuit function. However, in a single cell, the gene circuit’s
dynamics often involve small numbers of interacting molecules that will result in highly
noisy dynamics even for expression of a single gene. For many gene circuits, the impact of
such cellular noise may be critical and needs to be considered (Di Ventura et al., 2006). This
can be done using stochastic models (Arkin, 2001). Different rounds of simulation using a
stochastic model will lead to different results each time, which presumably reflect aspects of
noisy dynamics inside a cell. For synthetic biology applications, the key of such analysis is
not necessarily to accurately predict the exact noise level at each time point. This is not
possible even for the simplest circuits due to the “extrinsic” noise component for each circuit
(Elowitz et al., 2002). Rather, it is a way to determine to what extent the designed function
can be maintained and, given a certain level of uncertainty or randomness, to what extent
additional layers of control can minimize or exploit such variations. Independently of the
model that is used, these can be evolved in silico to optimize designs towards a given
function. As an example, genetic algorithms were used by Francois and Hakim (2004) to
design gene regulatory networks exhibiting oscillations.
In most attempts to engineer gene circuits, mathematical models are often purposefully
simplified to accommodate available computational power and to capture the qualitative
behavior of the underlying systems. Simplification is beneficial partially due to the limited
quantitative characterization of circuit elements, and partially because simpler models may
better reveal key design constraints. The limitation, however, is that a simplified model may
fail to capture richer dynamics intrinsic to a circuit. Synthetic models combine features of
mathematical models and model organisms. In the engineering of genetic networks,
166 Computational Biology and Applied Bioinformatics
synthetic biologists start from mathematical models, which are used as the blueprints to
engineer a model out of biological components that has the same materiality as model
organism but is much less complex. The specific characteristics of synthetic models allow
one to use them as tools in distinguishing between different mathematical models and
evaluating results gained in performing experiments with model organisms (Loettgers,
2007).
each part is modeled independently following the ODE formalism. This results in a set of
composable parts that communicate by fluxes of signal carriers, whose overall amount is
constantly updated inside their corresponding pools. The model also considers transcription
factors, chemicals and small RNAs as signal carriers. Pools are placed among parts and
devices: they store free signal carriers and distribute them to the whole circuit. Hence,
polymerases and ribosomes have a finite amount; this permits to estimate circuit scalability
with respect to the number of parts. Mass action kinetics is fully employed and no
approximations are required to depict the interactions of signal carriers with DNA and
mRNA. The authors implemented the corresponding models into ProMoT (Process
Modeling Tool), software for the object-oriented and modular composition of models for
dynamic processes (Mirschel et al., 2009). GenoCAD (Czar et al., 2009) and GEC (Pedersen &
Phillips, 2009) introduce the notions of a grammar and of a programming language for
genetic circuit design, respectively. These tools use a set of rules to check the correct
composition of standard parts. Relying on libraries of standard parts that are not necessarily
taken from the Registry of Standard Biological Parts, these programs can translate a circuit
design into a complete DNA sequence. The two tools differ in capabilities and possible
connectivity to other tools.
The ultimate goal of designing a genetic circuit is that it works, i.e. that it performs a given
function. For that purpose, optimization cycles to establish an appropriate structure and a
good set of kinetic parameters values are required. These optimization problems are
extremely complex since they involve the selection of adequate parts and appropriate
continuous parameter values. Stochastic optimization methods (e.g. evolutionary
algorithms) attempt to find good solutions by biased random search. They have the
potential for finding globally optimal solutions, but optimization is computationally
expensive. On the other hand, deterministic methods (e.g. gradient descent) are local search
methods, with less computational cost, but at the expense of missing good solutions.
The optimization problem can be tackled by tools such as Genetdes (Rodrigo et al., 2007b)
and OptCircuit (Dasika & Maranas, 2008). They rely on different parts characterizations and
optimization algorithms. Genetdes uses a stochastic method termed “Simulated Annealing”
(Kirkpatrick et al., 1983), which produces a single solution starting from a random circuit
configuration. As a drawback, the algorithm is more likely to get stuck in a local minimum
than an evolutionary algorithm. OptCircuit, on the contrary, treats the circuit design
problem with a deterministic method (Bansal et al., 2003), implementing a procedure
towards a “local’ optimal solution. Each of these optimization algorithms requires a very
simplified model for gene dynamics where, for instance, transcription and translation are
treated as a single step process. Moreover, the current methods can cope only with rather
small circuits. Another tool that has been described by Batt and co-workers (2007),
RoVerGeNe, addresses the problem of parameter estimation more specifically. This tool
permits to tune the performance and to estimate the robustness of a synthetic network with
a known behavior and for which the topology does not require further improvement.
Detailed design of synthetic parts that reproduce the estimated circuit kinetics and
dynamics is a complex task. It requires computational tools in order to achieve error free
solutions in a reasonable amount of time. Other than the placement/removal of restriction
sites and the insertion/deletion of longer motifs, mutations of single nucleotides may be
necessary to tune part characteristics (e.g. promoter strength and affinity toward regulatory
factors). Gene Designer (Villalobos et al., 2006) is a complete tool for building artificial DNA
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 169
segments and codon usage optimization. GeneDesign (Richardson et al., 2006) is another
tool to design long synthetic DNA sequences. Many other tools are available for specific
analysis of the DNA and RNA circuit components. The package UNAFold (Markham &
Zuker, 2008) predicts the secondary structure of nucleic acid sequences to simulate their
hybridizations and to estimate their melting temperature according to physical
considerations. A more accurate analysis of the secondary structure of ribonucleic acids can
be performed through the Vienna RNA package (Hofacker, 2003). Binding sites along a
DNA chain can be located using Zinc Finger Tools (Mandell & Barbas, 2006). These tools
allows one to search DNA sequences for target sites of particular zinc finger proteins
(Kaiser, 2005), whose structure and composition can also be arranged. Thus, gene control by
a class of proteins with either regulation or nuclease activity can be improved. Furthermore,
tools that enable promoter predictions and primers design are available, such as BDGP and
Primer3. Another relevant task in synthetic biology is the design and engineering of new
proteins. Many tools have been proposed for structure prediction, homology modeling,
function prediction, docking simulations and DNA-protein interactions evaluation.
Examples include the Rosetta package (Simons et al., 1999); RAPTOR (Xu et al., 2003); PFP
(Hawkins et al., 2006); Autodock 4.2 (Morris et al., 2009) and Hex 5.1 (Ritchie, 2008).
Further advance in computational synthetic biology will result from tools that combine and
integrate most of the tasks discussed, starting with the choice and assembly of biological
parts to the compilation and modification of the corresponding DNA sequences. Examples
of such tools comprise SynBioSS (Hill et al., 2008); Clotho and Biskit (Grunberg et al., 2007).
Critical elements are still lacking, such as tools for automatic information integration
(literature and databases), and tools that re-use standardized model entities for optimal
circuit design. Overall, providing an extended and integrated information technology
infrastructure will be crucial for the development of the synthetic biology field.
synthetic biology to metabolic engineering can potentially create a paradigm shift. Rather
than starting with the full complement of components in a wild-type organism and
piecewise modifying and streamlining its function, metabolic engineering can be attempted
from a bottom-up, parts-based approach to design by carefully and rationally specifying the
inclusion of each necessary component (McArthur IV & Fong, 2010). The importance of
rationally designing improved or new microbial cell factories for the production of drugs
has grown substantially since there is an increasing need for new or existing drugs at prices
that can be affordable for low-income countries. Large-scale re-engineering of a biological
circuit will require systems-level optimization that will come from a deep understanding of
operational relationships among all the constituent parts of a cell. The integrated framework
necessary for conducting such complex bioengineering requires the convergence of systems
and synthetic biology (Koide et al., 2009). In recent years, with advances in systems biology
(Kitano, 2002), there has been an increasing trend toward using mathematical and
computational tools for the in silico design of enhanced microbial strains (Rocha et al., 2010).
In silico testing
(parameter space; dynamics)
SIMULATION
DYNAMIC PROPERTIES
EXPLOITATION
re-engineer
Model organism (chassis)
CIRCUIT CONSTRUCTION •Platform (microscopy; flow cytometry; microfluidics)
•Read-out (phenotype; morphology; expression)
EXPERIMENTAL •Analysis (equilibrium; cellular context; genetic
VALIDATION background; environment)
•Circuit construction
CIRCUITS DEFINITION OR
IMPROVEMENT
CHARACTERIZATION
Fig. 1. The iterative synthetic biology approach to design a given biological circuit/system.
Current models in both synthetic and systems biology emphasize the relationship between
environmental influences and the responses of biological networks. Nevertheless, these
models operate at different scales, and to understand the new paradigm of rational systems
re-engineering, synthetic and systems biology fields must join forces (Koide et al., 2009).
Synthetic biology and bottom-up systems biology methods extract discrete, accurate,
quantitative, kinetic and mechanistic details of regulatory sub-circuits. The models
generated from these approaches provide an explicit mathematical foundation that can
ultimately be used in systems redesign and re-engineering. However, these approaches are
confounded by high dimensionality, non-linearity and poor prior knowledge of key
dynamic parameters (Fisher & Henzinger, 2007) when scaled to large systems.
172 Computational Biology and Applied Bioinformatics
5.1.1 Riboswitches
One of the most promising elements are the riboswitches, genetic control elements that
allow small molecules to regulate gene expression. They are structured elements typically
found in the 5’-untranslated regions of mRNA that recognize small molecules and respond
by altering their three-dimensional structure. This, in turn, affects transcription elongation,
translation initiation, or other steps of the process that lead to protein production (Beisel &
Smolke, 2009; Winkler & Breaker, 2005). Biological cells can modulate gene expression in
response to physical and chemical variations in the environment allowing them to control
their metabolism and preventing the waste of energy expenditure or inappropriate
physiological responses (Garst & Batey, 2009). There are currently at least twenty classes of
riboswitches that recognize a wide range of ligands, including purine nucleobases (purine
riboswitch), amino acids (lysine riboswitch), vitamin cofactors (cobalamin riboswitch),
amino sugars, metal ions (mgtA riboswitch) and second messenger molecules (cyclic di-
GMP riboswitch) (Beisel & Smolke, 2009). Riboswitches are typically composed of two
distinct domains: a metabolite receptor known as the aptamer domain, and an expression
platform whose secondary structure signals the regulatory response. Embedded within the
aptamer domain is the switching sequence, a sequence shared between the aptamer domain
and the expression platform (Garst & Batey, 2009). The aptamer domain is part of the RNA
174 Computational Biology and Applied Bioinformatics
cells opens unforeseen progresses in the medical field. Emerging applications include the
design of bacteria to produce therapeutic agents (in vitro or in vivo) and the use of live
bacteria as targeted delivery systems (Forbes, 2010; Pawelek et al., 2003). An impressive
example of these applications is described by Anderson and co-workers (2006). The authors
have successfully engineered E. coli harboring designed plasmids to invade cancer cells in
an environmentally controlled way, namely in a density-dependent manner under
anaerobic growth conditions and arabinose induction. Plasmids were built containing the
inv gene from Yersinia pseudotuberculosis under control of the Lux promoter, the hypoxia-
responsive fdhF promoter, and the arabinose-inducible araBAD promoter. This is significant
because the tumor environment is often hypoxic and allows for high bacterial cell densities
due to depressed immune function in the tumor. Therefore, this work demonstrated, as a
“proof of concept”, that one can potentially use engineered bacteria to target diseased cells
without significantly impacting healthy cells.
Ideally, an engineered bacterium for cancer therapy would specifically target tumors
enabling the use of more toxic molecules without systemic effects; be self-propelled enabling
its penetration into tumor regions that are inaccessible to passive therapies; be responsive to
external signals enabling the precise control of location and timing of cytotoxicity; be able to
sense the local environment allowing the development of responsive therapies that can
make decisions about where and when drugs are administered; and be externally detectable,
thus providing information about the state of the tumor, the success of localization and the
efficacy of treatment (Forbes, 2010). Indeed some of these features naturally exist in some
bacteria, e.g. many genera of bacteria have been shown to preferentially accumulate in
tumors, including Salmonella, Escherichia, Clostridium and Bifidobacterium. Moreover, bacteria
have motility (flagella) that enable tissue penetration and chemotactic receptors that direct
chemotaxis towards molecular signals in the tumor microenvironment. Selective
cytotoxicity can be engineered by transfection with genes for therapeutic molecules,
including toxins, cytokines, tumor antigens and apoptosis-inducing factors. External control
can be achieved using gene promoter strategies that respond to small molecules, heat or
radiation. Bacteria can be detected using light, magnetic resonance imaging and positron
emission tomography. At last, genetic manipulation of bacteria is easy, thus enabling the
development of treatment strategies, such as expression of anti-tumor proteins and
including vectors to infect cancer cells (Pawelek et al., 2003). To date, many different
bacterial strategies have been implemented in animal models (e.g. Salmonella has been tested
for breast, colon, hepatocellular, melanoma, neuroblastoma, pancreatic and prostate cancer),
and also some human trials (e.g. C. butyricum M-55 has been tested for squamous cell
carcinoma, metastic, malignant neuroma and melanoma) have been carried out (Forbes,
2010).
Ultrasound is one of the techniques often used to treat solid tumors (e.g. breast cancer);
however, this technique is not always successful, as sometimes it just heats the tumor
without destroying it. Therefore, we are currently engineering the heat shock response
machinery from E. coli to trigger the release of a therapeutic agent in situ concurrent with
ultrasound treatment. For that purpose, several modeling and engineering steps are being
implemented. The strategy being pursued is particularly useful for drugs that require in situ
synthesis because of a poor bioavailability, thereby avoiding repetitive oral doses to achieve
sufficient concentration inside the cells. The use of live bacteria for therapeutic purposes
naturally poses some issues (Pawelek et al., 2003), but currently the goal is to achieve the
176 Computational Biology and Applied Bioinformatics
The use of synthetic biology approaches in the field of metabolic engineering opens
enormous possibilities, especially toward the production of new drugs for cancer treatment.
Our goal is to design and model a new biosynthetic pathway for the production of natural
drugs in E. coli. Key to this is the specification of gene sequences encoding enzymes that
catalyze each reaction in the pathway, and whose DNA sequences can be incorporated into
devices that lead to functional expression of the molecules of interest (Prather & Martin,
2008). Partial pathways can be recruited from independent sources and co-localized in a
single host (Kobayashi et al., 2004). Alternatively, pathways can be constructed for the
production of new, non-natural products by engineering existing routes (Martin et al., 2003).
6. Conclusion
Despite all the scientific advances that humankind has seen over the last centuries, there are
still no clear and defined solutions to diagnose and treat cancer. In this sense, the search for
innovative and efficient solutions continues to drive research and investment in this field.
Synthetic biology uses engineering principles to create, in a rational and systematic way,
functional systems based on the molecular machines and regulatory circuits of living
organisms, or to re-design and fabricate existing biological systems. Bioinformatics and
newly developed computational tools play a key role in the improvement of such systems.
Elucidation of disease mechanisms, identification of potential targets and biomarkers,
design of biological elements for recognition and targeting of cancer cells, discovery of new
chemotherapeutics or design of novel drugs and catalysts, are some of the promises of
synthetic biology. Recent achievements are thrilling and promising; yet some of such
innovative solutions are still far from a real application due to technical challenges and
ethical issues. Nevertheless, many scientific efforts are being conducted to overcome these
limitations, and undoubtedly it is expected that synthetic biology together with
sophisticated computational tools, will pave the way to revolutionize the cancer field.
7. References
Alon, U. (2003). Biological networks: the tinkerer as an engineer. Science, Vol.301, No.5641,
(September 2003), pp. 1866-1867, ISSN 0036-8075
Anderson, J.C.; Clarke, E.J.; Arkin, A.P. & Voigt, C.A. (2006). Environmentally controlled
invasion of cancer cells by engineered bacteria. Journal of Molecular Biology, Vol.355,
No.4, (January 2006), pp. 619–627, ISSN 00222836
Andrianantoandro, E.; Basu, S.; Karig, D.K. & Weiss, R. (2006). Synthetic biology: new
engineering rules for an emerging discipline. Molecular Systems Biology Vol.2, No.
2006.0028, (May 2006), pp. 1-14, ISSN 1744-4292
Arkin, A.P. (2001). Synthetic cell biology. Current Opinion in Biotechnology Vol.12, No.6,
(December 2001), pp. 638-644, ISSN 0958-1669
Bansal, V.; Sakizlis, V.; Ross, R.; Perkins, J.D. & Pistikopoulos, E.N. (2003). New algorithms
for mixed-integer dynamic optimization. Computers and Chemical Engineering
Vol.27, No.5, (May 2003), pp. 647-668, ISSN 0098-1354
Bar, H.; Yacoby, I. & Benhar, I. (2008). Killing cancer cells by targeted drug-carrying phage
nanomedicines. BMC Biotechnology Vol.8, No.37, (April 2008), pp. 1-14, ISSN 1472-
6750
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 179
Batt, G., Yordanov, B., Weiss, R. & Belta, C. (2007). Robustness analysis and tuning of
synthetic gene networks. Bioinformatics Vol.23, No.18, (July 2007), pp. 2415-2422,
ISSN 1367-4803
Bedi, D.; Musacchio, T.; Fagbohun, O.A.; Gillespie, J.W.; Deinnocentes, P.; Bird, R.C.;
Bookbinder, L.; Torchilin, V.P. & Petrenko, V.A. (2011). Delivery of siRNA into
breast cancer cells via phage fusion protein-targeted liposomes. Nanomedicine:
Nanotechnology, Biology, and Medicine, doi:10.1016/j.nano.2010.10.004, ISSN 1549-
9634
Beisel, C.L. & Smolke, C.D. (2009). Design principles for riboswitch function. PLoS
Computational Biology Vol.5, No.4, (April 2009), e1000363, pp. 1-14, ISSN 1553-734X
Benner, S.A. & Sismour, A.M. (2005). Synthetic biology. Nature Reviews Genetics Vol.6, No.7,
(July 2005), pp. 533–543, ISSN 1471-0056
Brown, K.C. (2010). Peptidic tumor targeting agents: the road from phage display peptide
selections to clinical applications. Current Pharmaceutical Design Vol.16, No.9,
(March 2010), pp. 1040-1054, ISSN 1381-6128
Brustad, E.M. & Arnold, F.H. (2010). Optimizing non-natural protein function with directed
evolution. Current Opinion in Chemical Biology Vol.15, No.2, (April 2010), pp. 1-10,
ISSN 1367-5931
Bulter, T.; Lee, S.G.; Wong, W.W.; Fung, E.; Connor, M.R. & Liao, J.C. (2004). Design of
artificial cell–cell communication using gene and metabolic networks. PNAS
Vol.101, No.8 (February 2004), pp. 2299–2304, ISSN 0027-8424
Bumpus, S.; Evans, B.; Thomas, P.; Ntai, I. & Kelleher, N. (2009). A proteomics approach to
discovering natural products and their biosynthetic pathways. Nature Biotechnology
Vol.27, No.10, (September 2009), pp. 951-956, ISSN 1087-0156
Canton, B.; Labno, A. & Endy, D. (2008). Refinement and standardization of synthetic
biological parts and devices. Nature Biotechnology Vol.26, No. 7, (July 2008), pp. 787-
793, ISSN 1087-0156
Carothers, J.M.; Goler, J.A. & Keasling, J.D. (2009). Chemical synthesis using synthetic
biology. Current Opinion in Biotechnology Vol.20, No.4, (August 2009), pp. 498-503,
ISSN 0958-1669
Chan, L.Y.; Kosuri, S. & Endy, D. (2005). Refactoring bacteriophage T7. Molecular Systems
Biology Vol.1, No. 2005.0018, (September 2005), pp. 1-10, ISSN 1744-4292
Cooling, M.T.; Hunter, P. & Crampin, E.J. (2008). Modelling biological modularity with
CellML. IET Systems Biology Vol.2, No.2, (March 2008), pp. 73-79, ISSN 1751-8849
Crameri, A.; Raillard, S.A.; Bermudez, E. & Stemmer, W.P. (1998). DNA shuffling of a family
of genes from diverse species accelerates directed evolution. Nature Vol.391,
No.6664, (January 1998), pp. 288-291, ISSN 0028-0836
Czar, M.J.; Cai, Y. & Peccoud, J. (2009). Writing DNA with GenoCAD. Nucleic Acids Research
Vol.37, No.2, (May 2009), pp. W40-W47, ISSN 0305-1048
Dasika, M.S. & Maranas, C.D. (2008). OptCircuit: an optimization based method for
computational design of genetic circuits. BMC Systems Biology Vol.2, No.24, pp. 1-
19, ISSN 1752-0509
Datsenko, K.A. & Wanner, B.L. (2000). One-step inactivation of chromosomal genes in
Escherichia coli K-12 using PCR products. PNAS Vol.97, No. 12, (June 2000), pp.
6640-6645, ISSN 0027-8424
180 Computational Biology and Applied Bioinformatics
Dawid, A.; Cayrol, B. & Isambert, H. (2009). RNA synthetic biology inspired from bacteria:
construction of transcription attenuators under antisense regulation. Physical
Biology Vol.6, No. 025007, (July 2009), pp. 1-10, ISSN 1478-3975
Di Bernardo, D.; Thompson, M.J.; Gardner, T.S.; Chobot, S.E.; Eastwood, E.L.; Wojtovich,
A.P.; Elliott, S.E.; Schaus, S.E. & Collins, J.J. (2005). Chemogenomic profiling on a
genome-wide scale using reverse-engineered gene networks. Nature Biotechnology
Vol.23, No.3, (March 2005), pp. 377-383, ISSN 1087-0156
Di Ventura, B.; Lemerle, C.; Michalodimitrakis, K. & Serrano, L. (2006). From in vivo to in
silico biology and back. Nature Vol.443, No.7111, (October 2006), pp. 527-533, ISSN
0028-0836
Dougherty, M.J. & Arnold, F.H. (2009). Directed evolution: new parts and optimized
function. Current Opinion in Biotechnology Vol.20, No.4, (August 2009), pp. 486-491,
ISSN 0958-1669
Du B, Han H, Wang Z, Kuang L, Wang L, Yu L, Wu M, Zhou Z, Qian M. (2010). Targeted
drug delivery to hepatocarcinoma in vivo by phage-displayed specific binding
peptide. Molecular Cancer Research Vol.8, No.2, (February 2010), pp.135-144, ISSN
1541-7786
Ellis, T.; Wang, X. & Collins, J.J. (2009). Diversity-based, model guided construction of
synthetic gene networks with predicted functions. Nature Biotechnology Vol.27,
No.5, (May 2009), pp. 465– 471, ISSN 1087-0156
Elowitz, M.B. & Leibler, S. (2000). A synthetic oscillatory network of transcriptional
regulators. Nature Vol.403, No.6767, (January 2000), pp. 335-338, ISSN 0028-0836
Elowitz, M.B.; Levine, A.J.; Siggia, E.D. & Swain, P.S. (2002). Stochastic gene expression in a
single cell. Science Vol.297, No.5584, (August 2002), pp. 1183–1186, ISSN 0036-8075
Endy, D. (2005). Foundations for engineering biology. Nature Vol.438, No. 7067, (November
2005), pp. 449-453, ISSN 0028-0836
Engler, C.; Gruetzner, R.; Kandzia, R. & Marillonnet, S. (2009). Golden gate shuffling: a one
pot DNA shuffling method based on type ils restriction enzymes. PLoS ONE Vol.4,
No.5, (May 2009), e5553, pp. 1-9, ISSN 1932-6203
Eriksson, F.; Tsagozis, P.; Lundberg, K.; Parsa, R.; Mangsbo, S.M.; Persson, M.A.; Harris,
R.A. & Pisa, P.J. (2009). Tumor-specific bacteriophages induce tumor destruction
through activation of tumor-associated macrophages. The Journal of Immunology
Vol.182, No.5, (March 2009), pp. 3105-3111, ISSN 0022-1767
Ferreira, C.S.; Matthews, C.S. & Missailidis, S. (2006). DNA aptamers that bind to MUC1
tumour marker: design and characterization of MUC1-binding single-stranded
DNA aptamers. Tumour Biology Vol.27, No.6, (October 2006), pp. 289-301, ISSN
1010-4283
Fisher, J. & Henzinger, T.A. (2007). Executable cell biology. Nature Biotechnology Vol.25,
No.11, (November 2007), pp. 1239–1249, ISSN 1087-0156
Forbes, N.S. (2010). Engineering the perfect (bacterial) cancer therapy. Nature Reviews Cancer
Vol.10, No.11, (October 2010), pp. 785-794, ISSN 1474-175X
Francois, P. & Hakim, V. (2004). Design of genetic networks with specified functions by
evolution in silico. PNAS Vol.101, No.2, (January 2004), pp. 580–585, ISSN 0027-
8424
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 181
Funahashi, A.; Morohashi, M. & Kitano, H. (2003). CellDesigner: a process diagram editor
for gene-regulatory and biochemical networks. BIOSILICO Vol.1, No.5, (November
2003),pp. 159–162, ISSN 1478-5382
Fung, E.; Wong, W.W.; Suen, J.K.; Bulter, T.; Lee, S.G. & Liao, J.C. (2005). A synthetic gene
metabolic oscillator. Nature Vol.435, No.7038, (May 2005), pp. 118–122, ISSN 0028-
0836
Gao, C.; Mao, S.; Ronca, F.; Zhuang, S.; Quaranta, V.; Wirsching, P. & Janda, K.D. (2003). De
novo identification of tumor-specific internalizing human antibody-receptor pairs
by phage-display methods. Journal of Immunological Methods. Vol.274, No.1-2,
(March 2003), pp. 185-197, ISSN 0022-1759
Gardner, T.S.; Cantor, C.R. & Collins, J.J. (2000). Construction of a genetic toggle switch in
Escherichia coli. Nature Vol.403, No.6767, (January 2000), pp. 339-342, ISSN 0028-
0836
Garst, A.D. & Batey, R.T. (2009). A switch in time: detailing the life of a riboswitch.
Biochimica Biophysica Acta Vol.1789, No.9-10, (September-October 2009), pp. 584-
591, ISSN 0006-3002
Goler, J.A. (2004). A design and simulation tool for synthetic biological systems. Cambridge,
MA: MIT
Golynskiy, M.V. & Seelig, B. (2010). De novo enzymes: from computational design to mRNA
display. Trends in Biotechnology Vol.28, No.7, (July 2010), pp. 340-345, ISSN 0167-
7799
Grunberg, R.; Nilges, M. & Leckner, J. (2007). Biskit —a software platform for structural
bioinformatics. Bioinformatics Vol.23, No.6, (March 2007), pp. 769-770, ISSN 1367-
4803
Ham, T.S.; Lee, S.K.; Keasling, J.D. & Arkin, A.P. (2006). A tightly regulated inducible
expression system utilizing the fim inversion recombination switch. Biotechnology
and Bioengineering Vol.94, No.1, (May 2006), pp. 1–4, ISSN 0006-3592
Hartley, J.L. (2003). Use of the gateway system for protein expression in multiple hosts.
Current Protocols in Protein Science Chap5: Unit 5.17
Hawkins, T.; Luban, S. & Kihara, D. (2006). Enhanced automated function prediction using
distantly related sequences and contextual association by PFP. Protein Science
Vol.15, No.6, (June 2006), pp. 1550-1556, ISSN 1469-896X
Hianik, T. & Wang, J. (2009). Electrochemical aptasensors – recent achievements and
perspectives. Electroanalysis Vol.21, No.11, (June 2009), pp. 1223-1235, ISSN 1521-
4109
Hill, A.D.; Tomshine, J.R.; Weeding, E.M.; Sotiropoulos, V. & Kaznessis, Y.N. (2008).
SynBioSS: the synthetic biology modeling suite. Bioinformatics Vol.24, No.21,
(November 2008),pp. 2551-2553, ISSN 1367-4803
Hofacker, I.L. (2003). Vienna RNA secondary structure server. Nucleic Acids Research Vol.31,
No.13, (July 2003), pp. 3429-3431, ISSN 0305-1048
Isaacs, F.J.; Dwyer, D.J. & Collins, J.J. (2006). RNA synthetic biology. Nature Biotechnology
Vol.24, No.5, (May 2006), pp. 545-554, ISSN 1087-0156
Isalan, M.; Lemerle, C.; Michalodimitrakis, K.; Horn, C.; Beltrao, P.; Raineri, E.; Garriga-
Canut, M.; Serrano, L. (2008). Evolvability and hierarchy in rewired bacterial gene
networks. Nature Vol.452, No.7189, (April 2008), pp. 840–845, ISSN 0028-0836
182 Computational Biology and Applied Bioinformatics
Jemal, A.; Bray, F.; Center, M.M.; Ferlay, J.; Ward, E. & Forman, D. (2011). Global cancer
statistics. CA Cancer Journal for Clinicians Vol.61, No.2, (March-April 2011),pp. 69-
90, ISSN 0007-9235
Kaiser, J. (2005). Gene therapy. Putting the fingers on gene repair. Science Vol.310, No.5756,
(December 2005), pp.1894-1896, ISSN 0036-8075
Kaznessis, Y.N. (2009). Computational methods in synthetic biology. Biotechnology Journal
Vol.4, No.10, (October 2009), pp.1392-1405, ISSN 1860-7314
Kelly, J.; Rubin, A.J.; Davis, J. II; Ajo-Franklin, C.M.; Cumbers, J.; Czar, M.J.; de Mora, K.;
Glieberman, A.I.; Monie, D.D. & Endy, D. (2009). Measuring the activity of BioBrick
promoters using an in vivo reference standard. Journal of Biological Engineering
Vol.3, No.4, (March 2009), pp. 1-13, ISSN 1754-1611
Kirkpatrick, S.; Gelatt, C.D. Jr. & Vecchi, M.P. (1983). Optimization by Simulated Annealing.
Science Vol.220, No.4598, (May 1983), pp. 671-680, ISSN 0036-8075
Kitano, H. (2002). Systems biology: a brief overview. Science Vol.295, No.5560, (March 2002),
pp. 1662-1664, ISSN 0036-8075
Kizer, L.; Pitera, D.J.; Pfleger, B.F. & Keasling, J.D. (2008). Application of functional
genomics to pathway optimization for increased isoprenoid production. Applied
and Environmental Microbiology Vol.74, No.10, (May 2008), pp. 3229–3241, ISSN
0099-2240
Kobayashi, H.; Kaern, M.; Araki, M.; Chung, K.; Gardner, T.S.; Cantor, C.R. & Collins, J.J.
(2004). Programmable cells: interfacing natural and engineered gene networks.
PNAS Vol.101, No.22, (June 2004), pp. 8414–8419, ISSN 0027-8424
Koide, T.; Pang, W.L. & Baliga, N.S. (2009). The role of predictive modeling in rationally
reengineering biological systems. Nature Reviews Microbiology Vol.7, No.4, (April
2009), pp. 297-305, ISSN 1740-1526
Kosuri, S.; Kelly, J.R. & Endy, D. (2007). TABASCO: a single molecule, base pair resolved
gene expression simulator. BMC Bioinformatics Vol.8, No.480, (December 2007), pp.
1-15, ISSN 1471-2105
Lartigue, C.; Glass, J.I.; Alperovich, N.; Pieper, R.; Parmar, P.P.; Hutchison III, C.A.; Smith,
H.O. & Venter, J.C. (2007). Genome transplantation in bacteria: changing one
species to another. Science Vol.317, No.5838, (August 2007), pp. 632-638, ISSN 0036-
8075
Leonard, E.; Nielsen, D.; Solomon, K. & Prather, K.J. (2008). Engineering microbes with
synthetic biology frameworks. Trends in Biotechnology Vol.26, No.12, (December
2008), pp. 674-681, ISSN 0167-7799
Li K, Chen Y, Li S, Nguyen HG, Niu Z, You S, Mello CM, Lu X, Wang Q. (2010). Chemical
modification of M13 bacteriophage and its application in cancer cell imaging.
Bioconjugate Chemistry Vol.21, No.7, (December 2010), pp. 1369-1377, ISSN 1043-
1802
Loettgers, A. (2007). Model organisms and mathematical and synthetic models to explore
regulation mechanisms. Biological Theory Vol.2, No.2, (December 2007), pp. 134-142,
ISSN 1555-5542
Mandell, J.G. & Barbas, C.F. III (2006). Zinc Finger Tools: custom DNA binding domains for
transcription factors and nucleases. Nucleic Acids Research Vol.34, (July 2006), pp.
W516-523, ISSN 0305-1048
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 183
Marchisio, M.A. & Stelling, J. (2008). Computational design of synthetic gene circuits with
composable parts. Bioinformatics Vol.24, No.17, (September 2008), pp.1903–1910,
ISSN 1367-4803
Marchisio, M.A. & Stelling, J. (2009). Computational design tools for synthetic biology.
Current Opinion in Biotechnology Vol.20, No.4, (August 2009), pp. 479-485, ISSN
0958-1669
Marguet, P.; Balagadde, F.; Tan, C. & You, L. (2007). Biology by design: reduction and
synthesis of cellular components and behavior. Journal of the Royal Society Interface
Vol.4, No.15, (August 2007), pp. 607-623, ISSN 1742-5689
Markham, N.R. & Zuker, M. (2008). UNAFold: software for nucleic acid folding and
hybridization. Methods Molecular Biology Vol.453, No.I, pp. 3-31, ISSN 1064-3745
Martin, V.J.; Pitera, D.J.; Withers, S.T.; Newman, J.D. & Keasling, J.D. (2003). Engineering a
mevalonate pathway in Escherichia coli for production of terpenoids. Nature
Biotechnology Vol.21, No.7, (July 2003), pp.796–802, ISSN 1087-0156
Matsuoka, Y.; Ghosh, S. & Kitano, H. (2009). Consistent design schematics for
biological systems: standardization of representation in biological engineering.
Journal of the Royal Society Interface Vol.6, No.4, (August 2009), pp. S393-S404, ISSN
1742-5689
McArthur IV, G.H. & Fong, S.S. (2010). Toward engineering synthetic microbial metabolism.
Journal of Biomedicine and Biotechnology doi:10.1155/2010/459760, ISSN 1110-7243
Mirschel, S.; Steinmetz, K.; Rempel, M.; Ginkel, M. & Gilles, E.D. (2009). PROMOT: modular
modeling for systems biology. Bioinformatics Vol.25, No.5, (March 2009), pp. 687-
689, ISSN 1367-4803
Morris, G.M.; Huey, R.; Lindstrom, W.; Sanner, M.F.; Belew, R.K.; Goodsell, D.S. & Olson,
A.J. (2009). AutoDock4 and AutoDockTools4: automated docking with selective
receptor flexibility. Journal of Computational Chemistry Vol.30, No.16, (December
2009), pp. 2785-2791, ISSN 0192-8651
Mueller, S.; Coleman, J.R. & Wimmer, E. (2009). Putting synthesis into biology: a viral view
of genetic engineering through de novo gene and genome synthesis. Chemistry &
Biology Vol.16, No.3, (March 2009), pp. 337-347, ISSN 1074-5521
Neumann, H. & Neumann-Staubitz, P. (2010). Synthetic biology approaches in drug
discovery and pharmaceutical biotechnology. Applied Microbiology and Biotechnology
Vol.87, No.1, (June 2010), pp. 75-86, ISSN 0175-7598
Nielsen, J. (2001). Metabolic engineering. Applied Microbiology and Biotechnology Vol.55, No.3,
(April 2001), pp. 263-283, ISSN 0175-7598
Park, J.H.; Lee, S.Y.; Kim, T.Y. & Kim, H.U. (2008). Application of systems biology for
bioprocess development. Trends in Biotechnology Vol.26, No.8, (August 2008), pp.
404–412, ISSN 0167-7799
Pawelek, J.; Low, K. & Bermudes, D. (2003). Bacteria as tumour-targeting vectors. Lancet
Oncology Vol.4, No.9, (September 2003), pp. 548–556, ISSN 1470-2045
Pedersen, M. & Phillips, A. (2009). Towards programming languages for genetic engineering
of living cells. Journal of the Royal Society Interface Vol.6, No.4, (August 2009), pp.
S437-S450, ISSN 1742-5689
184 Computational Biology and Applied Bioinformatics
Prather, K. & Martin, C.H. (2008). De novo biosynthetic pathways: rational design of
microbial chemical factories. Current Opinion in Biotechnology Vol.19, No.5, (October
2008), pp. 468–474, ISSN 0958-1669
Price, N.D. & Shmulevich, I. (2007). Biochemical and statistical network models for systems
biology. Current Opinion in Biotechnology Vol.18, No.4, (August 2007), pp. 365–370,
ISSN 0958-1669
Purnick, P.E. & Weiss, R. (2009). The second wave of synthetic biology: from modules to
systems. Nature Reviews Molecular Cell Biology Vol.10, No.6, (June 2009), pp. 410-422,
ISSN 1471-0072
Quan, J. & Tian, J. (2009). Circular polymerase extension cloning of complex 1 gene libraries
and pathways. PLoS ONE Vol.4, No.7, (July 2009), e6441, pp.1-6, ISSN 1932-6203
Richardson, S.M.; Wheelan, S.J.; Yarrington, R.M. & Boeke, J.D. (2006). GeneDesign: rapid,
automated design of multikilobase synthetic genes. Genome Research Vol.16, No.4,
(April 2006), pp. 550-556, ISSN 1088-9051
Ritchie, D.W. (2008). Recent progress and future directions in protein–protein
docking.Current Protein & Peptide Science Vol.9, No.1, (February 2008), pp. 1-15,
ISSN 1389-2037
Ro, D.K.; Paradise, E.M.; Ouellet, M.; Fisher, K.J.; Newman, K.L.; Ndungu, J.M.; Ho, K.A.;
Eachus, R.A.; Ham, T.S.; Kirby, J.; Chang, M.C.; Withers, S.T.; Shiba, Y.; Sarpong, R.
& Keasling, J.D. (2006). Production of the antimalarial drug precursor artemisinic
acid in engineered yeast. Nature Vol.440, No.7086, (April 2006), pp. 940–943, ISSN
0028-0836
Rocha, I.; Forster, J. & Nielsen, J. (2008). Design and application of genome-scale
reconstructed metabolic models. Methods in Molecular Biology Vol.416, No.IIIB, pp.
409-431, ISSN 1064-3745
Rocha, I.; Maia, P.; Evangelista, P.; Vilaça, P.; Soares, S.; Pinto, J.P.; Nielsen, J.; Patil, K.R.;
Ferreira, E.C. & Rocha, M. (2010). OptFlux: an open-source software platform for in
silico metabolic engineering. BMC Systems Biology Vol.4, No. 45, pp. 1-12, ISSN
1752-0509
Rodrigo, G.; Carrera, J. & Jaramillo, A. (2007a). Asmparts: assembly of biological model
parts. Systems and Synthetic Biology Vol.1, No.4, (December 2007), pp. 167-170, ISSN
1872-5325
Rodrigo, G.; Carrera, J. & Jaramillo, A. (2007b). Genetdes: automatic design of
transcriptional networks. Bioinformatics Vol.23, No.14, (July 2007), pp.1857-1858,
ISSN 1367-4803
Rodrigues, L.R.; Teixeira, J.A.; Schmitt, F.; Paulsson, M. & Lindmark Måsson, H. (2007). The
role of osteopontin in tumour progression and metastasis in breast cancer. Cancer
Epidemology Biomarkers & Prevention Vol.16, No.6, (June 2007), pp. 1087–1097, ISSN
1055-9965
Saito, H. & Inoue, T. (2009). Synthetic biology with RNA motifs. The International Journal of
Biochemistry & Cell Biology Vol.41, No.2, (February 2009), pp. 398-404, ISSN 1357-
2725
Salis, H.M., Mirsky, E.A. & Voigt, C.A. (2009). Automated design of synthetic ribosome
binding sites to control protein expression. Nature Biotechnology Vol.27, No.10,
(October 2009), pp. 946–950, ISSN 1087-0156
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 185
Segre, D.; Vitkup, D. & Church, G.M. (2002). Analysis of optimality in natural and perturbed
metabolic networks. PNAS Vol.99, No.23, (November 2001), pp. 15112-15117, ISSN
0027-8424
Shetty, R.P.; Endy, D. & Knight, T.F. Jr. (2008). Engineering BioBrick vectors from BioBrick
parts. Journal of Biological Engineering Vol.2, No.1, (April 2008), pp.5-17 ISSN 1754-
1611
Simons, K.T.; Bonneau, R.; Ruczinski, I. & Baker, D. (1999). Ab initio protein structure
prediction of CASP III targets using ROSETTA. Proteins Vol.3, pp. 171-176, ISSN
0887-3585
Steinmetz, N.F. (2010).Viral nanoparticles as platforms for next-generation therapeutics and
imaging devices. Nanomedicine: Nanotechnology, Biology, and Medicine Vol.6, No.5,
(October 2010), pp. 634-641, ISSN 1549-9634
Stoltenburg, R.; Reinemann, C. & Strehlitz, B. (2007). SELEX—A (r)evolutionary method to
generate high-affinity nucleic acid ligands. Biomolecular Engineering Vol.24, No.4,
(October 2007), pp. 381–403, ISSN 1389-0344
Topp, S. & Gallivan, J.P. (2007). Guiding bacteria with small molecules and RNA. Journal of
the American Chemical Society Vol.129, No.21, (May 2007), pp. 6807-6811, ISSN 0002-
7863
Tyo, K.E.; Alper, H.S. & Stephanopoulos, G. (2007). Expanding the metabolic engineering
toolbox: more options to engineer cells. Trends in Biotechnology Vol.25, No.3, (March
2007), pp. 132-137, ISSN 0167-7799
Tyo, K.E.J.; Ajikumar, P.K. & Stephanopoulos, G. (2009). Stabilized gene duplication enables
long-term selection-free heterologous pathway expression. Nature Biotechnology
Vol.27, No.8, (August 2009), pp. 760–765, ISSN 1087-0156
Villalobos, A.; Ness, J.E.; Gustafsson, C.; Minshull, J. & Govindarajan, S. (2006). Gene
Designer: a synthetic biology tool for constructing artificial DNA segments. BMC
Bioinformatics Vol.7, No.285, (June 2006), pp.1-8, ISSN 1471-2105
Voloshchuk, N. & Montclare, J.K. (2010). Incorporation of unnatural amino acids for
synthetic biology. Molecular Biosystems Vol.6, No.1, (January 2010), pp. 65-80, ISSN
1742-2051
Wang, H.H.; Isaacs, F.J.; Carr, P.A.; Sun, Z.Z.; Xu, G.; Forest, C.R. & Church, G.M. (2009).
Programming cells by multiplex genome engineering and accelerated evolution.
Nature Vol.460, No.7257, (August 2009), pp. 894–898, ISSN 0028-0836
Wierling, C.; Herwig, R. & Lehrach, H. (2007). Resources, standards and tools for systems
biology. Briefings in Functional Genomics and Proteomics Vol.6, No.3, (September
2007), pp. 240-251, ISSN 1473-9550
Win, M.N. & Smolke, C.D. (2007). From the cover: a modular and extensible RNA-based
gene-regulatory platform for engineering cellular function. PNAS Vol.104, No.36,
(September 2007), pp. 14283–14288, ISSN 0027-8424
Winkler, W.C. & Breaker, R.R. (2005). Regulation of bacterial gene expression by
riboswitches. Annual Review of Microbiology Vol.59, pp. 487-517, ISSN 0066-4227
Xu, J.; Li, M.; Kim, D. & Xu, Y. (2003). RAPTOR: optimal protein threading by linear
programming. Journal of Bioinformatics and Computational Biology Vol.1, No.1, (April
2003), pp. 95-117, ISSN 0219-7200
186 Computational Biology and Applied Bioinformatics
Yokobayashi, Y.; Weiss, R. & Arnold, F.H. (2002). Directed evolution of a genetic circuit.
PNAS Vol.99, No.26, (September 2002), pp. 16587–16591, ISSN 0027-8424
You, L. (2004). Toward computational systems biology. Cell Biochemistry and Biophysics
Vol.40, No.2, pp. 167–184, ISSN 1085-9195
9
1. Introduction
Efficient biological sequence (proteins or DNA) alignment is an important and challenging
task in bioinformatics. It is similar to string matching in the context of biological data and
is used to infer the evolutionary relationship between a set of protein or DNA sequences.
An accurate alignment can provide valuable information for experimentation on the newly
found sequences. It is indispensable in basic research as well as in practical applications such
as pharmaceutical development, drug discovery, disease prevention and criminal forensics.
Many algorithms and methods, such as, dot plot (Gibbs & McIntyre, 1970), Needleman-Wunsch
(N-W) (Needleman & Wunsch, 1970), Smith-Waterman (S-W) (Smith & Waterman, 1981),
FASTA (Pearson & Lipman, 1985), BLAST (Altschul et al., 1990), HMMER (Eddy, 1998) and
ClustalW (Thompson et al., 1994) have been proposed to perform and accelerate sequence
alignment activities. An overview of these methods is given in (Hasan et al., 2007). Out
of these, S-W algorithm is an optimal sequence alignment method, but its computational
cost makes it inappropriate for practical purposes. To develop efficient and optimal
sequence alignment solutions, the S-W algorithm has recently been implemented on emerging
accelerator platforms such as Field Programmable Gate Arrays (FPGAs), Cell Broadband Engine
(Cell/B.E.) and Graphics Processing Units (GPUs) (Buyukkur & Najjar, 2008; Hasan et al., 2010;
Liu et al., 2009; 2010; Lu et al., 2008). This chapter aims at providing a broad overview of
sequence alignment in general with particular emphasis on the classification and discussion
of available methods and their comparison. Further, it reviews in detail the acceleration
approaches based on implementations on different platforms and provides a comparison
considering different parameters. This chapter is organized as follows:
The remainder of this section gives a classification, discussion and comparison of the available
methods and their hardware acceleration. Section 2 introduces the S-W algorithm which is
the focus of discussion in the succeeding sections. Section 3 reviews CPU-based acceleration.
Section 4 provides a review of FPGA-based acceleration. Section 5 overviews GPU-based
acceleration. Section 6 presents a comparison of accelerations on different platforms, whereas
Section 7 concludes the chapter.
1.1 Classification
Sequence alignment aims at identifying regions of similarity between two DNA or protein
sequences (the query sequence and the subject or database sequence). Traditionally, the
methods of pairwise sequence alignment are classified as either global or local, where pairwise
means considering only two sequences at a time. Global methods attempt to match as many
188
2 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics
characters as possible, from end to end, whereas local methods aim at identifying short
stretches of similarity between two sequences. However, in some cases, it might also be
needed to investigate the similarities between a group of sequences, hence multiple sequence
alignment methods are introduced. Multiple sequence alignment is an extension of pairwise
alignment to incorporate more than two sequences at a time. Such methods try to align all of
the sequences in a given query set simultaneously. Figure 1 gives a classification of various
available sequence alignment methods.
Sequence Alignment
Methods
N-W S-W
Dot plot FASTA BLAST HMMER ClustalW
algorithm algorithm
Global methods
Global methods aim at matching as many characters as possible, from end to end between
two sequences i.e. the query sequence (Q) and the database sequence (D). Methods carrying out
global alignment include dot plot and N-W algorithm. Both are categorized as exact methods.
The difference is that dot plot is based on a basic search method, whereas N-W on dynamic
programming (DP) (Giegerich, 2000).
Local methods
In contrast to global methods, local methods attempt to identify short stretches of similarity
between two sequences i.e. Q and D. These include exact method like S-W and heuristics
based approximate methods like FASTA and BLAST.
1.3 Comparison
The alignment methods can be compared on the basis of their temporal and spatial
complexities and parameters like alignment type and search procedure. A summary of the
comparison is shown in Table 1. It is interesting to note that all the global and local sequence
alignment methods essentially have the same computational complexity of O( L Q L D ), where
L Q and L D are the lengths of the query and database sequences, respectively. Yet despite this,
each of the algorithms has very different running times, with BLAST being the fastest and
dynamic programming algorithms being the slowest. In case of multiple sequence alignment
methods, ClustalW has the worst time complexity of O( L2Q L2D ), whereas HMMER has a
time complexity of O( L Q L2D ). The space complexities of all the alignment methods are also
essentially identical, around O( L Q L D ) space, except BLAST, the space complexity of which
is O(20w + L Q L D ). In the exact methods, dot plot uses a basic search method, whereas N-W
and S-W use DP. On the other hand, all the approximate methods are heuristic based. It
is also worthy to note that FASTA and BLAST have to make sacrifices on sensitivity to be
able to achieve higher speeds. Thus, a trade off exists between speed and sensitivity and we
must come to a compromise to be able to efficiently align sequences in a biologically relevant
manner in a reasonable amount of time.
Time Space
Method Type Accuracy Search
complexity complexity
Dot plot Global Exact Basic O( L Q L D ) O( L Q L D )
N-W Global Exact DP O( L Q L D ) O( L Q L D )
S-W Local Exact DP O( L Q L D ) O( L Q L D )
FASTA Local Approximate Heuristic O( L Q L D ) O( L Q L D )
BLAST Local Approximate Heuristic O( L Q L D ) O(20w + L Q L D )
HMMER Multiple Approximate Heuristic O( L Q L2D ) O( L Q L D )
ClustalW Multiple Approximate Heuristic O( L2Q L2D ) O( L Q L D )
CPUs
CPUs are well known, flexible and scalable architectures. By exploiting the Streaming
SIMD Extension (SSE) instruction set on modern CPUs, the running time of the analyses is
decreased significantly, thereby making analyses of data intensive problems like sequence
alignment feasible. Also emerging CPU technologies like multi-core combines two or more
independent processors into a single package. The Single Instruction Multiple Data-stream
(SIMD) paradigm is heavily utilized in this class of processors, making it appropriate for data
parallel applications like sequence alignment. SIMD describes CPUs with multiple processing
elements that perform the same operation on multiple data simultaneously. Thus, such
machines exploit data level parallelism. The SSE instruction set extension in modern CPUs
contains 70 new SIMD instructions. This extension greatly increases the performance when
exactly the same operations are to be performed on multiple data objects, making sequence
alignment a typical application.
190
4 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics
FPGAs
FPGAs are reconfigurable data processing devices on which an algorithm is directly mapped
to basic processing logic elements. To take advantage of using an FPGA, one has to implement
massively parallel algorithms on this reconfigurable device. They are thus well suited for
certain classes of bioinformatics applications, such as sequence alignment. Methods like the
ones based on systolic arrays are used to accelerate such applications.
GPUs
Initially stimulated by the need for real time graphics in video gaming, GPUs have evolved
into powerful and flexible vector processors, ideal for accelerating a variety of data parallel
applications. GPUs have in the last couple of years developed themselves from a fixed
function graphics processing unit into a flexible platform that can be used for high performance
computing (HPC). Applications like bioinformatics sequence alignment can run very efficiently
on these architectures.
2. Smith-Waterman algorithm
In 1981, Smith and Waterman described a method, commonly known as the Smith-Waterman
(S-W) algorithm (Smith & Waterman, 1981), for finding common regions of local similarity.
S-W method has been used as the basis for many subsequent algorithms, and is often quoted
as a benchmark when comparing different alignment techniques. When obtaining the local
S-W alignment, a matrix H is constructed using the following equation.
⎧
⎪
⎪ 0
⎨H
i −1,j −1 + Si,j
Hi,j = max (1)
⎪
⎪ H −d
⎩ i−1,j
Hi,j−1 − d
Where Si,j is the similarity score and d is the penalty for a mismatch. The algorithm can be
implemented using the following pseudo code.
Initialization:
H(0,j) = 0
H(i,0) = 0
Matrix Fill:
Traceback:
H(opt) = max(H(i,j))
traceback(H(opt))
An Overview
An Overview of Hardware-Based
of Hardware-based Acceleration
Acceleration of Biological of Biological Sequence Alignment
Sequence Alignment 1915
The H matrix is constructed with one sequence lined up against the rows of a matrix, and
another against the columns, with the first row and column initialized with a predefined value
(usually zero) i.e. if the sequences are of length M and N respectively, then the matrix for the
alignment algorithm will have ( M + 1) × ( N + 1) dimensions. The matrix fill stage scores
each cell in the matrix. This score is based on whether the two intersecting elements of each
sequence are a match, and also on the score of the cell’s neighbors to the left, above, and
diagonally upper left. Three separate scores are calculated based on all three neighbors, and
the maximum of these three scores (or a zero if a negative value would result) is assigned
to the cell. This is done for each cell in the matrix resulting in O( MN ) complexity for the
matrix fill stage. Even though the computation for each cell usually only consists of additions,
subtractions, and comparisons of integers, the algorithm would nevertheless perform very
poorly if the lengths of the query sequences become large. The traceback step starts at the cell
with the highest score in the matrix and ends at a cell when the similarity score drops below a
certain predefined threshold. For doing this, the algorithm requires to find the maximum cell
which is done by traversing the entire matrix, making the time complexity for the traceback
O( MN ). It is also possible to keep track of the cell with the maximum score, during the
matrix filling segment of the algorithm, although this will not change the overall complexity.
Thus, the total time complexity of the S-W algorithm is O( MN ). The space complexity is also
O( MN ).
In order to reduce the O( MN ) complexity of the matrix fill stage, multiple entries of the
H matrix can be calculated in parallel. This is however complicated by data dependencies,
whereby each Hi,j entry depends on the values of three neighboring entries Hi,j−1, Hi−1,j and
Hi−1,j−1 , with each of those entries in turn depending on the values of three neighboring
entries, which effectively means that this dependency extends to every other entry in the
region Hx,y : x ≤ i, y ≤ j. This implies that it is possible to simultaneously compute all the
elements in each anti-diagonal, since they fall outside each other’s data dependency regions.
Figure 2 shows a sample H matrix for two sequences, with the bounding boxes indicating the
elements that can be computed in parallel. The bottom-right cell is highlighted to show that
its data dependency region is the entire remaining matrix. The dark diagonal arrow indicates
the direction in which the computation progresses. At least 9 cycles are required for this
computation, as there are 9 bounding boxes representing 9 anti-diagonals and a maximum of
5 cells may be computed in parallel.
The degree of parallelism is constrained to the number of elements in the anti-diagonal and
the maximum number of elements that can be computed in parallel are equal to the number
of elements in the longest anti-diagonal (l d ), where,
ld = min( M, N ) (2)
Theoretically, the lower bound to the number of steps required to calculate the entries of
the H matrix in a parallel implementation of the S-W algorithm is equal to the number of
anti-diagonals required to reach the bottom-right element, i.e. M + N − 1 (Liao et al., 2004).
Figure 3 shows the logic circuit to compute an element of the H matrix. The logic contains
three adders, a sequence comparator circuit (SeqCmp) and three max operators (MAX). The
sequence comparator compares the corresponding characters of two input sequences and
outputs a match/mismatch score, depending on whether the two characters are equal or not.
Each max operator finds the maximum of its two inputs. The time to compute an element is
4 cycles, assuming that the time for each cycle is equal to the latency of one add or compare
operation.
192
6 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics
G A T T A
0 0 0 0 0 0
G 0 1 0 0 0 0
A 0 0 2 0 0 1
j C 0 0 0 1 0 0
T 0 0 0 1 2 0
C 0 0 0 0 0 1
Fig. 2. Sample H matrix, where the dotted rectangles show the elements that can be
computed in parallel
Hi,j
Cycle 4
MAX
Cycle 3
MAX
Cycle 2
MAX +
Cycle 1
Si,j
Seq
Cmp
+ +
Fig. 3. Logic circuit to compute cells in the H matrix, where + is an adder, MAX is a max
operator and SeqCmp is the sequence comparator that generates match/mismatch scores
An Overview
An Overview of Hardware-Based
of Hardware-based Acceleration
Acceleration of Biological of Biological Sequence Alignment
Sequence Alignment 1937
3. CPU-based acceleration
In this section CPU-based acceleration of the S-W algorithm is reviewed. Furthermore, an
estimation of the performance for top-end and future systems is made.
4. FPGA-based acceleration
FPGAs are programmable logic devices. To map an application on flexible FPGA platforms,
a program is written in a hardware description language like VHDL. The flexibility, difficulty
194
8 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics
N1 N2 N3 N4
M4M3M2M1
M1 U11 U12 U13 U14
U11
N1
M2 U21 U22 U23 U24
U12
N2
M3 U31 U32 U33 U34
U13
N3
M4 U41 U42 U43 U44
U14
N4
(a) Rectangular (2D) systolic array
Performance Performance
Reference FPGA Frequency PEs
(per FPGA) (per system)
Virtex2 180 1260
(Puttegowda et al., 2003) 7000 —
XC2V6000 MHz GCUPS
Virtex2 742
(Yu et al., 2003) — 4032 —
XCV1000-6 GCUPS
Virtex2 55 13.9
(Oliver et al., 2005) 252 —
XC2V6000 MHz GCUPS
Virtex2 112 54
(Gok & Yilmaz, 2006) 482 —
XC2V6000 MHz GCUPS
Stratix2 66.7 25.6
(Altera, 2007) 384 —
EP2S180 MHz GCUPS
200 24.1
(Cray, 2010) Virtex4 120 —
MHz GCUPS