0% found this document useful (0 votes)

941 views456 pages

Tutorial R

Tutorial.R

Uploaded by

Crystal Granados

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

941 views456 pages

Tutorial R

Tutorial.R

Uploaded by

Crystal Granados

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 456

COMPUTATIONAL

BIOLOGY AND APPLIED

BIOINFORMATICS
Edited by Heitor Silvério Lopes
and Leonardo Magalhães Cruz
Computational Biology and Applied Bioinformatics
Edited by Heitor Silvério Lopes and Leonardo Magalhães Cruz

Published by InTech
Janeza Trdine 9, 51000 Rijeka, Croatia

Copyright © 2011 InTech

All chapters are Open Access articles distributed under the Creative Commons
Non Commercial Share Alike Attribution 3.0 license, which permits to copy,
distribute, transmit, and adapt the work in any medium, so long as the original
work is properly cited. After this work has been published by InTech, authors
have the right to republish it, in whole or part, in any publication of which they
are the author, and to make other personal use of the work. Any republication,
referencing or personal use of the work must explicitly identify the original source.

Statements and opinions expressed in the chapters are these of the individual contributors
and not necessarily those of the editors or publisher. No responsibility is accepted
for the accuracy of information contained in the published articles. The publisher
assumes no responsibility for any damage or injury to persons or property arising out
of the use of any materials, instructions, methods or ideas contained in the book.

Publishing Process Manager Davor Vidic

Technical Editor Teodora Smiljanic
Cover Designer Jan Hyrat
Image Copyright Gencay M. Emin, 2010. Used under license from Shutterstock.com

First published August, 2011

Printed in Croatia

A free online edition of this book is available at www.intechopen.com

Additional hard copies can be obtained from [email protected]

Computational Biology and Applied Bioinformatics,

Edited by Heitor Silvério Lopes and Leonardo Magalhães Cruz
p. cm.
ISBN 978-953-307-629-4
free online editions of InTech
Books and Journals can be found at
www.intechopen.com
Contents

Preface IX

Part 1 Reviews 1

Chapter 1 Molecular Evolution & Phylogeny:

What, When, Why & How? 3
Pandurang Kolekar, Mohan Kale and Urmila Kulkarni-Kale

Chapter 2 Understanding Protein Function - The Disparity

Between Bioinformatics and Molecular Methods 29
Katarzyna Hupert-Kocurek and Jon M. Kaguni

Chapter 3 In Silico Identification

of Regulatory Elements in Promoters 47
Vikrant Nain, Shakti Sahi and Polumetla Ananda Kumar

Chapter 4 In Silico Analysis of Golgi Glycosyltransferases:

A Case Study on the LARGE-Like Protein Family 67
Kuo-Yuan Hwa, Wan-Man Lin and Boopathi Subramani

Chapter 5 MicroArray Technology - Expression Profiling

of MRNA and MicroRNA in Breast Cancer 87
Aoife Lowery, Christophe Lemetre, Graham Ball and Michael Kerin

Chapter 6 Computational Tools for Identification

of microRNAs in Deep Sequencing Data Sets 121
Manuel A. S. Santos and Ana Raquel Soares

Chapter 7 Computational Methods in Mass

Spectrometry-Based Protein 3D Studies 133
Rosa M. Vitale, Giovanni Renzone,
Andrea Scaloni and Pietro Amodeo

Chapter 8 Synthetic Biology & Bioinformatics

Prospects in the Cancer Arena 159
Lígia R. Rodrigues and Leon D. Kluskens
VI Contents

Chapter 9 An Overview of Hardware-Based

Acceleration of Biological Sequence Alignment 187
Laiq Hasan and Zaid Al-Ars

Part 2 Case Studies 203

Chapter 10 Retrieving and Categorizing Bioinformatics

Publications through a MultiAgent System 205
Andrea Addis, Giuliano Armano,
Eloisa Vargiu and Andrea Manconi

Chapter 11 GRID Computing and Computational Immunology 223

Ferdinando Chiacchio and Francesco Pappalardo

Chapter 12 A Comparative Study of Machine Learning

and Evolutionary Computation Approaches
for Protein Secondary Structure Classification 239
César Manuel Vargas Benítez, Chidambaram Chidambaram,
Fernanda Hembecker and Heitor Silvério Lopes

Chapter 13 Functional Analysis of the Cervical Carcinoma Transcriptome:

Networks and New Genes Associated to Cancer 259
Mauricio Salcedo, Sergio Juarez-Mendez,
Vanessa Villegas-Ruiz, Hugo Arreola, Oscar Perez,
Guillermo Gómez, Edgar Roman-Bassaure,
Pablo Romero, Raúl Peralta

Chapter 14 Number Distribution of Transmembrane

Helices in Prokaryote Genomes 279
Ryusuke Sawada and Shigeki Mitaku

Chapter 15 Classifying TIM Barrel Protein Domain Structure

by an Alignment Approach
Using Best Hit Strategy and PSI-BLAST 287
Chia-Han Chu, Chun Yuan Lin,
Cheng-Wen Chang, Chihan Lee and Chuan Yi Tang

Chapter 16 Identification of Functional Diversity

in the Enolase Superfamily Proteins 311
Kaiser Jamil and M. Sabeena

Chapter 17 Contributions of Structure Comparison Methods

to the Protein Structure Prediction Field 329
David Piedra, Marco d'Abramo and Xavier de la Cruz

Chapter 18 Functional Analysis of Intergenic

Regions for Gene Discovery 345
Li M. Fu
Contents VII

Chapter 19 Prediction of Transcriptional Regulatory

Networks for Retinal Development 357
Ying Li, Haiyan Huang and Li Cai

Chapter 20 The Use of Functional Genomics in

Synthetic Promoter Design 375
Michael L. Roberts

Chapter 21 Analysis of Transcriptomic

and Proteomic Data in Immune-Mediated Diseases 397
Sergey Bruskin, Alex Ishkin, Yuri Nikolsky,
Tatiana Nikolskaya and Eleonora Piruzian

Chapter 22 Emergence of the Diversified Short ORFeome

by Mass Spectrometry-Based Proteomics 417
Hiroko Ao-Kondo, Hiroko Kozuka-Hata and Masaaki Oyama

Chapter 23 Acrylamide Binding to Its Cellular Targets:

Insights from Computational Studies 431
Emmanuela Ferreira de Lima and Paolo Carloni
Preface

Nowadays it is difficult to imagine an area of knowledge that can continue developing

without the use of computers and informatics. It is not different with biology, that has
seen an unpredictable growth in recent decades, with the rise of a new discipline,
bioinformatics, bringing together molecular biology, biotechnology and information
technology. More recently, the development of high throughput techniques, such as
microarray, mass spectrometry and DNA sequencing, has increased the need of
computational support to collect, store, retrieve, analyze, and correlate huge data sets
of complex information. On the other hand, the growth of the computational power
for processing and storage has also increased the necessity for deeper knowledge in
the field. The development of bioinformatics has allowed now the emergence of
systems biology, the study of the interactions between the components of a biological
system, and how these interactions give rise to the function and behavior of a living
being.

Bioinformatics is a cross-disciplinary field and its birth in the sixties and seventies
depended on discoveries and developments in different fields, such as: the proposed
double helix model of DNA by Watson and Crick from X-ray data obtained by
Franklin and Wilkins in 1953; the development of a method to solve the phase problem
in protein crystallography by Perutz's group in 1954; the sequencing of the first protein
by Sanger in 1955; the creation of the ARPANET in 1969 at Stanford UCLA; the
publishing of the Needleman-Wunsch algorithm for sequence comparison in 1970; the
first recombinant DNA molecule created by Paul Berg and his group in 1972; the
announcement of the Brookhaven Protein DataBank in 1973; the establishment of the
Ethernet by Robert Metcalfe in the same year; the concept of computers network and
the development of the Transmission Control Protocol (TCP) by Vint Cerf and Robert
Khan in 1974, just to cite some of the landmarks that allowed the rise of bioinformatics.
Later, the Human Genome Project (HGP), started in 1990, was also very important for
pushing the development of bioinformatics and related methods of analysis of large
amount of data.

This book presents some theoretical issues, reviews, and a variety of bioinformatics
applications. For better understanding, the chapters were grouped in two parts. It was
not an easy task to select chapters for these parts, since most chapters provide a mix of
review and case study. From another point of view, all chapters also have extensive
X Preface

biological and computational information. Therefore, the book is divided into two
parts. In Part I, the chapters are more oriented towards literature review and
theoretical issues. Part II consists of application-oriented chapters that report case
studies in which a specific biological problem is treated with bioinformatics tools.

Molecular phylogeny analysis has become a routine technique not only to understand
the sequence-structure-function relationship of biomolecules but also to assist in their
classification. The first chapter of Part I, by Kolekar et al., presents the theoretical basis,
discusses the fundamental of phylogenetic analysis, and a particular view of steps and
methods used in the analysis.

Methods for protein function and gene expression are briefly reviewed in Hupert-
Kocurek and Kaguni’s chapter, and contrasted with the traditional approach of
mapping a gene via the phenotype of a mutation and deducing the function of the
gene product, based on its biochemical analysis in concert with physiological studies.
An example of experimental approach is provided that expands the current
understanding of the role of ATP binding and its hydrolysis by DnaC during the
initiation of DNA replication. This is contrasted with approaches that yield large sets
of data, providing a different perspective on understanding the functions of sets of
genes or proteins and how they act in a network of biochemical pathways of the cell.

Due to the importance of transcriptional regulation, one of the main goals in the post-
genomic era is to predict how the expression of a given gene is regulated based on the
presence of transcription factor binding sites in the adjacent genomic regions. Nain et
al. review different computational approaches for modeling and identification of
regulatory elements, as well as recent advances and the current challenges.

In Hwa et al., an approach is proposed to group proteins into putative functional

groups by designing a workflow with appropriate bioinformatics analysis tools, to
search for sequences with biological characteristics belonging to the selected protein
family. To illustrate the approach, the workflow was applied to LARGE-like protein
family.

Microarray technology has become one of the most important technologies for
unveiling gene expression profiles, thus fostering the development of new
bioinformatics methods and tools. In the chapter by Lowery et al. a thorough review of
microarray technology is provided, with special focus on MRNA and microRNA
profiling of breast cancer.

MicroRNAs are a class of small RNAs of approximately 22 nucleotides in length that

regulate eukaryotic gene expression at the post-transcriptional level. Santos and Soares
present several tools and computational pipelines for miRNA identification, discovery
and expression from sequencing data.

Currently, the mass spectroscopy-based methods represent very important and

flexible tools for studying the dynamic features of proteins and their complexes. Such
Preface XI

high-resolution methods are especially used for characterizing critical regions of the
systems under investigation. Vitale et al. present a thorough review of mass
spectrometry and the related computational methods for studying the three-
dimensional structure of proteins.

Rodrigues and Kluskens review synthetic biology approaches for the development of
alternatives for cancer diagnosis and drug development, providing several application
examples and pointing challenging directions of research.

Biological sequence alignment is an important and widely used task in bioinformatics.

It is essential to provide valuable and accurate information in the basic research, as
well as in daily use of the molecular biologist. The well-known Smith and Waterman
algorithm is an optimal sequence alignment method, but it is computationally
expensive for large instances. This fact fostered the research and development of
specialized hardware platforms to accelerate biological data analysis that use that
algorithm. Hasan and Al-Ars provide a thorough discussion and comparison of
available methods and hardware implementations for sequence alignment on different
platforms.

Exciting and updated issues are presented in Part II, where theoretical bases are
complemented with case studies, showing how bioinformatics analysis pipelines were
applied to answer a variety of biological issues.

During the last years we have witnessed an exponential growth of the biological data
and scientific articles. Consequently, retrieving and categorizing documents has
become a challenging task. The second part of the book starts with the chapter by
Addis et al. that propose a multiagent system for retrieving and categorizing
bioinformatics publications, with special focus on the information extraction task and
adopted hierarchical text categorization technique.

Computational immunology is a field of science that encompasses high-throughput

genomic and bioinformatic approaches to immunology. On the other hand, grid
computing is a powerful alternative for solving problems that are computationally
intensive. Pappalardo and Chiachio present two different studies of using
computational immunology approaches implemented in a grid infrastructure:
modeling atherosclerosis and optimal protocol searching for vaccine against
mammary carcinoma.

Despite the growing number of proteins discovered as sub-product of the many

genome sequencing projects, only a very few number of them have a known three-
dimensional structure. A possible way to infer the full structure of an unknown
protein is to identify potential secondary structures in it. Chidambaram et al. compare
the performance of several machine learning and evolutionary computing methods for
the classification of secondary structure of proteins, starting from their primary
structure.
XII Preface

Cancer is one of the most important public health problems worldwide. Breast and
cervical cancer are the most frequent in female population. Salcedo et al. present a
study about the functional analysis of the cervical carcinoma transcriptome, with focus
on the methods for unveiling networks and finding new genes associated to cervical
cancer.

In Sawada and Mitaku, the number distribution of transmembrane helices is

investigated to show that it is a feature under natural selection in prokaryotes and
how membrane proteins with high number of transmembrane helices disappeared in
random mutations by simulation data.

In Chu et al., an alignment approach using the pure best hit strategy is proposed to
classify TIM barrel protein domain structures in terms of the superfamily and family
categories with high accuracy.

Jamil and Sabeena use classic bioinformatic tools, such as ClustalW for Multiple
Sequence Alignment, SCI-PHY server for superfamily determination, ExPASy tools for
pattern matching, and visualization softwares for residue recognition and functional
elucidation to determine the functional diversity of the enolase enzyme superfamily.

Quality assessment of structure predictions is an important problem in bioinformatics

because quality determines the application range of predictions. Piedra et al. briefly
review some applications used in protein structure prediction field, were they are used
to evaluate overall prediction quality, and show how structure comparison methods
can also be used to identify the more reliable parts in “de novo” analysis and how this
information can help to refine/improve these models.

In Fu, a new method is presented that explores potential genes in intergenic regions of
an annotated genome on the basis of their gene expression activity. The method was
applied to the M. tuberculosis genome where potential protein-coding genes were
found, based on bioinformatics analysis in conjunction with transcriptional evidence
obtained using the Affymetrix GeneChip. The study revealed potential genes in the
intergenic regions, such as DNA-binding protein in the CopG family and a nickel
binding GTPase, as well as hypothetical proteins.

Cai et al. present a new method for developmental studies. It combines experimental
studies and computational analysis to predict the trans-acting factors and
transcriptional regulatory networks for mouse embryonic retinal development.

The chapter by Roberts shows how advances in bioinformatics can be applied to the
development of improved therapeutic strategies. The chapter describes how functional
genomics experimentation and bioinformatics tools could be applied to the design of
synthetic promoters for therapeutic and diagnostic applications or adapted across the
biotech industry. Designed synthetic gene promoters can than be incorporated in
novel gene transfer vectors to promote safer and more efficient expression of
therapeutic genes for the treatment of various pathological conditions. Tools used to
Preface XIII

analyze data obtained from large-scale gene expression analyses, which are
subsequently used in the smart design of synthetic promoters are also presented.

Bruskin et al. describe how candidate genes commonly involved in psoriasis and
Crohn's disease were detected using lists of differentially expressed genes from
microarrays experiments with different numbers of probes. These gene codes for
proteins are particular targets for elaborating new approaches to treating these
pathologies. A comprehensive meta-analysis of proteomics and transcriptomics of
psoriatic lesions from independent studies is performed. Network-based analysis
revealed similarities in regulation at both proteomics and transcriptomics level.

Some eukaryotic mRNAs have multiple ORFs, which are recognized as polycistronic
mRNAs. One of the well-known extra ORFs is the upstream ORF (uORF), that
functions as a regulator of mRNA translation. In Ao-Kondo et al., this issue is
addressed and an introduction to the mechanism of translation initiation and
functional roles of uORF in translational regulation is given, followed by a review of
how the authors identified novel small proteins with Mass Spectrometry and a
discussion on the progress of bioinformatics analyses for elucidating the
diversification of short coding regions defined by the transcriptome.

Acrylamide might feature toxic properties, including neurotoxicity and

carcinogenicity in both mice and rats, but no consistent effect on cancer incidence in
humans could be identified. In the chapter written by Lima and Carloni, the authors
report the use of bioinformatics tools, by means of molecular docking and molecular
simulation procedures, to predict and explore the structural determinants of
acrylamide and its derivative in complex with all of their known cellular target
proteins in human and mice.

Professor Heitor Silvério Lopes

Bioinformatics Laboratory, Federal University of Technology – Paraná,
Brazil

Professor Leonardo Magalhães Cruz

Biochemistry Department, Federal University of Paraná,
Brazil
Part 1

Reviews
1

Molecular Evolution & Phylogeny:

What, When, Why & How?
Pandurang Kolekar1, Mohan Kale2 and Urmila Kulkarni-Kale1
1Bioinformatics Centre, University of Pune
2Department of Statistics, University of Pune
India

1. Introduction
The endeavour for the classification and study of evolution of organisms, pioneered by
Linneaus and Darwin on the basis of morphological and behavioural features of organisms,
is now being propelled by the availability of molecular data. The field of evolutionary
biology has experienced a paradigm shift with the advent of sequencing technologies and
availability of molecular sequence data in the public domain databases. The post-genomic
era provides unprecedented opportunities to study the process of molecular evolution,
which is marked with the changes organisms acquire and inherit. The species are
continuously subjected to evolutionary pressures and evolve suitably. These changes are
observed in terms of variations in the sequence data that are collected over a period of time.
Thus, the molecular sequence data archived in various databases are the snapshots of the
evolutionary process and help to decipher the evolutionary relationships of genes/proteins
and genomes/proteomes for a group of organisms. It is known that the individual genes
may evolve with varying rates and the evolutionary history of a gene may or may not
coincide with the evolution of the species as a whole. One should always refrain from
discussing the evolutionary relationship between organisms when analyses are performed
using limited/partial data. Thorough understanding of the principles and methods of
phylogeny help the users not only to use the available software packages in an efficient
manner, but also to make appropriate choices of methods of analysis and parameters so that
attempts can be made to maximize the gain on huge amount of available sequence data.
As compared to classical phylogeny based on morphological data, molecular phylogeny has
distinct advantages, for instance, it is based on sequences (as descrete characters) unlike the
morphological data, which is qualitative in nature. While the tree of life is depicted to have
three major branches as bacteria, archaea and eukaryotes (it excludes viruses), the trees
based on molecular data accounts for the process of evolution of bio-macromolecules (DNA,
RNA and protein). The trees generated using molecular data are thus referred to as ‘inferred
trees’, which present a hypothesized version of what might have happened in the process of
evolution using the available data and a model. Therefore, many trees can be generated
using a dataset and each tree conveys a story of evolution. The two main types of
information inherent in any phylogenetic tree are the topology (branching pattern) and the
branch lengths.
4 Computational Biology and Applied Bioinformatics

Before getting into the actual process of molecular phylogeny analysis (MPA), it will be
helpful to get familiar with the concepts and terminologies frequently used in MPA.
Phylogenetic tree: A two-dimensional graph depicting nodes and branches that illustrates
evolutionary relationships between molecules or organisms.
Nodes: The points that connect branches and usually represent the taxonomic units.
Branches: A branch (also called an edge) connects any two nodes. It is an evolutionary
lineage between or at the end of nodes. Branch length represents the number of
evolutionary changes that have occurred in between or at the end of nodes. Trees with
uniform branch length (cladograms), branch lengths proportional to the changes or distance
(phylograms) are derived based on the purpose of analysis.
Operational taxonomic units (OTUs): The known external/terminal nodes in the
phylogenetic tree are termed as OTU.
Hypothetical taxonomic units (HTUs): The internal nodes in the phylogenetic tree that are
treated as common ancestors to OTUs. An internal node is said to be bifurcating if it has
only two immediate descendant lineages or branches. Such trees are also called binary or
dichotomous as any dividing branch splits into two daughter branches. A tree is called a
‘multifurcating’ or ‘polytomous’ if any of its nodes splits into more than two immediate
descendants.
Monophyletic: A group of OTUs that are derived from a single common ancestor
containing all the descendents of single common ancestor.
Polyphyletic: A group of OTUs that are derived from more than one common ancestor.
Paraphyletic: A group of OTUs that are derived from a common ancestor but the group
doesn’t include all the descendents of the most recent common ancestor.
Clade: A monophyletic group of related OTUs containing all the descendants of the
common ancestor along with the ancestor itself.
Ingroup: A monophyletic group of all the OTUs that are of primary interest in the
phylogenetic study.
Outgroup: One or more OTUs that are phylogenetically outside the ingroup and known to
have branched off prior to the taxa included in a study.
Cladogram: The phylogenetic tree with branches having uniform lengths. It only depicts the
relationship between OTUs and does not help estimate the extent of divergence.
Phylogram: The phylogenetic tree with branches having variable lengths that are
proportional to evolutionary changes.
Species tree: The phylogenetic tree representing the evolutionary pathways of species.
Gene tree: The phylogenetic tree reconstructed using a single gene from each species. The
topology of the gene tree may differ from ‘species tree’ and it may be difficult to reconstruct
a species tree from a gene tree.
Unrooted tree: It illustrates the network of relationship of OTUs without the assumption of
common ancestry. Most trees generated using molecular data are unrooted and they can be
rooted subsequently by identifying an outgroup. Total number of bifurcating unrooted trees
can be derived using the equation: Nu= (2n-5)!/2n-3 (n-3)!
Rooted tree: An unrooted phylogenetic tree can be rooted with outgroup species, as a
common ancestor of all ingroup species. It has a defined origin with a unique path to each
ingroup species from the root. The total number of bifurcating rooted trees can be calculated
using the formula, Nr= (2n-3)!/2n-2 (n-2)! (Cavalli-Sforza & Edwards, 1967). Concept of
unrooted and rooted trees is illustrated in Fig. 1.
Molecular Evolution & Phylogeny: What, When, Why & How? 5

Fig. 1. Sample rooted and unrooted phylogenetic trees drawn using 5 OTUs . The external
and internal nodes are labelled with alphabets and Arabic numbers respectively. Note that
the rooted and unrooted trees shown here are one of the many possible trees (105 rooted
and 15 unrooted) that can be obtained for 5 OTUs.
The MPA typically involves following steps
• Definition of problem and motivation to carry out MPA
• Compilation and curation of homologous sequences of nucleic acids or proteins
• Multiple sequence alignments (MSA)
• Selection of suitable model(s) of evolution
• Reconstruction of phylogenetic tree(s)
• Evaluation of tree topology
A brief account of each of these steps is provided below.

2. Definition of problem and motivation to carry out MPA

Just like any scientific experiment, it is necessary to define the objective of MPA to be carried
out using a set of molecular sequences. MPA has found diverse applications, which include
classification of organisms, DNA barcoding, subtyping of viruses, study the co-evolution of
genes and proteins, estimation of divergence time of species, study of the development of
pandemics and pattern of disease transmission, parasite-vector-host relationships etc. The
biological investigations where MPA constitute a major part of analyses are listed here. A
virus is isolated during an epidemic. Is it a new virus or an isolate of a known one? Can a
genotype/serotype be assigned to this isolate just by using the molecular sequence data? A
few strains of a bacterium are resistant to a drug and a few are sensitive. What and where
are the changes that are responsible for such a property? How do I choose the attenuated
strains, amongst available, such that protection will be offered against most of the wild type
strains of a given virus? Thus, in short, the objective of the MPA plays a vital role in
deciding the strategy for the selection of candidate sequences and adoption of the
appropriate phylogenetic methods.

3. Compilation and curation of homologous sequences

The compilation of nucleic acid or protein sequences, appropriate to undertake validation of
hypothesis using MPA, from the available resources of sequences is the next step in MPA.
6 Computational Biology and Applied Bioinformatics

At this stage, it is necessary to collate the dataset consisting of homologous sequences with the
appropriate coverage of OTUs and outgroup sequences, if needed. Care should be taken to
select the equivalent regions of sequences having comparable lengths (± 30 bases or amino
acids) to avoid the subsequent errors associated with incorrect alignments leading to incorrect
sampling of dataset, which may result in erroneous tree topology. Length differences of >30
might result in insertion of gaps by the alignment programs, unless the gap opening penalty is
suitably modified. Many comprehensive primary and derived databases of nucleic acid and
protein sequences are available in public domain, some of which are listed in Table 1. The
database issue published by the journal ‘Nucleic Acids research’ (NAR) in the month of
January every year is a useful resource for existing as well as upcoming databases. These
databases can be queried using the ‘text-based’ or ‘sequence-based’ database searches.

Database URL Reference

Nucleotide
GenBank http://www.ncbi.nlm.nih.gov/genbank/ Benson et al., 2011
EMBL http://www.ebi.ac.uk/embl/ Leinonen et al., 2011
DDBJ http://www.ddbj.nig.ac.jp/ Kaminuma et al., 2011
Protein
GenPept http://www.ncbi.nlm.nih.gov/protein Sayers et al., 2011
Swiss-Prot http://expasy.org/sprot/ The UniProt Consortium (2011)
UniProt http://www.uniprot.org/ The UniProt Consortium (2011)
Derived
RDP http://rdp.cme.msu.edu/ Cole et al., 2009
HIV http://www.hiv.lanl.gov/content/index Kuiken et al., 2009
HCV http://www.hcvdb.org/ http://www.hcvdb.org/

Table 1. List of some of the commonly used nucleotide, protein and molecule-/species-
specific databases.
Text-based queries are supported using search engines viz., Entrez and SRS, which are
available at NCBI and EBI respectively. The list of hits returned after the searches needs to
be curated very carefully to ensure that the data corresponds to the gene/protein of interest
and is devoid of partial sequences. It is advisable to refer to the feature-table section of every
entry to ensure that the data is extracted correctly and corresponds to the region of interest.
The sequence-based searches involve querying the databases using sequence as a probe and
are routinely used to compile a set of homologous sequences. Once the sequences are
compiled in FASTA or another format, as per the input requirements of MPA software, the
sequences are usually assigned with unique identifiers to facilitate their identification and
comparison in the phylogenetic trees. If the sequences posses any ambiguous characters or
low complexity regions, they could be carefully removed from sequences as they don’t
contribute to evolutionary analysis. The presence of such regions might create problems in
alignment, as it could lead to equiprobable alternate solutions to ‘local alignment’ as part of
Molecular Evolution & Phylogeny: What, When, Why & How? 7

a global alignment. Such regions possess ‘low’ information content to favour a tree topology
over the other. The inferiority of input dataset interferes with the analysis and interpretation
of the MPA. Thus, compilation of well-curated sequences, for the problem at hand, plays a
crucial role in MPA.
The concept of homology is central to MPA. Sequences are said to be homologous if they share
a common ancestor and are evolutionarily related. Thus, homology is a qualitative description
of the relationship and the term %homology has no meaning. However, supporting data for
deducing homology comes from the extent of sequence identity and similarity, both of which
are quantitative terms and are expressed in terms of percentage.
The homologous sequences are grouped into three types, viz., orthologs (same gene in
different species), paralogs (the genes that originated from duplication of an ancestral gene
within a species) and xenologs (the genes that have horizontally transferred between the
species). The orthologous protein sequences are known to fold into similar three-dimensional
shapes and are known to carry out similar functions. For example, haemoglobin alpha in horse
and human. The paralogous sequences are copies of the ancestral genes evolving within the
species such that nature can implement a modified function. For example haemoglobin alpha
and beta in horse. The xenologs and horizontal transfer events are extremely difficult to be
proved only on the basis of sequence comparison and additional experimental evidence to
support and validate the hypothesis is needed. The concepts of sequence alignments,
similarity and homology are extensively reviewed by Phillips (2006).

4. Multiple sequence alignments (MSA)

MSA is one of the most common and critical steps of classical MPA. The objective of MSA is to
juxtapose the nucleotide or amino acid residues in the selected dataset of homologous
sequences such that residues in the column of MSA could be used to derive the sequence of
the common ancestor. The MSA algorithms try to maximize the matching residues in the given
set of sequences with a pre-defined scoring scheme. The MSA produces a matrix of characters
with species in the rows and character sites in columns. It also introduces the gaps, simulating
the events of insertions and deletions (also called as indels). Insertion of gaps also helps in
making the lengths of all sequences same for the sake of comparison. All the MSA algorithms
are guaranteed to produce optimal alignment above a threshold value of detectable sequence
similarity. The alignment accuracy is observed to decrease when sequence similarity drops
below 35% towards the twilight (<35% but > 25%) and moonlight zones (<25%) of similarity.
The character matrix obtained in MSA reveals the pattern of conservation and variability
across the species, which in turn reveals the motifs and the signature sequences shared by
species to retain the fold and function. The analysis of variations can be gainfully used to
identify the changes that explain functional and phenotypic variability, if any, across OTUs.
Many algorithms have been specially developed for MSA and subsequently improved to
achieve higher accuracy. One of the popular heuristics-based MSA approach follows
progressive alignment procedure, in which sequences are compared in a pair wise fashion to
build a distance matrix containing percent identity values. A clustering algorithm is then
applied to distance matrix to generate a guide tree. The algorithm then follows a guide tree
to add the pair wise alignments together starting from the leaf to root. This ensures the
sequences with higher similarity are aligned initially and distantly related sequences are
progressively added to the alignment of aligned sequences. Thus, the gaps inserted are
always retained. A suitable scoring function, sum-of-pairs, consensus, consistency-based etc.
8 Computational Biology and Applied Bioinformatics

is employed to derive the optimum MSA (Nicholas et al., 2002; Batzoglou, 2005). Most of the
MSA packages use Needleman and Wunsch (1970) algorithm to compute pair wise sequence
similarity. The ClustalW is the widely used MSA package (Thompson et al., 1994). Recently
many alternative MSA algorithms are also being developed, which are enlisted in Table 2.
The standard benchmark datasets are used for comparative assessment of the alternative
approaches (Aniba et al., 2010; Thompson et al., 2011). Irrespective of the proven
performance of MSA methods for individual genes and proteins, some of the challenges and
issues regarding computational aspects involved in handling genomic data are still the
causes of concern (Kemena & Notredame, 2009).

Alignment
Algorithm description Available at / Reference
programs
http://www.ebi.ac.uk/Tools/msa/clustalw2/ ;
ClustalW Progressive
Thompson et al., 1994
http://www.ebi.ac.uk/Tools/msa/muscle/ ;
MUSCLE Progressive/iterative
Edgar, 2004
http://www.ebi.ac.uk/Tools/msa/tcoffee/ ;
T-COFFEE Progressive
Notredame et al., 2000
http://bibiserv.techfak.uni-bielefeld.de/dialign/ ;
DIALIGN2 Segment-based
Morgenstern et al., 1998
http://mafft.cbrc.jp/alignment/software/ ;
MAFFT Progressive/iterative
Katoh et al., 2005
Alignment visualization programs
http://www.mbio.ncsu.edu/bioedit/bioedit.html
*BioEdit
; Hall, 1999
http://www.megasoftware.net/ ;
MEGA5
Kumar et al., 2008
http://dambe.bio.uottawa.ca/dambe.asp ;
DAMBE
Xia & Xie, 2001
http://aig.cs.man.ac.uk/research/utopia/cinema ;
CINEMA5
Parry-Smith et al., 1998
*: Not updated since 2008, but the last version is available for use.
Table 2. List of commonly used multiple sequence alignment programs and visualization
tools.
The MSA output can also be visualized and edited, if required, with the software like
BioEdit, DAMBE etc. Multiple alignment output shows the conserved and variable sites,
usually residues are colour coded for the ease of visualisation, identification and analysis.
The character sites in MSA can be divided as conserved (all the sequences have same
residue or base), variable-non-informative (singleton site) and variable-informative sites.
The sites containing gaps in all or majority of the species are of no importance from the
evolutionary point of view and are usually removed from MSA while converting MSA data
to input data for MPA. A sample MSA is shown in Fig. 2. The sequences of surface
hydrophobic (SH) protein from various genotypes (A to M) of Mumps virus, are aligned. A
careful visual inspection of MSA allows us to locate the patterns and motifs (LLLXIL) in a
given set of sequences. Apart from MPA, the MSA data in turn can be used for the
construction of position specific scoring matrix (PSSM), generation of consensus sequence,
Molecular Evolution & Phylogeny: What, When, Why & How? 9

sequence logos, identification and prioritisation of potential B- and T-cell epitopes etc.
Nowadays the databases of curated, pre-computed alignments of reference species are also
being made available, which can be used for the benchmark comparison, evaluation purpose
(Thompson et al., 2011) and it also helps to keep the track of changes that get accumulated in
the species over a period of time. For example, in case of viruses, observed changes are
correlated with emergence of new genotypes (Kulkarni-Kale et al., 2004; Kuiken et al., 2005).

Fig. 2. The complete multiple sequence alignment of the surface hydrophobic (SH) proteins
of Mumps virus genotypes (A to M) carried out using ClustalW. The MSA is viewed using
BioEdit. The species labels in the leftmost column begin with genotype letter (A-M) followed
by GenBank accession numbers. The scale for the position in alignment is given at the top of
the alignment. The columns with conserved residues are marked with an “*” in the last row.

5. Selection of a suitable model of evolution

The existing MPA methods utilize the mathematical models to describe the evolution of
sequence by incorporating the biological, biochemical and evolutionary considerations.
These mathematical models are used to compute genetic distances between sequences. The
use of appropriate model of evolution and statistical tests help us to infer maximum
evolutionary information out of sequence data. Thus, the selection of the right model of
10 Computational Biology and Applied Bioinformatics

sequence evolution becomes important as a part of effective MPA. Two types of approaches
are adapted for the building of models, first one is empirical i.e. using the properties
revealed through comparative studies of large datasets of observed sequences, and the other
is parametrical, which uses biological and biochemical knowledge about the nucleic acid
and protein sequences, for example the favoured substitution patterns of residues.
Parametric models obtain the parameters from the MSA dataset under study. Both types of
approaches result in the models based on the Markov process, in the form of matrix
representing the rate of all possible transitions between the types of residues (4 nucleotides
in nucleic acids and 20 amino acids in proteins). According to the type of sequence (nucleic
acid or protein), two categories of models have been developed.

5.1 Models of nucleotide substitution

The nucleotide substitution models are based on the parametric approach with the use of
mainly three parameters i) nucleotides frequencies, ii) rate of nucleotide substitutions and
iii) rate heterogeneity. Nucleotide frequencies, account for the compositional sequence
constraints such as GC content. These are subsequently used in a model to allow the
substitutions of a certain type to occur more likely than others. The nucleotide substitution
parameter is used to represent a measure of biochemical similarity. Higher the similarity
between the nucleotide bases, the more is the rate of substitution between them, for
example, the transitions are more frequent than transversions. A parameter of rate
heterogeneity accounts for the unequal rates of substitution across the variable sites, which
can be correlated with the constraints of genetic code, selection for the gene function etc. The
site variability is modelled by gamma distribution of rates across sites. The shape parameter
of gamma distribution determines amount of heterogeneity among sites, larger values of
shape parameter gives a bell shaped distribution suggesting little or no rate variation across
the sites whereas small values of it gives J-shaped distribution indicating high rate variation
among sites along with low rates of evolution at many sites.
Varieties of nucleotide substitution models have been developed with a set of assumptions
and parameters described as above. Some of the well-known models of nucleotide
substitutions include Jukes-Cantor (JC) one-parameter model (Jukes & Cantor, 1969),
Kimura two-parameter model (K2P) (Kimura, 1980), Tamura’s model (Tamura, 1992),
Tamura and Nei model (Tamura & Nei, 1993) etc. These models make use of different
biological properties such as, transitions, transversions, G+C content etc. to compute
distances between nucleotide sequences. The substitution patterns of nucleotides for some
of these models are shown in Fig. 3.

5.2 Models of amino acid replacement

In contrast to nucleotide substitution models, amino acid replacement models are developed
using empirical approach. Schwarz and Dayhoff (1979) developed the most widely used
model of protein evolution in which, the replacement matrix was obtained from the alignment
of globular protein sequences with 15% divergence. The Dayhoff matrices, known as PAM
matrices, are also used by database searching methods. The similar methodology was adopted
by other model developers but with specialized databases. Jones et al., (1994) have derived a
replacement matrix specifically for membrane proteins, which has values significantly
different from Dayhoff matrix suggesting the remarkably different pattern of amino acid
replacements observed in the membrane proteins. Thus, such a matrix will be more
Molecular Evolution & Phylogeny: What, When, Why & How? 11

appropriate for the phylogenetic study of membrane proteins. On the other hand, Adachi and
Hasegawa (1996) obtained a replacement matrix using mitochondrial proteins across 20
vertebrate species and can be effectively used for mitochondrial protein phylogeny. Henikoff
and Henikoff (1992) derived the series of BLOSUM matrices using local, ungapped alignments
of distantly related sequences. The BLOSUM matrices are widely used in similarity searches
against databases than for phylogenetic analyses.

Fig. 3. The types of substitutions in nucleotides. α denotes the rate of transitions and β
denotes the rate of transversions. For example, in the case of JC model α=β while in the case
of K2P model α>β.
Recently, structural constraints of the nucleic acids and proteins are also being incorporated in
the building of models of evolution. For example, Rzhetsky (1995) contributed a model to
estimate the substitution patterns in ribosomal RNA genes with the account of secondary
structure elements like stem-loops in ribosomal RNAs. Another approach introduced a model
with the combination of protein secondary structures and amino acid replacement (Lio &
Goldman, 1998; Thorne et al., 1996). The overview of different models of evolution and the
criteria for the selection of models is also provided by Lio & Goldman (1998); Luo et al. (2010).

6. Reconstruction of a phylogenetic tree

The phylogeny reconstruction methods result in a phylogenetic tree, which may or may not
corroborate with the true phylogenetic tree. There are various methods of phylogeny
reconstruction that are divided into two major groups viz. character-based and distance-
based.
Character-based methods use a set of discrete characters, for example, in case of MSA data
of nucleotide sequences, each position in alignment is reffered as “character” and nucleotide
(A, T, G or C) present at that position is called as the “state” of that “character”. All such
characters are assumed to evolve independent of each other and analysed separately.
Distance-based methods on other hand use some form of distance measure to compute the
dissimilarity between pairs of OTUs, which subsequently results in derivation of distance
matrix that is given as an input to clustering methods like Neighbor-Joining (N-J) and
Unweighted Pair Group Method with Arithmetic mean (UPGMA) to infer phylogenetic tree.
The character-based and distance-based methods follow exhaustive search and/or stepwise
clustering approach to arrive at an optimum phylogenetic tree, which explains the
12 Computational Biology and Applied Bioinformatics

evolutionary pattern of the OTUs under study. The exhaustive search method examines
theoretically all possible tree topologies for a chosen number of species and derives the best
tree topology using a set of certain criteria. Table 3 shows the possible number of rooted and
unrooted trees for n number of species/OTUs.

Number of Number of Number of

OTUs unrooted trees rooted trees
2 1 1
3 1 3
4 3 15
5 15 105
6 105 945
10 2027025 34459425
Table 3. The number of possible rooted and unrooted trees for a given number of OTUs. The
number of possible unrooted trees for n OTUs is given by (2n-5)!/[2n-3(n-3)!]; and rooted
trees is given by (2n-3)!/[2n-2(n-2)!]
Whereas, stepwise clustering methods employ an algorithm, which begins with the
clustering of highly similar OTUs. It then combines the clustered OTUs such that it can be
treated as a single OTU representing the ancestor of combined OTUs. This step reduces the
complexity of data by one OTU. This process is repeated and in a stepwise manner adding
the remaining OTUs until all OTUs are clustered together. The stepwise clustering approach
is faster and computationally less intensive than the exhaustive search method.
The most widely used distance-based methods include N-J & UPGMA and character-based
methods include Maximum Parsimony (MP) and Maximum Likelihood (ML) methods
(Felsenstein, 1996). All of these methods make particular assumptions regarding
evolutionary process, which may or may not be applicable to the actual data. Thus, before
selection of a phylogeny reconstruction method, it is recommended to take into account the
assumptions made by the method to infer the best phylogenetic tree. The list of widely used
phylogeny inference packages is given in Table 4.

Package Available from / Reference

http://evolution.genetics.washington.edu/phylip.html ;
PHYLIP
Felsenstein, 1989
http://paup.csit.fsu.edu/ ;
PAUP
Wilgenbusch & Swofford, 2003
http://www.megasoftware.net/ ;
MEGA5
Kumar et al., 2008
http://mrbayes.csit.fsu.edu/ ;
MrBayes
Ronquist & Huelsenbeck, 2003
http://www.tree-puzzle.de/ ;
TREE-PUZZLE
Schmidt et al., 2002
Table 4. The list of widely used packages for molecular phylogeny.
Molecular Evolution & Phylogeny: What, When, Why & How? 13

6.1 Distance-based methods of phylogeny reconstruction

The distance-based phylogeny reconstruction begins with the computation of pair wise genetic
distances between molecular sequences with the use of appropriate substitution model, which
is built on the basis of evolutionary assumptions, discussed in section 4. This step results in
derivation of a distance matrix, which is subsequently used to infer a tree topology using the
clustering method. Fig. 4 shows the distance matrix computed for a sample sequence dataset
of 5 OTUs with 6 sites using Jukes-Cantor distance measure. A distance measure possesses
three properties, (a) a distance of OTU from itself is zero, D(i, i) = 0; (b) the distance of OTU i
from another OTU j must be equal to the distance of OTU j from OTU i, D(i, j) = D(j, i); and (c)
the distance measure should follow the triangle inequality rule i.e. D(i, j) ≤ D(i, k) + D(k, j). The
accurate estimation of genetic distances is a crucial requirement for the inference of correct
phylogenetic tree, thus choice of the right model of evolution is as important as the choice of
clustering method. The popular methods used for clustering are UPGMA and N-J.

Fig. 4. The distance matrix obtained for a sample nucleotide sequence dataset using Jukes-
Cantor model. Dataset contains 5 OTUs (A-E) and 6 sites shown in Phylip format. Dnadist
program in PHYLIP package is used to compute distance matrix.

6.1.1 UPGMA method for tree building

The UPGMA method was developed by Sokal and Michener (1958) and is the most widely
used clustering methodology. The method is based on the assumptions that the rate of
substitution for all branches in the tree is constant (which may not hold true for all data) and
branch lengths are additive. It employs hierarchical agglomerative clustering algorithm,
which produces ultrametric tree in such a way that every OTU is equidistant from the root.
The clustering process begins with the identification of the highly similar pair of OTUs (i & j)
as decided from the distance value D(i, j) in distance matrix. The OTUs i and j are clustered
together and combined to form a composite OTU ij. This gives rise to new distance matrix
shorter by one row and column than initial distance matrix. The distances of un-clustered
OTUs remain unchanged. The distances of remaining OTUs (for e.g. k) from composite OTUs
are represented as the average of the initial distances of that OTU from the individual
members of composite OTU (i.e. D(ij, k) = [D(i, k) + D(j, k)]/2). In this way a new distance
matrix is calculated and in the next round, the OTUs with least dissimilarity are clustered
together to form another composite OTU. The remaining steps are same as discussed in the
first round. This process of clustering is repeated until all the OTUs are clustered.
The sample calculations and steps involved in UPGMA clustering algorithm using distance
matrix shown in Fig. 4 are given below.
14 Computational Biology and Applied Bioinformatics

Iteration 1: OTU A is minimally equidistant from OTUs B and C. Randomly we select the
OTUs A and B to form one composite OTU (AB). A and B are clustered together. Compute
new distances of OTUs C, D and E from composite OTU (AB). The distances between
unclustered OTUs will be retained. See Fig. 4 for initial distance matrix and Fig. 5 for
updated matrix after first iteration of UPGMA.
d(AB,C) = [d(A,C) + d(B,C)]/2 = [0.188486 + 0.440840]/2 = 0.314633
d(AB,D) = [d(A,D) + d(B,D)]/2 = [0.823959 + 0.440840]/2 = 0.632399
d(AB,E) = [d(A,E) + d(B,E)]/2 = [1.647918 + 0.823959]/2 = 1.235938

Fig. 5. The updated distance matrix and clustering of A and B after the 1st iteration of
UPGMA.
Iteration 2: OTUs (AB) and C are minimally distant. We select these OTUs to form one
composite OTU (ABC). AB and C are clustered together. We then compute new distances of
OTUs D and E from composite OTU (ABC). See Fig. 5 for distance matrix obtained in
iteration 1 and Fig. 6 for updated matrix after the second iteration of UPGMA.
d(ABC,D) = [d(AB,D) + d(C,D)]/2 = [0.632399 + 1.647918]/2 = 1.140158
d(ABC,E) = [d(AB,E) + d(C,E)]/2 = [1.235938 + 0.823959]/2 = 1.029948

Fig. 6. The updated distance matrix and clustering of A, B and C after the 2nd iteration of
UPGMA.
Iteration 3: OTUs D and E are minimally distant. We select these OTUs to form one
composite OTU (DE). D and E are clustered together. Compute new distances of OTUs
(ABC) and (DE) from each other. Finally the remaining two OTUs are clustered together. See
Fig. 6 for distance matrix obtained in iteration 2 and Fig. 7 for updated matrix after third
iteration of UPGMA.
d(ABC,DE) = [d(ABC,D) + d(ABC,E)]/2 = [1.140158 + 1.029948]/2 = 1.085053
Molecular Evolution & Phylogeny: What, When, Why & How? 15

Fig. 7. The updated distance matrix and clustering of OTUs after the 3rd iteration of
UPGMA. Numbers on the branches indicate branch lengths, which are additive.

6.1.2 N-J method for tree building

The N-J method for clustering was developed by Saitou and Nei (1987). It reconstructs the
unrooted phylogenetic tree with branch lengths using minimum evolution criterion that
minimizes the lengths of tree. It does not assume the constancy of substitution rates across
sites and does not require the data to be ultrametric, unlike UPGMA. Hence, this method is
more appropriate for the sites with variable rates of evolution.
N-J method is known to be a special case of the star decomposition method. The initial tree
topology is a star. The input distance matrix is modified such that the distance between
every pair of OTUs is adjusted using their average divergence from all remaining OTUs. The
least dissimilar pair of OTUs is identified from the modified distance matrix and is
combined together to form single composite OTU. The branch lengths of individual
members, clustered in composite OTU, are computed from internal node of composite OTU.
Now the distances of remaining OTUs from composite OTU are redefined to give a new
distance matrix shorter by one OTU than the initial matrix. This process is repeated till all
the OTUs are grouped together, while keeping track of nodes, which results in a final
unrooted tree topology with minimized branch lengths. The unrooted phylogenetic tree,
thus obtained can be rooted using an outgroup species. The BIONJ (Gascuel 1997),
generalized N-J (Pearson et al., 1999) and Weighbor (Bruno et al., 2000) are some of the
recently proposed alternative versions of N-J algorithm. The sample calculation and steps
involved in N-J clustering algorithm, using distance matrix shown in Fig. 4, are given below.
Iteration 1: Before starting the actual process of clustering the vector r is calculated as
following with N=5, refer to the initial distance matrix given in Fig. 4 for reference values.
r(A) = [d(A,B)+ d(A,C)+ d(A,D)+ d(A,E)]/(N-2) = 0.949616
r(B) = [d(B,A)+ d(B,C)+ d(B,D)+ d(B,E)]/(N-2) = 0.631375
r(C) = [d(C,A)+ d(C,B)+ d(C,D)+ d(C,E)]/(N-2) = 1.033755
r(D) = [d(D,A)+ d(D,B)+ d(D,C)+ d(D,E)]/(N-2) = 1.245558
r(E) = [d(E,A)+ d(E,B)+ d(E,C)+ d(E,D)]/(N-2) = 1.373265
Using these r values, we construct a modified distance matrix, Md, such that
MD(i,j) = d(i,j) – (ri + rj).
See Fig. 8 for Md.
16 Computational Biology and Applied Bioinformatics

Fig. 8. The modified distance matrix Md and clustering for iteration 1 of N-J.
As can be seen from Md in Fig. 8, OTUs A and C are minimally distant. We select the OTUs
A and C to form one composite OTU (AC). A and C are clustered together.
Iteration 2: Compute new distances of OTUs B, D and E from composite OTU (AC).
Distances between unclustered OTUs will be retained from the previous step.
d(AC,B) = [d(A,B) + d(C,B)-d(A,C)]/2 = 0.22042
d(AC,D) = [d(A,D) + d(C,D) -d(A,C)]/2 = 1.141695
d(AC,E) = [d(A,E) + d(C,E) -d(A,C)]/2 = 1.141695
Compute r as in the previous step with N=4. See Fig. 9 for new distance matrix and r vector.

Fig. 9. The new distance matrix D and vector r obtained for NJ algorithm iteration 2.
Now, we compute the modified distance matrix, Md as in the previous step and cluster the
minimally distant OTUs. See Fig. 10

Fig. 10. The modified distance matrix Md, obtained during N-J algorithm iteration 2.
In this step, AC & B and D & E are minimally distant, so we cluster AC with B and D with E.
Repeating the above steps we will finally get the following phylogenetic tree, Fig. 11.
Both the distance-based methods, UPGMA and N-J, are computationally faster and hence
suited for the phylogeny of large datasets. N-J is the most widely used distance-based
method for phylogenetic analysis. The results of these methods are highly dependent on the
model of evolution selected a priori.
Molecular Evolution & Phylogeny: What, When, Why & How? 17

Fig. 11. The phylogenetic tree obtained using N-J algorithm for distance matrix in Fig 4.
Numbers on the branches indicate branch length.

6.2 Character-based methods of phylogeny reconstruction

The most commonly used character-based methods in molecular phylogenetics are
Maximum parsimony and Maximum likelihood. Unlike the distance-based MPA, character-
based methods use character information in alignment data as an input for tree building.
The aligned data is in the form of character-state matrix where the nucleotide or amino acid
symbols represent the states of characters. These character-based methods employ
optimality criterion with the explicit definition of objective function to score the tree
topology in order to infer the optimum tree. Hence, these methods are comparatively slower
than distance-based clustering algorithms, which are simply based on a set of rules and
operations for clustering. But character based methods are advantageous in the sense that
they provide a precise mathematical background to prefer one tree over another unlike in
distance-based clustering algorithms.

6.2.1 Maximum parsimony

The Maximum parsimony (MP) method is based on the simple principle of searching the
tree or collection of trees that minimizes the number of evolutionary changes in the form of
change of one character state into other, which are able to describe observed differences in
the informative sites of OTUs. There are two problems under the parsimony criterion, a)
determining the length of the tree i.e. estimating the number of changes in character states,
b) searching overall possible tree topologies to find the tree that involves minimum number
of changes. Finally all the trees with minimum number of changes are identified for each of
the informative sites. Fitch’s algorithm is used for the calculation of changes for a fixed tree
topology (Fitch, 1971). If the number of OTUs, N is moderate, this algorithm can be used to
calculate the changes for all possible tree topologies and then the most parsimonious rooted
tree with minimum number of changes is inferred. However, if N is very large it becomes
computationally expensive to calculate the changes for the large number of possible rooted
trees. In such cases, a branch and bound algorithm is used to restrict the search space of tree
topologies in accordance with Fitch’s algorithm to arrive at parsimonious tree (Hendy &
Penny, 1982). However, this approach may miss some parsimonious topologies in order to
reduce the search space.
An illustrative example of phylogeny analysis using Maximum parsimony is shown in Table
5 and Fig. 12. Table 5 shows a snapshot of MSA of 4 sequences where 5 columns show the
18 Computational Biology and Applied Bioinformatics

aligned nucleotides. Since there are four taxa (A, B, C & D), three possible unrooted trees
can be obtained for each site. Out of 5 character sites, only two sites, viz., 4 & 5 are
informative i.e. sites having at least two different types of characters (nucleotides/amino
acids) with a minimum frequency 2. In the Maximum parsimony method, only informative
sites are analysed. Fig. 12 shows the Maximum parsimony phylogenetic analysis of site 5
shown in Table 5. Three possible unrooted trees are shown for site 5 and the tree length is
calculated in terms of number of substitutions. Tree II is favoured over trees I and III as it
can explain the observed changes in the sequences just with a single substitution. In the
same way unrooted trees can be obtained for other informative sites such as site 4. The most
parsimonious tree among them will be selected as the final phylogenetic tree. If two or more
trees are found and no unique tree can be inferred, trees are said to be equally
parsimonious.

Table 5. Example of phylogenetic analysis from 5 aligned character sites in 4 OTUs using
Maximum parsimony method.

Fig. 12. Example showing various tree topologies based on site 5 in Table 5 using the
Maximum parsimony method.
This method is suitable for a small number of sequences with higher similarity and was
originally developed for protein sequences. Since this method examines the number of
evolutionary changes in all possible trees it is computationally intensive and time
consuming. Thus, it is not the method of choice for large sized genome sequences with high
variation. The unequal rates of variation in different sites can lead to erroneous parsimony
tree with some branches having longer lengths than others as parsimony method assumes
the rate of change across all sites to be equal.
Molecular Evolution & Phylogeny: What, When, Why & How? 19

6.2.2 Maximum likelihood

As mentioned in the beginning, another character based method for the MPA is the
Maximum likelihood method. This method is based on probabilistic approach to phylogeny.
This approach is different from the methods discussed earlier. In this method probabilistic
models for phylogeny are developed and the tree would be reconstructed using Maximum
likelihood method or by sampling method for the given set of sequences. The main
difference between this method and some of the available methods discussed before is that
it ranks various possible tree topologies according to their likelihood. The same can be
obtained by either using the frequentist approach (using the probability (data|tree)) or by
using the Baysian approach (likelihood based on the posterior probabilities i.e. by using
probability (tree|data)). This method also facilitates computing the likelihood of a sub-tree
topology along the branch.
To make the method operative, one must know how to compute P(x*|T,t*) probability of set
of data given tree topology T and set of branch length t*. The tree having maximum
probability or the one, which maximizes the likelihood would be chosen as the best tree. The
maximization can also be based on the posterior probability P(tree|data) and can be carried
out by obtaining required probability using P(x*|T,t*)=P(data|tree) and by applying the
Baye’s theorem.
The exercise of maximization involves two steps:
a. A search over all possible tree topologies with order of assignment of sequences at the
leaves specified.
b. For each topology, a search over all possible lengths of edges in t*
As mentioned in the chapter earlier, the number of rooted trees for given number of
sequences (N) grows very rapidly even as N increases to 10. An efficient search procedure
for these tasks is required, which was proposed by Felsenstein (1981) and is extensively
being used in the MPA. The maximization of likelihood of edge lengths can be carried out
using various optimization techniques.
An alternative method is to search stochastically over trees by sampling from posterior
distribution P(T,t*|x*). This method uses techniques such as Monte Carlo method, Gibb’s
sampling etc. The results of this method are very promising and are often recommended.
Having briefly reviewed the principles, merits and limitations of various methods available
for reconstruction of phylogenetic trees using molecular data, it becomes evident that the
choice of method for MPA is very crucial. The flowchart shown in Fig. 13 is intended to
serve as a guideline to choose a method based on extent of similarity between the sequences.
However, it is recommended that one uses multiple methods (at least two) to derive the
trees. A few programs have also been developed to superimpose trees to find out
similarities in the branching pattern and tree topologies.

7. Assessing the reliability of phylogenetic tree

The assessment of the reliability of phylogenetic tree is an important part of MPA as it helps
to decide the relationships of OTUs with a certain degree of confidence assigned by
statistical measures. Bootstrap and Jackknife analyses are the major statistical procedures to
evaluate the topology of phylogenetic tree (Efron, 1979; Felsenstein, 1985).
In bootstrap technique, the original aligned dataset of sequences is used to generate the
finite population of pseudo-datasets by “sampling with replacement” protocol. Each
pseudo-dataset is generated by sampling n character sites (columns in the alignment)
20 Computational Biology and Applied Bioinformatics

Fig. 13. Flowchart showing the analysis steps involved in phylogenetic reconstruction.

Fig. 14. The procedure to generate pseudo-replicate datasets of original dataset using
bootstrap is shown above. The character sites are shown in colour codes at the bottom of
datasets to visualize “sampling with replacement protocol”.
randomly from original dataset with a possibility of sampling the same site repeatedly, in
the process of regular bootstrap. This leads to generation of population of datasets, which
are given as an input to tree building methods thus giving rise to population of phylogenetic
Molecular Evolution & Phylogeny: What, When, Why & How? 21

trees. The consensus phylogenetic tree is then inferred by the majority rule that groups those
OTUs, which are found to cluster most of the times in the population of trees. The branches
in consensus phylogenetic tree are labelled with bootstrap support values enabling the
significance of the relationship of OTUs as depicted using a branching pattern. The
procedure for regular bootstrap is illustrated in the Fig. 14. It shows the original dataset
along with four pseudo-replicate datasets.
The sites in the original dataset are colour coded to visualize the “sampling with
replacement protocol” used in generation of pseudo-replicate datasets 1-4. Seqboot program
in PHYLIP package was used for this purpose with choice of regular bootstrap. For
example, pseudo-replicate dataset 1 contains the site 1 (red) from original dataset sampled 3
times. In the general practice, usually 100 to 1000 datasets are generated and for each of the
datasets phylogenetic tree is obtained. The consensus phylogenetic tree is then obtained by
majority rule. The reliability of the consensus tree is assessed from the “branch times” value
displayed along the branches of tree.
In Jackknife procedure, the pseudo-datasets are generated by “sampling without replacement”
protocol. In this process, sampling (<n) character sites randomly from original dataset
generates each pseudo dataset. This leads to generation of population of datasets, which are
given as an input to tree building methods thus giving rise to population of phylogenetic trees.
The consensus phylogenetic tree is inferred by the majority rule that groups those OTUs,
which are found to be clustered most of the times in the population of trees.

8. The case study of Mumps virus phylogeny

We have chosen a case study of Mumps virus (MuV) phylogeny using the amino acid
sequences of surface hydrophobic (SH) proteins. There are 12 different known genotypes of
MuV, which are designated through A to L, based on the sequence similarity of SH gene
sequences. Recently a new genotype of MuV, designated as M, has been identified during
parotitis epidemic 2006-2007 in the state of São Paulo, Brazil (Santos et al., 2008). Extensive
phylogenetic analysis of newly discovered genotype with existing genotypes of reference
strains (A-L) has been used for the confirmation of new genotype using character-based
Maximum likelihood method (Santos et al., 2008). In the case study to be presented here, we
have used distance-based Neighbor-Joining method with an objective to re-confirm the
presence of new MuV genotype M. The dataset reported in Santos et al., (2008) is used for
the re-confirmation analysis. The steps followed in the MPA are listed below.
a. Compilation and curation of sequences: The sequences of SH protein of the strains of
reference genotypes (A to L) as well as newly discovered genotype (M) of MuV were
retrieved using GenBank accession numbers as given in Santos et al., (2008). Sequences
were saved in Fasta format.
b. Multiple sequence alignment (MSA): SH proteins were aligned using ClustalW (See Fig.
2). MSA was saved in Phylip or interleaved (.phy) format.
c. Bootstrap analysis: 100 pseudo-replicate datasets of the original MSA data (obtained in
step b) were generated using regular bootstrap methods in Seqboot program of PHYLIP
package.
d. Derivation of distance: The distances between sequences in each dataset were calculated
using Dayhoff PAM model assuming uniform rate of variation at all sites. The ‘outfile’
generated by Seqboot program was used as an input to Protdist program in PHYLIP
package.
22 Computational Biology and Applied Bioinformatics

Fig. 15. The unrooted consensus phylogenetic tree obtained for Mumps virus genotypes
using Neighbor-Joining method. The first letter in OTU labels indicates the genotype (A-M),
which is followed by the GenBank accession numbers for the sequences. The OTUs are also
colour coded according to genotypes as following, A: red; B: light blue; C: yellow; D: light
green; E: dark blue; F: magenta; G: cyan; H: brick; I: pink; J: orange; K: black; L: dark green;
M: purple. All of the genotypes have formed monophyletic clades with high bootstrap
support values shown along the branches. The monophyletic clade of M genotypes (with 98
bootstrap support at its base) separated from the individual monophyletic clades of other
genotypes (A-L) re-confirms the detection of new genotype M.
e. Building phylogenetic tree: The distance matrices obtained in the previous step were
given as an input to N-J method to build phylogenetic trees. The ‘outfile’ generated by
Protdist program containing distance matrices was given as an input to Neighbor
program in PHYLIP package.
f. The consensus phylogenetic tree was then obtained using Consense program. For this
purpose the ‘outtree’ file (in Newick format) generated by Neighbor program was given
as an input to Consense program.
g. The consensus phylogenetic tree was visualized using FigTree software (available from
http://tree.bio.ed.ac.uk/software/figtree/). The consensus unrooted phylogenetic tree
is shown in Fig. 15.
The phylogenetic tree for the same dataset was also obtained by using Maximum parsimony
method, implemented as the Protpars program in PHYLIP by carrying out MSA and
bootstrap as detailed above. The consensus phylogenetic tree is shown in Fig. 16.
Comparison of the trees shown in Fig. 15 & Fig. 16 with that of the published tree re-
confirms the emergence of new MuV genotype M during the epidemic in São Paulo, Brazil
(Santos et al., 2008), as the members of genotype M have formed a distinct monophyletic
clade similar to the known genotypes (A-L). But, a keen observer would note the differences
in ordering of clades in the two phylograms obtained using two different methods viz., N-J
Molecular Evolution & Phylogeny: What, When, Why & How? 23

and MP. For example, the clade of genotype J is close to the clade of genotype I in the N-J
phylogram (see Fig. 15) whereas in the MP phylogram (Fig. 16) the clade of genotype J is
shown to cluster with the clade of genotype F. Such differences in the ordering of clades are
observed some times as these methods (N-J & MP) employ different assumptions and
models of evolution. The user can interpret the results with reasonable confidence where the
similar clustering pattern of clades is observed in trees drawn using multiple methods. The
user, on the other hand, should refrain from over interpretation of sub-tree topolologies,
where branching order doesn’t match in the trees drawn using different methods. Similarly,
a lot of case studies pertaining to the emergence of new species as well as evolution of
individual genes/proteins have been published. It is advisable to re-run through a few case
studies, which are published, to understand the way in which the respective authors have
interpreted the results on the basis of phylogenetic analyses.

Fig. 16. The unrooted consensus phylogenetic tree obtained for Mumps virus genotypes
using Maximum parsimony method. The labelling of OTUs and colour coding is same as in
Fig. 15.

9. Challenges and opportunities in phylogenomics

The introduction of next-generation sequencing technology has totally revived the pace of
genome sequencing. It has inevitably posed challenges on traditional ways of molecular
phylogeny analysis based on single gene, set of genes or markers. Currently the phylogeny
based on molecular markers such as 16S rRNA, mitochondrial, nuclear genes etc. provide
the taxonomic backbone for Tree of Life (http://tolweb.org/tree/phylogeny.html). But the
single gene based phylogeny does not necessarily reflect the phylogenetic history among the
genomes of organisms from which these genes are derived. Also the types of evolutionary
events such as lateral gene transfer, recombination etc. may not be revealed through the
phylogeny of single gene. Thus whole genome based phylogeny analyses become important
for deeper understanding of the evolutionary pattern in the organisms (Konstantinidis &
24 Computational Biology and Applied Bioinformatics

Tiedje, 2007). But whole genome based phylogeny poses many challenges to the traditional
methods of MPA, major concerns of them being the size, memory and computational
complexity involved in alignment of genomes (Liu et al., 2010).
The methods of MSA developed so far are adequate to handle the requirements of limited
amount of data viz. individual gene or protein sequences from various organisms. The
increased size of data in terms of the whole genome sequences, however, poses constrains
on use and applicability of currently available methods of MSA as they become
computationally intensive with requirement of higher memory. The uncertainty associated
with alignment procedures, which leads to variations in the inferred phylogeny, has also
been pointed out to be the cause of concern (Wong et al., 2008). The benchmark datasets are
made available to validate performance of multiple sequence alignment methods (Kemena
& Notredame, 2009). These challenges have opened up opportunities for development of
alternative approaches for MPA with emergence of alignment-free methods for the same
(Kolekar et al., 2010; Sims et al., 2009; Vinga & Almeida, 2003). The field of MPA is also
evolving with attempts to develop novel methods based on various data mining techniques
viz. Hidden Markov Model (HMM) (Snir & Tuller, 2009), Chaos game theory (Deschavanne
et al., 1999), Return Time Distributions (Kolekar et al., 2010) etc. The recent approaches are
undergoing refinement and will have to be evaluated with the benchmark datasets before
they are routinely used. However, sheer dimensionality of genomic data demands their
application. These approaches along with the conventional approaches are extensively
reviewed elsewhere (Blair & Murphy, 2010; Wong & Nielsen, 2007).

10. Conclusion
The chapter provides excursion of molecular phylogeny analyses for potential users. It gives
an account of available resources and tools. The fundamental principles and salient features
of various methods viz. distance-based and character-based are explained with worked out
examples. The purpose of the chapter will be served if it enables the reader to develop
overall understanding, which is critical to perform such analyses involving real data.

11. Acknowledgment
PSK acknowledges the DBT-BINC junior research fellowship awarded by Department of
Biotechnology (DBT), Government of India. UKK acknowledges infrastructural facilities and
financial support under the Centre of Excellence (CoE) grant of DBT, Government of India.

12. References
Adachi, J. & Hasegawa, M. (1996) Model of amino acid substitution in proteins encoded by
mitochondrial DNA. Journal of Molecular Evolution 42(4):459-468.
Aniba, M.; Poch, O. & Thompson J. (2010) Issues in bioinformatics benchmarking: the case
study of multiple sequence alignment. Nucleic Acids Research 38(21):7353-7363.
Batzoglou, S. (2005) The many faces of sequence alignment. Briefings in Bioinformatics 6(1):6-
22.
Benson, D.; Karsch-Mizrachi, I.; Lipman, D.; Ostell, J. & Sayers, E. (2011) GenBank. Nucleic
Acids Research 39(suppl 1):D32-D37.
Molecular Evolution & Phylogeny: What, When, Why & How? 25

Blair, C. & Murphy, R. (2010) Recent Trends in Molecular Phylogenetic Analysis: Where to
Next? Journal of Heredity 102(1):130.
Bruno, W.; Socci, N. & Halpern A. (2000) Weighted neighbor joining: a likelihood-based
approach to distance-based phylogeny reconstruction. Molecular Biology and
Evolution 17(1):189.
Cavalli-Sforza, L. & Edwards, A. (1967) Phylogenetic analysis. Models and estimation
procedures. American Journal of Human Genetics 19(3 Pt 1):233.
Cole, J.; Wang, Q.; Cardenas, E.; Fish, J.; Chai, B.; Farris, R.; Kulam-Syed-Mohideen, A.;
McGarrell, D.; Marsh, T.; Garrity, G. & others. (2009) The Ribosomal Database
Project: improved alignments and new tools for rRNA analysis. Nucleic Acids
Research 37(suppl 1):D141-D145.
Deschavanne, P.; Giron, A.; Vilain, J.; Fagot, G. & Fertil, B. (1999) Genomic signature:
characterization and classification of species assessed by chaos game representation
of sequences. Mol Biol Evol 16(10):1391-9.
Edgar, R.(2004). MUSCLE: a multiple sequence alignment method with reduced time and
space complexity. BMC Bioinformatics 5:113
Efron, B. (1979) Bootstrap Methods: Another Look at the Jackknife. Ann. Statist. 7:1-26.
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood
approach. J. Mol Evol 17:368-376.
Felsenstein, J. (1985) Confidence Limits on Phylogenies: An Approach Using the Bootstrap.
Evolution 39 783-791.
Felsenstein, J.(1989). PHYLIP-phylogeny inference package (version 3.2). Cladistics 5:164-166
Felsenstein, J. (1996) Inferring phylogenies from protein sequences by parsimony, distance,
and likelihood methods. Methods Enzymol 266:418-27.
Fitch, W. (1971) Toward Defining the Course of Evolution: Minimum Change for a Specific
Tree Topology. Systematic Zoology 20(4):406-416.
Gascuel, O. (1997) BIONJ: an improved version of the NJ algorithm based on a simple model
of sequence data. Molecular Biology and Evolution 14(7):685-695.
Hall, T. (1999). BioEdit: A user-friendly biological sequence alignment editor and analysis
program for Windows 95/98/NT. Nucleic Acids Symp Ser 41:95-98.
Hendy, M. & Penny, D. (1982) Branch and bound algorithms to determine minimal
evolutionary trees. Mathematical Biosciences 59(2):277-290.
Henikoff, S. & Henikoff, J. (1992) Amino acid substitution matrices from protein blocks.
Proceedings of the National Academy of Sciences of the United States of America
89(22):10915.
Jones, D.; Taylor, W. & Thornton, J. (1994) A mutation data matrix for transmembrane
proteins. FEBS Letters 339(3):269-275.
Jukes, T. & Cantor, C. (1969) Evolution of protein molecules. In “Mammalian Protein
Metabolism”(HN Munro, Ed.). Academic Press, New York.
Kaminuma, E.; Kosuge, T.; Kodama, Y.; Aono, H.; Mashima, J.; Gojobori, T.; Sugawara, H.;
Ogasawara, O; Takagi, T.; Okubo, K. & others. (2011). DDBJ progress report.
Nucleic Acids Research 39(suppl 1):D22-D27
Katoh, K.; Kuma, K.; Toh, H. & Miyata T (2005) MAFFT version 5: improvement in accuracy
of multiple sequence alignment. Nucleic Acids Res 33:511 - 518
Kemena, C. & Notredame, C. (2009) Upcoming challenges for multiple sequence alignment
methods in the high-throughput era. Bioinformatics 25(19):2455-2465.
26 Computational Biology and Applied Bioinformatics

Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions
through comparative studies of nucleotide sequences. Journal of Molecular Evolution
16(2):111-120.
Kolekar, P.; Kale, M. & Kulkarni-Kale, U. (2010) `Inter-Arrival Time' Inspired Algorithm and
its Application in Clustering and Molecular Phylogeny. AIP Conference Proceedings
1298(1):307-312.
Konstantinidis, K. & Tiedje, J. (2007) Prokaryotic taxonomy and phylogeny in the genomic
era: advancements and challenges ahead. Current Opinion in Microbiology 10(5):504-
509.
Kuiken, C.; Leitner, T.; Foley, B.; Hahn, B.; Marx, P.; McCutchan, F.; Wolinsky, S.; Korber, B.;
Bansal, G. & Abfalterer, W. (2009) HIV sequence compendium 2009. Document LA-
UR:06-0680
Kuiken, C.; Yusim, K.; Boykin, L. & Richardson, R. (2005) The Los Alamos hepatitis C
sequence database. Bioinformatics 21(3):379.
Kulkarni-Kale, U.; Bhosle, S.; Manjari, G. & Kolaskar, A. (2004) VirGen: a comprehensive
viral genome resource. Nucleic Acids Research 32(suppl 1):D289.
Kumar, S.; Nei, M.; Dudley, J. & Tamura, K. (2008) MEGA: a biologist-centric software for
evolutionary analysis of DNA and protein sequences. Brief Bioinform, 9(4):299-306.
Leinonen, R.; Akhtar, R.; Birney, E.; Bower, L.; Cerdeno-TÃ¡rraga, A.; Cheng, Y.; Cleland, I.;
Faruque, N.; Goodgame, N.; Gibson, R. & others. (2011) The European Nucleotide
Archive. Nucleic Acids Research 39(suppl 1):D28-D31.
Lio, P. & Goldman, N. (1998) Models of molecular evolution and phylogeny. Genome Res
8(12):1233-44.
Liu, K.; Linder, C. & Warnow, T. (2010) Multiple sequence alignment: a major challenge to
large-scale phylogenetics. PLoS Currents 2.
Luo, A.; Qiao, H.; Zhang, Y.; Shi, W.; Ho, S.; Xu, W.; Zhang, A. & Zhu, C. (2010) Performance
of criteria for selecting evolutionary models in phylogenetics: a comprehensive
study based on simulated datasets. BMC Evolutionary Biology 10(1):242.
Morgenstern, B.; French, K.; Dress, A. & Werner, T. (1998). DIALIGN: finding local
similarities by multiple sequence alignment. Bionformatics 14:290 - 294
Needleman, S. & Wunsch, C. (1970) A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Journal of Molecular Biology
48(3):443-453.
Nicholas, H.; Ropelewski, A. & Deerfield DW. (2002) Strategies for multiple sequence
alignment. Biotechniques 32(3):572-591.
Notredame, C.; Higgins, D. & Heringa, J. (2000) T-Coffee: A novel method for fast and
accurate multiple sequence alignment. Journal of Molecular Biology 302:205 - 217
Parry-Smith, D.; Payne, A.; Michie, A.& Attwood, T. (1998). CINEMA--a novel colour
INteractive editor for multiple alignments. Gene 221(1):GC57-GC63
Pearson, W.; Robins, G. & Zhang, T. (1999) Generalized neighbor-joining: more reliable
phylogenetic tree reconstruction. Molecular Biology and Evolution 16(6):806.
Phillips, A. (2006) Homology assessment and molecular sequence alignment. Journal of
Biomedical Informatics 39(1):18-33.
Ronquist, F. & Huelsenbeck, J. (2003). MrBayes 3: Bayesian phylogenetic inference under
mixed models. Bioinformatics 19(12):1572-1574
Molecular Evolution & Phylogeny: What, When, Why & How? 27

Rzhetsky, A. (1995) Estimating substitution rates in ribosomal RNA genes. Genetics

141(2):771.
Saitou, N. & Nei, M. (1987) The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Mol Biol Evol 4(4):406-25.
Santos, C.; Ishida, M.; Foster, P.; Sallum, M.; Benega, M.; Borges, D.; Corrêa, K.; Constantino,
C.; Afzal, M. & Paiva, T. (2008) Detection of a new mumps virus genotype during
parotitis epidemic of 2006–2007 in the State of São Paulo, Brazil. Journal of Medical
Virology 80(2):323-329.
Sayers, E.; Barrett, T.; Benson, D.; Bolton, E.; Bryant, S.; Canese, K.; Chetvernin, V.; Church,
D.; DiCuccio, M.; Federhen, S. & others. (2011). Database resources of the National
Center for Biotechnology Information. Nucleic Acids Research 39(suppl 1):D38-D51.
Schwartz, R. & Dayhoff, M. (1979) Matrices for detecting distant relationships. M. 0. Dayhoff
(ed.), Atlas of protein sequence and structure 5:353-358.
Schmidt, H.; Strimmer, K.; Vingron, M. & Von Haeseler A. (2002). TREE-PUZZLE:
maximum likelihood phylogenetic analysis using quartets and parallel computing.
Bioinformatics 18(3):502.
Sims, G.; Jun, S.; Wu, G. & Kim, S. (2009) Alignment-free genome comparison with feature
frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A
106(8):2677-82.
Snir, S. & Tuller, T. (2009) The net-hmm approach: phylogenetic network inference by
combining maximum likelihood and hidden Markov models. Journal of
bioinformatics and computational biology 7(4):625-644.
Sokal, R. & Michener, C. (1958) A statistical method for evaluating systematic relationships.
Univ. Kans. Sci. Bull. 38:1409-1438.
Tamura, K. (1992) Estimation of the number of nucleotide substitutions when there are
strong transition-transversion and G+C-content biases. Molecular Biology and
Evolution 9(4):678-687.
Tamura, K. & Nei, M. (1993) Estimation of the number of nucleotide substitutions in the
control region of mitochondrial DNA in humans and chimpanzees. Molecular
Biology and Evolution 10(3):512-526.
The UniProt Consortium. (2011). Ongoing and future developments at the Universal Protein
Resource. Nucleic Acids Research 39(suppl 1):D214-D219.
Thompson, J.; Higgins, D. & Gibson, T. (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position-
specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673-80.
Thompson, J.; Linard, B.; Lecompte, O. & Poch, O. (2011) A Comprehensive Benchmark
Study of Multiple Sequence Alignment Methods: Current Challenges and Future
Perspectives. PLoS ONE 6(3):e18093.
Thorne, J.; Goldman, N. & Jones, D. (1996) Combining protein evolution and secondary
structure. Molecular Biology and Evolution 13(5):666-673.
Vinga, S. & Almeida, J. (2003) Alignment-free sequence comparison-a review. Bioinformatics
19(4):513-23.
Wilgenbusch, J. & Swofford, D. (2003). Inferring Evolutionary Trees with PAUP*. Current
Protocols in Bioinformatics. 6.4.1–6.4.28
Wong, K.; Suchard, M. & Huelsenbeck, J. (2008) Alignment Uncertainty and Genomic
Analysis. Science 319(5862):473-476.
28 Computational Biology and Applied Bioinformatics

Wong, W. & Nielsen, R. (2007) Finding cis-regulatory modules in Drosophila using

phylogenetic hidden Markov models. Bioinformatics 23(16):2031-2037.
Xia, X. & Xie, Z. (2001). DAMBE: software package for data analysis in molecular biology
and evolution. Journal of Heredity 92(4):371
2

Understanding Protein Function - The Disparity

Between Bioinformatics and Molecular Methods
Katarzyna Hupert-Kocurek1 and Jon M. Kaguni2
1University of Silesia
2Michigan State University
1Poland
2United States of America

1. Introduction
Bioinformatics has its origins in the development of DNA sequencing methods by Alan
Maxam and Walter Gilbert (Maxam and Gilbert, 1977), and by Frederick Sanger and
coworkers (Sanger et al., 1977). By entirely different approaches, the first genomes
determined at the nucleotide sequence level were that of bacteriophage φX174, and the
recombinant plasmid named pBR322 composed of about 5,400 (Sanger et al., 1977), or 4,400
base pairs (Sutcliffe, 1979), respectively. In contrast, two articles that appeared in February
2001 reported on the preliminary DNA sequence of the human genome, which corresponds
to 3 billion nucleotides of DNA sequence information (Lander et al., 2001; Venter et al.,
2001). Only two years later, the GenBank sequence database contained more than 29.3
billion nucleotide bases in greater than 23 million sequences. With the development of new
technologies, experts predict that the cost to sequence an individual’s DNA will be about
$1000. This reduction in cost suggests that efforts in the area of comparative genomics will
increase substantially, leading to an enormous database that vastly exceeds the existing one.
By way of comparative genomics approaches, computational methods have led to the
identification of homologous genes shared among species, and their classification into
superfamilies based on amino acid sequence similarity. In combination with their
evolutionary relatedness, superfamily members have been clustered into clades. In addition,
high throughput sequencing of small RNAs and bioinformatics analyses have contributed to
the identification of regions between genes that can code small RNAs (siRNA, microRNA,
and long noncoding RNA), which act during the development of an organism to modulate
gene expression at the post-transcriptional level (Fire et al., 1998; Hamilton and Baulcombe,
1999) reviewed in Elbashir et al., 2001; Ghildiyal and Zamore, 2009; Christensen et al., 2010).
An emerging area is functional genomics whereby gene function is deduced using large-
scale methods by identifying the involvement of specific genes in metabolic pathways. More
recently, phenotype microarray methods have been used to correlate the functions of genes
of microbes with cell phenotypes under a variety of growth conditions (Bochner, 2009).
These methods contrast with the traditional approach of mapping a gene via the phenotype
of a mutation, and deducing the function of the gene product based on its biochemical
analysis in concert with physiological studies. Such studies have been performed to confirm
the functional importance of conserved residues shared by superfamily members, and also
30 Computational Biology and Applied Bioinformatics

to determine the role of specific residues for a given protein. In comparison, comparative
genomics methods are unable to distinguish if a nonconserved amino acid among
superfamily members is functionally important, or simply reflects sequence divergence due
to the absence of selection during evolution. Without functional information, it is not
possible to determine if a nonconserved amino acid is important.

2. Bioinformatics analysis of AAA+ proteins

On the basis of bioinformatics analysis, P-loop nucleotide hydrolases compose a very large
group of proteins that use an amino acid motif named the phosphate binding loop (P-loop)
to hydrolyze the phosphate ester bonds of nucleotides. A positively charged group in the
side chain of an amino acid (often lysine) in the P-loop promotes nucleotide hydrolysis by
interacting with the phosphate of the bound nucleotide. Additional bioinformatics analysis
of this group of proteins led to a category of nucleotidases containing the Walker A and B
motifs, as well as additional motifs shared by the AAA (ATPases Associated with diverse
cellular Activities) superfamily (Beyer, 1997; Swaffield and Purugganan, 1997). These
diverse activities include protein unfolding and degradation, vesicle transport and
membrane fusion, transcription and DNA replication. The additional motifs of the AAA
superfamily differentiate its members from the larger set of P-loop nucleotidases. Neuwald
et al., and Iyer et al. then integrated structural information with bioinformatics analysis to
classify members of the AAA+ superfamily into clades (Neuwald et al., 1999; Iyer et al.,
2004). These clades are the clamp loader clade, the DnaA/CDC6/ORC clade, the classical
AAA clade, the HslU/ClpX/Lon/ClpAB-C clade, and the Helix-2 insert clade. The last two
clades have been organized into the Pre-sensor 1 hairpin superclade.
Members of the superfamily of AAA+ ATPases carry a nucleotide-binding pocket called the
AAA+ domain that ranges from 200 to 250 amino acids, which is formed by an αβα-type
Rossmann fold followed by several α helices (Figure 1) (Lupas and Martin, 2002; Iyer et al.,
2004; Hanson and Whiteheart, 2005). Such proteins often assemble into ring-shaped or
helical oligomers (Davey et al., 2002; Iyer et al., 2004; Erzberger and Berger, 2006). Using the
nomenclature of Iyer et al., the Rossmann fold is formed by a β sheet of parallel strands
arranged in a β5-β1-β4-β3-β2 series. Its structure resembles a wedge. An α helix preceding
the β1 strand and a loop that is situated across the face of the β sheet is a distinguishing
feature of the AAA+ superfamily. Another characteristic is the position of several α helices
positioned above the wide end of the wedge. The P-loop or the Walker A motif (GX4GKT/S
where X is any amino acid) is located between the β1 strand and the following α helix. The
Walker B motif (φφφφDE where φ is a hydrophobic amino acid) coordinates a magnesium
ion complexed with the nucleoside triphosphate via the conserved aspartate residue. The
conserved glutamate is thought to interact with a water molecule to make it a better
nucleophile for nucleotide hydrolysis.
AAA+ proteins also share conserved motifs named the Sensor 1, Box VII, and Sensor 2
motifs that coordinate ATP hydrolysis with a change in conformation (Figure 1) (Lupas and
Martin, 2002; Iyer et al., 2004; Hanson and Whiteheart, 2005). Relative to the primary amino
acid sequence, these motifs are on the C-terminal side of the Walker B motif. The Sensor 1
motif contains a polar amino acid at the end of the β4 strand. On the basis of the X-ray
crystal structure of N-ethylmaleimide-sensitive factor, an ATPase involved in intracellular
vesicle fusion (Beyer, 1997; Swaffield and Purugganan, 1997), this amino acid together with
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 31

the acidic residue in the Walker B motif interacts with and aligns the activated water
molecule during nucleotide hydrolysis. The Box VII motif, which is also called the SRH
(Second Region of Homology) motif, contains an arginine named the arginine finger by its
analogous function with the corresponding arginine of GTPase activator proteins that
interacts with GTP bound to a small G protein partner to promote GTP hydrolysis. The
crystal structures of several AAA+ proteins have shown that the Box VII motif in an
individual molecule is located some distance away from the nucleotide binding pocket. In
AAA+ proteins that assemble into ring-shaped or helical oligomers, the Box VII motif of one
protomer directs an arginine residue responsible for interaction with the γ phosphate of
ATP toward the ATP binding pocket of the neighboring molecule. It is proposed that this
interaction or lack thereof coordinates ATP hydrolysis with a conformational change. The
Sensor 2 motif, which resides in one of the α helices that follow the Rossmann fold, also
contains a conserved arginine. For proteins whose structures contain the bound nucleoside
triphosphate or a nucleotide analogue, this amino acid interacts with the γ phosphate of the
nucleotide. As reviewed by Ogura (Ogura et al., 2004), this residue is involved in ATP
binding or its hydrolysis in some but not all AAA+ proteins. Like the arginine finger
residue, this arginine is thought to coordinate a change in protein conformation with
nucleotide hydrolysis.

Fig. 1. Structural organization of the AAA+ domain, and the locations of the Walker A/P-
loop, Walker B, Sensor 1, Box VII and Sensor 2 motifs are shown (adapted from ref.
(Erzberger and Berger, 2006)).
Because this chapter focuses on members of the DnaA/CDC6/ORC or initiator clade, the
following summarizes properties of this clade and not others. Like the clamp loader clade,
proteins in the initiator clade as represented by DnaA and DnaC have a structure
resembling an open spiral on the basis of X-ray crystallography (Erzberger et al., 2006; Mott
et al., 2008). In comparison, oligomeric proteins in the remaining clades form closed rings. A
characteristic feature of proteins in the initiator clade is the presence of two α helices
between the β2 and β3 strands (Figure 1). Compared with the function of DnaA in the
initiation of E. coli DNA replication, DnaC plays a separate role. Their functions are
32 Computational Biology and Applied Bioinformatics

described in more detail below. The ORC/CDC6 group of eukaryotic proteins in the
initiator clade, like DnaA and DnaC, act to recruit the replicative helicase to replication
origins at the stage of initiation of DNA replication (Lee and Bell, 2000; Liu et al., 2000). The
origin recognition complex (ORC) is composed of six related proteins named Orc1p through
Orc6p, and likely originated along with Cdc6p from a common ancestral gene.
Bioinformatics analysis of DnaC suggests that this protein is a paralog of DnaA, arising by
gene duplication and then diverging with time to perform a separate role from DnaA during
the initiation of DNA replication (Koonin, 1992). This notion leads to the question of what
specific amino acids are responsible for the different functions of DnaA and DnaC despite
the shared presence of the AAA+ amino acid sequence motifs. Presumably, specific amino
acids that are not conserved between these two proteins have critical roles in determining
their different functions, but how are these residues identified and distinguished from those
that are not functionally important? In addition, some amino acids that are conserved
among homologous DnaC proteins, which were identified by multiple sequence alignment
of twenty-eight homologues (Figure 2), are presumable responsible for the unique activities
of DnaC, but what are these unique activities? These issues underscore the limitation of
deducing the biological function of protein by relying only on bioinformatics analysis.

3. Reverse genetics as an approach to identify the function of an unknown

gene
Using various amino acid sequence alignment methods for a particular gene, the postulated
function for this gene remains unknown if amino acid sequence homology is not obtained
relative to a gene of known function. In such cases, the general approach is to employ
reverse genetics to attempt to correlate a phenotype with a mutation in the gene. By way of
comparison, forward genetics begins with a phenotype caused by a specific mutation at an
unknown site in the genome. The approximate position of the gene can be determined by
classical genetic methods that involve its linkage to another mutation that gives rise to a
separate phenotype. Refined linkage mapping can localize the gene of interest, followed by
PCR (polymerase chain reaction) amplification of the region and DNA sequence analysis to
determine the nature of the mutation. As a recent development, whole genome sequencing
has been performed to map mutations, dispensing with the classical method of genetic
linkage mapping (Lupski et al., 2010; Ng and Kirkness, 2010). The DNA sequence obtained
may reveal that the gene and the corresponding gene product have been characterized in the
same or different organism, and disclose its physiological function.
In a reverse genetics approach with a haploid organism, the standard strategy is to
inactivate the gene with the hope that a phenotype can be measured. Inactivation can be
achieved either by deleting the gene or by insertional mutagenesis, usually with a
transposon. As examples, transposon mutagenesis has been performed with numerous
microbial species, and with Caenorhabditis elegans (Vidan and Snyder, 2001; Moerman and
Barstead, 2008; Reznikoff and Winterberg, 2008). Using E. coli or S. cerevisiae as model
organisms for gene disruption, one method relies on replacing most of the gene with a drug
resistance cassette, or a gene that causes a detectable phenotype. The technique of gene
disruption relies on homologous recombination in which the drug resistance gene, for
example, has been joined to DNA sequences that are homologous to the ends of the target
gene (Figure 3). After introduction of this DNA into the cell, recombination between the
ends of the transfected DNA and the homologous regions in the chromosome leads to
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 33

Fig. 2. Multiple sequence alignment of dnaC homologues using the Constraint-based

Multiple Alignment Tool reveals amino acids that are highly conserved. The Walker A,
Walker B, Sensor I, Sensor II and Box VII motifs shared among the AAA+ family of ATPases
are shown below the alignment. These motifs are involved in ATP binding, ATP hydrolysis,
and coordinating a conformational change with ATP hydrolysis. The figure also shows the
region of DnaC near the N-terminus that is involved in interaction with DnaB helicase
(Ludlam et al., 2001; Galletto et al., 2003; Mott et al., 2008), and the ISM (Initiator/loader-
Specific Motif), which corresponds to two α helices between the β2 and β3 strands, that
distinguishes members of the DnaA/CDC6/ORC clade from other AAA+ clades. The ISM
causes DnaC assembled as oligomers to form a spiral structure (Mott et al., 2008).
34 Computational Biology and Applied Bioinformatics

replacement of the chromosomal copy of the gene with the drug resistance cassette, after
which the excised copy of the chromosomal gene is lost. In both E. coli and S. cerevisiae, this
approach has been used in seeking to correlate a phenotype with genes of unknown
function, and to identify those that are essential for viability (Winzeler et al., 1999; Baba et
al., 2006). By either gene disruption or transposon mutagenesis, genetic mapping of the
mutation can be performed by inverse PCR where primers complementary to a sequence
near the ends of the drug resistance cassette or the transposon are used. This approach first
involves partially digesting the chromosomal DNA with a restriction enzyme followed by
ligation of the resulting fragments to form a collection of circular DNAs. DNA sequence
analysis of the amplified DNA with the primers described above followed by comparison of
the nucleotide sequence with the genomic DNA sequence can identify the site of the
disrupted gene, or the site of insertion of the transposon.

Fig. 3. DNA recombination between the chromosomal gene and homologous DNA at ends
of the drug resistance gene leads to replacement of the chromosomal copy of the gene with
the drug resistance cassette. The chromosomal gene, and homologous DNA at the ends of
the drug resistance cassette are indicated by the lighter shaded rectangles. The drug
resistance gene is indicated by the darker rectangles. The thin line represents chromosomal
DNA flanking the gene.
With a multicellular organism, a similar strategy that relies on homologous recombination is
used to delete a gene. The type of cell to introduce the deletion is usually an embryonic stem
cell so that the effect of deletion can be measured in the whole organism. Many eukaryotic
organisms have two complete sets of chromosomes. Because the process of homologous
recombination introduces the deletion mutation in one of the two pairs of chromosomes,
yielding a heterozygote, the presence of the wild type copy on the sister chromosome may
conceal the biological effect of the deletion. Thus, the ideal objective is to delete both copies
of a gene in order to measure the associated phenotype. To attempt to obtain an organism in
which both copies of a gene have been “knocked out,” the standard strategy is to mate
heterozygous individuals. By Mendelian genetics, one-fourth of the progeny should carry
the deletion in both copies of the gene. The drawback with the approach of deleting a gene
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 35

is that it may be essential for viability as suggested if a homozygous knockout organism

cannot be obtained. Another pitfall is that it may not be possible to construct a heterozygous
knockout because the single wild type copy is insufficient to maintain viability. In either
case, no other hint of gene function is obtained except for the requirement for life.
Another complication with attempting to determine the role of a eukaryotic gene by
deleting it is the existence of gene families where a specific biochemical function is provided
by allelic variants. Hence, to correlate a phenotype by introducing a mutation into a specific
allelic variant requires inactivation of all other members of the family. A further
complication with eukaryotic organisms is that a product formed by an enzyme of interest
in one biochemical pathway may be synthesized via an alternate pathway that involves a
different set of proteins. In these circumstances, deletion of the gene does not yield a
measurable phenotype.
In the event that deletion of a gene is not possible, an alternate approach to characterize the
function of an unknown gene is by RNA interference (reviewed in Carthew and Sontheimer,
2009; Fischer, 2010; Krol et al., 2010). This strategy exploits a natural process that acts to
repress the expression of genes during development, or as cells progress through the cell
cycle (Fire et al., 1998; Ketting et al., 1999; Tabara et al., 1999). Small RNA molecules named
microRNA (miRNA) and small interfering RNA (siRNA) become incorporated into a large
complex called the RNA-inducing silencing complex (RISC), which reduces the expression
of target genes by facilitating the annealing of the RNA with the complementary sequence in
a messenger RNA (Liu et al., 2003). The duplex RNA is recognized by a component of the
RISC complex, followed by degradation of the messenger RNA to block its expression. The
RNA interference pathway has been adapted as a method to reduce or “knockdown” the
expression of a specific gene in order to explore its physiological function. Compared with
other genetic methods that examine the effect of a specific amino acid substitution on a
particular activity of a multifunctional protein, the knockout and knockdown approaches
are not as refined in that they measure the physiological effect of either the reduced
function, or the loss of function of the entire protein.

4. E. coli as a model organism for structure-function studies

Escherichia coli is a rod-shaped bacterium (0.5 micron x 2 microns in the nongrowing state)
that harbors a 4.4 x 106 base pair genome encoding more than 4,000 genes. By transposon-
based insertional mutagenesis and independently by systematic deletion of each open
reading frame, these genes have been separated into those that are essential for viability,
and those that are considered nonessential (Baba et al., 2006). Of the total, about 300 genes
are of undetermined function, including 37 genes that are essential. BLAST analysis
indicates that some of the genes of unknown function are conserved among bacteria,
suggesting their functional importance.
In comparison, many of the genes of known function have been studied extensively. Among
these are the genes required for duplication of the bacterial chromosome, including a subset
that acts at the initiation stage of DNA replication. The following section describes a specific
example that focuses on DnaC protein. Studies on this protein take advantage of
bioinformatics in combination with its X-ray crystallographic structure, molecular genetic
analysis, and the biochemical characterization of specific mutant DnaC proteins to obtain
new insight into its role in DNA replication.
36 Computational Biology and Applied Bioinformatics

5. Molecular analysis of E. coli DnaC, an essential protein involved in the

initiation of DNA replication, and replication fork restart
DNA replication is the basis for life. Occurring only once per cell cycle, DNA replication
must be tightly coordinated with other major cellular processes required for cell growth so
that each progeny cell receives an accurate copy of the genome at cell division (reviewed in
DePamphilis and Bell, 2010). Improper coordination of DNA replication with cell growth
leads to aberrant cell division that causes cell death in severe cases. In addition, the failure to
control the initiation process leads to excessive initiations, followed by the production of
double strand breaks that apparently arise due to head-to-tail fork collisions. In eukaryotes,
aneuploidy and chromosome fusions appear if the broken DNA is not fixed that can lead to
malignant growth.
In bacteria, chromosomal DNA replication starts at a specific locus called oriC (Figure 4).

Fig. 4. Organization of DNA sequence motifs in the E. coli replication origin (oriC). Near the
left border are 13-mer motifs that are unwound by DnaA complexed to ATP. Sites
recognized by DnaA are the DnaA boxes (arrow), I-sites (warped oval), and τ-sites (warped
circle). Sites recognized by IHF (shaded rectangle), Fis (filled rectangle), and DNA adenine
methyltransferase (shaded circle) are also indicated.
Recent reviews describe the independent mechanisms that control the frequency of
initiation from this site (Nielsen and Lobner-Olesen, 2008; Katayama et al., 2010). In
Escherichia coli, the minimal oriC sequence of 245 base pairs contains DNA-binding sites for
many different proteins that either act directly in DNA replication, or modulate the
frequency of this process (reviewed in Leonard and Grimwade, 2009). One of them is DnaA,
which is the initiator of DNA replication, and has been placed in one of the clades of the
AAA+ superfamily via bioinformatics analysis (Koonin, 1992; Erzberger and Berger, 2006).
DnaA binds to a consensus 9 base pair sequence known as the DnaA box. There are five
DnaA boxes individually named R1 through R5 within oriC that are similar in sequence and
are recognized by DnaA (Leonard and Grimwade, 2009). In addition to these sites, DnaA
complexed to ATP specifically recognizes three I- sites and τ-sites in oriC, which leads to the
unwinding of three AT-rich 13-mer repeats located in the left half of oriC. Binding sites are
also present for IHF protein (integration host factor) and FIS protein (factor for inversion
stimulation). As these proteins induce bends in DNA, their apparent ability to modulate the
binding of DnaA to the respective sites in oriC may involve DNA bending. Additionally,
oriC carries 11 GATC sequences recognized by DNA adenine methyltransferase, and sites
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 37

recognized by IciA, Rob, and SeqA. The influence of IHF, FIS, IciA, Rob and SeqA proteins
on the initiation process is described in more detail in a review (Leonard and Grimwade,
2005).
At the initiation stage of DNA replication, the first step requires the binding of DnaA
molecules, each complexed to ATP, to the five DnaA boxes, I- and τ- sites of oriC. After
binding, DnaA unwinds the duplex DNA in the AT-rich region to form an intermediate
named the open complex. HU or IHF stimulates the formation of the open complex. In the
next step, the replicative helicase named DnaB becomes stably bound to the separated
strands of the open complex to form an intermediate named the prepriming complex. At
this juncture, DnaC must be complexed to DnaB for a single DnaB hexamer to load onto
each of the separated strands. DnaC protein must then dissociate from the complex in order
for DnaB to be active as a helicase. Following the loading of DnaB, this helicase enlarges the
unwound region of oriC, and then interacts with DnaG primase (Tougu and Marians, 1996).
This interaction between DnaB and DnaG primase, which synthesize primer RNAs that are
extended by DNA polymerase III holoenzyme during semi-conservative DNA replication,
marks the transition between the process of initiation and elongation stage of DNA
replication (McHenry, 2003; Corn and Berger, 2006; Langston et al., 2009). Replication fork
movement that is supported by DnaB helicase and assisted by a second helicase named Rep
proceeds bidirectionally around the chromosome until it reaches the terminus region (Guy
et al., 2009). The two progeny DNAs then segregate near opposite poles of the cell before
septum formation and cell division.
DnaC protein (27 kDa) is essential for cell viability because it is required during the
initiation stage of DNA replication (reviewed in Kornberg and Baker, 1992; Davey and
O'Donnell, 2003). DnaC is also required for DNA replication of the single stranded DNA of
phage φX174, and for many plasmids (e.g. pSC101, P1, R1). DnaC additionally acts to
resurrect collapsed replication forks that appear when a replication fork encounters a nick,
gap, double-stranded break, or modified bases in the parental DNA (Sandler et al., 1996).
This process of restarting a replication fork involves assembly of the replication restart
primosome that contains PriA, PriB, PriC, DnaT, DnaB, DnaC, and Rep protein (Sandler,
2005; Gabbai and Marians, 2010). The major roles of DnaC at oriC, at the replication origins
of the plasmids and bacteriophage described above, or in restarting collapsed replication
forks is to form a complex with DnaB, which is required to escort the helicase onto the DNA,
and then to depart. Since the discovery of the dnaC gene over 40 years ago (Carl, 1970), its
ongoing study by various laboratories using a variety of approaches continue to reveal new
aspects of the molecular mechanisms of DnaC in DNA replication.
Biochemical analysis combined with the X-ray crystallographic structure of the majority of
Aquifex aeolicus DnaC (residues 43 to the C-terminal residue at position 235) reveals that
DnaC protein consists of a smaller N-terminal domain that is responsible for binding to the
C- terminal face of DnaB helicase, and larger ATP-binding region of 190 amino acids (Figure
2; (Ludlam et al., 2001; Galletto et al., 2003; Mott et al., 2008)). Sequence comparison of
homologues of the dnaC gene classifies DnaC as a member of the AAA+ family of ATPases
(Koonin, 1992; Davey et al., 2002; Mott et al., 2008). However unlike other AAA+ proteins,
DnaC contains two additional α helices named the ISM motif (Initiator/loader–Specific
Motif) that directs the oligomerization of this protein into a right-handed helical filament
(Mott et al., 2008). In contrast, the majority AAA+ proteins lacking these α helices assemble
into a closed-ring. Phylogenetic analysis of the AAA+ domain reveals that DnaC is most
38 Computational Biology and Applied Bioinformatics

closely related to DnaA, suggesting that both proteins arose from a common ancestor
(Koonin, 1992). In support, the X-ray crystallographic structures of the ATPase region of
DnaA and DnaC are very similar (Erzberger et al., 2006; Mott et al., 2008).
For DnaC, ATP increases its affinity for single-stranded DNA, which stimulates its ATPase
activity (Davey et al., 2002; Biswas et al., 2004). Other results suggest that ATP stabilizes the
interaction of DnaC with DnaB in the DnaB-DnaC complex (Wahle et al., 1989; Allen and
Kornberg, 1991), which contradicts studies that support the contrary conclusion that ATP is
not necessary for DnaC to form a stable complex with DnaB (Davey et al., 2002; Galletto et
al., 2003; Biswas and Biswas-Fiss, 2006). As mutant DnaC proteins bearing amino acid
substitutions in the Walker A box are both defective in ATP binding and apparently fail to
interact with DnaB, the consequence is that these mutants cannot escort DnaB to oriC
(Ludlam et al., 2001; Davey et al., 2002). Hence, despite the ability of DnaB by itself to bind
to single-stranded DNA in vitro (LeBowitz and McMacken, 1986), DnaC is essential for
DnaB to become stably bound to the unwound region of oriC (Kobori and Kornberg, 1982;
Ludlam et al., 2001). The observation that DnaC complexed to ATP interacts with DnaA
raises the possibility that both proteins act jointly in helicase loading (Mott et al., 2008).
Together, these observations indicate that the ability of DnaC to bind to ATP is essential for
its function in DNA replication, but the paradox about the role of ATP binding and its
hydrolysis on the activity of DnaC and about the mechanism that leads to the dissociation of
DnaC from DnaB have been long-standing issues.

Fig. 5. Plasmid shuffle method. With an E. coli strain lacking the chromosomal copy of dnaC,
a plasmid carrying the wild type dnaC gene residing in the bacterial cell can provide for
dnaC function. Using a plasmid that depends on IPTG for its maintenance, the strain will not
survive in the absence of IPTG. An introduced plasmid that does not depend on IPTG for its
propagation can sustain viability in the absence of IPTG only if it encodes a functional dnaC
allele.
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 39

As described above, one of the characteristics of AAA+ proteins is the presence of a

conserved motif named Box VII, which carries a conserved arginine called the “arginine
finger”. Structural studies of other AAA+ proteins have led to the proposal that this arginine
interacts with the γ phosphate of ATP to promote and coordinate ATP hydrolysis with a
conformational change. Recent experiments were performed to examine the role of the
arginine finger of DnaC and to attempt to clarify how ATP binding and its hydrolysis by
DnaC are involved in the process of initiation of DNA replication (Makowska-Grzyska and
Kaguni, 2010). Part of this study relied on an E. coli mutant lacking the chromosomal copy of
the dnaC gene (Hupert-Kocurek et al., 2007). Because the dnaC gene is essential, this
deficiency of the host strain can be complemented by a plasmid encoding the dnaC gene that
depends on IPTG (isopropyl β-D-1-thiogalactopyranoside) for plasmid maintenance (Figure
5). If another plasmid is introduced into the null dnaC mutant, it maintains viability of the
strain in the absence of IPTG only if it carries a functional dnaC allele. In contrast, if the
second plasmid carries an inactivating mutation in dnaC, the host strain cannot survive
without IPTG. This plasmid exchange method showed that an alanine substitution for the
arginine finger residue inactivated DnaC (Makowska-Grzyska and Kaguni, 2010).
Biochemical experiments performed in parallel showed that this conserved arginine plays a
role in the signal transduction process that involves ATP hydrolysis by DnaC that then leads
to the release of DnaC from DnaB. Finally, the interaction of primase with DnaB that is
coupled with primer formation is also apparently necessary for DnaC to dissociate from
DnaB.

6. Conclusions
In summary, the combination of various experimental approaches on the study of DnaC
have led to insightful experiments that expand our understanding of the role of ATP
binding and its hydrolysis by DnaC during the initiation of DNA replication. Evidence
suggests that ATP hydrolysis by DnaC that leads to the dissociation of DnaC from DnaB
helicase is coupled with primer formation that requires an interaction between DnaG
primase and DnaB. Hence, these critical steps are involved in the transition from the process
of initiation to the elongation phase of DNA replication in E. coli.
This example on the molecular mechanism of DnaC protein is a focused study of one
protein and its interaction with other required proteins during the process of initiation of
DNA replication. One may consider this a form of vertical thinking. It contrasts with
bioinformatics approaches that yield large sets of data for proteins based on the DNA
sequences of genomes, and with microarray approaches that, for example, survey the
expression of genes and their regulation at the genome level under different conditions, or
identify interacting partners for a specific protein. The vast wealth of data from these global
approaches provide a different perspective on understanding the functions of sets of genes
or proteins and how they act in a network of biochemical pathways of the cell.

7. Acknowledgements
We thank members of our labs for discussions on the content and organization of this
chapter. This work was supported by a grant GM090063 from the National Institutes of
Health, and the Michigan Agricultural Station to JMK.
40 Computational Biology and Applied Bioinformatics

8. References
Allen, G. J. and A. Kornberg (1991). Fine balance in the regulation of DnaB helicase by DnaC
protein in replication in Escherichia coli. J Biol Chem 266(33): 22096-22101
Baba, T., T. Ara, M. Hasegawa, Y. Takai, Y. Okumura, M. Baba, K. A. Datsenko, M. Tomita,
B. L. Wanner and H. Mori (2006). Construction of Escherichia coli K-12 in-frame,
single-gene knockout mutants: the Keio collection. Mol Syst Biol 2: 1-11
Beyer, A. (1997). Sequence analysis of the AAA protein family. Protein Sci 6(10): 2043-58,
0961-8368 (Print), 0961-8368 (Linking)
Biswas, S. B. and E. E. Biswas-Fiss (2006). Quantitative analysis of binding of single-stranded
DNA by Escherichia coli DnaB helicase and the DnaB-DnaC complex. Biochemistry
45(38): 11505-11513
Biswas, S. B., S. Flowers and E. E. Biswas-Fiss (2004). Quantitative analysis of nucleotide
modulation of DNA binding by the DnaC protein of Escherichia coli. Biochem J 379:
553-562
Bochner, B. R. (2009). Global phenotypic characterization of bacteria. FEMS Microbiol Rev
33(1): 191-205, 0168-6445 (Print), 0168-6445 (Linking)
Carl, P. L. (1970). Escherichia coli mutants with temperature-sensitive synthesis of DNA. Mol
Gen Genet 109(2): 107-122
Carthew, R. W. and E. J. Sontheimer (2009). Origins and Mechanisms of miRNAs and
siRNAs. Cell 136(4): 642-55, 1097-4172 (Electronic), 0092-8674 (Linking)
Christensen, N. M., K. J. Oparka and J. Tilsner (2010). Advances in imaging RNA in plants.
Trends Plant Sci 15(4): 196-203, 1878-4372 (Electronic), 1360-1385 (Linking)
Corn, J. E. and J. M. Berger (2006). Regulation of bacterial priming and daughter strand
synthesis through helicase-primase interactions. Nucleic Acids Res 34(15): 4082-
4088
Davey, M. J., L. Fang, P. McInerney, R. E. Georgescu and M. O'Donnell (2002). The DnaC
helicase loader is a dual ATP/ADP switch protein. EMBO J 21(12): 3148-3159
Davey, M. J., D. Jeruzalmi, J. Kuriyan and M. O'Donnell (2002). Motors and switches: AAA+
machines within the replisome. Nat Rev Mol Cell Biol 3(11): 826-835
Davey, M. J. and M. O'Donnell (2003). Replicative helicase loaders: ring breakers and ring
makers. Curr Biol 13(15): R594-596
DePamphilis, M. L. and S. D. Bell (2010). Genome duplication, Garland Science/Taylor &
Francis Group, 9780415442060, London
Elbashir, S. M., J. Harborth, W. Lendeckel, A. Yalcin, K. Weber and T. Tuschl (2001).
Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian
cells. Nature 411(6836): 494-8, 0028-0836 (Print), 0028-0836 (Linking)
Erzberger, J. P. and J. M. Berger (2006). Evolutionary relationships and
structural mechanisms of AAA+ proteins. Annu Rev Biophys Biomol Struct 35: 93-
114
Erzberger, J. P., M. L. Mott and J. M. Berger (2006). Structural basis for ATP-dependent
DnaA assembly and replication-origin remodeling. Nat Struct Mol Biol 13(8): 676-
683
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 41

Fire, A., S. Xu, M. K. Montgomery, S. A. Kostas, S. E. Driver and C. C. Mello (1998). Potent
and specific genetic interference by double-stranded RNA in Caenorhabditis elegans.
Nature 391(6669): 806-11, 0028-0836 (Print), 0028-0836 (Linking)
Fischer, S. E. (2010). Small RNA-mediated gene silencing pathways in C. elegans. Int J
Biochem Cell Biol 42(8): 1306-15, 1878-5875 (Electronic), 1357-2725 (Linking)
Gabbai, C. B. and K. J. Marians (2010). Recruitment to stalled replication forks of the PriA
DNA helicase and replisome-loading activities is essential for survival. DNA
Repair (Amst) 9(3): 202-209, 1568-7856 (Electronic), 1568-7856 (Linking)
Galletto, R., M. J. Jezewska and W. Bujalowski (2003). Interactions of the Escherichia coli
DnaB Helicase Hexamer with the Replication Factor the DnaC Protein. Effect of
Nucleotide Cofactors and the ssDNA on Protein-Protein Interactions and the
Topology of the Complex. J Mol Biol 329(3): 441-465
Ghildiyal, M. and P. D. Zamore (2009). Small silencing RNAs: an expanding universe. Nat
Rev Genet 10(2): 94-108, 1471-0064 (Electronic), 1471-0056 (Linking)
Guy, C. P., J. Atkinson, M. K. Gupta, A. A. Mahdi, E. J. Gwynn, C. J. Rudolph, P. B. Moon, I.
C. van Knippenberg, C. J. Cadman, M. S. Dillingham, R. G. Lloyd and P. McGlynn
(2009). Rep provides a second motor at the replisome to promote duplication of
protein-bound DNA. Mol Cell 36(4): 654-666, 1097-4164 (Electronic), 1097-2765
(Linking)
Hamilton, A. J. and D. C. Baulcombe (1999). A species of small antisense RNA in
posttranscriptional gene silencing in plants. Science 286(5441): 950-952, 0036-8075
(Print), 0036-8075 (Linking)
Hanson, P. I. and S. W. Whiteheart (2005). AAA+ proteins: have engine, will work. Nat Rev
Mol Cell Biol 6(7): 519-529
Hupert-Kocurek, K., J. M. Sage, M. Makowska-Grzyska and J. M. Kaguni (2007). Genetic
method to analyze essential genes of Escherichia coli. Appl Environ Microbiol 73(21):
7075-7082
Iyer, L. M., D. D. Leipe, E. V. Koonin and L. Aravind (2004). Evolutionary history and higher
order classification of AAA+ ATPases. J Struct Biol 146(1-2): 11-31, 1047-8477
(Print), 1047-8477 (Linking)
Katayama, T., S. Ozaki, K. Keyamura and K. Fujimitsu (2010). Regulation of the replication
cycle: conserved and diverse regulatory systems for DnaA and oriC. Nat Rev
Microbiol 8(3): 163-170, 1740-1534 (Electronic), 1740-1526 (Linking)
Ketting, R. F., T. H. Haverkamp, H. G. van Luenen and R. H. Plasterk (1999). Mut-7 of C.
elegans, required for transposon silencing and RNA interference, is a homolog of
Werner syndrome helicase and RNaseD. Cell 99(2): 133-141, 0092-8674 (Print), 0092-
8674 (Linking)
Kobori, J. A. and A. Kornberg (1982). The Escherichia coli dnaC gene product. II.
Purification, physical properties, and role in replication. J Biol Chem 257(22):
13763-13769
Koonin, E. V. (1992). DnaC protein contains a modified ATP-binding motif and belongs
to a novel family of ATPases including also DnaA. Nucleic Acids Res 20(8):
1997
42 Computational Biology and Applied Bioinformatics

Kornberg, A. and T. A. Baker (1992). DNA Replication Second Edition, W.H. Freeman and
Company, 9781891389443, New York
Krol, J., I. Loedige and W. Filipowicz (2010). The widespread regulation of microRNA
biogenesis, function and decay. Nat Rev Genet 11(9): 597-610, 1471-0064
(Electronic), 1471-0056 (Linking)
Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K.
Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J.
Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J.
P. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A.
Sheridan, C. Sougnez, N. Stange-Thomann, N. Stojanovic, A. Subramanian, D.
Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee,
N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R.
Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M.
Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A.
Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson,
L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T.
Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty,
T. L. Miner, A. Delehaunty, J. B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J.
Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S.
Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin, E.
Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E. Scherer, J. B. Bouck, E. J.
Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell, M. L. Metzker, S. L. Naylor, R.
S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori,
T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J.
Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E. Pelletier,
C. Robert, P. Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K.
Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S.
Taudien, A. Rump, H. Yang, J. Yu, J. Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A.
Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M.
Myers, J. Schmutz, M. Dickson, J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, N.
Shimizu, K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A.
Roe, F. Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de
la Bastide, N. Dedhia, H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L.
Aravind, J. A. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C.
B. Burge, L. Cerutti, H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S.
R. Eddy, E. E. Eichler, T. S. Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y.
Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. S. Johnson, T. A.
Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J. Kent, P. Kitts, E. V. Koonin, I. Korf,
D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght, T. Mikkelsen, J. V. Moran, N.
Mulder, V. J. Pollara, C. P. Ponting, G. Schuler, J. Schultz, G. Slater, A. F. Smit, E.
Stupka, J. Szustakowski, D. Thierry-Mieg, J. Thierry-Mieg, L. Wagner, J. Wallis, R.
Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, S. P. Yang, R. F. Yeh, F. Collins, M. S.
Guyer, J. Peterson, A. Felsenfeld, K. A. Wetterstrand, A. Patrinos, M. J. Morgan, P.
de Jong, J. J. Catanese, K. Osoegawa, H. Shizuya, S. Choi and Y. J. Chen (2001).
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 43

Initial sequencing and analysis of the human genome. Nature 409(6822): 860-921,
0028-0836 (Print), 0028-0836 (Linking)
Langston, L. D., C. Indiani and M. O'Donnell (2009). Whither the replisome: emerging
perspectives on the dynamic nature of the DNA replication machinery. Cell Cycle
8(17): 2686-2691, 1551-4005 (Electronic), 1551-4005 (Linking)
LeBowitz, J. H. and R. McMacken (1986). The Escherichia coli dnaB replication protein is a
DNA helicase. J Biol Chem 261(10): 4738-48,
Lee, D. G. and S. P. Bell (2000). ATPase switches controlling DNA replication initiation. Curr
Opin Cell Biol 12(3): 280-285, 0955-0674 (Print), 0955-0674 (Linking)
Leonard, A. C. and J. E. Grimwade (2005). Building a bacterial orisome: emergence of
new regulatory features for replication origin unwinding. Mol Microbiol 55(4): 978-
985
Leonard, A. C. and J. E. Grimwade (2009). Initiating chromosome replication in E. coli: it
makes sense to recycle. Genes Dev 23(10): 1145-1150,
Liu, J., C. L. Smith, D. DeRyckere, K. DeAngelis, G. S. Martin and J. M. Berger (2000).
Structure and function of Cdc6/Cdc18: implications for origin recognition and
checkpoint control. Mol Cell 6(3): 637-648
Liu, Q., T. A. Rand, S. Kalidas, F. Du, H. E. Kim, D. P. Smith and X. Wang (2003). R2D2, a
bridge between the initiation and effector steps of the Drosophila RNAi pathway.
Science 301(5641): 1921-1925, 1095-9203 (Electronic), 0036-8075 (Linking)
Ludlam, A. V., M. W. McNatt, K. M. Carr and J. M. Kaguni (2001). Essential amino acids of
Escherichia coli DnaC protein in an N-terminal domain interact with DnaB helicase. J
Biol Chem 276(29): 27345-27353
Lupas, A. N. and J. Martin (2002). AAA proteins. Curr Opin Struct Biol 12(6): 746-53, 0959-
440X (Print), 0959-440X (Linking)
Lupski, J. R., J. G. Reid, C. Gonzaga-Jauregui, D. Rio Deiros, D. C. Chen, L. Nazareth, M.
Bainbridge, H. Dinh, C. Jing, D. A. Wheeler, A. L. McGuire, F. Zhang, P.
Stankiewicz, J. J. Halperin, C. Yang, C. Gehman, D. Guo, R. K. Irikat, W. Tom, N. J.
Fantin, D. M. Muzny and R. A. Gibbs (2010). Whole-genome sequencing in a
patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 362(13): 1181-91, 1533-
4406 (Electronic), 0028-4793 (Linking)
Makowska-Grzyska, M. and J. M. Kaguni (2010). Primase Directs the Release of DnaC from
DnaB. Mol Cell 37(1): 90-101
Maxam, A. M. and W. Gilbert (1977). A new method for sequencing DNA. Proc Natl Acad
Sci U S A 74(2): 560-4, 0027-8424 (Print), 0027-8424 (Linking)
McHenry, C. S. (2003). Chromosomal replicases as asymmetric dimers: studies of subunit
arrangement and functional consequences. Mol Microbiol 49(5): 1157-1165
Moerman, D. G. and R. J. Barstead (2008). Towards a mutation in every gene in
Caenorhabditis elegans. Brief Funct Genomic Proteomic 7(3): 195-204, 1477-4062
(Electronic), 1473-9550 (Linking)
Mott, M. L., J. P. Erzberger, M. M. Coons and J. M. Berger (2008). Structural synergy and
molecular crosstalk between bacterial helicase loaders and replication initiators.
Cell 135(4): 623-634
44 Computational Biology and Applied Bioinformatics

Neuwald, A. F., L. Aravind, J. L. Spouge and E. V. Koonin (1999). AAA+: A class of

chaperone-like ATPases associated with the assembly, operation, and disassembly
of protein complexes. Genome Res 9(1): 27-43
Ng, P. C. and E. F. Kirkness (2010). Whole genome sequencing. Methods Mol Biol 628: 215-
226, 1940-6029 (Electronic), 1064-3745 (Linking)
Nielsen, O. and A. Lobner-Olesen (2008). Once in a lifetime: strategies for preventing re-
replication in prokaryotic and eukaryotic cells. EMBO Rep 9(2): 151-156
Ogura, T., S. W. Whiteheart and A. J. Wilkinson (2004). Conserved arginine residues
implicated in ATP hydrolysis, nucleotide-sensing, and inter-subunit interactions in
AAA and AAA+ ATPases. J Struct Biol 146(1-2): 106-112
Reznikoff, W. S. and K. M. Winterberg (2008). Transposon-based strategies for the
identification of essential bacterial genes. Methods Mol Biol 416: 13-26, 1064-3745
(Print), 1064-3745 (Linking)
Sandler, S. J. (2005). Requirements for replication restart proteins during constitutive stable
DNA replication in Escherichia coli K-12. Genetics 169(4): 1799-1806
Sandler, S. J., H. S. Samra and A. J. Clark (1996). Differential suppression of priA2::kan
phenotypes in Escherichia coli K- 12 by mutations in priA, lexA, and dnaC. Genetics
143(1): 5-13
Sanger, F., G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, C. A. Fiddes, C. A.
Hutchison, P. M. Slocombe and M. Smith (1977). Nucleotide sequence of
bacteriophage phi X174 DNA. Nature 265(5596): 687-95, 0028-0836 (Print), 0028-
0836 (Linking)
Sanger, F., S. Nicklen and A. R. Coulson (1977). DNA sequencing with chain-terminating
inhibitors. Proc Natl Acad Sci U S A 74(12): 5463-7, 0027-8424 (Print), 0027-8424
(Linking)
Sutcliffe, J. G. (1979). Complete nucleotide sequence of the Escherichia coli plasmid pBR322.
Cold Spring Harb Symp Quant Biol 43 Pt 1: 77-90, 0091-7451 (Print), 0091-7451
(Linking)
Swaffield, J. C. and M. D. Purugganan (1997). The evolution of the conserved ATPase
domain (CAD): reconstructing the history of an ancient protein module. J Mol Evol
45(5): 549-563, 0022-2844 (Print), 0022-2844 (Linking)
Tabara, H., M. Sarkissian, W. G. Kelly, J. Fleenor, A. Grishok, L. Timmons, A. Fire and C. C.
Mello (1999). The rde-1 gene, RNA interference, and transposon silencing in C.
elegans. Cell 99(2): 123-32, 0092-8674 (Print), 0092-8674 (Linking)
Tougu, K. and K. J. Marians (1996). The interaction between helicase and primase sets the
replication fork clock. J Biol Chem 271(35): 21398-405,
Venter, J. C., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M.
Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H.
Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski,
G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder,
A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M.
Simon, C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M.
Flanigan, L. Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K.
Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R.
Understanding Protein Function -
The Disparity Between Bioinformatics and Molecular Methods 45

Brandon, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng,

V. Di Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E. Gabrielian, W. Gan, W.
Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A.
Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N.
Milshina, H. M. Moore, A. K. Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B.
Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Wang, A. Wang, X. Wang, J. Wang,
M. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan, W. Zhang, H. Zhang, Q.
Zhao, L. Zheng, F. Zhong, W. Zhong, S. Zhu, S. Zhao, D. Gilbert, S. Baumhueter, G.
Spier, C. Carter, A. Cravchik, T. Woodage, F. Ali, H. An, A. Awe, D. Baldwin, H.
Baden, M. Barnstead, I. Barrow, K. Beeson, D. Busam, A. Carver, A. Center, M. L.
Cheng, L. Curry, S. Danaher, L. Davenport, R. Desilets, S. Dietz, K. Dodson, L.
Doup, S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C. Haynes, C.
Heiner, S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam, J. Johnson, F.
Kalush, L. Kline, S. Koduru, A. Love, F. Mann, D. May, S. McCawley, T. McIntosh,
I. McMullen, M. Moy, L. Moy, B. Murphy, K. Nelson, C. Pfannkoch, E. Pratts, V.
Puri, H. Qureshi, M. Reardon, R. Rodriguez, Y. H. Rogers, D. Romblad, B. Ruhfel,
R. Scott, C. Sitter, M. Smallwood, E. Stewart, R. Strong, E. Suh, R. Thomas, N. N.
Tint, S. Tse, C. Vech, G. Wang, J. Wetter, S. Williams, M. Williams, S. Windsor, E.
Winn-Deen, K. Wolfe, J. Zaveri, K. Zaveri, J. F. Abril, R. Guigo, M. J. Campbell, K.
V. Sjolander, B. Karlak, A. Kejariwal, H. Mi, B. Lazareva, T. Hatton, A. Narechania,
K. Diemer, A. Muruganujan, N. Guo, S. Sato, V. Bafna, S. Istrail, R. Lippert, R.
Schwartz, B. Walenz, S. Yooseph, D. Allen, A. Basu, J. Baxendale, L. Blick, M.
Caminha, J. Carnes-Stine, P. Caulk, Y. H. Chiang, M. Coyne, C. Dahlke, A. Mays,
M. Dombroski, M. Donnelly, D. Ely, S. Esparham, C. Fosler, H. Gire, S. Glanowski,
K. Glasser, A. Glodek, M. Gorokhov, K. Graham, B. Gropman, M. Harris, J. Heil, S.
Henderson, J. Hoover, D. Jennings, C. Jordan, J. Jordan, J. Kasha, L. Kagan, C. Kraft,
A. Levitsky, M. Lewis, X. Liu, J. Lopez, D. Ma, W. Majoros, J. McDaniel, S. Murphy,
M. Newman, T. Nguyen, N. Nguyen, M. Nodell, S. Pan, J. Peck, M. Peterson, W.
Rowe, R. Sanders, J. Scott, M. Simpson, T. Smith, A. Sprague, T. Stockwell, R.
Turner, E. Venter, M. Wang, M. Wen, D. Wu, M. Wu, A. Xia, A. Zandieh and X.
Zhu (2001). The sequence of the human genome. Science 291(5507): 1304-1351, 0036-
8075 (Print), 0036-8075 (Linking)
Vidan, S. and M. Snyder (2001). Large-scale mutagenesis: yeast genetics in the genome era.
Curr Opin Biotechnol 12(1): 28-34, 0958-1669 (Print), 0958-1669 (Linking)
Wahle, E., R. S. Lasken and A. Kornberg (1989). The dnaB-dnaC replication protein
complex of Escherichia coli. I. Formation and properties. J Biol Chem 264(5): 2463-
2468
Winzeler, E. A., D. D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, B. Andre, R.
Bangham, R. Benito, J. D. Boeke, H. Bussey, A. M. Chu, C. Connelly, K. Davis, F.
Dietrich, S. W. Dow, M. El Bakkoury, F. Foury, S. H. Friend, E. Gentalen, G.
Giaever, J. H. Hegemann, T. Jones, M. Laub, H. Liao, N. Liebundguth, D. J.
Lockhart, A. Lucau-Danila, M. Lussier, N. M'Rabet, P. Menard, M. Mittmann, C.
Pai, C. Rebischung, J. L. Revuelta, L. Riles, C. J. Roberts, P. Ross-MacDonald, B.
Scherens, M. Snyder, S. Sookhai-Mahadeo, R. K. Storms, S. Veronneau, M. Voet, G.
46 Computational Biology and Applied Bioinformatics

Volckaert, T. R. Ward, R. Wysocki, G. S. Yen, K. Yu, K. Zimmermann, P. Philippsen,

M. Johnston and R. W. Davis (1999). Functional characterization of the S. cerevisiae
genome by gene deletion and parallel analysis. Science 285(5429): 901-906, 0036-
8075 (Print), 0036-8075 (Linking)
3

In Silico Identification of Regulatory

Elements in Promoters
Vikrant Nain1, Shakti Sahi1 and Polumetla Ananda Kumar2
1Gautam Buddha University, Greater Noida
2National Research Centre on Plant Biotechnology, New Delhi
India

1. Introduction
In multi-cellular organisms development from zygote to adult and adaptation to different
environmental stresses occur as cells acquire specialized roles by synthesizing proteins
necessary for each task. In eukaryotes the most commonly used mechanism for maintaining
cellular protein environment is transcriptional regulation of gene expression, by recruiting
required transcription factors at promoter regions. Owing to the importance of
transcriptional regulation, one of the main goals in the post-genomic era is to predict gene
expression regulation on the basis of presence of transcription factor (TF) binding sites in the
promoter regions. Genome wide knowledge of TF binding sites would be useful to build
transcriptional regulatory networks model that result in cell specific differentiation. In
eukaryotic genomes only a fraction (< 5%) of total genome codes for functional proteins or
RNA, while remaining DNA sequences consist of non-coding regulatory sequences, other
regions and sequences still with unknown functions.
Since the discovery of trans-acting factors in gene regulation by Jacob and Monads in lac
operon of E. coli, scientists had an interest in finding new transcription factors, their specific
recognition and binding sequences. In DNAse footprinting (or DNase protection assay);
transcription factor bound regions are protected from DNAse digestion, creating a
"footprint" in a sequencing gel. This methodology has resulted in identification of hundreds
of regulatory sequences. However, limitation of this methodology is that it requires the TF
and promoter sequence (100-300 bp) in purified form. Our knowledge of known
transcription factors is limited and recognition and binding sites are scattered over the
complete genome. Therefore, in spite of high degree of accuracy in prediction of TF binding
site, this methodology is not suitable for genome wide or across the genomes scanning.
Detection of TF binding sites through phylogenetic footprinting is gradually becoming
popular. It is based on the fact that random mutations are not easily accepted in functional
sequences, while they continuously keep on tinkering non functional sequences. Many
comparative genomics studies have revealed that during course of evolution regulatory
elements remain conserved while the non-coding DNA sequences keep on mutating. With
an ever increasing number of complete genome sequence from multiple organisms and
mRNA profiling through microarray and deep sequencing technologies, wealth of gene
expression data is being generated. This data can be used for identification of regulatory
48 Computational Biology and Applied Bioinformatics

elements through intra and inter species comparative genomics. However, the identification
of TF binding sites in promoters still remains one of the major challenges in bioinformatics
due to following reasons:
1. Very short (5-15 nt) size of regulatory motifs that also differ in their number of
occurrence and position on DNA strands with respect to transcription start site. This
wide distribution of short TF binding sites makes their identification with commonly
used sequence alignment programmes challenging.
2. A very high degree of conservation between two closely related species generally
shows no clear signature of highly conserved motifs.
3. Absence of significant similarities between highly diverse species hinders the alignment
of functional sequences.
4. Sometimes, functional conservation of gene expression is not sufficient to assure the
evolutionary preservation of corresponding cis-regulatory elements (Pennacchio and
Rubin, 2001).
5. Transcription factors binding sites are often degenerate.
In order to overcome these challenges, in the last few years novel approaches have been
developed that integrate comparative, structural, and functional genomics with the
computational algorithms. Such interdisciplinary efforts have increased the sensitivity of
computational programs to find composite regulatory elements.
Here, we review different computational approaches for identification of regulatory
elements in promoter region with seed specific legumin gene promoter analysis as an
example. Based on the type of DNA sequence information the motif finding algorithms are
classified into three major classes: (1) methods that use promoter sequences from co
regulated genes from a single genome, (2) methods that use orthologous promoter
sequences of a single gene from multiple species, also known as phylogenetic footprinting
and (3) methods that use promoter sequences of co regulated genes as well as phylogenetic
footprinting (Das and Dai, 2007).

2. Representation of DNA motifs

In order to discover motifs of unknown transcription factors, models to represent motifs are
essential (Stormo, 2000).There are three models which are generally used to describe a motif
and its binding sites:
1. string representation (Buhler and Tompa, 2002)
2. matrix representation (Bailey and Elkan, 1994) and
3. representation with nucleotide dependency (Chin and Leung, 2008)

2.1 String representation

String representation is the basic representation using string of symbols or nucleotides A, C,
G and T of length-l to describe a motif. Wildcard symbols are introduced into the string to
represent choice from a subset of symbols at a particular position. The International Union
of Pure and Applied Chemistry (IUPAC) nucleic acid codes (Thakurta and Stormo, 2007) are
used to represent the information about degeneracy for example: W = A or T (‘Weak’ base
pairing); S= C or G (‘Strong’ base pairing); R= A or G (Purine); Y= C or T (Pyrimidine); K= G
or T (Keto group on base); M= A or C (Amino group on base); B= C, G, or T; D= A, G, or T ;
H= A, C, or T ; V= A, C, or G; N= A, C, G, or T.
In Silico Identification of Regulatory Elements in Promoters 49

2.2 Matrix representation

In matrix representation, motifs of length l are represented by position weight matrices
(PWMs) or position specific scoring matrices (PSSMs) of size 4x l. This gives the occurrence
probabilities of each of the four nucleotides at a position j. The score of any specific sequence
is the sum of the position scores from the weight matrix corresponding to that sequence.
Using this representation an entire genome can be scanned by a matrix and the score at
every position obtained (Stormo, 2000). Any sequence with score that is higher than the
predefined cut-off is a potential new binding site. A consensus sequence is deduced from a
multiple alignment of input sequences and then converted into a position weight matrix.
A PWM score is the sum of position-specific scores for each symbol in the substring. The
matrix has one row for each symbol of the alphabet, and one column for each position in the
( ) j =1 , is defined as ∑ Nj =1 msj , j
N
pattern. The score assigned by a PWM to a substring S = S j

where j represents position in the substring, sj is the symbol at position j in the substring,
and mα,j is the score in row α, column j of the matrix.
Although matrix representation appears superior, the solution space for PWMs and PSSMs,
which consists of 4l real numbers is infinite in size, and there are many local optimal
matrices, thus, algorithms generally either produce a suboptimal motif matrix or take too
long to run when the motif is longer than 10 bp (Francis and Henry, 2008).

2.3 Representation with nucleotide dependency

The interdependence between neighboring nucleotides with similar number of parameters
as string and matrix representations is described by Scored Position Specific Pattern (SPSP).
A set of length-l binding site patterns can be described by a SPSP representation P, which
contains c (c ≤ l) sets of patterns Pi, 1 ≤ i ≤ c, where each set of patterns Pi contains length-li
patterns Pi,j of symbols A, C, G, and T and ∑i li = l. Each length- li pattern Pi,j is associated
with a score si,j that represents the “closeness” of a pattern to be a binding site. The lower the
score, the pattern is more likely a binding site (Henry and Fracis, 2006).

3. Methods of finding TF binding sites in a DNA sequence

3.1 Searching known motifs
Development of databases of complete information on experimentally validated TF binding
site is indispensable for promoter sequence analysis. Information about TF binding sites
remain scattered in literature. In the last one and half decade phenomenal increase in
computational power, cheaper electronic storage with faster communication technologies,
have resulted in development of a range of web accessible databases having experimentally
validated TF binding sites. These TF binding site databases are not only highly useful for
identification of putative TF binding sites in new promoter sequences (Table1), but also are
valuable for providing positive dataset required for improvement and validation of new TF
binding site prediction algorithms.

3.1.1 TRANSFAC
TRANSFAC is the largest repository of transcription factors binding sites. TRANSFAC
(TRANSFAC 7.0, 2005) web accessible database consists of 6,133 factors with 7,915 sites,
while professional version (TRANSFAC 2008.3) consists of 11,683 factors with 30,227 sites.
TRANSFAC database is composed of six tables SITE, GENE, FACTOR, CELL, CLASS and
50 Computational Biology and Applied Bioinformatics

MATRIX. GENE table gives a short explanation of the gene where a site (or group of sites)
belongs to; FACTOR table describes the proteins binding to these sites. CELL gives brief
information about the cellular source of proteins that have been shown to interact with the
sites. CLASS contains some background information about the transcription factor classes,
while the MATRIX table gives nucleotide distribution matrices for the binding sites of
transcription factors. This database is most frequently used as reference for TFB sites as well
as for development of new algorithms. However, new users find it difficult to access the
database because it requires search terms to be entered manually. There is no criterion to
select the organism, desired gene or TF from a list, so web interface is not user friendly.
Other web tools such as TF search and Signal Scan overcome this limitation to certain extent.

3.1.2 Signal Scan

Signal Scan finds and lists homologies of published TF binding site signal sequences in the
input DNA sequence by using TRANSFAC, TFD and IMD databases. It also allows to select
from different classes viz mammal, bird, amphibian, insect, plant, other eukaryotes,
prokaryote, virus (TRANSFAC only), insect and yeast (TFD only).

3.1.3 TRRD
The transcription regulatory region database (TRRD) is a database of transcription
regulatory regions of the eukaryotic genome. The TRRD database contains three
interconnected tables: TRRDGENES (description of the genes as a whole), TRRDSITES
(description of the sites), and TRRDBIB (references). The current version, TRRD 3.5,
comprises of the description of 427 genes, 607 regulatory units (promoters, enhancers, and
silencers), and 2147 transcription factor binding sites. The TRRDGENES database compiles
the data on human (185 entries), mouse (126), rat (69), chicken (29), and other genes.

Transcription
Developmental/Environmental
factor binding Position
stimulus Sequence
site
TATA Box -33 tcccTATAaataa
Core promoter
Cat Box -49 gCCAAc
G Box -66 tgACGgtgt
Stress responsive ABRE -76 acaccttctttgACTGtccatccttc
ABI4 -245 CACCg
W Box -72 cttctTTGAcgtgtcca
Pathogen defense
TCA gAGAAgagaa
Light Response I box -302 gATATga
WUN -348 tAATTacac
Wound specific
TCA -646 gAGAAgagaa
Legumin -118 tccatacCCATgcaagctgaagaatgtc
Opaque-2 -348 TAATtacacatatttta
Seed Specific
Prolamine box -385 TTaaaTGTAAAAgtAa
AAGAA-motif -294 agaAAGAa
Table 1. In silico analysis of pigeonpea legumin gene promoter for identification of
regulatory elements. Database search reveals that it consist of regulatory elements that can
direct its activation under different envirnmental conditions and developmental stages.
In Silico Identification of Regulatory Elements in Promoters 51

3.1.4 PlantCARE
PlantCARE is database of plant specific cis-Acting regulatory elements in the promoter
regions (Lescot et al., 2002). It generates a sequence analysis output on a dynamic webpage,
on which TF binding sites are highlighted in the input sequence. The database can be
queried on names of transcription factor (TF) sites, motif sequence, function, species, cell
type, gene, TF and literature references. Information regarding TF site, organism, motif
position, strand, core similarity, matrix similarity, motif sequence and function are listed
whereas the potential sites are mapped on the query sequence.

3.1.5 PLACE
PLACE is another database of plant cis-acting regulatory elements extracted from published
reports (Higo et al., 1999). It also includes variations in the motifs in different genes or plant
species. PLACE also includes non-plant cis-elements data that may have homologues with
plant. PLACE database also provides brief description of each motif and links to
publications.

3.1.6 RegulonDB
RegulonDB is a comprehensive database of gene regulation and interaction in E. coli. It
consists of data on almost every aspect of gene regulation such as terminators, promoters,
TF binding sites, active and inactive transcription factor conformations, matrices alignments,
transcription units, operons, regulatory network interactions, ribosome binding sites (rbs),
growth conditions, gene product and small RNAs.

3.1.7 ABS
ABS is a database of known TF binding sites identified in promoters of orthologous
vertebrate genes. It has 650 annotated and experimental validated binding sites from 68
transcription factors and 100 orthologous target genes in human, mouse, rat and chicken
genome sequences. Although it’s a simple and easy-to-use web interface for data retrieval
but it does not facilitate either analysis of new promoter sequence or mapping user defined
motif in the promoter.

3.1.8 MatInspector
MatInspector identifies cis-acting regulatory elements in nucleotide sequences using
library of weight matrices (Cartharius et al., 2005). It is based on novel matrix family
concept, optimized thresholds, and comparative analysis that overcome the major limitation
of large number of redundant binding sites predicted by other programs. Thus it increases
the sensitivity of reducing false positive predictions. MatInspector also allows integration of
output with other sequence analysis programs e.g. DiAlignTF, FrameWorker,
SequenceShaper, for an in-depth promoter analysis and designing regulatory sequences.
MatInspector library contains 634 matrices representing one of the largest libraries available
for public searches.

3.1.9 JASPAR
JASPAR is the another open access database that compete with the commercial TF binding
site databases such as TRANSFAC (Portales-Casamar et al., 2009). The latest release has a
52 Computational Biology and Applied Bioinformatics

collection of 457 non-redundant, curated profiles. It is a collection of smaller databases, viz

JASPAR CORE, JASPAR FAM, JASPAR PHYLOFACTS, JASPAR POLII and others, among
which JASPAR CORE is most commonly used. The JASPAR CORE database contains a
curated, non-redundant set of profiles, derived from published collections of experimentally
determined transcription factor binding sites for multicellular eukaryotes (Portales-Casamar
et al., 2009). The JASPAR database can also be accessed remotely through external
application programming interface (API).

3.1.10 Cister: cis-element cluster finder

Cister is based on the technique of posterior decoding, with Hidden Markov model and
predicts regulatory regions in DNA sequences by searching for clusters of cis-elements
(Frith et al., 2001). The Cister input page consists of 16 common TF sites to define a cluster
and additional user defined PWM or TRANSFAC entries can also be entered. For web based
analysis maximum input sequence length is 100 kb, however, the program is downloadable
for standalone applications and analysis of longer sequences.

3.1.11 MAPPER
MAPPER stands for Multi-genome Analysis of Positions and Patterns of Elements of
Regulation It is a platform for the computational identification of TF binding sites in
multiple genomes (Marinescu et al., 2005). The MAPPER consists of three modules, the
MAPPER database, the Search Engine, and rSNPs and combines TRANSFAC and JASPAR
data. However, MAPPER database is limited to TFBSs found only in the promoter of genes
from the human, mouse and D.melanogaster genomes.

3.1.12 Stubb
Like Cister, Stubb also uses hidden Markov models (HMM) to obtain a statistically
significant score for modules (Sinha et al., 2006). STUBB is more suitable for finding
modules over genomic scales with small set of transcription factors whose binding sites are
known. Stubb differs from MAPPER in that the application of latter is limited to binding
sites of a single given motif in an input sequence.

3.1.13 Clover
Clover is another program for identifying functional sites in DNA sequences. It take a set of
DNA sequences that share a common function, compares them to a library of sequence
motifs (e.g. transcription factor binding patterns), and identifies which, if any, of the motifs
are statistically overrepresented in the sequence set (Frith et al., 2004). It requires two input
files one for sequences in fasta format and another for sequence motif. Clover provides
JASPAR core collection of TF binding sites that can be converted to clover format. Clover is
also available as standalone application for windows, Linux as well as Mac operating
systems.

3.1.14 RegSite
Regsite consists of plant specific largest repository of transcription factor binding sites.
Current RegSite release contains 1816 entries. It is used by transcription start site prediction
programs (Sinha et al., 2006).
In Silico Identification of Regulatory Elements in Promoters 53

3.1.15 JPREdictor
JPREdictor is a JAVA based cis-regulatory TF binding site prediction program (Fiedler and
Rehmsmeier, 2006). The JPREdictor can use different types of motifs: Sequence Motifs,
Regular Expression Motifs, PSPMs as well as PSSMs and the complex motif type
(MultiMotifs). This tool can be used for the prediction of cis-regulatory elements on a
genome-wide scale.

3.2 Motif finding programs

3.2.1 Phylogenetic footprinting
Comparative DNA sequence analysis shows local difference in mutation rates and reveals a
functional site by virtue of its conservation in a background of non-functional sequences. In
the phylogenetic equivalent, regulatory elements are protected from random drift across
evolutionary time by selection. Orthologous noncoding DNA sequences from multiple
species provide a strong base for identification of regulatory elements by Phylogenetic
footprinting (Fig. 1) (Rombauts et al., 2003).
The major advantage of phylogenetic footprinting over the single genome is that multigene
approach requires data of co regulated genes. While phylogenetic footprinting can
identifying regulatory elements present in single gene, that remain conserved during the
course of divergence of two species under investigation. With steep increase in available
complete genome sequences, across species comparisons for a wide variety of organisms has
become possible (Blanchette and Tompa, 2002; Das and Dai, 2007). A multiple sequence
alignment algorithm suited for phylogenetic footprinting should be able to indentify small
(5-15 bp) sequence in a background of highly diverse sequences.

Fig. 1. Identification of new regulatory elements (L-19) in legumin gene promoters by

phylogenetic footprinting.
3.2.1.1 Clustal W, LAGAN, AVID
In phylogenetic footprinting primary aim is to construct global multiple alignment of the
orthologous promoter sequences and then identify a region conserved across orthologous
sequences. Alignment algorithms, such as ClustalW (Thompson et al., 1994), LAGAN
(Brudno et al., 2003), AVID (Bray et al., 2003) and Bayes-Block Aligner (Zhu edt al., 1998),
have proven useful for phylogenetic footprinting, but the short length of the conserved
motif compared to the length of the non-conserved background sequence; and their variable
position in a promoter hampers the alignment of conserved motifs. Moreover multiple
sequence alignment does not reveal meaningful biological information if the species used
54 Computational Biology and Applied Bioinformatics

for comparison are too closely related. If the species are too distantly related, it is difficult to
find an accurate alignment. It requires computational tools that bypass the requirement of
sequence alignment completely and have the capabilities to identify short and scattered
conserved regions.
3.2.1.2 MEME, Consensus, Gibbs sampler, AlignAce
In cases where multiple alignment algorithms fails, motif finding algorithms such as MEME,
Consensus and Gibbs sampler have been used (Fig. 2). The feasibility of using comparative
DNA sequence analysis to identify functional sequences in the genome of S. cerevisiae, with
the goal of identifying regulatory sequences and sequences specifying nonprotein coding
RNAs was investigated (Cliften et al., 2001). It was found that most of the DNA sequences of
the closely related Saccharomyces species aligned to S.cerevisiae sequences and known
promoter regions were conserved in the alignments. Pattern search algorithms like
CONSENSUS (Hertz et al., 1990), Gibbs sampling (Lawrence et al., 1993) and AlignAce
(Roth et al., 1998) were useful for identifying known regulatory sequence elements in the
promoters, where they are conserved through the most diverged Saccharomyces species.
Gibbs sampler was used for motif finding using phylogenetic footprinting in proteobacterial
genomes (McCue et al., 2001). These programs employ two approaches for motif finding.
One approach is to employ a training set of transcription factor binding sites and a scoring
scheme to evaluate predictions. The scoring scheme is often based on information theory
and the training set is used to empirically determine a score threshold for reporting of the
predicted transcription factor binding sites. The second method relies on a rigorous
statistical analysis of the predictions, based upon modeled assumptions. The statistical
significance of a sequence match to a motif can be accessed through the determination of p-
value. P-value is the probability of observing a match with a score as good or better in a
randomly generated search space of identical size and nucleotide composition. The smaller
the p-value, the lesser the probability that the match is due to chance alone. Since the motif
finding algorithms assume the input sequences to be independent, therefore, they are
limited by the fact that the data sets containing a mixture of some closely related species will
have an unduly high weight in the results of motifs reported.
Multiple genome sequences were compared that are as optimally diverged as possible in
Saccharomyces genomes. Phylogenetic footprints were searched among the genome
sequences of six Saccharomyces species using the sequence alignment tool CLUSTAL W and
many statistically significant conserved sequence motifs (Cliften et al., 2003) were found.

Fig. 2. Combined Block diagram of an MEME output highlighting conserved motifs in

promoter regions of legumin seed storage protein genes of four different species.
In Silico Identification of Regulatory Elements in Promoters 55

3.2.1.3 Footprinter
This promising novel algorithm was developed to overcome the limitations imposed by
motif finding algorithms. This algorithm identifies the most conserved motifs among the
input sequences as measured by a parsimony score on the underlying phylogenetic tree
(Blanchette and Tompa, 2002). It uses dynamic programming to find most parsimonious k-
mer from each of the input sequences where k is the motif length. In general, the algorithm
selects motifs that are characterized by a minimal number of mismatches and are conserved
over long evolutionary distances. Furthermore, the motifs should not have undergone
independent losses in multiple branches. In other words, the motif should be present in the
sequences of subsequent taxa along a branch. The algorithm, based on dynamic
programming, proceeds from the leaves of the phylogenetic tree to its root and seeks for
motifs of a user-defined length with a minimum number of mismatches. Moreover, the
algorithm allows a higher number of mismatches for those sequences that span a greater
evolutionary distance. Motifs that are lost along a branch of the tree are assigned an
additional cost because it is assumed that multiple independent losses are unlikely in
evolution. To compensate for spurious hits, statistical significance is calculated based on a
random set of sequences in which no motifs occur.
3.2.1.4 CONREAL
CONREAL (Conserved Regulatory Elements Anchored Alignment Algorithm) is another
motif finding algorithm based on phylogenetic footprinting (Berezikov et al., 2005). This
algorithm uses potential motifs as represented by positional weight matrices (81 vertebrate
matrices form JASPAR database and 546 matrices from TRANSFAC database) to establish
anchors between orthologous sequences and to guide promoter sequence alignment.
Comparison of the performance of CONREAL with the global alignment programs LAGAN
and AVID using a reference data set, shows that CONREAL performs equally well for
closely related species like rodents and human, and has a clear added value for aligning
promoter elements of more divergent species like human and fish, as it identifies conserved
transcription-factor binding sites that are not found by other methods.
3.2.1.5 PHYLONET
The PHYLONET computational approach identifies conserved regulatory motifs directly
from whole genome sequences of related species without reliance on additional information
was developed by (Wang and Stormo, 2005). The major steps involved are: i) construction of
phylogenetic profiles for each promoter , ii) searching through the entire profile space of all
the promoters in the genome to identify conserved motifs and the promoters that contain
them using algorithm like BLAST, iii) determination of statistical significance of motifs
(Karlin and Altschul, 1990). By comparing promoters using phylogenetic profiles (multiple
sequence alignments of orthologous promoters) rather than individual sequences, together
with the application of modified Karlin– Altschul statistics, they readily distinguished
biologically relevant motifs from background noise. When applied to 3524 Saccharomyces
cerevisiae promoters with Saccharomyces mikatae, Saccharomyces kudriavzevii, and Saccharomyces
bayanus sequences as references PHYLONET identified 296 statistically significant motifs
with a sensitivity of >90% for known transcription factor binding sites. The specificity of the
predictions appears very high because most predicted gene clusters have additional
supporting evidence, such as enrichment for a specific function, in vivo binding by a known
TF, or similar expression patterns.
56 Computational Biology and Applied Bioinformatics

However, the prediction of additional transcription factor binding sites by comparison of a

motif to the promoter regions of an entire genome has its own problems due to the large
database size and the relatively small width of a typical transcription factor binding site.
There is an increased chance of identification of many sites that match the motif and the
variability among the transcription factor binding sites permits differences in the level of
regulation, due to the altered intrinsic affinities for the transcription factor (Carmack et al.,
2007).
3.2.1.6 Phyloscan
PhyloScan combines evidence from matching sites found in orthologous data from several
related species with evidence from multiple sites within an intergenic region.The
orthologous sequence data may be multiply aligned, unaligned, or a combination of aligned
and unaligned. In aligned data, PhyloScan statistically accounts for the phylogenetic
dependence of the species contributing data to the alignment and, in unaligned data; the
evidence for sites is combined assuming phylogenetic independence of the species. The
statistical significance of the gene predictions is calculated directly, without employing
training sets (Carmack et al., 2007). The application of the algorithm to real sequence data
from seven Enterobacteriales species identifies novel Crp and PurR transcription factor
binding sites, thus providing several new potential sites for these transcription factors.

3.3 Software suites for motif discovery

3.3.1 BEST
BEST is a suite of motif-finding programs that include four motif-finding programs:
AlignACE (Roth et al., 1998), BioProspector(Liu et al., 2001), Consensus(Hertz and Stormo,
1999), MEME (Bailey et al., 2006) and the optimization program BioOptimizer (Jensen and
Liu, 2004). BEST was compiled on Linux, and thus it can only be run on Linux machines
(Che et al., 2005).

3.3.2 Seqmotifs
Seqmotifs is a suite of web based programs to find regulatory motifs in coregulated genes of
both prokaryotes and eukaryotes. In this suite BioProspector (Liu et al., 2001) is used for
finding regulatory motifs in prokaryote or lower eukaryote sequences while
CompareProspector(Liu et al., 2002) is used for higher eukaryotes. Another program Mdscan
(Liu et al., 2002) is used for finding protein-DNA interaction sites from ChIP-on-chip targets.
These programs analyze a group of sequences of coregulated genes so they may share
common regulatory motifs and output a list of putative motifs as position-specific probability
matrices, the individual sites used to construct the motifs, and the location of each site on the
input sequences. CompareProspector has been used for identification of transcription factors
Mef2, My, Srf, and Sp1 motifs from a human-muscle-specific co regulated genes. Additionally
in a C. elegans–C briggsae comparison, CompareProspector found the PHA-4 motif and the
UNC-86 motif.(Liu et al., 2004) Another C. Elegans CompareProspector analysis showed that
intestine genes have GATA transcription factor binding motif that was latter experimentally
validated (Pauli et al., 2006).

3.3.3 RSAT
The Regulatory Sequence Analysis Tools (RSAT) is an integrated online tool to analyze
regulatory sequences in co regulated genes (Thomas-Chollier et al., 2008). The only input
In Silico Identification of Regulatory Elements in Promoters 57

required is a list of genes of interest; subsequently upstream sequences of desired distance

can be retrieved and appended to the list. Further tasks of putative regulatory signals
detection, matching positions for the detected signals in the original dataset can then be
performed. The suite includes programs for sequence retrieval, pattern discovery,
phylogenetic footprint detection, pattern matching, genome scanning and feature map
drawing. Random controls can be performed with random gene selections or by generating
random sequences according to a variety of background models (Bernoulli, Markov). The
results can be displayed graphically highlighting the desired features. As RSAT web
services is implemented using SOAP and WSDL, so either Perl, Python or Java scripts can
used for developing custom workflows by combining different tools.

3.3.4 TFM
TFM (Transcription Factor Matrices) is a software suite for identifying and analyzing
transcription factor binding sites in DNA sequences. It consists of TFM-Explorer, TFM-Scan,
TFM-Pvalue and TFM-Cluster.
TFM-Explorer (Transcription Factor Matrix Explorer) proceeds in two steps: scans sequences
for detecting all potential TF binding sites, using JASPAR or TRANSFAC and extracts
significant transcription factor.
TFM-Scan is a program dedicated to the location of large sets of putative transcription factor
binding sites on a DNA sequence. It uses Position Weight Matrices such as those available in
the Transfac or JASPAR databases. The program takes as input a set of matrices and a DNA
sequence. It computes all occurrences of matrices on the sequence for a given P-value
threshold. The algorithm is very fast and allows for large-scale analysis. TFM-Scan is also
able to cluster similar matrices and similar occurrences
TFM-Pvalue is a software suite providing tools for computing the score threshold associated
to a given P-value and the P-value associated to a given score threshold. It uses Position
Weight Matrices such as those available in the Transfac or JASPAR databases. The program
takes as input a matrix, the background probabilities for the letters of the DNA alphabet and
a score or a P-value.

3.3.5 rVISTA
rVISTA (regulatory VISTA) combines searching the major transcription binding site
database TRANSFAC Professional from Biobase with a comparative sequence analysis and
this procedure reduced the number of predicted transcription factor binding sites by several
orders of magnitude (Loots and Ovcharenko, 2004). It can be used directly or through links
in mVISTA, Genome VISTA, or VISTA Browser. Human and mouse sequences are aligned
using the global alignment program AVID (Bray et al., 2003).

3.3.6 Mulan
Mulan brings together several novel algorithms: the TBA multi-aligner program for rapid
identification of local sequence conservation and the multiTF program for detecting
evolutionarily conserved transcription factor binding sites in multiple alignments. In
addition, Mulan supports two-way communication with the GALA database; alignments of
multiple species dynamically generated in GALA can be viewed in Mulan, and conserved
transcription factor binding sites identified with Mulan/multiTF can be integrated and
overlaid with extensive genome annotation data using GALA. Local multiple alignments
58 Computational Biology and Applied Bioinformatics

computed by Mulan ensure reliable representation of short- and large-scale genomic

rearrangements in distant organisms. Mulan allows for interactive modification of critical
conservation parameters to differentially predict conserved regions in comparisons of both
closely and distantly related species. The uses and applications of the Mulan tool through
multispecies comparisons of the GATA3 gene locus and the identification of elements that
are conserved in a different way in avians than in other genomes, allowing speculation on
the evolution of birds.

3.3.7 MotifVoter
MotifVoter is a variance based ensemble method for discovery of binding sites. It uses 10
most commonly used individual basic motif finders as its component (Wijaya et al., 2008).
AlignACE (Hughes et al., 2000), MEME (Bailey and Elkan, 1994; Bailey et al., 2006),
ANNSpec, Mitra, BioProspector, MotifSampler, Improbizer, SPACE, MDScan and
Weeder. All programs can be selected individually or collectively. Though the existing
ensemble methods overall perform better than stand-alone motif finders, the improvement
gained is not substantial. These methods do not fully exploit the information obtained from
the results of individual finders, resulting in minor improvement in sensitivity and poor
precision.

3.3.8 ConSite
ConSite is a, web-based tool for finding cis-regulatory elements in genomic sequences
(Sandelin et al., 2004). Two genomic sequences submitted for analysis are aligned by ORCA
method. Alternatively, prealigned sequences can be submitted in ClustalW, MSF (GCG),
Fasta or Pair wise BLAST format. For analysis Transcription factors can be selected on the
basis of species, name, domain or user defined matrix (raw counts matrix or position weight
matrix). Predictions are based on the integration of binding site prediction generated with
high-quality transcription factor models and cross-species comparison filtering
(phylogenetic footprinting). ConSite (Sandelin et al., 2004) is based on the JASPAR database
(Portales-Casamar et al., 2009). By incorporating evolutionary constraints, selectivity is
increased by an order of magnitude as compared to single sequence analysis. ConSite offers
several unique features, including an interactive expert system for retrieving orthologous
regulatory sequences.

3.3.9 OPOSSUM
OPOSSUM identifies statistically over-represented, conserved TFBSs in the promoters of co-
expressed genes (Ho Sui et al., 2005). OPOSSUM integrates a precomputed database of
predicted, conserved TFBSs, derived from phylogenetic footprinting and TFBS detection
algorithms, with statistical methods for calculating overrepresentation. The background
data set was compiled by identifying all strict one-to-one human/mouse orthologues from
the Ensemble database. These orthologues were then aligned using ORCA, a pair-wise DNA
alignment program. The conserved non-coding regions were identified. The conserved
regions which fell within 5000 nucleotides upstream and downstream of the transcription
start site (TSS) were then scanned for TF sites using the position weight matrices (PWMs)
from the JASPAR database (Portales-Casamar et al., 2009). These TF sites were stored in the
OPOSSUM database and comprise the background set.
In Silico Identification of Regulatory Elements in Promoters 59

3.3.10 TOUCAN2
TOUCAN 2 is an operating system independent, open source, JAVA based workbench for
regulatory sequence analysis (Aerts et al., 2004). It can be used for detection of significant
transcription factor binding sites from comparative genomics data or for detection of
combinations of binding sites in sets of co expressed/co regulated genes. It tightly integrates
with Ensemble and EMBL for retrieval of sequences data. TOUCAN provides options to
align sequences with different specialized algorithms viz AVID (Bray et al., 2003), LAGAN
(Brudno et al., 2003), or BLASTZ. MotifScanner algorithm is used to search occurrence of
sites of transcription factors by using libraries of position weight matrices from TRANSFAC
6 (Matys et al., 2003), JASPAR, PLANTCARE (Lescot et al., 2002), SCPD and others. Motif
Sampler can be used for detection of over-represented motifs. More significantly TOUCAN
provides an option to select cis-regulatory modules using the ModuleSearch In essence
TOUCAN 2 provides one of the best integration of different algorithms for identification of
cis regulatory elements.

3.3.11 WebMOTIFS
WebMOTIFS web server combines TAMO and THEME tools for identification of conserved
motifs in co regulated genes (Romer et al., 2007). TAMO combines results from four motif
discovery programs viz AlignACE, MDscan, MEME, and Weeder, followed by clustering of
results (Gordon et al., 2005). Subsequently Bayesian motif analysis of known motifs is done
by THEME. Thus it integrates de novo motif discovery programs with Bayesian approaches
to identify the most significant motifs. However, current version of Web MOTIFS supports
motif discovery only for S. cerevisiae, M. musculus, and H. sapiens genomes.

3.3.12 Pscan
Pscan is a software tool that scans a set of sequences (e.g. promoters) from co-regulated or
co-expressed genes with motifs describing the binding specificity of known transcription
factors (Zambelli et al, 2009). It assesses which motifs are significantly over or
underrepresented, providing thus hints on which transcription factors could be common
regulators of the genes studied, together with the location of their candidate binding sites in
the sequences. Pscan does not resort to comparisons with orthologous sequences and
experimental results show that it compares favorably to other tools for the same task in
terms of false positive predictions and computation time.

4. Identification of regulatory elements in legumin promoter

The plant seeds are rich source of almost all essential supplements of diet viz. proteins,
carbohydrates, and lipids. Seeds not only provide nutrients to germinating seeding but are
major source of energy and other cellular building blocks for human and other heterotrophs
as well. Consequently, there is a plethora of pathogens attacking plant seeds. Certain pests
such as coleopteran insects of the family Bruchidae, have evolved with leguminous plants
(Sales et al., 2000). It is believed that these seeds, most of which are not essential for the
establishment of the new plant following germination, contribute to the protection and
defense of seeds against pathogens and predators.
Genes encoding seed storage proteins, like zein, phaseolin and legumin, were among the
first plant genes studied at gene expression level. Hetrologous expression of reporter genes
60 Computational Biology and Applied Bioinformatics

confirmed that there promoters direct the seed specific expression and subsequently these
seed storage gene promoters were used for developing transgenic plants for expressing
different genes of interest. Earlier we isolated legumin gene promoter from pigeon pea
(Cajanus cajan) and identified regulatory elements in its promoter region(Jaiswal et al., 2007).
Sequence analysis with PLACE, PLANTCARE, MATINSPECTOR and TRANSFC shows
that legumin promoter not only consist of regulatory elements for seed specific expression
but also have elements that are present in promoters of other genes involved in pathogen
defense, abiotic stress, light response and wounding (Table 1). Our pervious study also
confirmed that legumin promoter in expressed in the developing seedling as well (Jaiswal et
al., 2007). Recent studies have shown that these promoters are expressed in non seed tissues
as well (Zakharov et al., 2004) and play a role in bruchid (insect) resistance (Sales et al.,
2000).
In such a scenario where seed storage protein performs an additional task of pathogen
defense its prompter must have responsive elements to such stresses. In fact legumin
promoter consists of transcription factor binding site for wounding, a signal of insect attack
and pathogen defense (Table 1). Since promoter sequences are available for legumin
promoter from different species it becomes a good system for identification novel regulatory
elements in these promoter. Phylogenetic footprinting analysis reveals presentence of
another conserved motif 19 base pair downstream to legumin box (Fig. 1), named L-19 box
(Jaiswal et al., 2007). Further MEME analysis shows that in addition to four conserved
blocks present for TATA box, G box, Legumin box and L-19 box, there are other conserved,
non overlapping sequence blocks are present that were not revealed by multiple sequence
alignment based phylogenetic footprinting (Fig. 2).

5. Conclusion
With the critical role of cis-regulatory elements in differentiation of specific cells leading to
growth, development and survival of an organism, scientists have a great interest in their
identification and characterization. However, due to the limited availability of known
transcription factors identification of the corresponding regulatory elements through
conventional DNA-protein interaction techniques is challenging. Rapid increase in number
of complete genome sequences, identification of co-regulated genes through microarray
technology with available electronic storage and computational power has put before the
scientific community a challenge to integrate these advancing technology and develop
computational program to identify these short but functionally important regulatory
sequences. Although some progress has been made in this direction and databases like
TRANSFC and others store libraries of transcription factor binding sites. However, there are
limitations primarily because publicly available libraries are very small and complete
datasets are not freely available. Secondly, because of their very small size there is certain
degree of redundancy in binding and therefore the chances of false prediction are very high.
These limitations have been overcome to some extent by databases like JASPAR that are
freely available and have a collection of regulatory elements large enough to compete with
the commercially available datasets. Another concern with cis-acting regulatory elements is
that the data pool of these functional non coding transcription factor binding sites is very
small (a few thousands), compared with the fact that thousand of genes are expressed in a
cell at any point of time and every gene is transcribed by a combination of minimum 5-10
transcription factors. Phylogenetic footprinting, has enormous potential in identifying new
In Silico Identification of Regulatory Elements in Promoters 61

regulatory elements and completing the gene network map. Although routine sequence
alignment programs such as clustalW fail to align short conserved sequences in a
background of hyper variable sequences, more sophisticated sequence alignment programs
have been specially developed for identification of conserved regulatory elements. These
programs such as CONREAL uses available transcription factor binding site data to align
the two sequences that decreases the chances of missing a regulatory site considerably.
Moreover, other approaches such as MEME altogether abandons the presence of sequence
alignment and directly identifies the rare conserved blocks even if they have jumbled up to
the complementary strand. With the increasing sophistication and accuracy of motif finding
programs and available genome sequences it can be assumed that knowledge of these
regulatory sequences will definitely increase (Table 2). Once we have sufficient data it can
be used for development of synthetic promoters with desired expression patterns (Fig. 3).

Fruit
Specific

Wound
Specific

Seed Specific

CAMV 35S Promotor

Assemble
Specific Motifs
s

Construction of Synthetic
Promotor
Gene

Gene

Fig. 3. Future prospects: development of synthetic promoters for expression of gene of

interest in desired tissue at defined time and developmental stage. Example: Regulatory
elements can be combined from wound, fruit and seed specific promoters and combined
with strong CaMV 35S promoter for high level expression of desired gene in all these
tissues.
62 Computational Biology and Applied Bioinformatics

Tools Web site

Sequence Alignment
Blast-Z http://www.bx.psu.edu/miller_lab/
Dialign http://dialign.gobics.de/chaos-dialign-submission
AVID (Mvista) http://genome.lbl.gov/vista/mvista/submit.shtml
Lagan http://lagan.stanford.edu/lagan_web/index.shtml
Clustal W http://www.ebi.ac.uk/Tools/msa/clustalw2/
TF Binding Site search
Consite http://www.phylofoot.org/consite
CONREAL http://conreal.niob.knaw.nl/
http://www.softberry.com/berry.phtml?topic=promhg&group=progra
PromH
ms&subgroup=promoter
Trafac http://trafac.cchmc.org/trafac/index.jsp
http://wingless.cs.washington.edu/htbin-
Footprinter
post/unrestricted/FootPrinterWeb/FootPrinterInput2.pl
rVISTA http://rvista.dcode.org/
TFBIND http://tfbind.hgc.jp/
TESS http://www.cbil.upenn.edu/cgi-bin/tess/tess
TFSearch http://www.cbrc.jp/research/db/TFSEARCH.html
Toucan http://homes.esat.kuleuven.be/~saerts/software/toucan.php
Phyloscan http://bayesweb.wadsworth.org/cgi-bin/phylo_web.pl
OFTBS http://www.bioinfo.tsinghua.edu.cn/~zhengjsh/OTFBS/
http://alggen.lsi.upc.es/cgi-
PROMO
bin/promo_v3/promo/promoinit.cgi?dirDB=TF_8.3
R-Motif http://bioportal.weizmann.ac.il/~lapidotm/rMotif/html/
Motif Finding
MEME http://meme.sdsc.edu/meme4_6_1/intro.html
AlignAce http://atlas.med.harvard.edu/cgi-bin/alignace.pl
MotifVoter http://compbio.ddns.comp.nus.edu.sg/~edward/MotifVoter2/
RSAT http://rsat.scmbb.ulb.ac.be/rsat/
Gibbs Sampler http://bayesweb.wadsworth.org/gibbs/gibbs.html
BioProspector http://ai.stanford.edu/~xsliu/BioProspector/
MatInspector http://www.genomatix.de/
Improbizer http://users.soe.ucsc.edu/~kent/improbizer/improbizer.html
WebMOTIFS http://fraenkel.mit.edu/webmotifs-tryit.html
Psacn http://159.149.109.9/pscan/
http://wingless.cs.washington.edu/htbin-
FootPrinter
post/unrestricted/FootPrinterWeb/FootPrinterInput2.pl
Table 2. Regulatory sequences identification programs.

6. References
Aerts, S., Van Loo, P., Thijs, G., Mayer, H., de Martin, R., Moreau, Y., and De Moor, B.
(2004). TOUCAN 2: the all-inclusive open source workbench for regulatory
sequence analysis. Nucleic Acids Research 33, W393-W396.
In Silico Identification of Regulatory Elements in Promoters 63

Bailey, T.L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to
discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28-36.
Bailey, T.L., Williams, N., Misleh, C., and Li, W.W. (2006). MEME: discovering and
analyzing DNA and protein sequence motifs. Nucleic Acids Research 34, W369-
W373.
Berezikov, E., Guryev, V., and Cuppen, E. (2005). CONREAL web server: identification and
visualization of conserved transcription factor binding sites. Nucleic Acids
Research 33, W447-W450.
Blanchette, M., and Tompa, M. (2002). Discovery of Regulatory Elements by a
Computational Method for Phylogenetic Footprinting. Genome Research 12, 739-
748.
Bray, N., Dubchak, I., and Pachter, L. (2003). AVID: A Global Alignment Program. Genome
Research 13, 97-102.
Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Program, N.C.S., Green, E.D.,
Sidow, A., and Batzoglou, S. (2003). LAGAN and Multi-LAGAN: Efficient Tools for
Large-Scale Multiple Alignment of Genomic DNA. Genome Research 13, 721-731.
Buhler, J., and Tompa, M. (2002). Finding motifs using random projections. J Comput Biol 9,
225-242.
Carmack, C.S., McCue, L., Newberg, L., and Lawrence, C. (2007). PhyloScan: identification
of transcription factor binding sites using cross-species evidence. Algorithms for
Molecular Biology 2, 1.
Cartharius, K., Frech, K., Grote, K., Klocke, B., Haltmeier, M., Klingenhoff, A., Frisch, M.,
Bayerlein, M., and Werner, T. (2005). MatInspector and beyond: promoter analysis
based on transcription factor binding sites. Bioinformatics 21, 2933-2942.
Che, D., Jensen, S., Cai, L., and Liu, J.S. (2005). BEST: Binding-site Estimation Suite of Tools.
Bioinformatics 21, 2909-2911.
Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R.,
Cohen, B.A., and Johnston, M. (2003). Finding Functional Features in
Saccharomyces Genomes by Phylogenetic Footprinting. Science.
Cliften, P.F., Hillier, L.W., Fulton, L., Graves, T., Miner, T., Gish, W.R., Waterston, R.H., and
Johnston, M. (2001). Surveying Saccharomyces Genomes to Identify Functional
Elements by Comparative DNA Sequence Analysis. Genome Research 11, 1175-
1186.
Das, M., and Dai, H.-K. (2007). A survey of DNA motif finding algorithms. BMC
Bioinformatics 8, S21.
Fiedler, T., and Rehmsmeier, M. (2006). jPREdictor: a versatile tool for the prediction of cis-
regulatory elements. Nucleic Acids Research 34, W546-W550.
Francis, C., and Henry, C.M.L. (2008). DNA Motif Representation with Nucleotide
Dependency. IEEE/ACM Trans. Comput. Biol. Bioinformatics 5, 110-119.
Frith, M.C., Hansen, U., and Weng, Z. (2001). Detection of cis -element clusters in higher
eukaryotic DNA. Bioinformatics 17, 878-889.
Frith, M.C., Fu, Y., Yu, L., Chen, J.F., Hansen, U., and Weng, Z. (2004). Detection of
functional DNA motifs via statistical overâ€ representation. Nucleic Acids
Research 32, 1372-1381.
64 Computational Biology and Applied Bioinformatics

Gordon, D.B., Nekludova, L., McCallum, S., and Fraenkel, E. (2005). TAMO: a flexible,
object-oriented framework for analyzing transcriptional regulation using DNA-
sequence motifs. Bioinformatics 21, 3164-3165.
Henry, C.M.L., and Fracis, Y.L.C. (2006). Discovering DNA Motifs with Nucleotide
Dependency, Y.L.C. Francis, ed, pp. 70-80.
Hertz, G.Z., and Stormo, G.D. (1999). Identifying DNA and protein patterns with
statistically significant alignments of multiple sequences. Bioinformatics 15, 563-
577.
Hertz, G.Z., Hartzell, G.W., and Stormo, G.D. (1990). Identification of Consensus Patterns in
Unaligned DNA Sequences Known to be Functionally Related. Comput Appl Biosci
6, 81 - 92.
Higo, K., Ugawa, Y., Iwamoto, M., and Korenaga, T. (1999). Plant cis-acting regulatory DNA
elements (PLACE) database: 1999. Nucleic Acids Research 27, 297-300.
Ho Sui, S.J., Mortimer, J.R., Arenillas, D.J., Brumm, J., Walsh, C.J., Kennedy, B.P., and
Wasserman, W.W. (2005). oPOSSUM: identification of over-represented
transcription factor binding sites in co-expressed genes. Nucleic Acids Research 33,
3154-3164.
Hughes, J.D., Estep, P.W., Tavazoie, S., and Church, G.M. (2000). Computational
identification of Cis-regulatory elements associated with groups of functionally
related genes in Saccharomyces cerevisiae. Journal of Molecular Biology 296, 1205-
1214.
Jaiswal, R., Nain, V., Abdin, M.Z., and Kumar, P.A. (2007). Isolation of pigeon pea (Cajanus
cajan L.) legumin gene promoter and identification of conserved regulatory
elements using tools of bioinformatics. Indian Journal of experimental Biology 6,
495-503.
Jensen, S.T., and Liu, J.S. (2004). BioOptimizer: a Bayesian scoring function approach to
motif discovery. Bioinformatics 20, 1557-1564.
Karlin, S., and Altschul, S.F. (1990). Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes. Proceedings of the
National Academy of Sciences 87, 2264-2268.
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C.
(1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple
alignment. Science 262, 208-214.
Lescot, M., DÃ©hais, P., Thijs, G., Marchal, K., Moreau, Y., Van de Peer, Y., RouzÃ, P., and
Rombauts, S. (2002). PlantCARE, a database of plant cis-acting regulatory elements
and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids
Research 30, 325-327.
Liu, X., Brutlag, D.L., and Liu, J.S. (2001). BioProspector: discovering conserved DNA motifs
in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput, 127-
138.
Liu, X.S., Brutlag, D.L., and Liu, J.S. (2002). An algorithm for finding protein-DNA binding
sites with applications to chromatin-immunoprecipitation microarray experiments.
Nat Biotech 20, 835-839.
Liu, Y., Liu, X.S., Wei, L., Altman, R.B., and Batzoglou, S. (2004). Eukaryotic Regulatory
Element Conservation Analysis and Identification Using Comparative Genomics.
Genome Research 14, 451-458.
In Silico Identification of Regulatory Elements in Promoters 65

Loots, G.G., and Ovcharenko, I. (2004). rVISTA 2.0: evolutionary analysis of transcription
factor binding sites. Nucleic Acids Research 32, W217-W221.
Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M., and
Frazer, K.A. (2000). Identification of a Coordinate Regulator of Interleukins 4, 13,
and 5 by Cross-Species Sequence Comparisons. Science 288, 136-140.
Marinescu, V., Kohane, I., and Riva, A. (2005). MAPPER: a search engine for the
computational identification of putative transcription factor binding sites in
multiple genomes. BMC Bioinformatics 6, 79.
Matys, V., Fricke, E., Geffers, R., GÃ¶ÃŸling, E., Haubrock, M., Hehl, R., Hornischer, K.,
Karas, D., Kel, A.E., Kel-Margoulis, O.V., Kloos, D.U., Land, S., Lewicki-Potapov,
B., Michael, H., MÃ¼nch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S.,
and Wingender, E. (2003). TRANSFACÂ®: transcriptional regulation, from patterns
to profiles. Nucleic Acids Research 31, 374-378.
Pauli, F., Liu, Y., Kim, Y.A., Chen, P.-J., and Kim, S.K. (2006). Chromosomal clustering and
GATA transcriptional regulation of intestine-expressed genes in C. elegans.
Development 133, 287-295.
Pennacchio, L.A., and Rubin, E.M. (2001). Genomic strategies to identify mammalian
regulatory sequences. Nat Rev Genet 2, 100-109.
Portales-Casamar, E., Thongjuea, S., Kwon, A.T., Arenillas, D., Zhao, X., Valen, E., Yusuf, D.,
Lenhard, B., Wasserman, W.W., and Sandelin, A. (2009). JASPAR 2010: the greatly
expanded open-access database of transcription factor binding profiles. Nucleic
Acids Research.
Rombauts, S., Florquin, K., Lescot, M., Marchal, K., RouzÃ©, P., and Van de Peer, Y. (2003).
Computational Approaches to Identify Promoters and cis-Regulatory Elements in
Plant Genomes. Plant Physiology 132, 1162-1176.
Romer, K.A., Kayombya, G.-R., and Fraenkel, E. (2007). WebMOTIFS: automated discovery,
filtering and scoring of DNA sequence motifs using multiple programs and
Bayesian approaches. Nucleic Acids Research 35, W217-W220.
Roth, F.P., Hughes, J.D., Estep, P.W., and Church, G.M. (1998). Finding DNA regulatory
motifs within unaligned noncoding sequences clustered by whole-genome mRNA
quantitation. Nat Biotech 16, 939-945.
Sales, M.Ì.c.P., Gerhardt, I.R., Grossi-de-SÃ¡, M.F.t., and Xavier-Filho, J. (2000). Do Legume
Storage Proteins Play a Role in Defending Seeds against Bruchids? Plant
Physiology 124, 515-522.
Sandelin, A., Wasserman, W.W., and Lenhard, B. (2004). ConSite: web-based prediction of
regulatory elements using cross-species comparison. Nucleic Acids Research 32,
W249-W252.
Sinha, S., Liang, Y., and Siggia, E. (2006). Stubb: a program for discovery and analysis of cis-
regulatory modules. Nucleic Acids Research 34, W555-W559.
Stormo, G.D. (2000). DNA Binding Sites: Representation and Discovery. Bioinformatics 16,
16 - 23.
Thomas-Chollier, M., Sand, O., Turatsinze, J.-V., Janky, R.s., Defrance, M., Vervisch, E.,
Brohee, S., and van Helden, J. (2008). RSAT: regulatory sequence analysis tools.
Nucleic Acids Research 36, W119-W127.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994). CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
66 Computational Biology and Applied Bioinformatics

weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids
Research 22, 4673-4680.
Wang, T., and Stormo, G.D. (2005). Identifying the conserved network of cis-regulatory sites
of a eukaryotic genome. Proceedings of the National Academy of Sciences of the
United States of America 102, 17400-17405.
Wijaya, E., Yiu, S.-M., Son, N.T., Kanagasabai, R., and Sung, W.-K. (2008). MotifVoter: a
novel ensemble method for fine-grained integration of generic motif finders.
Bioinformatics 24, 2288-2295.
Zakharov, A., Giersberg, M., Hosein, F., Melzer, M., MÃ¼ntz, K., and Saalbach, I. (2004).
Seed-specific promoters direct gene expression in non-seed tissue. Journal of
Experimental Botany 55, 1463-1471.
Zhu, J., Liu, J.S., and Lawrence, C.E. (1998). Bayesian adaptive sequence alignment
algorithms. Bioinformatics 14, 25-39.
4

In Silico Analysis of Golgi Glycosyltransferases:

A Case Study on the LARGE-Like Protein Family
Kuo-Yuan Hwa1,2, Wan-Man Lin and Boopathi Subramani
Institute of Organic & Polymeric Materials
1Department of Molecular Science and Engineering
2Centre for Biomedical Industries

National Taipei University of Technology, Taipei,

Taiwan, ROC

1. Introduction
Glycosylation is one of the major post-translational modification processes essential for
expression and function of many proteins. It has been estimated that 1% of the open reading
frames of a genome is dedicated to glycosylation. Many different enzymes are involved in
glycosylation, such as glycosyltransferases and glycosidases.
Traditionally, glycosyltransferases are classified based on their enzymatic activities by
Enzyme Commission (http://www.chem.qmul.ac.uk/iubmb/enzyme/). Based on the
activated donor type, glycosyltransferases are named, for example glucosyltransferase,
mannosyltransferase and N-acetylglucosaminyltransferases. However, classification of
glycosyltransferases based on the biochemical evidence is a difficult task since most of the
enzymes are membrane proteins. Reconstruction of enzymatic assay for the membrane
proteins are intrinsically more difficult than soluble proteins. Thus the purification of
membrane-bound glycosyltransferase is a difficult task. On the other hand, with the recent
advancement of genome projects, DNA sequences of an organism are readily available.
Furthermore, bioinformatics annotation tools are now commonly used by life science
researchers to identify the putative function of a gene. Hence, new approaches based on in
silico analysis for classifying glycosyltransferase have been used successfully. The best
known database for classification of glycosyltransferase by in silico approach is the CAZy
(Carbohydrate- Active enZymes) database (http://afmb.cnrs-mrs.fr/CAZy/) (Cantarel et al.,
2009).
Glycosyltransferases are enzymes involved in synthesizing sugar moieties by transferring
activated saccharide donors into various macro-molecules such as DNA, proteins, lipids and
glycans. More than 100 glycosyltransferases are localized in the endoplasmic reticulum (ER)
and Golgi apparatus and are involved in the glycan synthesis (Narimatsu, H., 2006). The
structural studies on the ER and golgi glycosyltransferases has revealed several common
domains and motifs present between them. The glycosyltransferases are grouped into
functional subfamilies based on similarities of sequence, their enzyme characteristics, donor
specificity, acceptor specificity and the specific donor and acceptor linkages (Ishida et al.,
2005). The glycosyltransferase sequences comprise of 330-560 amino acids long and share
the same type II transmembrane protein structure with four functional domains: a short
68 Computational Biology and Applied Bioinformatics

cytoplasmic domain, a targeting / membrane anchoring domain, a stem region and a

catalytic domain (Fukuda et al., 1994). Mammals utilize only 9 sugar nucleotide donors for
glycosyltransferases such as UDP-glucose, UDP-galactose, UDP-GlcNAc, UDP-GalNAc,
UDP-xylose, UDP-glucuronic acid, GDP-mannose, GDP-fucose, and CMP-sialic acid. Other
organisms have an extensive range of nucleotide sugar donors (Varki et al., 2008). Based on
the structural studies, we have designed an intelligent platform for the LARGE protein, a
golgi glycosyltransferase. The LARGE is a member of glycosyltransferase which has been
studied in protein glycosylation (Fukuda & Hindsgaul, 2000). It was originally isolated from
a region in chromosome 22 of the human genome which was frequently deleted in human
meningiomas with alteration in glycosphingolipid composition. This led to a suggestion that
the LARGE may have possible role in complex lipid glycosylation (Dumanski et al., 1987;
Peyrard et al., 1999).

2. LARGE
LARGE is one of the largest genes present in the human genome and it is comprised of 660
kb of genomic DNA and contains 16 exons encoding a 756-amino-acid protein. It showed
98% amino acid identity to the mouse homologue and similar genomic organization. The
expression of LARGE is ubiquitous but the highest levels of LARGE mRNA are present in
heart, brain and skeletal muscle (Peyrard et al., 1999).
LARGE encodes a protein which has an N-terminal transmembrane anchor, coiled coil motif
and two putative catalytic domains with a conserved DXD (Asp-any-Asp) motif typical of
many glycosyltransferases that uses nucleoside diphosphate sugars as donors (Longman et
al., 2003 & Peyrard et al., 1999). The proximal catalytic domain in the LARGE was most
homologous to the bacterial glycosyltransferase family 8 (GT8 in CAZy database) members
(Coutinho et al., 2003). The members of this family are mainly involved in the synthesis of
bacterial outer membrane lipopolysaccharide. The distal domain resembled the human β1,3-
N-acetytglucosaminyltransferase (iGnT), a member of GT49 family. The iGnT enzyme is
required for the synthesis of the poly-N-acetyllactosamine backbone which is part of the
erythrocyte i antigen (Sasaki et al., 1997). The presence of two catalytic domains in the
LARGE is extremely unusual among the glycosyltransferase enzymes.

2.1 Functions of LARGE

2.1.1 Dystroglycan glycosylation
The Dystroglycan (DG) is an important constituent of the dystrophin-glycoprotein complex
(DGC). This complex plays an essential role in the maintaining the stability of the muscle
membrane and for the correct localization and/or ligand-binding activity, the glycosylation
of some of these components are required (Durbeej et al., 1998). The DG comprises of two
subunits, the extracellular α-DG and the transmembrane β-DG (Barresi, 2004). Various
components present in the extracellular matrix including laminin (Smalheiser & Schwartz
1987), agrin (Gee et al., 1994), neurexin, (Sugita et al., 2001), and perlecan (Peng et al., 1998)
interacts with α-DG. The carbohydrate moieties present in the α-DG are essential to bind
with laminin and other ligands. The α-DG is modified by three different types of glycans
such as: mucin type O-glycosylation, O-mannosylation, and N-glycosylation. The
glycosylated α-DG is essential for the protein’s ability to bind the laminin globular domain-
containing proteins of the Extracellular Matrix (Kanagawa, 2005). LARGE is required for the
generation of functional, properly glycosylated forms of α-DG (Barresi, 2004).
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 69

2.1.2 Human LARGE and α-Dystroglycan

The α-DG functional glycosylation by LARGE is likely to be involved in the generation of a
glycan polymer which gives rise to the broad molecular weight range observed for α-DG
detected by VIA4-1 and IIH6 antibodies. Both the human and mouse LARGE C-terminal
glycosyltransferase domain is similar to β3GnT6, which adds GlcNAc to Gal to generate
linear polylactosamine chains (Sasaki et al., 1997), the chain formed by LARGE might also be
composed of GlcNAc and Glc.
In 1963, Myodystrophy, myd, was first described (Lane et al., 1976) as a recessive myopathy
mapping to chromosome (Chr) 8, was identified as an intragenic deletion within the
glycosyltransferase gene, LARGE. In Largemyd and enr mice, the hypoglycosylation of α-DG
in DGC was due to the mutation in LARGE (Grewal et al., 2001). The α-DG function was
restored in Largemyd skeletal muscle and ameliorates muscular dystrophy when LARGE
gene was transferred, which indicated that adjustment in the glycosylation status of α-DG
can improve the muscle phenotype.
The patients with clinical spectrum ranging from severe congenital muscular dystrophy
(CMD), structural brain and eye abnormalities [Walker-Warburg syndrome (WWS), MIM
236670] to a relative mild form of limb-girdle muscular dystrophy (LGMD2I, MIM 607155)
are linked to the abnormal O-linked glycosylation of α-DG (van Reeuwijk et al., 2005). A
study made by Barresi R. et al. (2004) revealed the existence of dual and concentration
dependent functions of LARGE. In physiological concentration, LARGE may be involved in
regulating the α-DG O-mannosylation pathway. But when the LARGE is expressed by force,
it may trigger some other alternative pathways for the O-glycosylation of α-DG which can
generate a type of repeating polymer of variable lengths, such as glycosaminoglycan-like or
core 1 or core 2 structures. This alternative glycan mimics the O-mannose glycan in its
ability to bind α-DG ligands and can compensate for the defective tetrasaccharide. The
functional LARGE protein is also required for neuronal migration during CNS development
and it rescues α-DG in MEB fibroblasts and WWS cells (Barresi R. et al., 2004).

2.1.3 LARGE in visual signal processing

The role of LARGE in proper visual signal processing was studied from the retina retinal
pathology in Largemyd mice. The functional abnormalities of the retina was investigated by a
sensitive tool called Electroretinogram (ERG). In Largemyd mice, the normal a-wave indicated
that the mutant glycosyltransferase does not have any effect on its photoreceptor function.
But the alteration in b-wave may have resulted in downstream retinal circuitry with altered
signal processing (Newman & Frishman, 1991). The DGC may also have a possible role in
this aspect of the phenotype. The abnormal b-wave was responsible for the loss of retinal
isoforms of dystrophin in humans and mice similar to the Largemyd mice.

2.2 LARGE homologues

A homologous gene to LARGE was identified and named as LARGE2. It is found to be
involved in α-DG maturation as like LARGE, according to Fujimura et al., (2005). It is still not
well understood whether these two proteins are compensatory or cooperative. The co-
expression of LARGE and LARGE2 did not increase the maturation of α-DG in comparison
with either one of them alone and it proved that for the maturation of α-DG, the function of
LARGE2 is compensatory and not cooperative. Gene therapy for muscular dystrophy using
the LARGE gene is a current topic of research (Barresi R. et al., 2004; Braun, 2004). When
compared to LARGE, LARGE2 gene may be more effective because it can glycosylate heavily
than LARGE and it also prevents the harmful and immature α-DG production.
70 Computational Biology and Applied Bioinformatics

The closely related homologues of LARGE are found in the human genome,
(glycosyltransferase-like 1B; GYLTL1B), mouse genome (Glylt1b; also called LARGE-Like or
LargeL) and in some other vertebrate species (Grewal & Hewitt, 2002). The homologue gene
is positioned on the chromosome 11p11.2 of the human genome and it encodes 721 amino
acid protein which has 67% identity with LARGE, suggests that the two genes may have
risen by gene duplication. Like LARGE, it is also predicted to have two catalytic domains,
though it lacks the coiled-coiled motif present in the former protein. The hyperglycosylation
of α-dystroglycan by the overexpression of GYLTL1B increased its ability to bind laminin
and both the genes showed the same level of increase in laminin binding ability
(Brockington, et al., 2005).

3. Bioinformatics workflow and platform design

Many public databases and bioinformatics tools have been developed and are currently
available for use (Ding & Berleant, 2002). The primary goal of bioinformaticians is to
develop reliable databases and effective analysis tools capable of handling bulk amount of
biological data. But the objective of laboratory researchers is to study specific areas within
the life sciences, which requires only a limited set of databases and analysis tools. Thus the
existing free bioinformatics tools are sometimes too complicated for the biologists to choose.
One solution is to have an expert team who are familiar with both bioinformatics databases
and to know the needs of a research group in a particular field. The expert team will
recommend a workflow by using selected bioinformatics tools and databanks and also helps
the scientists with the complicated issue of tools and databases. Moreover, such a team
could organize large number of heterogeneous sources of biological information into a
specific, expertly annotated databank.
The team can also regularly and systematically update the information essential to help
biologists overcome the problems of integrating and keeping up-to-date with heterogeneous
biological information (Gerstein, 2000).
We have built a novel information management platform, LGTBase (Hyperlink).This
composite knowledge management platform includes the “LARGE-like GlcNAc Transferase
Database” by integrating specific public databases like CAZy database, and the workflow
analysis combined the usage of specific, public & designed bioinformatics tools to identify
the members of the LARGE-like protein family.

4. Tools and database selection

To analyze a novel protein family, biologists need to understand many different types of
information. Moreover, the speed of discovery in biology has been expanding exponentially
in recent years. So the biologists have to pick the right information available from the vast
resources available. To overcome these obstacles, a bioinformatics workflow can be
designed for analysing a specific protein family. In our study, a workflow was designed
based on the structure and characteristics of LARGE protein as shown in Figure 1 (Hwa et
al., 2007). The unknown DNA/protein sequences will be first identified as members of the
known gene families by using the Basic Local Alignment Search Tool (BLAST). The blastp
search tool is used to look for new LARGE-like proteins present in different organisms. The
researchers who wish to use our platform can obtain the protein sequences either from the
experimental data or through the blastp results. The search results were then analyzed with
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 71

the following tools. To begin with, the sequences are searched for the aspartate-any residue-
aspartate (DXD) motif. The DXD motifs present in some glycosyltransferase families are
essential for its enzyme activity.

Fig. 1. Bioinformatics workflow of LGTBase.

The DXD motif prediction was then followed by the transmembrane domain prediction by
using the TMHMM program (version 2.0; Center for Biological Sequence Analysis, Technical
University of Denmark [http://www.cbs.dtu.dk/services/TMHMM-2.0/]). The
transmembrane domain is a characteristic feature of the Golgi enzymes.
The sequence motifs are then identified by MEME (Multiple Expectation-maximization for
Motif Elicitation) program (version 3.5.4; San Diego Supercomputer Center, UCSD
[http://meme.sdsc.edu/meme/]).
This program finds the motif-homology between the target sequence and other known
glycosyltransferases. In addition to all the above mentioned tools, the Pfam search (Sanger
Institute [http://www.sanger.ac.uk/Software/Pfam/search.shtml]) can also be used to find
the multiple sequence alignments and hidden Markov models in many existing protein
domains and families. The Pfam results will indicate what kind of protein family the peptide
belongs to. If it is a desired protein, investigators can then identify the evolutionary
relationships by using phylogenetic analysis.

4.1 LARGE-like GlcNAc transferase database

The specific annotation entries used in the LGTBase are currently being used in a
configuration that uses the information retrieved from several databases.
In CAZy database (Carbohydrate- Active enZymes) database ([http://afmb.cnrs-
mrs.fr/CAZY/]), the glycosyltransferases are classified as families, clans, and folds based on
their structural and sequence similarities, and also on their mechanistic investigation. The
other databases used in this platform were listed in Table 1.
72 Computational Biology and Applied Bioinformatics

Database Description Website

EntrezGene NCBI's repository for gene-specific http://www.ncbi.nlm.nih.gov/sites
information /entrez?db=gene
GenBank NIH genetic sequence database, an http://www.ncbi.nlm.nih.gov/sites
annotated collection of all publicly /entrez?db=nucleotide
available DNA sequences
Dictybase Database for model organism http://dictybase.org/
Dictyostelium discoideum
UniProtKB/S High-quality, manually annotated, http://www.uniprot.org/
wiss-Prot non-redundant protein sequence
database
InterPro Database of protein families, http://www.ebi.ac.uk/interpro/
domains and functional sites
MGI Database provides integrated genetic, http://www.informatics.jax.org/
genomic, and biological data of the
laboratory mouse
Ensembl It provides genome- annotation http://www.ensembl.org/index.htm
information l
HGMD Human Gene Mutation Database http://www.hgmd.cf.ac.uk/ac/inde
(HGMD) provides comprehensive x.php
data on human inherited disease
mutations
UniGene NCBI database of the transcriptome http://www.ncbi.nlm.nih.gov/unige
ne
GeneWiki The database transfers information http://en.wikipedia.org/wiki/Gene
on human genes to Wikipedia article _Wiki
TGDB Database with information about the http://www.tumor-
genes involved in cancers gene.org/TGDB/tgdb.html
HUGE The database provides the results of http://zearth.kazusa.or.jp/huge/
the Human cDNA project at the
Kazusa DNA Research Institute
RGD Database with collection of genetic http://rgd.mcw.edu/
and genomic information on the rat
OMIM Database provides information on http://www.ncbi.nlm.nih.gov/sites
human genes and genetic disorders. /entrez?db=omim
CGAP Information of gene expression http://cgap.nci.nih.gov/
profiles of normal, precancer, and
cancer cells.
PubMed Database with 20 million citations for http://www.ncbi.nlm.nih.gov/Pub
biomedical literature from medical Med/
journals, life science journals, related
books.
GO Representation of gene and gene http://www.geneontology.org/
product attributes across all species
Table 1. The information sources of LARGE-like GlcNAc Transferase Database
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 73

All the information related to the LARGE-like protein family was retrieved from the
different biological databases. In order to confirm that the information obtained was
reliable, the data was scrutinized at two levels. First the information was selected from the
above mentioned biological databases with customized programs (using the perl compatible
regular expressions). Then the obtained information was annotated and validated by experts
in glycobiology and bioinformatics.
The annotated data in the LGTBase database was divided into nine categories (Figure 2).
The first category is related to genomic location, displays the chromosome, the cytogenetic
band and the map location of the gene. The second is related to aliases and descriptions,
displays synonyms and aliases for the relevant gene, and descriptions of its function,
cellular localization and effect on phenotype. The third category on proteins provides
annotated information about the proteins encoded by the relevant genes. The fourth is about
protein domains and families, provides annotated information about protein domains and
families and the fifth on protein function which provides annotated information about gene
function. The sixth category is related to pathways and interactions, provides links to
pathways and interactions followed by the seventh on disorders and mutations which
draws its information from OMIM and UniProt. The eighth category is on expression in
specific tissues, shows that the tissue expression values are available for a given gene. The
last category is about research articles, lists the references related to the proteins which are
studied. In addition, the investigator can also use DNA or protein sequences to assemble the
dataset for the analysis using this workflow.

Fig. 2. The contents of LGTBase database

4.2 LARGE-like GlcNAc transferase workflow

4.2.1 Reference sequences search
The unknown DNA/protein sequences are identified as members of the known gene
families using the Basic Local Alignment Search Tool (BLAST). BlastP is one of the BLAST
programs and it searches protein databases using a protein query. We used BlastP to look
for new LARGE-like proteins from different species and gathered the protein sequences of
74 Computational Biology and Applied Bioinformatics

LARGE like GlcNAc Transferases and built a protein database of ‘LARGE-like protein’. This
database would assist in search for more reference sequences of LARGE-like protein.

4.2.2 DXD motif search

In several glycosyltransferase families, the DXD motif is essential for the enzymatic activity
(Busch et al. 1998). So we first searched for aspartate-any residue-aspartate (DXD) motif,
commonly found in glycosyltransferase. Therefore, the ‘DXD Motif Search’ tool was
designed. The input protein sequences are loaded or pasted in this tool and the results
indicate the presence or absence of DXD motif.

4.2.3 Transmembrane helices search

The LARGE protein is a member of the N-acetylglucosaminyltransferase family. The
presence of transmembrane domain is a characteristics feature of this family. TMHMM
program is used to predict the transmembrane helices based on the hidden Markov model.
The prediction gives the most probable location and orientation of transmembrane helices in
the sequence. TMHMM can predict the location of transmembrane alpha helices and the
location of intervening loop regions. This program also predicts the location of the loops
that are present between the helices either inside or outside of the cell or organelle. The
program is designed based on a 20 amino acids long alpha helix which contains
hydrophobic amino acids that can span through a cell membrane.

4.2.4 MEME analysis

A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA
sequences. MEME (Multiple Expectation-maximization for Motif Elicitation) represents
motifs as position-dependent letter-probability matrices which describe the probability of
each possible letter at each position in the pattern. The program can search for homologous
sequences among the input protein sequences.

4.2.5 Protein families search

The Pfam HMM search was used to identify the protein family to which the input protein
sequences belong. The Pfam database contains the information about most of the protein
domains and families. The results from the Pfam HMM search will show the relation of
input protein sequences with the existing protein families and domains.

4.2.6 Phylogenetic analysis

The phylogenetic analysis was performed to find any significant evolutionary relationship
between the new protein sequences and the LARGE protein family and to support our
previous findings. ClustalW, a multiple alignment program which aligns two or more
sequences to determine any significant consensus sequences between them (Thompson et
al., 1994). This approach can also be used for searching patterns in the sequence. The
phylogenetic tree was constructed by using PHYLIP program (v.3.6.9) and viewed by
Treeview software (v.1.6.6). In GlcNAc-transferase phylogenetic analysis, once the multiple
alignment of all GlcNAc-transferase has been done, it can be used to construct the
phylogenetic tree. About 25 protein sequences were identified as the LARGE-like protein
family. By using the neighbor joining distance method, the phylogenetic tree showed that
these proteins can be divided into 6 groups (Figure 3). The evolutionary history inferred
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 75

from phylogenetic analysis is usually depicted as branching, tree-like diagrams which

represents an estimated pedigree of the inherited relationships among the protein sequences
from different species. These evolutionary relationships can be viewed either as Cladograms
(Chenna et al., 2003) or Phylograms (Radomski & Slonimski, 2001).

Fig. 3. Phylogenetic tree of LARGE-like Protein Family

4.3 Organization of the LGTBase platform

The data obtained from the analyses were stored in a MySQL relational database and the
web interface was built by using PHP and CGI/Java scripts. According to the characteristics
of LARGE-like GlcNAc transferase proteins, the workflow was designed and developed by
using Java language and several open source bioinformatics programs. Tools with different
languages, C, perl, java were integrated by using Java language (Figure 4). Adjustable
parameters of the tools were reserved to fulfill the needs in future.

5. Application with LARGE protein family

A protein sequence (fasta format) can be entered into the BlastP assistant interface, enabling
the other known proteins with similar sequences to be identified (Figure 5). The investigator
can select all the resulting sequences or use only some of them. The data can then be
transferred to the DXD analysis page (Figure 6). The rationale behind choosing the DXD
analysis was since they are represented in many families of glycosyltransferases and it will
be easy to narrow down the analysis of putative protein sequences to particular protein
families or domains. There were many online tools available for the identification and
76 Computational Biology and Applied Bioinformatics

characterization of unknown protein sequences. So depending upon the target protein of

study, one can pick the tools to characterize it.

Fig. 4. Database selected for construction of the knowledge management platform

The sequences are analyzed with the DXD motif search tool (Figure 6), which selects those
sequences containing the DXD motif for the TMHMM analysis. The transmembrane helices
can be predicted with TMHMM analysis (Figure 7). The transmembrane domains are
predicted by the hydrophobic nature of the proteins and mainly used to identify the cellular
location of the proteins. Similar to transmembrane domain prediction, there were several
other domains that can be predicted based on the protein’s characters like hydrophobic,
hydrophilic etc., The dataset containing DXD motifs and transmembrane helices are then
selected for MEME (Figure 8) and Pfam analysis (Figure 9). Some sequence motifs occur
repeatedly in the data set and are conjectured to have a biological significance are predicted
by MEME analysis. This application plays a significant role in characterization of the
putative protein sequences after the initial studies with the DXD motif, transmembrane
domain, and other tools. This tool can be used for all kind of protein sequences since its
prediction is based on the pattern of sequences present in the study. The protein sequences
in the dataset can be identified to the known protein families by Pfam analysis. The pfam
classification can also be used for almost all the putative protein sequences because of its
large collection of protein domain families represented by multiple sequence alignments
and Hidden Markov Models. After the MEME and Pfam analysis were done, ClustalW and
Phylip programs were used for Phylogenetic Analysis (Figure 9) to see the evolutionary
relationship among the data sets (Figure 10). Finally, these results can be used to design
experiments to be performed in the laboratory.
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 77

Fig. 5. The BlastP tool of the LGTBase platform to find similar sequences to LARGE
78 Computational Biology and Applied Bioinformatics

Fig. 6. DXD motif search tool of the LGTBase platform for DXD motif prediction

Fig. 7. TMHMM analysis tool of the LGTBase platform for Transmembrane domain
prediction
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 79

Fig. 8. MEME analysis tool of the LGTBase platform to predict the sequence motifs

Fig. 9. Pfam analysis tool of the LGTBase platform to identify the known protein family of
the target protein which is studied
80 Computational Biology and Applied Bioinformatics

Fig. 10. Phylogenetic analysis tool of the LGTBase platform to study the evolutionary
relationship of the target protein

6. Future direction
We have described how to construct a computational platform to analyze the LARGE
protein family. Since the platform was built based on several commonly shared protein
domains and motifs, it can also be modified for analyzing other golgi glycosyltransferases.
Furthermore, the phylogenetic analysis (Figure 3) revealed that LARGE protein family is
related to β-1,3-N-acetylglucosaminyltransferase 1 (β3GnT). β3GnT (EC 2.4.1.149) is a group
of enzymes belong to glycosyltransferases family. Some β3GnT enzymes catalyze the
transfer of GlcNAc from UDP-GlcNAc to Gal in the Galβ1-4 Glc(NAc) structure with β-1,3
linkage. These enzymes were grouped into GT family 31, 49 in the CAZy database. The
enzyme uses 2 substrates namely, UDP-N-acetyl-D-glucosamine and D-galactosyl-β-1,4-N-
acetyl-D-glucosaminyl-R and the products are formed as UDP, N-acetyl-β-D- glucosaminyl-
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 81

β-1,3 -D-galactosamine. These enzymes participate in the formation of keratan sulfate,

glycosphingolipid biosynthesis, neo-lacto series and N-linked glycans. There are currently 9
members known from the β3GnT family.
The β3GnT1 (iGnT) was the first enzyme to be isolated when cDNA of a human β-1,3-N-
acetylglucosaminyltransferase essential for poly-N-acetyllactosamine synthesis was studied
(Zhou et al., 1999). The poly-N-acetyllactosamine synthesized by iGnT provides critical
backbone structure for the addition of functional oligosaccharides such as Sialyl Lewis X. It
has been reported recently that β3GnT1 is involved in attenuating prostate cancer cell
locomotion by regulating the synthesis of laminin-binding glycans on α-DG (Bao et al.,
2009). Since there are several common shared domains similar to the LARGE protein, the
new platform for β3GnT protein family can be constructed based on the original platform.
Apart from β3GnT1, β3GnT2 enzyme is responsible for elongation of poly-lactosamine
chains. This enzyme was isolated based on structural similarity with the β3GalT family.
Studies showed that on a panel of invasive and noninvasive fresh transitional cell
carcinomas (TCCs) showed strong down regulation of β3GnT2 in the invasive lesions,
suggesting that a decline in the expression levels of some members of the
glycosyltransferase (Gromova et al., 2001).
The β3GnT3 and β3GnT4 enzymes were subsequently isolated based on the structural
similarity with β3GalT family. β3GnT3 is a type II transmembrane protein and contains a
signal anchor that is not cleaved. It prefers the substrates of lacto-N-tetraose and lacto-N-
neotetraose, and it is also involved in the biosynthesis of poly-N-acetyllactosamine chains
and the biosynthesis of the backbone structure of dimeric sialyl Lewis A. It plays dominant
role in the L-selectin ligand biosynthesis, lymphocyte homing and lymphocyte trafficking.
The β3GnT3 enzyme is highly expressed in the non-invasive colon cancer cells. β3GnT4 is
involved in the biosynthesis of poly-N-acetyllactosamine chains and prefers lacto-N-
neotetraose as the substrate. It is a type II transmembrane protein and it is expressed more
in bladder cancer cells (Shiraishi et al., 2001). β3GnT5 is responsible for
lactosyltriaosylceramide synthesis, an essential component of lacto/neolacto series
glycolipids (Togayachi et al., 2001 ). The expression of the HNK-1 and Lewis x antigens on
the lacto/neo-lacto-series of glycolipids is developmentally and tissue-specifically regulated
by β3GnT5. The overexpression of β3GnT5 in human gastric carcinoma cell lines led to
increased sialyl-Lewis X expression and increased H.pylori adhesion (Marcos et al., 2008).
The β3GnT6 synthesizes the core 3 O-glycan structure and speculates that this enzyme plays
an important role in the synthesis and function of mucin O-glycan in the digestive organs. In
addition, the expression of β3GnT6 was markedly down regulated in gastric and colorectal
carcinomas (Iwai et al., 2005). Expression of β3GnT7 has been reported to be down-
regulated upon malignant transformation (Kataoka et al., 2002). Elongation of the
carbohydrate backbone of keratan sulfate proteoglycan is catalyzed by β3GnT7 and β1,4-
galactosyltransferase 4 (Hayatsu et al., 2008). β3GnT7 can transfer GlcNAc to Gal to
synthesize a polylactosamine chain with each enzyme differing in its acceptor molecule
preference. The polylactosamine and related structures plays crucial role in cell-cell
interaction, cell-extracellular matrix interaction, immune response and determining
metastatic capacity. The β3GnT8 enzyme extends a polylactosamine chain specifically on a
tetraantennary N-glycans. β3GnT8 transfers GlcNAc to the non-reducing terminus of the
82 Computational Biology and Applied Bioinformatics

Galβ1-4GlcNAc of tetra antennary N-glycan in vitro. Intriguingly, β3GnT8 is significantly

upregulated in colon cancer tissues than in normal tissue (Ishida et al., 2005). The co-
transfection of β3GnT8 and β3GnT2 resulted in synergistic enhancement of the activity of
the polylactosamine synthesis. This indicates that these two enzymes interact and
complement each other’s function in the cell. As a summary, the members of the β3GnT
protein family are important in human cancer biology.
Our initial motif analysis showed that there are 3 important functional domains predicted
are commonly found among the β3GnT enzymes. The first motif is a structural motif
necessary for maintaining the protein fold. The second, DXD motif represented in many
glycosyltransferases is involved in the binding of the nucleotide-sugar donor substrate, both
directly and indirectly through coordination of metal ions such as magnesium or manganese
in the active site. A glycine-rich loop is the third motif found at the bottom of the active site
cleft. This loop is likely to play a role in the recognition of both the GlcNAc portion of the
donor and the substrate. Since the three common domains of β3GnT are similar to the
LARGE protein family, it is feasible to modify the current LARGE platform to analyze other
golgi glycosyltransferases such as β3GnT.

7. References
Bao, X., Kobayashi, M., Hatakeyama, S., Angata, K., Gullberg, D., Nakayama, J., Fukuda,
M.N. & Fukuda, M. (2009). Tumor suppressor function of laminin-binding
α-dystroglycan requires a distinct β-3-N-acetylglucosaminyltransferase. Proceedings
of the National Academy of Sciences USA, Vol.106, No.29, (July 2009),
pp. 12109-12114
Barresi, R., Michele, D.E., Kanagawa, M., Harper, H.A., Dovico, S.A., Satz, J.S., Moore, S.A.,
Zhang, W., Schachter, H., Dumanski, J.P., Cohn, R.D., Nishino, I. & Campbell, K.P.
(2004). LARGE can functionally bypass alpha-dystroglycan glycosylation defects in
distinct congenital muscular dystrophies. Nature Medicine, Vol.10, No.7, (July 2004),
pp. 696-703.
Braun, S. (2004). Naked plasmid DNA for the treatment of muscular dystrophy. Current
Opinion in Molecular Therapeutics, Vol.6, (October 2004), pp. 499-505.
Brockington, M., Torelli, S., Prandini, P., Boito, C., Dolatshad, N.F., Longman, C., Brown,
S.C., Muntoni, F. (2005). Localization and functional analysis of the LARGE family
of glycosyltransferases: significance for muscular dystrophy. Human Molecular
Genetics, Vol.14, No.5, (March 2005), pp. 657-665.
Busch, C., Hofmann, F., Selzer, J., Munro, S., Jeckel, D. & Aktories, K. (1998). A common
motif of eukaryotic glycosyltransferases is essential for the enzyme activity of large
clostridial cytotoxins. Journal of Biological Chemistry, Vol.273, No.31, (July 1998),
pp.19566-19572.
Cantarel, B.L., Coutinho, P.M., Rancurel, C., Bernard, T., Lombard, V. & Henrissat, B.
(2009) The Carbohydrate-Active EnZymes database (CAZy): an expert
resource for Glycogenomics. Nucleic Acids Research, Vol. 37, (January 2009), pp.
D233-238
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 83

Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J. & Higgins, D.G. Thompson JD
(2003). Multiple sequence alignment with the Clustal series of programs. Nucleic
Acids Res. Vol.31, No.13, (July 2003), pp. 3497-3500.
Coutinho, P.M., Deleury, E., Davies, G.J. & Henrissat, B. (2003). An evolving hierarchical
family classification for glycosyltransferases. Journal of Molecular Biology, Vol.328,
(April 2003), pp. 307-317.
Ding, J., Berleant, D., Nettleton, D. & Wurtele, E. (2002). Mining MEDLINE: abstracts,
sentences, or phrases? Pacific Symposium on Biocomputing, Vol.7, pp. 326-337.
Dumanski, J.P., Carlbom, E., Collins, V.P., Nordenskjold, M. (1987). Deletion mapping of a
locus on human chromosome 22 involved in the oncogenesis of meningioma.
Proceedings of the National Academy of Sciences USA, Vol.84, (December 1987), pp.
9275-9279.
Durbeej, M., Henry, M.D. & Campbell, K.P. (1998). Dystroglycan in development and
disease. Current Opinions in Cell Biology Vol. 10, (October 1998), pp. 594-601.
Fujimura, K., Sawaki, H., Sakai, T., Hiruma, T., Nakanishi, N., Sato, T., Ohkura, T.,
Narimatsu, H. (2005). LARGE2 facilitates the maturation of a-dystroglycan more
effectively than LARGE. Biochemical and Biophysical Research Communications,
Vol.329, No.3, (April 2005), pp. 1162-1171
Fukuda, M., Hindsgaul, O., Hames, B.D. & Glover, D.M. (1994). In Molecular Glycobiology,
Oxford Univ. Press, Oxford.
Fukuda, M., & Hindsgaul, O. (2000). Molecular and Cellular Glycobiology (2nd ed.), Oxford
Univ. Press, Oxford.
Gee, S.H., Montanaro, F., Lindenbaum, M.H., and Carbonetto, S. (1994). Dystroglycan-α, a
dystrophin-associated glycoprotein, is a functional agrin receptor. Cell, Vol.77,
(June 1994), pp. 675-686.
Gerstein, M. (2000). Integrative database analysis in structural genomics. Nature Structural
Biology, Vol.7, (November 2000), Suppl: 960-963.
Grewal, K., Holzfeind, P.J., Bittner, R.E. & Hewitt, J.E., (2001). Mutant glycosyltransferase
and altered glycosylation of alpha-dystroglycan in the myodystrophy mouse.
Nature Genetics, Vol.28, (June 2001), pp.151-154.
Grewal, P.K. & Hewitt, J.E. (2002). Mutation of Large, which encodes a putative
glycosyltransferase, in an animal model of muscular dystrophy. Biochimica et
Biophysica Acta, Vol.1573, (December 2002), pp. 216-224.
Gromova, I., Gromov, P. & Celis J.E. (2001). A Novel Member of the Glycosyltransferase
Family, β3GnT2, highly down regulated in invasive human bladder Transitional
Cell Carcinomas. Molecular Carcinogenesis, Vol. 32, No. 2, (October 2001), pp.
61-72
Hayatsu, N., Ogasawara, S., Kaneko, M.K., Kato, Y. & Narimatsu, H. (2008). Expression of
highly sulfated keratan sulfate synthesized in human glioblastoma cells. Biochemical
and Biophysical Research Communications, Vol. 368, No. 2, (April 2008), pp. 217-222
Hwa, K.Y., Pang, T.L. & Chen, M.Y. (2007). Classification of LARGE-like GlcNAc-
Transferases of Dictyostelium discoideum by Phylogenetic Analysis. Frontiers in the
Convergence of Bioscience and Information Technologies, pp. 289-293.
84 Computational Biology and Applied Bioinformatics

Ishida, H., Togayachi, A., Sakai, T., Iwai, T., Hiruma, T., Sato, T., Okubo, R., Inaba, N., Kudo,
T., Gotoh, M., Shoda, J., Tanaka, N., & Narimatsu, H. A novel beta1,3-N-
acetylglucosaminyltransferase (beta3Gn-T8), which synthesizes poly-N-
acetyllactosamine, is dramatically upregulated in colon cancer. FEBS Letters.
(January 2005), Vol. 579, No.1, pp. 71-78.
Ishida, H., Togayachi, A., Sakai, T., Iwai, T., Hiruma, T., Sato, T., Okubo, R., Inaba, N., Kudo,
T., Gotoh, M., Shoda, J., Tanaka, N. & Narimatsu, H. (2005). A novel beta1,3-N-
acetylglucosaminyltransferase (beta3Gn-T8), which synthesizes poly-N-
acetyllactosamine, is dramatically upregulated in colon cancer. FEBS Letters,
Vol.579, No.1, (January 2005), pp. 71-8.
Iwai, T., Kudo, T., Kawamoto, R., Kubota, T., Togayachi, A., Hiruma, T., Okada, T.,
Kawamoto, T., Morozumi, K. & Narimatsu, H. (2005). Core 3 synthase is down-
regulated in colon carcinoma and profoundly suppresses the metastatic potential of
carcinoma cells. Proceedings of the National Academy of Sciences USA, Vol.102, No.12,
(March 2005), pp. 4572-4577
Kanagawa, M., Michele, D.E., Satz, J.S., Barresi, R., Kusano, H., Sasaki, T., Timpl, R., Henry,
M. D., and Campbell, K.P. (2005). Disruption of Perlecan Binding and Matrix
Assembly by Post-Translational or Genetic Disruption of Dystroglycan Function.
FEBS Letters, Vol.579, No.21, (August 2005), pp. 4792-4796.
Kataoka, K. & Huh, N.H. (2002). A novel β1,3-N-acetylglucosaminyltransferase involved in
invasion of cancer cells as assayed in vitro. Biochemical and Biophysical Research
Communications, Vol. 294, No.4, (June 2002), pp. 843-848
Lane, P.W., Beamer, T.C. & Myers, D.D. (1976). Myodystrophy, a new myopathy on
chromosome 8 of the mouse. Journal of Heredity, Vol. 67, No.3 (May-June 1976), pp.
135-138.
Longman, C., Brockington, M., Torelli, S., Jimenez-Mallebrera, C., Kennedy, C., Khalil, N.,
Feng, L., Saran, R.K., Voit, T., Merlini, L., Sewry, C.A., Brown, S.C. & Muntoni F.
(2003). Mutations in the human LARGE gene cause MDC1D, a novel form of
congenital muscular dystrophy with severe mental retardation and abnormal
glycosylation of alpha dystroglycan. Human Molecular Genetics, Vol.12, No.21,
(November 2003), pp. 2853-2861.
Marcos, N.T., Magalhães, A., Ferreira, B., Oliveira, M.J., Carvalho, A.S., Mendes, N.,
Gilmartin, T., Head, S.R., Figueiredo, C., David, L., Santos-Silva, F. & Reis, C.A.
(2008). Helicobacter pylori induces β3GnT5 in human gastric cell lines, modulating
expression of the SabA ligand Sialyl-Lewis X. Journal of Clinical Investigation, Vol.
118, No. 6, (June 2008), pp.2325-2336
Narimatsu, H. (2006). Human glycogene cloning: focus on beta 3-glycosyltransferase and
beta 4-glycosyltransferase families. Current Opinions in Structural Biology. Vol.16,
No.5, (October 2006), pp. 567-575.
Newman, E.A. & Frishman, L.J. (1991). The b-wave. In Arden, G.B. (ed.), Principles and
Practice of Clinical Electrophysiology of Vision, Mosby-Year Book, St Louis, MO.
Peng, H.B., Ali, A.A., Daggett, D.F., Rauvala, H., Hassell, J.R., & Smalheiser, N.R. (1998). The
relationship between perlecan and dystroglycan and its implication in the
In Silico Analysis of Golgi Glycosyltransferases: A Case Study on the LARGE-Like Protein Family 85

formation of the neuromuscular junction. Cell Adhesion and Communication, Vol.5,

No.6, (September 1998), pp. 475-489
Peyrard, M., Seroussi, E., Sandberg-Nordqvist, A.C., Xie, Y.G., Han, F.Y.,Fransson, I.,
Collins, J., Dunham, I., Kost-Alimova, M., Imreh, S.,Dumanski, J.P., (1999). The
human LARGE gene from 22q12.3-q13.1 is a new, distinct member of the
glycosyltransferase gene family. Proceedings of the National Academy of Sciences USA,
Vol.96, No.2, (January 1999), pp. 589-603.
Radomski, J.P. & Slonimski, P.P. (2001). Genomic style of proteins: concepts, methods and
analyses of ribosomal proteins from 16 microbial species. FEMS Microbiol Reviews,
Vol.25, No.4, (August 2001), pp. 425-435.
Sasaki, K., Kurata-Miura, K., Ujita, M., Angata, K., Nakagawa, S., Sekine, S., Nishi, T. &
Fukuda, M. (1997). Expression cloning of cDNA encoding a human beta-1,3-N-
acetylglucosaminyl transferase that is essential for poly-N-acetyllactosamine
synthesis. Proceedings of the National Academy of Sciences USA, Vol.94, No.26,
(December 1997), pp. 14294-14299.
Shiraishi, N., Natsume, A., Togayachi, A., Endo, T., Akashima, T., Yamada, Y., Imai, N.,
Nakagawa, S., Koizumi, S., Sekine, S., Narimatsu, H. & Sasaki K. (2001).
Identification and characterization of 3 novel β1,3-N-
Acetylglucosaminyltransferases. Structurally Related to the β1,3-
Galactosyltransferase family. The Journal of Biological Chemistry, Vol. 276, No.5,
(February 2001), pp. 3498-3507
Smalheiser, N. R., and Schwartz, N. B. (1987) Cranin: a laminin-binding protein of cell
membranes. Proceedings of the National Academy of Sciences USA, Vol.84, No.18,
(September 1987), pp. 6457-6461.
Sugita, S., Saito, F., Tang, J., Satz, J., Campbell, K., & Sudhof, T.C. (2001). A stoichiometric
complex of neurexins and dystroglycan in brain. Journal of Cell Biology, Vol.154,
No.2, (July 2001), pp. 435-445
Thompson, J.D., Higgins, D.G. & Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, position-
specific gap penalties and weight matrix choice. Nucleic Acids Research, Vol.22,
No.22, (November 1994), pp. 4673-4680.
Togayachi, A., Akashima, T., Ookubo, R., Kudo, T., Nishihara, S., Iwasaki, H., Natsume, A.,
Mio, H. Inokuchi J. and T. Irimura et al., Molecular cloning and characterization of
UDP-GlcNAc: Lactosylceramide β1,3-N-acetylglucosaminyltransferase (β3Gn-T5),
an essential enzyme for the expression of HNK-1 and Lewis X epitopes on
glycolipids, Journal of Biological Chemistry, Vol. 276, No.5, (March 2001), pp. 22032–
22040
van Reeuwijk, J., Brunner, H.G., van Bokhoven, H. (2005). Glyc-O-genetics of
Walker-Warburg syndrome. Clinical Genetics, Vol. 67, No.4, (April 2005), pp. 281-
289.
Varki, A., Cummings, R.D., Esko, J.D., Freeze, H.H., Stanley, P., Bertozzi, C.R., Hart, G.W. &
Etzler, M.E. (2008). Essentials of Glycobiology, (2nd ed.) Plainview (NY): Cold Spring
Harbor Laboratory Press
86 Computational Biology and Applied Bioinformatics

Zhou, D., Dinter, A., Gutiérrez Gallego, R., Kamerling, J.P., Vliegenthart, J.F., Berger, E.G. &
Hennet, T. (1999). A β-1,3-N-acetylglucosaminyltransferase with poly-N-
acetyllactosamine synthase activity is structurally related to β-1,3-
galactosyltransferases. Proceedings of the National Academy of Sciences USA, Vol. 96,
No. 2, (January 1999), pp. 406-411
5

MicroArray Technology - Expression Profiling of

MRNA and MicroRNA in Breast Cancer
Aoife Lowery, Christophe Lemetre, Graham Ball and Michael Kerin
1Department of Surgery, National University of Ireland Galway,
2John Van Geest Cancer Research Centre, School of Science & Technology,
Nottingham Trent University, Nottingham
1Ireland
2UK

1. Introduction
Breast cancer is the most common form of cancer among women. In 2009, an estimated
194,280 new cases of breast cancer were diagnosed in the United States; breast cancer was
estimated to account for 27% of all new cancer cases and 15% of cancer-related mortality in
women (Jemal et al, 2009). Similarly, in Europe in 2008, the disease accounted for some 28%
and 17% of new cancer cases and cancer-related mortality in women respectively (Ferlay et
al, 2008). The increasing incidence of breast cancer worldwide will result in an increased
social and economic burden; for this reason there is a pressing need from a health and
economics perspective to develop and provide appropriate, patient specific treatment to
reduce the morbidity and mortality of the disease. Understanding the aetiology, biology and
pathology of breast cancer is hugely important in diagnosis, prognostication and selection of
primary and adjuvant therapy. Breast tumour behaviour and outcome can vary
considerably according to factors such as age of onset, clinical features, histological
characteristics, stage of disease, degree of differentiation, genetic content and molecular
aberrations. It is increasingly recognised that breast cancer is not a single disease but a
continuum of several biologically distinct diseases that differ in their prognosis and
response to therapy (Marchionni et al, 2008; Sorlie et al, 2001). The past twenty years has
seen significant advances in breast cancer management. Targeted therapies such as
hormonal therapy for estrogen receptor (ER) positive breast tumours and trastuzumab for
inhibition of HER2/neu signalling have become an important component of adjuvant
therapy and contributed to improved outcomes (Fisher et al, 2004; Goldhirsch et al, 2007;
Smith et al, 2007). However, our understanding of the molecular basis underlying breast
cancer heterogeneity remains incomplete. It is likely that there are significant differences
between breast cancers that reach far beyond the presence or absence of ER or HER2/neu
amplification. Patients with similar morphology and molecular phenotype based on ER, PR
and HER2/neu receptor status can have different clinical courses and responses to therapy.
There are small ER positive tumours that behave aggressively while some large high grade
ER negative, HER2/neu receptor positive tumours have an indolent course. ER-positive
tumours are typically associated with better clinical outcomes and a good response to
88 Computational Biology and Applied Bioinformatics

hormonal therapies such as tamoxifen (Osborne et al, 1998). However, a subset of these
patients recur and up to 40% develop resistance to hormonal therapy (Clarke et al, 2003).
Furthermore, clinical studies have shown that adding adjuvant chemotherapy to tamoxifen
in the treatment of node negative, ER positive breast cancer improves disease outcome
(Fisher et al, 2004). Indeed, treatment with tamoxifen alone is only associated with a 15%
risk of distant recurrence, indicating that 85% of these patients would do well without, and
could be spared the cytotoxic side-effects of adjuvant chemotherapy.
The heterogeneity of outcome and response to adjuvant therapy has driven the discovery of
further molecular predictors. Particular attention has focused on those with prognostic
significance which may help target cancer treatment to the group of patients who are likely
to derive benefit from a particular therapy. There has been a huge interest in defining the
gene expression profiles of breast tumours to further understand the aetiology and progression
of the disease in order to identify novel prognostic and therapeutic markers. The sequencing
of the human genome and the advent of high throughput molecular profiling has facilitated
comprehensive analysis of transcriptional variation at the genomic level. This has resulted in
an exponential increase in our understanding of breast cancer molecular biology. Gene
expression profiling using microarray technology was first introduced in 1995 (Schena et al,
1995). This technology enables the measurement of expression of tens of thousands of
mRNA sequences simultaneously and can be used to compare gene expression within a
sample or across a number of samples. Microarray technology has been productively
applied to breast cancer research, contributing enormously to our understanding of the
molecular basis of breast cancer and helping to achieve the goal of individualised breast
cancer treatment. However as the use of this technology becomes more widespread, our
understanding of the inherent limitations and sources of error increases. The large amount
of data produced from such high throughput systems has necessitated the use of complex
computational tools for management and analysis of this data; leading to rapid
developments in bioinformatics.
This chapter provides an overview of current gene expression profiling techniques, their
application to breast cancer prognostics and the bioinformatic challenges that must be
overcome to generate meaningful results that will be translatable to the clinical setting. A
literature search was performed using the PubMed database to identify publications
relevant to this review. Citations from these articles were also examined to yield further
relevant publications.

2. Microarray technology – principles & technical considerations

2.1 High throughput genomic technology
There are a multitude of high throughput genomic approaches which have been developed
to simultaneously measure variation in thousands of DNA sequences, mRNA transcripts,
peptides or metabolites:
• DNA microarray measures gene expression
• Microarray comparative genomic hybridisation (CGH) measures genomic gains and
losses or identifies differences in copy number for genes involved in pathological states
(Oosterlander et al, 2004)
• Single nucleotide polymorphism (SNP) microarray technology (Huang et al, 2001) has
been developed to test for genetic aberrations that may predispose an individual to
disease development.
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 89

• CpG arrays (Yan et al , 2000) can be used to determine whether patterns of specific
epigenetic alterations correlate with pathological parameters.
• Protein microarrays (Stoll et al, 2005) consisting of antibodies, proteins, protein
fragments, peptides or carbohydrate elements, are used to detect patterns of protein
expression in diseased states.
• ChIP-on-chip (Oberley et al, 2004) combines chromatin immunoprecipitation (ChIP)
with glass slide microarrays (chip) to detect how regulatory proteins interact with the
genome.
All of these approaches offer unique insights into the genetic and molecular basis of disease
development and progression.
This chapter focuses primarily on gene expression profiling and cDNA microarrays,
however many of the issues raised, particularly in relation to bioinformatics are also
applicable to the other “-omic” technologies.
Gene expression which is a measurement of gene “activity” can be determined by the
abundance of its messenger RNA (mRNA) transcripts or by the expression of the protein
which it encodes. ER, PR and HER2/neu receptor status are determined in clinical practice
using immunohistochemistry (IHC) to quantitate protein expression or fluorescence in situ
hybridisation (FISH) to determine copy number. These techniques are semi-quantitative and
are optimal when determining the expression of individual or a small number of genes.
Microarray technology is capable of simultaneously measuring the expression levels of
thousands of genes in a biological sample at the mRNA level. The abundance of individual
mRNA transcripts in a sample is a reflection of the expression levels of corresponding genes.
When a complementary DNA (cDNA) mixture reverse transcribed from the mRNA is
labelled and hybridised to a microarray, the strength of the signal produced at each address
shows the relative expression levels of the corresponding gene.
cDNA microarrays are miniature platforms containing thousands of DNA sequences which
act as gene specific probes, immobilised on a solid support (nylon, glass, silicon) in a parallel
format. They are reliant on the complementarity of the DNA duplex i.e. reassembly of
strands with base pairing A to T and C to G which occurs with high specificity. There are
microarray platforms available containing bound librarys of oligonucleotides representing
literally all known human genes e.g. Affymetrix GeneChip (Santa Clara, CA), Agilent array
(Santa Clara, CA), Illumina bead array (San Diego, CA). When fluorescence-labelled cDNA
is hybridised to these arrays, expression levels of each gene in the human genome can be
quantified using laser scanning microscopes. These microscopes measure the intensity of the
signal generated by each bound probe; abundant sequences generate strong signals and rare
sequences generate weaker signals. Despite differences in microarray construction and
hybridization methodologies according to manufacturing, microarray-based measurements
of gene expression appear to be reproducible across a range of different platforms when the
same starting material is used, as demonstrated by the MicroArray Quality Control project
(Shi et al, 2006).

2.2 Experimental approach

There are experimental design and quality control issues that must be considered when
undertaking a microarray experiment. The experiment should be designed appropriately to
answer a specific question and samples must be acquired from either patients or cultured
cells which are appropriate to the experimental setup. If the aim of a microarray experiment
90 Computational Biology and Applied Bioinformatics

is to identify differentially expressed genes between two groups of samples i.e.

“experiment” and “control”, it is critical that the largest source of variation results from the
phenotype under investigation (e.g. patient characteristic or treatment). The risk of
confounding factors influencing the results can be minimised by ensuring that the groups of
samples being compared are matched in every respect other than the phenotype under
investigation. Alternatively, large sample numbers can be used to increase the likelihood
that the experimental variable is the only consistent difference between the groups.
For a microarray experiment, fresh frozen tissue samples are required which have been
snap-frozen in liquid nitrogen or collected in an RNARetain™ or RNA LaterTM solution to
preserve the quality of the RNA. Formalin-fixed and paraffin embedded tissue samples are
generally unsuitable for microarray studies as the RNA in the sample suffers degradation
during tissue processing (Cronin et al, 2004; Masuda et al, 1999, Paik et al, 2005).
Due to the omnipresence of ribonucleases and the inherent instability of RNA, it is essential to
measure the integrity of RNA after extraction. Only samples of the highest integrity should be
considered for reverse transcription to cDNA and hybridisation to the microarray platform
(figure 1). Once obtained, intensity readings must be background adjusted and transformed;
this data is then normalised and analysed and results are generally interpreted according to
biological knowledge. The success of microarray experiments is highly dependent on
replication. Technical replication refers to the repeated assaying of the same biological sample
to facilitate quality assessment. Even more important is biological replication on larger sample
sets. The accuracy of microarray expression measurements must be confirmed using a reliable
independent technology, such as real-time quantitative PCR, and validated on a larger set of
independent biological samples. It is independent validation studies that determine the
strength or clinical relevance of a gene expression profile.

Fig. 1. The steps involved in a cDNA microarray experiment

MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 91

3. Molecular profiling – unlocking the heterogeneity of breast cancer

Breast cancer researchers were quick to adopt high throughput microarray technology,
which is unsurprising considering the opportunity it provides to analyse thousands of
genes simultaneously.

3.1 Class discovery

Microarray studies can be used in three different manners;
• class comparison
• class prediction
• class discovery (Simon et al, 2003)
All of these approaches have been applied to the study of breast cancer.
Class discovery involves analyzing a given set of gene expression profiles with the goal of
discovering subgroups that share common features. The early gene expression profiling
studies of breast cancer (Perou et al, 2000; Sorlie et al, 2001) were class discovery studies.
Researchers used an unsupervised method of analysis, in which tumours were clustered
into subgroups by a 496-gene “intrinsic” gene set that reflects differences in gene expression
between tumours without using selection criteria. The tumour subtype groupings consist of
luminal like subtypes which are predominantly ER and PR positive, basal-like subtypes
which are predominantly triple negative for ER, PR and HER2/neu, HER2/neu-like
subtypes which have increased expression of the HER2/neu amplicon and a normal-like
subtype (Perou et al, 2000). Subsequent studies from the same authors, on a larger cohort of
patients with follow-up data showed that the luminal subgroup could be further subdivided
into at least two groups, and that these molecular subtypes were actually associated with
distinct clinical outcomes (Sorlie et al 2001). These molecular subtypes of breast cancer have
been confirmed and added to in subsequent microarray datasets (Hu et al, 2006; Sorlie et al,
2003; Sotiriou et al, 2003). Given the importance of the ER in breast cancer biology, it is not
surprising that the most striking molecular differences were identified between the ER-
positive (luminal) and ER-negative subtypes. These differences have been repeatedly
identified and validated with different technologies and across different platforms (Fan et al,
2006; Farmer et al, 2005; Sorlie et al, 2006). The luminal subgroup has been subdivided into
two subgroups of prognostic significance:
• luminal A tumours which have high expression of ER –activated genes, and low
expression of proliferation related genes
• luminal B tumours which have higher expression of proliferation related genes and a
poorer prognosis than luminal A tumours (Geyer et al, 2009; Paik et al, 2000; Parker et
al, 2009; Sorlie et al, 2001, 2003).
The ER negative tumours are even more heterogeneous and comprise the:
• basal-like subgroup which lack ER and HER2/neu expression and feature more frequent
overexpression of basal cytokeratins, epidermal growth factor receptor and c-Kit
(Nielsen et al, 2004)
• HER2/neu subgroup which overexpress HER2/neu and genes associated with the
HER2/neu pathway and/or the HER2/neu amplicon on chromosome 17.
The HER2/neu and basal-like subtypes have in common an aggressive clinical behaviour
but appear to be more responsive to neoadjuvant chemotherapy than the luminal subtypes
(Carey et al, 2007; Rouzier et al, 2005). Also clustering with the ER negative tumours are the
normal-like breast cancers; these are as yet poorly characterised and have been shown to
92 Computational Biology and Applied Bioinformatics

cluster with fibroadenoma and normal breast tissue samples (Peppercorn et al, 2008). It is
important at this point to acknowledge the limitations of this molecular taxonomy;
intrasubtype heterogeneity has been noted despite the broad similarities defined by these
large subtypes (Parker et al, 2009). In particular the basal-like subgroup can be divided into
multiple additional subgroups (Kreike et al, 2007; Nielsen et al, 2004). Additionally,
although the luminal tumours have been separated into subgroups of prognostic
significance, meta-analysis of published expression data has suggested that these luminal
tumours actually form a continuum and their separation based on expression of
proliferation genes may be subjective (Shak et al, 2006; Wirapati et al, 2008). Furthermore,
the clinical significance of the normal-like subtype is yet to be determined; it has been
proposed that this subgroup may in fact represent an artefact of sample contamination with
a high content of normal breast tissue (Parker et al, 2009; Peppercorn et al, 2008). Due to
these limitations and the subjective nature of how the molecular subtypes were identified,
the translation of this taxonomy to the clinical setting as a definitive classification has been
difficult (Pustzai et al, 2006). The development of a prognostic test based on the intrinsic
subtypes has not been feasible to date. However, the seminal work by Sorlie and Perou
(Perou et al, 2000; Sorlie et al, 2001) recognized for the first time the scale of biological
heterogeneity within breast cancer and led to a paradigm shift in the way breast cancer is
perceived.

3.2 Class comparison

A number of investigators undertaking microarray expression profiling studies in breast
cancer have since adopted class comparison studies. These studies employ supervised
analysis approaches to determine gene expression differences between samples which
already have a predefined classification. The “null hypothesis” is that a given gene on the
array is not differentially expressed between the two conditions or classes under study. The
alternative hypothesis is that the expression level of that gene is different between the two
conditions. An example of this approach is the microarray experiments that have been
undertaken to define differences between invasive ductal and invasive lobular carcinomas
(Korkola, 2003; Weigelt, 2009; Zhao, 2004), between hereditary and sporadic breast cancer
(Berns, 2001; Hedenfalk, 2001) and between different disease stages of breast cancer
(Pedraza, 2010).

3.3 Class prediction

Perhaps the most clinically relevant use of this technology, however, are the microarray
class prediction studies which have been designed to answer specific questions regarding
gene expression in relation to clinical outcome and response to treatment. The latter
approach attempts to identify predictive markers, as opposed to the prognostic markers
which were identified in the “intrinsic gene-set”. There is frequently some degree of
confusion regarding the terms of “prognostic” and “predictive biomarkers”. This is partially
due to the fact that many prognostic markers also predict response to adjuvant therapy. This
is particularly true in breast cancer where, for example, the ER is prognostic, and predictive
of response to hormonal therapy, but also predictive of a poorer response to chemotherapy
(Carey 2007; Kim, 2009; Rouzier 2005,).
One of the first microarray studies designed to identify a gene-set predictive of prognosis in
breast cancer was that undertaken by van’t Veer and colleagues (van’t Veer et al, 2002). They
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 93

developed a 70-gene set capable of predicting the development of metastatic disease in a

group of 98 patients made up of 34 who had developed metastasis within 5-years of follow-
up, 40 patients who remained disease-free at 5-years, 18 patients with a BRCA-1 mutation,
and 2 patients with a BRCA-2 mutation. The 70-gene signature was subsequently validated
in a set of 295 breast cancers, including the group used to train the model, and shown to be
more accurate than standard histopathological parameters at predicting outcome in these
breast cancer patients (van de Vijver et al, 2002). The signature includes many genes
involved in proliferation, and genes associated with invasion, metastasis, stromal integrity
and angiogenesis are also represented. This 70-gene prognostic signature classifies patients
based on correlation with a “good-prognosis” gene expression profile; a coefficient of
greater than 0.4 is classified as good prognosis. The signature was initially criticised for the
inclusion of some patients in both the discovery and validation stages (van de Vijver et al,
2002). However, it has been subsequently validated in multiple cohorts of node-positive and
node-negative patients and has been shown to outperform traditional clinical and
histological parameters at predicting prognosis (Buyse et al, 2006; Mook et al, 2009).

3.3.1 Mammaprint assay

The 70-gene signature was approved by the FDA to become the MammaPrint Assay
(Agendia BV, Amsterdam, The Netherlands); the first fully commercialized microarray
based multigene assay for breast cancer. This prognostic tool is now available and can be
offered to women under the age of 61 years with lymph node negative breast cancer. The
MammaPrint test results are dichotomous, indicating either a high or low risk of disease
recurrence, and the test performs best at the extremes of the spectrum of disease outcome
i.e. identifying patients with a very good or a very poor prognosis.
The MammaPrint signature is a purely prognostic tool, and its role as a predictive marker
for response to therapy was not examined at the time it was developed. Its’ clinical utility is
currently being assessed, however, in a prospective clinical trial called microarray in node
negative and 1 to 3 positive lymph node disease may avoid chemotherapy (MINDACT) trial
(Cardoso et al, 2008). The trial aims to recruit 6000 patients, all of whom will be assessed by
standard clinicopathologic prognostic factors and by the MammaPrint assay. In cases where
there is concordance between the standard prognostic factors and the molecular assay,
patients will be treated accordingly with adjuvant chemotherapy with or without endocrine
therapy for poor prognosis patients. If both assays predict a good prognosis, no adjuvant
chemotherapy is given, and adjuvant hormonal therapy is given alone where indicated. In
cases where there is disconcordance between the standard clinicopathological prognostic
factors and the MammaPrint assays’ prediction of prognosis the patients are randomised to
receive adjuvant systemic therapy based on either the clinicopathological or the
MammaPrint prognostic prediction results. The expected outcome is that there will be a
reduction of 10-15% in the number of patients requiring adjuvant chemotherapy based on
the MammaPrint assay prediction. It is envisaged that this trial will answer the questions of
what patients can be spared chemotherapy and still have a good prognosis, thus
accelerating progress towards the goal of more tailored therapy for breast cancer patients.

3.3.2 Oncotype Dx assay

While MammaPrint was developed as a prognostic assay, the other most widely established
commercialized multigene assay Oncotype Dx was developed in a more context specific
94 Computational Biology and Applied Bioinformatics

manner as a prognostic and predictive test to determine the benefit of chemotherapy in

women with node-negative, ER-positive breast cancer treated with tamoxifen (Paik et al,
2004). The authors used published microarray datasets, including those that identified the
intrinsic breast cancer subtypes and the 70-gene prognostic signature identified by the
Netherlands group to develop real time quantitative polymerase chain reaction (RQ-PCR)
tests for 250 genes. Research undertaken by the National Surgical Adjuvant Breast and
Bowel Project (NSABP) B14 protocol using three independent clinical series, resulted in the
development of an optimised 21-gene predictive assay (Paik et al, 2004). The assay has been
commercialised as Oncotype® DX by Genomic Health Inc1 and consists of a panel of 16
discriminator genes and 5 endogenous control genes which are detected by RQ-PCR using
formalin-fixed paraffin embedded (FFPE) sections from standard histopathology blocks. The
ability to use FFPE tissue facilitates clinical translation and has allowed retrospective
analysis of archived tissue in large cohorts with appropriate follow up data. The assay has
been used to generate Recurrence Scores (RS) by differentially weighting the constituent
genes which are involved in:
• proliferation (MKI67, STK15, BIRC5/Survivin, CCNB1, MYBL2)
• estrogen response (ER, PGR, SCUBE2)
• HER2/neu amplicon (HER2/neu/ERBB2, GRB7),
• invasion (MMP11, CTSL2)
• apoptosis (BCL2, BAG1)
• drug metabolism (GSTM1)
• macrophage response (CD68).
The assay was evaluated in 651 ER positive lymph node negative breast cancer patients who
were treated with either tamoxifen or tamoxifen and chemotherapy as part of the NSABP
B20 protocol (Paik et al, 2006). It was found that patients with high recurrence scores had a
large benefit from chemotherapy, with a 27.6% mean decreased in 10 year distance
recurrence rates, while those with a low recurrence score derived virtually no benefit from
chemotherapy. The RS generated by the expression of the 21 genes is a continuous variable
ranging from 1-100, but has been divided into three groups for clinical decision making; low
(<18), intermediate (18-31) and high (>31). It has been shown in a number of independent
datasets that ER positive breast cancer patients with a low RS have a low risk of recurrence
and derive little benefit from chemotherapy. Conversely, ER positive patients with high RS
have a high risk of recurrence but do benefit from chemotherapy (Goldstein, 2006; Habel,
2006; Mina, 2007; Paik, 2006). The ability of the 21-gene signature to so accurately predict
prognosis has led to the inclusion of the Oncotype Dx assay in American Society of Clinical
Oncology (ASCO) guidelines on the use of tumour markers in breast cancer as a predictor of
recurrence in ER-positive, node-negative patients. However, despite the accurate
performance of the assay for high and low risk patients, there remains uncertainty regarding
the management of patients with intermediate RS (18-31). This issue is being addressed in a
prospective randomized trial assigning individual options for treatment (TAILORx)
sponsored by the National Cancer Institute (Lo et al, 2007). This multicentre trial aims to
recruit 10,000 patients with ER –positive, lymph node negative breast cancer who are
assigned to one of three groups based on their RS; low<11, intermediate 11-25 and high >25.
Notably, the RS criteria have been changed for the TAILORx trial, with the intermediate

1 http://www.genomichealth.com/OncotypeDX
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 95

range being changed from RS 18-30 to RS 11-25 to avoid exluding patients who may derive a
small benefit from chemotherapy (Sparano et al, 2006). Patients in the intermediate RS
group are randomly assigned to receive either adjuvant chemotherapy and hormonal
therapy, or hormonal therapy alone. The primary aim of the trial is to determine if ER
positive patients with an intermediate RS benefit from adjuvant chemotherapy or not.
The MammaPrint and Oncotype Dx gene signatures both predict breast cancer behaviour,
however there are fundamental differences between them (outlined in table 1). This chapter
has focused on these signatures as they were the first to be developed, have been extensively
validated, and are commercially available. However it is important to note that there are
other multi-gene based assays that have been developed and commercialized but are not
discussed in detail as they are not yet as widely utilized (Loi et al, 2007; Ma et al, 2008; Ross
et al, 2008; Wang et al, 2005 ).

Assay MammaPrint Oncotype Dx

Manufacturer Agendia BV Genomic Health, Inc.
From candidate set of 25,000 From candidate set of 250
Development of Signature
genes in 98 patients genes in 447 patients
Gene signature 70 genes 21 genes
Stage I & II breast cancer
Stage I & II breast cancer
Lymph node negative
Patient cohort Lymph node negative
ER positive
<55yrs
Receiving Tamoxifen
Platform cDNA Microarray RQ-PCR
Fresh frozen tissue or
Sample requirements collected in RNA FFPE tissue
preservative
5-year distant relapse free 10-year distant relapse free
Outcome
survival survival
Dichotomous correlation Continuous recurrence score
coefficient <18 = low risk
Test Results
>4.0 = good prognosis 18-31= intermediate risk
<4.0 = poor prognosis >31 = high risk
Predictive No; purely prognostic Yes
Prospective Trial MINDACT TAILORx
FDA approved Yes No
ASCO Guidelines No Yes
Table 1. Comparison of commercially available prognostic assays MammaPrint and
Oncotype Dx

4. Microarray data integration

4.1 Setting standards for microarray experiments
It must be acknowledged that despite the multitude of breast cancer prognostic signatures
available, the overlap between the gene lists is minimal (Ahmed, 2005; Brenton, 2005; Fan et
96 Computational Biology and Applied Bioinformatics

al, 2006; Michiels et al, 2005). This lack of concordance has called into question the
applicability of microarray analysis across the entire breast cancer population. In order to
facilitate external validation of signatures and meta-analysis in an attempt to devise more
robust signatures, it is important that published microarray data be publicly accessible to
the scientific community. In 2001 the Microarray Gene Expression Data Society proposed
experimental annotation standards known as minimum information about a microarray
experiment (MIAME), stating that raw data supporting published studies should be made
publicly available in one of a number of online repositories (table 2), these standards are
now upheld by leading scientific journals and facilitating in depth interrogation of multiple
datasets simultaneously.

Public Database
for Microarray URL Organization Description
Data
Array Express http://www.ebi.ac.uk/arrayexpress/ European Public data
Bioinformatics deposition and
Institute (EBI) queries
GEO Gene http://www.ncbi.nlm.nih.gov/geo/ National Centre for Public data
Expression Biotechnology deposition and
Omnibus Information (NCBI) queries
CIBEX Center http://cibex.nig.ac.jp/index.jsp National Institute Public data
for Information of Genetics deposition and
Biology Gene queries
Expression
Database
ONCOMINE http://www.oncomine.org/main/ University of Public queries
Cancer Profiling index.jsp Michigan
Database
PUMAdb http://puma.princeton.edu/ Princeton Public queries
Princeton University
University
MicroArray
database
SMD Stanford http://genome- Stanford Univeristy Public queries
Microarray www5.stanford.edu/
Database
UNC Chapel https://genome.unc.edu/ University of North Public queries
Hill Microarray Carolina at Chapel
database Hill
Table 2. List of Databases with Publicly Available Microarray Data

4.2 Gene ontology

The volume of data generated by high throughput techniques such as microarray poses the
challenge of how to integrate the genetic information obtained from large scale experiments
with information about specific biological processes, and how genetic profiles relate to
functional pathways. The development of the Gene Ontology (GO) as a resource for
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 97

experimentalists and bioinformaticians has contributed significantly to overcoming this

challenge (Ashburner et al, 2000). The GO Consortium was established with the aim of
producing a structured, precisely defined, common, controlled vocabulary for describing
the roles of genes and gene products in any organism. Initially a collaboration between three
organism databases: Flybase (The Flybase Consortium, 1999), Mouse Genome Informatics
(Blake et al, 2000) and the Saccharomyces Genome Database (Ball et al, 2000), the GO
Consortium has grown to include several of the world’s major repositories for plant, animal
and microbial genomes.
The Gene Ontology provides a structure that organizes genes into biologically related
groups according to three criteria. Genes and gene products are classified according to:
• Molecular Function: biochemical activity of gene products at the molecular level
• Biological Process: biological function of a gene product
• Cellular Component: location in the cell or extracellular environment where molecular
events occur
Every gene is described by a finite, uniform vocabulary. Each GO entry is defined by a
numeric ID in the format GO#######. These GO identifiers are fixed to the textual
definition of the term, which remains constant. A GO annotation is the specific association
between a GO identifier and a gene or protein and has a distinct evidence source that
supports the association. A gene product can take part in one or more biological process and
perform one or more molecular functions. Thus, a well characterized gene product can be
annotated to multiple GO terms in the three GO categories outlined above. GO terms are
related to each-other such that each term is placed in the context of all of the other terms in a
node-directed acyclic graph (DAC). The relationships used by the GO are: “is_a”, “part_of”,
“regulates”, “positively_regulates”, “negatively_regulates” and “disjoint_from”. Each term
in the DAC may have one or more parent terms and possibly one or more child nodes, and
the DAC gives a graphical representation of how GO terms relate to each other in a
hierarchical manner.
The development of Gene Ontology has facilitated analysis of microarray gene sets in the
context of the molecular functions and pathways in which they are involved (Blake &
Harris, 2002). GO-term analysis can be used to determine whether genetic “hits” show
enrichment for a particular group of biological processes, functions or cellular
compartments. One approach uses statistical analysis to determine whether a particular GO
is over or under-represented in the list of differentially expressed genes from a microarray
experiment. The statistical tests used for such analysis include hypergeometric, binomial or
Chi-square tests (Khatri et al, 2005).
An alternative approach known as “gene-set testing” has been described which involves
beginning with a known set of genes and testing whether this set as a whole is differentially
expressed in a microarray experiment (Lamb et al, 2003; Mootha et al, 2003). The results of
such analyses inform hypotheses regarding the biological significance of microarray
analyses.
Several tools have been developed to facilitate analysis of microarray data using GO, and a
list of these can be found at: http://www.geneontology.org/GO.tools.microarray.shtml
Analysing microarray datasets in combination with biological knowledge provided by GO
makes microarray data more accessible to the molecular biologist and can be a valuable
strategy for the selection of biomarkers and the determination of drug treatment effect in
breast cancer (Arciero et al, 2003; Cunliffe et al, 2003).
98 Computational Biology and Applied Bioinformatics

4.3 Microarray meta-analysis – combining datasets

Meta-analyses have confirmed that different prognostic signatures identify similar
biological subgroups of breast cancer patients (Fan et al, 2006) and have also shown that the
designation of tumours to a “good prognosis”/”low risk” group or a “poor
prognosis”/”high risk” group is largely dependent on the expression patterns of
proliferative genes. In fact, some of these signatures have been shown to have improved
performance when only the proliferative genes are used (Wirapati, 2008). Metanalyses of the
signatures have also proposed that the prognostic ability of the signatures is optimal in the
ER positive and HER2-negative subset of breast tumours (Desmedt, 2008; Wirapati, 2008),
the prognosis of this group of tumours being governed by proliferative activity.
Despite obvious clinical application, none of these prognostic assays are perfect, and they all
carry a false classification rate. The precise clinical value for these gene expression profiles
remains to be established by the MINDACT and TAILORx trials. In the interim the
performance of these assays is likely to be optimised by combining them with data from
traditional clinicopathological features, an approach which has been shown to increase
prognostic power (Sun et al, 2007).
Microarray technology has undoubtedly enhanced our understanding of the molecular
mechanisms underlying breast carcinogenesis; profiling studies have provided a myriad of
candidate genes that may be implicated in the cancer process and are potentially useful as
prognostic and predictive biomarkers or as therapeutic targets. However, as yet there is
little knowledge regarding the precise regulation of these genes and receptors, and further
molecular categories are likely to exist in addition to and within the molecular subtypes
already delineated. Accumulating data reveal the incredible and somewhat foreboding
complexity and variety of breast cancers and while mRNA expression profiling studies are
ongoing, a new player in breast cancer biology has come to the fore in recent years; a
recently discovered RNA species termed MiRNA (miRNA) which many scientists believe
may represent a crucial link in the cancer biology picture.

5. MicroRNA - a recently discovered layer of molecular complexity

It has been proposed that the discovery of miRNAs as regulators of gene expression represents
a paradigm changing event in biology and medicine. This discovery was made in 1993 by
researchers at the Ambros laboratory in Dartmouth Medical School, USA at which time it was
thought to be a biological entity specific to the nematode C. Elegans (Lee et al, 1993). In the
years following this discovery, hundreds of miRNAs were identified in animals and plants.
However it is only in the past 5 years that the field of miRNA research has really exploded
with the realisation that miRNAs are critical to the development of multicellular organisms
and the basic functions of cells (Bartel, 2004). MiRNAs are fundamental to genetic regulation,
and their aberrant expression and function have been linked to numerous diseases and
disorders (Bartel, 2004; Esquela-Kerscher & Slack, 2006). Importantly, miRNA have been
critically implicated in the pathogenesis of most human cancers, thus uncovering an entirely
new repertoire of molecular factors upstream of gene expression.

5.1 MicroRNA - novel cancer biomarkers

The first discovery of a link between miRNAs and malignancy was the identification of a
translocation-induced deletion at chromosome 13q14.3 in B-cell Chronic Lymphocytic
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 99

Leukaemia (Calin et al, 2002). Loss of miR-15a and miR-16-1 from this locus results in
increased expression of the anti-apoptotic gene BCL2. Intensifying research in this field,
using a range of techniques including miRNA cloning, quantitative PCR, microarrays and
bead-based flow cytometric miRNA expression profiling has resulted in the identification
and confirmation of abnormal miRNA expression in a number of human malignancies
including breast cancer (Heneghan et al, 2010; Lowery et al, 2007). MiRNA expression has
been observed to be upregulated or downregulated in tumours compared with normal
tissue, supporting their dual role in carcinogenesis as either oncogenic miRNAs or tumour
suppressors respectively (Lu et al, 2005). The ability to profile miRNA expression in human
tumours has led to remarkable insight and knowledge regarding the developmental lineage
and differentiation states of tumours. It has been shown that distinct patterns of miRNA
expression are observed within a single developmental lineage, which reflect mechanisms of
transformation, and support the idea that miRNA expression patterns encode the
developmental history of human cancers. In contrast to mRNA profiles it is possible also to
successfully classify poorly differentiated tumours using miRNA expression profiles
(Volinia et al, 2006). In this manner, miRNA expression could potentially be used to
accurately diagnose poorly differentiated tissue samples of uncertain histological origin, e.g.
metastasis with an unknown primary tumour, thus facilitating treatment planning.
MicroRNAs exhibit unique, inherent characteristics which make them particularly attractive
for biomarker development. They are known to be dysregulated in cancer, with
pathognomonic or tissue specific expression profiles and even a modest number of miRNAs
is sufficient to classify human tumours, which is in contrast to the relatively large mRNA
signatures generated by microarray studies (Lu et al, 2005). Importantly, miRNA are
remarkably stable molecules. They undergo very little degradation even after processing
such as formalin fixation and remain largely intact in FFPE clinical tissues, lending
themselves well to the study of large archival cohorts with appropriate follow-up data (Li et
al, 2007; Xi et al, 2007). The exceptional stability of miRNAs in visceral tissue has stimulated
investigation into their possible preservation in the circulation and other bodily fluids
(urine, saliva etc.). The hypothesis is that circulating miRNAs, if detectable and quantifiable
would be the ideal biomarker accessible by minimally invasive approaches such as simple
phlebotomy (Cortez et al, 2009; Gilad et al, 2008; Mitchell et al, 2008).

5.2 MicroRNA microarray

The unique size and structure of miRNAs has necessitated the modification of existing
laboratory techniques, to facilitate their analysis. Due to the requirement for high quality
large RNA molecules, primarily for gene expression profiling, many laboratories adopted
column-based approaches to selectively isolate large RNA molecules, discarding small RNA
fractions which were believed to contain degradation products. Modifications to capture
miRNA have been made to existing protocols to facilitate analysis of the miRNA fraction.
Microarray technology has also been modified to facilitate miRNA expression profiling.
Labelling and probe design were initially problematic due to the small size of miRNA
molecules. Reduced specificity was also an issue due to the potential of pre-miRNA and pri-
miRNAs to produce signals in addition to active mature miRNA. Castoldi et al described a
novel miRNA microarray platform using locked nucleic acid (LNA)-modified capture
probes (Castoldi et al, 2006). LNA modification improved probe thermostability and
increased specificity, enabling miRNAs with single nucleotide differences to be
100 Computational Biology and Applied Bioinformatics

discriminated - an important consideration as sequence-related family members may be

involved in different physiological functions (Abbott et al, 2005). An alternative high
throughput miRNA profiling technique is the bead-based flow cytometric approach
developed by Lu et al.; individual polystyrene beads coupled to miRNA complementary
probes are marked with fluorescent tags (Lu et al, 2005). After hybridization with size-
fractioned RNAs and streptavidin-phycoerythrin staining, the beads are analysed using a
flow-cytometer to measure bead colour and pycoerythrin, denoting miRNA identity and
abundance respectively. This method offered high specificity for closely related miRNAs
because hybridization occurs in solution. The high-throughput capability of array-based
platforms make them an attractive option for miRNA studies compared to lower
throughput techniques such as northern blotting and cloning; which remain essential for the
validation of microarray data.

5.2.1 MicroRNA microarray - application to breast cancer

Microarray analysis of miRNA expression in breast cancer is in its’ infancy relative to
expression profiling of mRNA. However, there is increasing evidence to support the
potential for miRNAs as class predictors in breast cancer. The seminal report of aberrant
miRNA expression in breast cancer by Iorio et al. in 2005 identified 29 miRNAs that were
differentially expressed in breast cancer tissue compared to normal, a subset of which could
correctly discriminate between tumour and normal with 100% accuracy (Iorio et al, 2005).
Among the leading miRNAs differentially expressed; miR-10b, miR-125b and mR-145 were
downregulated whilst miR-21 and miR-155 were consistently over-expressed in breast
tumours. In addition, miRNA expression correlated with biopathological features such as
ER and PR expression (miR-30) and tumour stage (miR-213 and miR-203). Mattie et al.
subsequently identified unique sets of miRNAs associated with breast tumors defined by
their HER2/neu or ER/PR status (Mattie et al, 2006). We have described 3 miRNA
signatures predictive of ER, PR and Her2/neu receptor status, respectively, which were
identified by applying artificial neural network analysis to miRNA microarray expression
data (Lowery et al, 2009). Blenkiron et al used an integrated approach of both miRNA and
mRNA microarray expression profiling to classify tumours according to “intrinsic subtype”.
This approach identified a number of miRNAs that are differentially expressed according to
intrinsic breast cancer subtype and associated with clinicopathological factors including ER
status and tumour grade. Importantly, there was overlap between the differentially
expressed miRNAs identified in these studies.
There has been interest in assessing the prognostic value of miRNAs, and expression studies
in this regard have focused on detecting differences in miRNA expression between primary
breast tumours and metastatic lymph nodes. This approach has identified numerous
miRNA that are dysregulated in primary breast tumours compared to metastatic lymph
nodes (Baffa et al 2009; Huang et al, 2008). MiRNA have also been identified that are
differentially expressed in patients who had a “poor prognosis” or a short time to
development of distant metastasis (Foekens et al, 2008); miR-516-3p, miR-128a, miR-210, and
miR-7 were linked to aggressiveness of lymph node-negative, ER-positive human breast
cancer.
The potential predictive value of miRNA is also under investigation. Preclinical studies have
reported associations between miRNA expression and sensitivity to adjuvant breast cancer
therapy including chemotherapy, hormonal therapy and HER2/neu targeted therapy (Ma
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 101

et al, 2010; Tessel et al, 2010; Wang et al, 2010), prompting analysis of tumour response in
clinical samples. Rodriguez-Gonzalez et al attempted to identify miRNAs related to
response to tamoxifen therapy by exploiting the Foekens dataset (Foekens, 2008) which
comprised miRNA expression levels of 249 miRNAs in 38 ER positive breast cancer patients.
Fifteen of these patients were hormone naive and experienced relapse, which was treated
with tamoxifen. Ten patients responded and five did not, progressing within 6 months. Five
miRNAs (miR-4221, miR-30a-3p, miR-187, miR-30c and miR-182) were the most
differentially expressed between patients who benefitted from tamoxifen and those who
failed therapy. The predictive value for these miRNAs was further assessed in 246 ER
positive primary tumours of hormone naive breast cancer patients who received tamoxifen
as monotherapy for metastatic disease. MiR-30a-3p, miR-30c and miR-182 were significantly
associated with response to tamoxifen, but only miR-30c remained an independent predictor
on multivariate analysis (Rodriguez-Gonzalez, 2010).
Microarray-based expression profiling has also been used to identify circulating miRNAs
which are differentially expressed in breast cancer patients and matched healthy controls.
Zhao et al profiled 1145 miRNAs in the plasma of 20 breast cancer patients and 20 controls,
identifying 26 miRNAs with at least two-fold differential expression which reasonably
separated the 20 cases from the 20 controls (Zhao et al, 2010). This is the first example of
genome-wide miRNA expression profiling in the circulation of breast cancer patients and
indicates potential for development of a signature of circulating miRNAs that may function
as a diagnostic biomarker of breast cancer.
At present diagnostic, prognostic and predictive miRNA signatures and markers remain
hypothesis generating. They require validation in larger, independent clinical cohorts prior
to any consideration for clinical application. Furthermore as additional short non-coding
RNAs are continuously identified through biomarker discovery programmes, the available
profiling technologies must adapt their platforms to incorporate newer potentially relevant
targets. MicroRNAs possess the additional attraction of potential for development as
therapeutic targets due to their ability to regulate gene expression. It is likely that future
microarray studies will adopt and integrated approach of miRNA and mRNA expression
analysis in an attempt to decipher regulatory pathways in addition to expression patterns.

6. Limitations of microarray technology & bioinformatic challenges

In addition to the great promises and opportunities held by microarray technologies, several
issues need to be borne in mind and appropriately addressed in order to perform reliable
and non-questionable experiments. As a result, several steps need to be addressed in order
to identify and validate reliable biomarkers in the scope of potential future clinical
application. This is one of the reasons why, despite the promises of using powerful high-
throughput technologies as such as microarray, only very few useful biomarkers have been
identified so far and/or have been translated to useful clinical assay or companion
diagnostics (Mammaprint®, Oncotype DX®). There still remains a lack of clinically relevant
biomarkers (Rifai et al, 2006). Amongst the limitations and pitfalls around the technology
and the use of microarrays, some of the most important are the reported lack of
reproducibility, as well as the massive amount of data generated, often extremely noisy and
with an increasing complexity. As for example, in the recent Affymetrix GeneChip 1.0 ST
microarray platform (designed to target all known and predicted exons in human, mouse
and rat genomes), where there is approximately 1.2 million exon clusters corresponding to
102 Computational Biology and Applied Bioinformatics

over 1.4 million probesets (Lancashire et al, 2009). As a result, it appears clearly that
extracting any relevant key component from such datasets requires robust mathematical
and/or statistical models running on efficient hardware to perform the appropriate
analyses.
With this in mind, it is clear that the identification of new biomarkers still requires a
concerted, multidisciplinary effort. It requires the expertise of the biologist or pathologist, to
extract the samples, the scientist to perform the analysis on the platform and then the
bioinformatician/biostatistician to analyse and interpret the output. The data-mining
required to cope with these types of data needs careful consideration and specific
computational tools, and as such remains a major challenge in bioinformatics.

6.1 Problems with the analysis of microarray data

6.1.1 Dimensionality and false discovery
The statistical analysis of mRNA or miRNA array data poses a number of challenges. This
type of data is of extremely high dimensionality i.e. has a large number of variables. Each of
these variables represents the relative expression of a mRNA or miRNA in a sample. Each of
these components contain noise, are non-linear may not follow a normal distribution
through a population and may be strongly correlated with other probes in the profile. These
characteristics mean that the data may violate many of the assumptions of conventional
statistical techniques, particularly with parametric tests.
The dimensionality of the data poses a significant problem, and remains as one of the most
critical when analysing microarray data. When one analyses this type of data, one has to
consider what is referred to as the curse of dimensionality, firstly described by Bellman in 1961
as the “exponential growth of the search space as a function of dimensionality” (Bellman, 1961;
Bishop, 1995). This occurs in highly dimensional systems where the number of dimensions
masks the true importance of an individual single dimension (variable). It is particularly
true in a microarray experiment when the number of probes representing the number of
miRNA/mRNA studied far exceeds the number of available samples. So there is the
potential for a probe that is in reality of high importance to be missed when considered with
a large number of other probes. This problem is overcome by breaking down the analysis
into single or small groups of variables and repeating the analysis rather than considering
the whole profile in one single analysis. Other methods consists of using pre-processing
methods and feature extraction algorithms in order to only analyse a subset of the data
supposed to hold the most relevant features (Bishop, 1995), as determined by the pre-
processing steps.
High dimensionality also creates problems due to false discovery. The false discovery rate
(FDR) introduced by Benjamini and Hochberg (Benjamini and Hochberg, 1995) is a measure
of the number of features incorrectly identified as “differential” and various approaches
have been suggested to accurately control the FDR. In this case if one has a high number of
dimensions and analyses each singly (as above) a proportion can appear to be of high
importance due to random chance considering the distribution, even when they are not. To
overcome this one has to examine a rank order of importance and when testing for
significance one has to correct the threshold for significance by dividing it by the number of
dimensions. So for example when analysing the significance of single probes from a profile
with 4,000 probes in it the threshold becomes P < 0.05 divided by 4,000 i.e. P < 0.0000125.
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 103

6.1.2 Quality and noise

Noise also poses a problem in the analysis of mRNA or miRNA data. The inherent technical
and biological variability necessarily induces noise within the data, eventually leading to
biased results. The noise may lead to misinterpretation of sample groups that may actually
have no biological relevance. As a consequence extreme care needs to be taken to address
the problem of noise.
Noise may be random where it is applied to all parts of the miRNA equally or systematic
where particular probes inherently have more noise than others because of the nature of the
component miRNA or genomic code that they represent.
It is now widely acknowledged that the reported high level of noise found in microarray
data is the most critical pull-back of microarray-based studies, as it is pointed by the MAQC
Consortium (Shi et al, 2006; Klebanov and Yakovlev, 2007).

6.1.3 Complexity and non-normality

Because of the complex nature of the profile a particular mRNA or miRNA may be non-
normally distributed through a population. Such non-normality will immediately invalidate
any statistical test that uses parametric statistics i.e. depends on the assumption of a normal
distribution. Invalidated tests would include ANOVA and t-test. To overcome this, the data
would have to be transformed mathematically to follow a normal distribution or an
alternative non parametric test would have to be employed. Examples of non-parametric
tests include Kruskal-Wallis and Mann Whitney U which are ANOVA and unpaired T-Test
alternatives respectively. Generally non-parametric tests lack power compared to their
parametric alternatives and this may prove to be a problem in high dimensional space due
to the reasons described previously.

6.1.4 Reproducibility
Reproducibility has a marked effect on the accuracy of any analysis conducted. Furthermore
reproducibility has a profound effect on the impact of other issues such as dimensionality
and false detection. Robust scientific procedures requires that the results have to be
reproducible in order to reduce the within sample variability, the variability between
sample runs and the variability across multiple reading instruments. Aspects of variability
can be addressed using technical and experimental replicates. The averaging of samples
profiles can be used to increase the confidence in the profiles for comparison (Lancashire et
al., 2009). Technical replicates provide information on the variability associated with
instrumental variability whilst experimental (or biological) replicates give a measure of the
natural sample to sample variation. Problems in data analysis occur when the technical
variability is high. In this situation the problem in part can be resolved by increasing the
number of replicates. If however the technical variation is higher than the biological
variation then the sample cannot be analysed.

6.1.5 Auto-correlation or co-correlation

Auto correlation exists when two components within a system are strongly linearly
correlated with one another. In any complicated system there are likely to be a number of
components that are auto correlated. This is especially true in array profiling of biological
samples. Firstly due to biological processes one protein in a set of samples is likely to
interact or correlate with another through a population.
104 Computational Biology and Applied Bioinformatics

Auto correlation becomes a problem when using linear based regression approaches. This is
because one of the assumptions of regression using multiple components is that the
components are not auto correlated. If intensity for multiple miRNA probes are to be added
into a regression to develop a classifier these components should not be auto correlated.
Auto correlation can be tested for using the Durbin Watson test.

6.1.6 Generality
The whole purpose of biomarker (or set of biomarkers) identification, using high-
throughput technologies or any other, is to provide the clinicians with an accurate model in
order to assess a particular aspect. However, a model is only as good as its ability to
generalize to unseen real world data. A model only able to explain the population on which
it was developed would be purely useless for any application.
As a result, if one is to develop classifiers from mRNA or miRNA array data the features
identified should be generalised. That is they will predict for new cases in the general
population of cases. When analysing high dimensional data there is an increased risk of over
fitting, particularly when the analysis methods imply supervised training on a subset of the
population. So for example, when a large number of mRNA or miRNA are analysed there is
the potential for false detection to arise. If a random element identified through false
detection is included as a component of a classifier (model) then the generality of that
classifier will be reduce; i.e. it is not a feature that relates to the broader population but is a
feature specific to the primary set of data used to develop the classifier. Standards of
validation required to determine generality have been defined by Michiels et al, 2007.
Generality of classifiers can be increased by the application of bootstrapping or cross
validation approaches.
Some algorithms and approaches, that usually involve supervised training, suffer from
over-fitting (sometimes called memorisation). This is a process where a classifier is
developed for a primary dataset but models the noise within the data as well as the relevant
features. This means that the classifier will not accurately classify for new cases i.e. it does
not represent a general solution to the problem which is applicable to all cases. This is
analogous, for example, to one developing a classifier that predicts well the risk of
metastasis for breast cancer patients from Nottingham but will not predict well for a set of
cases from Denmark. Over fitted classifiers seldom represent the biology of the system being
investigated and the features identified are often falsely detected.
One of the most common solutions to avoid over-fitting is to apply a Cross Validation
technique in combination with the supervised training. Random sample cross validation is a
process of mixing data. Firstly the data are divided into two or three parts (figure 2); the first
part is used to develop the classifier and the second or second and third parts are used to
test the classifier. These parts are sometimes termed training, test and validation data sets
respectively. In certain classifiers such as Artificial Neural Network based classifiers the
second blind set is used for optimisation and to prevent over fitting. In random sample cross
validation the random selection and training process is repeated a number of times to create
a number of models each looking at the global dataset in a number of different ways (figure
2). Often the mean performance of these models is considered.
Leave one out cross validation is an approach also used to validate findings. In this case one
sample is left out of the analysis. Once training is complete the sample left out is tested. This
process is repeated a number of times to determine the ability of a classifier to predict
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 105

unseen cases. This approach of random sample cross validation drives the classifier solution
to a generalised one by stopping the classifier from training too much on a seen dataset and
stopping the training earlier based on a blind dataset.

7. Methods used to analyse microarray data and their limitations

With the advent of cutting edge new technologies such as microarrays, the analysis tools for
the data produced need to be appropriately applied. Although expression arrays have
brought high hopes and expectations, they have brought tremendous challenges with them.
They have been proven to suffer from different limitations as previously discussed.
However, innovative computational analysis solutions have been developed and have been
proven efficient and successful at identifying markers of interest regarding particular
questions. This section presents some of the most common methods employed to overcome
the limitations discuss above, and to analyse expression array data.

7.1 Application of ordination techniques

If we are to utilise the mRNA or miRNA profile we have to identify robust features despite
its high dimensionality that are statistically valid for the general population not just for a
subset. Ordination techniques are used to map the variation in data. They are not directly
predictive and cannot classify directly unless combined with another classification
technique.

Fig. 2. Illustration of Cross Validation technique, here with three subsets: the training subset
used to train the classifier, the test subset used to stop the training when it has reached an
optimal performance on this subset, and a validation subset to evaluate the performance
(generalization ability) of the trained classifier.
106 Computational Biology and Applied Bioinformatics

7.1.1 Principal components analysis

PCA is usually a method of choice for dimensionality reduction. It is a multivariate
exploratory technique used to simplify complex data space (Raychaudhuri et al, 2000) by
translating the data space into a new space defined by the principal components. It works
by identifying the main (principal) components that explain best the shape (variance) of a
data set. Each principal component is a vector (line) through the data set that explains a
proportion of the variance, it is the expression of a linear combination of the data. In PCA
the first component that is added is the one that explains the most variance the second
component added is then orthogonal to the first. Subsequent orthogonal components are
added until all of the variation is explained. The addition of vectors through a
multidimensional data set is difficult to visualise in print, we have tried to illustrate it with 3
dimensions in figure 3. In mRNA/miRNA profile data where thousands of dimensions
exist, PCA is a useful technique as it reduces the dimensionality to a manageable number of
principal components. If the majority of the variance is explained in 2 or 3 principal
components these can be used to visualise the structure of the population using 2 or 3
dimensional plots. A limited parameterisation can also be conducted to determine the
contribution of each parameter (miRNA) to each of the principal components. This however
suffers from the curse of dimensionality in high dimensional systems. Thus the main
limitation of using PCA for gene expression data is the inability to verify the association of a
principal component vector with the known experimental variables (Marengo et al, 2004).
This often makes it difficult to accurately identify the importance of the mRNA or miRNA in
the system, and make it a valuable tool only for data reduction.

Fig. 3. Example of a 3 dimension PCA with the 3 orthogonal PCs.

MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 107

7.1.2 Hierarchical clustering

Although several clustering techniques exist, the most used in the context of microarray
data analysis is hierarchical clustering. Hierarchical clustering is used to identify the
structure of a given population of cases or a given set of markers such as proteins. Every
case is considered to have a given position in multidimensional space. Hierarchical
clustering determines the similarity of cases in this space based on the distance between
points. There are various linkage methods used for calculating distance, such as single
linkage, complete linkage and average linkage. Single linkage computes the distance as the
distance between the two nearest points in the clusters being compared. Complete linkage
computes the distance between the two farthest points, whilst average linkage averages all
distances across all the points in the clusters being compared. One commonly used distance
measure is Euclidian distance which is the direct angular distance between two points. In
fact it considers the distance in multidimensional space between each point and every other
point. In this way a hierarchy of distances is determined. This hierarchy is plotted in the
form of a dendrogram (figure 4). From this dendrogram we can identify clusters of cases or
markers that are similar at a given distance.
The one major problem concerning clustering is that it suffers from the curse of
dimensionality when analysing complex datasets. In a high dimensional space, it is likely
that for any given pair of points within a cluster there will exist dimensions on which these
points are far apart from one another. Therefore distance functions using all input features
equally may not be truly effective (Domeniconi et al, 2004). Furthermore, clustering methods
will often fail to identify coherent clusters due to the presence of many irrelevant and
redundant features (Greene et al, 2005). Additionally, the important number of different
distance measure may add an additional bias: it has been reported that the choice of a
distance measure can greatly affect the results and produce different outcomes after the
analysis (Quackenbush, 2001). Dimensionality is also of importance when one is examining
the structure of a population through ordination techniques. This is particularly the case
when utilising hierarchical cluster analysis. This approach is of limited suitability for high
dimensional data as in a high dimensional space the distance between individual cases
reaches convergence making all cases appear the same (Domeniconi et al, 2004). This makes
it difficult to identify the real structure in the data or clusters of similar cases.

7.2 Application of modelling techniques

This second part of the section focusing on analysis tools considers more evolved techniques
with what is known as machine learning. There are however a number of other techniques
that can be employed in a predictive or classification capacity. Others include hidden
Markov and Bayesian methods. These are widely described in the literature.

7.2.1 Decision tree based methodologies

Decision tree methodologies include, boosted decision trees, classification and regression
trees, random forest methodologies. This approach is based on splitting a population into
groups based on a hierarchy of rules (figure 5). Thus a given case is split into a given class
based on a series of rules. This approach has been modified in a number of ways. Generally,
a decision is made based on a feature that separates classes (one branch of the cluster
dendrogram from another) within the population. This decision is based on a logical or
numerical rule. Although their use in the analysis of miRNA data has been limited, decision
108 Computational Biology and Applied Bioinformatics

trees have been used in the analysis of miRNA data derived to classify cancer patients (Xu,
et al, 2009).

Fig. 4. Example of a hierarchical clustering analysis result aiming to find clusters of similar
cases.

Fig. 5. Schematic example of the basic principle of Decision Trees

MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 109

Boosted decision trees take the primary decision tree algorithm and boost it. Boosting is a
process where classifiers are derived to allow prediction of those not correctly predicted by
earlier steps. This means that a supervised classification is run where the actual class is
known. A decision tree is created that classifies correctly as many cases as possible. Those
cases that are incorrectly classified are given more weighting. A new tree is then created
with these boosted weights. This process is similar to the iterative leaning that is conducted
with the Artificial Neural Network back propagation algorithm.
Random forest approaches take the basic decision tree algorithm and couple it with random
sample cross validation. In this way a forest of trees is created. Integration of a number of
decision trees identifies a combined decision tree which, as it is developed on blind cases,
represents what approaches a generalised solution for the problem being modelled
(Breiman et al, 2001). This approach has been shown to be very good at making generalised
classifications. The approach essentially derives each tree from a random vector with
equivalent distribution from within the data set, essentially an extensive form of cross
validation. Yousef et al, (2010) have used random forest as one method for the identification
of gene targets for miRNAs. Segura et al (2010) have used random forests as a part of an
analysis to define post recurrence survival in melanoma patients.

7.2.2 Artificial Neural Networks

Artificial Neural Networks are a non linear predictive system that may be used as a
classifier. A popular form of ANN is the multi-layer perceptron (MLP) and is used to solve
many types of problems such as pattern recognition and classification, function
approximation, and prediction. The approach is a form of artificial intelligence in that it
“learns” a solution to a problem from a preliminary set of samples. This is achieved by
comparing predicted versus actual values for a seen data set (the training data set described
earlier) and using the error of the predicted values from the ANN to iteratively develop a
solution that is better able to classify. In MLP ANNs, learning is achieved by updating the
weights that exist between the processing elements that constitute the network topology
(figure 6). The algorithm fits multiple activation functions to the data to define a given class
in an iterative fashion, essentially an extension of logistic regression. Once trained, ANNs
can be used to predict the class of an unknown sample of interest. Additionally, the
variables of the trained ANN model may be extracted to assess their importance in the system
of interest. ANNs can be coupled with Random sample cross validation or any other cross
validation method (LOO or MCCV) in order to ensure that the mode developed is not over
fitted. One of the advantages of ANNs is that the process generates a mathematical model that
can be interrogated and explored in order to elucidate further biological details and validate
the model developed on a wide range of cases. A review of their use is in a clinical setting
presented in Lisboa and Taktak (2006). Back propagation MLP ANNs have been proposed for
use in the identification of biomarkers from miRNA data by Lowery et al, 2009.

7.2.3 Linear Discriminant Analysis (LDA)

Linear discriminant analysis attempts to separate the data into two subgroups by calculating
the optimal linear line that best splits the population. Calculation of this discriminating line
is conducted by taking into account sample variation within similar classes, and minimizing
it between classes. As a result, any additional sample has its class determined by the side of
the discriminating line it falls.
110 Computational Biology and Applied Bioinformatics

LDA can outperform other linear classification methods as LDA tries to consider the
variation within the sample population. Nevertheless, LDA still suffers from its linear
characteristic, and often fails to accurately classify non-linear problems, which is mostly the
case in biomedical sciences (Stekel et al, 2003). This is the reason why non-linear classifiers
are recommended.

Fig. 6. Example of a classical MLP ANN topology with the details of a node (or neurone)

7.2.4 Support Vector Machines

Support Vector Machines (SVMs) are another popular form of machine learning algorithms
in the field of analyzing MA data for non-linear modeling (Vapnik and Lerner, 1963). They
are an evolution of LDA in the sense that they work by separating the data into 2 sub-
groups. They work by separating the data into two regions by constructing a straight line or
hyper plane that best separates between classes (figure 7). In the common example of a two-
class classification problem, SVMs attempt to find a linear “maximal margin hyperplane”
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 111

able to accurately discriminate the classes (Dreiseitlet al, 2001), similarly to what does Linear
Discriminant Analysis. If no such linear hyperplane can be found, usually due to the
inherent non-linearity of the dataset, the data are mapped into a high-dimensional feature
space using a kernel function (for example polynomial or radial basis functions) in which
the two classes can now be separated by a hyperplane which corresponds to a non-linear
classifier (Furey et al, 2000). The class of the unknown sample is then determined by the side
of the “maximal marginal hyper plane” on which it lies. SVMs have been used to analyse
miRNA data by Xue et al, 2005.

Fig. 7. Schematic representation of the principle of SVM. SVM tries to maximise the margin
from the hyperplane in order to best separate the two classes (red positives from blue
negatives).

8. Conclusion
The capability of microarray to simultaneously analyse expression patterns of thousands of
DNA sequences, mRNA or miRNA transcripts has the potential to provide a unique insight
into the molecular biology of malignancy. However, the clinical relevance and value of
microarray data is highly dependent on a number of crucial factors including appropriate
experimental design and suitable bioinformatic analysis. Breast cancer is a heterogeneous
disease with many biological variables which need to be considered to generate meaningful
results. Cohort selection is critical and sufficient biological and technical replicates must be
included as part of microarray study design. Experimental protocols should be appropriate
to the research question. The research community have enthusiastically applied high
throughput technologies to the study of breast cancer. Class prediction, class comparison
and class discovery studies have been undertaken in an attempt to unlock the heterogeneity
of breast cancer and identify novel biomarkers. Molecular signatures have been generated
which attempt to outperform current histopathological parameters at prognostication and
112 Computational Biology and Applied Bioinformatics

prediction of response to therapy. Two clinical tests based on gene expression profiling
(Oncotype DX and Mammaprint) are already in clinical use and being evaluated in
multicentre international trials. It is essential that the potential of microarray signatures is
carefully validated before they are adopted as prognostic tools in the clinical setting.
Standards have been set for the reporting of microarray data (MIAME) and such data is
publically available to facilitate external validation and meta-analysis. It is imperative that
the data is integrated with knowledge normally processed in the clinical setting if we are to
overcome the difficulties in reproducibility, standardization and lack of proof of significance
beyond traditional clinicopathological tools that are limiting the incorporation of microarray
based tools into today’s standard of care.
Deriving biologically and clinically relevant results from microarray data is highly
dependent on bioinformatic analysis. Microarray data is limited by inherent characteristics
that render traditional statistical approaches less effective. These include high
dimensionality, false discovery rates, noise, complexity, non-normality and limited
reproducibility. High dimensionality remains one of the most critical challenges in the
analysis of microarray data. Hierarchical clustering approaches, which have been widely
used in the analysis of breast cancer microarray data, do not cope well with dimensionality.
In overcoming this challenge supervised machine learning techniques have been adapted to
the clinical setting to complement the existing statistical methods. The majority of machine
learning techniques originated in weak-theory domains such as business and marketing.
However, these approaches including Artificial Neural Networks and Support Vector
Machines have been successfully applied to the analysis of miRNA microarray data in the
context of clinical prognostication and prediction.
It is clear that the goal of translating microarray technology to the clinical setting requires
close collaboration between the involved scientific disciplines.If the current momentum in
microarray-based miRNA and mRNA translational research can be maintained this will add
an exciting new dimension to the field of diagnostics and prognostics and will bring us
closer to the ideal of individualized care for breast cancer patients.

9. References
Abbott AL, Alvarez-Saavedra E, Miska EA et al( 2005) The let-7 MiRNA family members
mir-48, mir-84, and mir-241 function together to regulate developmental timing in
Caenorhabditis elegans. Dev Cell. 9(3):403-14.
Adam BL, Qu Y, Davis JW, Ward MD et al (2002) Serum protein fingerprinting coupled with
a pattern-matching algorithm distinguishes prostate cancer from benign prostate
hyperplasia and healthy men. Cancer Research 62:3609-3614.
Ahmed AA, Brenton JD (2005) Microarrays and breast cancer clinical studies: forgetting
what we have not yet learnt. Breast Cancer Res 7:96–99.
Arciero C, Somiari SB, Shriver CD, et al. (2003). Functional relationship and gene ontology
classification of breast cancer biomarkers. Int. J. Biol. Markers 18: 241-272.
Ashburner M, Ball CA, Blake JA et al (2000). Gene Ontology: tool for the unification of
biology. Nat Genet. 25(1): 25–29.
Baffa R, Fassan M, Volinia S et al.(2009) MicroRNA expression profiling of human metastatic
cancers identifies cancer gene targets. J Pathol, 219(2), 214‐221
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 113

Ball CA, Dolinski K, Dwight SS, et al (2000). Integrating functional genomic information into
the Saccharomyces Genome Database. Nucleic Acids Res;28:77–80
Ball G, Mian S, Holding F, et al (2002) An integrated approach utilizing artificial neural
networks and SELDI mass spectrometry for the classification of human tumours
and rapid identification of potential biomarkers. Bioinformatics 18:3395-3404.
Bartel DP. (2004) MiRNAs: genomics, biogenesis, mechanism and function. Cell; 116:281-97.
Bellman RE (1961) Adaptive Control Processes. Princeton University Press, Princeton, NJ
Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society.
Series B (Methodological);57:289-300.
Berns EM, van Staveren IL, Verhoog L et al (2001) Molecular profiles of BRCA1-mutated and
matched sporadic breast tumours: relation with clinico-pathological features.British
journal of cancer;85(4):538-45.
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press.
Blake JA, Eppig JT, Richardson JE, Davisson MT (2000). The Mouse Genome Database
(MGD): expanding genetic and genomic resources for the laboratory mouse.
Nucleic Acids Res.;28:108–111
Blake JA, Harris MA (2002) The Gene Ontology (GO) project: structured vocabularies for
molecular biology and their application to genome and expression analysis. Curr
Protoc Bioinformatics. Chapter 7:Unit 7.2.
Blenkiron C, Goldstein LD, Thorne NP, et al (2007) MicroRNA expression profiling of
human breast cancer identifies new markers of tumor subtype. Genome Biol
8(10):R214
Brenton JD, Carey LA, Ahmed AA, Caldas C (2005). Molecular classification and molecular
forecasting of breast cancer: ready for clinical application? J Clin Oncol 23:7350–
7360.
Breiman L. Random Forests (2001) Machine Learning 45:5-32.
Buyse M, Loi S, van’t Veer L, et al. (2006) Validation and clinical utility of a 70-gene
prognostic signature for women with node-negative breast cancer. J Natl Cancer
Inst. 98(17):1183-1192.
Calin GA, Dumitru CD, Shimizu M, et al (2002) Frequent deletions and down-regulation of
micro- RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia
PNAS.;99(24):15524-9.
Cardoso F, Van’t Veer L, Rutgers E, et al. (2008) Clinical application of the 70-gene profile:
the MINDACT trial. J Clin Oncol; 26:729–735.
Carey LA, Dees EC, Sawyer L et al (2007). The triple negative paradox: Primary tumor
chemosensitivity of breast cancer subtypes. Clin Cancer Res 2007; 13:2329 –2334.
Castoldi M, Schmidt S, Benes V, et al (2006) A sensitive array for MiRNA expression
profiling (miChip) based on locked nucleic acids (LNA). RNA;12(5):913-20.
Clarke R, Liu MC, Bouker KB, et al (2003). Antiestrogen resistance in breast cancer and the
role of estrogen receptor signaling. Oncogene;22(47):7316-39.
Cortez MA, Calin GA (2009). MiRNA identification in plasma and serum: a new tool to
diagnose and monitor diseases. Expert Opin Biol Ther;9(6):703-711.
114 Computational Biology and Applied Bioinformatics

Cronin, M, Pho M, Dutta D et al (2004). Measurement of gene expression in archival

paraffin-embedded tissues: development and performance of a 92-gene reverse
transcriptase-polymerase chain reaction assay. Am J Pathol. 164:35–42
Cunliffe HE, Ringner M, Bilke S, et al. (2003). The gene expression response of breast cancer
to growth regulators: patterns and correlation with tumor expression profiles.
Cancer Res. 63:7158-7166.
Desmedt C, Haibe-Kains B, Wirapati P, et al (2008). Biological processes associated with
breast cancer clinical outcome depend on the molecular subtypes. Clin Cancer Res
;14:5158-65.
Domeniconi C, Papadopoulos D, Gunopulos D, et al (2004). Subspace clustering of high
dimensional data. Proceedings 4th SIAM International Conference on Data
Mining, pp. 517-521. Lake Buena Vista, FL, SIAM, 3600 UNIV CITY SCIENCE
CENTER, PHILADELPHIA, PA 19104-2688 USA.
Dreiseitl S, Ohno-Machado L, Kittler H, et al (2001). A Comparison of Machine Learning
Methods for the Diagnosis of Pigmented Skin Lesions. Journal of Biomedical
Informatics;34:28-36.
Esquela-Kerscher A, Slack FJ.(2006) Oncomirs - MiRNAs with a role in cancer. Nature
reviews;6(4):259-69.
Fan C, Oh DS, Wessels L, et al (2006) Concordance among gene-expression-based predictors
for breast cancer. N Engl J Med;355:560–9.
Farmer P, Bonnefoi H, Becette V, et al (2005) Identification of molecular apocrine breast
tumours by microarray analysis.Oncogene;24:4660–71.
Ferlay J, Parkin DM, Steliarova-Foucher E (2010) Estimates of cancer incidence and mortality
in Europe in 2008. Eur J Cancer; 46:765-781.
Fisher B, Jeong JH, Bryant J, et al (2004). Treatment of lymph-node-negative, oestrogen
receptor-positive breast cancer: Long-term findings from National Surgical
Adjuvant Breast and Bowel Project randomised clinical trials. Lancet;364:858–868
Foekens JA, Sieuwerts AM, Smid M et al (2008) Four miRNAs associated with
aggressiveness of lymph node negative, estrogen receptor-positive human breast
cancer.PNAS;105(35):13021-6.
Furey T S, Cristianini N, Duffy N, et al (2000). Support vector machine classification and
validation of cancer tissue samples using microarray expression data.
Bioinformatics16:906-914.
Geyer FC, Lopez-Garcia MA, Lambros MB, Reis-Filho JS (2009) Genetic Characterisation of
Breast Cancer and Implications for Clinical Management. J Cell Mol Med (10):4090
103.
Gilad S, Meiri E, Yogev Y, et al (2008). Serum MiRNAs are promising novel biomarkers.
PLoS ONE. ;3(9):e3148.
Goldhirsch A, Wood WC, Gelber RD,et al (2007). Progress and promise: highlights of the
international expert consensus on the primary therapy of early breast cancer 2007.
Ann Oncol;18(7):1133-44.
Goldstein LJ, Gray R, Badve S, et al (2008) Prognostic utility of the 21-gene assay in hormone
receptor-positive operable breast cancer compared with classical clinicopathologic
features. J Clin Oncol;26:4063–4071
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 115

Greene D, Cunningham P, Jorge A, et al. (2005). Producing accurate interpretable clusters

from high-dimensional data, Proceedings 9th European Conference on Principles
and Practice of Knowledge Discovery in Databases (PKDD), pp. 486-494, Porto
Portugal.
Habel LA, Shak S, Jacobs MK, et al (2006). A population-based study of tumor gene
expression and risk of breast cancer death among lymph node-negative patients.
Breast Cancer Res;8:R25.
Hedenfalk I, Duggan D, Chen Y, et al (2001) Gene expression profiles in hereditary breast
cancer. N Engl J Med.;344(8):539-48.
Heneghan HM, Miller N, Kerin MJ. (2010)MiRNAs as biomarkers and therapeutic targets in
cancer. Curr Opin Pharmacol;10(5):543-50.
Hu Z, Fan C, Oh DS,, et al. (2006) The molecularportraits of breast tumors are conserved
across microarray platforms. BMC Genom ;7:96.
Huang JX, Mehrens D, Wiese R,, et al. 2001. High-throughput genomic and Proteomic
analysis using microarray technology. Clinical Chem, 47: 1912-16.
Huang Q, Gumireddy K, Schrier M et al.(2008) The microRNAs miR‐373 and miR‐520c
promote tumour invasion and metastasis. Nat Cell Biol;10(2):202‐210
Izmirlian G (2004) Application of the random forest classification algorithm to a SELDI-TOF
proteomics study in the setting of a cancer prevention trial. Annals of the New
York Academy of Sciences; 1020:154-174
Iorio MV, Ferracin M, Liu CG, et al (2005). MicroRNA gene expression deregulation in
human breast cancer. Cancer research;65: 7065-70.
Jemal A, Siegel R, Ward E, et al. (2009) Cancer statistics, 2009. CA Cancer J Clin;59:225-249.
Khatri, P., Draghici, S. (2005), Ontological analysis of gene expression data: current tools,
limitations, and open problems. Bioinformatics; 21: 3587–3595.
Kim C, Taniyama Y, Paik S (2009). Gene-expression-based prognostic and predictive
markers for breast cancer- A primer for practicing pathologists Crit Rev Oncol
Hematol.;70(1):1-11.
Klebanov L, Yakovlev A (2007) How high is the level of technical noise in microarray data?
Biology Direct;2:9.
Kreike B, van Kouwenhove M, Horlings H et al (2007). Gene expression profiling and
histopathological characterization of triple-negative/basal-like breast carcinomas.
Breast Cancer Res;9:R65.
Korkola JE, DeVries S, Fridlyand J, et al (2003). Differentiation of lobular versus ductal
breast carcinomas by expression microarray analysis. Cancer Res;63:7167–7175.
Lamb J, Ramaswamy S, Ford HL, et al (2003). A mechanism of cyclin D1 action encoded in
the patterns of gene expression in human cancer. Cell; 114(3):323-34.
Lancashire LJ, Lemetre C, Ball GR (2009). An introduction to artificial neural networks in
bioinformatics--application to complex microarray and mass spectrometry datasets
in cancer studies. Briefings in Bioinformatics;10:315-329.
Lee RC, Feinbaum RL, Ambros V. (1993) The C. elegans heterochronic gene lin-4 encodes
small RNAs with antisense complementarity to lin-14. Cell.;75(5):843-54.
Leopold E, Kindermann J (2006). Content Classification of Multimedia Documents using
Partitions of Low-Level Features. Journal of Virtual Reality and Broadcasting 3(6).
116 Computational Biology and Applied Bioinformatics

Li J, Smyth P, Flavin R, et al. (2007) Comparison of miRNA expression patterns using total
RNA extracted from matched samples of formalin- fixed paraffin-embedded (FFPE)
cells and snap frozen cells. BMC biotechnology;7:36
Lisboa PJ, Taktak AF(2006). The use of artificial neural networks in decision support in
cancer: A systematic review. Neural Networks;19:408-415.
Lo SS, Norton J, Mumby PB et al(2007). Prospective multicenter study of the impact of the
21-gene recurrence score (RS) assay on medical oncologist (MO) and patient (pt)
adjuvant breast cancer (BC) treatment selection. J Clin Oncol;25(18 suppl):577
Loi S, Haibe-Kains B, Desmedt C, et al (2007). Definition of clinically distinct molecular
subtypes in estrogen receptor-positive breast carcinomas through genomic grade. J.
Clin. Oncol. 25, 1239–1246
Lowery AJ, Miller N, McNeill RE, Kerin MJ (2008). MicroRNAs as prognostic indicators and
therapeutic targets: potential effect on breast cancer management. Clin Cancer Res.
;14(2):360-5.
Lowery AJ, Miller N, Devaney A, et al (2009) . MicroRNA signatures predict estrogen
receptor, progesterone receptor and Her2/neu receptor status in breast cancer.
Breast Cancer Res.;11(3):R27.
Lu J, Getz G, Miska EA, et al.(2005) MiRNA expression profiles classify human cancers.
Nature. 2005;435(7043):834-8
Ma XJ, Hilsenbeck SG, Wang W et al (2006). The HOXB13:IL17BR expression index is a
prognostic factor in early-stage breast cancer. J Clin Oncol; 24:4611– 4619.
Ma J, Dong C, Ji C (2010). MicroRNA and drug resistance. Cancer Gene Ther, 17(8), 523‐531
Manning AT, Garvin JT, Shahbazi RI, et al (2007). Molecular profiling techniques and
bioinformatics in cancer research Eur J Surg Oncol;33(3):255-65.
Marchionni L, Wilson RF, Wolff AC, et al (2008). Systematic review: gene expression
profiling assays in early-stage breast cancer. Ann Intern Med.;148(5):358-369.
Marengo E, Robotti E, Righetti PG, et al (2004). Study of proteomic changes associated with
healthy and tumoral murine samples in neuroblastoma by principal component
analysis and classification methods. Clinica Chimica Acta;345:55-67.
Masuda N, Ohnishi T, KawamotoS , et al (1999) Analysis of chemical modification of RNA
from formalin-fixed samples and optimization of molecular biology applications
for such samples. Nucleic Acids Res. 27, 4436–4443
Matharoo-Ball B, Ratcliffe L, Lancashire L, et al (2007). Diagnostic biomarkers differentiating
metastatic melanoma patients from healthy controls identified by an integrated
MALDI-TOF mass spectrometry/bioinformatic approach. Proteomics Clinical
Applications; 1:605-620
Mattie MD, Benz CC, Bowers J, et al (2006). Optimized high-throughput MiRNA expression
profiling provides novel biomarker assessment of clinical prostate and breast
cancer biopsies. Molecular cancer;5:24
Michiels S, Koscielny S, Hill C (2005). Prediction of cancer outcome with microarrays: a
multiple random validation strategy. Lancet;365: 488-92.
Michiels S, Koscielny S, Hill C (2007). Interpretation of microarray data in cancer. British
Journal of Cancer;96:1155–1158.
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 117

Mina L, Soule SE, Badve S, et al. (2007) Predicting response to primary chemotherapy: gene
expression profiling of paraffin-embedded core biopsy tissue. Breast Cancer Res
Treat ;103:197–208.
Mitchell PS, Parkin RK, Kroh EM, et al (2008).Circulating MiRNAs as stable blood- based
markers for cancer detection.PNAS;105(30):10513-8
Mook S, Schmidt MK, Viale G, et al (2009). The 70-gene prognosis signature predicts disease
outcome in breast cancer patients with 1–3 positive lymph nodes in an
independent validation study. Breast Cancer Res Treat;116:295–302.
Mootha VK, Lindgren CM, Eriksson KF, et al (2003). PGC-1alpha Responsive Genes
Involved in Oxidative Phosphorylation are Coordinately Downregulated in Human
Diabetes, Nature Genetics 34(3):267-73
Nielsen TO, Hsu FD, Jensen K et al (2004). Immunohistochemical and clinical
characterization of the basal-like subtype of invasive breast carcinoma. Clin Cancer
Res ;10:5367-74.
Oberley MJ, Tsao J, Yau P, Farnham PJ (2004). Highthroughput screening of chromatin
immunoprecipitates using CpG-island microarrays. Methods Enzymol;376: 315-
34.
Oostlander AE, Meijer GA, Ylstra B (2004). Microarraybased comparative genomic
hybridization and its applications in human genetics. Clin Genet, 66: 488-495.
Osborne CK(1998) Tamoxifen in the treatment of breast cancer. N Engl J Med;339(22):1609-
18.
Paik S, Shak S, Tang G, et al (2004). A multigene assay to predict recurrence of tamoxifen-
treated, node-negative breast cancer. N Engl J Med;351(27):2817-26.
Paik, S. Kim, C. Y, Song, Y. K. & Kim, W. S. (2005) Technology insight: application of
molecular techniques to formalin-fixed paraffin-embedded tissues from breast
cancer. Nat. Clin. Pract. Oncol;2:246–254
Paik S, Tang G, Shak S, , et al(2006). Gene expression and benefit of chemotherapy in women
with node-negative, estrogen receptor-positive breastcancer. J Clin Oncol; 24 (23) :
3726-34.
Parker JS, Mullins M, Cheang MC, et al (2009). Supervised risk predictor of breast cancer
based on intrinsic subtypes. J Clin Oncol;27:1160–1167.
Pedraza V, Gomez-Capilla JA, Escaramis G, et al (2010) Gene expression signatures in breast
cancer distinguish phenotype characteristics, histologic subtypes, and tumor
invasiveness. Cancer.;116(2):486-96.
Peppercorn J, Perou CM, Carey LA. (2008) Molecular subtypes in breast cancer evaluation
and management: divide and conquer. Cancer Invest;26:1–10.
Perou CM, Sorlie T, Eisen MB, et al (2000). Molecular portraits of human breast tumours.
Nature;406: 747-52.
Pusztai L, Mazouni C, Anderson K, et al (2006). Molecular classification of breast cancer:
limitations and potential. Oncologist;11:868–877.
Quackenbush J (2001). Computational analysis of microarray data. Nature Reviews Genetics
;2:418-27.
118 Computational Biology and Applied Bioinformatics

Raychaudhuri S, Stuart JM, Altman RB (2000). Principal components analysis to summarize

microarray experiments: application to sporulation time series. In Pacific
Symposium on Biocomputing, pp. 455–466.
Rifai N, Gillette MA, Carr SA (2006). Protein biomarker discovery and validation: the long
and uncertain path to clinical utility. Nature biotechnology;24:971-983
Rouzier R, Perou CM, Symmans WF et al (2005). Breast cancer molecular subtypes
respond differently to preoperative chemotherapy. Clin Cancer Res;11:5678 –
5685.
Schena M, Shalon D, Davis RW, Brown PO (1995). Quantitative monitoring of
gene expression patterns with a complementary DNA microarray. Science;270:467-
70.
Segura MF, Belitskaya-Lévy I, Rose A, et al (2010) Melanoma MicroRNA Signature Predicts
Post-Recurrence Survival. Clinical Cancer Research;16:1577.
Shak S, Baehner FL, Palmer G, et al (2006) Subtypes of breast cancer defined by
standardized quantitative RT–PCR analysis of 10 618 tumors. Breast Cancer Res
Treat 2006;100:S295–295.
Shi L, Reid LH, Jones WD, et al (2006)The microarray quality control (MAQC) project shows
inter- and intraplatform reproducibility of gene expression measurements. Nature
Biotechnologie;24:1151–1161.
Simon RM, Korn EL, McShane LM, et al (2003). Design and analysis of DNA microarray
investigations. Springer New York
Smith I, Procter M, Gelber RD, et al (2007). 2-year follow up of trastuzumab after adjuvant
chemotherapy in HER2-positive breast cancer: a randomised controlled trial.
Lancet. ;369(9555):29-36.
Sorlie T, Perou CM, Tibshirani R, et al (2001) Gene expression patterns of breast
carcinomas distinguish tumor subclasses with clinical implications.PNAS;98: 10869-
74.
Sorlie T, Tibshirani R, Parker J, et al (2003). Repeated observation of breast tumor subtypes
in independent gene expression data sets. PNAS;100:8418–8423.
Sorlie T, Perou CM, Fan C, et al (2006) Gene expression profiles do not consistently predict
the clinical treatment response in locally advanced breast cancer. Mol Cancer
Ther;5:2914–8.
Sotiriou C, Neo SY, McShane LM, et al (2003) Breast cancer classification and prognosis
based on gene expression profiles from a population based study.PNAS;100:10393–
10398
Sparano JA. (2006). TAILORx: Trial assigning individualized options for treatment (Rx). Clin
Breast Cancer;7:347–350.
Stekel D. (2003). Microarray bioinformatics. Cambrigde University Press,
Stoll D, Templin MF, Bachmann J, Joos TO (2005). Protein microarrays: applications and
future challenges. Curr Opin Drug Discov Devel, 8: 239-252.
Sun Y, Goodison S. Li J, Liu L., Farmerie W (2007). Improved breast cancer prognosis
through the combination of clinical and genetic markers, Bioinformatics;23:30–
37
MicroArray Technology - Expression Profiling of MRNA and MicroRNA in Breast Cancer 119

Tessel MA, Krett NL, Rosen ST (2010).Steroid receptor and microRNA regulation in cancer.
Curr Opin Oncol;22(6):592‐597
The FlyBase Consortium (1999). The FlyBase database of the Drosophila Genome Projects
and community literature. Nucleic Acids Res;27:85–88.
van de Vijver M, He Y, van’t Veer L, et al (2002). A gene-expression signature as a predictor
of survival in breast cancer. N Engl J Med;347:1999–2009.
van’t Veer L, Dai H, van de Vijver M, et al (2002). Gene expression profiling predicts clinical
outcome of breast cancer. Nature;415:530–6.
Vapnik V, Lerner A (1963). Pattern recognition using generalized portrait method.
Automation and Remote Control 1963;24:774-780.
Volinia S, Calin GA, Liu CG, et al (2006). A MiRNA expression signature of human solid
tumors defines cancer gene targets. PNAS.;103(7):2257-61.
Wadsworth JT, Somers KD, Cazares LH, et al (2004) Serum protein profiles to identify head
and neck cancer. Clinical Cancer Research;10:1625-1632.
Wang Y, Klijn JG, Zhang Y et al (2005). Gene-expression profiles to predict
distant metastasis of lymph-node-negative primary breast cancer. Lancet;365:671–
679.
Warnat P, Eils R, Brors B (2005). Cross-platform analysis of cancer microarray data
improves gene expression based classification of phenotypes. BMC
bioinformatics 2;6: 265.
Weigelt B, Geyer FC, Natrajan R, et al (2010) The molecular underpinning of lobular
histological growth pattern: a genome-wide transcriptomic analysis of invasive
lobular carcinomas and grade- and molecular subtype-matched invasive ductal
carcinomas of no special type. J Pathol;220(1):45-57
Wirapati P, Sotiriou C, Kunkel S, et al (2008). Meta-analysis of gene expression profiles in
breast cancer: toward a unified understanding of breast cancer subtyping and
prognosis signatures. Breast Cancer Res;10:R65.
Wong JWH, Cagney G, Cartwright HM (2005). SpecAlign—processing and alignment of
mass spectra datasets. Bioinformatics;21:2088-2090
Xi Y, Nakajima G, Gavin E, et al (2007). Systematic analysis of MiRNA expression of RNA
extracted from fresh frozen and formalin-fixed paraffin-embedded samples.
RNA;13(10):1668-74.
Xue C, Li F, He T, Liu GP, Li Y, Xuegong Z (2005). Classification of real and pseudo
microRNA precursors using local structure-sequence features and support vector
machine. BMC Bioinformatics;6:310.
Xu R, Xu J, Wunsch DC (2009). Using default ARTMAP for cancer classification with
MicroRNA expression signatures, International Joint Conference on Neural Networks,
pp.3398-3404,
Yan PS, Perry MR, Laux DE, et al (2000). CpG island arrays: an application toward
deciphering epigenetic signatures of breast cancer. Clinical Cancer Research; 6:
1432-38.
Yousef M, Najami N, Khalifa W (2010). A comparison study between one-class and two-
class machine learning for MicroRNA target detection. Journal of Biomedical
Science and Engineering ;3:247-252.
120 Computational Biology and Applied Bioinformatics

Zhao H, Langerod A, Ji Y, et al (2004) Different gene expression patterns in invasive lobular

and ductal carcinomas of the breast. Mol Biol Cell;15:2523–2536.
Zhao H, Shen J, Medico L, et al (2010). A Pilot Study of Circulating miRNAs as Potential
Biomarkers of Early Stage Breast Cancer. PLoS One;5(10),e137:5 , 2010
Zheng T, Wang J, Chen X, Liu L (2010) Role of microRNA in anticancer drug resistance. Int J
Cancer;126(1):2-10.
6

Computational Tools for Identification of

microRNAs in Deep Sequencing Data Sets
Manuel A. S. Santos and Ana Raquel Soares
University of Aveiro
Portugal

1. Introduction
MicroRNAs (miRNAs) are a class of small RNAs of approximately 22 nucleotides in length
that regulate eukaryotic gene expression at the post-transcriptional level (Ambros 2004;
Bartel 2004; Filipowicz et al. 2008). They are transcribed as long precursor RNA molecules
(pri-miRNAs) and are successively processed by two key RNAses, namely Drosha and
Dicer, into their mature forms of ~22 nucleotides (Kim 2005; Kim et al. 2009). These small
RNAs regulate gene expression by binding to target sites in the 3’ untranslated region of
mRNAs (3’UTR). Recognition of the 3’UTR by miRNAs is mediated through complementary
hybridization at least between nucleotides 2-8, numbered from the 5’ end (seed sequences)
of the small RNAs, and complementary sequences present in the 3’UTRs of mRNAs
(Ambros 2004; Bartel 2004; Zamore and Haley 2005). Perfect or nearly perfect
complementarities between miRNAs and their 3’UTRs induce mRNA cleavage by the RNA-
induced silencing complex (RISC), whereas imperfect base pair matching may induce
translational silencing through various molecular mechanisms, namely inhibition of
translation initiation and activation of mRNA storage in P-bodies and/or stress granules
(Pillai et al. 2007).
This class of small RNAs is well conserved between eukaryotic organisms, suggesting that
they appeared early in eukaryotic evolution and play fundamental roles in gene expression
control. Each miRNA may repress hundreds of mRNAs and regulate a wide variety of
biological processes, namely developmental timing (Feinbaum and Ambros 1999; Lau et al.
2001), cell differentiation (Tay et al. 2008), immune response (Ceppi et al. 2009) and infection
(Chang et al. 2008). For this reason, their identification is essential to understand eukaryotic
biology. Their small size, low abundance and high instability complicated early
identification, but these obstacles have been overcome by next generation sequencing
approaches, namely the Genome SequencerTM FLX from Roche, the Solexa/Illumina
Genome Analyzer and the Applied Biosystems SOLiDTM Sequencer which are currently
being routinely used for rapid miRNA identification and quantification in many eukaryotes
(Burnside et al. 2008; Morin et al. 2008; Schulte et al. 2010).
As in other vertebrates, miRNAs control gene expression in zebrafish, since defective
miRNA processing arrest development (Wienholds et al. 2003). Also, a specific subset of
miRNAs is required for brain morphogenesis in zebrafish embryos, but not for cell fate
determination or axis formation (Giraldez et al. 2005). In other words, miRNAs play an
122 Computational Biology and Applied Bioinformatics

important role in zebrafish organogenesis and their expression at specific time points is
relevant to organ formation and differentiation. Since identification of the complete set of
miRNAs is fundamental to fully understand biological processes, we have used high
throughput 454 DNA pyrosequencing technologies to fully characterize the zebrafish
miRNA population (Soares et al. 2009). For this, a series of cDNA libraries were prepared
from miRNAs isolated at different embryonic time points and from fully developed organs
sequenced using the Genome SequencerTM FLX. This platform yields reads of up to 200
bases each and can generate up to 1 million high quality reads per run, which provides
sufficient sequencing coverage for miRNA identification and quantification in most
organisms. However, deep sequencing of small RNAs may pose some problems that need to
be taken into consideration to avoid sequencing biases. For example, library preparation and
computational methodologies for miRNA identification from large pool of reads need to be
optimized. There are many variables to consider, namely biases in handling large sets of
data, sequencing errors and RNA editing or splicing. If used properly, deep sequencing
technologies have enormous analytical power and have been proven to be very robust in
retrieving novel small RNA molecules. One of the major challenges when analyzing deep
sequencing data is to differentiate miRNAs from other small RNAs and RNA degradation
products.
Different research groups are developing dedicated computational methods for the
identification of miRNAs from large sets of sequencing data generated by next
generation sequencing experiments. miRDeep (http://www.mdc-
berlin.de/en/research/research_teams/systems_biology_of_gene_regulatory_elements/pr
ojects/miRDeep/index.html) (Friedlander et al. 2008) and miRanalizer
(http://web.bioinformatics.cicbiogune.es/microRNA/miRanalyser.php) (Hackenberg et al.
2009) can both detect known miRNAs annotated in miRBase and predict new miRNAs
(although using different prediction algorithms) from small RNA datasets generated by
deep sequencing. Although these online algorithms are extremely useful for miRNA
identification, custom-made pipeline analysis of deep sequencing data may be performed in
parallel to uncover the maximum number of small non-coding RNA molecules present in
the RNA datasets.
In this chapter, we discuss the tools and computational pipelines used for miRNA
identification, discovery and expression from sequencing data, based on our own experience
of deep sequencing of zebrafish miRNAs, using the Genome SequencerTM FLX from Roche.
We show how a combination of a public available, user-friendly algorithm, such as
miRDeep, with custom-built analysis pipelines can be used to identify non-coding RNAs
and uncover novel miRNAs. We also demonstrate that population statistics can be applied
to statistical analysis of miRNA populations identified during sequencing and we
demonstrate that robust computational analysis of the data is crucial for extracting the
maximum information from sequencing datasets.

2. miRNA identification by next-generation sequencing

2.1 Extraction of next-generation sequencing data
Next generation sequencing methods have been successfully applied in the last years to
miRNA identification in a variety of organisms. However, the enormous amount of data
generated represents bioinformatics challenges that researchers have to overcome in order
to extract relevant data from the datasets.
Computational Tools for Identification of microRNAs in Deep Sequencing Data Sets 123

We have used the Genome SequencerTM FLX system (454 sequencing) to identify zebrafish
miRNAs from different developmental stages and from different tissues. For this, cDNA
libraries are prepared following commonly used protocols (Droege M and Hill B. 2008;
Soares et al. 2009). These libraries contain specific adaptors for the small RNA molecules
containing specific priming sites for sequencing. After sequencing, raw data filtration and
extraction is performed using specialist software incorporated into the Genome SequencerTM
FLX system (Droege M and Hill B. 2008). Raw images are processed to remove background
noise and the data is normalized. Quality of raw sequencing reads is based on complete read
through of the adaptors incorporated into the cDNA libraries. The 200 base pair of 454
sequencing reads provide enough sequencing data for complete read through of the
adaptors and miRNAs. During quality control, the adaptors are trimmed and the resulting
sequences are used for further analysis. Sequences ≥ 15 nucleotides are kept for miRNA
identification, and constitute the small RNA sequencing data.
Other sequencing platforms, such as Illumina/Solexa and SOLiDTM, also have specialist
software for raw data filtration. DSAP, for example, is an automated multiple-task web
service designed to analyze small RNA datasets generated by the Solexa platform (Huang et
al. 2010). This software filters raw data by removing sequencing adaptors and poly-
A/T/C/G/N nucleotides. In addition, it performs non-coding RNA matching by sequence
homology mapping against the non-coding RNA database Rfam (rfam.sanger.ac.uk/) and
detects known miRNAs in miRBase (Griffiths-Jones et al. 2008), based on sequence
homology.
The SOLiDTM platform has its own SOLiD™ System Small RNA Analysis Pipeline Tool
(RNA2MAP), which is available online (http://solidsoftwaretools.com/gf/project/
rna2map). This software is similar to DSAP, as it filters raw data and identifies known
miRNAs in the sequencing dataset by matching reads against miRBase sequences and
against a reference genome. Although these specialist software packages are oriented for
miRNA identification in sequencing datasets they are not able to identify novel miRNAs.
For this, datasets generated from any of the sequencing platforms available have to be
analyzed using tools that include algorithms to identify novel miRNAs.

2.2 miRNA identification from next generation sequencing databases

miRNA identification (of both known and novel molecules) from datasets generated by
deep-sequencing has been facilitated by the development of public user friendly algorithms,
such as miRDeep (Friedlander et al. 2008), miRanalyzer (Hackenberg et al. 2009) and
miRTools (Zhu et al. 2010).
We used miRDeep to identify miRNAs in our sequencing datasets (Figure 1). miRDeep was
the first public tool available for the analysis of deep-sequencing miRNA data. This software
was developed to extract putative precursor structures and predict secondary structures
using RNAfold (Hofacker 2003) after genome alignment of the sequences retrieved by next-
generation sequencing. This algorithm relies on the miRNA biogenesis model. Pre-miRNAs
are processed by DICER, which originates three different fragments, namely the mature
miRNA, the star and the hairpin loop sequences (Kim et al. 2009). miRDeep scores the
compatibility of the position and frequency of the sequenced RNA with the secondary
structures of the miRNA precursors and identifies new, conserved and non-conserved
miRNAs with high confidence. It distinguishes between novel and known miRNAs, by
evaluating the presence or absence of alignments of a given sequence with the stem loop
124 Computational Biology and Applied Bioinformatics

sequences deposited in miRBase. The sequence with the highest expression is always
considered as the mature miRNA sequence by the miRDeep algorithm. All hairpins that are
not processed by DICER will not match a typical secondary miRNA structure and are
filtered out.
After aligning the sequences against the desired genome using megaBlast, the blast output is
parsed for miRDeep uploading. As sequencing errors, RNA editing and RNA splicing may
alter the original miRNA sequence, one can re-align reads that do not match the genome
using SHRiMP (http://compbio.cs.toronto.edu/shrimp/). The retrieved alignments are also
parsed for miRDeep for miRNA prediction. miRDeep itself allows up to 2 mismatches in the
3’ end of each sequence, which already accounts with some degree of sequencing errors that
might have occurred.
Reads matching more than 10 different genome loci are generally discarded, as they likely
constitute false positives. The remaining alignments are used as guidelines for excision of
the potential precursors from the genome. After secondary structure prediction of putative
precursors, signatures are created by retaining reads that align perfectly with those putative
precursors to generate the signature format. miRNAs are predicted by discarding non-
plausible DICER products and scoring plausible ones. The latter are blasted against mature
miRNAs deposited in miRBase, to extract known and conserved miRNAs. The remaining
reads are considered novel miRNAs.
In order to evaluate the sensitivity of the prediction and data quality, miRDeep calculates
the false positive rate, which should be below 10%. For this, the signature and the structure-
pairings in the input dataset are randomly permutated, to test the hypothesis that the
structure (hairpin) of true miRNAs is recognized by DICER and causes the signature.
miRanalizer (Hackenberg et al. 2009) is a recently developed web server tool that detects
both known miRNAs annotated in miRBase and other non-coding RNAs by mapping
sequences to non-coding RNA libraries, such as Rfam. This feature is important, as more
classes of small non coding RNAs are being unravelled and their identification can provide
clues about their functions. At the same time, by removing reads that match other non
coding RNA classes, it reduces the false positive rate in the prediction of novel miRNAs, as
these small non coding RNAs can be confused with miRNAs. For novel miRNA prediction,
miRanalizer implements a machine learning approach based on the random forest method,
with the number of trees set to 100 (Breiman 2001). miRanalyzer can be applied to miRNA
discovery in different models, namely human, mouse, rat, fruit-fly, round-worm, zebrafish
and dog, and uses datasets from different models to build the final prediction model. In
comparison to miRDeep, this is disadvantageous as the latter can predict novel miRNAs
from any model. All pre-miRNAs candidates that match known miRNAs are extracted from
the experimental dataset and labelled as positive instances. Next, an equal amount of pre-
miRNA candidates from the same dataset are selected by random selection with the known
miRNAs removed and labelled as negative. Pre-processing of reads corresponding to
putative new miRNAs includes clustering of all reads that overlap with the genome, testing
whether the start of the current read overlaps less than 3 nucleotides with the end position
of previous reads. This avoids DICER products grouping together and be considered non-
miRNAs products, which would increase false negatives. Besides, clusters of more than 25
base pairs in length are discarded and the secondary structure of the miRNA is predicted
via RNAfold (Hofacker 2003). Structures where the cluster sequence is not fully included
and where part of the stem cannot be identified as a DICER product are discarded.
Computational Tools for Identification of microRNAs in Deep Sequencing Data Sets 125

Fig. 1. Data pipeline analysis using miRDeep.

126 Computational Biology and Applied Bioinformatics

miRTools is a comprehensive web server that can be used for characterization of the small
RNA transcriptome (Zhu et al. 2010). It offers some advantages relative to miRDeep and
miRanalyzer, since it integrates multiple computational approaches including tools for raw
data filtration, identification of novel miRNAs and miRNA expression profile generation. In
order to detect novel miRNAs, miRTools analyze all sequences that are not annotated to
known miRNAs, other small non-coding RNAs and genomic repeats or mRNA that match the
reference genome. These sequences are extracted and their RNA secondary structures are
predicted using RNAfold (Hofacker 2003) and novel miRNAs are identified using miRDeep.

2.3 Analysis of discarded reads by miRNA identification algorithms can identify new
miRNAs
Since miRDeep and miRanalyzer are highly stringent algorithms, some miRNAs may escape
detection. The false negative discovery rate can, however be calculated by simply
performing a megaBlast search of the sequencing data against the miRNAs deposited in
miRBase. Perfect alignments are considered true positives. The list of known miRNAs
identified by this method is compared to the list of known miRNAs identified by miRDeep
or miRanalyzer. False negatives are those miRNAs present in the blast analysis, but which
were missed by the miRNA prediction algorithms. This is, in our opinion, an essential
control, as it gives information about the percentage of miRNAs that may have escaped
miRDeep or miRanalyzer analysis. We have detected ~19% of false negatives, which
prompted us to develop a parallel pipeline to analyze reads that may have been incorrectly
discarded by the original algorithm (Figure 2). This analysis can and should be performed
independently of the algorithm used to retrieve miRNAs from deep sequencing data.
To overcome the lack of sensitivity of miRDeep, our parallel bioinformatics pipeline
includes a megaBlast alignment between the dataset of discarded reads by miRDeep and
mature sequences deposited in miRBase. Besides, novel transcripts encoding miRNAs
predicted by computational tools can be retrieved from the latest Ensembl version using
BioMart and also from literature predictions. These sequences are then used to perform a
megaBlast search against the sequencing data. The transcripts with perfect matches and
alignment length > 18 nucleotides are kept for further processing. These transcripts are then
compared with the mature miRNAs deposited in miRBase and those that produce imperfect
alignments or do not produce alignments are considered novel miRNAs. Imperfect alignments
may identify conserved miRNAs if there is a perfect alignment in the seed region.
Complementary alignments of our dataset reads against the zebrafish genome with SHRiMP
alignments and complementary miRDeep analysis with an analysis of the reads discarded
by this algorithm, allowed us to identify 90% of the 192 zebrafish miRNAs previously
identified, plus 107 miRNA star sequences and 25 novel miRNAs.

2.4 Generation of miRNA profiles from deep sequencing data

Deep sequencing of miRNAs can also be used to generate miRNA expression profiles as the
absolute number of sequencing reads of each miRNA is directly proportional to their
relative abundance. miRNA profiles can be generated based on the number of reads of each
particular miRNA. However, a normalization step is essential to compare miRNA
expression levels between different samples. The variation in the total number of reads
between samples leads to erroneous interpretation of miRNA expression patterns by direct
Computational Tools for Identification of microRNAs in Deep Sequencing Data Sets 127

comparison of read numbers (Chen et al. 2005). Normalization assumes that the small RNA
population is constant and is represented by an arbitrary value (e.g. 1000), and can be
calculated as indicated below:

miRNA relative expression = 1000 x (NRmiRNAXY)

TNRmiRNAsY
where NRmiRNAXY is the number of reads of miRNAX (X = any miRNA) in sample Y, and
TNRmiRNAsY is the total number of miRNAs in sample Y. 1000 is an arbitrary number of
reads that allows for data normalization across different samples. This calculates the relative
expression of a specific miRNA in a given sample, relative to all miRNAs expressed.

Fig. 2. Bioinformatics pipeline of reads discarded by miRDeep (-i and –d stand for query file
and database respectively).
128 Computational Biology and Applied Bioinformatics

Using this formula it is possible to generate miRNA profiles for each sample sequenced.
These profiles provide valuable information about relative miRNA expression, which is
essential to understand miRNA function in different tissues. In order to compare miRNA
profiles of two deep sequencing samples (e.g. condition vs control), a two-side t-test can be
applied to determine miRNA levels. Sequence count values should be log-transformed to
stabilize variance (Creighton et al. 2009). miRTools already include a computational
approach to identify significantly differentially expressed miRNAs (Zhu et al. 2010). It
compares differentially expressed miRNAs in multiple samples after normalization of the
read count of each miRNA with the total number of miRNA read counts which are matched
to the reference genome. The algorithm calculates statistical significance (P-value) based on
a Bayesian method (Audic and Claverie 1997), which accounts for sampling variability of
tags with low counts. Significantly differentially expressed miRNAs are those that show P-
values <0.01 and at least 2-fold change in normalized sequence counts.

2.5 Statistical analysis of miRNA population

The platforms available for miRNA sequencing offer different sequencing coverage, ranging
from thousands to millions of reads. In principle, higher sequencing coverage will enable
discovery of more miRNA molecules in a sequencing run. However, technical problems
during sample preparation can interfere with good quality sequencing of small RNAs. One
of the most common problems is the generation of primer dimmers during PCR
amplification of cDNA libraries. This may indicate an excess of primers during
amplification, when compared to the miRNA levels in a given cDNA library or low
annealing temperature. This problem is often only detected after sequencing. When this
happens, a large number of reads do not pass quality control filters and the number of reads
corresponding to small RNAs is considerably lower than the initial sequencing coverage.
Besides this, quality control filters do not consider reads with sequencing errors in the
adaptors or without recognizable adaptors. For these reasons, a tool that verifies if the
sequencing coverage is sufficient to retrieve most miRNAs in a given sample is important.
A useful approach to assess the representativeness of miRNA reads in a sequencing
experiment is to apply population statistics to the overall miRNA population. We have
developed a statistical tool to calculate how many miRNAs are expected in a given
sequencing experiment and how many reads are needed to identify them. Rarefaction
curves of the total number of reads obtained versus the total number of miRNA species
identified are plotted and the total richness of the miRNA population is determined. Chao1,
a non-parametric richness estimator (Chao 1987), can be used to determine the total richness
of the miRNA population, as a function of the observed richness (Sobs), and the number of
total sequences obtained by sequencing. The value obtained represents the number of
different miRNAs that can be identified in a specific sequencing experiment. The rarefaction
curve estimates the number of reads needed to identify the different miRNAs that may be
present in a sequencing run. For example, 206 miRNAs are expected to be present in a
sequencing experiment that retrieves approximately 40000 reads (Figure 3). The steep curve
levels off towards an asymptote, indicating the point (~20000 reads) where additional
sampling will not yield extra miRNAs. As that critical point is below the total number of
reads obtained, we can conclude that the sequencing coverage is sufficient to identify all
miRNAs predicted in the particular sample. Rarefaction curves and the Chao1 statistical
estimator are computed using EstimateS8.0 (Colwell and Coddington 1994).
Computational Tools for Identification of microRNAs in Deep Sequencing Data Sets 129

Fig. 3. Statistical analysis of miRNA population. A) A rarefaction curve of the total number
of reads generated by deep sequencing versus the total number of miRNA species identified
is shown. The steep curve levels off towards an asymptote, which indicates the point where
additional sampling will not yield new miRNAs. B) Homogeneity of the miRNA population
was assessed using population statistics and by determining the Chao1 diversity estimator.
The Chao1 reached a mean stable value of 207, with lower and upper limits of 200.37 to
229.66, respectively, for a level of confidence of 95%.

3. Conclusion
Small non-coding RNAs are a class of molecules that regulate several biological processes.
Identification of such molecules is crucial to understand the molecular mechanisms that
they regulate. There are already several deep sequencing approaches to identify these
molecules. However, correct interpretation of sequencing data depends largely on the
bioinformatics and statistical tools available. There are online algorithms that facilitate
identification of miRNAs and other small non-coding RNAs from large datasets. However,
there are no tools to predict novel small non-coding RNAs beyond miRNAs. As those
additional RNA classes, namely piRNAs, snRNAs and snoRNAs are processed differently,
the development of algorithms based solely on their biogenesis is challenging. Moreover,
the available algorithms have some limitations and additional data analysis should be
performed with the discarded reads that can potentially hold non-conventional miRNA
molecules. Analysis of deep sequencing data is a powerful methodology to identify novel
miRNAs in any organism and determine their expression profiles. The challenge is to deal
with increasing dataset size and to integrate the information generated by small RNA
sequencing experiments. This will be essential to understand how different RNA classes are
related. Computational tools to integrate small non-coding RNA data with gene expression
data and target predictions are pivotal to understand the biological processes regulated by
miRNAs and other small non-coding RNA classes.
130 Computational Biology and Applied Bioinformatics

4. References
Ambros, V. (2004). The functions of animal microRNAs. Nature 431(7006), 350-355.
Audic, S., and Claverie, J. M. (1997). The significance of digital gene expression profiles.
Genome Res. 7(10), 986-995.
Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116(2),
281-297.
Breiman, L. Random forests. Machine Learning (2001). 45:28.
Burnside, J., Ouyang, M., Anderson, A., Bernberg, E., Lu, C., Meyers, B. C., Green, P. J.,
Markis, M., Isaacs, G., Huang, E., and Morgan, R. W. (2008). Deep sequencing of
chicken microRNAs. Bmc Genomics 9.
Ceppi, M., Pereira, P. M., Dunand-Sauthier, I., Barras, E., Reith, W., Santos, M. A., and
Pierre, P. (2009). MicroRNA-155 modulates the interleukin-1 signaling pathway in
activated human monocyte-derived dendritic cells. Proc. Natl. Acad. Sci. U. S. A
106(8), 2735-2740.
Chang, J. H., Cruo, J. T., Jiang, D., Guo, H. T., Taylor, J. M., and Block, T. M. (2008). Liver-
specific MicroRNA miR-122 enhances the replication of hepatitis C virus in
nonhepatic cells. Journal of Virology 82(16), 8215-8223.
Chao, A. (1987). Estimating the Population-Size for Capture Recapture Data with Unequal
Catchability. Biometrics 43(4), 783-791.
Chen, P. Y., Manninga, H., Slanchev, K., Chien, M. C., Russo, J. J., Ju, J. Y., Sheridan, R., John,
B., Marks, D. S., Gaidatzis, D., Sander, C., Zavolan, M., and Tuschl, T. (2005). The
developmental miRNA profiles of zebrafish as determined by small RNA cloning.
Genes & Development 19(11), 1288-1293.
Colwell, R. K., and Coddington, J. A. (1994). Estimating Terrestrial Biodiversity Through
Extrapolation. Philosophical Transactions of the Royal Society of London Series B-
Biological Sciences 345(1311), 101-118.
Creighton, C. J., Reid, J. G., and Gunaratne, P. H. (2009). Expression profiling of microRNAs
by deep sequencing. Brief. Bioinform. 10(5), 490-497.
Droege M, and Hill B. The Genome Sequencer FLX trade mark System-Longer reads, more
applications, straight forward bioinformatics and more complete data sets.
J.Biotechnol. 136 (1-2): 3-10. 2008.
Feinbaum, R., and Ambros, V. (1999). The timing of lin-4 RNA accumulation controls the
timing of postembryonic developmental events in Caenorhabditis elegans. Dev.
Biol. 210(1), 87-95.
Filipowicz, W., Bhattacharyya, S. N., and Sonenberg, N. (2008). Mechanisms of post-
transcriptional regulation by microRNAs: are the answers in sight? Nat. Rev. Genet.
9(2), 102-114.
Friedlander, M. R., Chen, W., Adamidi, C., Maaskola, J., Einspanier, R., Knespel, S., and
Rajewsky, N. (2008). Discovering microRNAs from deep sequencing data using
miRDeep. Nature Biotechnology 26(4), 407-415.
Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S.,
Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). MicroRNAs regulate brain
morphogenesis in zebrafish. Science 308(5723), 833-838.
Computational Tools for Identification of microRNAs in Deep Sequencing Data Sets 131

Griffiths-Jones, S., Saini, H. K., van, D. S., and Enright, A. J. (2008). miRBase: tools for
microRNA genomics. Nucleic Acids Res. 36(Database issue), D154-D158.
Hackenberg, M., Sturm, M., Langenberger, D., Falcon-Perez, J. M., and Aransay, A. M.
(2009). miRanalyzer: a microRNA detection and analysis tool for next-
generation sequencing experiments. Nucleic Acids Res. 37(Web Server issue),
W68-W76.
Hofacker, I. L. (2003). Vienna RNA secondary structure server. Nucleic Acids Res. 31(13),
3429-3431.
Huang, P. J., Liu, Y. C., Lee, C. C., Lin, W. C., Gan, R. R., Lyu, P. C., and Tang, P. (2010).
DSAP: deep-sequencing small RNA analysis pipeline. Nucleic Acids Res. 38(Web
Server issue), W385-W391.
Kim, V. N. (2005). MicroRNA biogenesis: coordinated cropping and dicing. Nat. Rev. Mol.
Cell Biol. 6(5), 376-385.
Kim, V. N., Han, J., and Siomi, M. C. (2009). Biogenesis of small RNAs in animals. Nat. Rev.
Mol. Cell Biol. 10(2), 126-139.
Lau, N. C., Lim, L. P., Weinstein, E. G., and Bartel, D. P. (2001). An abundant class of tiny
RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294(5543),
858-862.
Morin, R. D., O'Connor, M. D., Griffith, M., Kuchenbauer, F., Delaney, A., Prabhu, A. L.,
Zhao, Y., McDonald, H., Zeng, T., Hirst, M., Eaves, C. J., and Marra, M. A. (2008).
Application of massively parallel sequencing to microRNA profiling and discovery
in human embryonic stem cells. Genome Research 18(4), 610-621.
Pillai, R. S., Bhattacharyya, S. N., and Filipowicz, W. (2007). Repression of protein synthesis
by miRNAs: how many mechanisms? Trends Cell Biol. 17(3), 118-126.
Schulte, J. H., Marschall, T., Martin, M., Rosenstiel, P., Mestdagh, P., Schlierf, S., Thor,
T., Vandesompele, J., Eggert, A., Schreiber, S., Rahmann, S., and Schramm, A.
(2010). Deep sequencing reveals differential expression of microRNAs in
favorable versus unfavorable neuroblastoma. Nucleic Acids Res. 38(17), 5919-
5928.
Soares, A. R., Pereira, P. M., Santos, B., Egas, C., Gomes, A. C., Arrais, J., Oliveira, J. L.,
Moura, G. R., and Santos, M. A. S. (2009). Parallel DNA pyrosequencing unveils
new zebrafish microRNAs. Bmc Genomics 10.
Tay, Y. M. S., Tam, W. L., Ang, Y. S., Gaughwin, P. M., Yang, H., Wang, W. J., Liu, R. B.,
George, J., Ng, H. H., Perera, R. J., Lufkin, T., Rigoutsos, I., Thomson, A. M., and
Lim, B. (2008). MicroRNA-134 modulates the differentiation of mouse embryonic
stem cells, where it causes post-transcriptional attenuation of Nanog and LRH1.
Stem Cells 26(1), 17-29.
Wienholds, E., Koudijs, M. J., van Eeden, F. J., Cuppen, E., and Plasterk, R. H. (2003). The
microRNA-producing enzyme Dicer1 is essential for zebrafish development. Nat.
Genet. 35(3), 217-218.
Zamore, P. D., and Haley, B. (2005). Ribo-gnome: the big world of small RNAs. Science
309(5740), 1519-1524.
132 Computational Biology and Applied Bioinformatics

Zhu, E., Zhao, F., Xu, G., Hou, H., Zhou, L., Li, X., Sun, Z., and Wu, J. (2010). mirTools:
microRNA profiling and discovery based on high-throughput sequencing. Nucleic
Acids Res. 38(Web Server issue), W392-W397.
7

Computational Methods in Mass

Spectrometry-Based Protein 3D Studies
Rosa M. Vitale1, Giovanni Renzone2, Andrea Scaloni2 and Pietro Amodeo1
1Istitutodi Chimica Biomolecolare, CNR, Pozzuoli
2Laboratorio di Proteomica e Spettrometria di Massa, ISPAAM, CNR, Naples
Italy

1. Introduction
Mass Spectrometry (MS)-based strategies featuring chemical or biochemical probing
represent powerful and versatile tools for studying structural and dynamic features of
proteins and their complexes. In fact, they can be used both as an alternative for systems
intractable by other established high-resolution techniques, and as a complementary
approach to these latter, providing different information on poorly characterized or very
critical regions of the systems under investigation (Russell et al., 2004). The versatility of
these MS-based methods depends on the wide range of usable probing techniques and
reagents, which makes them suitable for virtually any class of biomolecules and complexes
(Aebersold et al., 2003). Furthermore, versatility is still increased by the possibility of
operating at very different levels of accuracy, ranging from qualitative high-throughput fold
recognition or complex identification (Young et al., 2000), to the fine detail of structural
rearrangements in biomolecules after environmental changes, point mutations or complex
formations (Nikolova et al.,1998; Millevoi et al., 2001; Zheng et al., 2007). However, these
techniques heavily rely upon the availability of powerful computational approaches to
achieve a full exploitation of the information content associated with the experimental data.
The determination of three-dimensional (3D) structures or models by MS-based techniques
(MS3D) involves four main activity areas: 1) preparation of the sample and its derivatives
labelled with chemical probes; 2) generation of derivatives/fragments of these molecules for
further MS analysis; 3) interpretation of MS data to identify those residues that have reacted
with probes; 4) derivation of 3D structures consistent with information from previous steps.
Ideally, this procedure should be considered the core of an iterative process, where the final
model possibly prompts for new validating experiments or helps the assignment of
ambiguous information from the mass spectra interpretation step.
Both the overall MS3D procedure and its different steps have been the subject of several
accurate review and perspective articles (Sinz, 2006; Back et al., 2003; Young et al., 2000;
Friedhoff, 2005, Renzone, et al., 2007a). However, with the partial exception of a few recent
papers (Van Dijk et al., 2005; Fabris et al., 2010; Leitner et al., 2010), the full computational
detail behind 3D model building (step 4) has generally received less attention than the
former three steps. Structural derivation in MS3D, in fact, is considered a special case of
structural determination from sparse/indirect constraints (SD-SIC). Nevertheless,
information for modelling derivable from MS-based experiments exhibits some peculiar
134 Computational Biology and Applied Bioinformatics

features that differentiate it from the data types associated with other experimental
techniques involved in SD-SIC procedures, such as nuclear magnetic resonance (NMR),
electron microscopy, small-angle X-ray scattering (SAXS), Förster resonance energy transfer
(FRET) and other fluorescence spectroscopy techniques, for which most of the currently
available SD-SIC methods have been developed and tailored (Förster et al., 2008; Lin et al.,
2008; Nilges et al., 1988a; Aszodi et al., 1995).
In this view, this study will illustrate possible approaches to model building in MS3D,
underlining the main issues related to this specific field and outlining some of the possible
solutions to these problems. Whenever possible, alternative methods employing either
different programs selected among most popular applications in homology modelling,
threading, docking and molecular dynamics (MD), or different strategies to exploit the
information contained in MS data will be described. Discussion will be limited to packages
either freely available, or costing less than 1,000 US$ for academic users. For programs, the
home web address has been reported, rather than references that are very often partial
and/or outdated. Some examples, derived from the literature available in this field, or
developed ad hoc to illustrate some critical features of the computational methods in MS3D,
should clarify potentiality and current limitations of this approach.

2. General MS3D modelling procedures

2.1 Possible computational protocols for MS3D approaches
MS3D can be fruitfully applied to many structure-related problems; thus, it requires the
(possibly combined) use of different modelling procedures. However, a very general scheme
for a MS3D approach can still be sketched (Fig. 1). It includes:
• an initial generation of possible structures for the investigated system by some
sampling algorithms (S1 or S2 stages);
• followed by classification, clustering and selection steps of the best sampled structures
based on one or more criteria (F1 or F2a-F2b-F2c);
• an optional narrowing of the ensemble by a refinement of the selected models (R);
• followed by new classification, clustering and selection stages for the identification of
the most representative models (FF).
Selection criteria are very often represented by more or less sophisticated combinations of
different scoring (i.e. the higher, the better), penalty (i.e. the lower, the better) or target (i.e.
the closer to its reference value, the better) functions. For the sake of brevity, from here
onwards the term “scoring” will be indiscriminately used for either true scoring, or penalty,
or target function, when their discrimination is not necessary.
The features characterizing a specific approach are: a) combination of sampling (and
optimization) algorithms, b) scoring functions in sampling/optimization and classification/
clustering/selection stages, c) strategies to introduce MS-based experimental information.
A first major branching in this scheme already occurs in the earliest modelling stages (box
A), depending if MS-based information is, at least in part, integrated in the structure
generation stage (path S1-F1), or rather deferred to a subsequent model classification/
selection step (path S2-F2a-F2b-F2c).
Depending on information types, programs and strategies used in modelling (see next
sections for theory and examples), MS-based data can be either all introduced during
sampling (S1), or all used in the filtering stage (F2a), or subdivided between the two steps
(S1+F1). The main advantage of the inclusion of MS-based information into sampling (path
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 135

S1-F1) is an increase in model generation efficiency by limitation of the conformational or

configurational subspace to be explored. In several potentially problematic cases, i.e. large
molecules with very limited additional information available, this reduction can transform a
potentially insoluble problem into a reliable model generation, capable of correlating
structural and functional features of the investigated system. However, for the very same
reason, if information is introduced too abruptly or tightly during structural sampling, it can
artificially freeze the models into a wrong, or at least incomplete, set of solutions (Latek et
al., 2007; Bowers et al., 2000). Also the weight of erroneous restraints will be considerably
amplified by the impossibility of a comparison with solutions characterized by some
restraint violations, but considerably more favourable scoring function values, which are
often diagnostic of inadequate sampling and/or errors in the experimental restraint set.

Fig. 1. Flowchart of a generic MS3D modelling approach. Magenta, violet and pink represent
steps in which MS-based information is applied. Triangular arrows indicate use of MS-based
data. Dotted lines and borders are used for optional refinement stages. Blue codes in white
circles/ellipses label the corresponding stages within the text.
136 Computational Biology and Applied Bioinformatics

Accordingly, both the protocol used to implement MS-based information into modelling
procedures and the MS-based data themselves generally represent very critical features,
which require the maximum attention during computational setup and final analyses. In
addition, implementation of restraints in the sampling procedure either requires some
purposely programming activity, or severely limits the choice of modelling tools to
programs already including suitable user-defined restraints.
Use of MS-based information in post-sampling analyses (path S2-F2a-F2b-F2c) to help
classifying and selecting the final models exhibits a mostly complementary profile of
advantages-disadvantages. In fact, it decreases the sampling efficiency of the modelling
methods (S2), by leading to a potentially very large number of models to be subsequently
discarded on the mere basis of their violations of MS-derived restraints (F2a), and by
providing no ab initio limitations to the available conformational/configurational space of
the system. Furthermore, it may still require programming activity if available restraint
analysis tools (F2a) are lacking or inefficient in the case of the implemented information.
However, this approach warrants the maximum freedom to the user in the choice of the
sampling program; this may result very useful in those cases where the peculiar features of
a specific program are strongly required to model the investigated system. In addition, a
compared analysis of both structural features and scoring function values between models
accepted and rejected on the basis of MS-based data may allow the identification of potential
issues in the selected models and the corresponding data sets (steps F2c-X).

2.2 Integration of MS-based data into modelling procedures

Although an ever-increasing number of MS-based strategies has been developed, they
provide essentially two information classes for model building: i) surface accessible
residues, from chemical/isotopic labelling or limited proteolysis experiments (Renzone et
al., 2007a); ii) couples of residues whose relative distances span a prefixed range, from
crosslinking experiments (Sinz, 2006; Renzone et al., 2007a). Details on the nature of the
combined biochemical and MS approaches used to generate these data and the experimental
procedures adopted in these cases is provided in the exhaustive reviews reported above.

2.2.1 Surface-related information (selective proteolysis and chemical labelling)

Although many structural generation approaches include surface-dependant terms, usually
they are not exposed to the user; thus, direct implementation of accessibility information is
always indirect and ranges from very difficult to impossible. In some docking programs,
surface residue patches can be excluded from the exploration, thus restricting the region of
space to be sampled (Section 3.2). This information is generally exploited through programs
that build and evaluate different kinds of molecular surfaces, applied during the model
validation stages. In this view, the main available programs and their usage will be
described in the section dedicated to model validation (Section 3.3.2).
In the case of modelling procedures based on sequence alignment with templates of known
3D structure, surface-dependent data can be employed both to validate alignments before
modelling (early steps in S1 stage), and to filter the structures resulting from the different
steps of a traditional model building procedure (stages F1 or F2a, and FF).

2.2.2 Crosslinks
Cross-linking information often directly contribute to the model building procedure (under
the form of distance restraints or direct linker addition to the simulated systems) (stage S1 in
Fig.1), in addition to their model validation/interpretation role (stages F1, F2a, FF).
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 137

Whenever information from crosslinking experiments is integrated within the modelling

procedure, the most common approach recurring in literature is its translation into distance
constraints (i.e. “hard”, fixed distances) or restraints (i.e. variable within an interval and/or
around a fixed distance with a given tolerance) involving atoms, in a full-atomistic
representation, or higher-order units, such as residues, secondary structure (SS) elements, or
domains, in coarse-grained models. A less common approach consists in the explicit
inclusion of the crosslinker atoms in the simulation.
2.2.2.1 Distance restraints
Distance restraints (DRs) are usually implemented by adding a penalty term to the scoring
function used to generate, classify or select the models, whenever the distance between
specified atom pairs exceeds a threshold value. In this way, associated experimental
information can be introduced rather easily and with moderate computational overheads in
all the molecular modelling and simulation approaches based on scoring functions.
However, since crosslinking agents are molecules endowed with well-defined and specific
conformational and interaction properties, both internal and with crosslinked molecules,
accurate theoretical and experimental estimates of distance ranges associated with the
corresponding cross-link agents only qualitatively correspond to experimentally-detected
distances between pairs of cross-linked residues (Green et al., 2001; Leitner et al., 2010).
Steric bumps, specific favourable or unfavourable electrostatic interactions, presence of
functional groups capable of promoting/hampering the crosslinking reaction and changes
in crosslinker conformational population under the effects of macromolecule are all possible
causes for observed discrepancies.
2.2.2.2 Explicit linkers
Explicit inclusion of crosslinkers in the systems, although potentially allowing to overcome
the limits of DRs, presently suffers from several drawbacks that limit its usage to either final
selection/validation stages, or to cases where a limited number of totally independent and
simultaneously holding crosslinks are observed. In fact, when many crosslinks are detected
in a system by MS analysis, they very often correspond to mixtures of different patterns,
because crosslinks can interfere each other either by direct steric hindrance, or by
competition for one of the macromolecule reacting groups, or by inducing deformation in
the linked system, thus preventing further reactions. However, the added information from
explicit crosslinkers may: i) allow disambiguation between alternative predicted binding
modes, ii) provide more realistic and strict estimates of the linker length to be used in
further stages of DR-based calculations, iii) help modelling convergence, iv) substantially
contribute to model validation.
An attempt to reproduce by an implicit approach at least the geometrical constraints
associated with a physical linker has been performed by developing algorithms to identify
minimum-length paths on protein surfaces (Potluri et al., 2004). This approach provides
upper/lower bounds to possible crosslinking distances on static structures but it only
worked on static structures as a post-modelling validation tool, and no further applications
have been reported so far.

3. Available computational approaches in MS3D

MS-based data can be used to obtain structural information on different classes of problems:
a. single conformational states (e.g. the overall fold);
138 Computational Biology and Applied Bioinformatics

b. conformational changes upon mutations/environmental modifications;

c. macromolecular aggregation (multimerization);
d. binding of small ligands to macromolecules.
Sampling efficiency and physical soundness of the scoring functions used during sampling
(stages S1/S2 of Fig. 1) and to select computed structures (stages F1/F2b and FF) generally
represent the main current limitations of 3D structure prediction and simulation methods. In
this view, introduction of experimental data represents a powerful approach to reduce the
geometrical space to be explored during sampling, and also an independent criterion to
evaluate the quality of selected models.
From a computational point of view, structural problems a)-d) translate into system-
dependent proper combinations of:
A. fold identification and characterization;
B. docking;
C. structural refinement and characterization of dynamic properties and of changes under
the effects of local or environmental perturbations.
Since the optimal combination of methods for a given problem depends upon a large
number of system- and data-dependent parameters, and the number of programs developed
for biomolecular simulations is huge, an exhaustive description and compared analysis of
methods for biomolecular structure generation/refinement is practically impossible.
However, we will try to offer a general overview of the main approaches to generate, refine
and select 3D structures in MS3D applications, with a special attention to possible ways of
introducing MS-based data and exploiting their full information content.

3.1 Fold identification and characterization

The last CASP (Critical Assessment of techniques for protein Structure Prediction)
experiment call (CASP9, 2010) classified modelling methods in two main categories:
“Template Based Modelling” (TBM) and “Template Free Modelling” (TFM), depending if
meaningful homology can be identified or not before modelling between the target sequence
and those of proteins/domains whose 3D structures are known (templates).
TFM represents the most challenging task because it requires the exploration of the widest
conformational space and heavily relies on scoring methods inspired by those principles of
physics governing protein folding (de novo or ab initio methods), eventually integrated by
statistical predictions, such as probabilities of interresidue contacts, surface accessibility of
single residues or local patches and SS occurrence. When number and quality of these
information increase, together with the extent of target sequence for which they are
available, “folding recognition” and “threading” techniques can be used, including a broad
range of methods at the interface between TFM and TBM. In these approaches, several
partial 3D structure “seeds” are generated by statistical prediction or distant homology
relationships, and their relative arrangements are subsequently optimized by strategies
deriving from de novo methods.
The most typical TBM approach, “comparative” or “homology” modelling (HM), uses
experimentally elucidated structures of related protein family members as “templates” to
model the structure of the protein under investigation (the “target”). Target sequence can
either be fully covered by one or more templates, exhibiting good homology over most of
the target sequence, or can require a “patchwork” of different templates, each best covering
a different region of the target.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 139

A further group of approaches, presently under active development and already exhibiting
good performances in CASP and other benchmark and testing experiments, is formed by the
“integrative” or “hybrid” methods. They combine information from a varied set of
computational and experimental sources, often acting as/based on “metaservers”, i.e.
servers that submit a prediction request to several other servers, then averaging their results
to provide a consensus that in many cases is more reliable than the single predictions from
which it originated. Some metaservers use the consensus as input to their own prediction
algorithms to further elaborate the models.
In order to provide some guidelines for structural prediction/refinement tasks in the
presence of MS-based data, a general procedure will be outlined for protein fold/structure
modelling. The starting step in protein modelling is usually represented by a search for
already structurally-characterized similar sequences. Sensitive methods for sequence
homology detection and alignment have been developed, based on iterative profile searches,
e.g. PSI-Blast (Altschul et al., 1997), Hidden Markov Models, e.g. SAM (K. Karplus et al.
1998), HMMER (Eddy, 1998), or profile-profile alignment such as FFAS03 (Jaroszewski et al.,
2005), profile.scan (Marti-Renom et al., 2004), and HHsearch (Soding, 2005).
When homology with known templates is over 40%, HM programs can be used rather
confidently. In this case, especially when alignments to be used in modelling have already
been obtained, local programs represent a more viable alternative to web-based methods
than in TFM processes. If analysis is limited to most popular programs and web services
capable of implementing user MS-based restraints (strategy S1 in Fig. 1), the number of
possible candidates considerably decreases. Among web servers, on the basis of identified
homologies with templates, Robetta is automatically capable of switching from ab initio to
comparative modelling, while I-TASSER requires user-provided alignment or templates to
activate comparative modelling mode. A very powerful, versatile and popular HM
program, available both as a standalone application, and as a web service, and embedded in
many modelling servers, is MODELLER (http://www.salilab.org/modeller/). It include
routines for template search, sequence and structural alignments, determination of
homology-derived restraints, model building, loop modelling, model refinement and
validation. MS-based distance restraints can be added to those produced from target-
template alignments, as well as to other restraints enforcing secondary structures, symmetry
or part of the structure that must not be allowed to change upon modelling. However, some
scripting ability is required to fully exploit MODELLER versatility.
The overall accuracy of HM models calculated from alignments with sequence identities of
40% or higher is almost always good (typical root mean square deviations (RMSDs) from
corresponding experimental structures less than 2Å). The frequency of models deviating by
more than 2Å RMSD from experimental structures rapidly increases when target–template
sequence identity falls significantly below 30–40%, the so-called “twilight zone” of HM (Blake
& Cohen, 2001; Melo & Sali, 2007). In such cases, the quality of resulting modelled structures
significantly increases by combining additional information, both of statistical origin, such as
SS prediction profiles, and from sparse experimental data (low resolution NMR or chemical
crosslinking, limited proteolysis, chemical/isotopical labelling coupled with MS).
If the search does not produce templates with sufficient homology and/or covering of the
target sequence, TFM or mixed TFM/TBM methods must be used. Many programs based on
ab initio, fold recognition and threading methods are presently offered as web services; this
is because very often they use a metaserver approach for some steps, need extensive
140 Computational Biology and Applied Bioinformatics

searches in large databases, require huge computational resources, or to better protect

underlying programs and algorithms, currently under very active development. Although
this may offer some advantages, especially to users less-experienced in biocomputing or
endowed with limited computing facilities, it may also imply strong limitations in the full
exploitation of the features implemented in the different methods, with particularly serious
implications in MS3D. Only few servers either include a NMR structure determination
module (not always suitable for MS-based data), or explicitly allow the optional usage of
user-provided distance restraints in the main input form. Fortunately, two of the most used
and versatile servers, Robetta (http://robetta.bakerlab.org/) and I-TASSER
(http://zhanglab.ccmb.med.umich.edu/I-TASSER/), good performers at the last CASP
rounds (http://predictioncenter.org/), allow input of distance restraints in the modelling
procedure, via a NMR-dedicated service for Robetta (Rosetta-NMR, suitable for working
with sparse restraint) (Bowers et al., 2000), or directly in the main prediction submission
page (I-TASSER). Other servers can still allow the implementation of MS-based information
in the model generation step if they can save intermediate results, such as sequence
alignments, SS or fold predictions. These latter, after addition of MS-based restraints, can be
then included into suitable modelling programs, to be run either locally or on web servers.
A successful examples of modelling with MS-based information in a low-homology case is
Gadd45β. A model was built, despite the low sequence identity (<20%) with template
identified by fold recognition programs, through the introduction of additional SS restraints,
which were based on SS profiles and experimental data from limited proteolysis and
alkylation reactions combined with MS analysis (Papa et al., 2007). Model robustness was
confirmed by comparison with the homolog Gadd45γ structure solved later (Schrag JD et al.,
2008), where the only divergence in SS profiles was the occurrence of two short 310 helices
(three residues each long) and an additional two-residues β-strand in predicted loop regions
(Fig. 2). Furthermore, this latter β-strand is so distorted that only a few SS assignment
programs could identify it, and the corresponding sequence in Gadd45β, predicted
unstructured and outside the template alignment, was not modelled at all.

Fig. 2. Comparison between the MS3D model of Gadd45β (light green) and the
crystallographic structure of its homolog Gadd45γ (light blue). Sequences with different SS
profiles are painted green in Gadd45β and magenta in Gadd45γ.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 141

3.2 Docking
Usually, methods for protein docking involve a six-dimensional search of the rotational and
translational space of one protein with respect to the other where the molecules are treated
as rigid or semirigid-bodies. However, during protein-protein association, the interface
residues of both molecules may undergo conformational changes that sometimes involve
not only side-chains, but also large backbone rearrangements. To manage at least in part
these conformational changes, protein docking protocols have introduced some degree of
protein flexibility by either use of "soft" scoring functions allowing some steric clash, or
explicit inclusion of domain movement/side chain flexibility. Biological information from
experimental data on regions or residues involved in complexation can guide the search of
complex configurations or filter out wrong solutions. Among the programs most frequently
used for protein-protein docking, recently reviewed by Moreira and colleagues (Moreira et
al., 2010), some of them can manage biological information and will be discussed in this
context.
In the Attract program (http://www.t38.physik.tu-muenchen.de/08475.htm ), proteins are
represented with a reduced model (up to 3 pseudoatoms per amino acid) to allow the
systematic docking minimization of many thousand starting structures. During the docking,
both partner proteins are treated as rigid-body and the protocol is based on energy
minimization in translational and rotational degrees of freedom of one protein with respect
to the other. Flexibility of critical surface side-chains as well as large loop movements are
introduced in the calculation by using a multiple conformational copy approach (Bastard et
al., 2006). Experimental data can be taken into account at various stages of the docking
procedure.
The 3D-Dock algorithm (http://www.sbg.bio.ic.ac.uk/docking/) performs a global scan of
translational and rotational space of the two interacting proteins, with a scoring function
based on shape complementarity and electrostatic interaction. The protein is described at
atomic level, while the side-chain conformations are modelled by multiple copy
representation using a rotamer library. Biological information can be used as distance
restraints to filter final complexes.
HADDOCK (http://www.nmr.chem.uu.nl/haddock/) makes use of biochemical or
biophysical interaction data, introduced as ambiguous intermolecular distance restraints
between all residues potentially involved in the interaction. Docking protocol consists of
four steps: 1) topology and structure generation; 2) randomization of orientations and rigid
body energy minimization; 3) semi-flexible simulated annealing (SA) in torsion angle space;
4) flexible refinement in Cartesian space with explicit solvent (water or DMSO). The final
structures are clustered using interface backbone RMSD and scored by their average
interaction energy and buried interface area. Recently, also explicit inclusion of water
molecules at the interface was incorporated in the protocol.
Molfit (http://www.weizmann.ac.il/Chemical_Services/molfit/) represents each molecule
involved in docking process by a 3-dimensional grid of complex numbers and estimates the
extent of geometric and chemical surface complementarity by correlating the grids using
Fast Fourier Transforms (FFT). During the search, contacts involving specified surface
regions of either one or both molecules are up- or down-weighted, depending on available
structural and biochemical data or sequence analysis (Ben-Zeev et al., 2003). The solutions
are sorted by their complementarity scores and the top ranking solutions are further refined
by small rigid body rotations around the starting position.
142 Computational Biology and Applied Bioinformatics

PatchDock (http://bioinfo3d.cs.tau.ac.il/PatchDock/) is based on shape complementarity.

First, the surfaces of interacting molecules are divided according to the shape in concave,
convex and flat patches; then, complementarity among patches are identified by shape-
matching techniques. The algorithm is a rigid body docking, but some flexibility is
indirectly considered by allowing some steric clashes. The resulting complexes are ranked
on the basis of the shape complementarity score. PatchDock allows integration of external
information by a list of binding site residues, thus restricting the matching stage to their
corresponding patches.
RosettaDock (http://rosettadock.graylab.jhu.edu/) try to mimics the two stages of a
docking process, recognition and binding, as hypothesized in Camacho & Vajda, 2001.
Recognition is simulated by a low resolution phase in which a coarse-grained representation
of proteins, with side chains replaced by single pseudoatoms, undergoes a rigid body Monte
Carlo (MC) search on translations and rotations. Binding is emulated by a high-resolution
refinement phase where explicit sidechains are added by using a backbone-dependent
rotamer packing algorithm. The sampling problem is handled by supercomputing clusters
to ensure a very large number of decoys that are discriminated by scoring functions at the
end of both stages of docking. The docking search problem can be simplified when
biological information is available on the binding region of one or both interacting proteins.
The reduction of conformational space to be sampled could be pursued by: i) opportunely
pre-orienting the partner proteins, or ii) reducing docking sampling to the high-affinity
domain, in the case of multidomain proteins, or iii) using loose distance constraints.
ZDOCK (http://zdock.bu.edu/) is a rigid body docking program based on FFT algorithm
and an energy function that combines shape complementarity, electrostatics and desolvation
terms. RDOCK (http://zdock.bu.edu/) is a refinement program to minimize and rerank the
solutions found by ZDOCK. The complexes are minimized by CHARMm (Brooks et al.,
1983) to remove clashes and improve energies, then electrostatic and desolvation terms are
recalculated in a more accurate fashion with respect to ZDOCK. Biological information can
be used either to avoid undesirable contacts between certain residues during ZDOCK
calculations or to filter solutions after RDOCK.
As in protein folding, also for docking the use of MS-based information allowed the
modelling of several complexes even in the lack of suitable templates with high homology.
The fold of prohibitin proteins PHB1 and PHB2 was predicted (Back et al., 2002) by SS and
fold recognition algorithms, while crosslinking allowed to model the relative spatial
arrangement of the two proteins in their 1:1 complex. Another example of combined use of
SS information, chemical crosslinking, limited proteolysis and MS analysis results with a
low sequence identity (~ 20%) template is the modelling of porcine aminoacylase 1 dimer; in
this case, standard modelling procedures based on automatic alignment had failed to
produce a dimeric model consistent with experimental data (D'Ambrosio et al., 2003).
In the case of protein-small ligand docking, the conformational space to be explored is
reduced by the small size of the ligand, whose full flexibility can usually be allowed, and by
the limited fraction of protein surface to be sampled, corresponding to the binding site, often
already known. Among the programs for ligand-flexible docking that allow protein side-
chains flexibility, Autodock is one of most popular (http://autodock.scripps.edu/).
AutoDock combines a grid-based method with a Lamarckian Genetic Algorithm to allow a
rapid evaluation of the binding energy. A simulated annealing method and a traditional
genetic algorithm are also available in Autodock4.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 143

In general, MS-based data can be used to limit the protein region to be sampled (Kessl et al.,
2009) or can be explicitly considered in the docking procedure, as in the case of the mapping
of Sso7d ATPase site (Renzone et al., 2007b). In this case, three independent approaches for
molecular docking/MD studies were followed, considering both FSBA-derivatives and the
ATP-Sso7d non-covalent complex: i) unrestrained MD, starting from a full-extended,
external conformation for Y7-FSBA and K39-FSBA residue sidechains, and from several
random orientations for ATP, with an initial distance of 20 Å from Sso7d surface, in regions
not involved in protein binding; ii) restrained MD, by gradually imposing distance restraints
corresponding to a H-bond between adenine NH2 group and each accessible (i.e., within a
distance lower or equal to the maximum length of the corresponding FSBA-derivative)
donor sidechain; iii) rigid ligand docking, by calculating 2000 ZDOCK models of the non-
covalent complex of Sso7d with an adenosine molecule. The rigid ligand docking
reproduced only in part features from other approaches, as rigid docking correctly predicted
the anchoring point for adenosine ring, but failed to achieve a correct position for the ribose
moiety, due to the required concerted rearrangement of two Sso7d loops involved in the
binding. This latter feature represents one of the main advantages of modelling strategies
involving MD (in particular, in cartesian coordinates) because MD-based simulation
techniques are the best or the only approaches that reproduce medium-to-large scale
concerted rearrangements of non-contiguous regions.

3.3 Model simulation, refinement and validation

Refinement (R stage in Fig.1) and validation of final models (FF stage) represent very
important steps, especially in cases of low homologies with known templates and when fine
details of the models are used to predict or explain functional properties of the investigated
system. In addition, very often the modelled structures are aimed at understanding the
structural effects of point mutations or other local sequence alterations (sequence
deletions/insertions, addition or deletion of disulphide bridges, formation of covalent
constructs between two molecules and post-translational modifications), or of changes in
environmental parameters (temperature, pressure, salt concentration and pH). In these
cases, techniques are required to simulate the static or dynamic behaviour of the
investigated system in its perturbed and unperturbed states.

3.3.1 Computational techniques and programs for model simulation and refinement
Model refinement, when not implemented in the modelling procedure, can be performed by
energy minimization (EM) or, better, by different molecular simulation methods, mostly
based on variants of molecular dynamics (MD) or Monte Carlo (MC) techniques. They are
also commonly used to characterize dynamic properties and structural changes upon local
or environmental perturbations.
Structures deriving from folding or docking procedures need, in general, at least a structural
regularization by EM before final validation steps, to avoid meaningless results from many
methods. Scoring functions of the latter evaluate the probity of parameters, such as dihedral
angle distributions, presence and distribution of steric bumps, voids in the molecular core,
specific nonbonded interactions (H-bonds, hydrophobic clusters). Representing a
mandatory step in most MC/MD protocols, EM programs are included in all the molecular
simulation packages, and they share with MC/MD most input files and part of the setup
parameters. Thus, unless they are be explicitly discussed, all system- and restraint-related
features or issues illustrated for simulation methods also implicitly held for EM.
144 Computational Biology and Applied Bioinformatics

As we are mostly interested in techniques implementing experimentally-derived constraints

or restraints, some of the most popular methods for constraints-based modelling will be
briefly described. These methods have been developed and optimized mainly to identify
and refine 3D structures consistent with spatial constraints from diffraction and resonance
experiments (de Bakker et al., 2006). They have also been extensively applied to both TBM
(Fiser & Sali, 2003) and free modelling prediction and simulation (Bradley et al., 2005;
Schueler-Furman et al., 2005), and are often used to refine/validate models produced in
TFM and TBM approaches described in sections 3.1 and 3.2. There are two main categories
of constraint-based modelling algorithms: i) distance geometry embedding, which uses a
metric matrix of distances from atomic coordinates to their collective centroid, to project
distance space to 3D space (Havel et al. 1983; Aszodi et al. 1995, 1997); ii) minimization,
which incorporates distance constraints in variable energy optimization procedures, such as
molecular dynamics (MD) and Monte Carlo (MC). For both MD and MC, it is possible to
work both in full cartesian coordinates, or in the restricted torsion angle (TA) space, with
covalent structure parameters kept fixed at their reference values, thus originating the
Torsional Angle MD (TAMD) and Torsional Angle MC (TAMC) approaches. They are
currently implemented in several modelling and refinement packages, developed for
structural refinement of X-ray or NMR structures (Rice & Brünger, 1994; Stein et al. 1997;
Güntert et al., 1997), folding prediction (Gray et al., 2003), or more general packages
(Mathiowetz et al., 1994; Vaidehi et al., 1997). Standard MC/MD methods are only useful for
structural refinement, local exploration and to characterize limited global rearrangements.
However, they are also widely used as sampling techniques in folding/docking approaches,
although in those cases enhanced sampling extensions of both methods are employed.
Simulated annealing (SA) (Kirkpatrick et al., 1983) and replica exchange (RE) approaches
(Nymeyer et al., 2004) are the most common examples of these MC/MD enhancements, both
potentially overcoming the large energy barriers required for sampling the wide
conformational and configurational spaces to be explored in folding and docking
applications, respectively.
A non-exhaustive list of the most diffused simulation packages including a more-than-basic
treatment of distance-related restraints and also exhibiting good versatility (i.e.
implementation of different algorithms, approaches, force fields and solvent
representations), may include at least: AMBER (http://ambermd.org/), CHARMM
(http://www.charmm.org/), DESMOND (http://deshawresearch.com/resources.html),
GROMACS (http://www.gromacs.org/) and TINKER (http://dasher.wustl.edu/tinker).
CYANA (http:www.cyana.org) and XPLOR/CNS (http://cns-online.org/v1.3/), although
originally more specialized for structural determination and refinement from NMR and
NMR/X-ray data, respectively, have been recently included in several TFM and TBM
protocols, thanks to their efficient implementations of TAMD and distance or torsional angle
restraints. The choice of a simulation program should ideally keep into account several
criteria, ranging from computational efficiency, to support of sampling or refinement
algorithms, to integration with other tools for TFM or TBM applications.
The main problems associated with simulation methods having relevant potential
implications on MS3D are: i) insufficient sampling; ii) inaccuracy in the potential energy
functionals driving the simulations; iii) influence of the approach used to implement
experimentally-derived information on final structure sets.
Sampling problem can be approached both by increasing the sampling efficiency with
MC/MD variations like SA and RE, and by decreasing the size of the space to be explored.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 145

This latter result can be reached by reducing the overall number of degrees of freedom to be
explicitly sampled and/or by reducing the number of possible values per variable to a
small, finite number (discretization, like in grid-based methods), and/or by restraining
acceptable variable ranges. Reduction of the total number of degrees of freedom can be
accomplished by switching to coarse-grained representations of the system, where a number
of explicit atoms, ranging from connected triples, to amino acid sidechains, to whole
residues, up to full protein subdomains, are replaced by a single particle. This method is
frequently used in initial stages of ab initio folding modelling, or in the simulation of very
large systems, such as giant structural proteins of huge protein aggregates.
Another possible way to reduce the number of degrees of freedom is the aforementioned TA
approach, requiring for a N atom system only N/3 torsional angles compared with 3N
coordinates in atomic cartesian space (Schwieters & Clore, 2001). Moreover, as the high
frequency motions of bending and stretching are removed, TAMD can use longer time steps
in the numerical integration of equations of motion than that required for a classical
molecular dynamics in cartesian space. Its main limitation may derive from neglecting
covalent geometry variations (in particular, bending centred on protein Cα atoms) that are
known to be associated with conformational variations (Berkholz et al., 2009), for instance
from α-helix to β-strand, and that can be important in concerted transitions or in large
structures with extensive and oriented SS regions. Discretization is mostly employed in the
initial screening of computationally intensive problems, such as ab initio modelling.
Restraining variable value ranges in MS3D is usually associated with either predictive
methods (SS, H-bond pattern, residue exposure), or to homology analysis, or to
experimentally-derived information. Origin, nature and form of these restraints have
already been discussed in previous sections, while some more detail on the implementation
of distance-related information into simulation programs will be given at the end of this
section.
While the implementation of restraints can be very variable in methods where the scoring
function does not intend to mimic or replicate a physical interaction between involved
entities, in methods based on physically-sounding molecular potential functions
(forcefields) have DRs implemented by a more limited number of approaches. At its
simplest, a DR will be represented as a harmonic restraint, for which only the target distance
and the force constant need to be specified in input. This functional form is present in
practically all most common programs, but either requires a precise knowledge of the target
distance, or it will result in a very loose restraint if the force constant is lowered too much to
account for low-precision target values, the usual case in MS-based data. In a more complex
and useful form, implemented with slight variations in several programs (AMBER,
CHARMM, GROMACS, XPLOR/CNS, DESMOND, TINKER), the restraint is a well with a
square bottom with parabolic sides out to a defined distance, and then linear beyond that on
both (AMBER) or just the upper limit side (CHARMM, GROMACS, XPLOR/CNS,
DESMOND). In some programs (CHARMM, AMBER, XPLOR/CNS), it is possible to select
an alternative behaviour when a distance restraint gets very large (Nilges et al,1988b) by
“flattening out” the potential, thus leading to no force for large violations; this allows for
errors in constraint lists, but might tend to ignore constraints that should be included to pull
a bad initial structure towards a more correct one.
Other forms for less-common applications can also be available in the programs or be
implemented by an user. However, the most interesting additional features of versatile DR
146 Computational Biology and Applied Bioinformatics

implementations are the different averages that can be used to describe DRs: i) complex
restraints can involve atom groups rather than single atoms at either or both restraint sides;
ii) time-averaged DRs, where target values are satisfied on average within a given time lapse
rather than instantaneously; iii) ambiguous DRs, averaged on different distance pairs. The
latter two cases are very useful when the overall DRs are not fully consistent each other,
because they are observed in the presence of conformational equilibria and, as such, they are
associated with different microstates of the system. In addition, complex and versatile
protocols can be simply developed in those programs where different parameters can be
smoothly varied during the simulation (AMBER).

3.3.2 Programs for model validation

A validation of the final models, very often included in part in the available automated
modelling protocols, represents a mandatory step, especially for more complex (low-
homology, few experimental data) modelling tasks. A huge number of protein and nucleic
acid structural analysis and validation tools exists, based on many different criteria, and
subjected to continuous development and testing; thus, even a CASP section is dedicated to
structural assessment tools (http://www.predictioncenter.org/), and the “Bioinformatics
Links Directory” site alone currently reports 76 results matching “3-D structural analysis”
(Brazas et al., 2010). Being outside the scope of the present report, information on 3D
structural validation tools can be searched on specialized sites such as
http://bioinformatics.ca/links_directory/ . However, similarly to what stated on prediction
metaservers, a general principle for validation is to possibly use several tools, based on
different criteria, looking for emergent properties and consensus among the results.
Specific parameters associated with MS-based data can be usually analysed with available
tools. Distance restraints and their violations can be analysed both on single structures and
on ensembles (sets of possible solutions of prediction methods, frames from molecular
dynamics trajectories) with several graphic or textual programs, the most specialized
obviously being those tools developed for the analysis of NMR-derived structures.
Surface information can be analysed by programs like:
DSSP (http://swift.cmbi.ru.nl/gv/dssp/),
NACCESS (http://www.bioinf.manchester.ac.uk/naccess/),
GETAREA (http://curie.utmb.edu/getarea.html/),
ASA-VIEW (http://gibk26.bse.kyutech.ac.jp/jouhou/shandar/netasa/asaview/)
that calculate different kinds of molecular surfaces, such as van der Waals, accessible, or
solvent excluded surfaces for overall systems and contact surfaces for complexes are used.
However, differently from distance restraints, available programs usually work on a single
input structure at a time, thus making structure filtering and analysis on the large ensembles
of models potentially produced by conformational prediction, molecular simulation or
docking calculations, a painful or impossible task. In these cases, scripts or programs to
automate the surface calculations and to average or filter the results must be developed.

4. Modelling with sparse experimental restraints

In the previous section many of the computational methods that can concur to produce
structural models in MS3D applications have been outlined, together with different ways to
integrate MS-based experimental information into them. Here we will refocus on the overall
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 147

computational approach in MS3D, to illustrate some of its peculiar features and issues, its
present potentialities and the variety of possible combinations of data and protocols that can
be devised to optimally handle different types of structural problems. Depending on nature
and quantity of available experimental information and on previous knowledge of the
investigated system, different combinations of the methods mentioned in previous sections
can be optimally employed. We will start illustrating examples of methods for de novo
protein folding, a frontier application of modelling with sparse restraints, because it is based
on minimal additional information on the system under investigation.
The MONSSTER program (Skolnick et al. 1997) only makes use of SS profiles and a limited
number of long-distance restraints. By employing system discretization and coarse-graining
to reduce required sampling, a protein is represented by a lattice-based Cα-backbone trace
with single interaction center rotamers for the side-chains. By using N/7 (N is the protein
length) long-range restraints, this method is able to produce folds of moderate resolution,
falling in the range from 4 to 5.5 Å of RMSD for Cα-traces with respect to the native
conformation for all α and α/β proteins, whereas β-proteins require, for the same resolution,
N/4 restraints. A more recent method for de novo protein modelling (Latek et al., 2007)
adopts restrained folding simulations supported by SS predictions, reinforced with sparse
experimental data. Authors focused on NMR chemical-shift-based restraints, but also sparse
restraints from different sources can be employed. A significant improvement of model
quality was already obtained by using a number of DRs equal to N/12.
As already stated by Latek and colleagues, the introduction of DRs in protein folding
protocol represents a critical step that in principle could negatively affect the sampling of
conformational space. In fact, restraint application at too early stages of calculations can trap
the protein into local minima, where restraints are satisfied, but the native conformation is
not reached. In addition to the number, even the specific distribution of long-range
restraints along the sequence can affect the sampling efficiency. To test the influence of data
sets in folding problem, we applied a well-tested protocol of SA, developed for AMBER
program and mainly oriented to NMR structure determination, to the folding simulation of
bovine pancreatic trypsin inhibitor (BPTI), by using different sets of ten long-distance
restraints, randomly selected from available NMR data (Berndt et al., 1992), with optional
inclusion of a SS profile. Fig. 3 shows representative structures for each restraint set.
The four native BPTI disulphide bridges were taken into account by additional distance and
angle restraints. BPTI represents a typical benchmark for this kind of studies, due to its
peculiar topology (an α/β fold with long connection loops, stabilized by disulphide bonds)
still associated with a limited size (58 residues), and to the availability of both X-ray and
NMR accurate structures. SA cycles of 50 structures each were obtained and compared for
four combinations of three sets (S1-3) of ten long distance restraints, totally non-redundant
among different sets and SS profiles: a) S1+SS profile; b) S1 alone; c) S2+SS profile; d) S3+SS
profile. S1 set performed definitely better than the other two, its best model exhibiting a
RMSD value of 2.4 Å on protein backbone of residues 3-53 from the representative NMR
structure.
This set was also able to provide a reasonable low-resolution fold even in the absence of SS
restraints (b). S3 resulted in almost correctly folded models, but with significantly worse
RMSD values than S1 (c). In S3 pseudomirror images (d) of the BPTI fold occurred several
times and only one model out of 50 was correctly folded (not shown).
148 Computational Biology and Applied Bioinformatics

Fig. 3. Ab initio modelling from sparse restraints of BPTI. Representative models from
different restraint sets (S1, S2, S3), with optional SS dihedral restraints are shown in ribbon
representation, coloured in red/yellow (helices), cyan (strands) and grey (loops). Models are
best-fitted on Cα atoms of residues 3-53 to the representative conformation of the NMR
high-resolution structure (PDB code: 1pit) (green), except for the S3+SS set, where
superposition with β-sheet only is shown, to better illustrate pseudo-mirroring effect,
although RMSD values are calculated on the same 3-53 residue range as other models.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 149

These results suggest a strong dependency of results upon both the exact nature of
experimental data used in structure determination, and the protocol followed for model
building. Thus, the number of restraints estimated in the aforementioned studies as
necessary/sufficient for a reliable structural prediction should be prudently interpreted for
practical purposes. If a proper protocol is adopted, increasing quantity, quality and
distribution homogeneity of data should decrease this dependency, but the problem still
remains severe when using very sparse restraints, such as those associated with many MS3D
applications. A careful validation of the models and, possibly, execution of more modelling
cycles with variations in different protocol parameters, can help to identify and solve these
kinds of problems.
However, in spite of these potential issues, ab initio MS3D can provide a precious insight
into systems that are impossible to study with other structural methods. In addition to
increases in the number of experimental data, also homology-based information and other
statistically-derived constraints can substantially increase the reliability of MS3D
predictions. Thus, suitable combinations of experimental data, predictive methods and
computational approaches have allowed the modelling of many different proteins and
protein complexes spanning a wide range of sizes and complexity. The illustrative examples
shown in Table 1 represents just a sample of systems affordable with current computational
MS3D techniques and a guideline to select possible approaches for different problem
classes. Heterogeneity of reported systems, data and methods, while suggesting the
enormous potentialities of MS3D approaches, practically prevents any really meaningful
critical comparison among methods, whose description in applicative papers is often
incomplete. A standardization of MS3D computational methods is still far from being
achieved, since it requires considerable computational effort to tackle with the considerable
number of strategies and parameters that should be tested in a truly exhaustive analysis.
Furthermore, the extreme sensitivity of modelling with sparse data to constraint
distribution, as seen in the example shown in Fig. 3, either introduces some degree of
arbitrariness in comparative analyses, or make them even more computationally-intensive,
by requiring the use of more subsets for each system setup to be sampled.
Advancements in MS3D experimental approaches continuously change the scenarios for
computational procedures, by substantially increasing the number of data, as well as the
types of crosslinking or labelling agents and proteolytic enzymes. The large number of
crosslinks obtained for apolipoproteins (Silva et al., 2005; Tubb et al., 2008) or CopA copper
ATPase (Lübben et al., 2009) represent good examples of these trends (Table 1).

5. Conclusion
As already stated in the preceding section, the compared analysis of computational
approaches involved in MS3D is still considerably limited, because of the complexity both of
the systems to be investigated, and of the methods themselves, especially when they are
used in combination with restraints as sparse as those usually available in MS3D studies.
The continuous development in all involved experimental and computational techniques
considerably accelerates the obsolescence of the results provided by any accurate
methodological analysis, thus representing a further disincentive to these usually very time
consuming studies. In this view, rather than strict prescriptions, detailed recipes or sharp
critical compared analysis of available approaches, this study was meant to provide an
150 Computational Biology and Applied Bioinformatics

PSF/NA
Gadd45β-MKK7
PSF/MD,EM
Computational Methods in Mass Spectrometry-Based Protein 3D Studies

Table 1. Some examples of MS3D studies from literature. The following abbreviations have been used (in bold, non standard
abbreviations): aXL: crosslinking, CL: chemical labelling, PL: photoaffinity labeling, LP: limited proteolysis; AA: alkylation
analyses, bsee a, PSF: post-sampling filtering with experimental data, IIS: esperimental data integrated in sampling; cXR: x-ray
crystallography; dSSp: secondary structure prediction, esee a,b, HM: Homology modeling, aiM: ab-initio modeling, FP: fold
prediction, MD: molecular dynamics, SA: Simulate Annealing, EM: energy minimization, PPD: protein-protein docking, PLD:
protein-small ligand docking; DR: distance restraints, TM: trans-membrane; NA: not available.
151
152 Computational Biology and Applied Bioinformatics

overall and as wide as possible picture of the state-of-art approaches in MS3D

computational techniques and their potential application fields. However, in spite of these
limitations, some general conclusions can still be drawn.
For predictive methods that stay behind the most ambitious MS3D applications (ab initio
folding, folding prediction, threading), at least when used in the absence of experimental
data, metaservers exhibit on average best performances than the single employed servers, as
also shown by the results of the last CASP rounds on automatic servers
(http://predictioncenter.org/). This suggests two distinct considerations: 1) the accuracy of
sampling and scoring exhibited by each single method, as well the rationale behind them,
are still so limited to prevent reliable predictions on best performing methods in any given
case; 2) nevertheless, most methods tend to locate correct solutions, or, in general, groups of
solutions including the correct one or a close analogue. Therefore, a consensus among the
predictions from different servers generally improves the final solutions, by smoothing
down both extreme results and random fluctuations associated with each single approach.
Well consolidated metaservers, such as Robetta or I-TASSER, can be regarded as reasonable
starting guesses for general folding problems, also considering that they both include
distance-related restraints in their available options. However, special classes of systems
(e.g. transmembrane proteins or several enzyme families) can instead benefit from
employing specifically-devised approaches.
In comparing server-based applications to standalone programs (often available in
alternative for a given approach), potential users should also consider that the former
require less computational skill and resources, but are intrinsically less flexible than the
latter, and that legal and secrecy issues may arise, because several servers consider
submitted prediction requests and the corresponding results as public data, as usually
clearly stated in submission pages. In addition to possible information “leakage” in projects,
the public status of the models would prevent their use in patents.
When considering more specifically MS3D procedures, it has been shown that even a small
number of MS-based restraints can significantly help in restricting the overall space to be
explored and in identifying the correct fold/complexation mode, especially if they are
introduced in early modelling stages of a computational procedure optimized to deal with
both the investigated system and the available data. Thus, experimental restraints can allow
the use of a single model generation procedure, rather than a multiple/metaserver
approach, at least in non-critical cases. In fact, they should filter out all wrong solutions
deriving from the biases of the modelling method, leaving only those close to the “real” one,
if it is included in the sampled set. In particular, since the lowest energy structure should
ideally also be associated with a minimum violation of experimentally-derived restraints,
the coincidence of minimum energy structures with least violated restraints should be
suggestive of correct modelling convergence and evaluation of experimental data. However,
particular care must be adopted not only in the choice of the overall computational
procedure, but especially of the protocol used to introduce experimental information,
because a too abrupt build up of the restraints can easily bring to local minima far from the
correct solution. Comparison of proper scoring functions other than energy between
experimentally-restrained and unrestrained solutions may provide significant help in
identifying potential issues in data or protocols. Estimates of the sensitivity of solutions to
changes in protocols may also enforce the reliability of best converged cases. In particular,
when other restraints are also present, the relative strength and/or introduction order of the
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 153

different sets could play an important role in the final result; thus, their weight should be
carefully evaluated by performing more modelling runs with different setups.
When evaluating the overall modelling procedures, their corresponding caveats and
performance issues, the importance of many details in setup and validation of MS3D
computational procedures fully emerges, thus suggesting that they still requires a
considerable human skill, although many full automated programs and servers allow in
principle the use of MS3D protocols even to inexperienced users. This is also demonstrated
for the pure ab initio modelling stage by the still superior performances obtained by human-
guided predictions in CASP rounds, when compared to fully automated servers.
Future improvements in MS3D are expected as a natural consequence of continuous
development in biochemical/MS techniques for experimental data, and in hardware/
software for molecular simulations and predictive methods. However, some specific, less
expensive and, possibly, quicker evolution in MS3D could be propelled by targeted
development of computational approaches more directly related to the real nature of the
experimental information on which MS3D is based, notably algorithms implementing
surface-dependent contributions and more faithful representations of crosslinkers than
straight distance restraints.

6. References
Aebersold, R. & Mann, M. (2003). Mass spectrometry-based proteomics. Nature, Vol.422, pp.
198–207.
Altschul, S.F.; Madden, T.L.; Schaffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W. & Lipman, D.J.
(1997). Gapped BLAST and PSI-BLAST: A new generation of protein database
search programs. Nucleic Acids Research, Vol.25, pp.3389–3402.
Aszodi, A.; Gradwell, M.J. & Taylor, W.R.(1995). Global fold determination from a small
number of distance restraints. Journal of Molecular Biology Vol.251, pp.308–326.
Aszodi, A.; Munro, R.E. & Taylor, W.R.(1997). Protein modeling by multiple sequence
threading and distance geometry. Proteins, Vol. 29, pp.38–42.
Back, J.W.; de Jong, L.; Muijsers, A.O. & de Koster, C.G. (2003). Chemical crosslinking and
mass spectrometry for protein structural modeling. Journal of Molecular Biology,
Vol.331,pp.303–313.
Back, J.W.; Sanz, M.A.; De Jong, L.; De Koning, L.J.; Nijtmans, L.G.; De Koster, C.G.; Grivell,
L.A.; Van Der Spek, H. & Muijsers, A.O.(2002). A structure for the yeast prohibitin
complex: Structure prediction and evidence from chemical crosslinking and mass
spectrometry. Protein Science, Vol. 11, pp.2471–2478.
Balasu, M.C.; Spiridon, L.N.; Miron, S. ; Craescu, C.T.; Scheidig, A.J., Petrescu, A.J. &
Szedlacsek, S.E. (2009). Interface Analysis of the Complex between ERK2 and PTP-
SL. Plos one, Vol. 4, pp. e5432.
Bastard, K.; Prévost, C. & Zacharias, M. (2006). Accounting for loop flexibility during
protein-protein docking. Proteins, Vol.62, pp. 956-969.
Ben-Zeev, E. & Eisenstein, M. (2003). Weighted geometric docking: incorporating external
information in the rotation-translation scan. Proteins, Vol.52, pp. 24-27.
Berndt, K.D.; Güntert, P.; Orbons, L.P. & Wüthrich, K. (1992). Determination of a high-
quality nuclear magnetic resonance solution structure of the bovine pancreatic
154 Computational Biology and Applied Bioinformatics

trypsin inhibitor and comparison with three crystal structures. Journal of Molecular
Biology, Vol.227, pp.757-775.
Blake, J.D. & Cohen, F.E. (2001). Pairwise sequence alignment below the twilight zone.
Journal of Molecular Biology, Vol. 307, pp. 721-735.
Bowers, P.M.; Strauss, C.E.M. & Baker, D. (2000). De novo protein structure determination
using sparse NMR data. Journal of Biomolecular NMR, Vol.18, pp.311–318.
Brazas, M.D.; Yamada, J.T. & Ouellette, B.F.F. (2010). Providing web servers and training in
Bioinformatics: 2010 update on the Bioinformatics Links Directory . Nucleic Acids
Research, Vol. 38, pp.W3–W6.
Brooks, B.R.; Bruccoleri, R.E.; Olafson, B.D.; States, D.J.; Swaminathan, S. & Karplus
M.(2003). CHARMM: A Program for Macromolecular Energy, Minimization, and
Dynamics Calculations. Journal of Computational Chemistry, Vol.4, pp.187-217.
Camacho, C. J. & Vajda, S. (2001). Protein docking along smooth association pathways.
PNAS USA, Vol.98, pp.10636–10641.
Carlsohn, E.; Ångström, J. ; Emmett, M.R.; Marshall, A.G. & Nilsson, C.L. (2004). Chemical
cross-linking of the urease complex from Helicobacter pylori and analysis by
Fourier transform ion cyclotron resonance mass spectrometry and molecular
modeling International Journal of Mass Spectrometry, Vol.234, pp. 137–144.
Chu, F.; Shan, S.; Moustakas, D.T.; Alber, F.; Egea, P.F.; Stroud, R.M.; Walter, P. &
Burlingame A.L. (2004). Unraveling the interface of signal recognition particle and
its receptor by using chemical cross-linking and tandem mass spectrometry. PNAS,
Vol.101, pp. 16454-16459.
D’Ambrosio, C.; Talamo, F.; Vitale, R.M.; Amodeo, P.; Tell, G.; Ferrara, L. & Scaloni, A.
(2003). Probing the Dimeric Structure of Porcine Aminoacylase 1 by Mass
Spectrometric and Modeling Procedures. Biochemistry, Vol. 42, pp. 4430-4443.
de Bakker, P.I.; Furnham, N.; Blundell, T.L. & DePristo, M.A. (2006). Conformer generation
under restraints. Current Opinion in Structural Biology, Vol. 16, pp.160–165.
Dimova, K; Kalkhof, S.; Pottratz, I.; Ihling, C.; Rodriguez-Castaneda, F.; Liepold, T.;
Griesinger, C.; Brose, N.; Sinz, A. & Jahn, O. (2009). Structural Insights into the
Calmodulin-Munc13 Interaction Obtained by Cross-Linking and Mass
Spectrometry. Biochemistry, Vol.48, pp. 5908-5921.
Eddy, S.R. (1998). Profile hidden Markov models. Bioinformatics, Vol.14, pp.755–763.
Fabris, D. & Yu, E.T. (2010). The collaboratory for MS3D:a new cyberinfrastructure for the
structural elucidation of biological macromolecules and their assemblies using
mass spectrometry-based approaches. Journal Proteome Research, Vol.7, pp. 4848-
4857.
Fiser, A. & Sali, A. (2003). Modeller: generation and refinement of homology base protein
structure models. Methods in Enzymology, Vol. 374, pp.461–491.
Förster, F.; Webb, B.; Krukenberg, K.A.; Tsuruta, H.; Agard, D.A. & Sali A.(2008).
Integration of Small-Angle X-Ray Scattering Data into Structural Modeling of
Proteins and Their Assemblies. Journal of Molecular Biology, Vol.382, pp.1089–
1106.
Friedhoff, P. (2005). Mapping protein–protein interactions by bioinformatics and
crosslinking.. Analitycal & Bioanalitycal Chemistry, Vol.381,pp.78–80.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 155

Giron-Monzon, L.; Manelyte, L.; Ahrends, R.; Kirsch, D.; Spengler, B. & Friedhoff, P. (2004).
Mapping Protein-Protein Interactions between MutL and MutH by Cross-linking.
The Journal of Biochemical Chemistry, Vol.279, pp. 49338–49345.
Gray, J.J.; Moughon, S.; Wang, C.; Schueler-Furman, O.; Kuhlman, B.; Rohl, C.A. & Baker, D.
(2003). Protein-protein docking with simultaneous optimization of rigid-body
displacement and side-chain conformations. Journal of Molecular Biology, Vol.331,
pp.281-299.
Green, N.S.; Reisler, E. & Houk, K.N. (2001). Quantitative evaluation of the lengths of
homobifunctional protein cross-linking reagents used as molecular rulers. Protein
Science, Vol.10, pp.1293-1304.
Grintsevich, E.E.; Benchaar, S.A.; Warshaviak, D.; Boontheung, P.; Halgand, F.; Whitelegge,
J.P.; Faull, K.F.; Ogorzalek Loo, R.R; Sept, D.; Loo, J.A. & Reisler, E. (2008). Mapping
the Cofilin Binding Site on Yeast G-Actin by Chemical Cross-Linking. Journal of
Molecular Biology, Vol.377, pp. 395-409.
Güntert, P.; Mumenthaler, C. & Wüthrich, K. (1997). Torsion angle dynamics for NMR
structure calculation with the new program Dyana. Journal of Molecular Biology, Vol.
273, pp. 283–298.
Havel, T.F.; Kuntz, I.D. & Crippen, G.M.(1983). The combinatorial distance geometry
method for the calculation of molecular conformation. I. A new approach to an old
problem. Journal of Theoretical Biology, Vol. 310, pp.638–642.
Jaroszewski, L.; Rychlewski, L.; Li, Z.; Li, W. & Godzik, A. (2005). FFAS03: a server for
profile– profile sequence alignments. Nucleic Acids Research, Vol.33, pp.W284–288.
Karplus, K.; Barrett, C. & Hughey R. (1998). Hidden Markov models for detecting remote
protein homologies. Bioinformatics, Vol.14, pp.846–856.
Kessl, J.J.; Eidahl, J.O.; Shkriabai, N.; Zhao, Z.; McKee, C.J.; Hess, S.; Burke, T.R. Jr &
Kvaratskhelia, M. (2009). An allosteric mechanism for inhibiting HIV-1 integrase
with a small molecule.Molecular Pharmacology, Vol. 76, pp.824–832.
Kirkpatrick, S.; Gelatt, C.D. Jr. & Vecchi, M.P. (1983). Optimization by Simulated Annealing.
Science, Vol 220,pp. 671-680.
Latek, D.; Ekonomiuk, D. & Kolinski , A.(2007). Protein structure prediction: combining de
novo modeling with sparse experimental data. Journal of Computational Chemistry,
Vol. 28, pp.1668–1676.
Leitner, A.; Walzthoeni, T.; Kahraman, A.; Herzog, F.; Rinner, O.; Beck, M. & Aebersolda, R.
(2010). Probing Native Protein Structures by Chemical Cross-linking, Mass
Spectrometry, and Bioinformatics. Molecular & Cellular Proteomics, Vol.24, pp. 1634-
1649.
Lin, M.; Lu, H.M.; Rong Chen,R. & Liang, J.(2008). Generating properly weighted ensemble
of conformations of proteins from sparse or indirect distance constraints. The
Journal of Chemical Physics, Vol.129, pp.094101–094114.
Marti-Renom, M.A.; Madhusudhan, M.S. & Sali, A. (2004). Alignment of protein sequences
by their profiles. Protein Science, Vol.13, pp.1071–1087.
Lübben, M.; Portmann, R.; Kock, G.; Stoll, R.; Young, M.M. & Solioz, M. (2009). Structural
model of the CopA copper ATPase of Enterococcus hirae based on chemical cross-
linking. Biometals, Vol.22, pp. 363-375.
156 Computational Biology and Applied Bioinformatics

Mathiowetz, A.M.; Jain, A.; Karasawa, N. & Goddard, W.A. III. (1994). Protein simulation
using techniques suitable for very large systems: The cell multipole method for
nonbond interactions and the Newton–Euler inverse mass operator method for
internal coordinate dynamics. Proteins, Vol. 20, pp. 227–247.
Melo, F. & Sali, A. (2007). Fold assessment for comparative protein structure modeling.
Protein Science, Vol. 16, pp. 2412–2426.
Millevoi, S.; Thion, L.; Joseph, G.; Vossen, C.; Ghisolfi-Nieto, L. & Erard, M. (2001). Atypical
binding of the neuronal POU protein N-Oct3 to noncanonical DNA targets.
Implications for heterodimerization with HNF-3b. European Journal Biochemistry,
Vol.268, pp. 781-791.
Moreira, I.S.; Fernandes, P.A. & Ramos, M.J. (2010). Protein-protein docking dealing with
the unknown. Journal of Computational Chemistry, Vol. 31, pp.317–342.
Mouradov, D.; Craven, A.; Forwood, J.K.; Flanagan, J.U.; García-Castellanos, R.; Gomis-
Rüth, F.X.; Hume, D.A.; Martin, J.L.; Kobe, B. & Huber, T. (2006). Modelling the
structure of latexin–carboxypeptidase. A complex based on chemical cross-
linking and molecular docking. Protein Engineering, Design & Selection, Vol.19, pp.
9-16.
Nikolova, L.; Soman, K. ; Nichols, J.C.; Daniel, D.S., Dickey, B.F. & Hoffenberg, S.
(1998). Conformationally variable Rab protein surface regions mapped by
limited proteolysis and homology modelling. Biochemical Journal, Vol.336, pp.
461–469.
Nilges, M.; Clore, G.M. & Gronenborn, A.M.(1988a). Determination of three dimensional
structures of proteins from interproton distance data by hybrid distance
geometry-dynamical simulated annealing calculations. FEBS Letters, Vol.229,
pp.317–324.
Nilges, M.; Gronenborn, A.M.; Brünger, A.T. & Clore, G.M. (1988b). Determination of
three- dimensional structures of proteins by simulated annealing with
interproton distance restraints: application to crambin, potato carboxypeptidase
inhibitor and barley serine proteinase inhibitor 2. Protein Engineering, Vol.2,
pp.27-38.
Nymeyer, H.; Gnanakaran, S. and García, A.E. (2004). Atomic simulations of protein
folding using the replica exchange algorithm. Methods in Enzymology, Vol.383,
pp.111-149.
Papa, S.; Monti, S.M.; Vitale, R.M.; Bubici, C.; Jayawardena, S.; Alvarez, K.; De Smaele, E.;
Dathan, N.; Pedone, C.; Ruvo M. & Franzoso, G. (2007). Insights into the structural
basis of the GADD45beta-mediated inactivation of the JNK kinase, MKK7/JNKK2..
Journal of Biological Chemistry, Vol. 282, pp. 19029-19041.
Potluri, S.; Khan, A.A.; Kuzminykh, A.; Bujnicki, J.M., Friedman, A.M. & Bailey-Kellogg, C.
(2004). Geometric Analysis of Cross-Linkability for Protein Fold Discrimination.
Pacific Symposium on Biocomputing, Vol.9, pp.447-458.
Renzone, G.; Salzano, A.M.; Arena, S.; D’Ambrosio, C. & Scaloni, A.(2007a). Mass
Spectrometry-Based Approaches for Structural Studies on Protein Complexes at
Low-Resolution. Current Proteomics, Vol. 4, pp. 1-16.
Computational Methods in Mass Spectrometry-Based Protein 3D Studies 157

Renzone, G.; Vitale, R.M.; Scaloni, A.; Rossi, M., Amodeo, P. & Guagliardi A. (2007b).
Structural Characterization of the Functional Regions in the Archaeal Protein
Sso7d. Proteins: Structure, Function, and Bioinformatics, Vol. 67, pp. 189-197.
Rice, L.M & Brünger, A.T. (1994). Torsion angle dynamics: Reduced variable conformational
sampling enhances crystallographic structure refinement. Proteins, Vol. 19, pp. 277–
290.
Russell, R.B.; Alber, F.; Aloy, P.; Davis, F.P.; Korkin, D.;Pichaud, M; Topf, M. & Sali, A.
(2004). A structural perspective on protein-protein interactions. Current Opinion in
Structural Biology, Vol.14, pp. 313-324.
Scaloni, A; Miraglia, N.; Orrù, S.; Amodeo, P.; Motta, A.; Marino, G. & Pucci, P.(1998).
Topology of the calmodulin-melittin complex. Journal of Molecular Biology, Vol. 277,
pp.945–958.
Schrag, J.D.; Jiralerspong, S.; Banville, M; Jaramillo, M.L. & O'Connor-McCourt, M.D. (2007).
The crystal structure and dimerization interface of GADD45gamma. PNAS, Vol.
105, pp. 6566-6571.
Schueler-Furman, O.; Wang, C.; Bradley, P.; Misura, K. & Baker, D. (2005). Progress in
modeling of protein structures and interactions. Science, Vol. 310, pp.638–642.
Schulz,D.M.; Kalkhof, S.; Schmidt, A.; Ihling, C.; Stingl, C.; Mechtler, K.; Zschörnig, O &
Sinz, A. (2007). Annexin A2/P11 interaction: New insights into annexin A2
tetramer structure by chemical crosslinking, high-resolution mass spectrometry,
and computational modeling. Proteins: Structure Function & Bioinformatics, Vol.69,
pp. 254-269.
Schwieters, C.D. & Clore, G.M. (2001). Internal Coordinates for Molecular Dynamics and
Minimization in Structure Determination and Refinement. Journal of Magnetic
Resonance, Vol. 152, pp.288–302.
Silva, R.A.G.D.; Hilliard, G.M.; Fang, J.; Macha, S. & Davidson, W.S. (2005). A Three-
Dimensional Molecular Model of Lipid-Free Apolipoprotein A-I Determined by
Cross-Linking/Mass Spectrometry and Sequence Threading. Biochemistry, Vol.44,
pp. 2759-2769.
Singh, P.; Panchaud, A. & Goodlett, D.R. (2010) Chemical Cross-Linking and Mass
Spectrometry As a Low-Resolution Protein Structure Determination Technique.
Analytical Chemistry, Vol. 82, pp. 2636–2642
Sinz, A. (2006). Chemical cross-linking and mass spectrometry to map three dimensional
protein structures and protein-protein interactions. Mass Spectrometry Reviews,
Vol.25, pp. 663-682.
Skolnick,J.; Kolinski, A. & Ortiz, A.R.(1997). MONSSTER: A Method for Folding Globular
Proteins with a Small Number of Distance Restraints. Journal of Molecular Biology,
Vol. 265 pp.217-241.
Soding, J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics,
Vol.21, pp.951–960.
Stein, E.G.; Rice, L.M & Brünger, A.T. (1997). Torsion-angle molecular dynamics as a new
efficient tool for NMR structure calculation. Journal of Magnetic Resonance, Vol. 124,
pp. 154–164.
158 Computational Biology and Applied Bioinformatics

Tubb, M.R.; Silva, R.A.G.D.; Fang, J.; Tso, P. & Davidson, W.S. (2008). A Three-dimensional
Homology Model of Lipid-free Apolipoprotein A-IV Using Cross-linking and Mass
Spectrometry. The Journal of Biochemical Chemistry, Vol.283, pp. 17314--17323.
Vaidehi, N., Jain, A. & Goddard, W.A. III (1996). Constant temperature constrained
molecular dynamics: The Newton–Euler inverse mass operator method. Journal of
Physical Chemistry, Vol. 100, pp.10 508–10517.
Van Dijk, A.D.J.; Boelens, R. & Bonvin, A.M.J.J. (2005). Data-driven docking for the study of
biomolecular complexes. FEBS Journal, Vol.272, pp.293–312.
Young, M.M.; Tang, N.; Hempel, J.C.; Oshiro, C.M.; Taylor, E.W.; Kuntz, I.D.; Gibson, B.W.
& Dollinger G. (2000). High throughput protein fold identification by using
experimental constraints derived from intramolecular cross-links and mass
spectrometry. PNAS, Vol.97, pp. 5802-2806.
Zheng, X.; Wintrode, P.L. & Chance M.R. (2007). Complementary Structural Mass
Spectrometry Techniques Reveal Local Dynamics in Functionally Important
Regions of a Metastable Serpin. Structure, Vol.16, pp. 38-51.
8

Synthetic Biology & Bioinformatics

Prospects in the Cancer Arena
Lígia R. Rodrigues and Leon D. Kluskens
IBB – Institute for Biotechnology and Bioengineering, Centre of Biological Engineering,
University of Minho, Campus de Gualtar, Braga
Portugal

1. Introduction
Cancer is the second leading cause of mortality worldwide, with an expected 1.5-3.0 million
new cases and 0.5-2.0 million deaths in 2011 for the US and Europe, respectively (Jemal et
al., 2011). Hence, this is an enormously important health risk, and progress leading to
enhanced survival is a global priority. Strategies that have been pursued over the years
include the search for new biomarkers, drugs or treatments (Rodrigues et al., 2007).
Synthetic biology together with bioinformatics represents a powerful tool towards the
discovery of novel biomarkers and the design of new biosensors.
Traditionally, the majority of new drugs has been generated from compounds derived from
natural products (Neumann & Neumann-Staubitz, 2010). However, advances in genome
sequencing together with possible manipulation of biosynthetic pathways, constitute
important resources for screening and designing new drugs (Carothers et al. 2009).
Furthermore, the development of rational approaches through the use of bioinformatics for
data integration will enable the understanding of mechanisms underlying the anti-cancer
effect of such drugs (Leonard et al., 2008; Rocha et al., 2010).
Besides in biomarker development and the production of novel drugs, synthetic biology can
also play a crucial role in the level of specific drug targeting. Cells can be engineered to
recognize specific targets or conditions in our bodies that are not naturally recognized by
the immune system (Forbes, 2010).
Synthetic biology is the use of engineering principles to create, in a rational and systematic
way, functional systems based on the molecular machines and regulatory circuits of living
organisms or to re-design and fabricate existing biological systems (Benner & Sismour,
2005). The focus is often on ways of taking parts of natural biological systems, characterizing
and simplifying them, and using them as a component of a highly unnatural, engineered,
biological system (Endy, 2005). Virtually, through synthetic biology, solutions for the unmet
needs of humankind can be achieved, namely in the field of drug discovery. Indeed,
synthetic biology tools enable the elucidation of disease mechanisms, identification of
potential targets, discovery of new chemotherapeutics or design of novel drugs, as well as
the design of biological elements that recognize and target cancer cells. Furthermore,
through synthetic biology it is possible to develop economically attractive microbial
production processes for complex natural products.
160 Computational Biology and Applied Bioinformatics

Bioinformatics is used in drug target identification and validation, and in the development
of biomarkers and tools to maximize the therapeutic benefit of drugs. Now that data on
cellular signalling pathways are available, integrated computational and experimental
projects are being developed, with the goal of enabling in silico pharmacology by linking the
genome, transcriptome and proteome to cellular pathophysiology. Furthermore,
sophisticated computational tools are being developed that enable the modelling and design
of new biological systems. A key component of any synthetic biology effort is the use of
quantitative models (Arkin, 2001). These models and their corresponding simulations enable
optimization of a system design, as well as guiding their subsequent analysis. Dynamic
models of gene regulatory and reaction networks are essential for the characterization of
artificial and synthetic systems (Rocha et al., 2008). Several software tools and standards
have been developed in order to facilitate model exchange and reuse (Rocha et al., 2010).
In this chapter, synthetic biology approaches for cancer diagnosis and drug development
will be reviewed. Specifically, examples on the design of RNA-based biosensors, bacteria
and virus as anti-cancer agents, and engineered microbial cell factories for the production of
drugs, will be presented.

2. Synthetic biology: tools to design, build and optimize biological processes

Synthetic biology uses biological insights combined with engineering principles to design
and build new biological functions and complex artificial systems that do not occur in
Nature (Andrianantoandro et al., 2006). The building blocks used in synthetic biology are
the components of molecular biology processes: promoter sequences, operator sequences,
ribosome binding sites (RBS), termination sites, reporter proteins, and transcription factors.
Examples of such building blocks are given in Table 1.
Great developments of DNA synthesis technologies have opened new perspectives for the
design of very large and complex circuits (Purnick & Weiss, 2009), making it now affordable
to synthesize a given gene instead of cloning it. It is possible to synthesize de novo a small
virus (Mueller et al., 2009), to replace the genome of one bacterium by another (Lartigue et
al., 2007) and to make large chunks of DNA coding for elaborate genetic circuits. Software
tools to simulate large networks and the entire panel of omics technologies to analyze the
engineered microorganism are available (for details see section 3). Finally, the repositories of
biological parts (e.g. Registry of Standard Biological Parts (http://partsregistry.org/)) will
increase in complexity, number and reliability of circuits available for different species.
Currently, the design and synthesis of biological systems are not decoupled. For example,
the construction of metabolic pathways or any circuit from genetic parts first requires a
collection of well characterized parts, which do not yet fully exist. Nevertheless, this
limitation is being addressed through the development and compilation of standard
biological parts (Kelly et al., 2009). When designing individual biological parts, the base-by-
base content of that part (promoter, RBS, protein coding region, terminator, among others) is
explicitly dictated (McArthur IV & Fong, 2010). Rules and guidelines for designing genetic
parts at this level are being established (Canton et al., 2008). Particularly, an important issue
when designing protein-coding parts is codon optimization, encoding the same amino acid
sequence with an alternative, preferred nucleotide sequence. Although a particular
sequence, when expressed, may be theoretically functional, its expression may be far from
optimal or even completely suppressed due to codon usage bias in the heterologous host.
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 161

Genetic part Examples Rationale

Transcriptional control
Constitutive promoters lacIq, SV40, T7, sp6 “Always on” transcription
Regulatory regions tetO, lacO, ara, gal4, rhl box Repressor and activator sites
Inducible promoters ara, ethanol, lac, gal, rhl, lux, Control of the promoter by
fdhH, sal, glnK, cyc1 induction or by cell state
Cell fate regulators GATA factors Control cell differentiation
Transcriptional control
RNA interference Logic functions, RNAi Genetic switch, logic evaluation
(RNAi) repressor and gene silencing
Riboregulators Ligand-controlled Switches for detection and
ribozymes actuation
Ribosome binding site Kozak consensus sequence Control the level of translation
mutants
Post-transcriptional control
Phosphorylation Yeast phosphorylation Modulate genetic circuit behavior
cascades pathway
Protein receptor design TNT, ACT and EST Control detection thresholds and
receptors combinatorial protein function
Protein degradation Ssra tags, peptides rich in Protein degradation at varying
Pro, Glu, Ser and Thr rates
Localization signals Nuclear localization, Import or export from nucleus and
nuclear export and mitochondria
mitochondrial localization
signals
Others
Reporter genes GFP, YFP, CFP, LacZ Detection of expression
Antibiotic resistance ampicilin, Selection of cells
chloramphenicol
Table 1. Genetic elements used as components of synthetic regulatory networks (adapted
from McArthur IV & Fong, 2010 and Purnick & Weiss, 2009). Legend: CFP, cyan fluorescent
protein; GFP, green fluorescent protein; YFP, yellow fluorescent protein.
Codon optimization of coding sequences can be achieved using freely available algorithms
such as Gene Designer (see section 3). Besides codon optimization, compliance with
standard assembly requirements and part-specific objectives including activity or specificity
modifications should be considered. For example, the BioBrick methodology requires that
parts exclude four standard restriction enzyme sites, which are reserved for use in assembly
(Shetty et al., 2008). Extensive collections of parts can be generated by using a naturally
occurring part as a template and rationally modifying it to create a library of that particular
genetic part. Significant progress in this area has been recently demonstrated for promoters
and RBS (Ellis et al., 2009; Salis et al., 2009). Ellis and co-workers (2009) constructed two
promoter libraries that can be used to tune network behavior a priori by fitting mathematical
promoter models with measured parameters. By using this model-guided design approach
the authors were able to limit the variability of the system and increase predictability.
162 Computational Biology and Applied Bioinformatics

However, it is well-known that noisy or leaky promoters can complicate the system design.
In these cases a finer control over expression can be established by weakening the binding
strength of the downstream gene (Ham et al., 2006), or by using two promoter inputs to
drive transcription of an output via a modular AND gate (Anderson et al., 2006).
Additionally, modular and scalable RNA-based devices (aptamers, ribozymes, and
transmitter sequences) can be engineered to regulate gene transcription or translation (Win
& Smolke, 2007).
Design at the pathway level is not only concerned with including the necessary parts, but
also with controlling the expressed functionality of those parts. Parts-based synthetic
metabolic pathways will require tunable control, just as their natural counterparts which
often employ feedback and feed-forward motifs to achieve complex regulation (Purnick &
Weiss, 2009). Using a synthetic biology approach, the design of DNA sequences encoding
metabolic pathways (e.g. operons) should be relatively straightforward. Synthetic scaffolds
and well-characterized families of regulatory parts have emerged as powerful tools for
engineering metabolism by providing rational methodologies for coordinating control of
multigene expression, as well as decoupling pathway design from construction (Ellis et al.,
2009). Pathway design should not overlook the fact that exogenous pathways interact with
native cellular components and have their own specific energy requirements. Therefore,
modifying endogenous gene expression may be necessary in addition to balancing cofactor
fluxes and installing membrane transporters (Park et al., 2008).
After designing parts, circuits or pathways, the genomic constructs ought to be
manufactured through DNA synthesis. Nucleotide’s sequence information can be
outsourced to synthesis companies (e.g. DNA2.0, GENEART or Genscript, among others).
The convenience of this approach over traditional cloning allows for the systematic
generation of genetic part variants such as promoter libraries. Also, it provides a way to
eliminate restriction sites or undesirable RNA secondary structures, and to perform codon
optimization. The ability to make large changes to DNA molecules has resulted in
standardized methods for assembling basic genetic parts into larger composite devices,
which facilitate part-sharing and faster system-level construction, as demonstrated by the
BioBrick methodology (Shetty et al., 2008) and the Gateway cloning system (Hartley, 2003).
Other approaches based on type II restriction enzymes, such as Golden Gate Shuffling,
provide ways to assemble many more components together in one step (Engler et al., 2009).
A similar one-step assembly approach, circular polymerase extension cloning (CPEC),
avoids the need for restriction-ligation, or single-stranded homologous recombination
altogether (Quan & Tian, 2009). Not only is this useful for cloning single genes, but also for
assembling parts into larger sequences encoding entire metabolic pathways and for
generating combinatorial part libraries. On a chromosomal level, disruption of genes in
Escherichia coli and other microorganisms has become much faster with the development of
RecBCD and lambda RED-assisted recombination systems (Datsenko & Wanner, 2000),
allowing the insertion, deletion or modification by simply using linear gene fragments.
Additionally, multiplex automated genome engineering (MAGE) has been introduced as
another scalable, combinatorial method for producing large-scale genomic diversity (Wang
et al., 2009). This approach makes chromosomal modification easier by simultaneously
mutating target sites across the chromosome. Plasmid-based expression and chromosomal
integration are the two common vehicles for implementing synthetic metabolic pathways.
Recently, the chemically inducible chromosomal evolution (CIChE) was proposed as a long-
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 163

term expression alternative method (Tyo et al., 2009). This new method avoids
complications associated with plasmid replication and segregation, and can be used to
integrate multiple copies of genes into the genome. All these techniques will provide
technical platforms for the rapid synthesis of parts and subsequent pathways.
The majority of the synthetic biology advances has been achieved purely in vitro (Isalan et
al., 2008), or in microorganisms involving the design of small gene circuits without a direct
practical application, although scientifically very exciting. These studies have offered
fundamental insight into biological processes, like the role and sources of biological noise;
the existence of biological modules with defined properties; the dynamics of oscillatory
behavior; gene transcription and translation; or cell communication (Alon, 2003; Kobayashi
et al., 2004). An interesting example of a larger system that has been redesigned is the
refactoring of the T7 bacteriophage (Chan et al., 2005). Another successful example has been
the production of terpenoid compounds in E. coli (Martin et al., 2003) and Saccharomyces
cerevisiae (Ro et al., 2006) that can be used for the synthesis of artemisinin. Bacteria and fungi
have long been used in numerous industrial microbiology applications, synthesizing
important metabolites in large amounts. The production of amino acids, citric acid and
enzymes are examples of other products of interest, overproduced by microorganisms.
Genetic engineering of strains can contribute to the improvement of these production levels.
Altogether, the ability to engineer biological systems will enable vast progress in existing
applications and the development of several new possibilities. Furthermore, novel
applications can be developed by coupling gene regulatory networks with biosensor
modules and biological response systems. An extensive RNA-based framework has been
developed for engineering ligand-controlled gene regulatory systems, called ribozyme
switches. These switches exhibit tunable regulation, design modularity, and target
specificity and could be used, for example, to regulate cell growth (Win & Smolke, 2007).
Engineering interactions between programmed bacteria and mammalian cells will lead to
exciting medical applications (Anderson et al., 2006). Synthetic biology will change the
paradigm of the traditional approaches used to treat diseases by developing “smart”
therapies where the therapeutic agent can perform computation and logic operations and
make complex decisions (Andrianantoandro et al., 2006). There are also promising
applications in the field of living vectors for gene therapy and chemical factories (Forbes,
2010; Leonard et al., 2008).

3. Bioinformatics: a rational path towards biological behavior predictability

In order to evolve as an engineering discipline, synthetic biology cannot rely on endless trial
and error methods driven by verbal description of biomolecular interaction networks.
Genome projects identify the components of gene networks in biological organisms, gene
after gene, and DNA microarray experiments discover the network connections (Arkin,
2001). However, these data cannot adequately explain biomolecular phenomena or enable
rational engineering of dynamic gene expression regulation. The challenge is then to reduce
the amount and complexity of biological data into concise theoretical formulations with
predictive ability, ultimately associating synthetic DNA sequences to dynamic phenotypes.

3.1 Models for synthetic biology

The engineering process usually involves multiple cycles of design, optimization and
revision. This is particularly evident in the process of constructing gene circuits (Marguet et
164 Computational Biology and Applied Bioinformatics

al., 2007). Due to the large number of participating species and the complexity of their
interactions, it becomes difficult to intuitively predict a design behavior. Therefore, only
detailed modeling can allow the investigation of dynamic gene expression in a way fit for
analysis and design (Di Ventura et al., 2006). Modeling a cellular process can highlight
which experiments are likely to be the most informative in testing model hypothesis, and for
example allow testing for the effect of drugs (Di Bernardo et al., 2005) or mutant
phenotypes (Segre et al., 2002) on cellular processes, thus paving the way for individualized
medicine.
Data are the precursor to any model, and the need to organize as much experimental data as
possible in a systematic manner has led to several excellent databases as summarized in
Table 2. The term “model” can be used for verbal or graphical descriptions of a mechanism
underlying a cellular process, or refer to a set of equations expressing in a formal and exact
manner the relationships among variables that characterize the state of a biological system
(Di Ventura et al., 2006). The importance of mathematical modeling has been extensively
demonstrated in systems biology (You, 2004), although its utility in synthetic biology seems
even more dominant (Kaznessis, 2009).

Name Website
BIND (Biomolecular Interaction Network Database) http://www.bind.ca/
Brenda (a comprehensive enzyme information http://www.brenda.uni-koeln.de/
system)
CSNDB (Cell Signaling Networks Database) http://geo.nihs.go.jp/csndb/
DIP (Database of Interacting Proteins) http://dip.doe-mbi.ucla.edu/
EcoCyc/Metacyc/BioCyc (Encyclopedia of E. coli http://ecocyc.org/
genes and metabolism)
EMP (Enzymes and Metabolic Pathways Database) http://www.empproject.com/
GeneNet (information on gene networks) http://wwwmgs.bionet.nsc.ru/mgs/s
ystems/genenet/
Kegg (Kyoto Encyclopedia of Genes and Genomes) http://www.genome.ad.jp/kegg/kegg
.html
SPAD (Signaling Pathway Database) http://www.grt.kyushu-u.ac.jp/eny-
doc/
RegulonDB (E. coli K12 transcriptional network) http://regulondb.ccg.unam.mx/
ExPASy-beta (Bioinformatics Resource Portal) http://beta.expasy.org/
Table 2. Databases of molecular properties, interactions and pathways (adapted from Arkin,
2001).
Model-driven rational engineering of synthetic gene networks is possible at the level of
topologies or at the level of molecular components. In the first one, it is considered that
molecules control the concentration of other molecules, e.g. DNA-binding proteins regulate
the expression of specific genes by either activation or repression. By combining simple
regulatory interactions, such as negative and positive feedback and feed-forward loops, one
may create more complex networks that precisely control the production of protein
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 165

molecules (e.g. bistable switches, oscillators, and filters). Experimentally, these networks can
be created using existing libraries of regulatory proteins and their corresponding operator
sites. Examples of these models are the oscillator described by Gardner et al (2000) and
repressilator by Elowitz and Leibler (2000). In the second level, the kinetics and strengths of
molecular interactions within the system are described. By altering the characteristics of the
components, such as DNA-binding proteins and their corresponding DNA sites, one can
modify the system dynamics without modifying the network topology. Experimentally, the
DNA sequences that yield the desired characteristics of each component can be engineered
to achieve the desired protein-protein, protein-RNA, or protein-DNA binding constants and
enzymatic activities. For example, Alon and co-workers (2003) showed how simple
mutations on the DNA sequence of the lactose operon can result in widely different
phenotypic behavior.
Various mathematical formulations can be used to model gene circuits. At the population
level, gene circuits can be modeled using ordinary differential equations (ODEs). In an ODE
formulation, the dynamics of the interactions within the circuit are deterministic. That is, the
ODE formulation ignores the randomness intrinsic to cellular processes, and is convenient
for circuit designs that are thought to be less affected by noise or when the impact of noise is
irrelevant (Marguet et al., 2007). An ODE model facilitates further sophisticated analyses,
such as sensitivity analysis and bifurcation analysis. Such analyses are useful to determine
how quantitative or qualitative circuit behavior will be impacted by changes in circuit
parameters. For instance, in designing a bistable toggle switch, bifurcation analysis was
used to explore how qualitative features of the circuit may depend on reaction parameters
(Gardner et al., 2000). Results of the analysis were used to guide the choice of genetic
components (genes, promoters and RBS) and growth conditions to favor a successful
implementation of designed circuit function. However, in a single cell, the gene circuit’s
dynamics often involve small numbers of interacting molecules that will result in highly
noisy dynamics even for expression of a single gene. For many gene circuits, the impact of
such cellular noise may be critical and needs to be considered (Di Ventura et al., 2006). This
can be done using stochastic models (Arkin, 2001). Different rounds of simulation using a
stochastic model will lead to different results each time, which presumably reflect aspects of
noisy dynamics inside a cell. For synthetic biology applications, the key of such analysis is
not necessarily to accurately predict the exact noise level at each time point. This is not
possible even for the simplest circuits due to the “extrinsic” noise component for each circuit
(Elowitz et al., 2002). Rather, it is a way to determine to what extent the designed function
can be maintained and, given a certain level of uncertainty or randomness, to what extent
additional layers of control can minimize or exploit such variations. Independently of the
model that is used, these can be evolved in silico to optimize designs towards a given
function. As an example, genetic algorithms were used by Francois and Hakim (2004) to
design gene regulatory networks exhibiting oscillations.
In most attempts to engineer gene circuits, mathematical models are often purposefully
simplified to accommodate available computational power and to capture the qualitative
behavior of the underlying systems. Simplification is beneficial partially due to the limited
quantitative characterization of circuit elements, and partially because simpler models may
better reveal key design constraints. The limitation, however, is that a simplified model may
fail to capture richer dynamics intrinsic to a circuit. Synthetic models combine features of
mathematical models and model organisms. In the engineering of genetic networks,
166 Computational Biology and Applied Bioinformatics

synthetic biologists start from mathematical models, which are used as the blueprints to
engineer a model out of biological components that has the same materiality as model
organism but is much less complex. The specific characteristics of synthetic models allow
one to use them as tools in distinguishing between different mathematical models and
evaluating results gained in performing experiments with model organisms (Loettgers,
2007).

3.2 Computational tools for synthetic biology

Computational tools are essential for synthetic biology to support the design procedure at
different levels. Due to the lack of quantitative characterizations of biological parts, most
design procedures are iterative requiring experimental validation to enable subsequent
refinements (Canton et al., 2008). Furthermore, stochastic noise, uncertainty about the
cellular environment of an engineered system, and little insulation of components
complicate the design process and require corresponding models and analysis methods (Di
Ventura et al., 2006). Many computational standards and tools developed in the field of
systems biology (Wierling et al., 2007) are applicable for synthetic biology as well.
As previously discussed, synthetic gene circuits can be constructed from a handful of basic
parts that can be described independently and assembled into interoperating modules of
different complexity. For this purpose, standardization and modularity of parts at different
levels is required (Canton et al., 2008). The Registry of Standard Biological Parts constitutes
a reference point for current research in synthetic biology and it provides relevant
information on several DNA-based synthetic or natural building blocks. Most
computational tools that specifically support the design of artificial gene circuits use
information from the abovementioned registry. Moreover, many of these tools share
standardized formats for the input/output files. The System Biology Markup Language
(SBML) (http://sbml.org) defines a widely accepted, XML-based format for the exchange of
mathematical models in biology. It provides a concise representation of the chemical
reactions embraced by a biological system. These can be translated into systems of ODEs or
into reaction systems amenable to stochastic simulations (Alon, 2003). Despite its large
applicability to simulations, SBML currently lacks modularity, which is not well aligned
with parts registries in synthetic biology. Alternatively, synthetic gene systems can be
described according to CellML language which is more modular (Cooling et al., 2008).
One important feature to enable the assembly of standard biological parts into gene
circuits is that they share common inputs and outputs. Endy (2005) proposed RNA
polymerases and ribosomes as the molecules that physically exchange information
between parts. Their fluxes, measured in PoPS (Polymerase Per Second) and in RiPS
(Ribosomes Per Second) represent biological currents (Canton et al., 2008). This picture,
however, does not seem sufficient to describe all information exchanges even in simple
engineered gene circuits, since other signal carriers like transcription factors and
environmental “messages” should be explicitly introduced and not indirectly estimated
by means of PoPS and RiPS (Marchisio & Stelling, 2008). Based on the assumption that
parts share common input/output signals, several computational tools have been
proposed for gene circuit design, as presented in Table 3. Comparing these circuit design
tools it is obvious that we are still far from an ideal solution. The software tools differ in
many aspects such as scope of parts and circuit descriptions, the mode of user interaction,
and the integration with databases or other tools.
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 167

Circuit design and simulation

Biojade http://web.mit.edu/jagoler/www/biojade/
Tinkercell http://www.tinkercell.com/Home
Asmparts http://soft.synth-bio.org/asmparts.html
ProMoT http://www.mpimagdeburg.mpg.de/projects/promot
GenoCAD http://www.genocad.org/genocad/
GEC http://research.microsoft.com/gec
TABASCO http://openwetware.org/wiki/TABASCO
Circuit optimization
Genetdes http://soft.synth-bio.org/genetdes.html
RoVerGeNe http://iasi.bu.edu/~batt/rovergene/rovergene.htm
DNA and RNA design
Gene Designer https://www.dna20.com/index.php?pageID=220
GeneDesign http://www.genedesign.org/
UNAFold http://www.bioinfo.rpi.edu/applications/hybrid/download.php
Vienna RNA package http://www.tbi.univie.ac.at/~ivo/RNA/
Zinc Finger Tools http://www.scripps.edu/mb/barbas/zfdesign/zfdesignhome.php
Protein Design
Rosetta http://www.rosettacommons.org/main.html
RAPTOR http://www.bioinformaticssolutions.com/products/raptor/index.
php
PFP http://dragon.bio.purdue.edu/pfp/
Autodock 4.2 http://autodock.scripps.edu/
HEX 5.1 http://webloria.loria.fr/~ritchied/hex/
Integrated workflows
SynBioSS http://synbioss.sourceforge.net/
Clotho http://biocad-server.eecs.berkeley.edu/wiki/index.php/Tools
Biskit http://biskit.sf.net/
Table 3. Computational design tools for synthetic biology (adapted from Marchisio &
Stelling, 2009; Matsuoka et al., 2009; and Prunick & Weiss, 2009)
Biojade was one of the first tools being reported for circuit design (Goler, 2004). It provides
connections to both parts databases and simulation environments, but it considers only one
kind of signal carrier (RNA polymerases). It can invoke the simulator TABASCO (Kosuri et
al., 2007), thus enabling genome scale simulations at single base-pair resolution.
CellDesigner (Funahashi et al., 2003) has similar capabilities for graphical circuit
composition. However, parts modularity and consequently circuit representation do not
appear detailed enough. Another tool for which parts communicate only by means of PoPS,
but not restricted to a single mathematical framework, is the Tinkercell. On the contrary, in
Asmparts (Rodrigo et al., 2007a) the circuit design is less straightforward and intuitive
because the tool lacks a Graphic User Interface. Nevertheless, each part exists as an
independent SBML module and the model kinetics for transcription and translation permit
to limit the number of parameters necessary for a qualitative system description. Marchisio
and Stelling (2008) developed a new framework for the design of synthetic circuits where
168 Computational Biology and Applied Bioinformatics

each part is modeled independently following the ODE formalism. This results in a set of
composable parts that communicate by fluxes of signal carriers, whose overall amount is
constantly updated inside their corresponding pools. The model also considers transcription
factors, chemicals and small RNAs as signal carriers. Pools are placed among parts and
devices: they store free signal carriers and distribute them to the whole circuit. Hence,
polymerases and ribosomes have a finite amount; this permits to estimate circuit scalability
with respect to the number of parts. Mass action kinetics is fully employed and no
approximations are required to depict the interactions of signal carriers with DNA and
mRNA. The authors implemented the corresponding models into ProMoT (Process
Modeling Tool), software for the object-oriented and modular composition of models for
dynamic processes (Mirschel et al., 2009). GenoCAD (Czar et al., 2009) and GEC (Pedersen &
Phillips, 2009) introduce the notions of a grammar and of a programming language for
genetic circuit design, respectively. These tools use a set of rules to check the correct
composition of standard parts. Relying on libraries of standard parts that are not necessarily
taken from the Registry of Standard Biological Parts, these programs can translate a circuit
design into a complete DNA sequence. The two tools differ in capabilities and possible
connectivity to other tools.
The ultimate goal of designing a genetic circuit is that it works, i.e. that it performs a given
function. For that purpose, optimization cycles to establish an appropriate structure and a
good set of kinetic parameters values are required. These optimization problems are
extremely complex since they involve the selection of adequate parts and appropriate
continuous parameter values. Stochastic optimization methods (e.g. evolutionary
algorithms) attempt to find good solutions by biased random search. They have the
potential for finding globally optimal solutions, but optimization is computationally
expensive. On the other hand, deterministic methods (e.g. gradient descent) are local search
methods, with less computational cost, but at the expense of missing good solutions.
The optimization problem can be tackled by tools such as Genetdes (Rodrigo et al., 2007b)
and OptCircuit (Dasika & Maranas, 2008). They rely on different parts characterizations and
optimization algorithms. Genetdes uses a stochastic method termed “Simulated Annealing”
(Kirkpatrick et al., 1983), which produces a single solution starting from a random circuit
configuration. As a drawback, the algorithm is more likely to get stuck in a local minimum
than an evolutionary algorithm. OptCircuit, on the contrary, treats the circuit design
problem with a deterministic method (Bansal et al., 2003), implementing a procedure
towards a “local’ optimal solution. Each of these optimization algorithms requires a very
simplified model for gene dynamics where, for instance, transcription and translation are
treated as a single step process. Moreover, the current methods can cope only with rather
small circuits. Another tool that has been described by Batt and co-workers (2007),
RoVerGeNe, addresses the problem of parameter estimation more specifically. This tool
permits to tune the performance and to estimate the robustness of a synthetic network with
a known behavior and for which the topology does not require further improvement.
Detailed design of synthetic parts that reproduce the estimated circuit kinetics and
dynamics is a complex task. It requires computational tools in order to achieve error free
solutions in a reasonable amount of time. Other than the placement/removal of restriction
sites and the insertion/deletion of longer motifs, mutations of single nucleotides may be
necessary to tune part characteristics (e.g. promoter strength and affinity toward regulatory
factors). Gene Designer (Villalobos et al., 2006) is a complete tool for building artificial DNA
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 169

segments and codon usage optimization. GeneDesign (Richardson et al., 2006) is another
tool to design long synthetic DNA sequences. Many other tools are available for specific
analysis of the DNA and RNA circuit components. The package UNAFold (Markham &
Zuker, 2008) predicts the secondary structure of nucleic acid sequences to simulate their
hybridizations and to estimate their melting temperature according to physical
considerations. A more accurate analysis of the secondary structure of ribonucleic acids can
be performed through the Vienna RNA package (Hofacker, 2003). Binding sites along a
DNA chain can be located using Zinc Finger Tools (Mandell & Barbas, 2006). These tools
allows one to search DNA sequences for target sites of particular zinc finger proteins
(Kaiser, 2005), whose structure and composition can also be arranged. Thus, gene control by
a class of proteins with either regulation or nuclease activity can be improved. Furthermore,
tools that enable promoter predictions and primers design are available, such as BDGP and
Primer3. Another relevant task in synthetic biology is the design and engineering of new
proteins. Many tools have been proposed for structure prediction, homology modeling,
function prediction, docking simulations and DNA-protein interactions evaluation.
Examples include the Rosetta package (Simons et al., 1999); RAPTOR (Xu et al., 2003); PFP
(Hawkins et al., 2006); Autodock 4.2 (Morris et al., 2009) and Hex 5.1 (Ritchie, 2008).
Further advance in computational synthetic biology will result from tools that combine and
integrate most of the tasks discussed, starting with the choice and assembly of biological
parts to the compilation and modification of the corresponding DNA sequences. Examples
of such tools comprise SynBioSS (Hill et al., 2008); Clotho and Biskit (Grunberg et al., 2007).
Critical elements are still lacking, such as tools for automatic information integration
(literature and databases), and tools that re-use standardized model entities for optimal
circuit design. Overall, providing an extended and integrated information technology
infrastructure will be crucial for the development of the synthetic biology field.

4. A roadmap from design to production of new drugs

Biological systems are dynamic, that is they mutate, evolve and are subject to noise.
Currently, the full knowledge on how these systems work is still limited. As previously
discussed, synthetic biology approaches involve breaking down organisms into a hierarchy
of composable parts, which is useful for conceptualization purposes. Reprogramming a cell
involves the creation of synthetic biological components by adding, removing, or changing
genes and proteins. Nevertheless, it is important to notice that assembly of parts largely
depends on the cellular context (the so-called chassis), thus restraining the abstraction of
biological components into devices and modules, and their use in design and engineering of
new organisms or functions.
One level of abstraction from the DNA synthesis and manipulation is parts production,
which optimization can be accomplished through either rational design or directed
evolution. Applying rational design to parts alteration or creation is advantageous, in that it
cannot only generate products with a defined function, but it can also produce biological
insights into how the designed function comes about. However, it requires prior structural
knowledge of the part, which is frequently unavailable. Directed evolution is an alternative
method that can effectively address this limitation. Many synthetic biology applications will
require parts for genetic circuits, cell–cell communication systems, and non-natural
metabolic pathways that cannot be found in Nature, simply because Nature is not in need of
them (Dougherty & Arnold, 2009). In essence, directed evolution begins with the generation
170 Computational Biology and Applied Bioinformatics

of a library containing many different DNA molecules, often by error-prone DNA

replication, DNA shuffling or combinatorial synthesis (Crameri et al., 1998). The library is
next subjected to high-throughput screening or selection methods that maintain a link
between genotype and phenotype in order to enrich the molecules that produce the desired
function. Directed evolution can also be applied at other levels of biological hierarchy, for
example to evolve entire gene circuits (Yokobayashi et al., 2002). Rational design and
directed evolution should not be viewed as opposing methods, but as alternate ways to
produce and optimize parts, each with their own unique strengths and weaknesses.
Directed evolution can complement this technique, by using mutagenesis and subsequent
screening for improved synthetic properties (Brustad & Arnold, 2010). In addition, methods
have been developed to incorporate unnatural amino acids in peptides and proteins
(Voloshchuk & Montclare, 2009). This will expand the toolbox of protein parts, and add
beneficial effects, such as increased in vivo stability, when incorporated in proteinaceous
therapeutics. Also, the development of de novo enzymes has seen a significant increase
lately. The principle of computational design uses the design of a model, capable of
stabilizing the transition state of a reaction. From there on, individual amino acids are
positioned around it to create a catalytic site that stabilizes the transition state. The mRNA
display technique resembles phage display and is a technique for the in vitro selection and
evolution of proteins. Translated proteins are associated with their mRNA via a puromycin
linkage. Selection occurs by binding to an immobilized substrate, after which a reverse
transcriptase step will reveal the cDNA and thus the nucleotide sequence (Golynskiy &
Seelig, 2010). If the selection step includes measurement of product formation from the
substrate, novel peptides with catalytic properties can be selected.
For the design, engineering, integration and testing of new synthetic gene networks, tools
and methods derived from experimental molecular biology must be used (for details see
section 2). Nevertheless, progress on these tools and methods is still not enough to
guarantee the complete success of the experiment. As a result, design of synthetic biological
systems has become an iterative process of modeling, construction, and experimental testing
that continues until a system achieves the desired behavior (Purnick & Weiss, 2009). The
process begins with the abstract design of devices, modules, or organisms, and is often
guided by mathematical models (Koide et al., 2009). Afterwards, the newly constructed
systems are tested experimentally. However, such initial attempts rarely yield fully
functional implementations due to incomplete biological information. Rational redesign
based on mathematical models improves system behavior in such situations (Koide et al.,
2009; Prather & Martin, 2008). Directed evolution is a complimentary approach, which can
yield novel and unexpected beneficial changes to the system (Yokobayashi et al., 2002).
These retooled systems are once again tested experimentally and the process is repeated as
needed. Many synthetic biological systems have been engineered successfully in this fashion
because the methodology is highly tolerant to uncertainty (Matsuoka et al., 2009). Figure 1
illustrates the above mentioned iterative approach used in synthetic biology.
Since its inception, metabolic engineering aims to optimize cellular metabolism for a
particular industrial process application through the use of directed genetic modifications
(Tyo et al., 2007). Metabolic engineering is often seen as a cyclic process (Nielsen, 2001),
where the cell factory is analyzed and an appropriate target is identified. This target is then
experimentally implemented and the resulting strain is characterized experimentally and, if
necessary, further analyses are conducted to identify novel targets. The application of
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 171

synthetic biology to metabolic engineering can potentially create a paradigm shift. Rather
than starting with the full complement of components in a wild-type organism and
piecewise modifying and streamlining its function, metabolic engineering can be attempted
from a bottom-up, parts-based approach to design by carefully and rationally specifying the
inclusion of each necessary component (McArthur IV & Fong, 2010). The importance of
rationally designing improved or new microbial cell factories for the production of drugs
has grown substantially since there is an increasing need for new or existing drugs at prices
that can be affordable for low-income countries. Large-scale re-engineering of a biological
circuit will require systems-level optimization that will come from a deep understanding of
operational relationships among all the constituent parts of a cell. The integrated framework
necessary for conducting such complex bioengineering requires the convergence of systems
and synthetic biology (Koide et al., 2009). In recent years, with advances in systems biology
(Kitano, 2002), there has been an increasing trend toward using mathematical and
computational tools for the in silico design of enhanced microbial strains (Rocha et al., 2010).

MODEL FORMULATION Kinetic model (ODE/SDE)

FROM MASS ACTION (experimental data; existing model
KINETICS parameters)

In silico testing
(parameter space; dynamics)

SIMULATION
DYNAMIC PROPERTIES
EXPLOITATION

re-engineer
Model organism (chassis)
CIRCUIT CONSTRUCTION •Platform (microscopy; flow cytometry; microfluidics)
•Read-out (phenotype; morphology; expression)
EXPERIMENTAL •Analysis (equilibrium; cellular context; genetic
VALIDATION background; environment)
•Circuit construction

CIRCUITS DEFINITION OR
IMPROVEMENT
CHARACTERIZATION

Fig. 1. The iterative synthetic biology approach to design a given biological circuit/system.
Current models in both synthetic and systems biology emphasize the relationship between
environmental influences and the responses of biological networks. Nevertheless, these
models operate at different scales, and to understand the new paradigm of rational systems
re-engineering, synthetic and systems biology fields must join forces (Koide et al., 2009).
Synthetic biology and bottom-up systems biology methods extract discrete, accurate,
quantitative, kinetic and mechanistic details of regulatory sub-circuits. The models
generated from these approaches provide an explicit mathematical foundation that can
ultimately be used in systems redesign and re-engineering. However, these approaches are
confounded by high dimensionality, non-linearity and poor prior knowledge of key
dynamic parameters (Fisher & Henzinger, 2007) when scaled to large systems.
172 Computational Biology and Applied Bioinformatics

Consequently, modular sub-network characterization is performed assuming that the

network is isolated from the rest of the host system. The top-down systems biology
approach is based on data from high-throughput experiments that list the complete set of
components within a system in a qualitative or semi-quantitative manner. Models of overall
systems are similarly qualitative, tending toward algorithmic descriptions of component
interactions. Such models are amenable to the experimental data used to develop them, but
usually sacrifice the finer kinetic and mechanistic details of the molecular components
involved (Price & Shmulevich, 2007). Bridging systems and synthetic biology approaches is
being actively discussed and several solutions have been suggested (Koide et al., 2009).
A typical synthetic biology project is the design and engineering of a new biosynthetic
pathway in a model organism (chassis). Generally, E. coli is the preferred chassis since it is
well-studied, easy to manipulate, and its reproduction in biological cultures is very handy.
Initially, relevant databases like Kegg and BioCyc (Table 2) can be consulted for identifying
all the possible metabolic routes that allow the production of a given drug from metabolites
that exist in native E. coli. Then, for each reaction, the species that are known to possess the
corresponding enzymes/genes must be identified. This step is relevant, since most often the
same enzyme exhibits different kinetic behavior among different species. Information on
sequences and kinetic parameters can be extracted from the above-mentioned sources,
relevant literature and also from Brenda and Expasy databases. Afterwards, the information
collected is used to build a family of dynamic models (Rocha et al., 2008, 2010) that enable
the simulation of possible combinations regarding pathway configuration and the origin of
the enzymes (associated with varying kinetic parameters). Optflux tool can be used for
simulations and metabolic engineering purposes (http://www.optflux.org/). Using the
same input (a fixed amount of precursor) it is possible to select the configuration that allows
obtaining higher drug yields. Furthermore, through the use of genome-scale stoichiometric
models coupled with the dynamic model it is possible to understand the likely limitations
regarding the availability of the possible precursors. In fact, if the precursor for the given
drug biosynthesis is a metabolic intermediate, possible limitations in its availability need to
be addressed in order to devise strategies to cope with it. Based on this information, the next
step involves the construction of the enzymatic reactions that will lead to the production of
the drug from a metabolic precursor in E. coli. The required enzymes are then synthesized
based on the gene sequences previously selected from the databases. The cloning strategy
may include using a single plasmid with two different promoters; using two different
plasmids, with different copy numbers and/or origins of replication; or ultimately
integrating it into the genome, in order to allow fine tuning of the expression of the various
enzymes necessary. Finally, a set of experiments using the engineered bacterium needs to be
performed to evaluate its functionality, side-product formation and/or accumulation,
production of intermediate metabolites and final product (desired drug). In the case of the
previously mentioned artemisinin production, DNA microarray analysis and targeted
metabolic profiling were used to optimize the synthetic pathway, reducing the accumulation
of toxic intermediates (Kizer et al., 2008). These types of methodologies enable the validation
of the drug production model and the design of strategies to further improve its production.

5. Novel strategies for cancer diagnosis and drug development

Cancer is a main issue for the modern society and according to the World Health
Organization it is within the top 10 of leading causes of death in middle- and high-income
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 173

countries. Several possibilities to further improve existing therapies and diagnostics, or to

develop novel alternatives that still have not been foreseen, can be drawn using synthetic
biology approaches. Promising future applications include the development of RNA-based
biosensors to produce a desired response in vivo or to be integrated in a cancer diagnosis
device; the design and engineering of bacteria that can be programmed to target a tumor
and release a therapeutic agent in situ; the use of virus as a tool for recognizing tumors or for
gene therapy; and the large scale production of complex chemotherapeutic agents, among
others.

5.1 RNA-based biosensors

Synthetic biology seeks for new biological devices and systems that regulate gene
expression and metabolite pathways. Many components of a living cell possess the ability to
carry genetic information, such as DNA, RNA, proteins, among others. RNA has a critical
role in several functions (genetic translation, protein synthesis, signal recognition of
particles) due to its functional versatility from genetic blueprint (e.g. mRNA, RNA virus
genomes. Its catalytic function as enzyme (e.g. ribozymes, rRNA) and regulator of gene
expression (e.g. miRNA, siRNA) makes it stand out among other biopolymers with a more
specialized scope (e.g. DNA, proteins) (Dawid et al., 2009). Therefore, non-coding RNA
molecules enable the formation of complex structures that can interact with DNA, other
RNA molecules, proteins and other small molecules (Isaacs et al., 2006).
Natural biological systems contain transcription factors and regulators, as well as several
RNA-based mechanisms for regulating gene expression (Saito & Inoue, 2009). A number of
studies have been conducted on the use of RNA components in the construction of synthetic
biologic devices (Topp & Gallivan, 2007; Win & Smolke, 2007). The interaction of RNA with
proteins, metabolites and other nucleic acids is affected by the relationship between
sequence, structure and function. This is what makes the RNA molecule so attractive and
malleable to engineering complex and programmable functions.

5.1.1 Riboswitches
One of the most promising elements are the riboswitches, genetic control elements that
allow small molecules to regulate gene expression. They are structured elements typically
found in the 5’-untranslated regions of mRNA that recognize small molecules and respond
by altering their three-dimensional structure. This, in turn, affects transcription elongation,
translation initiation, or other steps of the process that lead to protein production (Beisel &
Smolke, 2009; Winkler & Breaker, 2005). Biological cells can modulate gene expression in
response to physical and chemical variations in the environment allowing them to control
their metabolism and preventing the waste of energy expenditure or inappropriate
physiological responses (Garst & Batey, 2009). There are currently at least twenty classes of
riboswitches that recognize a wide range of ligands, including purine nucleobases (purine
riboswitch), amino acids (lysine riboswitch), vitamin cofactors (cobalamin riboswitch),
amino sugars, metal ions (mgtA riboswitch) and second messenger molecules (cyclic di-
GMP riboswitch) (Beisel & Smolke, 2009). Riboswitches are typically composed of two
distinct domains: a metabolite receptor known as the aptamer domain, and an expression
platform whose secondary structure signals the regulatory response. Embedded within the
aptamer domain is the switching sequence, a sequence shared between the aptamer domain
and the expression platform (Garst & Batey, 2009). The aptamer domain is part of the RNA
174 Computational Biology and Applied Bioinformatics

and forms precise three-dimensional structures. It is considered a structured nucleotide

pocket belonging to the riboswitch, in the 5´-UTR, which when bound regulates
downstream gene expression (Isaacs et al., 2006). Aptamers specifically recognize their
corresponding target molecule, the ligand, within the complex group of other metabolites,
with the appropriate affinity, such as dyes, biomarkers, proteins, peptides, aromatic small
molecules, antibiotics and other biomolecules. Both the nucleotide sequence and the
secondary structure of each aptamer remain highly conserved (Winkler & Breaker, 2005).
Therefore, aptamer domains are the operators of the riboswitches.
A strategy for finding new aptamer sequences is the use of SELEX (Systemic Evolution of
Ligands by Exponential enrichment method). SELEX is a combinatorial chemistry technique
for producing oligonucleotides of either single-stranded DNA or RNA that specifically bind
to one or more target ligands (Stoltenburg et al., 2007). The process begins with the synthesis
of a very large oligonucleotide library consisting of randomly generated sequences of fixed
length flanked by constant 5' and 3' ends that serve as primers. The sequences in the library
are exposed to the target ligand and those that do not bind the target are removed, usually
by affinity chromatography. The bound sequences are eluted and amplified by PCR to
prepare for subsequent rounds of selection in which the stringency of the elution conditions
is increased to identify the tightest-binding sequences (Stoltenburg et al., 2007). SELEX has
been used to evolve aptamers of extremely high binding affinity to a variety of target
ligands. Clinical uses of the technique are suggested by aptamers that bind tumor markers
(Ferreira et al., 2006). The aptamer sequence must then be placed near to the RBS of the
reporter gene, and inserted into E. coli (chassis), using a DNA carrier (i.e. plasmid), in order
to exert its regulatory function.
Synthetic riboswitches represent a powerful tool for the design of biological sensors that
can, for example, detect cancer cells, or the microenvironment of a tumor, and in the
presence of a given molecule perform a desired function, like the expression in situ of a
therapeutic agent. Several cancer biomarkers have been identified in the last decade;
therefore there are many opportunities of taking these compounds as templates to design
adequate riboswitches for their recognition. Alternatively, the engineering goal might be the
detection of some of these biomarkers in biological samples using biosensors with aptamers
as the biological recognition element, hence making it a less invasive approach. The
development of aptamer-based electrochemical biosensors has made the detection of small
and macromolecular analytes easier, faster, and more suited for early detection of protein
biomarkers (Hianik & Wang, 2009). Multi-sensor arrays that provide global information on
complex samples (e.g. biological samples) have deserved much interest recently. Coupling
an aptamer to these devices will increase its specificity and selectivity towards the selected
target(s). The selected target may be any serum biomarker that when detected in high
amounts in biological samples can be suggestive of tumor activity.

5.2 Bacteria as anti-cancer agents

Bacteria possess unique features that make them powerful candidates for treating cancer in
ways that are unattainable by conventional methods. The moderate success of conventional
methods, such as chemotherapy and radiation, is related to its toxicity to normal tissue and
inability to destroy all cancer cells. Many bacteria have been reported to specifically target
tumors, actively penetrate tissue, be easily detected and/or induce a controlled cytotoxicity.
The possibility of engineering interactions between programmed bacteria and mammalian
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 175

cells opens unforeseen progresses in the medical field. Emerging applications include the
design of bacteria to produce therapeutic agents (in vitro or in vivo) and the use of live
bacteria as targeted delivery systems (Forbes, 2010; Pawelek et al., 2003). An impressive
example of these applications is described by Anderson and co-workers (2006). The authors
have successfully engineered E. coli harboring designed plasmids to invade cancer cells in
an environmentally controlled way, namely in a density-dependent manner under
anaerobic growth conditions and arabinose induction. Plasmids were built containing the
inv gene from Yersinia pseudotuberculosis under control of the Lux promoter, the hypoxia-
responsive fdhF promoter, and the arabinose-inducible araBAD promoter. This is significant
because the tumor environment is often hypoxic and allows for high bacterial cell densities
due to depressed immune function in the tumor. Therefore, this work demonstrated, as a
“proof of concept”, that one can potentially use engineered bacteria to target diseased cells
without significantly impacting healthy cells.
Ideally, an engineered bacterium for cancer therapy would specifically target tumors
enabling the use of more toxic molecules without systemic effects; be self-propelled enabling
its penetration into tumor regions that are inaccessible to passive therapies; be responsive to
external signals enabling the precise control of location and timing of cytotoxicity; be able to
sense the local environment allowing the development of responsive therapies that can
make decisions about where and when drugs are administered; and be externally detectable,
thus providing information about the state of the tumor, the success of localization and the
efficacy of treatment (Forbes, 2010). Indeed some of these features naturally exist in some
bacteria, e.g. many genera of bacteria have been shown to preferentially accumulate in
tumors, including Salmonella, Escherichia, Clostridium and Bifidobacterium. Moreover, bacteria
have motility (flagella) that enable tissue penetration and chemotactic receptors that direct
chemotaxis towards molecular signals in the tumor microenvironment. Selective
cytotoxicity can be engineered by transfection with genes for therapeutic molecules,
including toxins, cytokines, tumor antigens and apoptosis-inducing factors. External control
can be achieved using gene promoter strategies that respond to small molecules, heat or
radiation. Bacteria can be detected using light, magnetic resonance imaging and positron
emission tomography. At last, genetic manipulation of bacteria is easy, thus enabling the
development of treatment strategies, such as expression of anti-tumor proteins and
including vectors to infect cancer cells (Pawelek et al., 2003). To date, many different
bacterial strategies have been implemented in animal models (e.g. Salmonella has been tested
for breast, colon, hepatocellular, melanoma, neuroblastoma, pancreatic and prostate cancer),
and also some human trials (e.g. C. butyricum M-55 has been tested for squamous cell
carcinoma, metastic, malignant neuroma and melanoma) have been carried out (Forbes,
2010).
Ultrasound is one of the techniques often used to treat solid tumors (e.g. breast cancer);
however, this technique is not always successful, as sometimes it just heats the tumor
without destroying it. Therefore, we are currently engineering the heat shock response
machinery from E. coli to trigger the release of a therapeutic agent in situ concurrent with
ultrasound treatment. For that purpose, several modeling and engineering steps are being
implemented. The strategy being pursued is particularly useful for drugs that require in situ
synthesis because of a poor bioavailability, thereby avoiding repetitive oral doses to achieve
sufficient concentration inside the cells. The use of live bacteria for therapeutic purposes
naturally poses some issues (Pawelek et al., 2003), but currently the goal is to achieve the
176 Computational Biology and Applied Bioinformatics

proof-of-concept that an engineered system will enable the production of a cancer-fighting

drug triggered by a temperature increase.

5.3 Alternative nanosized drug carriers

The design of novel tumor targeted multifunctional particles is another extremely
interesting and innovative approach that makes use of the synthetic biology principles. The
modest success of the traditional strategies for cancer treatment has driven research towards
the development of new approaches underpinned by mechanistic understanding of cancer
progression and targeted delivery of rational combination therapy.

5.3.1 Viral drug delivery systems

The use of viruses, in the form of vaccines, has been common practice ever since its first use
to combat smallpox. Recently, genetic engineering has enlarged the applications of viruses,
since it allows the removal of pathogen genes encoding virulence factors that are present in
the virus coat. As a result, it can elicit immunity without causing serious health effects in
humans. In the light of gene therapy, the use of virus-based entities hold a promising future,
since by nature, they are being delivered to human target cells, and can be easily
manipulated genetically. As such, they may be applied to target and lyse specific cancer
cells, delivering therapeutics in situ. Bacteriophages are viruses that specifically and only
infect bacteria. They have gained more attention the last decades, mainly in phage display
technology. In anti-cancer therapy, this technique has contributed enormously to the
identification of new tumor-targeting molecules (Brown, 2010). In vivo phage display
technology identified a peptide exhibiting high affinity to hepatocellular carcinoma cells
(Du et al., 2010). In a different approach, a phage display-selected ligand targeting breast
cancer cells was incorporated in liposomes containing siRNA. The delivered liposomes were
shown to significantly downregulate the PRDM14 gene in the MCF7 target cells (Bedi et al.,
2011). In addition, the option to directly use bacteriophages as drug-delivery platforms has
been explored. A recent study described the use of genetically modified phages able to
target tumor cell receptors via specific antibodies resulting in endocytosis, intracellular
degradation, and drug release (Bar et al., 2008). Using phage display a variety of cancer cell
binding and internalizing ligands have been selected (Gao et al., 2003). Bacteriophages can
also be applied to establish an immune response. Eriksson and co-workers (2009) showed
that a tumor-specific M13 bacteriophage induced regression of melanoma target cells,
involving tumor-associated macrophages and being Toll-like receptor-dependent. Finally,
marker molecules or drugs can be chemically conjugated onto the phage surface, making it a
versatile imaging or therapy vehicle that may reduce costs and improve life quality
(Steinmetz, 2010). An M13 phage containing cancer cell-targeting motifs on the surface was
chemically modified to conjugate with fluorescent molecules, resulting in both binding and
imaging of human KB cancer cells (Li et al., 2010). Besides being genetically part of the virus,
anti-tumor compounds can also be covalently linked to it. We are currently, by phage
display, selecting phages that adhere and penetrate tumor cells. Following this selection, we
will chemically conjugate anti-cancer compounds (e.g. doxorubicin) to bacteriophages,
equipped with the cancer cell-recognizing peptides on the phage surface. We anticipate that
such a multifunctional nanoparticle, targeted to the tumor using a tumor “homing” peptide,
will enable a significant improvement over existing anti-cancer approaches.
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 177

5.4 Microbial cell factories for the production of drugs

On a different perspective, as exemplified in section 4, synthetic biology approaches can be
used for the large scale production of compounds with pharmaceutical applications. One of
the easily employable approaches to develop synthetic pathways is to combine genes from
different organisms, and design a new set of metabolic pathways to produce various natural
and unnatural products. The host organism provides precursors from its own metabolism,
which are subsequently converted to the desired product through the expression of the
heterologous genes (see section 4). Existing examples of synthetic metabolic networks make
use of transcriptional and translational control elements to regulate the expression of
enzymes that synthesize and breakdown metabolites. In these systems, metabolite
concentration acts as an input for other control elements (Andrianantoandro et al., 2006). An
entire metabolic pathway from S. cerevisiae, the mevalonate isoprenoid pathway for
synthesizing isopentyl pyrophosphate, was successfully transplanted into E. coli. In
combination with an inserted synthetic amorpha-4, 11-diene synthase, this pathway
produced large amounts of a precursor to the anti-malarial drug artemisinin. This new
producing strain is very useful since a significant decrease in the drug production time and
costs could be achieved (Martin et al., 2003). In addition to engineering pathways that
produce synthetic metabolites, artificial circuits can be engineered using metabolic
pathways connected to regulatory proteins and transcriptional control elements
(Andrianantoandro et al., 2006). One study describes such a circuit based on controlling
gene expression through acetate metabolism for cell–cell communication (Bulter et al., 2004).
Metabolic networks may embody more complex motifs, such as an oscillatory network. A
recently constructed metabolic network used glycolytic flux to generate oscillations through
the signaling metabolite acetyl phosphate (Fung et al., 2005). The system integrates
transcriptional regulation with metabolism to produce oscillations that are not correlated
with the cell division cycle. The general concerns of constructing transcriptional and protein
interaction-based modules, such as kinetic matching and optimization of reactions for a new
environment, apply for metabolic networks as well. In addition, the appropriate metabolic
precursors must be present. For this purpose, it may be necessary to include other enzymes
or metabolic pathways that synthesize precursors for the metabolite required in a synthetic
network (Leonard et al., 2008; McArthur IV & Fong, 2010).
Many polyketides and nonribosomal peptides are being used as antibiotic, anti-tumor and
immunosuppressant drugs (Neumann & Neumann-Staubitz, 2010). In order to produce
them in heterologous hosts, assembly of all the necessary genes that make up the synthetic
pathways is essential. The metabolic systems for the synthesis of polyketides are composed
of multiple modules, in which an individual module consists of either a polyketide synthase
or a nonribosomal peptide synthetase. Each module has a specific set of catalytic domains,
which ultimately determine the structure of the metabolic product and thus its function.
Recently, Bumpus et al. (2009) presented a proteomic strategy to identify new gene clusters
for the production of polyketides and nonribosomal peptides, and their biosynthetic
pathways, by adapting mass-spectrometry-based proteomics. This approach allowed
identification of genes that are used in the production of the target product in a species, for
which a complete genome sequence is not available. Such newly identified pathways can
then be copied into a new host strain that is more suitable for producing polyketides and
nonribosomal peptides at an industrial scale. This exemplifies that the sources of new
pathways are not limited to species with fully sequenced genomes.
178 Computational Biology and Applied Bioinformatics

The use of synthetic biology approaches in the field of metabolic engineering opens
enormous possibilities, especially toward the production of new drugs for cancer treatment.
Our goal is to design and model a new biosynthetic pathway for the production of natural
drugs in E. coli. Key to this is the specification of gene sequences encoding enzymes that
catalyze each reaction in the pathway, and whose DNA sequences can be incorporated into
devices that lead to functional expression of the molecules of interest (Prather & Martin,
2008). Partial pathways can be recruited from independent sources and co-localized in a
single host (Kobayashi et al., 2004). Alternatively, pathways can be constructed for the
production of new, non-natural products by engineering existing routes (Martin et al., 2003).

6. Conclusion
Despite all the scientific advances that humankind has seen over the last centuries, there are
still no clear and defined solutions to diagnose and treat cancer. In this sense, the search for
innovative and efficient solutions continues to drive research and investment in this field.
Synthetic biology uses engineering principles to create, in a rational and systematic way,
functional systems based on the molecular machines and regulatory circuits of living
organisms, or to re-design and fabricate existing biological systems. Bioinformatics and
newly developed computational tools play a key role in the improvement of such systems.
Elucidation of disease mechanisms, identification of potential targets and biomarkers,
design of biological elements for recognition and targeting of cancer cells, discovery of new
chemotherapeutics or design of novel drugs and catalysts, are some of the promises of
synthetic biology. Recent achievements are thrilling and promising; yet some of such
innovative solutions are still far from a real application due to technical challenges and
ethical issues. Nevertheless, many scientific efforts are being conducted to overcome these
limitations, and undoubtedly it is expected that synthetic biology together with
sophisticated computational tools, will pave the way to revolutionize the cancer field.

7. References
Alon, U. (2003). Biological networks: the tinkerer as an engineer. Science, Vol.301, No.5641,
(September 2003), pp. 1866-1867, ISSN 0036-8075
Anderson, J.C.; Clarke, E.J.; Arkin, A.P. & Voigt, C.A. (2006). Environmentally controlled
invasion of cancer cells by engineered bacteria. Journal of Molecular Biology, Vol.355,
No.4, (January 2006), pp. 619–627, ISSN 00222836
Andrianantoandro, E.; Basu, S.; Karig, D.K. & Weiss, R. (2006). Synthetic biology: new
engineering rules for an emerging discipline. Molecular Systems Biology Vol.2, No.
2006.0028, (May 2006), pp. 1-14, ISSN 1744-4292
Arkin, A.P. (2001). Synthetic cell biology. Current Opinion in Biotechnology Vol.12, No.6,
(December 2001), pp. 638-644, ISSN 0958-1669
Bansal, V.; Sakizlis, V.; Ross, R.; Perkins, J.D. & Pistikopoulos, E.N. (2003). New algorithms
for mixed-integer dynamic optimization. Computers and Chemical Engineering
Vol.27, No.5, (May 2003), pp. 647-668, ISSN 0098-1354
Bar, H.; Yacoby, I. & Benhar, I. (2008). Killing cancer cells by targeted drug-carrying phage
nanomedicines. BMC Biotechnology Vol.8, No.37, (April 2008), pp. 1-14, ISSN 1472-
6750
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 179

Batt, G., Yordanov, B., Weiss, R. & Belta, C. (2007). Robustness analysis and tuning of
synthetic gene networks. Bioinformatics Vol.23, No.18, (July 2007), pp. 2415-2422,
ISSN 1367-4803
Bedi, D.; Musacchio, T.; Fagbohun, O.A.; Gillespie, J.W.; Deinnocentes, P.; Bird, R.C.;
Bookbinder, L.; Torchilin, V.P. & Petrenko, V.A. (2011). Delivery of siRNA into
breast cancer cells via phage fusion protein-targeted liposomes. Nanomedicine:
Nanotechnology, Biology, and Medicine, doi:10.1016/j.nano.2010.10.004, ISSN 1549-
9634
Beisel, C.L. & Smolke, C.D. (2009). Design principles for riboswitch function. PLoS
Computational Biology Vol.5, No.4, (April 2009), e1000363, pp. 1-14, ISSN 1553-734X
Benner, S.A. & Sismour, A.M. (2005). Synthetic biology. Nature Reviews Genetics Vol.6, No.7,
(July 2005), pp. 533–543, ISSN 1471-0056
Brown, K.C. (2010). Peptidic tumor targeting agents: the road from phage display peptide
selections to clinical applications. Current Pharmaceutical Design Vol.16, No.9,
(March 2010), pp. 1040-1054, ISSN 1381-6128
Brustad, E.M. & Arnold, F.H. (2010). Optimizing non-natural protein function with directed
evolution. Current Opinion in Chemical Biology Vol.15, No.2, (April 2010), pp. 1-10,
ISSN 1367-5931
Bulter, T.; Lee, S.G.; Wong, W.W.; Fung, E.; Connor, M.R. & Liao, J.C. (2004). Design of
artificial cell–cell communication using gene and metabolic networks. PNAS
Vol.101, No.8 (February 2004), pp. 2299–2304, ISSN 0027-8424
Bumpus, S.; Evans, B.; Thomas, P.; Ntai, I. & Kelleher, N. (2009). A proteomics approach to
discovering natural products and their biosynthetic pathways. Nature Biotechnology
Vol.27, No.10, (September 2009), pp. 951-956, ISSN 1087-0156
Canton, B.; Labno, A. & Endy, D. (2008). Refinement and standardization of synthetic
biological parts and devices. Nature Biotechnology Vol.26, No. 7, (July 2008), pp. 787-
793, ISSN 1087-0156
Carothers, J.M.; Goler, J.A. & Keasling, J.D. (2009). Chemical synthesis using synthetic
biology. Current Opinion in Biotechnology Vol.20, No.4, (August 2009), pp. 498-503,
ISSN 0958-1669
Chan, L.Y.; Kosuri, S. & Endy, D. (2005). Refactoring bacteriophage T7. Molecular Systems
Biology Vol.1, No. 2005.0018, (September 2005), pp. 1-10, ISSN 1744-4292
Cooling, M.T.; Hunter, P. & Crampin, E.J. (2008). Modelling biological modularity with
CellML. IET Systems Biology Vol.2, No.2, (March 2008), pp. 73-79, ISSN 1751-8849
Crameri, A.; Raillard, S.A.; Bermudez, E. & Stemmer, W.P. (1998). DNA shuffling of a family
of genes from diverse species accelerates directed evolution. Nature Vol.391,
No.6664, (January 1998), pp. 288-291, ISSN 0028-0836
Czar, M.J.; Cai, Y. & Peccoud, J. (2009). Writing DNA with GenoCAD. Nucleic Acids Research
Vol.37, No.2, (May 2009), pp. W40-W47, ISSN 0305-1048
Dasika, M.S. & Maranas, C.D. (2008). OptCircuit: an optimization based method for
computational design of genetic circuits. BMC Systems Biology Vol.2, No.24, pp. 1-
19, ISSN 1752-0509
Datsenko, K.A. & Wanner, B.L. (2000). One-step inactivation of chromosomal genes in
Escherichia coli K-12 using PCR products. PNAS Vol.97, No. 12, (June 2000), pp.
6640-6645, ISSN 0027-8424
180 Computational Biology and Applied Bioinformatics

Dawid, A.; Cayrol, B. & Isambert, H. (2009). RNA synthetic biology inspired from bacteria:
construction of transcription attenuators under antisense regulation. Physical
Biology Vol.6, No. 025007, (July 2009), pp. 1-10, ISSN 1478-3975
Di Bernardo, D.; Thompson, M.J.; Gardner, T.S.; Chobot, S.E.; Eastwood, E.L.; Wojtovich,
A.P.; Elliott, S.E.; Schaus, S.E. & Collins, J.J. (2005). Chemogenomic profiling on a
genome-wide scale using reverse-engineered gene networks. Nature Biotechnology
Vol.23, No.3, (March 2005), pp. 377-383, ISSN 1087-0156
Di Ventura, B.; Lemerle, C.; Michalodimitrakis, K. & Serrano, L. (2006). From in vivo to in
silico biology and back. Nature Vol.443, No.7111, (October 2006), pp. 527-533, ISSN
0028-0836
Dougherty, M.J. & Arnold, F.H. (2009). Directed evolution: new parts and optimized
function. Current Opinion in Biotechnology Vol.20, No.4, (August 2009), pp. 486-491,
ISSN 0958-1669
Du B, Han H, Wang Z, Kuang L, Wang L, Yu L, Wu M, Zhou Z, Qian M. (2010). Targeted
drug delivery to hepatocarcinoma in vivo by phage-displayed specific binding
peptide. Molecular Cancer Research Vol.8, No.2, (February 2010), pp.135-144, ISSN
1541-7786
Ellis, T.; Wang, X. & Collins, J.J. (2009). Diversity-based, model guided construction of
synthetic gene networks with predicted functions. Nature Biotechnology Vol.27,
No.5, (May 2009), pp. 465– 471, ISSN 1087-0156
Elowitz, M.B. & Leibler, S. (2000). A synthetic oscillatory network of transcriptional
regulators. Nature Vol.403, No.6767, (January 2000), pp. 335-338, ISSN 0028-0836
Elowitz, M.B.; Levine, A.J.; Siggia, E.D. & Swain, P.S. (2002). Stochastic gene expression in a
single cell. Science Vol.297, No.5584, (August 2002), pp. 1183–1186, ISSN 0036-8075
Endy, D. (2005). Foundations for engineering biology. Nature Vol.438, No. 7067, (November
2005), pp. 449-453, ISSN 0028-0836
Engler, C.; Gruetzner, R.; Kandzia, R. & Marillonnet, S. (2009). Golden gate shuffling: a one
pot DNA shuffling method based on type ils restriction enzymes. PLoS ONE Vol.4,
No.5, (May 2009), e5553, pp. 1-9, ISSN 1932-6203
Eriksson, F.; Tsagozis, P.; Lundberg, K.; Parsa, R.; Mangsbo, S.M.; Persson, M.A.; Harris,
R.A. & Pisa, P.J. (2009). Tumor-specific bacteriophages induce tumor destruction
through activation of tumor-associated macrophages. The Journal of Immunology
Vol.182, No.5, (March 2009), pp. 3105-3111, ISSN 0022-1767
Ferreira, C.S.; Matthews, C.S. & Missailidis, S. (2006). DNA aptamers that bind to MUC1
tumour marker: design and characterization of MUC1-binding single-stranded
DNA aptamers. Tumour Biology Vol.27, No.6, (October 2006), pp. 289-301, ISSN
1010-4283
Fisher, J. & Henzinger, T.A. (2007). Executable cell biology. Nature Biotechnology Vol.25,
No.11, (November 2007), pp. 1239–1249, ISSN 1087-0156
Forbes, N.S. (2010). Engineering the perfect (bacterial) cancer therapy. Nature Reviews Cancer
Vol.10, No.11, (October 2010), pp. 785-794, ISSN 1474-175X
Francois, P. & Hakim, V. (2004). Design of genetic networks with specified functions by
evolution in silico. PNAS Vol.101, No.2, (January 2004), pp. 580–585, ISSN 0027-
8424
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 181

Funahashi, A.; Morohashi, M. & Kitano, H. (2003). CellDesigner: a process diagram editor
for gene-regulatory and biochemical networks. BIOSILICO Vol.1, No.5, (November
2003),pp. 159–162, ISSN 1478-5382
Fung, E.; Wong, W.W.; Suen, J.K.; Bulter, T.; Lee, S.G. & Liao, J.C. (2005). A synthetic gene
metabolic oscillator. Nature Vol.435, No.7038, (May 2005), pp. 118–122, ISSN 0028-
0836
Gao, C.; Mao, S.; Ronca, F.; Zhuang, S.; Quaranta, V.; Wirsching, P. & Janda, K.D. (2003). De
novo identification of tumor-specific internalizing human antibody-receptor pairs
by phage-display methods. Journal of Immunological Methods. Vol.274, No.1-2,
(March 2003), pp. 185-197, ISSN 0022-1759
Gardner, T.S.; Cantor, C.R. & Collins, J.J. (2000). Construction of a genetic toggle switch in
Escherichia coli. Nature Vol.403, No.6767, (January 2000), pp. 339-342, ISSN 0028-
0836
Garst, A.D. & Batey, R.T. (2009). A switch in time: detailing the life of a riboswitch.
Biochimica Biophysica Acta Vol.1789, No.9-10, (September-October 2009), pp. 584-
591, ISSN 0006-3002
Goler, J.A. (2004). A design and simulation tool for synthetic biological systems. Cambridge,
MA: MIT
Golynskiy, M.V. & Seelig, B. (2010). De novo enzymes: from computational design to mRNA
display. Trends in Biotechnology Vol.28, No.7, (July 2010), pp. 340-345, ISSN 0167-
7799
Grunberg, R.; Nilges, M. & Leckner, J. (2007). Biskit —a software platform for structural
bioinformatics. Bioinformatics Vol.23, No.6, (March 2007), pp. 769-770, ISSN 1367-
4803
Ham, T.S.; Lee, S.K.; Keasling, J.D. & Arkin, A.P. (2006). A tightly regulated inducible
expression system utilizing the fim inversion recombination switch. Biotechnology
and Bioengineering Vol.94, No.1, (May 2006), pp. 1–4, ISSN 0006-3592
Hartley, J.L. (2003). Use of the gateway system for protein expression in multiple hosts.
Current Protocols in Protein Science Chap5: Unit 5.17
Hawkins, T.; Luban, S. & Kihara, D. (2006). Enhanced automated function prediction using
distantly related sequences and contextual association by PFP. Protein Science
Vol.15, No.6, (June 2006), pp. 1550-1556, ISSN 1469-896X
Hianik, T. & Wang, J. (2009). Electrochemical aptasensors – recent achievements and
perspectives. Electroanalysis Vol.21, No.11, (June 2009), pp. 1223-1235, ISSN 1521-
4109
Hill, A.D.; Tomshine, J.R.; Weeding, E.M.; Sotiropoulos, V. & Kaznessis, Y.N. (2008).
SynBioSS: the synthetic biology modeling suite. Bioinformatics Vol.24, No.21,
(November 2008),pp. 2551-2553, ISSN 1367-4803
Hofacker, I.L. (2003). Vienna RNA secondary structure server. Nucleic Acids Research Vol.31,
No.13, (July 2003), pp. 3429-3431, ISSN 0305-1048
Isaacs, F.J.; Dwyer, D.J. & Collins, J.J. (2006). RNA synthetic biology. Nature Biotechnology
Vol.24, No.5, (May 2006), pp. 545-554, ISSN 1087-0156
Isalan, M.; Lemerle, C.; Michalodimitrakis, K.; Horn, C.; Beltrao, P.; Raineri, E.; Garriga-
Canut, M.; Serrano, L. (2008). Evolvability and hierarchy in rewired bacterial gene
networks. Nature Vol.452, No.7189, (April 2008), pp. 840–845, ISSN 0028-0836
182 Computational Biology and Applied Bioinformatics

Jemal, A.; Bray, F.; Center, M.M.; Ferlay, J.; Ward, E. & Forman, D. (2011). Global cancer
statistics. CA Cancer Journal for Clinicians Vol.61, No.2, (March-April 2011),pp. 69-
90, ISSN 0007-9235
Kaiser, J. (2005). Gene therapy. Putting the fingers on gene repair. Science Vol.310, No.5756,
(December 2005), pp.1894-1896, ISSN 0036-8075
Kaznessis, Y.N. (2009). Computational methods in synthetic biology. Biotechnology Journal
Vol.4, No.10, (October 2009), pp.1392-1405, ISSN 1860-7314
Kelly, J.; Rubin, A.J.; Davis, J. II; Ajo-Franklin, C.M.; Cumbers, J.; Czar, M.J.; de Mora, K.;
Glieberman, A.I.; Monie, D.D. & Endy, D. (2009). Measuring the activity of BioBrick
promoters using an in vivo reference standard. Journal of Biological Engineering
Vol.3, No.4, (March 2009), pp. 1-13, ISSN 1754-1611
Kirkpatrick, S.; Gelatt, C.D. Jr. & Vecchi, M.P. (1983). Optimization by Simulated Annealing.
Science Vol.220, No.4598, (May 1983), pp. 671-680, ISSN 0036-8075
Kitano, H. (2002). Systems biology: a brief overview. Science Vol.295, No.5560, (March 2002),
pp. 1662-1664, ISSN 0036-8075
Kizer, L.; Pitera, D.J.; Pfleger, B.F. & Keasling, J.D. (2008). Application of functional
genomics to pathway optimization for increased isoprenoid production. Applied
and Environmental Microbiology Vol.74, No.10, (May 2008), pp. 3229–3241, ISSN
0099-2240
Kobayashi, H.; Kaern, M.; Araki, M.; Chung, K.; Gardner, T.S.; Cantor, C.R. & Collins, J.J.
(2004). Programmable cells: interfacing natural and engineered gene networks.
PNAS Vol.101, No.22, (June 2004), pp. 8414–8419, ISSN 0027-8424
Koide, T.; Pang, W.L. & Baliga, N.S. (2009). The role of predictive modeling in rationally
reengineering biological systems. Nature Reviews Microbiology Vol.7, No.4, (April
2009), pp. 297-305, ISSN 1740-1526
Kosuri, S.; Kelly, J.R. & Endy, D. (2007). TABASCO: a single molecule, base pair resolved
gene expression simulator. BMC Bioinformatics Vol.8, No.480, (December 2007), pp.
1-15, ISSN 1471-2105
Lartigue, C.; Glass, J.I.; Alperovich, N.; Pieper, R.; Parmar, P.P.; Hutchison III, C.A.; Smith,
H.O. & Venter, J.C. (2007). Genome transplantation in bacteria: changing one
species to another. Science Vol.317, No.5838, (August 2007), pp. 632-638, ISSN 0036-
8075
Leonard, E.; Nielsen, D.; Solomon, K. & Prather, K.J. (2008). Engineering microbes with
synthetic biology frameworks. Trends in Biotechnology Vol.26, No.12, (December
2008), pp. 674-681, ISSN 0167-7799
Li K, Chen Y, Li S, Nguyen HG, Niu Z, You S, Mello CM, Lu X, Wang Q. (2010). Chemical
modification of M13 bacteriophage and its application in cancer cell imaging.
Bioconjugate Chemistry Vol.21, No.7, (December 2010), pp. 1369-1377, ISSN 1043-
1802
Loettgers, A. (2007). Model organisms and mathematical and synthetic models to explore
regulation mechanisms. Biological Theory Vol.2, No.2, (December 2007), pp. 134-142,
ISSN 1555-5542
Mandell, J.G. & Barbas, C.F. III (2006). Zinc Finger Tools: custom DNA binding domains for
transcription factors and nucleases. Nucleic Acids Research Vol.34, (July 2006), pp.
W516-523, ISSN 0305-1048
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 183

Marchisio, M.A. & Stelling, J. (2008). Computational design of synthetic gene circuits with
composable parts. Bioinformatics Vol.24, No.17, (September 2008), pp.1903–1910,
ISSN 1367-4803
Marchisio, M.A. & Stelling, J. (2009). Computational design tools for synthetic biology.
Current Opinion in Biotechnology Vol.20, No.4, (August 2009), pp. 479-485, ISSN
0958-1669
Marguet, P.; Balagadde, F.; Tan, C. & You, L. (2007). Biology by design: reduction and
synthesis of cellular components and behavior. Journal of the Royal Society Interface
Vol.4, No.15, (August 2007), pp. 607-623, ISSN 1742-5689
Markham, N.R. & Zuker, M. (2008). UNAFold: software for nucleic acid folding and
hybridization. Methods Molecular Biology Vol.453, No.I, pp. 3-31, ISSN 1064-3745
Martin, V.J.; Pitera, D.J.; Withers, S.T.; Newman, J.D. & Keasling, J.D. (2003). Engineering a
mevalonate pathway in Escherichia coli for production of terpenoids. Nature
Biotechnology Vol.21, No.7, (July 2003), pp.796–802, ISSN 1087-0156
Matsuoka, Y.; Ghosh, S. & Kitano, H. (2009). Consistent design schematics for
biological systems: standardization of representation in biological engineering.
Journal of the Royal Society Interface Vol.6, No.4, (August 2009), pp. S393-S404, ISSN
1742-5689
McArthur IV, G.H. & Fong, S.S. (2010). Toward engineering synthetic microbial metabolism.
Journal of Biomedicine and Biotechnology doi:10.1155/2010/459760, ISSN 1110-7243
Mirschel, S.; Steinmetz, K.; Rempel, M.; Ginkel, M. & Gilles, E.D. (2009). PROMOT: modular
modeling for systems biology. Bioinformatics Vol.25, No.5, (March 2009), pp. 687-
689, ISSN 1367-4803
Morris, G.M.; Huey, R.; Lindstrom, W.; Sanner, M.F.; Belew, R.K.; Goodsell, D.S. & Olson,
A.J. (2009). AutoDock4 and AutoDockTools4: automated docking with selective
receptor flexibility. Journal of Computational Chemistry Vol.30, No.16, (December
2009), pp. 2785-2791, ISSN 0192-8651
Mueller, S.; Coleman, J.R. & Wimmer, E. (2009). Putting synthesis into biology: a viral view
of genetic engineering through de novo gene and genome synthesis. Chemistry &
Biology Vol.16, No.3, (March 2009), pp. 337-347, ISSN 1074-5521
Neumann, H. & Neumann-Staubitz, P. (2010). Synthetic biology approaches in drug
discovery and pharmaceutical biotechnology. Applied Microbiology and Biotechnology
Vol.87, No.1, (June 2010), pp. 75-86, ISSN 0175-7598
Nielsen, J. (2001). Metabolic engineering. Applied Microbiology and Biotechnology Vol.55, No.3,
(April 2001), pp. 263-283, ISSN 0175-7598
Park, J.H.; Lee, S.Y.; Kim, T.Y. & Kim, H.U. (2008). Application of systems biology for
bioprocess development. Trends in Biotechnology Vol.26, No.8, (August 2008), pp.
404–412, ISSN 0167-7799
Pawelek, J.; Low, K. & Bermudes, D. (2003). Bacteria as tumour-targeting vectors. Lancet
Oncology Vol.4, No.9, (September 2003), pp. 548–556, ISSN 1470-2045
Pedersen, M. & Phillips, A. (2009). Towards programming languages for genetic engineering
of living cells. Journal of the Royal Society Interface Vol.6, No.4, (August 2009), pp.
S437-S450, ISSN 1742-5689
184 Computational Biology and Applied Bioinformatics

Prather, K. & Martin, C.H. (2008). De novo biosynthetic pathways: rational design of
microbial chemical factories. Current Opinion in Biotechnology Vol.19, No.5, (October
2008), pp. 468–474, ISSN 0958-1669
Price, N.D. & Shmulevich, I. (2007). Biochemical and statistical network models for systems
biology. Current Opinion in Biotechnology Vol.18, No.4, (August 2007), pp. 365–370,
ISSN 0958-1669
Purnick, P.E. & Weiss, R. (2009). The second wave of synthetic biology: from modules to
systems. Nature Reviews Molecular Cell Biology Vol.10, No.6, (June 2009), pp. 410-422,
ISSN 1471-0072
Quan, J. & Tian, J. (2009). Circular polymerase extension cloning of complex 1 gene libraries
and pathways. PLoS ONE Vol.4, No.7, (July 2009), e6441, pp.1-6, ISSN 1932-6203
Richardson, S.M.; Wheelan, S.J.; Yarrington, R.M. & Boeke, J.D. (2006). GeneDesign: rapid,
automated design of multikilobase synthetic genes. Genome Research Vol.16, No.4,
(April 2006), pp. 550-556, ISSN 1088-9051
Ritchie, D.W. (2008). Recent progress and future directions in protein–protein
docking.Current Protein & Peptide Science Vol.9, No.1, (February 2008), pp. 1-15,
ISSN 1389-2037
Ro, D.K.; Paradise, E.M.; Ouellet, M.; Fisher, K.J.; Newman, K.L.; Ndungu, J.M.; Ho, K.A.;
Eachus, R.A.; Ham, T.S.; Kirby, J.; Chang, M.C.; Withers, S.T.; Shiba, Y.; Sarpong, R.
& Keasling, J.D. (2006). Production of the antimalarial drug precursor artemisinic
acid in engineered yeast. Nature Vol.440, No.7086, (April 2006), pp. 940–943, ISSN
0028-0836
Rocha, I.; Forster, J. & Nielsen, J. (2008). Design and application of genome-scale
reconstructed metabolic models. Methods in Molecular Biology Vol.416, No.IIIB, pp.
409-431, ISSN 1064-3745
Rocha, I.; Maia, P.; Evangelista, P.; Vilaça, P.; Soares, S.; Pinto, J.P.; Nielsen, J.; Patil, K.R.;
Ferreira, E.C. & Rocha, M. (2010). OptFlux: an open-source software platform for in
silico metabolic engineering. BMC Systems Biology Vol.4, No. 45, pp. 1-12, ISSN
1752-0509
Rodrigo, G.; Carrera, J. & Jaramillo, A. (2007a). Asmparts: assembly of biological model
parts. Systems and Synthetic Biology Vol.1, No.4, (December 2007), pp. 167-170, ISSN
1872-5325
Rodrigo, G.; Carrera, J. & Jaramillo, A. (2007b). Genetdes: automatic design of
transcriptional networks. Bioinformatics Vol.23, No.14, (July 2007), pp.1857-1858,
ISSN 1367-4803
Rodrigues, L.R.; Teixeira, J.A.; Schmitt, F.; Paulsson, M. & Lindmark Måsson, H. (2007). The
role of osteopontin in tumour progression and metastasis in breast cancer. Cancer
Epidemology Biomarkers & Prevention Vol.16, No.6, (June 2007), pp. 1087–1097, ISSN
1055-9965
Saito, H. & Inoue, T. (2009). Synthetic biology with RNA motifs. The International Journal of
Biochemistry & Cell Biology Vol.41, No.2, (February 2009), pp. 398-404, ISSN 1357-
2725
Salis, H.M., Mirsky, E.A. & Voigt, C.A. (2009). Automated design of synthetic ribosome
binding sites to control protein expression. Nature Biotechnology Vol.27, No.10,
(October 2009), pp. 946–950, ISSN 1087-0156
Synthetic Biology & Bioinformatics Prospects in the Cancer Arena 185

Segre, D.; Vitkup, D. & Church, G.M. (2002). Analysis of optimality in natural and perturbed
metabolic networks. PNAS Vol.99, No.23, (November 2001), pp. 15112-15117, ISSN
0027-8424
Shetty, R.P.; Endy, D. & Knight, T.F. Jr. (2008). Engineering BioBrick vectors from BioBrick
parts. Journal of Biological Engineering Vol.2, No.1, (April 2008), pp.5-17 ISSN 1754-
1611
Simons, K.T.; Bonneau, R.; Ruczinski, I. & Baker, D. (1999). Ab initio protein structure
prediction of CASP III targets using ROSETTA. Proteins Vol.3, pp. 171-176, ISSN
0887-3585
Steinmetz, N.F. (2010).Viral nanoparticles as platforms for next-generation therapeutics and
imaging devices. Nanomedicine: Nanotechnology, Biology, and Medicine Vol.6, No.5,
(October 2010), pp. 634-641, ISSN 1549-9634
Stoltenburg, R.; Reinemann, C. & Strehlitz, B. (2007). SELEX—A (r)evolutionary method to
generate high-affinity nucleic acid ligands. Biomolecular Engineering Vol.24, No.4,
(October 2007), pp. 381–403, ISSN 1389-0344
Topp, S. & Gallivan, J.P. (2007). Guiding bacteria with small molecules and RNA. Journal of
the American Chemical Society Vol.129, No.21, (May 2007), pp. 6807-6811, ISSN 0002-
7863
Tyo, K.E.; Alper, H.S. & Stephanopoulos, G. (2007). Expanding the metabolic engineering
toolbox: more options to engineer cells. Trends in Biotechnology Vol.25, No.3, (March
2007), pp. 132-137, ISSN 0167-7799
Tyo, K.E.J.; Ajikumar, P.K. & Stephanopoulos, G. (2009). Stabilized gene duplication enables
long-term selection-free heterologous pathway expression. Nature Biotechnology
Vol.27, No.8, (August 2009), pp. 760–765, ISSN 1087-0156
Villalobos, A.; Ness, J.E.; Gustafsson, C.; Minshull, J. & Govindarajan, S. (2006). Gene
Designer: a synthetic biology tool for constructing artificial DNA segments. BMC
Bioinformatics Vol.7, No.285, (June 2006), pp.1-8, ISSN 1471-2105
Voloshchuk, N. & Montclare, J.K. (2010). Incorporation of unnatural amino acids for
synthetic biology. Molecular Biosystems Vol.6, No.1, (January 2010), pp. 65-80, ISSN
1742-2051
Wang, H.H.; Isaacs, F.J.; Carr, P.A.; Sun, Z.Z.; Xu, G.; Forest, C.R. & Church, G.M. (2009).
Programming cells by multiplex genome engineering and accelerated evolution.
Nature Vol.460, No.7257, (August 2009), pp. 894–898, ISSN 0028-0836
Wierling, C.; Herwig, R. & Lehrach, H. (2007). Resources, standards and tools for systems
biology. Briefings in Functional Genomics and Proteomics Vol.6, No.3, (September
2007), pp. 240-251, ISSN 1473-9550
Win, M.N. & Smolke, C.D. (2007). From the cover: a modular and extensible RNA-based
gene-regulatory platform for engineering cellular function. PNAS Vol.104, No.36,
(September 2007), pp. 14283–14288, ISSN 0027-8424
Winkler, W.C. & Breaker, R.R. (2005). Regulation of bacterial gene expression by
riboswitches. Annual Review of Microbiology Vol.59, pp. 487-517, ISSN 0066-4227
Xu, J.; Li, M.; Kim, D. & Xu, Y. (2003). RAPTOR: optimal protein threading by linear
programming. Journal of Bioinformatics and Computational Biology Vol.1, No.1, (April
2003), pp. 95-117, ISSN 0219-7200
186 Computational Biology and Applied Bioinformatics

Yokobayashi, Y.; Weiss, R. & Arnold, F.H. (2002). Directed evolution of a genetic circuit.
PNAS Vol.99, No.26, (September 2002), pp. 16587–16591, ISSN 0027-8424
You, L. (2004). Toward computational systems biology. Cell Biochemistry and Biophysics
Vol.40, No.2, pp. 167–184, ISSN 1085-9195
9

An Overview of Hardware-Based Acceleration

of Biological Sequence Alignment
Laiq Hasan and Zaid Al-Ars
TU Delft
The Netherlands

1. Introduction
Efficient biological sequence (proteins or DNA) alignment is an important and challenging
task in bioinformatics. It is similar to string matching in the context of biological data and
is used to infer the evolutionary relationship between a set of protein or DNA sequences.
An accurate alignment can provide valuable information for experimentation on the newly
found sequences. It is indispensable in basic research as well as in practical applications such
as pharmaceutical development, drug discovery, disease prevention and criminal forensics.
Many algorithms and methods, such as, dot plot (Gibbs & McIntyre, 1970), Needleman-Wunsch
(N-W) (Needleman & Wunsch, 1970), Smith-Waterman (S-W) (Smith & Waterman, 1981),
FASTA (Pearson & Lipman, 1985), BLAST (Altschul et al., 1990), HMMER (Eddy, 1998) and
ClustalW (Thompson et al., 1994) have been proposed to perform and accelerate sequence
alignment activities. An overview of these methods is given in (Hasan et al., 2007). Out
of these, S-W algorithm is an optimal sequence alignment method, but its computational
cost makes it inappropriate for practical purposes. To develop efficient and optimal
sequence alignment solutions, the S-W algorithm has recently been implemented on emerging
accelerator platforms such as Field Programmable Gate Arrays (FPGAs), Cell Broadband Engine
(Cell/B.E.) and Graphics Processing Units (GPUs) (Buyukkur & Najjar, 2008; Hasan et al., 2010;
Liu et al., 2009; 2010; Lu et al., 2008). This chapter aims at providing a broad overview of
sequence alignment in general with particular emphasis on the classification and discussion
of available methods and their comparison. Further, it reviews in detail the acceleration
approaches based on implementations on different platforms and provides a comparison
considering different parameters. This chapter is organized as follows:
The remainder of this section gives a classification, discussion and comparison of the available
methods and their hardware acceleration. Section 2 introduces the S-W algorithm which is
the focus of discussion in the succeeding sections. Section 3 reviews CPU-based acceleration.
Section 4 provides a review of FPGA-based acceleration. Section 5 overviews GPU-based
acceleration. Section 6 presents a comparison of accelerations on different platforms, whereas
Section 7 concludes the chapter.

1.1 Classiﬁcation
Sequence alignment aims at identifying regions of similarity between two DNA or protein
sequences (the query sequence and the subject or database sequence). Traditionally, the
methods of pairwise sequence alignment are classiﬁed as either global or local, where pairwise
means considering only two sequences at a time. Global methods attempt to match as many
188
2 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics

characters as possible, from end to end, whereas local methods aim at identifying short
stretches of similarity between two sequences. However, in some cases, it might also be
needed to investigate the similarities between a group of sequences, hence multiple sequence
alignment methods are introduced. Multiple sequence alignment is an extension of pairwise
alignment to incorporate more than two sequences at a time. Such methods try to align all of
the sequences in a given query set simultaneously. Figure 1 gives a classiﬁcation of various
available sequence alignment methods.

Sequence Alignment
Methods

Global Local Multiple

N-W S-W
Dot plot FASTA BLAST HMMER ClustalW
algorithm algorithm

Exact methods Approximate methods

Fig. 1. Various methods for sequence alignment
These methods are categorized into three types, i.e. global, local and multiple, as shown in the
figure. Further, the figure also identifies the exact methods and approximate methods. The
methods shown in Figure 1 are discussed briefly in the following subsection.

1.2 Discussion of available methods

Following is a brief description of the available methods for sequence alignment.

Global methods
Global methods aim at matching as many characters as possible, from end to end between
two sequences i.e. the query sequence (Q) and the database sequence (D). Methods carrying out
global alignment include dot plot and N-W algorithm. Both are categorized as exact methods.
The difference is that dot plot is based on a basic search method, whereas N-W on dynamic
programming (DP) (Giegerich, 2000).

Local methods
In contrast to global methods, local methods attempt to identify short stretches of similarity
between two sequences i.e. Q and D. These include exact method like S-W and heuristics
based approximate methods like FASTA and BLAST.

Multiple alignment methods

It might be of interest in some cases to consider the similarities between a group of sequences.
Multiple sequence alignment methods like HMMER and ClustalW are introduced to handle
such cases.
An Overview
An Overview of Hardware-Based
of Hardware-based Acceleration
Acceleration of Biological of Biological Sequence Alignment
Sequence Alignment 1893

1.3 Comparison
The alignment methods can be compared on the basis of their temporal and spatial
complexities and parameters like alignment type and search procedure. A summary of the
comparison is shown in Table 1. It is interesting to note that all the global and local sequence
alignment methods essentially have the same computational complexity of O( L Q L D ), where
L Q and L D are the lengths of the query and database sequences, respectively. Yet despite this,
each of the algorithms has very different running times, with BLAST being the fastest and
dynamic programming algorithms being the slowest. In case of multiple sequence alignment
methods, ClustalW has the worst time complexity of O( L2Q L2D ), whereas HMMER has a
time complexity of O( L Q L2D ). The space complexities of all the alignment methods are also
essentially identical, around O( L Q L D ) space, except BLAST, the space complexity of which
is O(20w + L Q L D ). In the exact methods, dot plot uses a basic search method, whereas N-W
and S-W use DP. On the other hand, all the approximate methods are heuristic based. It
is also worthy to note that FASTA and BLAST have to make sacriﬁces on sensitivity to be
able to achieve higher speeds. Thus, a trade off exists between speed and sensitivity and we
must come to a compromise to be able to efﬁciently align sequences in a biologically relevant
manner in a reasonable amount of time.
Time Space
Method Type Accuracy Search
complexity complexity
Dot plot Global Exact Basic O( L Q L D ) O( L Q L D )
N-W Global Exact DP O( L Q L D ) O( L Q L D )
S-W Local Exact DP O( L Q L D ) O( L Q L D )
FASTA Local Approximate Heuristic O( L Q L D ) O( L Q L D )
BLAST Local Approximate Heuristic O( L Q L D ) O(20w + L Q L D )
HMMER Multiple Approximate Heuristic O( L Q L2D ) O( L Q L D )
ClustalW Multiple Approximate Heuristic O( L2Q L2D ) O( L Q L D )

Table 1. Comparison of various sequence alignment methods

1.4 Hardware platforms

Work has been done on accelerating sequence alignment methods, by implementing them on
various available hardware platforms. Following is a brief discussion about such platforms.

CPUs
CPUs are well known, ﬂexible and scalable architectures. By exploiting the Streaming
SIMD Extension (SSE) instruction set on modern CPUs, the running time of the analyses is
decreased signiﬁcantly, thereby making analyses of data intensive problems like sequence
alignment feasible. Also emerging CPU technologies like multi-core combines two or more
independent processors into a single package. The Single Instruction Multiple Data-stream
(SIMD) paradigm is heavily utilized in this class of processors, making it appropriate for data
parallel applications like sequence alignment. SIMD describes CPUs with multiple processing
elements that perform the same operation on multiple data simultaneously. Thus, such
machines exploit data level parallelism. The SSE instruction set extension in modern CPUs
contains 70 new SIMD instructions. This extension greatly increases the performance when
exactly the same operations are to be performed on multiple data objects, making sequence
alignment a typical application.
190
4 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics

FPGAs
FPGAs are reconﬁgurable data processing devices on which an algorithm is directly mapped
to basic processing logic elements. To take advantage of using an FPGA, one has to implement
massively parallel algorithms on this reconﬁgurable device. They are thus well suited for
certain classes of bioinformatics applications, such as sequence alignment. Methods like the
ones based on systolic arrays are used to accelerate such applications.

GPUs
Initially stimulated by the need for real time graphics in video gaming, GPUs have evolved
into powerful and flexible vector processors, ideal for accelerating a variety of data parallel
applications. GPUs have in the last couple of years developed themselves from a fixed
function graphics processing unit into a flexible platform that can be used for high performance
computing (HPC). Applications like bioinformatics sequence alignment can run very efficiently
on these architectures.

2. Smith-Waterman algorithm
In 1981, Smith and Waterman described a method, commonly known as the Smith-Waterman
(S-W) algorithm (Smith & Waterman, 1981), for ﬁnding common regions of local similarity.
S-W method has been used as the basis for many subsequent algorithms, and is often quoted
as a benchmark when comparing different alignment techniques. When obtaining the local
S-W alignment, a matrix H is constructed using the following equation.
⎧
⎪
⎪ 0
⎨H
i −1,j −1 + Si,j
Hi,j = max (1)
⎪
⎪ H −d
⎩ i−1,j
Hi,j−1 − d

Where Si,j is the similarity score and d is the penalty for a mismatch. The algorithm can be
implemented using the following pseudo code.

Initialization:

H(0,j) = 0
H(i,0) = 0

Matrix Fill:

for each i,j = 1 to M,N

{
H(i,j) = max(0,
H(i-1,j-1) + S(i,j),
H(i-1,j) - d,
H(i,j-1) - d)
}

Traceback:

H(opt) = max(H(i,j))
traceback(H(opt))
An Overview
An Overview of Hardware-Based
of Hardware-based Acceleration
Acceleration of Biological of Biological Sequence Alignment
Sequence Alignment 1915

The H matrix is constructed with one sequence lined up against the rows of a matrix, and
another against the columns, with the first row and column initialized with a predefined value
(usually zero) i.e. if the sequences are of length M and N respectively, then the matrix for the
alignment algorithm will have ( M + 1) × ( N + 1) dimensions. The matrix fill stage scores
each cell in the matrix. This score is based on whether the two intersecting elements of each
sequence are a match, and also on the score of the cell’s neighbors to the left, above, and
diagonally upper left. Three separate scores are calculated based on all three neighbors, and
the maximum of these three scores (or a zero if a negative value would result) is assigned
to the cell. This is done for each cell in the matrix resulting in O( MN ) complexity for the
matrix fill stage. Even though the computation for each cell usually only consists of additions,
subtractions, and comparisons of integers, the algorithm would nevertheless perform very
poorly if the lengths of the query sequences become large. The traceback step starts at the cell
with the highest score in the matrix and ends at a cell when the similarity score drops below a
certain predefined threshold. For doing this, the algorithm requires to find the maximum cell
which is done by traversing the entire matrix, making the time complexity for the traceback
O( MN ). It is also possible to keep track of the cell with the maximum score, during the
matrix filling segment of the algorithm, although this will not change the overall complexity.
Thus, the total time complexity of the S-W algorithm is O( MN ). The space complexity is also
O( MN ).
In order to reduce the O( MN ) complexity of the matrix fill stage, multiple entries of the
H matrix can be calculated in parallel. This is however complicated by data dependencies,
whereby each Hi,j entry depends on the values of three neighboring entries Hi,j−1, Hi−1,j and
Hi−1,j−1 , with each of those entries in turn depending on the values of three neighboring
entries, which effectively means that this dependency extends to every other entry in the
region Hx,y : x ≤ i, y ≤ j. This implies that it is possible to simultaneously compute all the
elements in each anti-diagonal, since they fall outside each other’s data dependency regions.
Figure 2 shows a sample H matrix for two sequences, with the bounding boxes indicating the
elements that can be computed in parallel. The bottom-right cell is highlighted to show that
its data dependency region is the entire remaining matrix. The dark diagonal arrow indicates
the direction in which the computation progresses. At least 9 cycles are required for this
computation, as there are 9 bounding boxes representing 9 anti-diagonals and a maximum of
5 cells may be computed in parallel.
The degree of parallelism is constrained to the number of elements in the anti-diagonal and
the maximum number of elements that can be computed in parallel are equal to the number
of elements in the longest anti-diagonal (l d ), where,

ld = min( M, N ) (2)
Theoretically, the lower bound to the number of steps required to calculate the entries of
the H matrix in a parallel implementation of the S-W algorithm is equal to the number of
anti-diagonals required to reach the bottom-right element, i.e. M + N − 1 (Liao et al., 2004).
Figure 3 shows the logic circuit to compute an element of the H matrix. The logic contains
three adders, a sequence comparator circuit (SeqCmp) and three max operators (MAX). The
sequence comparator compares the corresponding characters of two input sequences and
outputs a match/mismatch score, depending on whether the two characters are equal or not.
Each max operator ﬁnds the maximum of its two inputs. The time to compute an element is
4 cycles, assuming that the time for each cycle is equal to the latency of one add or compare
operation.
192
6 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics

G A T T A

0 0 0 0 0 0

G 0 1 0 0 0 0

A 0 0 2 0 0 1

j C 0 0 0 1 0 0

T 0 0 0 1 2 0

C 0 0 0 0 0 1

Fig. 2. Sample H matrix, where the dotted rectangles show the elements that can be
computed in parallel

Hi,j

Cycle 4

MAX

Cycle 3

MAX

Cycle 2

MAX +
Cycle 1
Si,j

Seq
Cmp
+ +

Hi,j-1 d Hi-1,j Hi-1,j-1 Q D 0

Fig. 3. Logic circuit to compute cells in the H matrix, where + is an adder, MAX is a max
operator and SeqCmp is the sequence comparator that generates match/mismatch scores
An Overview
An Overview of Hardware-Based
of Hardware-based Acceleration
Acceleration of Biological of Biological Sequence Alignment
Sequence Alignment 1937

3. CPU-based acceleration
In this section CPU-based acceleration of the S-W algorithm is reviewed. Furthermore, an
estimation of the performance for top-end and future systems is made.

3.1 Recent implementations

The ﬁrst CPU implementations used a sequential way of calculating all the matrix values.
These implementations were slow and therefore hardly used. In 2007, Farrar introduced
a SSE implementation for S-W (Farrar, 2007). His work used SSE2 instructions for an Intel
processor and was up to six times faster than existing S-W implementations. Two years later,
a Smith-Waterman implementation on Playstation 3 (SWPS3) was introduced (Szalkowski et al.,
2009), which was based on a minor adjustment to Farrar’s implementation. SWPS3 is a
vectorized implementation of the Smith-Waterman local alignment algorithm optimized for
both the IBM Cell/B.E. and Intel x86 architectures. A SWPS3 version optimized for multi
threading has been released recently (Aldinucci et al., 2010). The SSE implementations can
be viewed as being semi parallel, as they constantly calculate sixteen, eight or less values at
the same time, while discarding startup and ﬁnish time. Table 2 presents the performance
achieved by these implementations on various CPU platforms.
Peak Benchmark Peak performance
Implementation
performance hardware (per thread)
2.0 GHz, Xeon
3.75 GCUPS
(Farrar, 2007) 2.9 GCUPS
Core 2 Duo
single thread
2.4 GHz
4.08 GCUPS
(Szalkowski et al., 2009) 15.7 GCUPS Core 2 Quad
Q6600, 4 threads
2.5 GHz, 2x Xeon
4.38 GCUPS
(Aldinucci et al., 2010) 35 GCUPS Core Quad
E5420, 8 threads
Table 2. Performance achieved by various S-W CPU implementations (Vermij, 2011)

3.2 Performance estimations for top-end and future CPUs

With the data from Table 2, we make an estimate of the performance on the current top-end
CPUs and take a look into the future. Table 3 gives the estimated peak performances based
on the SIMD register width, the number of cores, clock speed and the known speed per
core. We assumed linear scaling in the number of cores as suggested in Table 2, and the
given performances may therefore not be reliable. Non-ideal inter-core communication,
memory bandwidth limitations and shared caches could lead to a lower peak performance.
Furthermore, no distinction in performance is made between Intel and AMD processors.
Hence, Table 3 must be used as an indication to where the S-W performance could go on
in current and future CPUs (Vermij, 2011).

4. FPGA-based acceleration
FPGAs are programmable logic devices. To map an application on flexible FPGA platforms,
a program is written in a hardware description language like VHDL. The flexibility, difficulty
194
8 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics

SIMD Cores Clock Peak performance

System Released
register width (threads) speed (estimated)
Xeon 8 2.26 32
2010 128
Beckton (16) GHz GCUPS
Opteron 12 2.3 48
2010 128
Magny-Cours (12) GHz GCUPS
Opteron 16 2.3 64
2011 128
Interlagos (16) GHz GCUPS
Table 3. Estimated peak performance for current top-end and future CPUs (Vermij, 2011)

of design as well as the performance of FPGA implementations fall typically somewhere

between pure software running on a CPU and an Application Speciﬁc Integrated Circuit
(ASIC). FPGAs are widely used to accelerate applications like S-W based sequence alignment.
Implementations rely on the ability to create building blocks called processing elements (PEs)
that can update one matrix cell every clock cycle. Furthermore, multiple PEs can be linked
together in a two dimensional or linear systolic arrays to process huge data in parallel. This
section provides a brief description of traditional systolic arrays followed by a discussion of
existing and future FPGA-based S-W implementations.

4.1 Systolic arrays

Systolic array is an arrangement of processors in an array, where data flows synchronously
across the array between neighbors, usually with data flowing in a specific direction
(Kung & Leiserson, 1979), (Quinton & Robert, 1991). Each processor at each step takes in data
from one or more neighbors (e.g. North and West), processes it and, in the next step, outputs
results to the opposite neighbors (South and East). Systolic arrays can be implemented in
rectangular or 2-dimensional (2D) and linear or 1-dimensional (1D) fashion. Figure 4 gives a
pictorial view of both implementation types.
They best suit compute-intensive applications like biological sequence alignment. The
disadvantage is that being highly specialized processors type, they are difficult to implement
and build.
In (Pfeiffer et al., 2005), a concept to accelerate S-W algorithm on the basis of linear systolic
array is demonstrated. The reason for choosing this architecture is outlined by demonstrating
the efficiency and simplicity in combination with the algorithm. Nevertheless, there are two
key methodologies to speedup this massively parallel system. By turning the processing
from bit-parallel to bit-serial, the actual improvement is enabled. This change is performance
neutral, but in combination with the early maximum detection, a considerable speedup is
possible. Another effect of this improvement is a data dependant execution time of the
processing elements. Here, the second acceleration prevents idle times to exploit the hardware
and speeds up the computation. This can be accomplished by a globally asynchronous timing
representing a self-timed linear systolic array. The authors have provided no performance
estimation due to the initial stage of their work, that is why it cannot be compared with other
related work.
In (Vermij, 2011), the working of a linear systolic array (LSA) is explained. Such an array works
like the SSE unit in a modern CPU. But instead of having a fixed length of lets say 16, the
FPGA based array can have any length.
An Overview
An Overview of Hardware-Based
of Hardware-based Acceleration
Acceleration of Biological of Biological Sequence Alignment
Sequence Alignment 1959

N1 N2 N3 N4

M4M3M2M1
M1 U11 U12 U13 U14

U11

N1
M2 U21 U22 U23 U24

(b) Linear (1D) systolic array

U12

N2
M3 U31 U32 U33 U34

U13

N3
M4 U41 U42 U43 U44

U14

N4
(a) Rectangular (2D) systolic array

Fig. 4. Pictorial view of systolic array architectures

4.2 Existing FPGA implementations

In Section 3, we discussed some existing S-W implementations running on a CPU. A
comparable analysis for FPGAs is rather hard. There are very few real, complete
implementations that give usable results. Most research implementations only discuss
synthetic tests, giving very optimistic numbers for implementations that are hardly used in
practice. Furthermore, there is a great variety in the types of FPGAs used. Since every FPGA
series has a different way of implementing circuitry, it is hard to make a fair comparison.
In addition, the performance of the implementations relies heavily on the data widths used.
Smaller data widths lead to smaller PEs, which lead to faster implementations. These numbers
are not usually published. The ﬁrst, third and fourth implementations shown in Table 4 make
this clear, where the performance is given in terms of Giga Cell Updates Per Second (GCUPS).
Using the same FPGA device, these three implementations differ signiﬁcantly in performance.
The most reliable numbers are from Convey and SciEngines, as shown in the last two entries
of Table 4. These implementations work the same in practice for real cases and are build for
maximal performance (Vermij, 2011).
196
10 Computational Biology and AppliedWill-be-set-by-IN-TECH
Bioinformatics

Performance Performance
Reference FPGA Frequency PEs
(per FPGA) (per system)
Virtex2 180 1260
(Puttegowda et al., 2003) 7000 —
XC2V6000 MHz GCUPS
Virtex2 742
(Yu et al., 2003) — 4032 —
XCV1000-6 GCUPS
Virtex2 55 13.9
(Oliver et al., 2005) 252 —
XC2V6000 MHz GCUPS
Virtex2 112 54
(Gok & Yilmaz, 2006) 482 —
XC2V6000 MHz GCUPS
Stratix2 66.7 25.6
(Altera, 2007) 384 —
EP2S180 MHz GCUPS
200 24.1
(Cray, 2010) Virtex4 120 —
MHz GCUPS