Distributed and Sequential Algorithms For Bioinformatics
Distributed and Sequential Algorithms For Bioinformatics
K. Erciyes
Distributed
and Sequential
Algorithms for
Bioinformatics
www.allitebooks.com
Computational Biology
Volume 23
Editors-in-Chief
Andreas Dress
CAS-MPG Partner Institute for Computational Biology, Shanghai, China
Michal Linial
Hebrew University of Jerusalem, Jerusalem, Israel
Olga Troyanskaya
Princeton University, Princeton, NJ, USA
Martin Vingron
Max Planck Institute for Molecular Genetics, Berlin, Germany
Editorial Board
Robert Giegerich, University of Bielefeld, Bielefeld, Germany
Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
Gene Myers, Max Planck Institute of Molecular Cell Biology and Genetics, Dresden,
Germany
Pavel A. Pevzner, University of California, San Diego, CA, USA
Advisory Board
Gordon Crippen, University of Michigan, Ann Arbor, MI, USA
Joe Felsenstein, University of Washington, Seattle, WA, USA
Dan Gusfield, University of California, Davis, CA, USA
Sorin Istrail, Brown University, Providence, RI, USA
Thomas Lengauer, Max Planck Institute for Computer Science, Saarbrücken, Germany
Marcella McClure, Montana State University, Bozeman, MO, USA
Martin Nowak, Harvard University, Cambridge, MA, USA
David Sankoff, University of Ottawa, Ottawa, ON, Canada
Ron Shamir, Tel Aviv University, Tel Aviv, Israel
Mike Steel, University of Canterbury, Christchurch, New Zealand
Gary Stormo, Washington University in St. Louis, St. Louis, MO, USA
Simon Tavaré, University of Cambridge, Cambridge, UK
Tandy Warnow, University of Texas, Austin, TX, USA
Lonnie Welch, Ohio University, Athens, OH, USA
www.allitebooks.com
The Computational Biology series publishes the very latest, high-quality research
devoted to specific issues in computer-assisted analysis of biological data. The main
emphasis is on current scientific developments and innovative techniques in
computational biology (bioinformatics), bringing to light methods from mathemat-
ics, statistics and computer science that directly address biological problems
currently under investigation.
The series offers publications that present the state-of-the-art regarding the
problems in question; show computational biology/bioinformatics methods at work;
and finally discuss anticipated demands regarding developments in future
methodology. Titles can range from focused monographs, to undergraduate and
graduate textbooks, and professional text/reference works.
www.allitebooks.com
K. Erciyes
Distributed
and Sequential Algorithms
for Bioinformatics
123
www.allitebooks.com
K. Erciyes
Computer Engineering Department
Izmir University
Uckuyular, Izmir
Turkey
ISSN 1568-2684
Computational Biology
ISBN 978-3-319-24964-3 ISBN 978-3-319-24966-7 (eBook)
DOI 10.1007/978-3-319-24966-7
www.allitebooks.com
To the memory of Atilla Ozerdim and all
disabled people in his name.
Atilla was a former student and then a
colleague. He was highly intelligent however
was severely disabled lower than his neck. He
used a computer by typing with a stick in his
mouth and wrote thousands of lines of code
and researched this way.
www.allitebooks.com
Preface
Recent technological advancements in the last few decades provided vast and
unprecedented amounts of biological data including data of DNA and cell,
and biological networks. This data comes in two basic formats as DNA nucleotide
and protein amino acid sequences, and more recently, topological data of biological
networks. Analysis of this huge data is a task on its own and the problems
encountered are NP-hard most of the time, defying solutions in polynomial time.
Such analysis is required as it provides a fundamental understanding of the func-
tioning of a cell which can help understand human health and disease states and the
diagnosis of diseases, which can further aid development of biotechnological
processes to be used for medical purposes such as treatment of diseases.
Instead of searching for optimal solutions to these difficult problems, approxi-
mation algorithms that provide sub-optimal solutions are usually preferred. An
approximation algorithm should guarantee a solution within an approximation
factor for all input combinations. In many cases, even approximation algorithms are
not known to date and using heuristics that are shown to work for most of the input
cases experimentally are considered as solutions.
Under these circumstances, there is an increasing demand and interest in
research community for parallel/distributed algorithms to solve these problems
efficiently using a number of processing elements. This book is about both
sequential and distributed algorithms for the analysis and modeling of biological
data and as such, it is one of the first ones in this topic to the best of our knowledge.
In the context of this book, we will assume a distributed algorithm as a parallel
algorithm executed on a distributed memory processing system using
message-passing rather than special purpose parallel architectures. For the cases of
shared memory parallel computing, we will use the term parallel algorithm
explicitly. We also cover algorithms for biological sequences (DNA and protein)
and biological network (protein interaction networks, gene regulation, etc.) data in
the same volume. Although algorithms for DNA sequences have a longer history of
study, even the sequential algorithms for biological networks such as the protein
interaction networks are rare and are at an early stage of development in research
studies. We aim to give a unified view of algorithms for sequences and networks of
biological systems where possible. These two views are not totally unrelated; for
example, the function of a protein is influenced by both its position in a network
vii
www.allitebooks.com
viii Preface
and its amino acid sequence, and also by its 3-D structure. It can also be seen that
the problems in the sequence and network domains are analogous to some extent;
for example, sequence alignment and network alignment, sequence motifs and
network motifs, sequence clustering and network clustering are analogous problems
in these two related worlds. It is not difficult to predict that the analysis of biological
networks will have a profound effect on our understanding the origins of life, health
and disease states, as analysis of DNA/RNA and protein sequences have provided.
The parallel and distributed algorithms are needed to solve bioinformatics
problems simply for the speedup they provide. Even the linear time algorithms may
be difficult to realize in bioinformatics due to the size of the data involved. For
example, suffix trees are fundamental data structures in bioinformatics, and con-
structing them takes OðnÞ time by relatively new algorithms such as Ukkonen’s or
Farach’s. Considering human DNA which consists of over 3 billion base pairs, even
these linear time algorithms are time-consuming. However, by using distributed
suffix trees, the time can be reduced to Oðn=kÞ where k is the number of processors.
One wonders then about the scarcity of the research efforts in the design and
implementation of distributed algorithms for these time-consuming difficult tasks.
A possible reason would be that a number of problems have been introduced
recently and the general approach in the research community has been to search for
sequential algorithmic solutions first and then investigate ways of parallelizing
these algorithms or design totally new parallel/distributed algorithms. Moreover,
the parallel and distributed computing is a principle on its own where researchers in
this field may not be familiar with bioinformatics problems in general, and the
multidisciplinary efforts in this discipline and bioinformatics seem to be just
starting. This book is an effort to contribute to the filling of the aforementioned gap
between the distributed computing and bioinformatics. Our main goal is to first
introduce the fundamental sequential algorithms to understand the problem and
then describe distributed algorithms that can be used for fundamental bioinfor-
matics problems such as sequence and network alignment, and finding sequence
and network motifs, and clustering. We review the most fundamental sequential
algorithms which aid the understanding of the problem better and yield
parallel/distributed versions. In other words, we try to be as comprehensive as
possible in the coverage of parallel/distributed algorithms for the fundamental
bioinformatics problems with an in-depth analysis of sequential algorithms.
Writing a book on bioinformatics is a challenging task for a number of reasons.
First of all, there are so many diverse topics to consider, from clustering to genome
rearrangement, from network motif search to phylogeny, and one has to be selective
not to divert greatly from the initially set objectives. We had to carefully choose a
subset of topics to be included in this book in line with the book title and aim; tasks
that require substantial computation power due to their data sizes and therefore are
good candidates for parallelization. Second, bioinformatics is a very dynamic area
of study with frequent new technological advances and results which requires a
thorough survey of contemporary literature on presented topics. The two worlds of
bioinformatics, namely biological sequences and biological networks, both have
similar challenging problems. A closer look reveals these two worlds in fact have
www.allitebooks.com
Preface ix
www.allitebooks.com
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Biological Sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Need for Distributed Algorithms. . . . . . . . . . . . . . . . . . 6
1.5 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Part I Background
2 Introduction to Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Central Dogma of Life. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 The Genetic Code . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Mutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Biotechnological Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Polymerase Chain Reaction . . . . . . . . . . . . . . . . . . 20
2.4.3 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Nucleotide Databases. . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Protein Sequence Databases . . . . . . . . . . . . . . . . . . 22
2.6 Human Genome Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
xi
www.allitebooks.com
xii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Introduction
1
1.1 Introduction
Biology is the science of life and living organisms. An organism is a living entity that
may consist of organs that are made of tissues. Cells are the building blocks of organ-
isms and form tissues of organs. Cells consist of molecules and molecular biology
is the science of studying the cell at molecular level. The nucleus of a cell contains
deoxyribonucleic acid (DNA) which stores all of the genetic information. DNA is
a double helix structure consisting of four types of molecules called nucleotides. It
consists of a long sequence of nucleotides of about 3 billion pairs. From the viewpoint
of computing, DNA is simply a string that has a four-letter alphabet. The ribonucleic
acid (RNA) has one strand and also consists of four types of nucleotides like DNA,
with one different nucleotide. Proteins are large molecules outside the nucleus of the
cell and perform vital life functions. A protein is basically a linear chain of molecules
called amino acids. Molecules of the cell interact to perform all necessary functions
for living.
Recent technological advancements have provided vast amounts of biological
data at molecular level. Analysis and extracting meaningful information from this
data requires new methods and approaches. This data comes in two basic forms as
sequence and network data. On one hand, we are provided with sequence data of
DNA/RNA and proteins, and on the other hand, we have topological information
about the connectivity of various networks within the cell. Analysis of this data is a
task on its own due to its huge size.
We first describe the problems encountered in the analysis of biological sequences
and networks and we then describe why distributed algorithms are imperative as
computational methods in this chapter. It seems design and implementation of dis-
tributed algorithms are inevitable for these time-consuming difficult problems and
their scarcity can be attributed to the relatively recent provision of the biological data
and the field being multidisciplinary in nature. We conclude by providing the outline
of the book.
The biological sequences in the cell we are referring are the nucleotide sequences in
DNA, RNA, and amino acid sequences in proteins. DNA contains four nucleotides in
a double-helix-shaped two-strand structure: Adenine (A), Cytosine (C), Guanine (G),
and Thymine (T). Adenine always pairs with thymine, and guanine with cytosine.
The primary structure of a protein consists of a linear chain of amino acids and the
order of these affects its 3D shape. Figure 1.1 shows a simplified diagram of DNA
and a protein.
Analysis of these biological sequences involves the following tasks:
(a) (b)
Fig. 1.1 a DNA double helix structure. b A protein structure having a linear sequence of amino
acids
1.2 Biological Sequences 3
and locations of repeats are unique for individuals and can be used in forensics to
identify individuals.
• Gene finding: A gene codes for a polypeptide which can form or be a part of a
protein. There are over 20,000 genes in human DNA which occupy only about
3 % of human genome. Finding genes is a fundamental step in their analysis. A
mutated gene may cause the formation of a wrong protein, disturbing the healthy
state of an organism, but mutations are harmless in many cases.
• Genome rearrangements: Mutations of DNA at a coarser level than point mutations
of nucleotides involve certain alterations of segments or genes in DNA. These
changes include reversals, duplications, and transfer to different locations in DNA.
Genome rearrangements may result in the production of new species but in many
cases, they are considered as the causes of complex diseases.
• Haplotype inference: The DNA sequencing methods of humans provide the order
of DNA nucleotides from two chromosomes as this approach is more cost-effective
and practical. However, the sequence information from a single chromosome called
a haplotype is needed for disease analysis and also to discover ancestral relation-
ships. Discovering single chromosome data from the two chromosomes is called
haplotype inference and is needed for the data to be meaningful
All of the above-described tasks are fundamental areas of research in the analysis
of biological sequences. Comparison of sequences, clustering, and finding repeats
apply both to DNA/RNA and protein sequences. Protein sequences may also be
employed to discover genes as they are the product of genes; however, genome
rearrangements and haplotype inference problems are commonly associated with
the DNA/RNA sequences. Except for the sequence alignment problem, there are
hardly any polynomial algorithms for these problems. Even when there is a solution
in polynomial time, the size of data necessitates the use of approximation algorithm
if they are available. As we will see, the heuristic algorithms that can be shown to
work for a wide range of input combinations experimentally are the only choice in
many cases.
Biological networks consist of biological entities which interact in some form. The
modeling and analysis of biological networks are fundamental areas of research
in bioinformatics. The number of nodes in a biological network is large and these
nodes have complex relations among them. We can represent a biological network
by a graph where an edge between two entities indicates an interaction between
them. This way, many results in graph theory and also various graph algorithms
become available for immediate use to help solve a number of problems in biological
networks.
4 1 Introduction
We can make coarse distinction between the networks in the cell and other bio-
logical networks. The cell contains DNA, RNA, proteins, and metabolites at the
molecular level. Networks at biological level are gene regulation networks, signal
transduction networks, protein–protein interaction (PPI) networks, and metabolic
networks. DNA is static containing the genetic code and proteins take part in various
vital functions in the cell. Genes in DNA code for proteins in a process called gene
expression.
DNA, RNA, proteins, and metabolites all have their own networks. Intracellular
biological networks have molecules within the cell as their nodes and their interac-
tions as links. PPI networks have proteins which interact with each other to cooperate
to accomplish various vital functions for life. PPI networks are not constructed ran-
domly and they show specific structures. They have a few number of highly connected
nodes called hubs and the rest of the nodes in the network have very few connec-
tions to other proteins. The distance between two farthest proteins in such a network
is very small compared to the size of the network. This type of networks is called
small-world network as it is relatively easy to reach a node from any other node
in the network. Hubs in PPI networks form dense regions of the network and these
clusters of nodes called protein complexes usually have some fundamental functions
attributed to them. It is therefore desirable to detect these clusters in PPI networks
and we will investigate algorithms for this purpose in Chap. 12. A PPI network has
very few hubs and many low-degree nodes. Such networks are commonly called
scale-free networks and a hub in a PPI network is presumed to act as a gateway
between a large number of nodes. Figure 1.2 shows a subnetwork of a PPI network
of the immune system where small-world and scale-free properties can be visually
detected along with a number of hubs.
The main areas of investigation in biological networks are the following:
www.allitebooks.com
1.3 Biological Networks 5
Fig. 1.2 A subnetwork of the SpA PPI network involved in innate immune response (taken from
[1]). Triangles are the genes
there are only 13 possible motifs with 4 nodes, while the number of all subgraphs
of size 5 is 51. Comparison of networks and hence deducing ancestral relationships
are also possible by searching for common motifs in them.
• Network alignment: In many cases, two or more biological networks may be similar
and finding the similar subgraphs may show the conserved and therefore important
topological structures in them. A fundamental problem in biological networks is the
assessment of this similarity between two or more biological networks. Analogous
to sequence alignment of DNA/RNA and proteins, the network alignment process
aims to align two or more biological networks. This time, we basically search for
topological similarities between the networks. Node similarity is also considered
together with topological similarity to have a better view of the affinity of two
or more networks. Our aim in alignment is very similar to sequence alignment
in a different domain, and we try to compare two or more networks, find their
conserved modules, and infer ancestral relationships between them.
• Phyloegeny: Phylogeny is the study and analysis of evolutionary relationships
among organisms. Phylogenetics aims to discover these relationships using mole-
cular sequence and morphological data to reconstruct evolutionary dependencies
among the organisms. Phylogenetics is one of the most studied and researched
topics in bioinformatics and its implications are numerous. It can help disease
analysis and design of drug therapies by analyzing phylogenetic dependencies of
pathogens. Transmission characters of diseases can also be predicted using phy-
logenetics.
6 1 Introduction
Most of the problems outlined above do not have solutions in polynomial time
and we are left with approximation algorithms in rare cases but mostly with heuristic
algorithms. Since the size of the networks under consideration is huge, sampling
the network and implementing the algorithm in this sample is typically followed in
many methods. This approach provides a solution in reasonable time at the expense
of decreased accuracy.
There are many challenges in bioinformatics and we can first state that obtaining
meaningful data is difficult in general as it is noisy most of the time. The size of data
is large making it difficult to find time-efficient algorithms. Complexity class P is
related to problems that can be solved in polynomial time and some problems we need
to solve in bioinformatics belong to this class. Many problems, however, can only
be solved in exponential time and in some cases, solutions to the problem at hand do
not even exist. These problems belong to the class nondeterministic polynomial hard
(NP-hard) and we need to rely on approximation algorithms which find suboptimal
solutions with bounded approximation ratios for such hard problems. Most of the
time, there are no known approximation algorithms and we need to rely on heuristic
methods which show experimentally that the algorithm works for most of input
combinations.
Parallel algorithms and distributed algorithms provide faster execution as a num-
ber of processing elements are used in parallel. Here, we make a distinction between
shared memory parallel processing where processing elements communicate using a
shared and protected memory and distributed processing where computational nodes
of a communication network communicate by the transfer of messages without any
shared memory. Our main focus in this book is about distributed, that is, distrib-
uted memory, message-passing algorithms although we will also cover a number of
shared memory parallel algorithms. The efficiency of a parallel/distributed algorithm
is basically measured by the speedup it provides which is expressed as the ratio of
the number of time steps of the sequential algorithm T (s) to the number of time
steps of the parallel algorithm T ( p). If k processing elements are used, the speedup
should approach k ideally. However, there are the synchronization delays in shared
memory parallel processing and the message delays of the interconnection network
in distributed processing which means the speedup value could be far from ideal.
Whether the algorithm works in polynomial time to find an exact solution or an
approximate solution again in polynomial time; bioinformatics problems have an
additional property; the size of the input data is huge. We may therefore attempt
to provide a parallel or distributed version of an algorithm even if it does solve the
problem in polynomial time. For example, suffix trees are versatile data structures that
find various implementations in algorithmic solutions to bioinformatics problems.
There exist algorithms that construct suffix trees in a maximum of n steps where n
is the length of the sequence from which a suffix tree is built. Although this time
1.4 The Need for Distributed Algorithms 7
Our description of algorithms in each chapter follows the same model where the
problem is first introduced informally and the related concepts are described. We
then present the problem formally with the notation used and review the fundamen-
tal sequential algorithms with emphasis on the ones that have potential to be modified
and executed in parallel/distributed environments. After this step, we usually pro-
vide a template of a generic algorithm that summarizes the general approaches to
have a distributed algorithm for the problem at hand. This task can be accomplished
basically by partitioning data, partitioning code or both among the processing ele-
ments. We also propose new algorithms for specific problems under consideration
that can be implemented conveniently in this step. We then review published work
about parallel and distributed algorithms for the problem described, and describe
fundamental research studies that have received attention in more detail at algorith-
mic level. Finally, the main points of the chapter are emphasized with pointers for
future work in the Chapter Notes section.
We have three parts, and start with Part I which is about the basic background about
bioinformatics with three chapters on molecular biology; graphs and algorithms; and
parallel and distributed processing. This part is not a comprehensive treatment of
three main areas of research, but rather a dense coverage of these three major topics
with pointers for further study. It is self-contained, however, covering most of the
fundamental concepts needed for parallel/distributed processing in bioinformatics.
Part II describes the problems in the sequence analysis of biological data. We
start this part with string algorithms and describe algorithms for sequence alignment
which is one of the most researched topics in the sequence world. We then investigate
sequence motif search algorithm and describe genome analysis problems such as
gene finding, genome rearrangement, and haplotype inference at a coarser level.
In Part III, our main focus is on parallel and distributed algorithms for biological
networks, the PPI networks being the center of focus. We will see that the main
problems in biological networks can be classified into cluster detection, network
motif discovery, network alignment, and phylogenetics. Cluster detection algorithms
aim to find clusters of high inter-node connections and these dense regions may
implicate a specific function or health and disease states of an organism. Network
motif algorithms attempt to discover repeating patterns of subgraphs and they may
8 1 Introduction
indicate again some specific function attributed to these patterns. Network alignment
algorithms investigate the similarities between two or more graphs and show whether
these organisms have common ancestors and how well they are related. Analysis of
phylogenetic trees and networks which construct ancestor–descendant relationships
among existing individuals and organisms is also reviewed in this part which is
concluded by discussing the main challenges and future prospects of bioinformatics
in the Epilogue chapter.
Reference
1. Zhao J, Chen J, Yang T-H, Holme P (2012) Insights into the pathogenesis of axial
spondyloarthropathy from network and pathway analysis. BMC Syst Biol. 6(Suppl 1):S4.
doi:10.1186/1752-0509-6-S1-S4
Part I
Background
Introduction to Molecular Biology
2
2.1 Introduction
Modern biology has its roots at the work of Gregor Mendel who identified the fun-
damental rules of hereditary in 1865. The discovery of chromosomes and genes
followed later and in 1952 Watson and Crick disclosed the double helix structure of
DNA. All living organisms have common characteristics such as replication, nutri-
tion, growing and interaction with their environment. An organism is composed
of organs which perform specific functions. Organs are made of tissues which are
composed of aggregation of cells that have similar functions. The cell is the basic
unit of life in all living organisms and it has molecules that have fundamental func-
tions for life. Molecular biology is the study of these molecules in the cell. Two of
these molecules called proteins and nucleotides have fundamental roles to sustain
life. Proteins are the key components in everything related to life. DNA is made of
nucleotides and parts of DNA called genes code for proteins which perform all the
fundamental processes for living using biochemical reactions.
Cells synthesize new molecules and break large molecules into smaller ones using
complex networks of chemical reactions called pathways. Genome is the complete set
of DNA of an organism and human genome consists of chromosomes which contain
many genes. A gene is the basic physical and functional unit of hereditary that codes
for a protein which is a large molecule made from a sequence of amino acids. Three
critical molecules of life are deoxyribonucleic acid (DNA), ribonucleic acid (RNA),
and proteins. A central paradigm in molecular biology states that biological function
is heavily dependent on the biological structure.
In this chapter, we first review the functions performed by the cell and its ingre-
dients. The DNA contained in the nucleus, the proteins, and various other molecules
all have important functionalities and we describe these in detail. The central dogma
of life is the process of building up a protein from the code in the genes as we will
outline. We will also briefly describe biotechnological methods and introduce some
of the commonly used databases that store information about DNA, proteins, and
other molecules in the cell.
© Springer International Publishing Switzerland 2015 11
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_2
12 2 Introduction to Molecular Biology
Cells are the fundamental building blocks of all living things. The cell serves as a
structural building block to form tissues and organs. Each cell is independent and
can live on its own. All cells have a metabolism to take in nutrients and convert
them into molecules and energy to be used. Another important property of cells is
replication in which a cell produces another cell that has the same properties as
itself. Cells are composed of approximately 70 % water; 7 % small molecules like
amino acids, nucleotides, salts, and lipids, and 23 % macromolecules such as proteins
and polysaccharids. A cell consists of molecules in a dense liquid surrounded by a
membrane as shown in Fig. 2.1.
The eukaryotic cells have nuclei containing the genetic material which is sepa-
rated from the rest of the cell by a membrane and the prokaryotic cells do not have
nuclei. Prokaryotes include bacteria and archaea; and plants, animals, and fungi are
examples of eukaryotes. The tasks performed by the cells include taking nutrients
from food, converting these to energy, and performing various special missions. A
cell is composed of many parts each with a different purpose. The following are the
important parts of an eukaryotic cell with their functions:
The nucleus is at the center of the cell and is responsible for vital functions such
as cell growth, maturation, division, or death. Cytoplasm consists of jellylike fluid
which surrounds the nucleus and it contains various other structures. Endoplasmic
reticulum enwraps the nucleus, and processes molecules made by the cell and trans-
ports them to their destinations. Conversion of energy from food to a form that can be
used by the cell is performed by mitochondria which have their own genetic material.
These components of the cell are shown in Fig. 2.1. The cell contains various other
structures than the ones we have outlined here.
Chemically, cell is composed of few elements only. Carbon (C), hydrogen (H),
oxygen (O), and nitrogen (N) are the dominant ones with phosphorus (P) and sulfur
(S) appearing in less proportions. These elements combine to form molecules in
the cell, using covalent bonds in which electrons in their outer orbits are shared
between the atoms. A nucleotide is one such molecule in the cell which is a chain
of three components: a base B, a sugar S, and a phosphoric acid P. The three basic
macromolecules in the cell that are essential for life are the DNA, RNA, and proteins.
2.2.1 DNA
James Watson and Francis Crick discovered the Deoxyribonucleic Acid (DNA) struc-
ture in the cell in 1953 using X-ray diffraction patterns which showed that the DNA
molecule is long, thin, and has a spiral-like shape [5]. The DNA is contained in the
nuclei of eukaryotic cells and is composed of small molecules called nucleotides.
Each nucleotide consists of a five-carbon sugar, a phosphate group, and a base. The
carbon atoms in a sugar molecule are labeled 1 to 5 and using this notation, DNA
molecules start at 5 end and finish at 3 end as shown in Fig. 2.2. There are four
nucleotides in the DNA which are distinguished by the bases they have: Adenine
(A), Cytosine (C), Guanine (G), and Thymine (T). We can therefore think of DNA
as a string with a four letter alphabet = {A,C,G,T}. Human DNA consists approx-
imately of three billion bases. Nucleotide A pairs only with T, and C pairs only with
G, we can say A and T are complementary and so are G and C as shown in Fig. 2.2.
Given the sequence S of a DNA strand, we can construct the other strand S by
taking the complement of bases in this strand. If we take the complement of the
G T A C
A
T
G T
G A
G
A
C T C
T T A
C C
C A A T G
G
resulting strand we will obtain the original strand. This process is used and essential
for protein production. Physically, DNA consists of two strands held together by
hydrogen bonds, arranged in a double helix as shown in Fig. 2.3. The complement
of a DNA sequence consists of complements of its bases. The DNA therefore consists
of two complementary strands which bind to each other tightly providing a stable
structure. This structure also provides the means to replicate in which the double
DNA helix structure is separated into two strands and each of these strands are then
used as templates to synthesize their complements.
The DNA molecule is wrapped around proteins called histones into complex-
walled structures called chromosomes in the nucleus of each cell. The number of
chromosomes depends on the type of eukaryote species. Each chromosome consists
of two chromatides which are coil-shaped structures connected near the middle
forming an x-like structure. Chromosomes are kept in the nucleus of a cell in a
highly packed and hierarchically organized form. A single set of chromosomes in an
organism is called haploid, two sets of chromosomes is called di ploid, and more
than two sets is called polyploid. Humans are diploid where each chromosome is
inherited from a parent to have two chromosomes for each of the 23 chromosome set.
The sex chromosome is chromosome number 23 which either has two chromosomes
shaped X resulting in a female, or has X and Y resulting in a male. The type of
chromosome inherited from father determines the sex of the child in this case.
2.2.2 RNA
The ribonucleic acid (RNA) is an important molecule that is used to transfer genetic
information. It has a similar structure to DNA but consists of only one strand and
does not form a helix structure like DNA. It also has nucleotides which consist of a
sugar, phosphate, and a base. The sugar however is a ribose instead of deoxyribose
and hence the name RNA. Also, DNA base thymine (T) is replaced with uracil (U)
in RNA. The fundamental kinds of RNA are the messenger RNA (mRNA), transfer
RNA (tRNA), and ribosomal RNA (rRNA) which perform different functions in the
cell. RNA provides a flow of information in the cell. First, DNA is copied to mRNA
2.2 The Cell 15
in the nucleus and the mRNA is then translated to protein in the cytoplasm. During
translation, tRNA and rRNA have important functions. The tRNA is responsible for
forming the amino acids which make up the protein, as prescribed in the mRNA; and
the rRNA molecules are the fundamental building blocks of the ribosomes which
carry out translation of mRNA to protein.
2.2.3 Genes
A gene is the basic unit of hereditary in a living organism determining its character
as a whole. A gene physically is a sequence of DNA that codes for an RNA (mRNA,
tRNA, or rRNA) and the mRNA codes for a protein. The study of genes is called
genetics. Gregor Mendel in the 1860 s was first to experiment and set principles of
passing hereditary information to offsprings.
There are 23 pairs of chromosomes in humans and between 20000–25000 genes
are located in these chromosomes. The starting and stopping locations of a gene are
identified by specific sequences. The protein coding parts of a gene are called exons
and the regions between exons with no specific function are called introns. Genes
have varying lengths and also, exons and introns within a gene have varying lengths.
A gene can combine with other genes or can be nested within another gene to yield
some functionality, and can be mutated which may change its functionality at varying
degrees in some cases leading to diseases. The complete set of genes of an organism
is called its genot ype. Each gene has a specific function in the physiology and mor-
phology of an organism. The physical manifestation or expression of the genotype
is the phenot ype which is the physiology and morphology of an organism cite. A
gene may have different varieties called alleles resulting in different phenotyping
characteristics. Humans are diploid meaning we inherit a chromosome from each
parent, therefore we have two alleles of each gene. The genes that code for proteins
constitute about 1.5 % of total DNA and the rest contains RNA encoding genes and
sequences that are not known to have any function. This part of DNA is called junk
DNA. There is no relatedness between the size of genome, number of genes, and
organism complexity. In fact, some single cell organisms have a larger genome than
humans.
2.2.4 Proteins
Proteins are large molecules of the cell and they carry out many important functions.
For example, they form the antibodies which bind to foreign particles such as viruses
and bacteria. As enzymes, they work as catalysts for various chemical reactions;
the messenger proteins transmit signals to coordinate biological processes between
different cells, tissues, and organs, also they transport small molecules within the cell
and the body. Proteins are made from the information contained in genes. A protein
consists of a chain of amino acids connected by peptide bonds. Since such a bond
releases a water molecule, what we have inside a protein is a chain of amino acid
www.allitebooks.com
16 2 Introduction to Molecular Biology
residues. Typically, a protein has about 300 amino acid residues which can reach
5000 in large proteins.The essential 20 amino acids that make up the proteins is
shown in Table 2.1 with their abbreviations, codes, and polarities.
Proteins have highly complex structures and can be analyzed at four hierarchical
structures. The primary structure of a protein is specified by a sequence of amino
acids that are linked in a chain and the secondary structure is formed by linear
regions of amino acids. A protein domain is a segment of amino acid sequences in
a protein which has independent functions than the rest of the protein. The protein
also has a 3D structure called tertiary structure which affects its functionality and
several protein molecules are arranged in quaternary structure. The function of a
protein is determined by its four layer structure. A protein has the ability to fold
in 3D and its shape formed as such affects its function. Using its 3D shape, it can
bind to certain molecules and interact. For example, mad cow disease is believed to
be caused by the wrong folding of a protein. For this reason, predicting the folding
structure of a protein from its primary sequence and finding the relationship between
its 3D structure and its functionality has become one of the main research areas in
bioinformatics.
The central dogma of molecular biology and hence life was formulated by F. Crick
in 1958 and it describes the flow of information between DNA, RNA, and proteins.
This flow can be specified as DNA → mRNA → protein as shown in Fig. 2.4. The
forming of mRNA from a DNA strand is called transcription and the production
of a protein based on the nucleotide sequence of the mRNA is called translation as
described next.
2.3 Central Dogma of Life 17
2.3.1 Transcription
In the transcription phase of protein coding, a single stranded RNA molecule called
mRNA is produced which is complementary to the DNA strand it is transcribed. The
transcription process in eukaryotes takes place in the nucleus. The enzyme called
RNA polymerase starts transcription by first detecting and binding a pr omoter region
of a gene. This special pattern of DNA shown in Fig. 2.5 is used by RNA polymerase
to find where to begin transcription. The reverse copy of the gene is then synthesized
by this enzyme and a terminating signal sequence in DNA results in the ending of
this process after which pre-mRNA which contains exons and introns is released.
A post-processing called splicing involves removing the introns received from the
gene and reconnecting the exons to form the mature and much shorter mRNA which
is transferred to cytoplasm for the second phase called translation. The complete
gene contained in the chromosome is called genomic DNA and the sequence with
exons only is called complementar y DNA or cDNA [25].
The genetic code provides the mapping between the sequence of nucleotides and the
type of amino acids in proteins. This code is in triplets of nucleotide bases called
codons where each codon encodes one amino acid. Since there are four nucleotide
bases, possible total number of codons is 43 = 64. However, proteins are made of
20 amino acids only which means many amino acids are specified by more than one
codon. Table 2.2 displays the genetic code.
Such redundancy provides fault tolerance in case of mutations in the nucleotide
sequences in DNA or mRNA. For example, a change in the codon UUA may result in
UUG in mRNA but the amino acid leucine corresponding to each of these sequences
is formed in both cases. Similarly, all of the three codons UAA, UAG, and UGA cause
termination of the polypeptide sequence and hence a single mutation from A to G or
from G to A still causes termination preventing unwanted growth due to mutations.
Watson et al. showed that the sequence order of codons in DNA correspond directly
to the sequence order of amino acids in proteins [28]. The codon AUG specifies the
beginning of a protein amino acid sequence, therefore, the amino acid methionine is
found as the first amino acid in all proteins.
2.3.3 Translation
The translation phase is the process where a mature mRNA is used as a template
to form proteins. It is carried out by the large molecules called ribosomes which
consist of proteins and the ribosomal RNA (rRNA) [5]. A ribosome uses tRNA to
2.3 Central Dogma of Life 19
first detect the start codon in the mRNA which is the nucleotide base sequence AUG.
The tRNA has three bases called anticodons which are complementary to the codons
it reads. The amino acids as prescribed by the mRNA are then formed and added to
the linear protein structure according to the genetic code. Translation to the protein
is concluded by detecting one of the three stop codons. Once a protein is formed, a
protein may be transferred to the needed location by the signals in the amino acid
sequence. The new protein must fold into a 3D structure before it can function [27].
Figure 2.6 displays the transcription and translation phases of a superficial protein
made of six amino acids as prescribed by the mRNA.
2.3.4 Mutations
2.4.1 Cloning
The polymerase chain reaction (PCR) developed by Kary Mullis [3] in 1983, is a
biomedical technology used to amplify selected DNA segment over several orders
of magnitude. The amplification of DNA is needed for a number of applications
including analysis of genes, discovery of DNA motifs, and diagnosis of hereditary
diseases. PCR uses thermal cycling in which two phases are employed. In the first
phase, the DNA is separated into two strands by heat and then, a single strand is
enlarged to a double strand by the inclusion of a primer and polymerase processing.
DNA polymerase is a type of enzyme that synthesizes new strands of DNA comple-
mentary to the target sequence. These two steps are repeated many times resulting
in an exponential growth of the initial DNA segment. There are some limitations
of PCR processing such as the accumulation of pyrophosphate molecules and the
existence of inhibitors of the polymerase in the DNA sample which results in the
stopping of the amplification.
2.4 Biotechnological Methods 21
The sequence order of bases in DNA is needed to find the genetic information. DNA
sequencing is the process of obtaining the order of nucleotides in DNA. The obtained
sequence data can then be used to analyze DNA for various tasks such as finding
evolutionary relationships between organisms and treatment of diseases. The exons
are the parts of DNA that contain genes to code for proteins and all exons in a genome
is called exome. Sequencing exomes is known as whole exome sequencing. However,
research reveals DNA sequences external to the exons also affect protein coding and
health state of an individual. In whole genome sequencing, the whole genome of an
individual is sequenced. The new generation technologies are developed for both of
these processes. A number of methods exist for DNA sequencing and we will briefly
describe only the few fundamental ones.
The sequencing technology called Sanger sequencing named after Frederick
Sanger who developed it [23,24], used deoxynucleotide triphosphates (dNTPs) and
di-deoxynucleotide triphosphates (ddNTPs) which are essentially nucleotides with
minor modifications. The DNA strand is copied using these altered bases and when
these are entered into a sequence, they stop the copying process which results in
different lengths of short DNA segments. These segments are ordered by size and
the nucleotides are read from the shortest to the longest segment. Sanger method is
slow and new technologies are developed. The shotgun method of sequencing was
used to sequence larger DNA segments. The DNA segment is broken into many over-
lapping short segments and these segments are then cloned. These short segments
are selected at random and sequenced in the next step. The final step of this method
involves assembling the short segments in the most likely order to determine the
sequence of the long segment, using the overlapping data of the short segments.
Next generation DNA sequencing methods employ massively parallel processing
to overcome the problems of the previous sequencing methods. Three platforms are
widely used for this purpose: the Roche/454 FLX [21], the Illumina/Solexa Genome
Analyzer [4], and the Ion Torrent: Proton/PGM Sequencing [12]. The Roche/454
FLX uses the pyrosequencing method in which the input DNA strand is divided
into shorter segments which are amplified by the PCR method. Afterward, multiple
reads are sequenced in parallel by detecting optical signals as bases are added. The
Illumina sequencing uses a similar method, the input sample fragment is cleaved
into short segments and each short segment is amplified by PCR. The fragments are
located in a slide which is flooded with nucleotides that are labeled with colors and
DNA polymerase. By taking images of the slide and adding bases, and repeating this
process, bases at each site can be detected to construct the sequence. The Ion proton
sequencing makes use of the fact that addition of a dNTP to a DNA polymer releases
an H+ ion. The preparation of the slide is similar to other two methods and the slide
is flooded with dNTPs. Since each H+ decreases pH, the changes in pH level is used
to detect nucleotides [8].
22 2 Introduction to Molecular Biology
2.5 Databases
The major databases for nucleotides are the GenBank [10], the European Molec-
ular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) database
[19], and the DNA Databank of Japan (DDJB) [26]. GenBank is maintained by the
National Center for Biotechnology Information (NCBI), U.S. and contains sequences
for various organisms including primates, plants, mammals, and bacteria. It is a fun-
damental nucleic acid database and genomic data is submitted to GenBank from
research projects and laboratories. Searches in this database can be performed by
keywords or by sequences. The EMBL-EBI database is based on EMBL Nucleotide
Sequence Data Library which was the first nucleotide database in the world and
receives contributions from projects and authors. EMBL supports text-based retrieval
tools including SRS and BLAST and FASTA for sequence-based retrieval [9].
Protein sequence databases provide storage of protein amino acid sequence informa-
tion. Two commonly used protein databases are the Protein Identification Resource
(PIR) [16,31] and the UniProt [15] containing SwissProt [2]. The PIR contains
protein amino acid sequences and structures of proteins to support genomic and pro-
teomic research. It was founded by the National Biomedical Research Foundation
(NBRF) for the identification and interpretation of protein sequence information,
and the Munich Information Center for Protein Sequences (MIPS) [22] in Germany,
and the Japan International Protein Information Database later joined this database.
SwissProt protein sequence database was established in 1986 and provided protein
functions, their hierarchical structures, and diseases related to proteins. The Uni-
versal Protein Resource (UniProt) is formed by the collaboration of EMBL-EBI,
Swiss Institute of Bioinformatics (SIB) and PIR in 2003 and SwissProt was incor-
porated into UniProt. PDBj (Protein Data Bank Japan) is a protein database in Japan
providing an archive of macromolecular structures and integrated tools [17].
2.6 Human Genome Project 23
1. For the DNA base sequence S = AACGTAGGCTAAT, work out the complemen-
tary sequence S and then the complementary of the sequence S .
2. A superficial gene has the sequence CCGTATCAATTGGCATC. Assuming this
gene has exons only, work out the amino acid of the protein to be formed.
3. Discuss the functions of three RNA molecules named tRNA, cRNA, and mRNA.
4. A protein consists of the amino acid sequence A-B-N-V. Find three gene
sequences that could have resulted in this protein.
5. Why is DNA multiplying needed? Compare the cloning and PCR methods of
multiplying DNA in terms of technology used and their performances.
References
1. Alberts B, Bray D, Lewis J, Raff M, Roberts K (1994) Molecular biology of the cell. Garland
Publishing, New York
2. Bairoch A, Apweiler R (1999) The SWISS-PROT protein sequence data bank and its supple-
ment TrEMBL in 1999. Nucleic Acids Res 27(1):49–54
3. Bartlett JMS, Stirling D (2003) A short history of the polymerase chain reaction. PCR Protoc
226:36
4. Bentley DR (2006) Whole-genome resequencing. Curr Opin Genet Dev 16:545–552
5. Brandenberg O, Dhlamini Z, Sensi A, Ghosh K, Sonnino A (2011) Introduction to molecular
biology and genetic engineering. Biosafety Resource Book, Food and Agriculture Organization
of the United Nations
6. Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, Kothari A, Krummenacker
M, Latendresse M, Mueller LA, Ong Q, Paley S, Pujar A, Shearer AG, Travers M, Weerasinghe
D, Zhang P, Karp PD (2011) The MetaCyc database of metabolic pathways and enzymes
and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 40(Database
issue):D742D753
References 25
7. CBD (Convention on Biological Diversity). 5 June 1992. Rio de Janeiro. United Nations
8. http://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-
course
9. http://www.ebi.ac.uk/embl/
10. http://www.ncbi.nlm.nih.gov/genbank
11. http://www.ncbi.nlm.nih.gov/geo/
12. http://www.iontorrent.com/. Ion Torrent official page
13. http://www.genome.jp/kegg/pathway.html
14. http://www.ebi.ac.uk/arrayexpress/. Website of ArrayExpress
15. http://www.uniprot.org/. Official website of PIR at Georgetown University
16. http://pir.georgetown.edu/. Official website of PIR at Georgetown University
17. http://www.pdbj.org/. Official website of Protein Databank Japan
18. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids
Res 28(1):2730
19. Kneale G, Kennard O (1984) The EMBL nucleotide sequence data library. Biochem Soc Trans
12(6):1011–1014
20. Lewin B (1994) Genes V. Oxford University Press, Oxford
21. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS et al (2005) Genome sequencing in
microfabricated high-density picolitre reactors. Nature 437:376–380
22. Mewes HW, Andreas R, Fabian T, Thomas R, Mathias W, Dmitrij F, Karsten S, Manuel S,
Mayer KFX, Stmpflen V, Antonov A (2011) MIPS: curated databases and comprehensive
secondary data resources in 2010. Nucleic Acids Res (England) 39
23. Sanger F, Coulson AR (1975) A rapid method for determining sequences in DNA by primed
synthesis with DNA polymerase. J Mol Biol 94(3):441–448
24. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors.
Proc Natl Acad Sci USA 74(12):5463–5467
25. Setubal JC, Meidanis J (1997) Introduction to computational molecular biology. PWS Publish-
ing Company, Boston
26. Tateno Y, Imanishi T, Miyazaki S, Fukami-Kobayashi K, Saitou N, Sugawara H et al (2002)
DNA data bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res
30(1):27–30
27. Voet D, Voet JG (2004) Biochemistry, 3rd edn. Wiley
28. Watson J, Baker T, Bell S, Gann A, Levine M, Losick R (2008) Molecular biology of the gene.
Addison-Wesley Longman, Amsterdam
29. Watson JD (1986) Molecular biology of the gene, vol 1. Benjamin/Cummings, Redwood City
30. Watson JD, Hopkins NH, Roberts JW, Steitz JA, Weiner AM (1987) Molecular biology of the
gene, vol 2. Benjamin/Cummings, Redwood City
31. Wu C, Nebert DW (2004) Update on genome completion and annotations: protein information
resource. Hum Genomics 1(3):229–233
www.allitebooks.com
Graphs, Algorithms, and Complexity
3
3.1 Introduction
Graphs are discrete structures which are frequently used to model biological net-
works. We start this chapter with a brief review of basics of graph theory concepts
and then describe concepts such as connectivity and distance that are used for the
analysis of biological networks. Trees are graphs with no cycles and are used for var-
ious representations of biological networks. Spectral properties of graphs provide
means to explore structures such as clusters of biological networks and are briefly
described.
Algorithms have key functionality to solve bioinformatics problems and we pro-
vide a short review of fundamental algorithmic methods starting by describing the
complexity of the algorithms. Our emphasis in this section is again on methods
such as dynamic programming and graph algorithms that find many applications
in bioinformatics. Basic graph traversal procedures such as the breadth-first search
and depth-first search are frequently employed for various problems in biological
networks and are described in detail. We then introduce complexity classes NP,
NP-hard, and NP-complete for problems that cannot be solved in polynomial time.
Using approximation algorithms with known approximation ratios provides subop-
timal solutions to difficult problems and sometimes our only choice in tackling a
bioinformatics problem is the use of heuristics which are commonsense rules that
work for most cases of the input. We present a rather dense review of all of these key
concepts in relation to bioinformatics problems in this chapter.
3.2 Graphs
Graphs are frequently used to model networks of any kind. A graph G(V, E) has a
nonempty vertex set V and a possibly nonempty edge set E. The number of vertices
of a graph is called its or der and the number of its edges is called its si ze. An edge
© Springer International Publishing Switzerland 2015 27
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_3
28 3 Graphs, Algorithms, and Complexity
a c a c a c
e d e d e d
e of a graph is incident to two vertices called its endpoints. A loop is an edge that
starts and ends at the same vertex. Multiple edges e1 and e2 are incident to the same
endpoint vertices. A simple graph does not have any loops or multiple edges. In
the context of this book, we will consider only simple graphs. The complement of a
graph G(V, E) is the graph G(V, E ) which has the same vertex set as G and an edge
(u, v) ∈ E if and only if (u, v) ∈
/ E. A clique is a graph with the maximum number
of connections between its vertices. Each vertex in a clique has n − 1 connections
to all other vertices where n is the order of the clique. A maximal clique is a clique
which is not a proper subset of another clique. A maximum clique of a graph G is a
clique of maximum size in G. Figure 3.1 displays example graphs.
The edges of a directed graph have orientation between their endpoints. An ori-
ented edge (u, v) of a directed graph shown by an arrow from u to v, starts from
vertex u and ends at vertex v. The degree of a vertex v of a graph G is the number
of edges that are incident to v. The maximum degree of a graph is shown by Δ(G).
The in-degree of a vertex v in a directed graph is the number of edges that end at v
and the out-degree of a vertex v in such a graph is the number of edges that start at
v. A directed graph is depicted in Fig. 3.2a. The in-degrees of vertices a, b, c, d, e in
(a) (b) f
a c b
a
b d
e e
d
g
(a) (b) 4 5
1 2 6
e2
1 2
e1 e3 2 1 3 4
6 e7 e9 3
3 2 4
e8
e6 e4
5 4 4 1 2 3 5
e5
5 1 4 6
6 1 5
neighbor requires n steps. For sparse graphs where the density of the graph is low
with relatively much less number of edges than dense graphs (m n), this structure
has the disadvantage of occupying unnecessary memory space. The adjacency list
takes m + n memory locations and examining all neighbors of a given node requires
m/n time on average. However, checking the existence of an edge requires m/n
steps whereas we can achieve this in one step using an adjacency matrix. In general,
adjacency lists should be preferred for sparse graphs and adjacency matrices for
dense graphs.
(a) (b) 1
a b
9 2
a b c d
8
f g
10 c
11 12 7
6
h g f e i h
5 3
e d
4
Fig. 3.4 a The bridge (g, f ) of a graph. The endpoints of this bridge are also cut vertices of the
graph. b The Hamiltonian cycle is {a, b, c, d, e, i, h, g, f, a} and the sequence of Eularian trail
starting from vertex a is shown by the numbers on edges
(a)
(b) 3
1 2
8 9
7 4 1 3
6
2
6
5 4
5
3.2.4 Trees
A tr ee is a graph with no cycles and a f or est is a graph that has two or more
disconnected trees. A spanning tree T (V, E ) of a graph G(V, E) is a tree that covers
all vertices of G where |E | ≤ |E|. A tree where a special vertex is designated as
root with all vertices having an orientation toward it is called a rooted tree, otherwise
the tree is unr ooted. Any vertex u except the root in a rooted tree is connected to
a vertex v on its path to the root called its par ent, and the vertex u is called the
child of vertex v. For an edge-weighted graph G(V, E, w), the minimum spanning
tree (MST) of G has the minimum total sum of weights among all possible spanning
trees of G. The MST of a graph is unique if all of its edges have distinct weights.
Figure 3.5 shows a forest consisting of unrooted trees and a rooted MST of a graph.
Given a square matrix A[n ×n], an eigenvalue λ and the corresponding eigenvector
x of A satisfy the following equation:
Ax = λx (3.1)
which can be written as
Ax − λx = 0 (3.2)
(A − λI )x = 0 (3.3)
The necessary condition for Eq. 3.2 to hold is det (A − λI ) to be 0. In order to
find an eigenvalue of a matrix A, we can do the following [3]:
1. Form matrix (A − λI ).
2. Solve the equation for λ values.
3. Substitute each eigenvalue in Eq. 3.1 to find x vectors corresponding to λ values.
Eigenvalues are useful in solving many problems in engineering and basic sci-
ences. Our main interest in eigenvalues and eigenvectors will be their efficient use
for clustering in biological networks. The Laplacian matrix of a graph is obtained by
subtracting its adjacency matrix A from its degree matrix D which has dii = deg(i)
at diagonal elements and di j = 0 otherwise as L = D − A. An element of the
Laplacian matrix therefore is given by
⎧
⎨ 1 if i = j and deg( j) = 0
L i j = −1 if i and j are adjacent (3.4)
⎩
0 otherwise
The Laplacian matrix L provides useful information about the connectivity of
a graph [4]. It is positive semidefinite with all eigenvalues except the smallest one
being positive and the smallest eigenvalue is 0. The number of components of a graph
3.2 Graphs 33
G is given by the number of eigenvalues of its Laplacian which are 0. The second
smallest eigenvalue of L(G) shows whether graph G is connected or not, a positive
value showing connectedness. A greater second eigenvalue shows a better connected
graph.
3.3 Algorithms
The time complexity of an algorithm is the number of steps needed to obtain the
output and is commonly expressed in terms of the size of the input. In Algorithm
3.1, we would need at least n steps since the f or loop is executed n times. Also, the
assignment outside the loop needs 1 step. However, we would be more interested in
the part of the algorithm that has a running time dependent on the input size, as the
execution times of the other parts of an algorithm would diminish when the input
size gets larger. In the above example, the algorithm has to be executed exactly n
times. However, if we change the requirement to find the first occurrence of the input
key, then the execution time would be at most n steps.
34 3 Graphs, Algorithms, and Complexity
3.3.2 Recurrences
A recursive algorithm calls itself with smaller values until a base case is encountered.
A check for the base case is typically made at the start of the algorithm and if this
condition is not met, the algorithm calls itself with possibly a smaller value of the
input. Let us consider finding the nth power of an integer x using a recursive algorithm
called Power as shown in Algorithm 3.2. The base case which is the start of the
returning point from the recursive calls is when n equals 0, and each returned value
is multiplied by the value of x when called which results in the value x n .
Recursive algorithms result in simpler code in general but their analysis may not
be simple. They are typically analyzed using recurrence relations where time to
solve the algorithm for an input size is expressed in terms of the time for smaller
input sizes. The following recurrence equation for the Power algorithm defines the
relation between the time taken for the algorithm for various input sizes:
3.3 Algorithms 35
T (n) = T (n − 1) + 1 (3.5)
= T (n − 1) + 1 = T (n − 2) + 2
= T (n − 2) + 2 = T (n − 3) + 3
Proceeding in this manner, we observe that T (n) = T (n −k)+k and when n = k,
T (n) = T (0) + n = n steps are required, assuming that the base case does not take
any time. We could have noticed that there are n recursive calls until the base case
of n = 0 and each return results in one multiplication resulting in a total of n steps.
The analysis of recurrences may be more complicated than our simple example and
the use of more advanced techniques such as the Master theorem may be needed [7].
The fundamental classes of algorithms are the greedy algorithms, divide and conquer
algorithms, dynamic programming, and the graph algorithms. A greedy algorithm
makes the locally optimal choice at each step and these algorithms do not provide an
optimal solution in general but may find suboptimal solutions in reasonable times.
We will see a greedy algorithm that finds MST of a graph in reasonable time in graph
algorithms.
Divide and conquer algorithms on the other hand, divide the problem into a number
of subproblems and these subproblems are solved recursively, and the solutions of the
smaller subproblems may be combined to find the solution to the original problem.
As an example of this method, we will describe the mergesor t algorithm which
sorts elements of an array recursively. The array is divided into two parts, each part
is sorted recursively and the sorted smaller arrays are merged into a larger sorted
array. Merging is the key operation in this algorithm and when merging two smaller
sorted arrays, the larger array is formed by finding the smallest value found in both
smaller arrays iteratively. For example, merging the two sorted arrays {2, 8, 10, 12}
and {1, 4, 6, 9} results in {1, 2, 4, 6, 8, 9, 10, 12}. The dynamic programming method
and the graph algorithms have important applications in bioinformatics and they are
described next.
Algorithm 3.3 shows how Fibonacci sequence can be computed using dynamic
programming. The array F is filled with the n members of this sequence at the end
of the algorithm which requires Θ(n) steps. Dynamic programming is frequently
used in bioinformatics problems such as sequence alignment and DNA motif search
as we will see in Chaps. 6 and 8.
Theorem 3.1 The time complexity of BFS algorithm is Θ(n + m) for a graph of
order n and size m.
www.allitebooks.com
3.3 Algorithms 37
Proof The initialization between lines 3 and 6 takes Θ(n) time. The while loop is
executed at most n times and the f or loop between the lines 12 and 18 is run at most
deg(u)+1 times considering the vertices with no neighbors. Total running time in
this case is
(a) 1 2 3 (b) 1 12 5 8 6 7
b c d b c d
a a
g f e g f e
1 2 3 9 10 2 11 3 4
Fig. 3.6 a A BFS tree from vertex a. b A DFS tree from vertex a in the same graph. The first and
last visit times of a vertex are shown in teh left and right of a vertex consecutively
38 3 Graphs, Algorithms, and Complexity
There are two loops in the algorithm; the first loop considers each vertex at a
maximum of O(n) time and the second loop considers all neighbors of a vertex in
a total of u N (u) = 2m time. Time complexity of DFS algorithm is, therefore,
Θ(n + m).
Figure 3.7 shows the iterations of Prim’s algorithm in a sample graph. The com-
plexity of this algorithm is O(n 2 ) as the while loop is executed for all vertices and
the search for the minimum weight edge for each vertex will take O(n) time. This
complexity can be reduced to O(m log n) time using binary heaps and adjacency
lists.
(a) 1 (b) 1
a b a b
2 2
3 7 6 3 7
e 6 e
4 5 4 5
d c d c
8 8
(c) (d)
1
1 a b
a b 2
2
3 7
3 e 6
e 6
7 4 5
4 5 d c
d c
8
8
The execution of this algorithm is shown in Fig. 3.8 where shortest path tree T
is formed after four steps. Since there are n vertices, n − 1 edges are added to
the initial tree T consisting of the source vertex s only. The time complexity of
the algorithm depends on the data structures used. As with the Prim’s algorithm,
there are two nested loops and implementation using adjacency list requires O(n 2 )
time. Using adjacency lists and Fibonacci heaps, this complexity may be reduced to
O(m + n log n).
This algorithm only finds the shortest path tree from a source vertex and can be
executed for all vertices to find all shortest paths. It can be modified so that the
predecessors of vertices are stored (See Exercise 3.6).
3.3 Algorithms 41
(a) (b) 4
4
~ ~ 3 ~
9 b 9 c
b c
8 8
2 2 7
a 1 7 a 1
2 d 2 e d
e
6 6 ~
2 ~ 2
(c) (d)
4 4
3 4 3 4
9 b 9 c
b c
8 8
2 7 2 7
a 1 a 1
2 e d 2 e d
6 6
2 ~ 2 5
Given a simple and undirected graph G(V, E), a special subgraph G ⊆ G has some
property which may be implemented to discover a structure in biological networks.
The special subgraphs we will consider are the independent sets, dominating sets,
matching and the vertex cover.
(a) (b)
Fig. 3.9 a A maximum independent set of a graph. b The minimum dominating set of the same
graph
3.3.6.3 Matching
A matching in a graph G(V, E) is a subset E of its edges such that the edges in E
do not share any endpoints. Matching finds various applications including routing
in computer networks. A maximum matching of a graph G has the maximum size
among all matchings of G. A maximal matching of G is a matching in G which
cannot be enlarged. Finding maximum matching in a graph can be performed in
polynomial time [2] and hence is in P. A maximum matching of size 4 is depicted in
Fig. 3.10a.
(a) (b)
3.4 NP-Completeness
All of the algorithms we have considered up to this point have polynomial execu-
tion times which can be expressed as O(n k ) where n is the input size and k ≥ 0.
These algorithms are in complexity class P which contains algorithms with polyno-
mial execution times. Many algorithms, however, do not belong to P as either they
have exponential running times or sometimes the problem at hand does not even
have a known solution with either polynomial time or exponential time. Problems
with solutions in P are called tractable and any other problem outside P is called
intractable.
We need to define some concepts before investigating complexity classes other
than P. Problems we want to solve are either optimization problems where we try
to obtain the best solution to the problem at hand, or decision problems where we
try to find an answer in the form of a yes or no to a given instance of the problem
as described before. A cer ti f icate is an input combination for an algorithm and a
veri f ier (or a cer ti f ier ) is an algorithm that checks a certifier for a problem and
provides an answer in the form of a yes or no. We will exemplify these concepts by the
use of subset sum problem. Given a set of n numbers S = {i 1 , . . . , i n }, this problem
asks to find a subset of S sum of which is equal to a given value. For example, given
S = {3, 2, −5, −2, 9, 6, 1, 12} and the value 4, the subset {3, 2, −1} provides a yes
answer. The input {3, 2, −1} is the certificate and the verifier algorithm simply adds
the values of integers in the certifier in k − 1 steps where k is the size of the certifier.
Clearly, we need to check each subset of S to find a solution if it exits and this can
be accomplished in exponential time of 2n which is the number of subsets of a set
having n elements. Now, if we are given an input value m and a specific subset R of
S as a certifier, we can easily check in polynomial time using Θ(k) additions where
k is the size of R, whether R is a solution. The certifier is R and the verifier is the
algorithm that sums the elements of R and checks whether this sum equals m. Such
problems that can be verified with a nondeterministic random input in polynomial
time are said to be in complexity class Nondeterministic Polynomial (NP). Clearly,
all of the problems in P have verifiers that have polynomial running time and hence
P ⊆ N P. Whether P = NP is not known but the opposite is widely believed by
computer scientists.
NP-Complete problems constitute a subset of problems in NP and they are as
hard as any problem in NP. Formally, a decision problem C is NP-Complete if
it is in NP and every problem in NP can be reduced to C in polynomial time.
44 3 Graphs, Algorithms, and Complexity
NP−complete
NP
P
NP-hard problems are the problems which do not have any known polynomial time
algorithms and solving one NP-hard problem in polynomial time implies all of the
NP-hard problems that can be solved in polynomial time. In other words, NP-hard
is a class of problems that are as hard as any problem in NP. For example, finding the
least cost cycle in a weighted graph is an optimization problem which is NP-hard.
Figure 3.11 displays the relations between these complexity classes.
3.4.1 Reductions
(a) (b)
Fig. 3.12 Reduction from independent set to clique. a An independent set of a graph G. b The
clique of G using the same vertices
[5], our aim is to search for approximation algorithms for vertex cover. Algorithm
3.8 displays our approximation algorithm for the vertex cover problem. At each iter-
ation, a random edge (u, v) is picked, its endpoint vertices are included in the cover
and all edges incident to these edges are deleted from G. This process continues until
there are no more edges left and since all edges are covered, the output is a vertex
cover. We need to know how close the size of the output is to the optimum VC.
This algorithm in fact finds a maximal matching of a graph G and includes the
endpoints of edges found in the matching to the vertex cover set. Let M be a maximal
matching of G. The minimum vertex cover must include at least one endpoint of each
edge (u, v) ∈ M. Hence, the size of minimum vertex cover is at least as the size
of M. The size of the cover returned by the algorithm is 2|M| as both ends of the
matching are included in the cover. Therefore,
|V C| = 2|M| ≤ 2 × |MinV C| (3.8)
where VC is the vertex cover returned by the algorithm and MinVC is the minimum
vertex cover. Therefore, the approximation ratio of this algorithm is 2. The running
of this algorithm in a sample graph as shown in Fig. 3.13 results in a vertex cover
that has a size of 6 which is 1.5 times the optimum size of 4 shown in Fig. 3.13b.
(a) (b)
Fig. 3.13 Vertex Cover Examples. a The output of the approximation algorithm. b The optimal
vertex cover
www.allitebooks.com
3.4 NP-Completeness 47
We have reviewed the basic background to analyze biological sequences and bio-
logical networks in this chapter. Graph theory with its rich background provides a
convenient model for efficient analysis of biological networks. Our review of graphs
emphasized concepts that are relevant to biological networks rather than being com-
prehensive. We then provided a survey of algorithmic methods again with emphasis
on the ones used in bioinformatics. The dynamic programming and the graph algo-
rithms are frequently used in bioinformatics as we will see. Finally, we reviewed
the complexity classes P, NP, NP-hard, and NP-complete. Most of the problems we
encounter in bioinformatics are NP-hard which often require the use of approxima-
tion algorithms or heuristics. Since the size of data is huge in these applications,
we may use approximation algorithms or heuristics even if an algorithm in P exists
for the problem at hand. For example, if we have an exact algorithm that solves a
problem in O(n 2 ) time, we may be interested in an approximation algorithm with
a reasonable approximation ratio that finds the solution in O(n) time for n 1.
Similarly, the size of data being large necessitates the use of parallel and distrib-
uted algorithms which provide faster solutions than the sequential ones even if the
problem is in P, as we will see in the next chapter.
A comprehensive review of graph theory is provided by West [9] and Harary [6].
Cormen et al. [1], Skiena [8] and Levitin [7] all provide detailed descriptions of key
algorithm design concepts.
48 3 Graphs, Algorithms, and Complexity
Exercises
1. Find the adjacency matrix and the adjacency list representation of the graph
shown in Fig. 3.14.
2. Prove that each vertex of as graph G has an even degree if G has an Eularian
trail.
3. The exchange sort algorithm sorts the numbers in an array of size n by first
finding the maximum value of the array, swapping the maximum value with the
value in the first place, and then finding the maximum value of the remaining
(n −1) elements and continue until there are two elements. Write the pseudocode
of this algorithm; show its iteration steps in the array A = {3, 2, 5, 4, 6} and work
out its time complexity.
4. Modify the algorithm that finds the Fibonacci numbers (Algorithm 3.3) such
that only two memory locations which show the last two values of the sequence
are used.
5. Work out the BFS and DFS trees rooted at vertex b in the graph of Fig. 3.15 by
showing the parents of each vertex except the root.
6. Sort the weights of the graph of Fig. 3.16 in ascending order. Then, starting
from the lightest weight edge, include edges in the MST as long as they do
not form cycles with the existing edges in the MST fragment obtained so far.
This procedure is known as Kruskal’s algorithm. Write the pseudocode for this
algorithm and work out its time complexity.
1 2
6 3
5 4
a b c d
i j
h g f e
3 9
a b c
8 4
7 h
g 6 10
12
11 14
f e d
5 13
2
h g f e
7. Modify Dijkstra’s shortest path algorithm (Algorithm 3.7) so that the shortest
path tree information in terms of predecessors of vertices are stored.
8. Find the minimum dominating set and the minimum vertex cover of the graph
in Fig. 3.17.
9. Show that the independent set problem can be reduced to vertex cover problem
in polynomial time. (Hint: If V is an independent set of G(V, E), then V \ V
is a vertex cover).
References
1. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The
MIT Press, Cambridge
2. Edmonds J (1965) Paths, trees, and flowers. Can J Math 17:449–467
3. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press, Taylor and Francis
4. Fiedler M (1989) Laplacian of graphs and algebraic connectivity. Comb Graph Theory 25:57–70
5. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-
completeness. W. H. Freeman, New York
6. Harary F (1979) Graph theory. Addison-Wesley, Reading
50 3 Graphs, Algorithms, and Complexity
7. Levitin A (2011) Introduction to the design and analysis of algorithms, 3rd edn. Pearson Inter-
national Edn. ISBN: 0-321-36413-9
8. Skiena S (2008) The algorithm design manual. Springer, ISBN-10: 1849967202
9. West DB (2001) Introduction to graph theory, 2nd edn. Prentice-Hall, ISBN 0-13-014400-2
Parallel and Distributed Computing
4
4.1 Introduction
The terms computing, algorithm and programming are related to each other but
have conceptually different meanings. An algorithm in general is a set of instruc-
tions described frequently using pseudocode independent of the hardware, the oper-
ating system and the programming language used. Programming involves use of
implementation details such as operating system constructs and the programming
language. Computing is more general and includes methodologies, algorithms, pro-
gramming and architecture.
A sequential algorithm consists of a number of instructions executed consecu-
tively. This algorithm is executed on a central processing unit (CPU) of a computer
which also has memory and input/output units. A program is compiled and stored in
memory and each instruction of the program is fetched from the memory to the CPU
which decodes and executes it. In this so called Von Neumann model, both program
code and data are stored in external memory and need to be fetched to the CPU.
Parallel computing is the use of parallel computers to solve computational prob-
lems faster than a single computer. It is widely used for computationally difficult
and time consuming problems such as climate estimation, navigation and scientific
computing. A parallel algorithm executes simultaneously on a number of processing
elements with the processing elements being configured into various architectures.
A fundamental issue in parallel computing architecture is the organization of the
interconnection network which provides communication among tasks running on
different processing elements. The tasks of the parallel algorithm may or may not
share a commonly accessible global memory. The term parallel computing in general,
implies tightly coupling between the tasks as we will describe.
Distributed computing on the other hand, assumes the communication between
the tasks running on different nodes of the network communicate using messages
only. Distributed algorithms are sometimes called message-passing algorithms for
this reason. A network algorithm is a type of distributed algorithm where each node
is typically aware of its position in the network topology; cooperates with neighbor
© Springer International Publishing Switzerland 2015 51
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_4
52 4 Parallel and Distributed Computing
nodes to solve a network problem. For example, a node would search for shortest
paths from itself to all other nodes by cooperating with its neighbors in a network
routing algorithm. In a distributed memory routing algorithm on the other hand,
we would attempt to solve shortest paths between all nodes of a possibly large net-
work graph using few processing elements in parallel. Both can be called distributed
algorithms in the general sense.
We start this chapter by the review of fundamental parallel and distributed com-
puting architectures. We then describe the needed system support for shared-memory
parallel computing and review multi-threaded programming by providing examples
on POSIX threads. The parallel algorithm design techniques are also outlined and the
distributed computing section is about message passing paradigm and examples of
commonly used message passing software are described. We conclude the analysis
by the UNIX operating system and its network support for distributed processing
over a computer network.
(a) (b)
P1 P2 ... PN
Fig. 4.1 Sample interconnection networks with each processor having switching capabilities. a A
shared medium. b A 2-D mesh. Sample concurrent paths are shown in bold in (b)
(a) (b)
6 7
2 3
4 5
0 1
Fig. 4.2 Sample interconnection networks. a A 3-dimension hypercube. b A binary tree network.
Circles are the switching elements, squares are the processors. Three sample concurrent paths are
shown in bold in both figures
(a) (b)
P1 P2 ... PN P1 PN
Shared Input /
Memory Output
is generally more tightly coupled and more homogeneous than a distributed system
although technically, a distributed system is a multicomputer system as it does not
have shared memory.
employed within the CPU using instruction level parallelism (ILP) by executing
independent instructions at the same time, however, the physical limits of this method
have been reached recently and parallel processing at a coarser level has re-gained
its popularity among researchers.
A parallel algorithm is analyzed in terms of its time, processor and work complexities.
The time complexity T (n) of a parallel algorithm is the number of time steps required
to finish it. The processor complexity P(n) specifies the number of processors needed
and the work complexity W (n) is the total work done by all processors where W =
P × T . A parallel algorithm for a problem A is more efficient than another parallel
algorithm B for the same problem if WA < WB .
Let us consider a sequential algorithm that solves a problem A with a worst case
running time of Ts (n) which is also an upper bound for A. Furthermore, let us assume
a parallel algorithm does W (n) work to solve the same problem in Tp (n) time. The
parallel algorithm is work-optimal if W (n) = O(T (n)). The speedup Sp obtained by
a parallel algorithm is the ratio of the sequential time to parallel time as follows:
Ts (n)
Sp = (4.1)
Tp (n)
and the efficiency Ep is,
Sp
Ep = (4.2)
p
which is a value between zero and one, with p being the number of processors. Ideally
Sp should be equal to the number of processors but this will not be possible due to
the overheads involved in communication among parallel processes. Therefore, we
need to minimize the interprocess communication costs to improve speedup. Placing
many processes on the same processor decreases interprocess communication costs
as local communication costs are negligible, however, the parallelism achieved will
be greatly reduced. Load balancing of parallel processes involves distributing the
processes to the processors such that the computational load is balanced across the
processors and the inter-processor communications are minimized.
This model is commonly used for parallel algorithm design as it discards various
implementation details such as communication and synchronization. It can further
be classified into the following subgroups:
• Exclusive read exclusive write (EREW): Every memory cell can be read or written
by only one processor at a time.
• Concurrent read exclusive write (CREW): Multiple processors can read a memory
cell but only one can write at a time.
• Concurrent read concurrent write (CRCW): Multiple processors can read and write
to the same memory location concurrently. A CRCW PRAM is sometimes called
a concurrent random-access machine.
Exclusive read concurrent write (ERCW) is not considered as it does not make
sense. Most algorithms for PRAM model use SIMD model of parallel computation
and the complexity of a PRAM algorithm is the number of synchronous steps of the
algorithm.
A[0:15] P0
A[0:7] A[8:15]
P0 P1
A[0:3] A[4:7] A[8:11] A[12:15]
P0 P1 P2 P3
A[2:3] A[4:5] A[6:7] A[8:9] A[10:11]
A[0:1] A[14:15]
P0 P1 P2 P3 P4 P5 P6 P7
A[12:13]
0 1 3 2 . . . 15
A
www.allitebooks.com
4.3 Parallel Computing 57
Algorithm 4.1 shows one way of implementing this algorithm using EREW nota-
tion where the active processors are halved in each step for a total of logn steps. We
could modify this algorithm so that contents of the array A are not altered by starting
with 2p processor which copy A to another array B (see Exercise 4.2).
The time complexity of this algorithm is log n and in the first step, each processor
pi performs one addition for a total of n/2 additions. In the second step, we have
n/4 processors each performing 2 additions and it can readily be seen that we have a
total of n/4 summations at each step. The total work done in this case is (n log n)/4).
We could have a sequential algorithm that finds the result in only n − 1 steps. In this
case, we find the work done by this algorithm is not optimal.
We will assume that a parallel program consists of tasks that can run in parallel.
A task in this sense has code, local memory and a number of input/output ports.
Tasks communicate by sending data to their output ports and receive data from their
input ports [12]. The message queue that connects an output port of a task to an
input port of another task is called a channel. A task that wants to receive data
from one of its input channels is blocked if there is no data available in that port.
However, a task sending data to an output channel does not usually need to wait for
the reception of this data by the receiving task. This model is commonly adopted
in parallel algorithm design where reception of data is synchronous and the sending
of it is asynchronous. Full synchronous communication between two tasks requires
the sending task to be blocked also until the receiving task has received data. A
four-step design process for parallel computing was proposed by Foster consisting
of partitioning, communication, agglomeration and mapping phases [5].
Partitioning step can be performed on data, computation or both. We may divide
data into a number of partitions that may be processed by a number of processing
elements which is called domain decomposition [12]. For example, if the sum of an
array of size n using p processors is needed, we can allocate each processor n/p data
58 4 Parallel and Distributed Computing
items which can perform addition of these elements and we can then transfer all of
the partial sums to a single processor which computes the total sum for output. In
partitioning computation, we divide the computation into a number of modules and
then associate data with each module to be executed on each processor. This strategy
is known as functional decomposition and sometimes we may employ both domain
and functional decompositions for the same problem that needs to be parallelized.
The communication step involves the deciding process of the communication
among the tasks such as which task sends or receives data from which other tasks.
The communication costs among tasks that reside on the same processor can be
ignored, however, the interprocess communication costs using the interconnection
network is not trivial and needs to be considered.
The agglomeration step deals with grouping tasks that were designed in the first
two steps so that the interprocess communication is reduced. The final step is the
allocation of the task groups to processors which is commonly known as the mapping
problem. There is a trade off however, as grouping tasks onto the same processor
decreases communication among them but results in less parallelism. One way of
dealing with this problem is to use a task dependency graph which is a directed graph
representing the precedences and communications of the tasks. The problem is then
reduced to the graph partitioning problem in which a vast amount of literature exists
[1]. Our aim in general is to find a minimum weight cut set of the task graph such
that communication between tasks on different processors is minimized and the load
among processors is balanced.
Figure 4.5 shows such a task dependency graph where each task has a known
computation time and the costs of communications are also shown. We assumed the
PA1 PA2
1 task id
3 execution time
2 2 1
2 3
3 6 2
4 12
4 5 6
2 3 7 8
1
7
1 7
2
4
4
8
5
PA1
P2 1 3 6
8
P1 2 4 5 7
5 10 t 20 30 40
PA2
P2 3 5 6
P1 1 2 4 7 8
5 10 t 20 30 40
computation times of tasks are known beforehand which is not realistic in a general
computing system but is valid for many real-time computing systems. As a general
rule, a task Ti that receives some data from its predecessors cannot finish before
its predecessors finish, simply because we need to make sure that all of its data has
arrived before it can start its execution. Otherwise, we cannot claim that Ti will finish
in its declared execution time.
We can partition these tasks to two processors P1 and P2 with partition PA1
where we only consider the number of tasks in each partition resulting in a total
IPC cost of 21; or PA2 in which we consider IPC costs with a total cost of 8 for
IPC. These schedules are displayed by the Gantt charts [6] which show the start and
finishing times of tasks in Fig. 4.6. The resulting total execution times are 41 and 32
for partitions PA1 and PA2 from which we can deduce PA2 which has a lower IPC
cost and shorter overall execution time is relatively more favorable.
schedule
READY RUN
time expires
it is waiting. When that event occurs, it is made ready again to be scheduled by the
operating system. A running process may have its time slice expired in which case it
is preempted and put in the ready state to be scheduled when its turn comes. The data
about a process is kept in the structure called process control block (PCB) managed
by the operating system.
Having processes in an operating system provides modularity and a level of par-
allelism where CPU is not kept idle by scheduling a ready process when the cur-
rent process is blocked. However, we need to provide some mechanisms for these
processes to synchronize and communicate so that they can cooperate. There are
two types of synchronization among processes; mutual exclusion and conditional
synchronization as described next.
P1 P2
1. LOAD R2,m[x] 1. LOAD R5,m[x]
2. INC R2 2. INC R5
3. STO m[x],R2 3. STO m[x],R5
Now, let us assume P1 executes first, and just before it can execute line 3, its
time slice expires and the operating system performs a context switch by storing its
variables including R2 which has 13 in its PCB. P2 is executed afterwards which
4.3 Parallel Computing 61
executes by reading the value of 12 into register R15, incrementing R5 and storing
its value 13 in x. The operating system now schedules P1 by loading CPU registers
from its PCB and P1 writes 13 on x location which already contains 13. We have
incremented x once instead of twice as required.
As this example shows, some parts of the code of a process has to be executed
exclusively. In the above example, if we had provided some means so that P1 and P2
executed lines 1–3 without any interruption, the value of x would be consistent. Such
a segment of the code of a process which has to be executed exclusively is called a
critical section. A simple solution to this problem would be disabling of interrupts
before line 1 and then enabling them after line 3 for each process. In user C code
for example, we would need to enclose the statement x = x + 1 by the assembly
language codes disable interrupt (DI) and enable interrupt (EI). However, allowing
machine control at user level certainly has shortcomings, for example, if the user
forgets enabling interrupts then the computation would stop completely.
There are various methods to provide mutual exclusion while a critical section
is executed by a process and the reader is referred to [16] for a comprehensive
review. Basically, we can classify the methods that provide mutual exclusion at
hardware, operating system or application (algorithmic) level. At hardware level,
special instructions that provide mutual exclusion to the memory locations are used
whereas algorithms provide mutual exclusion at application level. However, it is not
practical to expect a user to implement algorithms when executing a critical section,
instead, operating system primitives for this task can be appropriately used. Modern
operating systems provide a mutual exclusion data structure (mutex) for each critical
section and two operations called lock and unlock on the mutex structure. The critical
section can then be implemented by each process on a shared variable x as follows:
mutex m1;
int x /* shared integer */
process P1{
... /* non-critical section */
lock(&m1);
x++; /* critical section */
unlock(&m1);
... /* non-critical section */
}
The lock and unlock operations on the mutex variables are atomic, that is, they
cannot be interrupted, as provided by the operating system.
A semaphore can be used for both mutual exclusion and conditional synchroniza-
tion. The following example demonstrates these two usages of semaphores where two
processes producer and consumer synchronize using semaphores. The semaphore
s1 is used for mutual exclusion and the semaphores s2 and s3 are used for waiting
and signalling events.
sem_t s1,s2,s3;
int shared;
wait(s1); out_data=shared;
shared = in_data; signal(s1);
signal(s1); signal(s2);
signal(s3); print out_dat;
} }
} }
The shared memory location which should be mutually accessed is shared and
is protected by the semaphore s1. The producer continuously reads data from the
keyboard, and first waits on s2 to be signalled by the consumer indicating it has read
the previous data. It then writes the input data to the shared location and signals
the consumer by the s3 semaphore so that it can read shared and display the data.
Without this synchronization, the producer may overwrite previous data before it is
read by the consumer; or the consumer may read the same data more than once.
A thread is a lightweight process with an own program counter, stack and register
values. It does not have the memory page references, file pointers and other data
that an ordinary process has, and each thread belongs to one process. Two different
types of threads are the user level threads and the kernel level threads. User level
threads are managed by the run-time system at application level and the kernel which
is the core of an operating system that handles basic functions such as interprocess
communication and synchronization, is not aware of the existence of these threads.
Creation and other management functions of a kernel thread is performed by the
kernel.
A user level thread should be non-blocking as a single thread of a process that
does a blocking system call will block all of the process as the kernel sees it as
one main thread. The kernel level threads however can be blocking and even if one
thread of a process is blocked, kernel may schedule another one as it manages each
thread separately by using thread control blocks which store all thread related data.
The kernel threads are a magnitude or more times slower than user level threads due
to their management overhead in the kernel. The general rule is to use user level
threads if they are known to be non-blocking and use kernel level threads if they are
blocking, at the expense of slowed down execution. Figure 4.8 displays the user level
and kernel level threads in relation to the kernel.
Using threads provide the means to run them on multiprocessors or multi-core
CPUs by suitable scheduling policies. Also, they can share data which means they do
not need interprocess communication. However, the shared data must be protected
by issuing operating system calls as we have seen.
64 4 Parallel and Distributed Computing
(a) (b)
Fig. 4.8 a User level threads. b Kernel level threads which are attached to thread control blocks in
the kernel
where thread_id is the variable where the created thread identifier will be stored after
this system call, start_function is the address of the thread code and the arguments
are the variables passed to the created thread.
The following example illustrates the use of threads for parallel summing of an
integer array A with n elements. Each thread sums the portion of the array defined by
its identity passed to it during its creation. Note that we are invoking the same thread
code with different parameter (i) each time resulting in summing a different part of
the array A, for example, thread 8 sums A[80 : 89]. For this example, we could have
used user threads as they work independently without getting blocked, however, we
used kernel threads as POSIX threads API allows the usage of kernel threads only.
Solaris operating system allows both user and kernel threads and provide flexibility
but this API is not a standard [15].
#include <stdio.h>
#include <pthread.h>
#define n 100
#define n_threads 10
int A[n]=2,1,...,8; /* initialize */
pthread_mutex_t m1;
4.3 Parallel Computing 65
int total_sum;
main()
{ pthread_t threads[n];
int i;
pthread_mutex_init(&m1);
for(i=1; i<=n_threads; i++)
pthread_create(&threads[i],NULL,worker,i);
for(i=1; i<=n_threads; i++)
pthread_join(threads[i],NULL);
printf("Total sum is = %d", total_sum);
}
We need to compile this program (sum.c) with the POSIX thread library as follows:
#include <semaphore.h>
sem_t s1;
sem_init(&s1, 1, 1)
The wait and signal operations on semaphores are sem_wait and sem_signal
respectively. In the following example, we will implement the producer/consumer
example of Sect. 4.3.4 by using two threads and two semaphores, full and empty.
#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>
scanf("%d", data);
sem_post(&full);
i++;
}
}
main()
{ pthread_t prod, cons;
int i;
sem_init(&full,1,0);
sem_init(&empty,1,1);
pthread_create(&prod,NULL,producer,NULL);
pthread_create(&cons,NULL,consumer,NULL);
pthread_join(prod,NULL);
pthread_join(cons,NULL);
}
UNIX is a multitasking operating system developed at Bell Labs first and at Uni-
versity of California at Berkeley with network interface extension, named Berkeley
Software Distribution (BSD). It is written in C for the most part, has a relatively
smaller size than other operating systems, is modular and distributed in source code
which make UNIX as one of the most widely used operating systems. The user
interface in UNIX is called shell, and the kernel which performs the core operating
system functions such as process management and scheduling resides between the
shell and the hardware as shown in Fig. 4.9.
4.3 Parallel Computing 67
Shell
Kernel
Hardware
UNIX is based on processes and a number of system calls provide process man-
agement. A process can be created by a process using the system call fork. The caller
becomes the parent of the newly created process which becomes the child of the
parent. The fork system call returns the process identifier as an unsigned integer to
the parent, and 0 to the child. We can create a number of parallel processes using this
structure and provide concurrency where each child process is created with a copy
of the data area of the parent and it runs the same code as the parent. UNIX provides
various interprocess communication primitives and pipes are one of the simplest. A
pipe is created by the pipe system call and two identifiers, one for reading from a
pipe and one for writing to the pipe are returned. Reading and writing to a pipe are
performed by read and write system calls, similar to file read and write operations.
The following describes the process creation and interprocess communication using
pipes in UNIX. Our aim in this program to add the elements of an array in parallel
using two processes. The parent creates two pipes one for each direction and then
forks a child, sends the second half of the array to it by the pipe p1 to have it calculate
the partial sum, and calculates the sum of the lower portion of the array itself. It then
reads the sum of the child from pipe p2 and finds the total sum and displays it.
#include <stdio.h>
#define n 8
int c, *pt, p1[2], p2[2], A[n], sum=0, my_sum=0;
main()
{ pipe(p1); /* create two pipes one for each direction */
pipe(p2); /* p1 is from parent to child, */
c=fork(); /* p2 is from child to parent */
if(c!=0) { /* this is parent */
close(p1[0]); /* close the read end of p1 */
close(p2[1]); /* close the write end of p2 */
for(i=0;i<n;i++) /* initialize array */
A[i]=i+1;
write(p[1],&A[n/2],n/2); /* send the second half of array */
68 4 Parallel and Distributed Computing
for(i=0;i<n/2;i++)
my_sum=my_sum+A[i];
while(n=read(p2[0], pt, 1));
printf("Total sum is = %d", total_sum);
}
else { /* this is child */
close(p2[0]); /* close the read end of p2 */
close(p1[1]); /* close the write end of p1 */
while(n=read(p2[0], pt, n));
for(i=0;i<n/2;i++)
my_sum=my_sum+A[i];
write(p2[1],sizeof(int),sum); /* send partial sum */
}
}
}
Although threads are mostly used for shared memory parallel processing, we can still
have message-passing functionality using threads by additional routines at user level.
One such library is presented in [2] where threads communicate by message passing
primitives write_fifo and read_fifo which correspond to send and receive routines
of a message passing system. The communication channels between processes are
simulated by first-in-first-out (FIFO) data structures as shown below.
typedef struct {
sem_t send_sem;
sem_t receive_sem;
msgptr_t message_que[N_msgs];
} fifo_t;
and signalling its receiving semaphore so that it is activated. Using such a simulator
provides a simple testbed to verify distributed algorithms and when the algorithm
works correctly, performance tests can be performed using commonly used message-
passing tools as described next.
The Message Passing Interface Standard (MPI) provides a library of message passing
primitives in C or Fortran programming languages for distributed processing over a
number of computational nodes connected by a network [8,10]. Its aim is to provide
a portable, efficient and flexible standard of message passing. MPI code can be eas-
ily transferred to any parallel/distributed architecture that implements MPI. It can
be used in parallel computers, clusters and heterogeneous networks. It is basically
an application programming interface (API) to handle communication and synchro-
nization among processes residing at various nodes of a paralell/distributed comput-
ing system. MPI provides point-to-point, blocking and non-blocking and collective
communication modes. A previous commonly used tool for this purpose was Parallel
Virtual Machine (PVM) [7]. The basic MPI instructions to run an MPI program are
as follows.
#include <stdio.h>
#include <mpi.h>
The program when run by the mpirun command takes the number of processes as
an input which is 8 in this case. It then creates 7 other child processes by the MPI_Init
command to have a total of 8 processes. Each of these processes then execute their
4.4 Distributed Computing 71
program separately. We will have 8 “Hello world” outputs when this program is
executed. Clearly, it would make more sense to have each child process execute on
different data to perform parallel processing. In order to achieve this SPMD style
of parallel programming, each process has a unique identifier which can then be
mapped to the portion of data it should process. A process finds out its identifier by
the MPI_Comm_rank command and based on this value, it can execute the same code
on different data, achieving data partitioning. Two basic communication routines in
MPI are the MPI_Send and MPI_Receive with the following syntax:
where data is the memory location for data storage, n_data is the number of data
items to be transferred with the given type, receiver and the sender are the identifiers
of the receiving and the sending processes respectively, and tag is the type of message.
The next example shows how to find the sum of an array in parallel using a number
of processes. In this case, the root performs data partitioning by sending a different
part of an array to each child process which calculate the partial sums and return it
to the root which finds the total sum. The root first initializes the array and sends
the size of the portion along with array data to each process. Message types for each
communication direction are also defined.
#include <stdio.h>
#include <mpi.h>
MPI_Init(&argc, &argv);
array[i]=i;
}
n_portion=n_data/n_procs;
MPI_Recv( &array, n_portion, MPI_INT, root, tag2,
MPI_COMM_WORLD, &status);
partial_sum = 0;
for(i = 0; i < n_portion; i++)
partial_sum += array[i];
BSD UNIX provides the socket-based communications over the Internet. A server
is an endpoint of a communication between the two hosts over the Internet which
performs some function as required by a client process in network communica-
tions using sockets. Communication over the Internet can be performed either as
connection-oriented where a connection between a server and a client is established
and maintained during data transfer. Connection oriented delivery of messages also
guarantees the delivery of messages reliably by preserving the order of messages.
Transmission Control Protocol (TCP) is a connection oriented protocol provided at
layer 4 of the International Standards Organization (ISO) Open System Interconnect
(OSI) 7-Layer model. In connectionless communication mode, there is no established
or maintained connection and the delivery of messages is not guaranteed requiring
further processing at higher levels of communication. Unreliable Datagram Protocol
(UDP) is the standard connectionless protocol of the OSI model. Sockets may be
used for connection-oriented communications in which case they are called stream
sockets and when they are used for connectionless communication, they are called
datagram sockets. The basic system calls provided by BSD UNIX are as follows:
• socket: This call creates a socket of required type, whether connection-oriented or
connectionless, and the required domain.
• listen: A server specifies the maximum number of concurrent requests it can handle
on a socket by this call.
• connect: A client initiates a connection to a server by this procedure.
• accept: A server accepts client requests by this call. It blocks the caller until a
request is made. This call is used by a server for connection oriented communica-
tion.
• read and write: Data is read from and written to a socket using these calls.
• close: A connection is closed by this call on a socket.
Figure 4.10 displays a typical communication scenario of a connection-oriented
server and a client. The server and the client are synchronized by the blocking system
calls connect and accept after which a full connection is established using the TCP
protocol. The server then reads the client request by the read and responds by the
write calls. A server typically runs in an endless loop and in order to respond to
further incoming requests, it spawns a child for each request so that it can return
waiting for further client requests on the accept call. Threads may also be used for
this purpose resulting in less management overheads than processes.
We can use socket-based communication over the Internet to perform distributed
computations. Returning to the sum of an array example we have been elaborating,
a server process can send portions of an array to a number of clients using stream or
datagram sockets upon their requests and then can combine all of the partial results
from the clients to find the final sum (see Exercise 4.9).
74 4 Parallel and Distributed Computing
Server Client
socket
socket
bind
connect
listen
write
accept
read
read
close
write
face provides routines for creation, reading from and writing to data structures called
sockets. However, MPI provides a neater interface to the application with the addition
of routines for multicast and broadcast communication and is frequently employed
for parallel and distributed processing. Threads are frequently used for shared mem-
ory parallel processing and MPI can be employed for both parallel and distributed
applications.
Exercises
1. Provide a partitioning of the task dependency graph of Fig. 4.5 to three processors
which considers evenly balancing the load among the processors and minimizing
the IPC costs at the same time. Draw the Gant chart for this partitioning and work
out the total IPC cost and the execution time of this partitioning.
2. Modify Algorithm 1.1 such that array A is first copied to an array B and all of
the summations are done on array B to keep contents of A intact. Copying of A
to B should also be done in parallel using n processors.
3. The prefix sum of an array A is an array B with the running sums of the elements
of A such that b0 = a0 ; b1 = a0 + a1 ; b2 = a0 + a1 + a2 ; . . . For example if
A = {2, 1, −4, 6, 3, 0, 5} then B = {2, 3, −1, 5, 8, 8, 13}. Write the pseudocode
of an algorithm using EREW PRAM model to find the prefix sum of an input
array.
4. Provide an N-buffer version of the producer/consumer algorithm where we have
N empty buffers initially and the producer process fills these buffers with input
data as long as there are empty buffers. The producer reads filled buffer loca-
tions and consumes these data by printing them. Write the pseudocode of this
algorithm with short comments.
5. A multi-threaded file server is to be designed which receives a message by the
front end thread, and this thread invokes one of the open, read and write threads
depending on the action required in the incoming message. The read thread reads
the number of bytes from the file which is sent to the sender, the write thread
writes the specified number of contained bytes in the message to the specified
file. Write this program in C with POSIX threads with brief comments.
1 4
6. The PI number can be approximated by 0 (1+x 2 ) . Provide a multi-threaded C
program using POSIX threads, with 10 threads each working with 20 slices to
find PI by calculating the area under this curve.
7. Modify the program of Sect. 4.3.6 such that there are a total of 8 processes
spawned by the parent to find the sum of the array in parallel.
8. Provide a distributed algorithm for 8 processes with process 0 as the root. The
root process sends a request to learn the identifiers of the processes in the first
round and then finds the maximum identifier among them and notifies each
process of the maximum identifier in the second round.
9. Write an MPI program in C that sums elements of an integer array in a pipelined
topology of four processes as depicted in Fig. 4.11. The root process P0 computes
the sum of the first 16 elements and passes this sum along with the remaining
48 elements to P1 which sums the second 16 elements of the array and pass
76 4 Parallel and Distributed Computing
16 ... 64 32 ... 64
P1
P0 P2
48 ... 64
P3
sum 0−3
the accumulated sum to P2. This process is repeated until message reaches P0
which receives the total sum and displays it in the final step. Note that the first
extra element of the message transferred can be used to store the partial sum
between processes.
10. Provide the pseudocode of a program using UNIX BSD sockets where we have
a server and n clients. Each client makes a connection-oriented call to the server,
obtains its portion of an integer array A and returns the partial sum to the server.
The server displays the total sum of array A when it receives all of the partial
sums from the clients.
References
1. Elsner U (1997) Graph partitioning, a survey. Technical Report, Technische Universitat Chem-
nitz
2. Erciyes K (2013) Distributed graph algorithms for computer networks. Computer communi-
cations and networks series. Chap. 18, Springer. ISBN 978-1-4471-5172-2
3. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press, Taylor and Fran-
cis, pp 57–59. ISBN 978-1-4471-5172-2
4. Flynn MJ (1972) Some computer organizations and their effectiveness. IEEE Trans. Comput.
C 21(9):948–960
5. Foster I (1995) Designing and building parallel programs: concepts and tools for parallel
software engineering. Addison-Wesley
6. Gantt HL (1910) Work, wages and profit. The Engineering Magazine, New York
7. http://www.csm.ornl.gov/pvm/
8. http://www.mcs.anl.gov/research/projects/mpi/
9. http://www.open-mpi.org/
10. https://computing.llnl.gov/tutorials/mpi/
11. POSIX.1 FAQ. The open group. 5 Oct 2011
References 77
12. Quinn M (2003) Parallel programming in C with MPI and OpenMP. McGraw-Hill Sci-
ence/Engineering/Math
13. Stevens WR (1998) UNIX network programming. In: Networking APIs: sockets and XTI, vol
1, 2nd edn. Prentice Hall
14. Stevens WR (1999) UNIX network programming. In: Interprocess communications, vol 2, 2nd
edn. Prentice Hall
15. Sun Microsystems Inc. (1991) SunSoft introduces first shrink-wrapped distributed computing
solution: Solaris. (Press release)
16. Tanenbaum A (2014) Modern operating systems 4th edn. Prentice-Hall
Part II
Biological Sequences
String Algorithms
5
5.1 Introduction
A string S consists of an ordered set of characters over a finite set of symbols called
an alphabet Σ. The size of the alphabet, |Σ|, is the number of distinct characters
in it and the size of a string, |S|, is the number of characters contained in it; also
called the length of the string. For example, the DNA structure can be represented by
a string over the alphabet of four nucleotides: Adenine (A), Cytosine (C), Guanine
(G) and Thymine (T), hence Σ = {A, C, G, T } for DNA; the RNA structure has
Σ = {A, C, G, U} replacing Thymine with Uracil (U), and a protein is a linear chain
of amino acids over an alphabet of 20 amino acids Σ = {A, R, N, . . . , V }.
String algorithms have numerous applications in molecular biology. Biologists
often need to compare two DNA/RNA or protein sequences to find the similarity
between them. This process is called sequence alignment and finding the relatedness
of two or more biological sequences helps to deduce ancestral relationships among
organisms as well as finding conserved segments which may have fundamental roles
for the functioning of an organism. We may also need to search for a specific string
pattern in a biological sequence as this pattern may indicate a location of a structure
such as a gene in DNA. In some other cases, biologists need to find repeating substring
patterns in DNA or protein sequences as these serve as signals in genome and also
over representations of these may indicate complex diseases. Analysis of genome as
substring rearrangements helps to find evolutionary relationships between organisms
as well as to understand the mechanisms of diseases. In summary, DNA/RNA and
protein functionalities are highly dependent on their sequence structures and we
need efficient string manipulation algorithms to understand the underlying complex
biological processes.
In this chapter, we start the analysis of strings by the string matching algorithms
where we search the occurrences of a small string inside a larger string. We will
see that there are a number of algorithms with varying time complexities. The next
problem we will investigate is the longest common subsequence problem where our
aim is to find the longest common subsequence in two strings. Longest increasing
© Springer International Publishing Switzerland 2015 81
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_5
82 5 String Algorithms
We will now review fundamental sequential algorithms for string matching starting
with the naive algorithm which is a brute force algorithm that checks every pos-
sible combination. As this is a time consuming approach, various algorithms were
proposed that have lower time complexities.
from the first locations in T and P and compare their elements. As long as we find
a match, the indices are incremented and if the size of the pattern m is reached, a
match is found and output as shown in Algorithm 5.1.
Let us consider the running of this algorithm until the first match, for the example
above:
T = a b c c b a b a c c b a a b
P = c c b a
c c b a
c c b a (a match found at location 3)
The time taken by this algorithm is O(nm) since checking each occurrence of P
is done in m comparisons and we need to check the first n − m + 1 locations of T .
For very long strings such as the biological sequences, the total time taken will be
significantly high. Baeza-Yates showed that the expected time taken in this algorithm
for an alphabet of size c is [1]:
c 1
C= (1 − m (n − m + 1) + O(1) (5.1)
c−1 c
• Σ is an input alphabet
• S is a finite nonempty set of states
• s0 ∈ S is the initial state
• δ : S × Σ → S is the state transition function
defined as a function address of which is placed in this table. The algorithm then
simply directs the flow to the function specified by the current state and the current
input in the table as shown below. The states are now labeled as 0, 1, 2, 3 and 4, and
the inputs are 0, 1, and 2 corresponding to a, b, and c. The time to construct the FSM
is Θ(m|Σ|) which is the size of the table, assuming there are no overlapping states.
The time to search the input string is Θ(n), resulting in a total time complexity of
Θ(m|Σ| + n) for this algorithm.
Fig. 5.2 The computation of the prefix array Π for the pattern P = {acacaba}
The formation of the prefix table Π is performed at initialization using the pattern
P only. The idea here is to check incrementally whether any proper prefix of a pattern
is also a proper suffix of it. If this is the case, the integer value is incremented. The
entry Π [i] is the largest integer smaller than i such that prefix p1 , . . . , pπi is also a
suffix of p1 , . . . , pi . As an example, let us assume the pattern P = {acacaba} and
work out the values of Π as shown in Fig. 5.2.
The pseudocode of the prefix function to fill Π is shown in Algorithm 5.2.
Once this array is formed, we can use it when there is a mismatch to shift the
indices in the text T . The operation of the KMP algorithm is described in pseudocode
in Algorithm 5.3.
Prefix : 0 0 1 2 3 1 1
S = c c a c a c a c a b a c a b
P = a c a c a b a (first match found at location 3)
a c a c a b a (second match)
a c a c a b a (third match)
a c a c a b a (fourth match)
a c a c a b a (fifth match)
a c a c a b a (mismatch at 6, shift 1)
a c a c a b a (mismatches at 1,2 and 3, shift 1 at 3)
a c a c a b a (first match found at location 3)
a c a c a b a (second match)
a c a c a b a (third match)
a c a c a b a (fourth match)
a c a c a b a (fifth match)
a c a c a b a (sixth match)
a c a c a b a (FULL MATCH)
88 5 String Algorithms
This algorithm takes O(m) time to compute the prefix values and O(n) to compare
the pattern to the text, resulting in a total of O(m + n) time. The running time of
this algorithm is optimal and can process large texts as it never needs to move
backward. However, it is sensitive to the size of the alphabet |Σ| as there will be
more mismatches when this increases.
T= a b c c b a a b b a c
P= a c b c b a mismatch at position 3
i=3, k=3, T[3]=c, i-R[c]=-1, shift 1
5.2 Exact String Matching 89
b c c b c | b a a c b | b c a b b
p1 p2 p3
Neither the process p2 nor p1 will be aware of the match because they do not
have the whole of the pattern cba. A possible remedy for this situation would be to
provide the preceding process, which is p1 in this case, with the first m − 1 characters
of its proceeding process. Searching for matches in these bordering regions would
then be the responsibility of the preceding process with a lower identifier. Assuming
the communication costs are negligible, this approach which is sometimes called
embarrassingly parallel will provide almost linear speedup.
5.3 Approximate String Matching 91
The pseudocode for this algorithm is shown in Algorithm 5.6. As in the naive
exact algorithm, it has a time complexity of O(nm).
S = a b b a b c b a a b c c a
A = b a b c
B = a - c b a - - c
C = b a b - b
5.4 Longest Subsequence Problems 93
The following example shows the LCS P of two strings A and B. In general, we
are more interested in finding the length of LCS rather than its contents.
A = a b b a b c b a b a
B = b b a c c c a a c
P = b b a c a a
1. Case 1: Considering any two elements A[i] and B[j], the first case is they may be
equal. In this case of A[i] = B[j], the length of current LCS should be incremented.
This relation can be stated recursively as follows:
LCS[i, j] = 1 + LCS[i − 1, j − 1] (5.2)
2. Case 2: When A[i] = B[j], either A[i] or B[j] has to be discarded, therefore:
LCS[i, j] = max(LCS[i − 1, j], LCS[i, j − 1]) (5.3)
Using these two cases, we can form the dynamic programming solution to this
problem using two nested loops as shown in Algorithm 5.7. The array C holds the
partial and final solutions to this problem.
94 5 String Algorithms
For two strings A = {abacc} and B = {babbc}, let us form the output matrix C as
below. The time to fill this matrix is Θ(nm) with a total space Θ(nm), and the final
length of LCS is in the last element of C, that is, C[n, m] = |LCS(A, B)|. In order
to find the actual LCS sequence, we start moving up from the right corner of the
matrix which holds the element C[n, m] and backtrack the path we have followed
while filling the matrix. The sequence elements discovered in this way are shown
by arrows in the matrix and the final LCSes are {bacc} and {abcc} as there are two
paths.
proposed in [6] with O(nm/w) time and O(m/w) space complexities where w is the
size of the machine word and n and m are the lengths of the two strings.
Dominant points are the minimal points of search and using them narrows the
search space efficiently. Studies using dominant points to solve multiple LCS problem
in parallel for more than two input strings have been reported in [7,8]. A parallel
algorithm to solve LCS problem on graphics processing units (GPUs) was proposed
by Yang et al. [9].
As another example, let us consider the longest increasing subsequence (LIS) prob-
lem. Given a sequence S = {a1 a2 , . . . , an } of numbers, L is the longest subsequence
of S where ∀ai , aj ∈ L, ai ≤ aj if i ≤ j. For example, given S = {7, 2, 4, 1, 6, 8, 5, 9},
L = {2, 4, 6, 8, 9}. A dynamic programming solution to the problem would again
start from the subproblems and store the results to be used in future. Algorithm 5.8
shows how to find LIS using dynamic programming in O(n2 ) time [10]. The proce-
dure Max_V al is used to find the maximum of numbers in an array in order to find
the length of the longest path stored in the array L.
The longest common increasing subsequence (LCIS) is more general than LIS in
which we search for an LIS of two or more sequences. Formally, given two strings
A = (a1 , . . . , am ) and B = (b1 , . . . , bn ) over an ordered alphabet Σ = {x1 < x2 <
x3 }, the LCIS of A and B is the common subsequence of A and B of the longest
length. An example LCIS L of two strings A and B is shown below:
A = 3 6 1 4 2 5 7 8 2
B = 2 3 8 4 1 9 5 1 8
P = 3 4 5 8
96 5 String Algorithms
A suffix tree introduced by Weiner [11] is a data structure that represents suffixes
of a string. Suffix trees have numerous of applications in bioinformatics problems
including exact string matching and pattern matching. Formally, a suffix tree can be
defined as follows:
This representation of a suffix tree, however, does not guarantee a valid suffix tree
for any string S. If the prefix of a suffix of S is the same as its suffix, the path for
the suffix will not be represented by a leaf. It is therefore general practice to place a
special symbol such as $ at the end of S and hence every suffix of S ends with $ as
shown in Fig. 5.3 which represents a suffix tree for the string S = abbcbabc.
A generalized suffix tree contains suffix trees for two or more strings. In this
case, each leaf number of such a tree is represented by two integers that identify the
string and the starting position of the suffix represented by the leaf in that string as
shown in Fig. 5.4 where two strings S1 = acbab and S2 = bacb over an alphabet
There are various algorithms to construct a suffix tree T of a string S. Weiner was first
to provide a linear time algorithm to construct a suffix tree [11] which was modified
by McGreight to yield a space complexity of O(n2 ) [14]. Ukkonen proposed an
online algorithm which provided more space saving over McCreight’s algorithm
[15] and Farach provided an algorithm to construct suffix trees using unbounded
alphabet sizes [16]. We will describe sequential and parallel algorithms to construct
suffix trees starting with a naive algorithm and then the algorithm due to Ukkonen.
We will then investigate methods for parallel construction of suffix trees.
T1 consists of this suffix only. At each step i + 1, the tree Ti is traversed starting
at root r and a prefix of S[i + 1, n] that matches a path of Ti starting from the root
node is searched. If such a prefix is not found, a new leaf numbered i + 1 with edge
label S[i + 1, n$] is formed as an edge coming out of the root. If such a prefix of
S[i + 1, n$] exists as a prefix of a branch of Ti , we check symbols of the path in that
branch until a mismatch occurs. If this mismatch is in the midst of an edge (u, v),
we split (u, v) to form a new branch, otherwise if the mismatch occurs after a vertex
w, the new branch is joined to w as shown in Algorithm 5.9.
The execution of Algorithm 5.9 is shown in Fig. 5.5 for the string S = babcab.
We start with the longest suffix S[1..n]$ in (a) and connect it to the root r to form T1 .
The second suffix starting at location 2 does not have a matching path and therefore
a new branch to T1 is added to form T2 as shown in (b). The suffix S[3..n] has the
first character in common with the first suffix and therefore, we can split the first
suffix branch as shown in (c). Continuing in this manner, the whole suffix tree is
constructed.
The naive algorithm has a time complexity of O(n2 ) for a string S of length n
and requires O(n) space. The time needed for the ith suffix in ith iteration of this
algorithm is O(n − i + 1). Therefore, total time is:
n
n
= O(n − i + 1) = O(i) = O(n2 ) (5.4)
i=1 i=1
5.5 Suffix Trees 99
Fig. 5.5 Construction of a suffix tree for the string S = {babca} by the naive algorithm