0% found this document useful (0 votes)
57 views376 pages

Distributed and Sequential Algorithms For Bioinformatics

The document discusses the publication of a book on distributed and sequential algorithms for bioinformatics, highlighting the importance of computational biology in analyzing vast biological data. It emphasizes the need for efficient algorithms to tackle NP-hard problems in biological data analysis, particularly through parallel and distributed computing methods. The book aims to provide a comprehensive overview of both sequential and distributed algorithms for biological sequences and networks, catering to researchers and students in related fields.

Uploaded by

KB bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views376 pages

Distributed and Sequential Algorithms For Bioinformatics

The document discusses the publication of a book on distributed and sequential algorithms for bioinformatics, highlighting the importance of computational biology in analyzing vast biological data. It emphasizes the need for efficient algorithms to tackle NP-hard problems in biological data analysis, particularly through parallel and distributed computing methods. The book aims to provide a comprehensive overview of both sequential and distributed algorithms for biological sequences and networks, catering to researchers and students in related fields.

Uploaded by

KB bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 376

Computational Biology

K. Erciyes

Distributed
and Sequential
Algorithms for
Bioinformatics

www.allitebooks.com
Computational Biology

Volume 23

Editors-in-Chief
Andreas Dress
CAS-MPG Partner Institute for Computational Biology, Shanghai, China

Michal Linial
Hebrew University of Jerusalem, Jerusalem, Israel

Olga Troyanskaya
Princeton University, Princeton, NJ, USA

Martin Vingron
Max Planck Institute for Molecular Genetics, Berlin, Germany

Editorial Board
Robert Giegerich, University of Bielefeld, Bielefeld, Germany
Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
Gene Myers, Max Planck Institute of Molecular Cell Biology and Genetics, Dresden,
Germany
Pavel A. Pevzner, University of California, San Diego, CA, USA

Advisory Board
Gordon Crippen, University of Michigan, Ann Arbor, MI, USA
Joe Felsenstein, University of Washington, Seattle, WA, USA
Dan Gusfield, University of California, Davis, CA, USA
Sorin Istrail, Brown University, Providence, RI, USA
Thomas Lengauer, Max Planck Institute for Computer Science, Saarbrücken, Germany
Marcella McClure, Montana State University, Bozeman, MO, USA
Martin Nowak, Harvard University, Cambridge, MA, USA
David Sankoff, University of Ottawa, Ottawa, ON, Canada
Ron Shamir, Tel Aviv University, Tel Aviv, Israel
Mike Steel, University of Canterbury, Christchurch, New Zealand
Gary Stormo, Washington University in St. Louis, St. Louis, MO, USA
Simon Tavaré, University of Cambridge, Cambridge, UK
Tandy Warnow, University of Texas, Austin, TX, USA
Lonnie Welch, Ohio University, Athens, OH, USA

www.allitebooks.com
The Computational Biology series publishes the very latest, high-quality research
devoted to specific issues in computer-assisted analysis of biological data. The main
emphasis is on current scientific developments and innovative techniques in
computational biology (bioinformatics), bringing to light methods from mathemat-
ics, statistics and computer science that directly address biological problems
currently under investigation.
The series offers publications that present the state-of-the-art regarding the
problems in question; show computational biology/bioinformatics methods at work;
and finally discuss anticipated demands regarding developments in future
methodology. Titles can range from focused monographs, to undergraduate and
graduate textbooks, and professional text/reference works.

More information about this series at http://www.springer.com/series/5769

www.allitebooks.com
K. Erciyes

Distributed
and Sequential Algorithms
for Bioinformatics

123

www.allitebooks.com
K. Erciyes
Computer Engineering Department
Izmir University
Uckuyular, Izmir
Turkey

ISSN 1568-2684
Computational Biology
ISBN 978-3-319-24964-3 ISBN 978-3-319-24966-7 (eBook)
DOI 10.1007/978-3-319-24966-7

Library of Congress Control Number: 2015950452

Springer Cham Heidelberg New York Dordrecht London


© Springer International Publishing Switzerland 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.

Printed on acid-free paper

Springer International Publishing AG Switzerland is part of Springer Science+Business Media


(www.springer.com)

www.allitebooks.com
To the memory of Atilla Ozerdim and all
disabled people in his name.
Atilla was a former student and then a
colleague. He was highly intelligent however
was severely disabled lower than his neck. He
used a computer by typing with a stick in his
mouth and wrote thousands of lines of code
and researched this way.

www.allitebooks.com
Preface

Recent technological advancements in the last few decades provided vast and
unprecedented amounts of biological data including data of DNA and cell,
and biological networks. This data comes in two basic formats as DNA nucleotide
and protein amino acid sequences, and more recently, topological data of biological
networks. Analysis of this huge data is a task on its own and the problems
encountered are NP-hard most of the time, defying solutions in polynomial time.
Such analysis is required as it provides a fundamental understanding of the func-
tioning of a cell which can help understand human health and disease states and the
diagnosis of diseases, which can further aid development of biotechnological
processes to be used for medical purposes such as treatment of diseases.
Instead of searching for optimal solutions to these difficult problems, approxi-
mation algorithms that provide sub-optimal solutions are usually preferred. An
approximation algorithm should guarantee a solution within an approximation
factor for all input combinations. In many cases, even approximation algorithms are
not known to date and using heuristics that are shown to work for most of the input
cases experimentally are considered as solutions.
Under these circumstances, there is an increasing demand and interest in
research community for parallel/distributed algorithms to solve these problems
efficiently using a number of processing elements. This book is about both
sequential and distributed algorithms for the analysis and modeling of biological
data and as such, it is one of the first ones in this topic to the best of our knowledge.
In the context of this book, we will assume a distributed algorithm as a parallel
algorithm executed on a distributed memory processing system using
message-passing rather than special purpose parallel architectures. For the cases of
shared memory parallel computing, we will use the term parallel algorithm
explicitly. We also cover algorithms for biological sequences (DNA and protein)
and biological network (protein interaction networks, gene regulation, etc.) data in
the same volume. Although algorithms for DNA sequences have a longer history of
study, even the sequential algorithms for biological networks such as the protein
interaction networks are rare and are at an early stage of development in research
studies. We aim to give a unified view of algorithms for sequences and networks of
biological systems where possible. These two views are not totally unrelated; for
example, the function of a protein is influenced by both its position in a network

vii

www.allitebooks.com
viii Preface

and its amino acid sequence, and also by its 3-D structure. It can also be seen that
the problems in the sequence and network domains are analogous to some extent;
for example, sequence alignment and network alignment, sequence motifs and
network motifs, sequence clustering and network clustering are analogous problems
in these two related worlds. It is not difficult to predict that the analysis of biological
networks will have a profound effect on our understanding the origins of life, health
and disease states, as analysis of DNA/RNA and protein sequences have provided.
The parallel and distributed algorithms are needed to solve bioinformatics
problems simply for the speedup they provide. Even the linear time algorithms may
be difficult to realize in bioinformatics due to the size of the data involved. For
example, suffix trees are fundamental data structures in bioinformatics, and con-
structing them takes OðnÞ time by relatively new algorithms such as Ukkonen’s or
Farach’s. Considering human DNA which consists of over 3 billion base pairs, even
these linear time algorithms are time-consuming. However, by using distributed
suffix trees, the time can be reduced to Oðn=kÞ where k is the number of processors.
One wonders then about the scarcity of the research efforts in the design and
implementation of distributed algorithms for these time-consuming difficult tasks.
A possible reason would be that a number of problems have been introduced
recently and the general approach in the research community has been to search for
sequential algorithmic solutions first and then investigate ways of parallelizing
these algorithms or design totally new parallel/distributed algorithms. Moreover,
the parallel and distributed computing is a principle on its own where researchers in
this field may not be familiar with bioinformatics problems in general, and the
multidisciplinary efforts in this discipline and bioinformatics seem to be just
starting. This book is an effort to contribute to the filling of the aforementioned gap
between the distributed computing and bioinformatics. Our main goal is to first
introduce the fundamental sequential algorithms to understand the problem and
then describe distributed algorithms that can be used for fundamental bioinfor-
matics problems such as sequence and network alignment, and finding sequence
and network motifs, and clustering. We review the most fundamental sequential
algorithms which aid the understanding of the problem better and yield
parallel/distributed versions. In other words, we try to be as comprehensive as
possible in the coverage of parallel/distributed algorithms for the fundamental
bioinformatics problems with an in-depth analysis of sequential algorithms.
Writing a book on bioinformatics is a challenging task for a number of reasons.
First of all, there are so many diverse topics to consider, from clustering to genome
rearrangement, from network motif search to phylogeny, and one has to be selective
not to divert greatly from the initially set objectives. We had to carefully choose a
subset of topics to be included in this book in line with the book title and aim; tasks
that require substantial computation power due to their data sizes and therefore are
good candidates for parallelization. Second, bioinformatics is a very dynamic area
of study with frequent new technological advances and results which requires a
thorough survey of contemporary literature on presented topics. The two worlds of
bioinformatics, namely biological sequences and biological networks, both have
similar challenging problems. A closer look reveals these two worlds in fact have

www.allitebooks.com
Preface ix

analogous problems as mentioned; sequence alignment and network alignment;


sequence motif search and network motif search; sequence clustering and network
clustering which may be comforting first. However, these problems are not closely
related in general, other than the clustering problem which can be converted from
the sequence domain to the network domain with some effort.
We have a uniform chapter layout by first starting with an informal description
of the problem at hand. We then define it formally and review the significant
sequential algorithms in this topic briefly. We describe parallel/distributed algo-
rithms for the same problem, which is the main theme of the book, and briefly
discuss software packages if there is any available. When distributed algorithms for
the topic are scarce, we provide clues and possible approaches for distributed
algorithms as possible extensions to sequential ones or as totally new algorithms, to
aid starting researchers in the topic. There are several coarsely designed and
unpublished distributed algorithms, approximately 2-3 algorithms in some chapter,
for the stated problems in biological sequences and networks which can be taken as
starting points for potential researchers in this field. In the final part of each chapter,
we emphasize main points, compare the described algorithms, give a contemporary
review of the related literature, and show possible open research areas in the chapter
notes section.
The intended audience for this book is the graduate students and researchers of
computer science, biology, genetics, mathematics, and engineering, or any person
with basic background in discrete mathematics and algorithms. The Web page for
the book is at: http://eng.izmir.edu.tr/~kerciyes/DSAB.
I would like to thank graduate students at Izmir University who have taken
complex networks and distributed algorithms courses for their valuable feedback
when parts of the material covered in the book were presented during lectures.
I would like to thank Esra Rüzgar for her review and comments on parallel network
motif search algorithms of Chap. 14. I would also like to thank Springer senior
editor Wayne Wheeler and associate editor Simon Rees for their continuous support
and patience during the course of this project.

Uckuyular, Izmir, Turkey K. Erciyes

www.allitebooks.com
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Biological Sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Need for Distributed Algorithms. . . . . . . . . . . . . . . . . . 6
1.5 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Part I Background
2 Introduction to Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Central Dogma of Life. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 The Genetic Code . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Mutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Biotechnological Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Polymerase Chain Reaction . . . . . . . . . . . . . . . . . . 20
2.4.3 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Nucleotide Databases. . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Protein Sequence Databases . . . . . . . . . . . . . . . . . . 22
2.6 Human Genome Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

xi

www.allitebooks.com
xii Contents

3 Graphs, Algorithms, and Complexity . . . . . . . . . . . . . . . . . . . . . . 27


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Types of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Graph Representations . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Paths, Cycles, and Connectivity . . . . . . . . . . . . . . . 30
3.2.4 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.5 Spectral Properties of Graphs . . . . . . . . . . . . . . . . . 32
3.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Time and Space Complexities. . . . . . . . . . . . . . . . . 33
3.3.2 Recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 Fundamental Approaches . . . . . . . . . . . . . . . . . . . . 35
3.3.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 35
3.3.5 Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.6 Special Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 NP-Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Coping with NP-Completeness . . . . . . . . . . . . . . . . 45
3.5 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Parallel and Distributed Computing. . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Architectures for Parallel and Distributed Computing. . . . . . . 52
4.2.1 Interconnection Networks. . . . . . . . . . . . . . . . . . . . 52
4.2.2 Multiprocessors and Multicomputers . . . . . . . . . . . . 53
4.2.3 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Complexity of Parallel Algorithms . . . . . . . . . . . . . 55
4.3.2 Parallel Random Access Memory Model . . . . . . . . . 55
4.3.3 Parallel Algorithm Design Methods . . . . . . . . . . . . . 57
4.3.4 Shared Memory Programming . . . . . . . . . . . . . . . . 59
4.3.5 Multi-threaded Programming . . . . . . . . . . . . . . . . . 63
4.3.6 Parallel Processing in UNIX . . . . . . . . . . . . . . . . . . 66
4.4 Distributed Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 Distributed Algorithm Design . . . . . . . . . . . . . . . . . 69
4.4.2 Threads Re-visited . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.3 Message Passing Interface . . . . . . . . . . . . . . . . . . . 70
4.4.4 Distributed Processing in UNIX . . . . . . . . . . . . . . . 73
4.5 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Contents xiii

Part II Biological Sequences


5 String Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Exact String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.1 Sequential Algorithms . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Distributed String Matching . . . . . . . . . . . . . . . . . . 90
5.3 Approximate String Matching . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Longest Subsequence Problems. . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Longest Common Subsequence. . . . . . . . . . . . . . . . 92
5.4.2 Longest Increasing Subsequence . . . . . . . . . . . . . . . 95
5.5 Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Construction of Suffix Trees. . . . . . . . . . . . . . . . . . 97
5.5.2 Applications of Suffix Trees . . . . . . . . . . . . . . . . . . 102
5.5.3 Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.1 The Objective Function . . . . . . . . . . . . . . . . . . . . . 112
6.2.2 Scoring Matrices for Proteins . . . . . . . . . . . . . . . . . 114
6.3 Pairwise Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3.1 Global Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3.2 Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.4 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . 120
6.4.1 Center Star Method . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4.2 Progressive Alignment . . . . . . . . . . . . . . . . . . . . . . 122
6.5 Alignment with Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . 123
6.6 Database Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.6.1 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.6.2 BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.7 Parallel and Distributed Sequence Alignment . . . . . . . . . . . . 126
6.7.1 Parallel and Distributed SW Algorithm . . . . . . . . . . 126
6.7.2 Distributed BLAST . . . . . . . . . . . . . . . . . . . . . . . . 127
6.7.3 Parallel/Distributed CLUSTALW . . . . . . . . . . . . . . 129
6.8 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7 Clustering of Biological Sequences . . . . . . . . . . . . . . . . . . . . . . . . 135
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2 Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.2.1 Distance and Similarity Measures . . . . . . . . . . . . . . 136
7.2.2 Validation of Cluster Quality . . . . . . . . . . . . . . . . . 137
xiv Contents

7.3 Classical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138


7.3.1 Hierarchical Algorithms . . . . . . . . . . . . . . . . . . . . . 138
7.3.2 Partitional Algorithms . . . . . . . . . . . . . . . . . . . . . . 140
7.3.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4 Clustering Algorithms Targeting Biological Sequences. . . . . . 144
7.4.1 Alignment-Based Clustering . . . . . . . . . . . . . . . . . . 144
7.4.2 Other Similarity-Based Methods . . . . . . . . . . . . . . . 144
7.4.3 Graph-Based Clustering . . . . . . . . . . . . . . . . . . . . . 145
7.5 Distributed Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.5.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . 146
7.5.2 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . 152
7.5.3 Graph-Based Clustering . . . . . . . . . . . . . . . . . . . . . 154
7.5.4 Review of Existing Algorithms . . . . . . . . . . . . . . . . 155
7.6 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8 Sequence Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.2 Tandem Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.2.1 Stoye and Gusfield Algorithm. . . . . . . . . . . . . . . . . 164
8.2.2 Distributed Tandem Repeat Search . . . . . . . . . . . . . 166
8.3 Sequence Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3.1 Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . 169
8.3.2 Combinatorial Methods . . . . . . . . . . . . . . . . . . . . . 171
8.3.3 Parallel and Distributed Motif Search . . . . . . . . . . . 174
8.3.4 A Survey of Recent Distributed Algorithms . . . . . . . 178
8.4 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9 Genome Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.2 Gene Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.2.1 Fundamental Methods . . . . . . . . . . . . . . . . . . . . . . 185
9.2.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . 186
9.2.3 Nature Inspired Methods . . . . . . . . . . . . . . . . . . . . 187
9.2.4 Distributed Gene Finding . . . . . . . . . . . . . . . . . . . . 189
9.3 Genome Rearrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.3.1 Sorting by Reversals . . . . . . . . . . . . . . . . . . . . . . . 191
9.3.2 Unsigned Reversals . . . . . . . . . . . . . . . . . . . . . . . . 193
9.3.3 Signed Reversals. . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.3.4 Distributed Genome Rearrangement Algorithms . . . . 199
9.4 Haplotype Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 202
9.4.2 Clark’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 202
Contents xv

9.4.3 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 203


9.4.4 Distributed Haplotype Inference Algorithms . . . . . . . 204
9.5 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
References. . . . ................................ . . . . . . 208

Part III Biological Networks


10 Analysis of Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.2 Networks in the Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10.2.1 Metabolic Networks . . . . . . . . . . . . . . . . . . . . . . . 214
10.2.2 Gene Regulation Networks . . . . . . . . . . . . . . . . . . . 215
10.2.3 Protein Interaction Networks. . . . . . . . . . . . . . . . . . 216
10.3 Networks Outside the Cell . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.3.1 Networks of the Brain . . . . . . . . . . . . . . . . . . . . . . 217
10.3.2 Phylogenetic Networks . . . . . . . . . . . . . . . . . . . . . 219
10.3.3 The Food Web . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.4 Properties of Biological Networks . . . . . . . . . . . . . . . . . . . . 221
10.4.1 Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.4.2 Vertex Degrees . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.4.3 Clustering Coefficient . . . . . . . . . . . . . . . . . . . . . . 223
10.4.4 Matching Index. . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.5 Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.5.1 Degree Centrality . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.5.2 Closeness Centrality . . . . . . . . . . . . . . . . . . . . . . . 225
10.5.3 Betweenness Centrality . . . . . . . . . . . . . . . . . . . . . 225
10.5.4 Eigenvalue Centrality . . . . . . . . . . . . . . . . . . . . . . 229
10.6 Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.6.1 Random Networks. . . . . . . . . . . . . . . . . . . . . . . . . 230
10.6.2 Small World Networks . . . . . . . . . . . . . . . . . . . . . 231
10.6.3 Scale-Free Networks . . . . . . . . . . . . . . . . . . . . . . . 232
10.6.4 Hierarchical Networks . . . . . . . . . . . . . . . . . . . . . . 233
10.7 Module Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.8 Network Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.9 Network Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.10 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
11 Cluster Discovery in Biological Networks . . . . . . . . . . . . . . . . . . . 241
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
11.2 Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.2.1 Quality Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.2.2 Classification of Clustering Algorithms . . . . . . . . . . 245
xvi Contents

11.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 246


11.3.1 MST-Based Clustering. . . . . . . . . . . . . . . . . . . . . . 247
11.3.2 Edge-Betweenness-Based Clustering . . . . . . . . . . . . 250
11.4 Density-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 252
11.4.1 Clique Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . 252
11.4.2 k-core Decomposition . . . . . . . . . . . . . . . . . . . . . . 254
11.4.3 Highly Connected Subgraphs Algorithm . . . . . . . . . 258
11.4.4 Modularity-Based Clustering . . . . . . . . . . . . . . . . . 259
11.5 Flow Simulation-Based Approaches. . . . . . . . . . . . . . . . . . . 263
11.5.1 Markov Clustering Algorithm . . . . . . . . . . . . . . . . . 263
11.5.2 Distributed Markov Clustering
Algorithm Proposal . . . . . . . . . . . . . . . . . . . . . . . . 265
11.6 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.7 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
12 Network Motif Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
12.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
12.2.1 Methods of Motif Discovery. . . . . . . . . . . . . . . . . . 277
12.2.2 Relation to Graph Isomorphism . . . . . . . . . . . . . . . 278
12.2.3 Frequency Concepts . . . . . . . . . . . . . . . . . . . . . . . 279
12.2.4 Random Graph Generation . . . . . . . . . . . . . . . . . . . 280
12.2.5 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . 280
12.3 A Review of Sequential Motif Searching Algorithms. . . . . . . 281
12.3.1 Network Centric Algorithms. . . . . . . . . . . . . . . . . . 282
12.3.2 Motif Centric Algorithms . . . . . . . . . . . . . . . . . . . . 286
12.4 Distributed Motif Discovery . . . . . . . . . . . . . . . . . . . . . . . . 291
12.4.1 A General Framework . . . . . . . . . . . . . . . . . . . . . . 291
12.4.2 Review of Distributed Motif
Searching Algorithms . . . . . . . . . . . . . . . . . . . . . . 292
12.4.3 Wang et al.’s Algorithm. . . . . . . . . . . . . . . . . . . . . 292
12.4.4 Schatz et al.’s Algorithm . . . . . . . . . . . . . . . . . . . . 294
12.4.5 Riberio et al.’s Algorithms . . . . . . . . . . . . . . . . . . . 294
12.5 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
13 Network Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.2.1 Relation to Graph Isomorphism . . . . . . . . . . . . . . . 304
13.2.2 Relation to Bipartite Graph Matching . . . . . . . . . . . 305
13.2.3 Evaluation of Alignment Quality. . . . . . . . . . . . . . . 305
13.2.4 Network Alignment Methods . . . . . . . . . . . . . . . . . 307
Contents xvii

13.3 Review of Sequential Network Alignment Algorithms . . . . . . 308


13.3.1 PathBlast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
13.3.2 IsoRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.3.3 MaWIsh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.3.4 GRAAL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
13.3.5 Recent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 310
13.4 Distributed Network Alignment . . . . . . . . . . . . . . . . . . . . . 311
13.4.1 A Distributed Greedy Approximation
Algorithm Proposal . . . . . . . . . . . . . . . . . . . . . . . . 311
13.4.2 Distributed Hoepman’s Algorithm . . . . . . . . . . . . . . 314
13.4.3 Distributed Auction Algorithms . . . . . . . . . . . . . . . 316
13.5 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
14 Phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
14.2 Terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
14.3 Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
14.3.1 Distance-Based Algorithms. . . . . . . . . . . . . . . . . . . 326
14.3.2 Maximum Parsimony. . . . . . . . . . . . . . . . . . . . . . . 335
14.3.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . 342
14.4 Phylogenetic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.4.1 Reconstruction Methods. . . . . . . . . . . . . . . . . . . . . 344
14.5 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
15 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
15.2 Current Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
15.2.1 Big Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 352
15.2.2 Disease Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 353
15.2.3 Bioinformatics Education . . . . . . . . . . . . . . . . . . . . 355
15.3 Specific Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
15.3.1 Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . 356
15.3.2 Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 357
15.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
15.4.1 Big Data Gets Bigger . . . . . . . . . . . . . . . . . . . . . . 360
15.4.2 New Paradigms on Disease Analysis . . . . . . . . . . . . 360
15.4.3 Personalized Medicine . . . . . . . . . . . . . . . . . . . . . . 361
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Introduction
1

1.1 Introduction

Biology is the science of life and living organisms. An organism is a living entity that
may consist of organs that are made of tissues. Cells are the building blocks of organ-
isms and form tissues of organs. Cells consist of molecules and molecular biology
is the science of studying the cell at molecular level. The nucleus of a cell contains
deoxyribonucleic acid (DNA) which stores all of the genetic information. DNA is
a double helix structure consisting of four types of molecules called nucleotides. It
consists of a long sequence of nucleotides of about 3 billion pairs. From the viewpoint
of computing, DNA is simply a string that has a four-letter alphabet. The ribonucleic
acid (RNA) has one strand and also consists of four types of nucleotides like DNA,
with one different nucleotide. Proteins are large molecules outside the nucleus of the
cell and perform vital life functions. A protein is basically a linear chain of molecules
called amino acids. Molecules of the cell interact to perform all necessary functions
for living.
Recent technological advancements have provided vast amounts of biological
data at molecular level. Analysis and extracting meaningful information from this
data requires new methods and approaches. This data comes in two basic forms as
sequence and network data. On one hand, we are provided with sequence data of
DNA/RNA and proteins, and on the other hand, we have topological information
about the connectivity of various networks within the cell. Analysis of this data is a
task on its own due to its huge size.
We first describe the problems encountered in the analysis of biological sequences
and networks and we then describe why distributed algorithms are imperative as
computational methods in this chapter. It seems design and implementation of dis-
tributed algorithms are inevitable for these time-consuming difficult problems and
their scarcity can be attributed to the relatively recent provision of the biological data
and the field being multidisciplinary in nature. We conclude by providing the outline
of the book.

© Springer International Publishing Switzerland 2015 1


K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_1
2 1 Introduction

1.2 Biological Sequences

The biological sequences in the cell we are referring are the nucleotide sequences in
DNA, RNA, and amino acid sequences in proteins. DNA contains four nucleotides in
a double-helix-shaped two-strand structure: Adenine (A), Cytosine (C), Guanine (G),
and Thymine (T). Adenine always pairs with thymine, and guanine with cytosine.
The primary structure of a protein consists of a linear chain of amino acids and the
order of these affects its 3D shape. Figure 1.1 shows a simplified diagram of DNA
and a protein.
Analysis of these biological sequences involves the following tasks:

• Comparison of sequences: A basic requirement to analyze a newly discovered


sequence is to compare it with the known sequences. The basic assumption here is
that the similar structures may indicate similar functions. In the very basic form,
we attempt to find the approximately nucleotides in two or more sequences. This
process is commonly called sequence alignment and can be solved in polynomial
time by dynamic algorithms. A sequence alignment algorithm provides the simi-
larities and distances between a number of input sequences which can be used for
further processing.
• Clustering: The grouping of similar sequences is called clustering and this process
is at a higher level than sequence alignment as it needs the distances between the
sequences as computed by the alignment. Clustering of sequences aims to find their
affinities and infer ancestral relationships based on the groupings. The functions
of sequences in clusters can be analyzed more easily and also this information can
be used for disease analysis of organisms.
• Sequence patterns: DNA and proteins contain repeating subsequences which are
called sequence repeats. These repeats may be consecutive or dispersed, and the
former is commonly referred to as tandem repeats and the latter as sequence motifs.
In various diseases, the number of the repeats is more than expected, and hence
discovering them helps to understand the disease mechanism in an organism. They
also reside near genes and may be used to find the location of genes. The number

(a) (b)

Fig. 1.1 a DNA double helix structure. b A protein structure having a linear sequence of amino
acids
1.2 Biological Sequences 3

and locations of repeats are unique for individuals and can be used in forensics to
identify individuals.
• Gene finding: A gene codes for a polypeptide which can form or be a part of a
protein. There are over 20,000 genes in human DNA which occupy only about
3 % of human genome. Finding genes is a fundamental step in their analysis. A
mutated gene may cause the formation of a wrong protein, disturbing the healthy
state of an organism, but mutations are harmless in many cases.
• Genome rearrangements: Mutations of DNA at a coarser level than point mutations
of nucleotides involve certain alterations of segments or genes in DNA. These
changes include reversals, duplications, and transfer to different locations in DNA.
Genome rearrangements may result in the production of new species but in many
cases, they are considered as the causes of complex diseases.
• Haplotype inference: The DNA sequencing methods of humans provide the order
of DNA nucleotides from two chromosomes as this approach is more cost-effective
and practical. However, the sequence information from a single chromosome called
a haplotype is needed for disease analysis and also to discover ancestral relation-
ships. Discovering single chromosome data from the two chromosomes is called
haplotype inference and is needed for the data to be meaningful

All of the above-described tasks are fundamental areas of research in the analysis
of biological sequences. Comparison of sequences, clustering, and finding repeats
apply both to DNA/RNA and protein sequences. Protein sequences may also be
employed to discover genes as they are the product of genes; however, genome
rearrangements and haplotype inference problems are commonly associated with
the DNA/RNA sequences. Except for the sequence alignment problem, there are
hardly any polynomial algorithms for these problems. Even when there is a solution
in polynomial time, the size of data necessitates the use of approximation algorithm
if they are available. As we will see, the heuristic algorithms that can be shown to
work for a wide range of input combinations experimentally are the only choice in
many cases.

1.3 Biological Networks

Biological networks consist of biological entities which interact in some form. The
modeling and analysis of biological networks are fundamental areas of research
in bioinformatics. The number of nodes in a biological network is large and these
nodes have complex relations among them. We can represent a biological network
by a graph where an edge between two entities indicates an interaction between
them. This way, many results in graph theory and also various graph algorithms
become available for immediate use to help solve a number of problems in biological
networks.
4 1 Introduction

We can make coarse distinction between the networks in the cell and other bio-
logical networks. The cell contains DNA, RNA, proteins, and metabolites at the
molecular level. Networks at biological level are gene regulation networks, signal
transduction networks, protein–protein interaction (PPI) networks, and metabolic
networks. DNA is static containing the genetic code and proteins take part in various
vital functions in the cell. Genes in DNA code for proteins in a process called gene
expression.
DNA, RNA, proteins, and metabolites all have their own networks. Intracellular
biological networks have molecules within the cell as their nodes and their interac-
tions as links. PPI networks have proteins which interact with each other to cooperate
to accomplish various vital functions for life. PPI networks are not constructed ran-
domly and they show specific structures. They have a few number of highly connected
nodes called hubs and the rest of the nodes in the network have very few connec-
tions to other proteins. The distance between two farthest proteins in such a network
is very small compared to the size of the network. This type of networks is called
small-world network as it is relatively easy to reach a node from any other node
in the network. Hubs in PPI networks form dense regions of the network and these
clusters of nodes called protein complexes usually have some fundamental functions
attributed to them. It is therefore desirable to detect these clusters in PPI networks
and we will investigate algorithms for this purpose in Chap. 12. A PPI network has
very few hubs and many low-degree nodes. Such networks are commonly called
scale-free networks and a hub in a PPI network is presumed to act as a gateway
between a large number of nodes. Figure 1.2 shows a subnetwork of a PPI network
of the immune system where small-world and scale-free properties can be visually
detected along with a number of hubs.
The main areas of investigation in biological networks are the following:

• Analysis of networks: Given a graph G(V, E) representing a biological network,


our interest is to analyze the topological properties of this graph. In most of the
biological networks, we find them as small-world and scale-free networks. Another
point of interest is to find the importance of nodes and links in these networks. The
centrality parameters provide evaluation of the importance of vertices and edges
in the network graph.
• Clustering: Clustering in this case involves finding dense regions of the network
graph. The nodes in these regions have more connections between them than
the rest of the network presumably due to a significant function performed in
these subgraphs. A disease state of an organism may also cause highly populated
subgraphs of a PPI network for example. Detecting these clusters is an important
research area in biological networks.
• Network motifs: Network motifs are the frequent occurring subgraph patterns
in biological networks. They are the analogous structures to DNA and protein
sequence motifs. A network motif is commonly perceived as the building block
of a biological network and therefore it is assumed to have some basic functional
significance. Discovering these motifs is a computationally difficult task since the
number of possible motifs grows exponentially with increased size. For example,

www.allitebooks.com
1.3 Biological Networks 5

Fig. 1.2 A subnetwork of the SpA PPI network involved in innate immune response (taken from
[1]). Triangles are the genes

there are only 13 possible motifs with 4 nodes, while the number of all subgraphs
of size 5 is 51. Comparison of networks and hence deducing ancestral relationships
are also possible by searching for common motifs in them.
• Network alignment: In many cases, two or more biological networks may be similar
and finding the similar subgraphs may show the conserved and therefore important
topological structures in them. A fundamental problem in biological networks is the
assessment of this similarity between two or more biological networks. Analogous
to sequence alignment of DNA/RNA and proteins, the network alignment process
aims to align two or more biological networks. This time, we basically search for
topological similarities between the networks. Node similarity is also considered
together with topological similarity to have a better view of the affinity of two
or more networks. Our aim in alignment is very similar to sequence alignment
in a different domain, and we try to compare two or more networks, find their
conserved modules, and infer ancestral relationships between them.
• Phyloegeny: Phylogeny is the study and analysis of evolutionary relationships
among organisms. Phylogenetics aims to discover these relationships using mole-
cular sequence and morphological data to reconstruct evolutionary dependencies
among the organisms. Phylogenetics is one of the most studied and researched
topics in bioinformatics and its implications are numerous. It can help disease
analysis and design of drug therapies by analyzing phylogenetic dependencies of
pathogens. Transmission characters of diseases can also be predicted using phy-
logenetics.
6 1 Introduction

Most of the problems outlined above do not have solutions in polynomial time
and we are left with approximation algorithms in rare cases but mostly with heuristic
algorithms. Since the size of the networks under consideration is huge, sampling
the network and implementing the algorithm in this sample is typically followed in
many methods. This approach provides a solution in reasonable time at the expense
of decreased accuracy.

1.4 The Need for Distributed Algorithms

There are many challenges in bioinformatics and we can first state that obtaining
meaningful data is difficult in general as it is noisy most of the time. The size of data
is large making it difficult to find time-efficient algorithms. Complexity class P is
related to problems that can be solved in polynomial time and some problems we need
to solve in bioinformatics belong to this class. Many problems, however, can only
be solved in exponential time and in some cases, solutions to the problem at hand do
not even exist. These problems belong to the class nondeterministic polynomial hard
(NP-hard) and we need to rely on approximation algorithms which find suboptimal
solutions with bounded approximation ratios for such hard problems. Most of the
time, there are no known approximation algorithms and we need to rely on heuristic
methods which show experimentally that the algorithm works for most of input
combinations.
Parallel algorithms and distributed algorithms provide faster execution as a num-
ber of processing elements are used in parallel. Here, we make a distinction between
shared memory parallel processing where processing elements communicate using a
shared and protected memory and distributed processing where computational nodes
of a communication network communicate by the transfer of messages without any
shared memory. Our main focus in this book is about distributed, that is, distrib-
uted memory, message-passing algorithms although we will also cover a number of
shared memory parallel algorithms. The efficiency of a parallel/distributed algorithm
is basically measured by the speedup it provides which is expressed as the ratio of
the number of time steps of the sequential algorithm T (s) to the number of time
steps of the parallel algorithm T ( p). If k processing elements are used, the speedup
should approach k ideally. However, there are the synchronization delays in shared
memory parallel processing and the message delays of the interconnection network
in distributed processing which means the speedup value could be far from ideal.
Whether the algorithm works in polynomial time to find an exact solution or an
approximate solution again in polynomial time; bioinformatics problems have an
additional property; the size of the input data is huge. We may therefore attempt
to provide a parallel or distributed version of an algorithm even if it does solve the
problem in polynomial time. For example, suffix trees are versatile data structures that
find various implementations in algorithmic solutions to bioinformatics problems.
There exist algorithms that construct suffix trees in a maximum of n steps where n
is the length of the sequence from which a suffix tree is built. Although this time
1.4 The Need for Distributed Algorithms 7

complexity is acceptable for a majority of computational tasks, we may need to search


for faster solutions when a suffix tree of a DNA sequence of about 3 billion pairs
is to be constructed. Theoretically, it is difficult to find an algorithm that has better
time than linear-time execution; however, we can distribute this task to a number of
processing elements and achieve significant gains in time. There is also the case of
double gain in time by first designing an approximate algorithm, if this is possible,
with better execution time and then providing a distributed version of this algorithm
to provide further speedup.

1.5 Outline of the Book

Our description of algorithms in each chapter follows the same model where the
problem is first introduced informally and the related concepts are described. We
then present the problem formally with the notation used and review the fundamen-
tal sequential algorithms with emphasis on the ones that have potential to be modified
and executed in parallel/distributed environments. After this step, we usually pro-
vide a template of a generic algorithm that summarizes the general approaches to
have a distributed algorithm for the problem at hand. This task can be accomplished
basically by partitioning data, partitioning code or both among the processing ele-
ments. We also propose new algorithms for specific problems under consideration
that can be implemented conveniently in this step. We then review published work
about parallel and distributed algorithms for the problem described, and describe
fundamental research studies that have received attention in more detail at algorith-
mic level. Finally, the main points of the chapter are emphasized with pointers for
future work in the Chapter Notes section.
We have three parts, and start with Part I which is about the basic background about
bioinformatics with three chapters on molecular biology; graphs and algorithms; and
parallel and distributed processing. This part is not a comprehensive treatment of
three main areas of research, but rather a dense coverage of these three major topics
with pointers for further study. It is self-contained, however, covering most of the
fundamental concepts needed for parallel/distributed processing in bioinformatics.
Part II describes the problems in the sequence analysis of biological data. We
start this part with string algorithms and describe algorithms for sequence alignment
which is one of the most researched topics in the sequence world. We then investigate
sequence motif search algorithm and describe genome analysis problems such as
gene finding, genome rearrangement, and haplotype inference at a coarser level.
In Part III, our main focus is on parallel and distributed algorithms for biological
networks, the PPI networks being the center of focus. We will see that the main
problems in biological networks can be classified into cluster detection, network
motif discovery, network alignment, and phylogenetics. Cluster detection algorithms
aim to find clusters of high inter-node connections and these dense regions may
implicate a specific function or health and disease states of an organism. Network
motif algorithms attempt to discover repeating patterns of subgraphs and they may
8 1 Introduction

indicate again some specific function attributed to these patterns. Network alignment
algorithms investigate the similarities between two or more graphs and show whether
these organisms have common ancestors and how well they are related. Analysis of
phylogenetic trees and networks which construct ancestor–descendant relationships
among existing individuals and organisms is also reviewed in this part which is
concluded by discussing the main challenges and future prospects of bioinformatics
in the Epilogue chapter.

Reference
1. Zhao J, Chen J, Yang T-H, Holme P (2012) Insights into the pathogenesis of axial
spondyloarthropathy from network and pathway analysis. BMC Syst Biol. 6(Suppl 1):S4.
doi:10.1186/1752-0509-6-S1-S4
Part I
Background
Introduction to Molecular Biology
2

2.1 Introduction

Modern biology has its roots at the work of Gregor Mendel who identified the fun-
damental rules of hereditary in 1865. The discovery of chromosomes and genes
followed later and in 1952 Watson and Crick disclosed the double helix structure of
DNA. All living organisms have common characteristics such as replication, nutri-
tion, growing and interaction with their environment. An organism is composed
of organs which perform specific functions. Organs are made of tissues which are
composed of aggregation of cells that have similar functions. The cell is the basic
unit of life in all living organisms and it has molecules that have fundamental func-
tions for life. Molecular biology is the study of these molecules in the cell. Two of
these molecules called proteins and nucleotides have fundamental roles to sustain
life. Proteins are the key components in everything related to life. DNA is made of
nucleotides and parts of DNA called genes code for proteins which perform all the
fundamental processes for living using biochemical reactions.
Cells synthesize new molecules and break large molecules into smaller ones using
complex networks of chemical reactions called pathways. Genome is the complete set
of DNA of an organism and human genome consists of chromosomes which contain
many genes. A gene is the basic physical and functional unit of hereditary that codes
for a protein which is a large molecule made from a sequence of amino acids. Three
critical molecules of life are deoxyribonucleic acid (DNA), ribonucleic acid (RNA),
and proteins. A central paradigm in molecular biology states that biological function
is heavily dependent on the biological structure.
In this chapter, we first review the functions performed by the cell and its ingre-
dients. The DNA contained in the nucleus, the proteins, and various other molecules
all have important functionalities and we describe these in detail. The central dogma
of life is the process of building up a protein from the code in the genes as we will
outline. We will also briefly describe biotechnological methods and introduce some
of the commonly used databases that store information about DNA, proteins, and
other molecules in the cell.
© Springer International Publishing Switzerland 2015 11
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_2
12 2 Introduction to Molecular Biology

2.2 The Cell

Cells are the fundamental building blocks of all living things. The cell serves as a
structural building block to form tissues and organs. Each cell is independent and
can live on its own. All cells have a metabolism to take in nutrients and convert
them into molecules and energy to be used. Another important property of cells is
replication in which a cell produces another cell that has the same properties as
itself. Cells are composed of approximately 70 % water; 7 % small molecules like
amino acids, nucleotides, salts, and lipids, and 23 % macromolecules such as proteins
and polysaccharids. A cell consists of molecules in a dense liquid surrounded by a
membrane as shown in Fig. 2.1.
The eukaryotic cells have nuclei containing the genetic material which is sepa-
rated from the rest of the cell by a membrane and the prokaryotic cells do not have
nuclei. Prokaryotes include bacteria and archaea; and plants, animals, and fungi are
examples of eukaryotes. The tasks performed by the cells include taking nutrients
from food, converting these to energy, and performing various special missions. A
cell is composed of many parts each with a different purpose. The following are the
important parts of an eukaryotic cell with their functions:

• Nucleus: Storage of DNA molecules, and RNA and ribosome synthesis.


• Endoplasmic reticulum: Synthesis of lipids and proteins
• Golgi apparatus: Distribution of proteins and lipids and posttranslational process-
ing of proteins.
• Mitochondria: Generation of energy by oxidizing nutrients.
• Vesicles: Transport vesicles move molecules such as proteins from endoplasmic
reticulum to Golgi apparatus, Secretory vesicles have material to be excreted from
the cell and lysosomes provide cellular digestion.

The nucleus is at the center of the cell and is responsible for vital functions such
as cell growth, maturation, division, or death. Cytoplasm consists of jellylike fluid
which surrounds the nucleus and it contains various other structures. Endoplasmic

Fig. 2.1 Parts of a cell


2.2 The Cell 13

reticulum enwraps the nucleus, and processes molecules made by the cell and trans-
ports them to their destinations. Conversion of energy from food to a form that can be
used by the cell is performed by mitochondria which have their own genetic material.
These components of the cell are shown in Fig. 2.1. The cell contains various other
structures than the ones we have outlined here.
Chemically, cell is composed of few elements only. Carbon (C), hydrogen (H),
oxygen (O), and nitrogen (N) are the dominant ones with phosphorus (P) and sulfur
(S) appearing in less proportions. These elements combine to form molecules in
the cell, using covalent bonds in which electrons in their outer orbits are shared
between the atoms. A nucleotide is one such molecule in the cell which is a chain
of three components: a base B, a sugar S, and a phosphoric acid P. The three basic
macromolecules in the cell that are essential for life are the DNA, RNA, and proteins.

2.2.1 DNA

James Watson and Francis Crick discovered the Deoxyribonucleic Acid (DNA) struc-
ture in the cell in 1953 using X-ray diffraction patterns which showed that the DNA
molecule is long, thin, and has a spiral-like shape [5]. The DNA is contained in the
nuclei of eukaryotic cells and is composed of small molecules called nucleotides.
Each nucleotide consists of a five-carbon sugar, a phosphate group, and a base. The
carbon atoms in a sugar molecule are labeled 1 to 5 and using this notation, DNA
molecules start at 5 end and finish at 3 end as shown in Fig. 2.2. There are four
nucleotides in the DNA which are distinguished by the bases they have: Adenine
(A), Cytosine (C), Guanine (G), and Thymine (T). We can therefore think of DNA
as a string with a four letter alphabet = {A,C,G,T}. Human DNA consists approx-
imately of three billion bases. Nucleotide A pairs only with T, and C pairs only with
G, we can say A and T are complementary and so are G and C as shown in Fig. 2.2.
Given the sequence S of a DNA strand, we can construct the other strand S  by
taking the complement of bases in this strand. If we take the complement of the

Fig. 2.2 DNA structure


14 2 Introduction to Molecular Biology

G T A C
A
T
G T
G A
G
A
C T C
T T A
C C
C A A T G
G

Fig. 2.3 DNA double helix structure

resulting strand we will obtain the original strand. This process is used and essential
for protein production. Physically, DNA consists of two strands held together by
hydrogen bonds, arranged in a double helix as shown in Fig. 2.3. The complement
of a DNA sequence consists of complements of its bases. The DNA therefore consists
of two complementary strands which bind to each other tightly providing a stable
structure. This structure also provides the means to replicate in which the double
DNA helix structure is separated into two strands and each of these strands are then
used as templates to synthesize their complements.
The DNA molecule is wrapped around proteins called histones into complex-
walled structures called chromosomes in the nucleus of each cell. The number of
chromosomes depends on the type of eukaryote species. Each chromosome consists
of two chromatides which are coil-shaped structures connected near the middle
forming an x-like structure. Chromosomes are kept in the nucleus of a cell in a
highly packed and hierarchically organized form. A single set of chromosomes in an
organism is called haploid, two sets of chromosomes is called di ploid, and more
than two sets is called polyploid. Humans are diploid where each chromosome is
inherited from a parent to have two chromosomes for each of the 23 chromosome set.
The sex chromosome is chromosome number 23 which either has two chromosomes
shaped X resulting in a female, or has X and Y resulting in a male. The type of
chromosome inherited from father determines the sex of the child in this case.

2.2.2 RNA

The ribonucleic acid (RNA) is an important molecule that is used to transfer genetic
information. It has a similar structure to DNA but consists of only one strand and
does not form a helix structure like DNA. It also has nucleotides which consist of a
sugar, phosphate, and a base. The sugar however is a ribose instead of deoxyribose
and hence the name RNA. Also, DNA base thymine (T) is replaced with uracil (U)
in RNA. The fundamental kinds of RNA are the messenger RNA (mRNA), transfer
RNA (tRNA), and ribosomal RNA (rRNA) which perform different functions in the
cell. RNA provides a flow of information in the cell. First, DNA is copied to mRNA
2.2 The Cell 15

in the nucleus and the mRNA is then translated to protein in the cytoplasm. During
translation, tRNA and rRNA have important functions. The tRNA is responsible for
forming the amino acids which make up the protein, as prescribed in the mRNA; and
the rRNA molecules are the fundamental building blocks of the ribosomes which
carry out translation of mRNA to protein.

2.2.3 Genes

A gene is the basic unit of hereditary in a living organism determining its character
as a whole. A gene physically is a sequence of DNA that codes for an RNA (mRNA,
tRNA, or rRNA) and the mRNA codes for a protein. The study of genes is called
genetics. Gregor Mendel in the 1860 s was first to experiment and set principles of
passing hereditary information to offsprings.
There are 23 pairs of chromosomes in humans and between 20000–25000 genes
are located in these chromosomes. The starting and stopping locations of a gene are
identified by specific sequences. The protein coding parts of a gene are called exons
and the regions between exons with no specific function are called introns. Genes
have varying lengths and also, exons and introns within a gene have varying lengths.
A gene can combine with other genes or can be nested within another gene to yield
some functionality, and can be mutated which may change its functionality at varying
degrees in some cases leading to diseases. The complete set of genes of an organism
is called its genot ype. Each gene has a specific function in the physiology and mor-
phology of an organism. The physical manifestation or expression of the genotype
is the phenot ype which is the physiology and morphology of an organism cite. A
gene may have different varieties called alleles resulting in different phenotyping
characteristics. Humans are diploid meaning we inherit a chromosome from each
parent, therefore we have two alleles of each gene. The genes that code for proteins
constitute about 1.5 % of total DNA and the rest contains RNA encoding genes and
sequences that are not known to have any function. This part of DNA is called junk
DNA. There is no relatedness between the size of genome, number of genes, and
organism complexity. In fact, some single cell organisms have a larger genome than
humans.

2.2.4 Proteins

Proteins are large molecules of the cell and they carry out many important functions.
For example, they form the antibodies which bind to foreign particles such as viruses
and bacteria. As enzymes, they work as catalysts for various chemical reactions;
the messenger proteins transmit signals to coordinate biological processes between
different cells, tissues, and organs, also they transport small molecules within the cell
and the body. Proteins are made from the information contained in genes. A protein
consists of a chain of amino acids connected by peptide bonds. Since such a bond
releases a water molecule, what we have inside a protein is a chain of amino acid

www.allitebooks.com
16 2 Introduction to Molecular Biology

Table 2.1 Amino acids


Name Abbrv. Code Pol. Name Abbrv. Code Pol.
Alanine Ala A H Methionine Met M H
Cysteine Cys C P Asparagine Asn N P
Aspartic acid Asp D P Proline Pro P H
Glutamic acid Glu E P Glutamine Gln Q P
Phenylalanine Phe F H Arginine Arg R P
Glycine Gly G P Serine Ser S P
Histidine His H P Threonine Thr T P
Isoleucine Ile I H Valine Val V H
Lysine Lys K P Tryptophan Trp W H
Leucine Leu L H Tyrosine Tyr Y P

residues. Typically, a protein has about 300 amino acid residues which can reach
5000 in large proteins.The essential 20 amino acids that make up the proteins is
shown in Table 2.1 with their abbreviations, codes, and polarities.
Proteins have highly complex structures and can be analyzed at four hierarchical
structures. The primary structure of a protein is specified by a sequence of amino
acids that are linked in a chain and the secondary structure is formed by linear
regions of amino acids. A protein domain is a segment of amino acid sequences in
a protein which has independent functions than the rest of the protein. The protein
also has a 3D structure called tertiary structure which affects its functionality and
several protein molecules are arranged in quaternary structure. The function of a
protein is determined by its four layer structure. A protein has the ability to fold
in 3D and its shape formed as such affects its function. Using its 3D shape, it can
bind to certain molecules and interact. For example, mad cow disease is believed to
be caused by the wrong folding of a protein. For this reason, predicting the folding
structure of a protein from its primary sequence and finding the relationship between
its 3D structure and its functionality has become one of the main research areas in
bioinformatics.

2.3 Central Dogma of Life

The central dogma of molecular biology and hence life was formulated by F. Crick
in 1958 and it describes the flow of information between DNA, RNA, and proteins.
This flow can be specified as DNA → mRNA → protein as shown in Fig. 2.4. The
forming of mRNA from a DNA strand is called transcription and the production
of a protein based on the nucleotide sequence of the mRNA is called translation as
described next.
2.3 Central Dogma of Life 17

Fig. 2.4 Central dogma of life

2.3.1 Transcription

In the transcription phase of protein coding, a single stranded RNA molecule called
mRNA is produced which is complementary to the DNA strand it is transcribed. The
transcription process in eukaryotes takes place in the nucleus. The enzyme called
RNA polymerase starts transcription by first detecting and binding a pr omoter region
of a gene. This special pattern of DNA shown in Fig. 2.5 is used by RNA polymerase
to find where to begin transcription. The reverse copy of the gene is then synthesized
by this enzyme and a terminating signal sequence in DNA results in the ending of
this process after which pre-mRNA which contains exons and introns is released.
A post-processing called splicing involves removing the introns received from the
gene and reconnecting the exons to form the mature and much shorter mRNA which
is transferred to cytoplasm for the second phase called translation. The complete
gene contained in the chromosome is called genomic DNA and the sequence with
exons only is called complementar y DNA or cDNA [25].

Fig. 2.5 Structure of a gene


18 2 Introduction to Molecular Biology

Table 2.2 The genetic code


2nd Letter
1st L. U C A G 3rd L.
UUU UCU UAU UGU U
Phe Tyr Cys
U UUC UCC UAC UGC C
Ser
UUA UCA UAA Stop UGA Stop A
Leu
UUG UCG UAG Stop UGG Trp G
CUU CCU CAU CGU U
His
C CUC CCC CAC CGC C
Leu Pro Arg
CUA CCA CAA CGA A
Gln
CUG CCG CAG CGG G
AUU ACU AAU AGU U
Asn Ser
A AUC Ile ACC AAC AGC C
Thr
AUA ACA AAA AGA A
Lys Arg
AUG Met ACG AAG AGG G
GUU GCU GAU GGU U
Asp
G GUC GCC GAC GGC C
Val Ala Gly
GUA GCA GAA GGA A
Glu
GUG GCG GAG GGG G

2.3.2 The Genetic Code

The genetic code provides the mapping between the sequence of nucleotides and the
type of amino acids in proteins. This code is in triplets of nucleotide bases called
codons where each codon encodes one amino acid. Since there are four nucleotide
bases, possible total number of codons is 43 = 64. However, proteins are made of
20 amino acids only which means many amino acids are specified by more than one
codon. Table 2.2 displays the genetic code.
Such redundancy provides fault tolerance in case of mutations in the nucleotide
sequences in DNA or mRNA. For example, a change in the codon UUA may result in
UUG in mRNA but the amino acid leucine corresponding to each of these sequences
is formed in both cases. Similarly, all of the three codons UAA, UAG, and UGA cause
termination of the polypeptide sequence and hence a single mutation from A to G or
from G to A still causes termination preventing unwanted growth due to mutations.
Watson et al. showed that the sequence order of codons in DNA correspond directly
to the sequence order of amino acids in proteins [28]. The codon AUG specifies the
beginning of a protein amino acid sequence, therefore, the amino acid methionine is
found as the first amino acid in all proteins.

2.3.3 Translation

The translation phase is the process where a mature mRNA is used as a template
to form proteins. It is carried out by the large molecules called ribosomes which
consist of proteins and the ribosomal RNA (rRNA) [5]. A ribosome uses tRNA to
2.3 Central Dogma of Life 19

Fig. 2.6 Construction of a protein

first detect the start codon in the mRNA which is the nucleotide base sequence AUG.
The tRNA has three bases called anticodons which are complementary to the codons
it reads. The amino acids as prescribed by the mRNA are then formed and added to
the linear protein structure according to the genetic code. Translation to the protein
is concluded by detecting one of the three stop codons. Once a protein is formed, a
protein may be transferred to the needed location by the signals in the amino acid
sequence. The new protein must fold into a 3D structure before it can function [27].
Figure 2.6 displays the transcription and translation phases of a superficial protein
made of six amino acids as prescribed by the mRNA.

2.3.4 Mutations

Mutations are changes in genetic code due to a variety of reasons. As an example, a


stop codon UAA may be formed instead of an amino acid coding codon UCA (Serine)
by a single point mutation of A to C, which will result in a shorter protein that will
likely be nonfunctional. An amino acid may be replaced by another one, again by
a point mutation such as the forming of CCU (Proline) instead of CAU (Histidine)
by the mutation of C to A. These types of mutations may have important or trivial
effects depending on the functioning of the mutated amino acid in the protein [5].
Furthermore, a nucleotide may be added or deleted to the sequence, resulting in the
shifting of all codons by one nucleotide which will result in a very different sequence.
Mutations can be caused by radiations from various sources such as solar, radioac-
tive materials, X-rays, and UV light. Chemical pollution is also responsible for many
cases of mutations and viruses which insert themselves into DNA cause mutations.
The inherited mutations are responsible for genetic diseases such as multiple scle-
rosis and Alzheimer disease. In many cases though, mutations result in better and
improved characteristics in an organism such as better eyesight.
20 2 Introduction to Molecular Biology

2.4 Biotechnological Methods

Biotechnology is defined as any technological application that uses biological sys-


tems, living organisms, or derivatives thereof, to make or modify products or
processes for specific use, as defined by the Convention on Biological Diversity
(CBD) [7]. The main biological methods are the cloning and polymerase chain reac-
tion to amplify it, and sequencing to determine the nucleotide sequence in a DNA
segment.

2.4.1 Cloning

DNA needs to be in sufficient quantities to experiment. DNA cloning is a method


to amplify the DNA segment which could be very small. In this method, the DNA
to be amplified called insert is inserted into the genome of an organism which is
called the host or the vector . The host is then allowed to multiply during which
the DNA inserted to it also multiplies. The host can then be disposed of, leaving
only the amplified DNA segment. The commonly used organisms for cloning DNA
are plasmids, cosmids, phages, and yeast artificial chromosomes (YACs) [25].
A plasmid is a circular DNA in bacteria and is used for cloning DNA of sizes
up to 15 kbp. Phages are viruses and DNA segment inserted in them gets replicated
when the virus infects an organism and multiplies itself. In YAC-based cloning, an
artificial chromosome of yeast is constructed by the DNA insert sequence and the
yeast chromosome control sections. The yeast multiplies its chromosomes including
the YAC and hence multiplying the insert. YAC-based cloning can be used for very
large segments of a million base pairs [25].

2.4.2 Polymerase Chain Reaction

The polymerase chain reaction (PCR) developed by Kary Mullis [3] in 1983, is a
biomedical technology used to amplify selected DNA segment over several orders
of magnitude. The amplification of DNA is needed for a number of applications
including analysis of genes, discovery of DNA motifs, and diagnosis of hereditary
diseases. PCR uses thermal cycling in which two phases are employed. In the first
phase, the DNA is separated into two strands by heat and then, a single strand is
enlarged to a double strand by the inclusion of a primer and polymerase processing.
DNA polymerase is a type of enzyme that synthesizes new strands of DNA comple-
mentary to the target sequence. These two steps are repeated many times resulting
in an exponential growth of the initial DNA segment. There are some limitations
of PCR processing such as the accumulation of pyrophosphate molecules and the
existence of inhibitors of the polymerase in the DNA sample which results in the
stopping of the amplification.
2.4 Biotechnological Methods 21

2.4.3 DNA Sequencing

The sequence order of bases in DNA is needed to find the genetic information. DNA
sequencing is the process of obtaining the order of nucleotides in DNA. The obtained
sequence data can then be used to analyze DNA for various tasks such as finding
evolutionary relationships between organisms and treatment of diseases. The exons
are the parts of DNA that contain genes to code for proteins and all exons in a genome
is called exome. Sequencing exomes is known as whole exome sequencing. However,
research reveals DNA sequences external to the exons also affect protein coding and
health state of an individual. In whole genome sequencing, the whole genome of an
individual is sequenced. The new generation technologies are developed for both of
these processes. A number of methods exist for DNA sequencing and we will briefly
describe only the few fundamental ones.
The sequencing technology called Sanger sequencing named after Frederick
Sanger who developed it [23,24], used deoxynucleotide triphosphates (dNTPs) and
di-deoxynucleotide triphosphates (ddNTPs) which are essentially nucleotides with
minor modifications. The DNA strand is copied using these altered bases and when
these are entered into a sequence, they stop the copying process which results in
different lengths of short DNA segments. These segments are ordered by size and
the nucleotides are read from the shortest to the longest segment. Sanger method is
slow and new technologies are developed. The shotgun method of sequencing was
used to sequence larger DNA segments. The DNA segment is broken into many over-
lapping short segments and these segments are then cloned. These short segments
are selected at random and sequenced in the next step. The final step of this method
involves assembling the short segments in the most likely order to determine the
sequence of the long segment, using the overlapping data of the short segments.
Next generation DNA sequencing methods employ massively parallel processing
to overcome the problems of the previous sequencing methods. Three platforms are
widely used for this purpose: the Roche/454 FLX [21], the Illumina/Solexa Genome
Analyzer [4], and the Ion Torrent: Proton/PGM Sequencing [12]. The Roche/454
FLX uses the pyrosequencing method in which the input DNA strand is divided
into shorter segments which are amplified by the PCR method. Afterward, multiple
reads are sequenced in parallel by detecting optical signals as bases are added. The
Illumina sequencing uses a similar method, the input sample fragment is cleaved
into short segments and each short segment is amplified by PCR. The fragments are
located in a slide which is flooded with nucleotides that are labeled with colors and
DNA polymerase. By taking images of the slide and adding bases, and repeating this
process, bases at each site can be detected to construct the sequence. The Ion proton
sequencing makes use of the fact that addition of a dNTP to a DNA polymer releases
an H+ ion. The preparation of the slide is similar to other two methods and the slide
is flooded with dNTPs. Since each H+ decreases pH, the changes in pH level is used
to detect nucleotides [8].
22 2 Introduction to Molecular Biology

2.5 Databases

A database is a collection of structured data and there are hundreds of databases in


bioinformatics. Many of these databases are generated by filtering and transform-
ing data from other databases which contain raw data. Some of these databases
are privately owned by companies and access is provided with a charge. In most
cases however, bioinformatics databases are publicly accessible by anyone. We can
classify bioinformatics databases broadly as nucleotide databases which contain
DNA/RNA sequences; protein sequence databases with amino acid sequences of
proteins, microarray databases storing gene expression data, and pathway databases
which provide access to metabolic pathway data.

2.5.1 Nucleotide Databases

The major databases for nucleotides are the GenBank [10], the European Molec-
ular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) database
[19], and the DNA Databank of Japan (DDJB) [26]. GenBank is maintained by the
National Center for Biotechnology Information (NCBI), U.S. and contains sequences
for various organisms including primates, plants, mammals, and bacteria. It is a fun-
damental nucleic acid database and genomic data is submitted to GenBank from
research projects and laboratories. Searches in this database can be performed by
keywords or by sequences. The EMBL-EBI database is based on EMBL Nucleotide
Sequence Data Library which was the first nucleotide database in the world and
receives contributions from projects and authors. EMBL supports text-based retrieval
tools including SRS and BLAST and FASTA for sequence-based retrieval [9].

2.5.2 Protein Sequence Databases

Protein sequence databases provide storage of protein amino acid sequence informa-
tion. Two commonly used protein databases are the Protein Identification Resource
(PIR) [16,31] and the UniProt [15] containing SwissProt [2]. The PIR contains
protein amino acid sequences and structures of proteins to support genomic and pro-
teomic research. It was founded by the National Biomedical Research Foundation
(NBRF) for the identification and interpretation of protein sequence information,
and the Munich Information Center for Protein Sequences (MIPS) [22] in Germany,
and the Japan International Protein Information Database later joined this database.
SwissProt protein sequence database was established in 1986 and provided protein
functions, their hierarchical structures, and diseases related to proteins. The Uni-
versal Protein Resource (UniProt) is formed by the collaboration of EMBL-EBI,
Swiss Institute of Bioinformatics (SIB) and PIR in 2003 and SwissProt was incor-
porated into UniProt. PDBj (Protein Data Bank Japan) is a protein database in Japan
providing an archive of macromolecular structures and integrated tools [17].
2.6 Human Genome Project 23

2.6 Human Genome Project

The human genome project (HGP) is an international scientific research project to


produce a complete human DNA sequence and identifying genes of human genome
as well as other organisms such as mice, bacteria, and flies. This project was planned
in 1984, started in 1990 and was finished in 2003. About 20 universities and research
centers in United States, Japan, China, France, Germany, and the United Kingdom
participated in this project. It aimed to sequence the three billion base pairs in human
genome to analyze and search for the genetic causes of diseases to find cure for them,
along with analysis of various other problems in molecular biology.
The results of this project are that there are between 20,000–25,000 genes in
humans, and the human genome has more repeated DNA segments than other mam-
malian genomes. The work on results are ongoing but the results started to appear
even before the completion of the project. Many companies are offering genetic
tests which can show the tendencies of an individual to various illnesses. Comparing
human genome with the genomes of other organisms will help our understanding
of evolution better. Some ethical, legal, and social issues are questioned as a result
of this project. Possible discrimination based on the genetic structure of an individ-
ual by the employers is one such concern and may result in unbalance in societies.
However, this project has provided data that can be used to find molecular roots of
diseases and search for cures.

2.7 Chapter Notes

We have reviewed the basic concepts of molecular biology at introductory level.


The processes are evidently much more complex than outlined here. More detailed
treatment of this topic can be found [1,20,29,30]. The cell is the basic unit of life
in all organisms. The two types of cell are the eurokaryotes which are cells with
nuclei and prokaryotes which do not have nuclei. Both of these life forms have
the genetic material embedded in their DNA. Human DNA consists of a sequence
of smaller molecules called nucleotides which are placed in 23 pairs of structures
called chromosomes. A sequence in a chromosome that codes for a protein is called
a gene. Genes identify amino acid sequences which form the proteins. The central
dogma of life consists of two fundamental steps called transcription and translation.
During transcription, a complementary copy of a DNA strand is formed and then the
introns are extracted to form mRNA which is carried out of nucleus and the ribosomes
form the amino acids prescribed using cRNA and tRNA. The three nucleotides that
prescribe an amino acid is called a codon and the genetic code provides the mapping
from a codon to an amino acid. Proteins also interact with other proteins forming
protein–protein interaction (PPI) networks and their function is very much related
to their hierarchical structure and also their position in the PPI networks.
We also briefly reviewed the biotechnologies for DNA multiplying, namely
cloning and PCR technologies. These techniques are needed to provide sufficient
24 2 Introduction to Molecular Biology

amount of DNA to experiment in the laboratories. DNA sequencing is the process


of obtaining nucleotide sequence of DNA. The databases for DNA and protein
sequences contain data obtained by various bioinformatics projects and are pre-
sented for public use. DNA microarrays provide snapshots of DNA expression levels
of vast number of genes simultaneously and gene expression omnibus (GEO) [11]
from NCBI and ArrayExpress [14] from EBI are the two databases for microarray-
based gene expression data. There are also pathway databases which provide data
for biochemical pathways, reactions, and enzymes. Kyoto Encyclopedia of Genes
and Genomes (KEGG) [13,18] and BioCyc [6] are two such databases.
The computer science point of view can be confined to analysis of two levels
of data in bioinformatics: the DNA/RNA and protein sequence data and the data
of biological networks such as the PPI networks. Our main focus in this book will
be the sequential and distributed algorithms for the analysis of these sequence and
network data.
Exercises

1. For the DNA base sequence S = AACGTAGGCTAAT, work out the complemen-
tary sequence S  and then the complementary of the sequence S  .
2. A superficial gene has the sequence CCGTATCAATTGGCATC. Assuming this
gene has exons only, work out the amino acid of the protein to be formed.
3. Discuss the functions of three RNA molecules named tRNA, cRNA, and mRNA.
4. A protein consists of the amino acid sequence A-B-N-V. Find three gene
sequences that could have resulted in this protein.
5. Why is DNA multiplying needed? Compare the cloning and PCR methods of
multiplying DNA in terms of technology used and their performances.

References
1. Alberts B, Bray D, Lewis J, Raff M, Roberts K (1994) Molecular biology of the cell. Garland
Publishing, New York
2. Bairoch A, Apweiler R (1999) The SWISS-PROT protein sequence data bank and its supple-
ment TrEMBL in 1999. Nucleic Acids Res 27(1):49–54
3. Bartlett JMS, Stirling D (2003) A short history of the polymerase chain reaction. PCR Protoc
226:36
4. Bentley DR (2006) Whole-genome resequencing. Curr Opin Genet Dev 16:545–552
5. Brandenberg O, Dhlamini Z, Sensi A, Ghosh K, Sonnino A (2011) Introduction to molecular
biology and genetic engineering. Biosafety Resource Book, Food and Agriculture Organization
of the United Nations
6. Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, Kothari A, Krummenacker
M, Latendresse M, Mueller LA, Ong Q, Paley S, Pujar A, Shearer AG, Travers M, Weerasinghe
D, Zhang P, Karp PD (2011) The MetaCyc database of metabolic pathways and enzymes
and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 40(Database
issue):D742D753
References 25

7. CBD (Convention on Biological Diversity). 5 June 1992. Rio de Janeiro. United Nations
8. http://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-
course
9. http://www.ebi.ac.uk/embl/
10. http://www.ncbi.nlm.nih.gov/genbank
11. http://www.ncbi.nlm.nih.gov/geo/
12. http://www.iontorrent.com/. Ion Torrent official page
13. http://www.genome.jp/kegg/pathway.html
14. http://www.ebi.ac.uk/arrayexpress/. Website of ArrayExpress
15. http://www.uniprot.org/. Official website of PIR at Georgetown University
16. http://pir.georgetown.edu/. Official website of PIR at Georgetown University
17. http://www.pdbj.org/. Official website of Protein Databank Japan
18. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids
Res 28(1):2730
19. Kneale G, Kennard O (1984) The EMBL nucleotide sequence data library. Biochem Soc Trans
12(6):1011–1014
20. Lewin B (1994) Genes V. Oxford University Press, Oxford
21. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS et al (2005) Genome sequencing in
microfabricated high-density picolitre reactors. Nature 437:376–380
22. Mewes HW, Andreas R, Fabian T, Thomas R, Mathias W, Dmitrij F, Karsten S, Manuel S,
Mayer KFX, Stmpflen V, Antonov A (2011) MIPS: curated databases and comprehensive
secondary data resources in 2010. Nucleic Acids Res (England) 39
23. Sanger F, Coulson AR (1975) A rapid method for determining sequences in DNA by primed
synthesis with DNA polymerase. J Mol Biol 94(3):441–448
24. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors.
Proc Natl Acad Sci USA 74(12):5463–5467
25. Setubal JC, Meidanis J (1997) Introduction to computational molecular biology. PWS Publish-
ing Company, Boston
26. Tateno Y, Imanishi T, Miyazaki S, Fukami-Kobayashi K, Saitou N, Sugawara H et al (2002)
DNA data bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res
30(1):27–30
27. Voet D, Voet JG (2004) Biochemistry, 3rd edn. Wiley
28. Watson J, Baker T, Bell S, Gann A, Levine M, Losick R (2008) Molecular biology of the gene.
Addison-Wesley Longman, Amsterdam
29. Watson JD (1986) Molecular biology of the gene, vol 1. Benjamin/Cummings, Redwood City
30. Watson JD, Hopkins NH, Roberts JW, Steitz JA, Weiner AM (1987) Molecular biology of the
gene, vol 2. Benjamin/Cummings, Redwood City
31. Wu C, Nebert DW (2004) Update on genome completion and annotations: protein information
resource. Hum Genomics 1(3):229–233

www.allitebooks.com
Graphs, Algorithms, and Complexity
3

3.1 Introduction

Graphs are discrete structures which are frequently used to model biological net-
works. We start this chapter with a brief review of basics of graph theory concepts
and then describe concepts such as connectivity and distance that are used for the
analysis of biological networks. Trees are graphs with no cycles and are used for var-
ious representations of biological networks. Spectral properties of graphs provide
means to explore structures such as clusters of biological networks and are briefly
described.
Algorithms have key functionality to solve bioinformatics problems and we pro-
vide a short review of fundamental algorithmic methods starting by describing the
complexity of the algorithms. Our emphasis in this section is again on methods
such as dynamic programming and graph algorithms that find many applications
in bioinformatics. Basic graph traversal procedures such as the breadth-first search
and depth-first search are frequently employed for various problems in biological
networks and are described in detail. We then introduce complexity classes NP,
NP-hard, and NP-complete for problems that cannot be solved in polynomial time.
Using approximation algorithms with known approximation ratios provides subop-
timal solutions to difficult problems and sometimes our only choice in tackling a
bioinformatics problem is the use of heuristics which are commonsense rules that
work for most cases of the input. We present a rather dense review of all of these key
concepts in relation to bioinformatics problems in this chapter.

3.2 Graphs

Graphs are frequently used to model networks of any kind. A graph G(V, E) has a
nonempty vertex set V and a possibly nonempty edge set E. The number of vertices
of a graph is called its or der and the number of its edges is called its si ze. An edge
© Springer International Publishing Switzerland 2015 27
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_3
28 3 Graphs, Algorithms, and Complexity

(a) (b) (c)


b b b

a c a c a c

e d e d e d

Fig. 3.1 a A graph G. b Complement of G. c Clique of order 5

e of a graph is incident to two vertices called its endpoints. A loop is an edge that
starts and ends at the same vertex. Multiple edges e1 and e2 are incident to the same
endpoint vertices. A simple graph does not have any loops or multiple edges. In
the context of this book, we will consider only simple graphs. The complement of a
graph G(V, E) is the graph G(V, E  ) which has the same vertex set as G and an edge
(u, v) ∈ E  if and only if (u, v) ∈
/ E. A clique is a graph with the maximum number
of connections between its vertices. Each vertex in a clique has n − 1 connections
to all other vertices where n is the order of the clique. A maximal clique is a clique
which is not a proper subset of another clique. A maximum clique of a graph G is a
clique of maximum size in G. Figure 3.1 displays example graphs.
The edges of a directed graph have orientation between their endpoints. An ori-
ented edge (u, v) of a directed graph shown by an arrow from u to v, starts from
vertex u and ends at vertex v. The degree of a vertex v of a graph G is the number
of edges that are incident to v. The maximum degree of a graph is shown by Δ(G).
The in-degree of a vertex v in a directed graph is the number of edges that end at v
and the out-degree of a vertex v in such a graph is the number of edges that start at
v. A directed graph is depicted in Fig. 3.2a. The in-degrees of vertices a, b, c, d, e in

(a) (b) f

a c b
a
b d

e e
d
g

Fig. 3.2 a A directed graph. b A bipartite graph


3.2 Graphs 29

this graph are 1, 1, 2, 2, 3 and the out-degrees are 1, 3, 3, 2, 0, respectively. Given a


graph G(V, E); if edge (u, v) ∈ E, then vertices u and v are called neighbor s. An
open neighborhood N (v) (or Γ (v)) of a vertex v includes all vertices of v, whereas
its closed neighborhood N [v] = N (v) ∪ {v} is the union of its open neighborhood
and itself.
Given two graphs G 1 (V1 , E 1 ) and G 2 (V2 , E 2 ); G 1 is called a subgraph of G 2 if
V1 ⊆ V2 and E 1 ⊆ E 2 . An induced subgraph G 1 (V1 , E 1 ) of a graph G 2 (V2 , E 2 )
preserves in G 1 all of the incident edges between the vertices of V1 that exist in G 2 .

3.2.1 Types of Graphs

Certain classes of graphs have important applications in biological networks. We will


now describe weighted graphs and bipartite graphs which may be used to represent
biological networks. An edge-weighted graph G(V, E, w) has weights associated
with its edges such that w : E → R. Similarly, a vertex-weighted graph has weights
associated with its vertices such that w : V → R.
A bipartite graph G(V1 , V2 , E) has V1 and V2 as subsets of its vertices such
that no edges exist between any two vertices of V1 , and similarly there are no edges
between any two vertices of V2 . In other words, for any edge (u, v) ∈ E of a bipartite
graph, if u ∈ V1 , then v ∈ V2 and (u, v) ∈ / E. A bipartite graph G(V1 , V2 , E) is
shown in Fig. 3.2b with V1 = { f, a, e, c} and V2 = {b, d, g}.

3.2.2 Graph Representations

A graph G(V, E) is commonly represented using an adjacency matrix A. An element


ai j of A is equal to 1 if there is an edge joining vertices vi and v j , otherwise ai j is
0. For a directed graph, ai j is 1 if there is an edge that starts at vertex i and ends
at vertex j. Another way to represent a graph G is using linked lists. Each vertex
is connected to its neighbors in a linked list and the graph is represented by n such
lists. The incidence matrix I can also be used to display the connection of vertices
of a graph G. In this case, an element i i j of this matrix is equal to unity if vertex vi
is incident to edge (vi , v j ). Figure 3.3 displays a graph and its adjacency list.
The adjacency matrix A and the incidence matrix I of this graph are shown below.
⎡ ⎤ ⎡ ⎤
010111 110000110
⎢1 0 1 1 0 0⎥ ⎢0 1 1 0 0 1 0 0 1⎥
⎢ ⎥ ⎢ ⎥
⎢0 1 0 1 0 0⎥ ⎢0 0 1 1 0 0 0 0 0⎥
A=⎢ ⎢1 1 1 0 1 0⎥
⎥ I = ⎢
⎢0 0 0 1 1 0 0 1 1⎥

⎢ ⎥ ⎢ ⎥
⎣1 0 0 1 0 1⎦ ⎣0 0 0 0 1 1 1 0 0⎦
100010 100010000
Using an adjacency matrix, we can check in constant time whether an edge (i, j)
exists between vertices i and j in a graph. However, this structure occupies a max-
imum of n 2 memory locations for the case of directed graphs, and searching for a
30 3 Graphs, Algorithms, and Complexity

(a) (b) 4 5
1 2 6
e2
1 2
e1 e3 2 1 3 4

6 e7 e9 3
3 2 4
e8
e6 e4
5 4 4 1 2 3 5
e5

5 1 4 6

6 1 5

Fig. 3.3 a A graph G. b Its adjacency list

neighbor requires n steps. For sparse graphs where the density of the graph is low
with relatively much less number of edges than dense graphs (m  n), this structure
has the disadvantage of occupying unnecessary memory space. The adjacency list
takes m + n memory locations and examining all neighbors of a given node requires
m/n time on average. However, checking the existence of an edge requires m/n
steps whereas we can achieve this in one step using an adjacency matrix. In general,
adjacency lists should be preferred for sparse graphs and adjacency matrices for
dense graphs.

3.2.3 Paths, Cycles, and Connectivity

A walk of a graph G(V, E) is a finite sequence of edges (v0 , v1 ), (v1 , v2 ), ...,


(vm−1 , vm ) where any two consecutive edges are adjacent or identical. The ver-
tex v0 is the initial vertex and vm is the final vertex of the walk. The length of the
walk is the number of edges in the walk. A trail is a walk in which all edges are
distinct, and a path is a trail with all distinct vertices. A cycle is a trail or a path
with the same initial and final vertices containing at least one edge. The gir th of a
graph is the length of the shortest cycle it has. In a connected graph G, the distance
d(u, v) between vertices u and v is the length of the shortest path from u to v. The
diameter of a graph is the greatest distance between any two of its vertices.
A graph is connected if and only if there is a path between each vertex pair. A
cutset of a graph is the set of edges of minimum size, removal of which results in a
disconnected graph. In other words, a cutset of a graph G does not contain another
cutset of G. The edge connectivity λ(G) is the size of the smallest cutset of the graph
3.2 Graphs 31

(a) (b) 1
a b
9 2
a b c d
8
f g
10 c
11 12 7
6
h g f e i h
5 3

e d
4

Fig. 3.4 a The bridge (g, f ) of a graph. The endpoints of this bridge are also cut vertices of the
graph. b The Hamiltonian cycle is {a, b, c, d, e, i, h, g, f, a} and the sequence of Eularian trail
starting from vertex a is shown by the numbers on edges

(a)

(b) 3
1 2
8 9

7 4 1 3
6
2
6
5 4
5

Fig. 3.5 a A forest. b An MST of a graph rooted at vertex 4

G. A graph G is k-edge connected if λ(G) ≥ k. If the cutset of a graph has only


one edge, this edge is called a bridge. The vertex connectivity κ(G) of G on the
other hand, is the minimum number of vertices needed to delete from G to leave it
disconnected. All incident edges to a vertex should be removed when removing it
from the graph. The graph is called k-connected if κ(G) ≥ k. A cut vertex of a graph
G is a vertex removal of which disconnects G. Figure 3.5a displays a bridge and cut
vertices of a graph.
A trail containing every edge of a graph G exactly once is called an Eularian trail
and G is called Eularian. It can be shown that a connected graph G is Eularian if
and only if each vertex of G has even degree (see Exercise 3.2). An Eularian cycle
is an Eularian trail that starts and ends at the same vertex. A cycle of a graph G
that passes through each vertex of G exactly once is called a Hamiltonian cycle. An
Eularian trail and a Hamiltonian cycle are shown in Fig. 3.4b.
32 3 Graphs, Algorithms, and Complexity

3.2.4 Trees

A tr ee is a graph with no cycles and a f or est is a graph that has two or more
disconnected trees. A spanning tree T (V, E  ) of a graph G(V, E) is a tree that covers
all vertices of G where |E  | ≤ |E|. A tree where a special vertex is designated as
root with all vertices having an orientation toward it is called a rooted tree, otherwise
the tree is unr ooted. Any vertex u except the root in a rooted tree is connected to
a vertex v on its path to the root called its par ent, and the vertex u is called the
child of vertex v. For an edge-weighted graph G(V, E, w), the minimum spanning
tree (MST) of G has the minimum total sum of weights among all possible spanning
trees of G. The MST of a graph is unique if all of its edges have distinct weights.
Figure 3.5 shows a forest consisting of unrooted trees and a rooted MST of a graph.

3.2.5 Spectral Properties of Graphs

Given a square matrix A[n ×n], an eigenvalue λ and the corresponding eigenvector
x of A satisfy the following equation:
Ax = λx (3.1)
which can be written as
Ax − λx = 0 (3.2)

(A − λI )x = 0 (3.3)
The necessary condition for Eq. 3.2 to hold is det (A − λI ) to be 0. In order to
find an eigenvalue of a matrix A, we can do the following [3]:
1. Form matrix (A − λI ).
2. Solve the equation for λ values.
3. Substitute each eigenvalue in Eq. 3.1 to find x vectors corresponding to λ values.
Eigenvalues are useful in solving many problems in engineering and basic sci-
ences. Our main interest in eigenvalues and eigenvectors will be their efficient use
for clustering in biological networks. The Laplacian matrix of a graph is obtained by
subtracting its adjacency matrix A from its degree matrix D which has dii = deg(i)
at diagonal elements and di j = 0 otherwise as L = D − A. An element of the
Laplacian matrix therefore is given by

⎨ 1 if i = j and deg( j) = 0
L i j = −1 if i and j are adjacent (3.4)

0 otherwise
The Laplacian matrix L provides useful information about the connectivity of
a graph [4]. It is positive semidefinite with all eigenvalues except the smallest one
being positive and the smallest eigenvalue is 0. The number of components of a graph
3.2 Graphs 33

G is given by the number of eigenvalues of its Laplacian which are 0. The second
smallest eigenvalue of L(G) shows whether graph G is connected or not, a positive
value showing connectedness. A greater second eigenvalue shows a better connected
graph.

3.3 Algorithms

An algorithm consists of a number of instructions to solve a given problem. It has


inputs and execute the instructions on the input to provide an output. An algorithm
has assignment, decision, and r epetition as three main subtasks. Assignment is
attributing values to variables, decision is the act of changing the flow of the algorithm
based on some test condition and repetition is performed by executing the same code
on possibly different data. An obvious requirement from any algorithm is that it
should find the results correctly. Also, given two algorithms that provide the same
result to a problem, we would prefer the one that solves the problem in shorter time.
Algorithm 3.1 finds the number of occurrences of a given key value in an integer
array and we can observe all of the described main subtasks in this algorithm.

Algorithm 3.1 Alg_E xample


1: Input : Array A[1..n]={...}
2: key = 5
3: Output : count
4: count ← 0 assignment
5: for i = 1 to n do repetition
6: if A[i] = key then decision
7: count ← count + 1
8: end if
9: end for

3.3.1 Time and Space Complexities

The time complexity of an algorithm is the number of steps needed to obtain the
output and is commonly expressed in terms of the size of the input. In Algorithm
3.1, we would need at least n steps since the f or loop is executed n times. Also, the
assignment outside the loop needs 1 step. However, we would be more interested in
the part of the algorithm that has a running time dependent on the input size, as the
execution times of the other parts of an algorithm would diminish when the input
size gets larger. In the above example, the algorithm has to be executed exactly n
times. However, if we change the requirement to find the first occurrence of the input
key, then the execution time would be at most n steps.
34 3 Graphs, Algorithms, and Complexity

We can state the time complexity of an algorithm formally as follows. If f and g


are two functions from N to R+ , then
• f (n) = O(g(n)), if there is a constant c > 0 such that f (n) ≤ cg(n) ∀n ≥ n 0 ;
• f (n) = Ω(g(n)), if there is a constant c > 0 such that f (n) ≥ cg(n) ∀n ≥ n 0 ;
• f (n) = Θ(g(n)), if f (n) = O(g(n)) and f (n) = Ω(g(n)).
The O(g(n)) states that when the input size is equal to or greater than a threshold
n 0 , the running time is always less than cg(n). In other words, the g(n) function
provides an upper bound on the running time. The Ω(g(n)) however, provides a
lower bound on the running time, ensuring that the algorithm requires at least cg(n)
steps after a threshold n 0 . If there are two constants c1 , c2 > 0, such that the running
time of the algorithm is greater than or equal to c1 .g(n) and less than or equal to
c2 .g(n) after a threshold n 0 ; the running time of the algorithms is Θ(g(n)). In most
cases, we are interested in the worst case running time of an algorithm. The space
complexity of an algorithm specifies the size of the memory required as a function
of the input size.

3.3.2 Recurrences

A recursive algorithm calls itself with smaller values until a base case is encountered.
A check for the base case is typically made at the start of the algorithm and if this
condition is not met, the algorithm calls itself with possibly a smaller value of the
input. Let us consider finding the nth power of an integer x using a recursive algorithm
called Power as shown in Algorithm 3.2. The base case which is the start of the
returning point from the recursive calls is when n equals 0, and each returned value
is multiplied by the value of x when called which results in the value x n .

Algorithm 3.2 Recur sive_E x


1: procedure Power (x, n)
2: if n = 0 then
3: return 1
4: else
5: return x × power (x, n − 1)
6: end if
7: end procedure

Recursive algorithms result in simpler code in general but their analysis may not
be simple. They are typically analyzed using recurrence relations where time to
solve the algorithm for an input size is expressed in terms of the time for smaller
input sizes. The following recurrence equation for the Power algorithm defines the
relation between the time taken for the algorithm for various input sizes:
3.3 Algorithms 35

T (n) = T (n − 1) + 1 (3.5)
= T (n − 1) + 1 = T (n − 2) + 2
= T (n − 2) + 2 = T (n − 3) + 3
Proceeding in this manner, we observe that T (n) = T (n −k)+k and when n = k,
T (n) = T (0) + n = n steps are required, assuming that the base case does not take
any time. We could have noticed that there are n recursive calls until the base case
of n = 0 and each return results in one multiplication resulting in a total of n steps.
The analysis of recurrences may be more complicated than our simple example and
the use of more advanced techniques such as the Master theorem may be needed [7].

3.3.3 Fundamental Approaches

The fundamental classes of algorithms are the greedy algorithms, divide and conquer
algorithms, dynamic programming, and the graph algorithms. A greedy algorithm
makes the locally optimal choice at each step and these algorithms do not provide an
optimal solution in general but may find suboptimal solutions in reasonable times.
We will see a greedy algorithm that finds MST of a graph in reasonable time in graph
algorithms.
Divide and conquer algorithms on the other hand, divide the problem into a number
of subproblems and these subproblems are solved recursively, and the solutions of the
smaller subproblems may be combined to find the solution to the original problem.
As an example of this method, we will describe the mergesor t algorithm which
sorts elements of an array recursively. The array is divided into two parts, each part
is sorted recursively and the sorted smaller arrays are merged into a larger sorted
array. Merging is the key operation in this algorithm and when merging two smaller
sorted arrays, the larger array is formed by finding the smallest value found in both
smaller arrays iteratively. For example, merging the two sorted arrays {2, 8, 10, 12}
and {1, 4, 6, 9} results in {1, 2, 4, 6, 8, 9, 10, 12}. The dynamic programming method
and the graph algorithms have important applications in bioinformatics and they are
described next.

3.3.4 Dynamic Programming

Dynamic programming is a powerful algorithmic method that divides the problem


into smaller subproblems as in the divide and conquer strategy. However, the solutions
to the subproblems during this process are stored to be used in the latter stages of
the algorithm. We will show an example of this approach using Fibonacci sequence
which starts with 0 and 1 and each number in this sequence is the sum of the two
previous numbers. This sequence can be stated as the reccurrence:
F(i) = F(i − 1) + F(i − 2) f or i ≥2 (3.6)
36 3 Graphs, Algorithms, and Complexity

Algorithm 3.3 shows how Fibonacci sequence can be computed using dynamic
programming. The array F is filled with the n members of this sequence at the end
of the algorithm which requires Θ(n) steps. Dynamic programming is frequently
used in bioinformatics problems such as sequence alignment and DNA motif search
as we will see in Chaps. 6 and 8.

Algorithm 3.3 Fibo_dyn


1: Input : n ≥ 2
2: int F[n]
3: F[0] ← 0; F[1] ← 1
4: for i ← 2 to n do
5: F[i] ← F[i − 1] + F[i − 2]
6: end for
7: return F(n)

3.3.5 Graph Algorithms

Graph algorithms are implemented on graph structures which may be represented


by adjacency matrices or lists. Two commonly used algorithms executed on graphs
are the breadth-first search and the depth-first search algorithms which both build
spanning trees that can be used for a variety of applications in bioinformatics.

3.3.5.1 Breadth-First Search


The breadth-first search (BFS) algorithm calculates distances from a source vertex s
to all other vertices in a graph. It has many applications including finding distances
and testing bipartiteness of a graph. Intuitively, any vertex u that is a neighbor of s
will have a distance of 1 to s and the neighbors of u that are two hops away will have a
distance of 2 to s. Algorithm 3.4 shows one way of implementing the BFS algorithm
[3]. The source vertex s is initialized to zero and all other vertices are labeled with ∞
distances. The source vertex is inserted in the queue que and during each iteration of
the algorithm, a vertex v is dequeued from Q, its neighbors are labeled with a distance
of distance v plus 1 if they are not already labeled. The algorithm also labels parents
of vertices on their BFS path to the source vertex s. Different order of queueing may
result in different parents for vertices but their distance to s will not change.
A BFS tree in a sample graph is depicted in Fig. 3.6a.

Theorem 3.1 The time complexity of BFS algorithm is Θ(n + m) for a graph of
order n and size m.

www.allitebooks.com
3.3 Algorithms 37

Algorithm 3.4 BFS


1: Input : G(V, E) undirected and connected graph
2: Output : dv and pr ev[v], ∀v ∈ E distances and predecessors of vertices in BFS tree
3: for all v ∈ V \ {s} do initialize all vertices except source s
4: dv ← ∞
5: pr ev[v] ←⊥
6: end for
7: ds ← 0 initialize source s
8: pr ev[s] ← s
9: que ← s
10: while que = Ø do do until que is empty
11: v ← deque(que) deque the first element u
12: for all (u, v) ∈ E do process all neighbors of u
13: if du = ∞ then
14: du ← dv + 1
15: pr ev[u] ← v
16: enque(que, u)
17: end if
18: end for
19: end while

Proof The initialization between lines 3 and 6 takes Θ(n) time. The while loop is
executed at most n times and the f or loop between the lines 12 and 18 is run at most
deg(u)+1 times considering the vertices with no neighbors. Total running time in
this case is

n+ (deg(u) + 1) = n + deg(u) + n = 2n + 2m ∈ Θ(n + m) (3.7)


u∈V u∈V
For general dense graphs excluding the sparse graphs, m  n, and the time
complexity can be considered to be Θ(m).

(a) 1 2 3 (b) 1 12 5 8 6 7
b c d b c d

a a

g f e g f e
1 2 3 9 10 2 11 3 4

Fig. 3.6 a A BFS tree from vertex a. b A DFS tree from vertex a in the same graph. The first and
last visit times of a vertex are shown in teh left and right of a vertex consecutively
38 3 Graphs, Algorithms, and Complexity

3.3.5.2 Depth-First Search


The depth-first search (DFS) algorithm visits vertices of a graph recursively by
visiting any unvisited neighbor of the vertex it currently is visiting and it goes as
deep as it can rather than widely as in the BFS algorithm. When there are no unvisited
neighbors of a vertex, the algorithm returns to the vertex where it came from and
the algorithm stops when it returns to the starting vertex. Algorithm 3.5 shows one
way of implementing DFS algorithm for a forest as in [1]. The first loop is executed
once for each connected component of the graph and the second loop forms DFS
trees recursively for each of these components. There is a color associated with a
vertex; white is for unvisited vertices, gray is for a visited but not finished vertices,
and black means a vertex is visited and return from that vertex is done. There are
also two times associated with each vertex, d[u] records the global time when the
vertex u is first visited and f [u] shows when the visit is finished. As the output is a
tree rooted at starting vertex, the array pr ed stores the predecessors of each vertex
in this tree. A DFS tree in an example graph is shown in Fig. 3.6b.

Algorithm 3.5 DFS_forest


1: Input : G(V, E), directed or undirected graph
2: Output : pr ed[n]; d[n], f [n] place of a vertex in DFS tree and its visit times
3: int time ← 0; color [n]
4: for all u ∈ V do initialize distances
5: color [u] ← f alse
6: pr ed[u] ←⊥
7: end for
8: for all u ∈ V do
9: if color [u] = white then
10: D F S(u) call for each connected component
11: end if
12: end for
13:
14: procedure D F S(u)
15: color [u] ← gray
16: time ← time + 1; d[u] ← time first visit
17: for all (u, v) ∈ E do initialize distances
18: if color [v] = white then
19: pr ed[v] ← u
20: D F S(v)
21: end if
22: end for
23: color [u] ← black
24: time ← time + 1 ; f [u] ← time return visit
25: end procedure
3.3 Algorithms 39

There are two loops in the algorithm; the first loop considers each vertex at a
maximum of O(n) time and the second loop considers all neighbors of a vertex in
a total of u N (u) = 2m time. Time complexity of DFS algorithm is, therefore,
Θ(n + m).

3.3.5.3 Prim’s Minimum Spanning Tree Algorithm


This algorithm was initially proposed by Jarnik and then by Prim to find MST of
a connected, weighted, directed, or undirected graph G(V, E, w). It starts from an
arbitrary vertex s and includes it in the MST. Then, at each step of the algorithm,
the lowest weight outgoing edge (u, v) from the current tree fragment T f such that
u ∈ T f and v ∈ G \ T f is found and added to the tree T f . If T f is an MST fragment
of G, it can be shown that T ∪ (u, v) is also an MST fragment of G. Proceeding in
this manner, the algorithm finishes when all vertices are included in the final MST
as shown in Algorithm 3.6.

Algorithm 3.6 Prim_MST


1: Input : G(V, E, w)
2: Output : MST T of G
3: V ← s
4: T ←Ø
5: while V  = V do continue until all vertices are visited
6: select the edge (u, v) with minimal weight such that u ∈ T and v ∈ G \ T
7: V  ← V  ∪ {v}
8: T ← T ∪ {(u, v)}
9: end while

Figure 3.7 shows the iterations of Prim’s algorithm in a sample graph. The com-
plexity of this algorithm is O(n 2 ) as the while loop is executed for all vertices and
the search for the minimum weight edge for each vertex will take O(n) time. This
complexity can be reduced to O(m log n) time using binary heaps and adjacency
lists.

3.3.5.4 Dijkstra’s Shortest Path Algorithm


BFS algorithm finds shortest paths from a source vertex s to all other vertices in an
unweighted graph. In a weighed graph, Dijkstra’s shortest path algorithm uses the
greedy approach similar to Prim’s algorithm. It checks all of the outgoing edges from
the current tree fragment as in Prim’s MST algorithm. However, when selecting an
edge (u, v) to be included in the shortest path tree T , total distance of its endpoint
vertex v from the source vertex s is considered. When the minimum distance vertex
is found, it is included in the partial tree and the distance of all affected vertices are
updated. Algorithm 3.7 shows the pseudocode of Dijkstra’s shortest path algorithm.
40 3 Graphs, Algorithms, and Complexity

(a) 1 (b) 1
a b a b
2 2
3 7 6 3 7
e 6 e
4 5 4 5
d c d c
8 8

(c) (d)
1
1 a b
a b 2
2
3 7
3 e 6
e 6
7 4 5
4 5 d c
d c
8
8

Fig. 3.7 Running of Prim’s MST algorithm for vertex a

Algorithm 3.7 Dijkstra’s


1: Input : G(V, E, w)
2: Output : T the shortest path tree rooted at s
3: D[1 : n] ← ∞ array showing distance to s
4: T ← {s} T starts with the source vertex
5: D[s] ← 0
6: while cut (T ) = Ø do
7: select (u, v) ∈ cut (T ) such that D[u] + wuv is minimum
8: D[v] ← D[u] + wuv
9: T ← T ∪ (u, v)
10: end while

The execution of this algorithm is shown in Fig. 3.8 where shortest path tree T
is formed after four steps. Since there are n vertices, n − 1 edges are added to
the initial tree T consisting of the source vertex s only. The time complexity of
the algorithm depends on the data structures used. As with the Prim’s algorithm,
there are two nested loops and implementation using adjacency list requires O(n 2 )
time. Using adjacency lists and Fibonacci heaps, this complexity may be reduced to
O(m + n log n).
This algorithm only finds the shortest path tree from a source vertex and can be
executed for all vertices to find all shortest paths. It can be modified so that the
predecessors of vertices are stored (See Exercise 3.6).
3.3 Algorithms 41

(a) (b) 4
4
~ ~ 3 ~
9 b 9 c
b c
8 8
2 2 7
a 1 7 a 1

2 d 2 e d
e
6 6 ~
2 ~ 2

(c) (d)
4 4

3 4 3 4
9 b 9 c
b c
8 8
2 7 2 7
a 1 a 1

2 e d 2 e d
6 6
2 ~ 2 5

Fig. 3.8 Running of Dijkstra’s shortest path algorithm

3.3.6 Special Subgraphs

Given a simple and undirected graph G(V, E), a special subgraph G  ⊆ G has some
property which may be implemented to discover a structure in biological networks.
The special subgraphs we will consider are the independent sets, dominating sets,
matching and the vertex cover.

3.3.6.1 Independent Sets


An independent set I of a simple and undirected graph G(V, E) is a subset of
vertices G such that there is not an edge joining any two vertices in I . Formally,
∀u and v ∈ I , (u, v) ∈
/ E. The independent set problem (IND) asks to find I with
the maximum number of vertices as an optimization problem where we try to find
the best solution from a number of solutions. The decision version of this problem
requires an answer as yes or no to the question: Is there an independent size of at
least k in a graph G where k < n with n = |V |? Figure 3.9a shows an independent
set which is maximum.
42 3 Graphs, Algorithms, and Complexity

(a) (b)

Fig. 3.9 a A maximum independent set of a graph. b The minimum dominating set of the same
graph

3.3.6.2 Dominating Sets


A dominating set of a graph G(V, E) is a subset D of its vertices such that any
vertex of G is either in this set or a neighbor of a vertex in this set. The dominating
set problem (DOM) asks to find D with the minimum number of vertices as an
optimization problem. The decision version of this problem requires an answer as
yes or no to the question: Is there a dominating set with a maximum size of k in a
graph G where k < n? Dominating sets can be used to find clusters in networks. A
minimum dominating set is shown in Fig. 3.9b.

3.3.6.3 Matching
A matching in a graph G(V, E) is a subset E  of its edges such that the edges in E 
do not share any endpoints. Matching finds various applications including routing
in computer networks. A maximum matching of a graph G has the maximum size
among all matchings of G. A maximal matching of G is a matching in G which
cannot be enlarged. Finding maximum matching in a graph can be performed in
polynomial time [2] and hence is in P. A maximum matching of size 4 is depicted in
Fig. 3.10a.

3.3.6.4 Vertex Cover


A vertex cover of a graph G(V, E) is a subset V  of vertices of G such that every
edge in G is incident to at least one vertex in V  . Formally, for V  ⊆ V to be a
vertex cover; ∀(u, v) ∈ E, either u or v or both should be an element of V  . The
optimization version of this problem asks to find a vertex cover of minimum size

(a) (b)

Fig. 3.10 a A maximum matching of a graph. b A minimum vertex cover of a graph


3.3 Algorithms 43

in a graph G. The decision version of this problem requires an answer as yes or


no to the question: Is there a vertex cover of size of at least k in a graph G where
k < n? A minimal vertex cover V  of a graph G(V, E) is a vertex cover such that
removal of a vertex from V  leaves at least one edge of G uncovered. Vertex cover
has many applications including facility location in networks. Figure 3.10b displays
a minimum vertex cover in a graph.

3.4 NP-Completeness

All of the algorithms we have considered up to this point have polynomial execu-
tion times which can be expressed as O(n k ) where n is the input size and k ≥ 0.
These algorithms are in complexity class P which contains algorithms with polyno-
mial execution times. Many algorithms, however, do not belong to P as either they
have exponential running times or sometimes the problem at hand does not even
have a known solution with either polynomial time or exponential time. Problems
with solutions in P are called tractable and any other problem outside P is called
intractable.
We need to define some concepts before investigating complexity classes other
than P. Problems we want to solve are either optimization problems where we try
to obtain the best solution to the problem at hand, or decision problems where we
try to find an answer in the form of a yes or no to a given instance of the problem
as described before. A cer ti f icate is an input combination for an algorithm and a
veri f ier (or a cer ti f ier ) is an algorithm that checks a certifier for a problem and
provides an answer in the form of a yes or no. We will exemplify these concepts by the
use of subset sum problem. Given a set of n numbers S = {i 1 , . . . , i n }, this problem
asks to find a subset of S sum of which is equal to a given value. For example, given
S = {3, 2, −5, −2, 9, 6, 1, 12} and the value 4, the subset {3, 2, −1} provides a yes
answer. The input {3, 2, −1} is the certificate and the verifier algorithm simply adds
the values of integers in the certifier in k − 1 steps where k is the size of the certifier.
Clearly, we need to check each subset of S to find a solution if it exits and this can
be accomplished in exponential time of 2n which is the number of subsets of a set
having n elements. Now, if we are given an input value m and a specific subset R of
S as a certifier, we can easily check in polynomial time using Θ(k) additions where
k is the size of R, whether R is a solution. The certifier is R and the verifier is the
algorithm that sums the elements of R and checks whether this sum equals m. Such
problems that can be verified with a nondeterministic random input in polynomial
time are said to be in complexity class Nondeterministic Polynomial (NP). Clearly,
all of the problems in P have verifiers that have polynomial running time and hence
P ⊆ N P. Whether P = NP is not known but the opposite is widely believed by
computer scientists.
NP-Complete problems constitute a subset of problems in NP and they are as
hard as any problem in NP. Formally, a decision problem C is NP-Complete if
it is in NP and every problem in NP can be reduced to C in polynomial time.
44 3 Graphs, Algorithms, and Complexity

Fig. 3.11 Relations between NP−hard


the complexity classes

NP−complete

NP
P

NP-hard problems are the problems which do not have any known polynomial time
algorithms and solving one NP-hard problem in polynomial time implies all of the
NP-hard problems that can be solved in polynomial time. In other words, NP-hard
is a class of problems that are as hard as any problem in NP. For example, finding the
least cost cycle in a weighted graph is an optimization problem which is NP-hard.
Figure 3.11 displays the relations between these complexity classes.

3.4.1 Reductions

If a problem A can be reduced to another problem B in polynomial time, we can


assume they are at the same level of difficulty. Formally, a problem A is polynomially
reducible to another problem B if any input I A of A can be transformed to an input
I B of B in polynomial time, and the answer to input I B for problem B is true if and
only if the answer is true for problem A with the input I A [3].
We will illustrate this concept by reduction of the independent set problem to the
clique problem. We have seen that cliques are complete graphs with edges between
all pairs of vertices. The clique optimization problem (CLIQUE) asks to find a clique
with the maximum number of vertices in a given simple and undirected graph. The
decision version of this problem searches an answer to the question: Is there a clique
of size at least k in a graph G where k < n?
We will now prove that IND problem can be reduced to CLIQUE problem in
polynomial time and hence these problems are equivalent. Given a graph G(V, E)
with an independent set I ⊂ V , we form G(V, E  ) which is the complement graph of
G with the same vertex set but a complementary edge set E  . Our claim is, if I is an
independent set in G, then I is a clique in G. From the definition of the independent
set, we know that for any two vertices u and v ∈ I , the edge (u, v) ∈
/ E which means
(u, v) ∈ E  . This would mean checking whether a certificate C is an independent
set in G which is equivalent to checking whether C is a clique in G. Clearly, the
transformation from G to G can be performed in polynomial time and therefore IND
is reducable to CLIQUE which is stated as IND ≤ CLIQUE. Figure 3.12 illustrates
reduction from an independent set to a clique.
3.4 NP-Completeness 45

(a) (b)

Fig. 3.12 Reduction from independent set to clique. a An independent set of a graph G. b The
clique of G using the same vertices

3.4.2 Coping with NP-Completeness

Determining that a problem is NP-Complete has certain implications. First, there


is no need to search for an algorithm in P to solve the problem, however, we can
employ one of the following methods to find a solution:
• We can still use an exponential algorithm, if it exists, for a small input size. For
bioinformatics problems, the size of the input in general is huge and this option
would not be viable.
• We can use clever algorithms that prune the search space. Backtracking and branch
and bound algorithms are two such types of algorithms. The backtracking algo-
rithms construct a space tree of possible solutions and evaluate cases along the
branches of this tree. When a case that does not meet the requirements is encoun-
tered at a vertex of this tree, all of its subtree is discarded from the search space
resulting in a smaller size of search space. A branch and bound algorithm works
similar to the backtracking algorithm, but additionally, it records the best solution
obtained and whenever the branch of a tree does not provide a better solution, the
subtree is discarded.
• Approximation algorithms which find suboptimal solutions with proven approxi-
mation ratios to the optimal value can be used.
• Heuristics which provide suboptimal solutions in reasonable time may be used.
Heuristics do not guarantee to find a solution to the problem and typically are
tested against a variety of inputs to show they work most of the time.

3.4.2.1 Approximation Algorithms


An approximation algorithm finds a suboptimal solution to a given difficult prob-
lem. The approximation ratio of an approximation algorithm shows how close the
algorithm achieves to the optimal case and this ratio should be proven. For example,
if an approximation algorithm to find the minimum dominating set in a graph G is
1.6, we know that the output from this algorithm will be at most 1.6 times the size
of the minimum dominating set of G. We will exemplify the use of approximation
algorithms by the vertex cover problem. As this is a known NP-complete problem
46 3 Graphs, Algorithms, and Complexity

[5], our aim is to search for approximation algorithms for vertex cover. Algorithm
3.8 displays our approximation algorithm for the vertex cover problem. At each iter-
ation, a random edge (u, v) is picked, its endpoint vertices are included in the cover
and all edges incident to these edges are deleted from G. This process continues until
there are no more edges left and since all edges are covered, the output is a vertex
cover. We need to know how close the size of the output is to the optimum VC.

Algorithm 3.8 Appr ox_M V C


1: Input G(V, E)
2: ← E, M V C ← Ø
3: while S = ∅ do
4: pick any (u, v) ∈ S
5: M V C ← M V C ∪ {u, v}
6: delete all edges incident to either u or v from S
7: end while

This algorithm in fact finds a maximal matching of a graph G and includes the
endpoints of edges found in the matching to the vertex cover set. Let M be a maximal
matching of G. The minimum vertex cover must include at least one endpoint of each
edge (u, v) ∈ M. Hence, the size of minimum vertex cover is at least as the size
of M. The size of the cover returned by the algorithm is 2|M| as both ends of the
matching are included in the cover. Therefore,
|V C| = 2|M| ≤ 2 × |MinV C| (3.8)
where VC is the vertex cover returned by the algorithm and MinVC is the minimum
vertex cover. Therefore, the approximation ratio of this algorithm is 2. The running
of this algorithm in a sample graph as shown in Fig. 3.13 results in a vertex cover
that has a size of 6 which is 1.5 times the optimum size of 4 shown in Fig. 3.13b.

(a) (b)

Fig. 3.13 Vertex Cover Examples. a The output of the approximation algorithm. b The optimal
vertex cover

www.allitebooks.com
3.4 NP-Completeness 47

3.4.2.2 Using Heuristics


A heuristic is a commonsense observation that may be used as a rule in an algorithm.
It is not proven mathematically and therefore extensive tests are required to claim
that a certain heuristic works well for most of the input combinations. As an example,
we may choose to select the vertex with the highest degree in the current working
graph to find the vertex cover. This seems as a good choice as we would be covering
the highest number of edges at each iteration and this should result in less number
of vertices in the cover. Unfortunately, this heuristic is not better than the random
choice and it can be shown its approximation ratio is not constant and dependent on
the number of vertices with an approximation ratio of Θ(log n).
Approximation algorithms often employ heuristics also, but they have a proven
approximation ratio. In many types of bioinformatics problems, the use of heuristics
is sometimes the only choice as the problems are NP-hard most of the time and
approximation algorithms are not known to date. Extensive testing and validation
are needed in this case to show the proposed heuristics perform well for most of the
input combinations.

3.5 Chapter Notes

We have reviewed the basic background to analyze biological sequences and bio-
logical networks in this chapter. Graph theory with its rich background provides a
convenient model for efficient analysis of biological networks. Our review of graphs
emphasized concepts that are relevant to biological networks rather than being com-
prehensive. We then provided a survey of algorithmic methods again with emphasis
on the ones used in bioinformatics. The dynamic programming and the graph algo-
rithms are frequently used in bioinformatics as we will see. Finally, we reviewed
the complexity classes P, NP, NP-hard, and NP-complete. Most of the problems we
encounter in bioinformatics are NP-hard which often require the use of approxima-
tion algorithms or heuristics. Since the size of data is huge in these applications,
we may use approximation algorithms or heuristics even if an algorithm in P exists
for the problem at hand. For example, if we have an exact algorithm that solves a
problem in O(n 2 ) time, we may be interested in an approximation algorithm with
a reasonable approximation ratio that finds the solution in O(n) time for n  1.
Similarly, the size of data being large necessitates the use of parallel and distrib-
uted algorithms which provide faster solutions than the sequential ones even if the
problem is in P, as we will see in the next chapter.
A comprehensive review of graph theory is provided by West [9] and Harary [6].
Cormen et al. [1], Skiena [8] and Levitin [7] all provide detailed descriptions of key
algorithm design concepts.
48 3 Graphs, Algorithms, and Complexity

Exercises

1. Find the adjacency matrix and the adjacency list representation of the graph
shown in Fig. 3.14.
2. Prove that each vertex of as graph G has an even degree if G has an Eularian
trail.
3. The exchange sort algorithm sorts the numbers in an array of size n by first
finding the maximum value of the array, swapping the maximum value with the
value in the first place, and then finding the maximum value of the remaining
(n −1) elements and continue until there are two elements. Write the pseudocode
of this algorithm; show its iteration steps in the array A = {3, 2, 5, 4, 6} and work
out its time complexity.
4. Modify the algorithm that finds the Fibonacci numbers (Algorithm 3.3) such
that only two memory locations which show the last two values of the sequence
are used.
5. Work out the BFS and DFS trees rooted at vertex b in the graph of Fig. 3.15 by
showing the parents of each vertex except the root.
6. Sort the weights of the graph of Fig. 3.16 in ascending order. Then, starting
from the lightest weight edge, include edges in the MST as long as they do
not form cycles with the existing edges in the MST fragment obtained so far.
This procedure is known as Kruskal’s algorithm. Write the pseudocode for this
algorithm and work out its time complexity.

1 2

6 3

5 4

Fig. 3.14 Example graph for Exercise 1

a b c d

i j

h g f e

Fig. 3.15 Example graph for Exercise 5


3.5 Chapter Notes 49

3 9
a b c
8 4

7 h
g 6 10
12

11 14
f e d
5 13
2

Fig. 3.16 Example graph for Exercise 6

Fig. 3.17 Example graph for


a b c d
Exercise 7

h g f e

7. Modify Dijkstra’s shortest path algorithm (Algorithm 3.7) so that the shortest
path tree information in terms of predecessors of vertices are stored.
8. Find the minimum dominating set and the minimum vertex cover of the graph
in Fig. 3.17.
9. Show that the independent set problem can be reduced to vertex cover problem
in polynomial time. (Hint: If V  is an independent set of G(V, E), then V \ V 
is a vertex cover).

References
1. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The
MIT Press, Cambridge
2. Edmonds J (1965) Paths, trees, and flowers. Can J Math 17:449–467
3. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press, Taylor and Francis
4. Fiedler M (1989) Laplacian of graphs and algebraic connectivity. Comb Graph Theory 25:57–70
5. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-
completeness. W. H. Freeman, New York
6. Harary F (1979) Graph theory. Addison-Wesley, Reading
50 3 Graphs, Algorithms, and Complexity

7. Levitin A (2011) Introduction to the design and analysis of algorithms, 3rd edn. Pearson Inter-
national Edn. ISBN: 0-321-36413-9
8. Skiena S (2008) The algorithm design manual. Springer, ISBN-10: 1849967202
9. West DB (2001) Introduction to graph theory, 2nd edn. Prentice-Hall, ISBN 0-13-014400-2
Parallel and Distributed Computing
4

4.1 Introduction

The terms computing, algorithm and programming are related to each other but
have conceptually different meanings. An algorithm in general is a set of instruc-
tions described frequently using pseudocode independent of the hardware, the oper-
ating system and the programming language used. Programming involves use of
implementation details such as operating system constructs and the programming
language. Computing is more general and includes methodologies, algorithms, pro-
gramming and architecture.
A sequential algorithm consists of a number of instructions executed consecu-
tively. This algorithm is executed on a central processing unit (CPU) of a computer
which also has memory and input/output units. A program is compiled and stored in
memory and each instruction of the program is fetched from the memory to the CPU
which decodes and executes it. In this so called Von Neumann model, both program
code and data are stored in external memory and need to be fetched to the CPU.
Parallel computing is the use of parallel computers to solve computational prob-
lems faster than a single computer. It is widely used for computationally difficult
and time consuming problems such as climate estimation, navigation and scientific
computing. A parallel algorithm executes simultaneously on a number of processing
elements with the processing elements being configured into various architectures.
A fundamental issue in parallel computing architecture is the organization of the
interconnection network which provides communication among tasks running on
different processing elements. The tasks of the parallel algorithm may or may not
share a commonly accessible global memory. The term parallel computing in general,
implies tightly coupling between the tasks as we will describe.
Distributed computing on the other hand, assumes the communication between
the tasks running on different nodes of the network communicate using messages
only. Distributed algorithms are sometimes called message-passing algorithms for
this reason. A network algorithm is a type of distributed algorithm where each node
is typically aware of its position in the network topology; cooperates with neighbor
© Springer International Publishing Switzerland 2015 51
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_4
52 4 Parallel and Distributed Computing

nodes to solve a network problem. For example, a node would search for shortest
paths from itself to all other nodes by cooperating with its neighbors in a network
routing algorithm. In a distributed memory routing algorithm on the other hand,
we would attempt to solve shortest paths between all nodes of a possibly large net-
work graph using few processing elements in parallel. Both can be called distributed
algorithms in the general sense.
We start this chapter by the review of fundamental parallel and distributed com-
puting architectures. We then describe the needed system support for shared-memory
parallel computing and review multi-threaded programming by providing examples
on POSIX threads. The parallel algorithm design techniques are also outlined and the
distributed computing section is about message passing paradigm and examples of
commonly used message passing software are described. We conclude the analysis
by the UNIX operating system and its network support for distributed processing
over a computer network.

4.2 Architectures for Parallel and Distributed Computing

Parallel computers may be constructed using special custom developed processors


or off-the-shelf general purpose processors used in personal computers and worksta-
tions. The interconnection network connecting the processors may be custom built or
a general one. Contemporary view of parallel processing hardware uses general pur-
pose processors and general purpose interconnection networks to provide scalability
and access to a wider population of users.

4.2.1 Interconnection Networks

An interconnection network connects the processing elements of the parallel or the


distributed computing system. The communication medium may be shared among
processors where only one message at a time can be transferred. In switched com-
munication, each processor has its own connection to the switch. Switches provide
parallel transmission of messages at the cost of more sophisticated and expensive
hardware than the shared medium. Figure 4.1 displays a shared medium of computers
connected and a switched medium of 2D mesh.
A hypercube is another example of a switched network consisting of processors
placed at the vertices of a cubic structure. A d-dimension hypercube has 2d processors
labeled 0, . . . , 2d − 1 as shown in Fig. 4.2a. Each node of a hypercube differs by
one bit in its binary label representation from each of its neighbors. The farthest two
nodes in a hypercube of d dimension has logn hops between them. A binary tree
network has processors as the leaves, and switches as the other nodes of a tree where
each switching node has exactly two children as shown in Fig. 4.2b.
4.2 Architectures for Parallel and Distributed Computing 53

(a) (b)

P1 P2 ... PN

Fig. 4.1 Sample interconnection networks with each processor having switching capabilities. a A
shared medium. b A 2-D mesh. Sample concurrent paths are shown in bold in (b)

(a) (b)
6 7

2 3

4 5

0 1

Fig. 4.2 Sample interconnection networks. a A 3-dimension hypercube. b A binary tree network.
Circles are the switching elements, squares are the processors. Three sample concurrent paths are
shown in bold in both figures

4.2.2 Multiprocessors and Multicomputers

A multiprocessor consists of a number of processors that communicate using a shared


memory. The memory can be a single unit or it can be distributed over the network.
A single shared memory multiprocessor is called uniform memory access (UMA)
or a symmetric multiprocessor (SMP). An UMA architecture is shown Fig. 4.3a
where each processor has its own cache and communicate using the global memory.
Commonly, a processor has its own local memory and uses the global shared memory
for inter-processor communications in a multiprocessor.
A multicomputer however, may and commonly does not have a shared memory
and each processor typically communicates by message passing where messages are
the basic units of communication. Each processing element in a multicomputer has
its own local memory and input/output unit and hence is a general purpose computing
element. A multicomputer architecture is displayed in Fig. 4.3b. A multicomputer
54 4 Parallel and Distributed Computing

(a) (b)
P1 P2 ... PN P1 PN

M I/O ... M I/O

Shared Input /
Memory Output

Fig. 4.3 a A multiprocessor. b A multicomputer both with N processors

is generally more tightly coupled and more homogeneous than a distributed system
although technically, a distributed system is a multicomputer system as it does not
have shared memory.

4.2.3 Flynn’s Taxonomy

Flynn provided a taxonomy of parallel computers based on their ability to execute


instructions on data as follows [4]:
• Single-instruction Single-data (SISD) computers: The single processing unit com-
puters are in this category.
• Single-instruction Multiple-data (SIMD) computers: These are the computers that
operate single instructions on multiple data. They have a single control unit and a
number of processors. A processor array is an SIMD computer commonly used
for matrix operations.
• Multiple-instruction Single-data (MISD) computers: These computers perform a
number of instructions on a single data stream. They have a number of pipelining
stages that process a data stream independently.
• Multiple-instruction Multiple-data (MIMD) computers: This category is the most
versatile category of parallel computing where computers execute independently
on different data streams. Multiprocessors and multicomputers are in this class.
We will be considering MIMD computers which do not require special hardware
for most of the problems considered in this book.

4.3 Parallel Computing

Parallel computing employs a number of processing elements which cooperate to


solve a problem faster than a sequential computer. It involves decomposition of
the computation and/or data into parts, assigning these to processors and providing
the necessary interprocess communication mechanisms. Parallel processing is also
4.3 Parallel Computing 55

employed within the CPU using instruction level parallelism (ILP) by executing
independent instructions at the same time, however, the physical limits of this method
have been reached recently and parallel processing at a coarser level has re-gained
its popularity among researchers.

4.3.1 Complexity of Parallel Algorithms

A parallel algorithm is analyzed in terms of its time, processor and work complexities.
The time complexity T (n) of a parallel algorithm is the number of time steps required
to finish it. The processor complexity P(n) specifies the number of processors needed
and the work complexity W (n) is the total work done by all processors where W =
P × T . A parallel algorithm for a problem A is more efficient than another parallel
algorithm B for the same problem if WA < WB .
Let us consider a sequential algorithm that solves a problem A with a worst case
running time of Ts (n) which is also an upper bound for A. Furthermore, let us assume
a parallel algorithm does W (n) work to solve the same problem in Tp (n) time. The
parallel algorithm is work-optimal if W (n) = O(T (n)). The speedup Sp obtained by
a parallel algorithm is the ratio of the sequential time to parallel time as follows:
Ts (n)
Sp = (4.1)
Tp (n)
and the efficiency Ep is,
Sp
Ep = (4.2)
p
which is a value between zero and one, with p being the number of processors. Ideally
Sp should be equal to the number of processors but this will not be possible due to
the overheads involved in communication among parallel processes. Therefore, we
need to minimize the interprocess communication costs to improve speedup. Placing
many processes on the same processor decreases interprocess communication costs
as local communication costs are negligible, however, the parallelism achieved will
be greatly reduced. Load balancing of parallel processes involves distributing the
processes to the processors such that the computational load is balanced across the
processors and the inter-processor communications are minimized.

4.3.2 Parallel Random Access Memory Model

Parallel Random Access Memory Model (PRAM) is an idealized shared memory


model of parallel computation. This model consists of p identical processors each
with a local memory and they share a single memory of size m. Each processor works
synchronously at each step by reading one local or global memory location, execute
a single global RAM operation and write to a local or global memory location.
56 4 Parallel and Distributed Computing

This model is commonly used for parallel algorithm design as it discards various
implementation details such as communication and synchronization. It can further
be classified into the following subgroups:
• Exclusive read exclusive write (EREW): Every memory cell can be read or written
by only one processor at a time.
• Concurrent read exclusive write (CREW): Multiple processors can read a memory
cell but only one can write at a time.
• Concurrent read concurrent write (CRCW): Multiple processors can read and write
to the same memory location concurrently. A CRCW PRAM is sometimes called
a concurrent random-access machine.
Exclusive read concurrent write (ERCW) is not considered as it does not make
sense. Most algorithms for PRAM model use SIMD model of parallel computation
and the complexity of a PRAM algorithm is the number of synchronous steps of the
algorithm.

4.3.2.1 Summing an Array


We will show an example of an EREW PRAM algorithm that adds the elements of
an array A of size 2p where p is the number of parallel processors. The idea of this
algorithm is that each processor adds two consecutive array locations starting with
the even numbered location and stores the result in the even location as shown in
Fig. 4.4 for p = 8 and array size 16. We then need only half of the processors to add
the four resulting numbers in the second step. Proceeding in this manner, the final
sum is stored in the first location A[0] of array A.

A[0:15] P0
A[0:7] A[8:15]
P0 P1
A[0:3] A[4:7] A[8:11] A[12:15]
P0 P1 P2 P3
A[2:3] A[4:5] A[6:7] A[8:9] A[10:11]
A[0:1] A[14:15]
P0 P1 P2 P3 P4 P5 P6 P7
A[12:13]

0 1 3 2 . . . 15
A

Fig. 4.4 Parallel summing of an array A of size 16 by 8 processors

www.allitebooks.com
4.3 Parallel Computing 57

Algorithm 4.1 shows one way of implementing this algorithm using EREW nota-
tion where the active processors are halved in each step for a total of logn steps. We
could modify this algorithm so that contents of the array A are not altered by starting
with 2p processor which copy A to another array B (see Exercise 4.2).

Algorithm 4.1 EREW Sum of an array


1: Input : A[0, ..., 2p −
1], p = 2m
n−1
2: Output : sum S = j=0 A[j]
3: for k = 1 to log n do
4: for i = 0 to n/2k − 1 in parallel do
5: A[i] ← A[2i] + A[2i + 1]
6: end for
7: end for
8: if i = 0 then
9: S ← A[i]
10: end if

The time complexity of this algorithm is log n and in the first step, each processor
pi performs one addition for a total of n/2 additions. In the second step, we have
n/4 processors each performing 2 additions and it can readily be seen that we have a
total of n/4 summations at each step. The total work done in this case is (n log n)/4).
We could have a sequential algorithm that finds the result in only n − 1 steps. In this
case, we find the work done by this algorithm is not optimal.

4.3.3 Parallel Algorithm Design Methods

We will assume that a parallel program consists of tasks that can run in parallel.
A task in this sense has code, local memory and a number of input/output ports.
Tasks communicate by sending data to their output ports and receive data from their
input ports [12]. The message queue that connects an output port of a task to an
input port of another task is called a channel. A task that wants to receive data
from one of its input channels is blocked if there is no data available in that port.
However, a task sending data to an output channel does not usually need to wait for
the reception of this data by the receiving task. This model is commonly adopted
in parallel algorithm design where reception of data is synchronous and the sending
of it is asynchronous. Full synchronous communication between two tasks requires
the sending task to be blocked also until the receiving task has received data. A
four-step design process for parallel computing was proposed by Foster consisting
of partitioning, communication, agglomeration and mapping phases [5].
Partitioning step can be performed on data, computation or both. We may divide
data into a number of partitions that may be processed by a number of processing
elements which is called domain decomposition [12]. For example, if the sum of an
array of size n using p processors is needed, we can allocate each processor n/p data
58 4 Parallel and Distributed Computing

items which can perform addition of these elements and we can then transfer all of
the partial sums to a single processor which computes the total sum for output. In
partitioning computation, we divide the computation into a number of modules and
then associate data with each module to be executed on each processor. This strategy
is known as functional decomposition and sometimes we may employ both domain
and functional decompositions for the same problem that needs to be parallelized.
The communication step involves the deciding process of the communication
among the tasks such as which task sends or receives data from which other tasks.
The communication costs among tasks that reside on the same processor can be
ignored, however, the interprocess communication costs using the interconnection
network is not trivial and needs to be considered.
The agglomeration step deals with grouping tasks that were designed in the first
two steps so that the interprocess communication is reduced. The final step is the
allocation of the task groups to processors which is commonly known as the mapping
problem. There is a trade off however, as grouping tasks onto the same processor
decreases communication among them but results in less parallelism. One way of
dealing with this problem is to use a task dependency graph which is a directed graph
representing the precedences and communications of the tasks. The problem is then
reduced to the graph partitioning problem in which a vast amount of literature exists
[1]. Our aim in general is to find a minimum weight cut set of the task graph such
that communication between tasks on different processors is minimized and the load
among processors is balanced.
Figure 4.5 shows such a task dependency graph where each task has a known
computation time and the costs of communications are also shown. We assumed the

PA1 PA2
1 task id
3 execution time
2 2 1

2 3
3 6 2
4 12
4 5 6
2 3 7 8
1
7
1 7
2
4
4
8
5

Fig. 4.5 A task dependency graph with 8 tasks


4.3 Parallel Computing 59

PA1

P2 1 3 6
8

P1 2 4 5 7

5 10 t 20 30 40

PA2

P2 3 5 6

P1 1 2 4 7 8

5 10 t 20 30 40

Fig. 4.6 Mapping of tasks of Fig. 4.5

computation times of tasks are known beforehand which is not realistic in a general
computing system but is valid for many real-time computing systems. As a general
rule, a task Ti that receives some data from its predecessors cannot finish before
its predecessors finish, simply because we need to make sure that all of its data has
arrived before it can start its execution. Otherwise, we cannot claim that Ti will finish
in its declared execution time.
We can partition these tasks to two processors P1 and P2 with partition PA1
where we only consider the number of tasks in each partition resulting in a total
IPC cost of 21; or PA2 in which we consider IPC costs with a total cost of 8 for
IPC. These schedules are displayed by the Gantt charts [6] which show the start and
finishing times of tasks in Fig. 4.6. The resulting total execution times are 41 and 32
for partitions PA1 and PA2 from which we can deduce PA2 which has a lower IPC
cost and shorter overall execution time is relatively more favorable.

4.3.4 Shared Memory Programming

Modern operating systems are based on the concept of a process. A process is an


instance of a program in execution, and it has various dynamic data structures such
as program counter, stack, open file pointers etc. A fundamental task performed by
an operating system is the management of processes. Processes can be in one of the
basic states as new, ready, blocked, running and terminated as shown in Fig. 4.7.
When a new process enters the system, it is assigned to ready state so that it can run
when CPU is allocated to it. A process waiting on an event such as a disk operation
is blocked and enters blocked state, to enable another process to have the CPU while
60 4 Parallel and Distributed Computing

Fig. 4.7 Basic process states


BLOCK
NEW

event occured wait event

schedule
READY RUN

time expires

it is waiting. When that event occurs, it is made ready again to be scheduled by the
operating system. A running process may have its time slice expired in which case it
is preempted and put in the ready state to be scheduled when its turn comes. The data
about a process is kept in the structure called process control block (PCB) managed
by the operating system.
Having processes in an operating system provides modularity and a level of par-
allelism where CPU is not kept idle by scheduling a ready process when the cur-
rent process is blocked. However, we need to provide some mechanisms for these
processes to synchronize and communicate so that they can cooperate. There are
two types of synchronization among processes; mutual exclusion and conditional
synchronization as described next.

4.3.4.1 The Critical Section Problem


Processes need to access some parts of their code exclusively if this segment of code
manipulates a common data with other processes. This segment of code is called a
critical section and we need mechanisms to provide exclusive operations in these
sections. Let us demonstrate the critical section problem by a solid example. Assume
two concurrent processes P1 and P2 increment a global integer variable x which has a
value of 12. The increment operation for each process will be implemented typically
by loading a CPU register with the value of x, incrementing the register value and
then storing the value of this register in the location of x as shown below.

P1 P2
1. LOAD R2,m[x] 1. LOAD R5,m[x]
2. INC R2 2. INC R5
3. STO m[x],R2 3. STO m[x],R5

Now, let us assume P1 executes first, and just before it can execute line 3, its
time slice expires and the operating system performs a context switch by storing its
variables including R2 which has 13 in its PCB. P2 is executed afterwards which
4.3 Parallel Computing 61

executes by reading the value of 12 into register R15, incrementing R5 and storing
its value 13 in x. The operating system now schedules P1 by loading CPU registers
from its PCB and P1 writes 13 on x location which already contains 13. We have
incremented x once instead of twice as required.
As this example shows, some parts of the code of a process has to be executed
exclusively. In the above example, if we had provided some means so that P1 and P2
executed lines 1–3 without any interruption, the value of x would be consistent. Such
a segment of the code of a process which has to be executed exclusively is called a
critical section. A simple solution to this problem would be disabling of interrupts
before line 1 and then enabling them after line 3 for each process. In user C code
for example, we would need to enclose the statement x = x + 1 by the assembly
language codes disable interrupt (DI) and enable interrupt (EI). However, allowing
machine control at user level certainly has shortcomings, for example, if the user
forgets enabling interrupts then the computation would stop completely.
There are various methods to provide mutual exclusion while a critical section
is executed by a process and the reader is referred to [16] for a comprehensive
review. Basically, we can classify the methods that provide mutual exclusion at
hardware, operating system or application (algorithmic) level. At hardware level,
special instructions that provide mutual exclusion to the memory locations are used
whereas algorithms provide mutual exclusion at application level. However, it is not
practical to expect a user to implement algorithms when executing a critical section,
instead, operating system primitives for this task can be appropriately used. Modern
operating systems provide a mutual exclusion data structure (mutex) for each critical
section and two operations called lock and unlock on the mutex structure. The critical
section can then be implemented by each process on a shared variable x as follows:

mutex m1;
int x /* shared integer */

process P1{
... /* non-critical section */

lock(&m1);
x++; /* critical section */
unlock(&m1);
... /* non-critical section */
}

The lock and unlock operations on the mutex variables are atomic, that is, they
cannot be interrupted, as provided by the operating system.

4.3.4.2 Process Synchronization


Conditional synchronization allows processes to synchronize on events. For example,
let us assume a process P1 is waiting for a resource r that another process P2 is
using. P1 needs to be notified when the processing of P1 is over as it has no other
62 4 Parallel and Distributed Computing

means of knowing the availability of the resource r. Operating systems provide


various facilities and system calls to wait on an event and to signal an event for
synchronization.
A semaphore is a special structure that contains at least an integer. Two atomic
operations; wait and signal can be performed on a semaphore. The wait system call
is used to wait for the event that the semaphore represents and the signal primitive
signals the occurrence of that event. In a practical implementation of a semaphore,
a process queue may be associated with a semaphore. We can define an array of
semaphores with each semaphore having an integer value and a process queue as
shown below. The wait system call then decrements the value of the semaphore,
and queues the calling process in the semaphore process queue by changing its state
to BLOCKED if this value becomes negative, meaning the event waited has not
occurred yet. The signal call does the reverse operation by first incrementing the
value of the semaphore and if this value is negative or zero, which means there was
at least one process waiting in the semaphore queue for the event, it dequeues the
first waiting process, changes its state to READY and calls the scheduler so that the
freed process can execute.

typedef struct { int value;


que pr_que;
}sem_t,
sem_t sem_tab[N_sems], *sp;
pcbptr currptr, procptr;

void wait(int s) void signal(int s)


{ semptr sp=&(sem_tab[s]); { semptr sp=&(sem_tab[s]);
sp->value--; sp->value++;
if ( sp->value < 0 ) if ( sp->value <= 0 )
{ enque(currptr, sp->pr_que); { procptr=deque(sp->pr_que);
currptr->state=BLOCKED; procptr->state=READY;
Schedule(); Schedule();
} }
} }

A semaphore can be used for both mutual exclusion and conditional synchroniza-
tion. The following example demonstrates these two usages of semaphores where two
processes producer and consumer synchronize using semaphores. The semaphore
s1 is used for mutual exclusion and the semaphores s2 and s3 are used for waiting
and signalling events.

sem_t s1,s2,s3;
int shared;

void producer() void consumer()


{ while(true) { { while(true) {
input in_data; wait(s2);
wait(s2); wait(s1);
4.3 Parallel Computing 63

wait(s1); out_data=shared;
shared = in_data; signal(s1);
signal(s1); signal(s2);
signal(s3); print out_dat;
} }
} }

The shared memory location which should be mutually accessed is shared and
is protected by the semaphore s1. The producer continuously reads data from the
keyboard, and first waits on s2 to be signalled by the consumer indicating it has read
the previous data. It then writes the input data to the shared location and signals
the consumer by the s3 semaphore so that it can read shared and display the data.
Without this synchronization, the producer may overwrite previous data before it is
read by the consumer; or the consumer may read the same data more than once.

4.3.5 Multi-threaded Programming

A thread is a lightweight process with an own program counter, stack and register
values. It does not have the memory page references, file pointers and other data
that an ordinary process has, and each thread belongs to one process. Two different
types of threads are the user level threads and the kernel level threads. User level
threads are managed by the run-time system at application level and the kernel which
is the core of an operating system that handles basic functions such as interprocess
communication and synchronization, is not aware of the existence of these threads.
Creation and other management functions of a kernel thread is performed by the
kernel.
A user level thread should be non-blocking as a single thread of a process that
does a blocking system call will block all of the process as the kernel sees it as
one main thread. The kernel level threads however can be blocking and even if one
thread of a process is blocked, kernel may schedule another one as it manages each
thread separately by using thread control blocks which store all thread related data.
The kernel threads are a magnitude or more times slower than user level threads due
to their management overhead in the kernel. The general rule is to use user level
threads if they are known to be non-blocking and use kernel level threads if they are
blocking, at the expense of slowed down execution. Figure 4.8 displays the user level
and kernel level threads in relation to the kernel.
Using threads provide the means to run them on multiprocessors or multi-core
CPUs by suitable scheduling policies. Also, they can share data which means they do
not need interprocess communication. However, the shared data must be protected
by issuing operating system calls as we have seen.
64 4 Parallel and Distributed Computing

(a) (b)

Fig. 4.8 a User level threads. b Kernel level threads which are attached to thread control blocks in
the kernel

4.3.5.1 POSIX Threads


Portable operating system interface (POSIX) is a group of standards specified by
IEEE to provide compatibility between different operating systems [11] and POSIX
threads is a POSIX standard specifying an application programming interface (API)
for thread management. We will briefly review only basic POSIX threads API rou-
tines which provide a rich variety of procedures for thread management. One such
fundamental routine is pthread_create function used to create a thread is as follows:
pthread_create(&thread_id,&attributes,start_function,arguments);

where thread_id is the variable where the created thread identifier will be stored after
this system call, start_function is the address of the thread code and the arguments
are the variables passed to the created thread.
The following example illustrates the use of threads for parallel summing of an
integer array A with n elements. Each thread sums the portion of the array defined by
its identity passed to it during its creation. Note that we are invoking the same thread
code with different parameter (i) each time resulting in summing a different part of
the array A, for example, thread 8 sums A[80 : 89]. For this example, we could have
used user threads as they work independently without getting blocked, however, we
used kernel threads as POSIX threads API allows the usage of kernel threads only.
Solaris operating system allows both user and kernel threads and provide flexibility
but this API is not a standard [15].

#include <stdio.h>
#include <pthread.h>
#define n 100
#define n_threads 10
int A[n]=2,1,...,8; /* initialize */
pthread_mutex_t m1;
4.3 Parallel Computing 65

int total_sum;

void *worker(void *id)


{ int me=*((int *)id);
int i, n_slice, my_sum;
for(i=me*n_slice;i<me*n_slice+n_slice;i++)
my_sum=my_sum+A[i];
pthread_mutex_lock(&m1);
total_sum=total_sum+my_sum;
pthread_mutex_unlock(&m1);
}

main()
{ pthread_t threads[n];
int i;
pthread_mutex_init(&m1);
for(i=1; i<=n_threads; i++)
pthread_create(&threads[i],NULL,worker,i);
for(i=1; i<=n_threads; i++)
pthread_join(threads[i],NULL);
printf("Total sum is = %d", total_sum);
}

We need to compile this program (sum.c) with the POSIX thread library as follows:

cc -o sum sum.c -lpthread

Threads can synchronize using locks, condition variables or semaphores which


are data structures defined globally. A semaphore can be declared and initialized to
be shared among threads with the initial value of 1 as follows:

#include <semaphore.h>
sem_t s1;
sem_init(&s1, 1, 1)

The wait and signal operations on semaphores are sem_wait and sem_signal
respectively. In the following example, we will implement the producer/consumer
example of Sect. 4.3.4 by using two threads and two semaphores, full and empty.

#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>

sem_t full, empty;


int data;

void *producer(void *arg)


{ int i=1;
while(i<10) {
sem_wait(&empty);
66 4 Parallel and Distributed Computing

scanf("%d", data);
sem_post(&full);
i++;
}
}

void *consumer(void *arg)


{ int i=1;
while(i<10) {
sem_wait(&full);
printf("%d", output);
sem_post(&empty);
i++;
}
}

main()
{ pthread_t prod, cons;
int i;
sem_init(&full,1,0);
sem_init(&empty,1,1);
pthread_create(&prod,NULL,producer,NULL);
pthread_create(&cons,NULL,consumer,NULL);
pthread_join(prod,NULL);
pthread_join(cons,NULL);
}

The producer executes a wait on the empty semaphore which is initialized to 1, so


it does not actually wait the first time. It then writes the input data from the keyboard
to the global data location and signals the full semaphore where the consumer is
waiting. The consumer then is activated and consumes data by printing it. It has to
signal the waiting producer now so that the previous data is not overwritten. Note
that two semaphores are needed as two processes wait on two different conditions
to be signalled; the consumer waits for the availability of next data and the producer
waits for the notification of the consumption of data written.

4.3.6 Parallel Processing in UNIX

UNIX is a multitasking operating system developed at Bell Labs first and at Uni-
versity of California at Berkeley with network interface extension, named Berkeley
Software Distribution (BSD). It is written in C for the most part, has a relatively
smaller size than other operating systems, is modular and distributed in source code
which make UNIX as one of the most widely used operating systems. The user
interface in UNIX is called shell, and the kernel which performs the core operating
system functions such as process management and scheduling resides between the
shell and the hardware as shown in Fig. 4.9.
4.3 Parallel Computing 67

Fig. 4.9 UNIX structure


Application

Shell

Kernel

Hardware

UNIX is based on processes and a number of system calls provide process man-
agement. A process can be created by a process using the system call fork. The caller
becomes the parent of the newly created process which becomes the child of the
parent. The fork system call returns the process identifier as an unsigned integer to
the parent, and 0 to the child. We can create a number of parallel processes using this
structure and provide concurrency where each child process is created with a copy
of the data area of the parent and it runs the same code as the parent. UNIX provides
various interprocess communication primitives and pipes are one of the simplest. A
pipe is created by the pipe system call and two identifiers, one for reading from a
pipe and one for writing to the pipe are returned. Reading and writing to a pipe are
performed by read and write system calls, similar to file read and write operations.
The following describes the process creation and interprocess communication using
pipes in UNIX. Our aim in this program to add the elements of an array in parallel
using two processes. The parent creates two pipes one for each direction and then
forks a child, sends the second half of the array to it by the pipe p1 to have it calculate
the partial sum, and calculates the sum of the lower portion of the array itself. It then
reads the sum of the child from pipe p2 and finds the total sum and displays it.

#include <stdio.h>
#define n 8
int c, *pt, p1[2], p2[2], A[n], sum=0, my_sum=0;

main()
{ pipe(p1); /* create two pipes one for each direction */
pipe(p2); /* p1 is from parent to child, */
c=fork(); /* p2 is from child to parent */
if(c!=0) { /* this is parent */
close(p1[0]); /* close the read end of p1 */
close(p2[1]); /* close the write end of p2 */
for(i=0;i<n;i++) /* initialize array */
A[i]=i+1;
write(p[1],&A[n/2],n/2); /* send the second half of array */
68 4 Parallel and Distributed Computing

for(i=0;i<n/2;i++)
my_sum=my_sum+A[i];
while(n=read(p2[0], pt, 1));
printf("Total sum is = %d", total_sum);
}
else { /* this is child */
close(p2[0]); /* close the read end of p2 */
close(p1[1]); /* close the write end of p1 */
while(n=read(p2[0], pt, n));
for(i=0;i<n/2;i++)
my_sum=my_sum+A[i];
write(p2[1],sizeof(int),sum); /* send partial sum */
}
}
}

4.4 Distributed Computing

A distributed system is made of computational nodes that are connected over a


communication network. The Internet, the grid, cluster of workstations and an airline
reservation system are the examples of distributed systems. Two important benefits
to be gained from distributed systems other than parallel processing are resource
sharing and fault tolerance. Our aim in the context of this book will be to use a
distributed system for parallel processing in order to solve computationally time
consuming bioinformatics problems.
Nodes of a distributed systems use message-passing as the main method for com-
munication. Each process may send a message to a single process (unicast), to a
group of processes (multicast) or to all processes in the system (broadcast). Mes-
sage passing can be performed synchronously or asynchronously. Two fundamental
primitives of message-passing are the send and receive primitives. A synchronous
send primitive will wait until the receiver has received its message and the syn-
chronous receive will block the caller until a message arrives, on the other hand,
asynchronous send and receive procedures do not block the caller. As the further
actions of a receiver are frequently determined by the contents of the received mes-
sage, the receive primitive is commonly used in blocking mode. However, the sender
may assume its message has been delivered to the receiver over a reliable network
and typically does not need to block. For this reason, the asynchronous send and
synchronous receive combination which we will call semi-synchronous interprocess
communication is mostly used in message-passing distributed computing systems.
4.4 Distributed Computing 69

4.4.1 Distributed Algorithm Design

A distributed algorithm runs at the computational nodes of a distributed system.


Typically, the same algorithm code will run on different nodes with different data
which is called the single program multiple data (SPMD) paradigm. At a higher
level of operation than message passing, distributed algorithms can be specified as
synchronous where a number of rounds are executed synchronously under the control
of a central process; or asynchronous with no central control. The type of action to
be performed in a distributed algorithm depends largely on the type and contents of
the message received from neighbors. Algorithm 4.2 displays a commonly employed
structure of a distributed algorithm where process i receives a message from a process
j as in [3], and based on the type and contents of this message, it executes a specific
procedure.

Algorithm 4.2 General Distributed Algorithm Structure


1: int i, j  i is this node; j is the sender of the current message
2: while ¬finished do  all nodes execute the same code
3: receive msg(j)
4: case msg(j).type of
5: type_1 : Procedure_1
6: ... : ...
7: type_n : Procedure_n
8: end while

4.4.2 Threads Re-visited

Although threads are mostly used for shared memory parallel processing, we can still
have message-passing functionality using threads by additional routines at user level.
One such library is presented in [2] where threads communicate by message passing
primitives write_fifo and read_fifo which correspond to send and receive routines
of a message passing system. The communication channels between processes are
simulated by first-in-first-out (FIFO) data structures as shown below.

typedef struct {
sem_t send_sem;
sem_t receive_sem;
msgptr_t message_que[N_msgs];
} fifo_t;

Sending a message to a process is performed by first checking the availability


of message space in the FIFO by executing a wait on the sending semaphore of
the FIFO; then writing the message address to the FIFO of the receiving process
70 4 Parallel and Distributed Computing

and signalling its receiving semaphore so that it is activated. Using such a simulator
provides a simple testbed to verify distributed algorithms and when the algorithm
works correctly, performance tests can be performed using commonly used message-
passing tools as described next.

4.4.3 Message Passing Interface

The Message Passing Interface Standard (MPI) provides a library of message passing
primitives in C or Fortran programming languages for distributed processing over a
number of computational nodes connected by a network [8,10]. Its aim is to provide
a portable, efficient and flexible standard of message passing. MPI code can be eas-
ily transferred to any parallel/distributed architecture that implements MPI. It can
be used in parallel computers, clusters and heterogeneous networks. It is basically
an application programming interface (API) to handle communication and synchro-
nization among processes residing at various nodes of a paralell/distributed comput-
ing system. MPI provides point-to-point, blocking and non-blocking and collective
communication modes. A previous commonly used tool for this purpose was Parallel
Virtual Machine (PVM) [7]. The basic MPI instructions to run an MPI program are
as follows.

• MPI_Init: Initializes the MPI library


• MPI_Comm_Size: Specifies the number of processes that will execute the code.
• MPI_Comm_Rank: The rank of a process which varies from 0 to size − 1 is
obtained.
• MPI_Finalize: Cleans up the library and terminates the calling process.

Here is a simple program called hello.c in C showing how MPI works.

#include <stdio.h>
#include <mpi.h>

main(int argc, char **argv)


{ MPI_Init(&argc, &argv);
printf("Hello world");
MPI_Finalize();
}

The program is compiled and then run as follows:

mpicc -o hello hello.c


mpirun -np 8 hello

The program when run by the mpirun command takes the number of processes as
an input which is 8 in this case. It then creates 7 other child processes by the MPI_Init
command to have a total of 8 processes. Each of these processes then execute their
4.4 Distributed Computing 71

program separately. We will have 8 “Hello world” outputs when this program is
executed. Clearly, it would make more sense to have each child process execute on
different data to perform parallel processing. In order to achieve this SPMD style
of parallel programming, each process has a unique identifier which can then be
mapped to the portion of data it should process. A process finds out its identifier by
the MPI_Comm_rank command and based on this value, it can execute the same code
on different data, achieving data partitioning. Two basic communication routines in
MPI are the MPI_Send and MPI_Receive with the following syntax:

int MPI_Send(void *data, int n_data, MPI_Datatype type,


int receiver, int tag, MPI_Comm comm);

int MPI_Recv(void *data, int n_data, MPI_Datatype type,


int sender, int tag, MPI_Comm comm, MPI_Status *status);

where data is the memory location for data storage, n_data is the number of data
items to be transferred with the given type, receiver and the sender are the identifiers
of the receiving and the sending processes respectively, and tag is the type of message.
The next example shows how to find the sum of an array in parallel using a number
of processes. In this case, the root performs data partitioning by sending a different
part of an array to each child process which calculate the partial sums and return it
to the root which finds the total sum. The root first initializes the array and sends
the size of the portion along with array data to each process. Message types for each
communication direction are also defined.

#include <stdio.h>
#include <mpi.h>

#define n_data 1000


#define tag1 1 /* from root to workers */
#define tag2 2 /* from workers to root */
#define root 0
int array[n_data];

main(int argc, char **argv)


{ long int total_sum, partial_sum;
MPI_Status status;
int my_id, root, i, n_procs, n_portion;

MPI_Init(&argc, &argv);

/* find my id and number of processes */


MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
MPI_Comm_size(MPI_COMM_WORLD, &n_procs);
n_portion=n_data/n_procs;

if(my_id == root) { /* initialize array */


for(i = 0; i < n_data; i++) {
72 4 Parallel and Distributed Computing

array[i]=i;
}

/* send a portion of the array to each worker */


for(i= 1; i < _procs; i++) {
MPI_Send( &array[i*n_portion], n_portion, MPI_INT,
i, tag1, MPI_COMM_WORLD); }

/* calculate the sum of my portion */


for(i = 0; i < n_portion; i++)
total_sum += array[i];

/* collect the partial sums from workers */


for(i= 1; i < n_procs; i++) {
MPI_Recv( &partial_sum, 1, MPI_LONG, MPI_ANY_SOURCE,
tag2, MPI_COMM_WORLD, &status);
total_sum += partial_sum; }
printf("The total sum is: %i", sum);
}

else { /* I am a worker, receive data from root */

n_portion=n_data/n_procs;
MPI_Recv( &array, n_portion, MPI_INT, root, tag2,
MPI_COMM_WORLD, &status);

/* Calculate the sum of my portion of the array */

partial_sum = 0;
for(i = 0; i < n_portion; i++)
partial_sum += array[i];

/* send my partial sum to the root */

MPI_Send( &partial_sum, 1, MPI_LONG, root,


tag2, MPI_COMM_WORLD);
}
MPI_Finalize();
}

MPI specification standard that can be implemented differently in various plat-


forms. It is hardware and language independent providing ease in transporting code. It
is by far the most commonly used parallel/distributed computing standard to develop
applications. OpenMPI project provides open source MPI implementation which can
also be used for shared memory parallel programming [9].
4.4 Distributed Computing 73

4.4.4 Distributed Processing in UNIX

BSD UNIX provides the socket-based communications over the Internet. A server
is an endpoint of a communication between the two hosts over the Internet which
performs some function as required by a client process in network communica-
tions using sockets. Communication over the Internet can be performed either as
connection-oriented where a connection between a server and a client is established
and maintained during data transfer. Connection oriented delivery of messages also
guarantees the delivery of messages reliably by preserving the order of messages.
Transmission Control Protocol (TCP) is a connection oriented protocol provided at
layer 4 of the International Standards Organization (ISO) Open System Interconnect
(OSI) 7-Layer model. In connectionless communication mode, there is no established
or maintained connection and the delivery of messages is not guaranteed requiring
further processing at higher levels of communication. Unreliable Datagram Protocol
(UDP) is the standard connectionless protocol of the OSI model. Sockets may be
used for connection-oriented communications in which case they are called stream
sockets and when they are used for connectionless communication, they are called
datagram sockets. The basic system calls provided by BSD UNIX are as follows:
• socket: This call creates a socket of required type, whether connection-oriented or
connectionless, and the required domain.
• listen: A server specifies the maximum number of concurrent requests it can handle
on a socket by this call.
• connect: A client initiates a connection to a server by this procedure.
• accept: A server accepts client requests by this call. It blocks the caller until a
request is made. This call is used by a server for connection oriented communica-
tion.
• read and write: Data is read from and written to a socket using these calls.
• close: A connection is closed by this call on a socket.
Figure 4.10 displays a typical communication scenario of a connection-oriented
server and a client. The server and the client are synchronized by the blocking system
calls connect and accept after which a full connection is established using the TCP
protocol. The server then reads the client request by the read and responds by the
write calls. A server typically runs in an endless loop and in order to respond to
further incoming requests, it spawns a child for each request so that it can return
waiting for further client requests on the accept call. Threads may also be used for
this purpose resulting in less management overheads than processes.
We can use socket-based communication over the Internet to perform distributed
computations. Returning to the sum of an array example we have been elaborating,
a server process can send portions of an array to a number of clients using stream or
datagram sockets upon their requests and then can combine all of the partial results
from the clients to find the final sum (see Exercise 4.9).
74 4 Parallel and Distributed Computing

Server Client

socket

socket

bind

connect

listen

write

accept

read
read

close
write

Fig. 4.10 A connection-oriented server and a client using socket-based communication

4.5 Chapter Notes

We have reviewed basic principles of parallel and distributed computing in this


chapter. Parallel computing as we consider in the context of this book uses shared
memory for data transfer among concurrent processes. Synchronization and commu-
nication are the two fundamental requirements for parallel and distributed processing.
We have seen access to the shared memory should be controlled by mutual exclu-
sion structures or semaphores, and processes synchronize on events by conditional
variables to notify each other. Operating systems provide primitives for both mutual
exclusion and conditional synchronization. Modern operating systems are based
on processes which are instances of executing programs. Threads are lightweight
processes with few data such as the program counter, stack and registers. Threads of
a process share data within the process which must be protected as the shared data
between the processes. POSIX threads are widely used for multi-threaded parallel
processing applications.
There is no shared memory in distributed computing and message passing is the
main method to provide synchronization and data transfer among processes. MPI
which consist of various primitives for interprocess communication is a commonly
used software to provide parallel and distributed processing. We can also employ
basic network communication using UNIX BSD socket interface [13,14]. This inter-
4.5 Chapter Notes 75

face provides routines for creation, reading from and writing to data structures called
sockets. However, MPI provides a neater interface to the application with the addition
of routines for multicast and broadcast communication and is frequently employed
for parallel and distributed processing. Threads are frequently used for shared mem-
ory parallel processing and MPI can be employed for both parallel and distributed
applications.
Exercises

1. Provide a partitioning of the task dependency graph of Fig. 4.5 to three processors
which considers evenly balancing the load among the processors and minimizing
the IPC costs at the same time. Draw the Gant chart for this partitioning and work
out the total IPC cost and the execution time of this partitioning.
2. Modify Algorithm 1.1 such that array A is first copied to an array B and all of
the summations are done on array B to keep contents of A intact. Copying of A
to B should also be done in parallel using n processors.
3. The prefix sum of an array A is an array B with the running sums of the elements
of A such that b0 = a0 ; b1 = a0 + a1 ; b2 = a0 + a1 + a2 ; . . . For example if
A = {2, 1, −4, 6, 3, 0, 5} then B = {2, 3, −1, 5, 8, 8, 13}. Write the pseudocode
of an algorithm using EREW PRAM model to find the prefix sum of an input
array.
4. Provide an N-buffer version of the producer/consumer algorithm where we have
N empty buffers initially and the producer process fills these buffers with input
data as long as there are empty buffers. The producer reads filled buffer loca-
tions and consumes these data by printing them. Write the pseudocode of this
algorithm with short comments.
5. A multi-threaded file server is to be designed which receives a message by the
front end thread, and this thread invokes one of the open, read and write threads
depending on the action required in the incoming message. The read thread reads
the number of bytes from the file which is sent to the sender, the write thread
writes the specified number of contained bytes in the message to the specified
file. Write this program in C with POSIX threads with brief comments.
1 4
6. The PI number can be approximated by 0 (1+x 2 ) . Provide a multi-threaded C
program using POSIX threads, with 10 threads each working with 20 slices to
find PI by calculating the area under this curve.
7. Modify the program of Sect. 4.3.6 such that there are a total of 8 processes
spawned by the parent to find the sum of the array in parallel.
8. Provide a distributed algorithm for 8 processes with process 0 as the root. The
root process sends a request to learn the identifiers of the processes in the first
round and then finds the maximum identifier among them and notifies each
process of the maximum identifier in the second round.
9. Write an MPI program in C that sums elements of an integer array in a pipelined
topology of four processes as depicted in Fig. 4.11. The root process P0 computes
the sum of the first 16 elements and passes this sum along with the remaining
48 elements to P1 which sums the second 16 elements of the array and pass
76 4 Parallel and Distributed Computing

16 ... 64 32 ... 64

P1

P0 P2

48 ... 64
P3
sum 0−3

Fig. 4.11 MPI tasks for Exercise 9

the accumulated sum to P2. This process is repeated until message reaches P0
which receives the total sum and displays it in the final step. Note that the first
extra element of the message transferred can be used to store the partial sum
between processes.
10. Provide the pseudocode of a program using UNIX BSD sockets where we have
a server and n clients. Each client makes a connection-oriented call to the server,
obtains its portion of an integer array A and returns the partial sum to the server.
The server displays the total sum of array A when it receives all of the partial
sums from the clients.

References
1. Elsner U (1997) Graph partitioning, a survey. Technical Report, Technische Universitat Chem-
nitz
2. Erciyes K (2013) Distributed graph algorithms for computer networks. Computer communi-
cations and networks series. Chap. 18, Springer. ISBN 978-1-4471-5172-2
3. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press, Taylor and Fran-
cis, pp 57–59. ISBN 978-1-4471-5172-2
4. Flynn MJ (1972) Some computer organizations and their effectiveness. IEEE Trans. Comput.
C 21(9):948–960
5. Foster I (1995) Designing and building parallel programs: concepts and tools for parallel
software engineering. Addison-Wesley
6. Gantt HL (1910) Work, wages and profit. The Engineering Magazine, New York
7. http://www.csm.ornl.gov/pvm/
8. http://www.mcs.anl.gov/research/projects/mpi/
9. http://www.open-mpi.org/
10. https://computing.llnl.gov/tutorials/mpi/
11. POSIX.1 FAQ. The open group. 5 Oct 2011
References 77

12. Quinn M (2003) Parallel programming in C with MPI and OpenMP. McGraw-Hill Sci-
ence/Engineering/Math
13. Stevens WR (1998) UNIX network programming. In: Networking APIs: sockets and XTI, vol
1, 2nd edn. Prentice Hall
14. Stevens WR (1999) UNIX network programming. In: Interprocess communications, vol 2, 2nd
edn. Prentice Hall
15. Sun Microsystems Inc. (1991) SunSoft introduces first shrink-wrapped distributed computing
solution: Solaris. (Press release)
16. Tanenbaum A (2014) Modern operating systems 4th edn. Prentice-Hall
Part II
Biological Sequences
String Algorithms
5

5.1 Introduction

A string S consists of an ordered set of characters over a finite set of symbols called
an alphabet Σ. The size of the alphabet, |Σ|, is the number of distinct characters
in it and the size of a string, |S|, is the number of characters contained in it; also
called the length of the string. For example, the DNA structure can be represented by
a string over the alphabet of four nucleotides: Adenine (A), Cytosine (C), Guanine
(G) and Thymine (T), hence Σ = {A, C, G, T } for DNA; the RNA structure has
Σ = {A, C, G, U} replacing Thymine with Uracil (U), and a protein is a linear chain
of amino acids over an alphabet of 20 amino acids Σ = {A, R, N, . . . , V }.
String algorithms have numerous applications in molecular biology. Biologists
often need to compare two DNA/RNA or protein sequences to find the similarity
between them. This process is called sequence alignment and finding the relatedness
of two or more biological sequences helps to deduce ancestral relationships among
organisms as well as finding conserved segments which may have fundamental roles
for the functioning of an organism. We may also need to search for a specific string
pattern in a biological sequence as this pattern may indicate a location of a structure
such as a gene in DNA. In some other cases, biologists need to find repeating substring
patterns in DNA or protein sequences as these serve as signals in genome and also
over representations of these may indicate complex diseases. Analysis of genome as
substring rearrangements helps to find evolutionary relationships between organisms
as well as to understand the mechanisms of diseases. In summary, DNA/RNA and
protein functionalities are highly dependent on their sequence structures and we
need efficient string manipulation algorithms to understand the underlying complex
biological processes.
In this chapter, we start the analysis of strings by the string matching algorithms
where we search the occurrences of a small string inside a larger string. We will
see that there are a number of algorithms with varying time complexities. The next
problem we will investigate is the longest common subsequence problem where our
aim is to find the longest common subsequence in two strings. Longest increasing
© Springer International Publishing Switzerland 2015 81
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_5
82 5 String Algorithms

subsequence problem searches the longest increasing subsequence in a string and


the longest common increasing subsequence is the search of such sequences in two
strings. All of these problems are closely related to DNA sequence analysis. A suf-
fix tree is a data structure that represents suffixes of a string. Suffix trees have many
applications in bioinformatic sequence problems including exact string matching and
other string algorithms such as the substring problems. We will investigate sequen-
tial and distributed construction of suffix trees and their applications in biological
sequence problems. In all the topics described, we will first present sequential algo-
rithms for these problems and then provide a review of contemporary parallel and
distributed algorithms in these topics.

5.2 Exact String Matching

Given a string S = s1 , . . . , sn with n elements, the string SP = si , . . . , sj ⊂ S is


called a substring of S which starts at position i and ends at position j. A prefix
Sp = s1 , . . . , si of a string S starts from the beginning of the string and ends at
location i. A suffix S = si , . . . , sn of a string S on the other hand, starts at location
i and ends at the end of the string. As an example, let S = madagaskar, mada is a
prefix and skar is a suffix of S.
In the string matching problem, we want to find the first but most commonly, all
occurrences of a small string called pattern in a large string called text where symbols
in the pattern and text are chosen from the alphabet Σ. Formally, we are given a text
T = t1 , . . . , tn of length n and a pattern P = p1 , . . . , pm of length m and require to
find the location i in T such that ti , . . . , ti+m−1 = P. For example, if P = ccba and
T = abccbabaccbaab using the alphabet Σ = {a, b, c}, we need to find the locations
{3,9} as t3 , . . . , t6 = t9 , . . . , t12 = ccba. We will mostly use this notation, however,
we will occasionally adopt the array notation as S[i] to denote the ith element of a
string S, for example, for convenience in representation in algorithms.

5.2.1 Sequential Algorithms

We will now review fundamental sequential algorithms for string matching starting
with the naive algorithm which is a brute force algorithm that checks every pos-
sible combination. As this is a time consuming approach, various algorithms were
proposed that have lower time complexities.

5.2.1.1 The Naive Algorithm


The brute force or the naive algorithm is the simplest method to search for a pattern
in a text. The occurrences of the pattern can only start in the first n − m + 1 locations
of the string. For the text T = abccbabaccbaab and the pattern P = ccba example,
n is 14 and m is 4; we need to check the first 11 positions in T . Therefore, we start
5.2 Exact String Matching 83

from the first locations in T and P and compare their elements. As long as we find
a match, the indices are incremented and if the size of the pattern m is reached, a
match is found and output as shown in Algorithm 5.1.

Algorithm 5.1 Naive algorithm


1: Input: text T = t1 ...tn , pattern P = p1 ...pm
2: int i, j, k
3: for i ← 1 to (n − m + 1) do
4: k ← i; j ← 1
5: while j ≤ m and tk = pj do
6: k ←k+1
7: end while
8: if j > m then
9: output match at position i − j + 1
10: end if
11: end for

Let us consider the running of this algorithm until the first match, for the example
above:

T = a b c c b a b a c c b a a b
P = c c b a
c c b a
c c b a (a match found at location 3)

The time taken by this algorithm is O(nm) since checking each occurrence of P
is done in m comparisons and we need to check the first n − m + 1 locations of T .
For very long strings such as the biological sequences, the total time taken will be
significantly high. Baeza-Yates showed that the expected time taken in this algorithm
for an alphabet of size c is [1]:
c 1
C= (1 − m (n − m + 1) + O(1) (5.1)
c−1 c

5.2.1.2 Finite State Machine-Based Algorithm


A finite automaton or a finite state machine (FSM) (or simply a state machine) consists
of a finite number of states, a finite number of transitions between these states, a start
state and a number of states called accepting (final) states. An FSM can be in only
one state at any time which is its current state and its next state is determined by this
current state, and the input it receives. FSMs are used to model diverse problems
such as communication protocols, language grammars, and sequence analysis. In a
deterministic FSM, the next state is a single determined state and there may be a
84 5 String Algorithms

number of possible next states in the nondeterministic FSM. Formally, an FSM is a


quintuple (Σ, S, s0 , δ, O) where:

• Σ is an input alphabet
• S is a finite nonempty set of states
• s0 ∈ S is the initial state
• δ : S × Σ → S is the state transition function

Let us illustrate these concepts by a string matching algorithm example. Given an


input sequence T of length n over an alphabet Σ = {a, b, c} of length m, we want to
find all occurrences of the pattern P = acab in S. The FSM diagram which consists
of circles representing states with arcs showing the transitions between the states can
be drawn as depicted in Fig. 5.1. Reaching the final state in the FSM means we have
detected the pattern in the text.
An FSM table is an alternative way of showing the operation of the FSM. Typically,
the rows of the table are indexed by the states and the inputs are the columns. In the
example above, we have 5 states as 0 (the initial state), a, ac, aca, and acab. The
possible inputs as read from the string T are a, b, and c. We can simply fill the entries
of this table with the next state of the FSM based on the input and the current state
as shown in Table 5.1.
For example, receiving input c at state aca will cause to return to initial state
of 0. This way of representing an FSM provides a simple way of programming
the algorithm. We will show how to implement this FSM using such a table in
the programming language C. We first define the FSM table (fsm_tab) as a two-
dimensional array of function pointers. Then each state changing action can be

Fig. 5.1 FSM to find pattern acab

Table 5.1 Finite state States Inputs


machine table
a b c
0 a 0 0
a a 0 ac
ac aca 0 0
aca a acab 0
acab a 0 0
5.2 Exact String Matching 85

defined as a function address of which is placed in this table. The algorithm then
simply directs the flow to the function specified by the current state and the current
input in the table as shown below. The states are now labeled as 0, 1, 2, 3 and 4, and
the inputs are 0, 1, and 2 corresponding to a, b, and c. The time to construct the FSM
is Θ(m|Σ|) which is the size of the table, assuming there are no overlapping states.
The time to search the input string is Θ(n), resulting in a total time complexity of
Θ(m|Σ| + n) for this algorithm.

void (*fsm_tab[n_states][n_inputs])(); // fsm table declared


char S[n]; // input string S of length n
unsigned int curr_state=0; // initialize current state
act_00(){curr_state = 1; // action 00 changes current state to 1
} // when ’a’ is received at state 0
... // all functions are declared
/* initialize */
fsm_tab[0][0]=act_00; // place function addresses in table
... // do it for n_states * n_inputs times
main() { unsigned int count=0;
while(count < n-m+1) {
input=S[count];
*fsm_tab[curr_state][input]; // go to the action
if (curr_state == 4) // state 4 is a match
printf("pattern found at location %d", count);
count++;
}
}

5.2.1.3 Knuth–Morris–Pratt Algorithm


Knuth and Morris, and around the same time Pratt discovered the first linear time
string matching algorithm by analyzing the naive algorithm [2]. The general obser-
vation of the naive algorithm showed that there are unneeded searches in it and some
of the comparisons may be skipped when previous comparisons are used. The main
idea of the Knuth–Morris–Pratt (KMP) algorithm is to use a sliding window of size
m and compare the characters in this window of text with the pattern. The window
is slid from left to right until a mismatch is found. When this happens, the prefix
table Π , which is an array of integers constructed prior to the execution of the KMP
algorithm, is searched to determine where to move next in the text. This table basi-
cally shows how many characters to skip if there is a mismatch while seeking the
pattern in the text, and this is the fundamental gain of this algorithm. In other words,
when there is a mismatch, the window is shifted as many characters as specified in
Π , saving redundant searches.
86 5 String Algorithms

Fig. 5.2 The computation of the prefix array Π for the pattern P = {acacaba}

The formation of the prefix table Π is performed at initialization using the pattern
P only. The idea here is to check incrementally whether any proper prefix of a pattern
is also a proper suffix of it. If this is the case, the integer value is incremented. The
entry Π [i] is the largest integer smaller than i such that prefix p1 , . . . , pπi is also a
suffix of p1 , . . . , pi . As an example, let us assume the pattern P = {acacaba} and
work out the values of Π as shown in Fig. 5.2.
The pseudocode of the prefix function to fill Π is shown in Algorithm 5.2.

Algorithm 5.2 Prefix Function


1: procedure Prefix(P[m])
2: pattern P[1..m]  pattern of size m
3: Π [1] ← 0  output array holding prefix values
4: j←0
5: for i ← 2 to m do
6: while j > 0 and P[j + 1] = P[i] do
7: j ← Π [j]
8: end while
9: if P[j + 1] = P[i] then
10: j ←j+1
11: end if
12: Π [i] ← j
13: end for
14: return Π
15: end procedure
5.2 Exact String Matching 87

Once this array is formed, we can use it when there is a mismatch to shift the
indices in the text T . The operation of the KMP algorithm is described in pseudocode
in Algorithm 5.3.

Algorithm 5.3 KMP_Alg


1: Input : text T [1..n]
2: pattern P[1..m]
3: Π ← Prefix(P)  compute prefix function
4: j ← 0
5: for i ← 1 to n do
6: while j > 0 and P[j + 1] = T [i] do
7: j ← Π [j]
8: end while
9: if P[j + 1] = T [i] then
10: j ←j+1
11: end if
12: if j = m then
13: output match at i − m
14: j ← Π [j]
15: end if
16: end for

Let us consider an example where T = {ccabababacbac} and the pattern P =


{acacaba} described above so we can use the prefix table Π computed. As long
as the characters in T match the characters in P, we increment the pointers and
whenever there is a mismatch, the working window is shifted as many times as in
the corresponding entry of Π shown below.

Prefix : 0 0 1 2 3 1 1
S = c c a c a c a c a b a c a b
P = a c a c a b a (first match found at location 3)
a c a c a b a (second match)
a c a c a b a (third match)
a c a c a b a (fourth match)
a c a c a b a (fifth match)
a c a c a b a (mismatch at 6, shift 1)
a c a c a b a (mismatches at 1,2 and 3, shift 1 at 3)
a c a c a b a (first match found at location 3)
a c a c a b a (second match)
a c a c a b a (third match)
a c a c a b a (fourth match)
a c a c a b a (fifth match)
a c a c a b a (sixth match)
a c a c a b a (FULL MATCH)
88 5 String Algorithms

This algorithm takes O(m) time to compute the prefix values and O(n) to compare
the pattern to the text, resulting in a total of O(m + n) time. The running time of
this algorithm is optimal and can process large texts as it never needs to move
backward. However, it is sensitive to the size of the alphabet |Σ| as there will be
more mismatches when this increases.

5.2.1.4 Boyer–Moore Algorithm


The Boyer–Moore (BM) algorithm is an efficient pattern matching algorithm that
has usually a sub-linear computation time [3]. We have an input string T and a
pattern P as before. The main idea of this algorithm is to search a pattern without
looking at all characters of the text. Also, the matching process of the pattern with
the text is performed from right to left since more information is gained this way. The
processing of the whole pattern is still from left to right though, pattern p1 , . . . , pm
is compared with tj+1 , . . . , tj+m where 1 ≤ j ≤ n − m is the current location in the
text T . It implements two clever rules called bad character rule and good suffix rule
to compute the number of right shifts whenever there is a mismatch, as described
below.
Bad character rule
The pattern P is shifted right until the rightmost occurrence of the text character
T [j + i] in the pattern. Let us assume there is a mismatch at position P[i] = a and
T [i + j] = b. This implies P[i + 1..m] = T [i + j + 1..j + m] = w since we are doing
a right scan. If the pattern P does not contain b, then we can shift P until T [i + j].
Otherwise, we need to shift P until the rightmost incidence of b. We therefore need
to do preprocessing for the bad suffix rule. For each character x ∈ Σ:

0, if x ∈
/P
R(x) = max
max{i < m| P[i] = x}
Informally, R(x) is the position of the rightmost occurrence of x in P, and it is
zero if x does not exist in P. The array R can be computed in O(m) time for a
pattern P of length m. Bad character rule is as follows. When a mismatch is found
at position i with T [k] being the mismatch character in T , shift the pattern right by
max(1, i − R[T [k]]. For example, given the pattern P = a c b c b a, R[a] = 6, R[b] =
5, and R[c] = 4, let us see how this rule works for the text T = a b c c b a a b b a c

T= a b c c b a a b b a c
P= a c b c b a mismatch at position 3
i=3, k=3, T[3]=c, i-R[c]=-1, shift 1
5.2 Exact String Matching 89

Good suffix rule


The idea of the good suffix rule is to shift the pattern to the position of the next
occurrence of the suffix P[i + 1], . . . , P[m] in the pattern. It basically means we have
matched a suffix of P and we know the following characters in T . The number of
steps to shift when there is a mismatch is calculated according to both rules and the
shift that has the greatest value is implemented. We need to preprocess the pattern
to form arrays for both cases in order to implement these rules.
The Boyer–Moore algorithm shown in Algorithm 5.4 forms one array for the bad
character rule and two arrays for the good suffix rule before the execution of the
algorithm and uses them to shift the pattern when there is a mismatch. Algorithm 5.4
displays the simplified pseudocode of the algorithm where preprocessing forms the
arrays.

Algorithm 5.4 BM_Alg


1: Input : text T [1..n]  text of size n
2: pattern P[1..m]  pattern of size m
3: Output : Index of first substring of T matching P
4: preprocess to obtain arrays for bad character and good suffix rules
5: i ← m − 1
6: j ← m − 1
7: repeat
8: if P[j] = T [i] then
9: if j = 0 then
10: return i
11: else
12: i ←i−1
13: j ←j−1
14: end if
15: else
16: shift P by the maximum of the shifts obtained
17: from the bad character rule and the good suffix rule
18: end if
19: until i > n − 1

Its worst-case performance is O(n+rm), where r is the number of occurrences of P


in T . The average performance of BM algorithm is O(n/m) meaning its performance
gets better for longer patterns. A parallel implementation of BM algorithm is reported
in [4].
90 5 String Algorithms

5.2.2 Distributed String Matching

A general approach to provide distributed string matching would be to partition the


string S into approximately equal segments and provide each processor with one
segment. Each process pi residing on a processor then would implement the string
matching algorithm, whether KMP or BM algorithm, in its own segment. The parti-
tioning of data is handled by a special process called the supervisor and the processes
in the distributed system are called workers. This model is also called master–slave
method of parallel processing and we will frequently adopt this approach for distrib-
uted processing to solve bioinformatics problems. We will generally assume there
are k processors and k processes, each of which runs on a processor. Another distinc-
tion is whether the supervisor is involved in the actual algorithm to find results, for
example, whether it also searches for a pattern as workers. We will show examples
of both approaches, however, if there is high volume of data to be communicated by
the supervisor or if the amount of work done by the workers can vary and cannot
be determined beforehand, the supervisor may be assigned to perform management
only.
Let us attempt to sketch a distributed algorithm to perform KMP algorithm in par-
allel over k processors. We have k processes pr0 , . . . , prk−1 and pr0 is the supervisor
(master or the root process). As shown in Algorithm 5.5, pr0 first works out the pre-
fixes using the prefix function. As this initialization part of the algorithm takes O(m)
time and since the pattern is significantly shorter than the text T , this task can be
handled by pr0 . It then partitions the input text T into approximately k − 1 partitions
T1 , . . . , Tk−1 and sends each worker a unique segment along with the prefix array
values. Each worker then implements KMP algorithm in its segment and returns the
results to the supervisor pr0 which combines them and outputs the final result.
There is one issue to be handled though, the characters in the bordering regions
of segments should be carefully considered. For example, let us assume we have
the following text which is partitioned among three processes p1 , p2 , and p3 and we
search for the pattern cba in this text:

b c c b c | b a a c b | b c a b b
p1 p2 p3

Neither the process p2 nor p1 will be aware of the match because they do not
have the whole of the pattern cba. A possible remedy for this situation would be to
provide the preceding process, which is p1 in this case, with the first m − 1 characters
of its proceeding process. Searching for matches in these bordering regions would
then be the responsibility of the preceding process with a lower identifier. Assuming
the communication costs are negligible, this approach which is sometimes called
embarrassingly parallel will provide almost linear speedup.
5.3 Approximate String Matching 91

Algorithm 5.5 Dist_KMP


1: Input : Text T [1, n], pattern P[1, m]
2: all_procs = {pr1 , ..., prk−1 }  process set
3: Output : matched  set of matched P values in T
4: received ← Ø
5: if pri = pr0 then  if I am the supervisor process
6: Π ← Prefix(T )  compute prefix values
7: for i = 1 to k − 1 do
8: send (T [((i − 1)n/k) + 1] - T [in/k]) ∪Π to pri  send a segment of T
9: end for
10: while received = all_procs do  get matched minimum values from processes
11: receive match(valsi ) from pri
12: matched ← matched ∪ vals
13: received ← received ∪ pri
14: end while
15: output matched
16: else  I am a worker process pi
17: receive my segment Mi from pr0
18: vals ← KMP(Mi )  run KMP on my segment
19: send match(valsi ) to pr0  send matches to supervisor
20: end if

5.3 Approximate String Matching

Given a text T = t1 , . . . , tn of length n and a pattern P = p1 , . . . , pm , our aim in


approximate string matching is to search for approximate matches of P in T . That
is, a prespecified upperbound k on mismatches is allowed. When k = 1, this problem
is commonly referred as the 1-mismatch problem. The required output is the list of
positions of approximate matches as in the exact matching problem. Let us illustrate
this concept by an example; we have a text T = acbbcabbacbc and the pattern is
P = cbb with the maximum allowed mismatch number k = 1. Intuitively, we can
implement the naive algorithm that we used in exact matching by scanning the text
T from left to right for the pattern P. This time, however, we allow for a maximum
of one mismatch. The example below shows the operation of this algorithm for these
two example sets.
T = a c a b c b b b c b c
P = c b b
c b b ( approximate match at position 2 )
c b b
c b b
c b b ( exact match at position 5)
c b b
c b b
c b b
c b b ( approximate match at 9)
92 5 String Algorithms

The pseudocode for this algorithm is shown in Algorithm 5.6. As in the naive
exact algorithm, it has a time complexity of O(nm).

Algorithm 5.6 Naive Approximate Matching


1: Input: text T = t1 , ..., tn , pattern P = p1 , ..., pm , upperbound k
2: Output: positions of approximate matches
3: int i,j
4: for i ← 1 to (n − m + 1) do
5: d←0
6: for j ← 1 to m do
7: if T [i + j − 1] = P[j] then
8: d ←d+1
9: end if
10: end for
11: if d ≤ k then
12: output i
13: end if
14: end for

5.4 Longest Subsequence Problems

The discovery of longest subsequences of a number of strings has many practical


applications in bioinformatics such as finding the conserved regions of similarity and
associating functionality. There are few variants of this problem as described next.

5.4.1 Longest Common Subsequence

Let us consider a string S which consists of symbols from a finite alphabet Σ. A


substring P ∈ S is a fragment of S which includes some contiguous symbols of S.
A subsequence Q ∈ S is found by deleting zero or more symbols from S. In other
words, the symbols in Q need not be contiguous in S. An example of a substring A
and subsequences B and C of a string S with alphabet Σ = {a, b, c} is as follows:

S = a b b a b c b a a b c c a
A = b a b c
B = a - c b a - - c
C = b a b - b
5.4 Longest Subsequence Problems 93

The longest common subsequence problem can be stated as follows :

Definition 5.1 The Longest Common Subsequence (LCS) of two strings A =


(a1 , . . . , an ) and B = (b1 , . . . , bm ) is the longest sequence that is a subsequence of
both A and B. LCS can be also be searched among a set of k strings S = S1 , . . . , Sk .

The following example shows the LCS P of two strings A and B. In general, we
are more interested in finding the length of LCS rather than its contents.

A = a b b a b c b a b a
B = b b a c c c a a c
P = b b a c a a

Comparing two biological sequences is frequently needed to infer their relation-


ship and understand their functionalities. The LCS between two such sequences is
one step toward this goal and we will see how this problem can be solved for the more
general case in the next chapter. The brute force algorithm to find LCS of strings A
of size m and B of size n will consider all 2m substrings of A and will search all of
these for each element of B resulting in n2m computation time. Therefore, there is a
need for algorithms with better time complexities.

5.4.1.1 Dynamic Programming Algorithm


The dynamic programming solution to this problem involves breaking it into smaller
problems and using the solutions of the smaller problems to build larger solutions,
as in the general approach of dynamic programming. In the search of LCS between
two strings A = a1 , . . . , an and B = b1 , . . . , bm , we observe that for two prefixes
a1 , . . . , ai and b1 , . . . , bj of these two strings; their LCS has a length of one more
than the length of LCS of a1 , . . . , ai−1 and b1 , . . . , bj−1 . This observation shows
LCS problem can be solved using dynamic programming. Let us aim at finding the
length of LCS between two strings A and B and consider two cases of this problem:

1. Case 1: Considering any two elements A[i] and B[j], the first case is they may be
equal. In this case of A[i] = B[j], the length of current LCS should be incremented.
This relation can be stated recursively as follows:
LCS[i, j] = 1 + LCS[i − 1, j − 1] (5.2)
2. Case 2: When A[i] = B[j], either A[i] or B[j] has to be discarded, therefore:
LCS[i, j] = max(LCS[i − 1, j], LCS[i, j − 1]) (5.3)

Using these two cases, we can form the dynamic programming solution to this
problem using two nested loops as shown in Algorithm 5.7. The array C holds the
partial and final solutions to this problem.
94 5 String Algorithms

Algorithm 5.7 LCS Dynamic Algorithm


1: Input : A = a1 , ..., am , B = b1 , ..., bn
2: Output C[n, m]
3: int i,j
4: for i ← 1 to n do
5: for j ← 1 to m do
6: if ai = bj then
7: C[i, j] ← C[i − 1, j − 1] + 1
8: else
9: C[i, j] ← max(C[i, j − 1], C[i − 1, j])
10: end if
11: end for
12: end for

For two strings A = {abacc} and B = {babbc}, let us form the output matrix C as
below. The time to fill this matrix is Θ(nm) with a total space Θ(nm), and the final
length of LCS is in the last element of C, that is, C[n, m] = |LCS(A, B)|. In order
to find the actual LCS sequence, we start moving up from the right corner of the
matrix which holds the element C[n, m] and backtrack the path we have followed
while filling the matrix. The sequence elements discovered in this way are shown
by arrows in the matrix and the final LCSes are {bacc} and {abcc} as there are two
paths.

5.4.1.2 Parallel and Distributed LCS Search


Several ways of parallelizing LCS using dynamic programming is described by
Ukiyama and Imai [5]. The use of parallelization in diagonal direction of the matrix
is implemented efficiently on the CM5 parallel computer which can provide parallel
operations up to 16000 nodes. A bit-parallel algorithm for the LCS problem was
5.4 Longest Subsequence Problems 95

proposed in [6] with O(nm/w) time and O(m/w) space complexities where w is the
size of the machine word and n and m are the lengths of the two strings.
Dominant points are the minimal points of search and using them narrows the
search space efficiently. Studies using dominant points to solve multiple LCS problem
in parallel for more than two input strings have been reported in [7,8]. A parallel
algorithm to solve LCS problem on graphics processing units (GPUs) was proposed
by Yang et al. [9].

5.4.2 Longest Increasing Subsequence

As another example, let us consider the longest increasing subsequence (LIS) prob-
lem. Given a sequence S = {a1 a2 , . . . , an } of numbers, L is the longest subsequence
of S where ∀ai , aj ∈ L, ai ≤ aj if i ≤ j. For example, given S = {7, 2, 4, 1, 6, 8, 5, 9},
L = {2, 4, 6, 8, 9}. A dynamic programming solution to the problem would again
start from the subproblems and store the results to be used in future. Algorithm 5.8
shows how to find LIS using dynamic programming in O(n2 ) time [10]. The proce-
dure Max_V al is used to find the maximum of numbers in an array in order to find
the length of the longest path stored in the array L.

Algorithm 5.8 LIS_dyn


1: Input : S[n]  array of numbers
2: Output : length  length of LIS
3: for i ← 1 to n do
4: L[i] ← 1
5: for j ← 1 to i do
6: if S[i] ≥ S[j] ∧ L[i] ≤ L[j] then
7: L[i] ← L[j] + 1
8: end if
9: end for
10: end for
11: return Max_V al(L)

The longest common increasing subsequence (LCIS) is more general than LIS in
which we search for an LIS of two or more sequences. Formally, given two strings
A = (a1 , . . . , am ) and B = (b1 , . . . , bn ) over an ordered alphabet Σ = {x1 < x2 <
x3 }, the LCIS of A and B is the common subsequence of A and B of the longest
length. An example LCIS L of two strings A and B is shown below:

A = 3 6 1 4 2 5 7 8 2
B = 2 3 8 4 1 9 5 1 8
P = 3 4 5 8
96 5 String Algorithms

A variation of this problem is to consider equality such that each element xi of


the LCIS L is less or equal to the next element. The sequence formed in this manner
where each xi ∈ L ≤ xj ∈ L, ∀i < j is called the longest common weakly increasing
subsequence (LCWIS).

5.5 Suffix Trees

A suffix tree introduced by Weiner [11] is a data structure that represents suffixes
of a string. Suffix trees have numerous of applications in bioinformatics problems
including exact string matching and pattern matching. Formally, a suffix tree can be
defined as follows:

Definition 5.2 [12] A suffix tree T to represent an n character string S is a rooted


directed tree with n leaves numbered 1 to n. Any intermediate node of this tree has
at least two children and each edge has a label with a nonempty substring of S. For
any leaf i of T , the concatenation of characters from the root to the leaf i shows a
suffix S[i, . . . , n] of S. Edges that connect children of a node in T start with different
characters.

This representation of a suffix tree, however, does not guarantee a valid suffix tree
for any string S. If the prefix of a suffix of S is the same as its suffix, the path for
the suffix will not be represented by a leaf. It is therefore general practice to place a
special symbol such as $ at the end of S and hence every suffix of S ends with $ as
shown in Fig. 5.3 which represents a suffix tree for the string S = abbcbabc.
A generalized suffix tree contains suffix trees for two or more strings. In this
case, each leaf number of such a tree is represented by two integers that identify the
string and the starting position of the suffix represented by the leaf in that string as
shown in Fig. 5.4 where two strings S1 = acbab and S2 = bacb over an alphabet

Fig. 5.3 Suffix tree example r


for the string S = abbcbabc. c
The leaves are numbered by b a
a b c
the indexes of the suffixes b
in S
b a b
c c
c
b b
c b c
b a b
a b a
b c b
c c
5.5 Suffix Trees 97

Fig. 5.4 A generalized # r


suffix tree example of two
b
strings, S1 = acbab and a #
S2 = bacb. The leaves are c
c a
labeled by the string number b b b
and the position and the
b
c
terminating character is “$” a
a b
for S1 and “#” for S2
b # # b #

Σ = {a, b, c} are represented by a generalized suffix tree. The terminating symbols


for the two distinct trees need to be different (Fig. 5.4).
For very long strings, the storage requirements for the suffix tree need to be
considered. For a string of length n over an alphabet Σ, the suffix tree T has n leaves
to represent n suffixes and it has a maximum of 2n edges. Each character of the
alphabet Σ can be represented by log |Σ| bits, total number of bits required to store
S would then be O(n2 log Σ) bits. Rather than storing each edge e as a concatenation
of characters, we may represent an edge by two integers i and j where S[i, . . . , j] is
equal to the label of e. This way, the total storage required is reduced to O(n log n)
since each integer can be represented by log n bits [13].
In this section, we first introduce suffix tree data structure and provide algorithms
for suffix tree construction. We then review parallel and distributed algorithms for
suffix tree construction. At the end of the section, suffix arrays which provide a more
compact way of storage are described.

5.5.1 Construction of Suffix Trees

There are various algorithms to construct a suffix tree T of a string S. Weiner was first
to provide a linear time algorithm to construct a suffix tree [11] which was modified
by McGreight to yield a space complexity of O(n2 ) [14]. Ukkonen proposed an
online algorithm which provided more space saving over McCreight’s algorithm
[15] and Farach provided an algorithm to construct suffix trees using unbounded
alphabet sizes [16]. We will describe sequential and parallel algorithms to construct
suffix trees starting with a naive algorithm and then the algorithm due to Ukkonen.
We will then investigate methods for parallel construction of suffix trees.

5.5.1.1 The Naive Algorithm


The naive method to build a suffix tree of a string S[n] would be to first form the
branch S[1, n] in the tree and then insert S[i, n] in sequence for 2 ≤ i ≤ n, from
the longest to shortest one. Initially, the suffix S[1, . . . , n$] is inserted in tree and
98 5 String Algorithms

T1 consists of this suffix only. At each step i + 1, the tree Ti is traversed starting
at root r and a prefix of S[i + 1, n] that matches a path of Ti starting from the root
node is searched. If such a prefix is not found, a new leaf numbered i + 1 with edge
label S[i + 1, n$] is formed as an edge coming out of the root. If such a prefix of
S[i + 1, n$] exists as a prefix of a branch of Ti , we check symbols of the path in that
branch until a mismatch occurs. If this mismatch is in the midst of an edge (u, v),
we split (u, v) to form a new branch, otherwise if the mismatch occurs after a vertex
w, the new branch is joined to w as shown in Algorithm 5.9.

Algorithm 5.9 Naive_ST


1: Input : String S[1..n]
2: Output : Suffix tree TS representing S
3: insert S[1, n$] in T
4: for i = 2 to n do
5: traverse Ti
6: if there is a longest prefix S[i..k] that matches a path in Ti then
7: let S[k + 1, n] be the non-matching substring
8: (u, v) is the edge of the path where S[k] is found
9: if k is just before a vertex w in Ti then
10: form a new edge joining w and leaf i
11: label edge (w, i) with S[i..k$]
12: else
13: form a new node w and a leaf numbered i
14: divide edge (u, v) into edges (u, w) and (w, v) with w after k
15: label edge (w, i) with S[i..k$]
16: end if
17: else
18: form a new leaf with edge label S[i..n$] and join it to the root
19: end if
20: end for

The execution of Algorithm 5.9 is shown in Fig. 5.5 for the string S = babcab.
We start with the longest suffix S[1..n]$ in (a) and connect it to the root r to form T1 .
The second suffix starting at location 2 does not have a matching path and therefore
a new branch to T1 is added to form T2 as shown in (b). The suffix S[3..n] has the
first character in common with the first suffix and therefore, we can split the first
suffix branch as shown in (c). Continuing in this manner, the whole suffix tree is
constructed.
The naive algorithm has a time complexity of O(n2 ) for a string S of length n
and requires O(n) space. The time needed for the ith suffix in ith iteration of this
algorithm is O(n − i + 1). Therefore, total time is:

n 
n
= O(n − i + 1) = O(i) = O(n2 ) (5.4)
i=1 i=1
5.5 Suffix Trees 99

(a) r (b) r (c) r


b b a b a
a a a
b b
b b c b c c
c c c a
a a
a a a $ $
$
$ $ $ 3
1 2 2
1 1

(d) r (e) r (f) r $


c c b 7
b a a b a a $ a c a
a b a b 6 b $
b $ b $ a b
b c b c $ c $
c 4 b 4
c a c a c 5 4 a c 5
a a c a
a $ a $ a $
$ $ $
$ 3 $ 3 $ 3
2 2 2
1 1 1

Fig. 5.5 Construction of a suffix tree for the string S = {babca} by the naive algorithm

5.5.1.2 Ukkonenn’s Algorithm


Ukkonen provided a suffix construction method that works online and has linear
time complexity [15,17]. The online property means that the algorithm can process
a string from left to right and provides a suffix tree for the