0% found this document useful (0 votes)
362 views543 pages

Algorithms in Tics 2002

The second workshop on algorithms in bioinformatics took place in Rome, italy, in September 2002. The workshop was part of a three-conference meeting, which included the ESA and APPROX 2002. Emphasis is on discrete algorithms that address important problems in molecular biology, genomics, and genetics.

Uploaded by

Ömür Çevik
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
362 views543 pages

Algorithms in Tics 2002

The second workshop on algorithms in bioinformatics took place in Rome, italy, in September 2002. The workshop was part of a three-conference meeting, which included the ESA and APPROX 2002. Emphasis is on discrete algorithms that address important problems in molecular biology, genomics, and genetics.

Uploaded by

Ömür Çevik
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Preface

We are pleased to present the proceedings of the Second Workshop on Algo-


rithms in Bioinformatics (WABI 2002), which took place on September 17-21,
2002 in Rome, Italy. The WABI workshop was part of a three-conference meet-
ing, which, in addition to WABI, included the ESA and APPROX 2002. The
three conferences are jointly called ALGO 2002, and were hosted by the Fac-
ulty of Engineering, University of Rome La Sapienza. See http://www.dis.
uniroma1.it/algo02 for more details.
The Workshop on Algorithms in Bioinformatics covers research in all areas
of algorithmic work in bioinformatics and computational biology. The emphasis
is on discrete algorithms that address important problems in molecular biology,
genomics, and genetics, that are founded on sound models, that are computation-
ally ecient, and that have been implemented and tested in simulations and on
real datasets. The goal is to present recent research results, including signicant
work in progress, and to identify and explore directions of future research.
Original research papers (including signicant work in progress) or state-
of-the-art surveys were solicited on all aspects of algorithms in bioinformatics,
including, but not limited to: exact and approximate algorithms for genomics,
genetics, sequence analysis, gene and signal recognition, alignment, molecular
evolution, phylogenetics, structure determination or prediction, gene expression
and gene networks, proteomics, functional genomics, and drug design.
We received 83 submissions in response to our call for papers, and were able
to accept about half of the submissions. In addition, WABI hosted two invited,
distinguished lectures, given to the entire ALGO 2002 conference, by Dr. Ehud
Shapiro of the Weizmann Institute and Dr. Gene Myers of Celera Genomics. An
abstract of Dr. Shapiros lecture, and a full paper detailing Dr. Myers lecture,
are included in these proceedings.
We would like to sincerely thank all the authors of submitted papers, and
the participants of the workshop. We also thank the program committee for
their hard work in reviewing and selecting the papers for the workshop. We were
fortunate to have on the program committee the following distinguished group
of researchers:
Pankaj Agarwal (GlaxoSmithKline Pharmaceuticals, King of Prussia)
Alberto Apostolico (Universit` a di Padova and Purdue University, Lafayette)
Craig Benham (University of California, Davis)
Jean-Michel Claverie (CNRS-AVENTIS, Marseille)
Nir Friedman (Hebrew University, Jerusalem)
Olivier Gascuel (Universite de Montpellier II and CNRS, Montpellier)
Misha Gelfand (IntegratedGenomics, Moscow)
Raaele Giancarlo (Universit`a di Palermo)
VI Preface
David Gilbert (University of Glasgow)
Roderic Guigo (Institut Municipal dInvestigacions M`ediques,
Barcelona, co-chair)
Dan Guseld (University of California, Davis, co-chair)
Jotun Hein (University of Oxford)
Inge Jonassen (Universitetet i Bergen)
Giuseppe Lancia (Universit` a di Padova)
Bernard M.E. Moret (University of New Mexico, Albuquerque)
Gene Myers (Celera Genomics, Rockville)
Christos Ouzonis (European Bioinformatics Institute, Hinxton Hall)
Lior Pachter (University of California, Berkeley)
Knut Reinert (Celera Genomics, Rockville)
Marie-France Sagot (Universite Claude Bernard, Lyon)
David Sanko (Universite de Montreal)
Steve Skiena (State University of New York, Stony Brook)
Gary Stormo (Washington University, St. Louis)
Jens Stoye (Universit at Bielefeld)
Martin Tompa (University of Washington, Seattle)
Alfonso Valencia (Centro Nacional de Biotecnologa, Madrid)
Martin Vingron (Max-Planck-Institut f ur Molekulare Genetik, Berlin)
Lusheng Wang (City University of Hong Kong)
Tandy Warnow (University of Texas, Austin)
We also would like to thank the WABI steering committee, Olivier Gascuel,
Jotun Hein, Raaele Giancarlo, Erik Meineche-Schmidt, and Bernard Moret, for
inviting us to co-chair this program committee, and for their help in carrying
out that task.
We are particularly indebted to Terri Knight of the University of California,
Davis, Robert Castelo of the Universitat Pompeu Fabra, Barcelona, and Bernard
Moret of the University of New Mexico, Albuquerque, for the extensive technical
and advisory help they gave us. We could not have managed the reviewing
process and the preparation of the proceedings without their help and advice.
Thanks again to everyone who helped to make WABI 2002 a success. We
hope to see everyone again at WABI 2003.
July, 2002 Roderic Guig o and Dan Guseld
Table of Contents
Simultaneous Relevant Feature Identication
and Classication in High-Dimensional Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 1
L.R. Grate (Lawrence Berkeley National Laboratory), C. Bhattacharyya,
M.I. Jordan, and I.S. Mian (University of California Berkeley)
Pooled Genomic Indexing (PGI): Mathematical Analysis
and Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
M. Cs ur os (Universite de Montreal), and A. Milosavljevic (Human
Genome Sequencing Center)
Practical Algorithms and Fixed-Parameter Tractability
for the Single Individual SNP Haplotyping Problem . . . . . . . . . . . . . . . . . . . . 29
R. Rizzi (Universit` a di Trento), V. Bafna, S. Istrail (Celera Genomics),
and G. Lancia (Universit` a di Padova)
Methods for Inferring Block-Wise Ancestral History
from Haploid Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
R. Schwartz, A.G. Clark (Celera Genomics), and S. Istrail (Celera
Genomics)
Finding Signal Peptides in Human Protein Sequences
Using Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
M. Reczko (Synaptic Ltd.), P. Fiziev, E. Staub (metaGen Pharmaceu-
ticals GmbH), and A. Hatzigeorgiou (University of Pennsylvania)
Generating Peptide Candidates from Amino-Acid Sequence Databases
for Protein Identication via Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . . 68
N. Edwards and R. Lippert (Celera Genomics)
Improved Approximation Algorithms for NMR Spectral Peak Assignment . 82
Z.-Z. Chen (Tokyo Denki University), T. Jiang (University of Califor-
nia, Riverside), G. Lin (University of Alberta), J. Wen (University of
California, Riverside), D. Xu, and Y. Xu (Oak Ridge National Labo-
ratory)
Ecient Methods for Inferring Tandem Duplication History . . . . . . . . . . . . . 97
L. Zhang (Nat. University of Singapore), B. Ma (University of Western
Ontario), and L. Wang (City University of Hong Kong)
Genome Rearrangement Phylogeny Using Weighbor . . . . . . . . . . . . . . . . . . . . 112
L.-S. Wang (University of Texas at Austin)
VIII Table of Contents
Segment Match Renement and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.L. Halpern (Celera Genomics), D.H. Huson (T ubingen University),
and K. Reinert (Celera Genomics)
Extracting Common Motifs under the Levenshtein Measure:
Theory and Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
E.F. Adebiyi and M. Kaufmann (Universit at T ubingen)
Fast Algorithms for Finding Maximum-Density Segments
of a Sequence with Applications to Bioinformatics . . . . . . . . . . . . . . . . . . . . . . 157
M.H. Goldwasser (Loyola University Chicago), M.-Y. Kao (Northwest-
ern University), and H.-I. Lu (Academia Sinica)
FAUST: An Algorithm for Extracting Functionally Relevant Templates
from Protein Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
M. Milik, S. Szalma, and K.A. Olszewski (Accelrys)
Ecient Unbound Docking of Rigid Molecules . . . . . . . . . . . . . . . . . . . . . . . . . 185
D. Duhovny, R. Nussinov, and H.J. Wolfson (Tel Aviv University)
A Method of Consolidating and Combining EST and mRNA Alignments
to a Genome to Enumerate Supported Splice Variants . . . . . . . . . . . . . . . . . . 201
R. Wheeler (Aymetrix)
A Method to Improve the Performance of Translation Start Site Detection
and Its Application for Gene Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
M. Pertea and S.L. Salzberg (The Institute for Genomic Research)
Comparative Methods for Gene Structure Prediction
in Homologous Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
C.N.S. Pedersen and T. Scharling (University of Aarhus)
MultiProt A Multiple Protein Structural Alignment Algorithm . . . . . . . . . 235
M. Shatsky, R. Nussinov, and H.J. Wolfson (Tel Aviv University)
A Hybrid Scoring Function for Protein Multiple Alignment . . . . . . . . . . . . . . 251
E. Rocke (University of Washington)
Functional Consequences in Metabolic Pathways
from Phylogenetic Proles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Y. Bilu and M. Linial (Hebrew University)
Finding Founder Sequences from a Set of Recombinants . . . . . . . . . . . . . . . . 277
E. Ukkonen (University of Helsinki)
Estimating the Deviation from a Molecular Clock . . . . . . . . . . . . . . . . . . . . . . 287
L. Nakhleh, U. Roshan (University of Texas at Austin), L. Vawter
(Aventis Pharmaceuticals), and T. Warnow (University of Texas at
Austin)
Table of Contents IX
Exploring the Set of All Minimal Sequences of Reversals An Application
to Test the Replication-Directed Reversal Hypothesis . . . . . . . . . . . . . . . . . . . 300
Y. Ajana, J.-F. Lefebvre (Universite de Montreal), E.R.M. Tillier (Uni-
versity Health Network), and N. El-Mabrouk (Universite de Montreal)
Approximating the Expected Number of Inversions
Given the Number of Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
N. Eriksen (Royal Institute of Technology)
Invited Lecture Accelerating Smith-Waterman Searches . . . . . . . . . . . . . . . 331
G. Myers (Celera Genomics) and R. Durbin (Sanger Centre)
Sequence-Length Requirements for Phylogenetic Methods . . . . . . . . . . . . . . . 343
B.M.E. Moret (University of New Mexico), U. Roshan, and T. Warnow
(University of Texas at Austin)
Fast and Accurate Phylogeny Reconstruction Algorithms
Based on the Minimum-Evolution Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
R. Desper (National Library of Medicine, NIH) and O. Gascuel (LIRMM)
NeighborNet: An Agglomerative Method for the Construction
of Planar Phylogenetic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
D. Bryant (McGill University) and V. Moulton (Uppsala University)
On the Control of Hybridization Noise
in DNA Sequencing-by-Hybridization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
H.-W. Leong (National University of Singapore), F.P. Preparata (Brown
University), W.-K. Sung, and H. Willy (National University of Singa-
pore)
Restricting SBH Ambiguity via Restriction Enzymes . . . . . . . . . . . . . . . . . . . 404
S. Skiena (SUNY Stony Brook) and S. Snir (Technion)
Invited Lecture Molecule as Computation:
Towards an Abstraction of Biomolecular Systems . . . . . . . . . . . . . . . . . . . . . . 418
E. Shapiro (Weizmann Institute)
Fast Optimal Genome Tiling with Applications to Microarray Design
and Homology Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
P. Berman (Pennsylvania State University), P. Bertone (Yale Univer-
sity), B. DasGupta (University of Illinois at Chicago), M. Gerstein
(Yale University), M.-Y. Kao (Northwestern University), and M. Sny-
der (Yale University)
Rapid Large-Scale Oligonucleotide Selection for Microarrays . . . . . . . . . . . . . 434
S. Rahmann (Max-Planck-Institute for Molecular Genetics)
X Table of Contents
Border Length Minimization in DNA Array Design . . . . . . . . . . . . . . . . . . . . . 435
A.B. Kahng, I.I. M andoiu, P.A. Pevzner, S. Reda (University of Cal-
ifornia at San Diego), and A.Z. Zelikovsky (Georgia State University)
The Enhanced Sux Array and Its Applications to Genome Analysis . . . . . 449
M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch (University of Bielefeld)
The Algorithmic of Gene Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
A. Bergeron (Universite du Quebec a Montreal), S. Corteel (CNRS -
Universite de Versailles), and M. Ranot (CNRS - Laboratoire Genome
et Informatique)
Combinatorial Use of Short Probes
for Dierential Gene Expression Proling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
L.L. Warren and B.H. Liu (North Carolina State University)
Designing Specic Oligonucleotide Probes
for the Entire S. cerevisiae Transcriptome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
D. Lipson (Technion), P. Webb, and Z. Yakhini (Agilent Laboratories)
K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data . . 506
Z. Bar-Joseph, E.D. Demaine, D.K. Giord (MIT LCS), A.M. Hamel
(Wilfrid Laurier University), T.S. Jaakkola (MIT AI Lab), and N. Sre-
bro (MIT LCS)
Inversion Medians Outperform Breakpoint Medians
in Phylogeny Reconstruction from Gene-Order Data . . . . . . . . . . . . . . . . . . . . 521
B.M.E. Moret (University of New Mexico), A.C. Siepel (University of
California at Santa Cruz), J. Tang, and T. Liu (University of New
Mexico)
Modied Mincut Supertrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
R.D.M. Page (University of Glasgow)
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Simultaneous Relevant Feature Identication
and Classication in High-Dimensional Spaces
L.R. Grate
1
, C. Bhattacharyya
2,3
, M.I. Jordan
2,3
, and I.S. Mian
1
Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley CA 94720
Department of EECS, University of California Berkeley, Berkeley CA 94720
Department of Statistics, University of California Berkeley, Berkeley CA 94720
Abstract. Molecular proling technologies monitor thousands of tran-
scripts, proteins, metabolites or other species concurrently in biologi-
cal samples of interest. Given two-class, high-dimensional proling data,
nominal Liknon [4] is a specic implementation of a methodology for
performing simultaneous relevant feature identication and classica-
tion. It exploits the well-known property that minimizing an l norm
(via linear programming) yields a sparse hyperplane [15,26,2,8,17]. This
work (i) examines computational, software and practical issues required
to realize nominal Liknon, (ii) summarizes results from its application
to ve real world data sets, (iii) outlines heuristic solutions to problems
posed by domain experts when interpreting the results and (iv) denes
some future directions of the research.
1 Introduction
Biologists and clinicians are adopting high-throughput genomics, proteomics and
related technologies to assist in interrogating normal and perturbed systems such
as unaected and tumor tissue specimens. Such investigations can generate data
having the formT = (x
n
, y
n
), n (1, . . . , N) where x
n
R
P
and, for two-class
data, y
n
+1, 1. Each element of a data point x
n
is the absolute or relative
abundance of a molecular species monitored. In transcript proling, a data point
represents transcript (gene) levels measured in a sample using cDNA, oligonu-
cleotide or similar microarray technology. A data point from protein proling
can represent Mass/Charge (M/Z) values for low molecular weight molecules
(proteins) measured in a sample using mass spectroscopy.
In cancer biology, proling studies of dierent types of (tissue) specimens
are motivated largely by a desire to create clinical decision support systems
for accurate tumor classication and to identify robust and reliable targets,
biomarkers, for imaging, diagnosis, prognosis and therapeutic intervention
[14,3,13,27,18,23,9,25,28,19,21,24]. Meeting these biological challenges includes
addressing the general statistical problems of classication and prediction, and
relevant feature identication.
Support Vector Machines (SVMs) [30,8] have been employed successfully for
cancer classication based on transcript proles [5,22,25,28]. Although mecha-
nisms for reducing the number of features to more manageable numbers include
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 19, 2002.
c Springer-Verlag Berlin Heidelberg 2002
2 L.R. Grate et al.
discarding those below a user-dened threshold, relevant feature identication
is usually addressed via a lter-wrapper strategy [12,22,32]. The lter generates
candidate feature subsets whilst the wrapper runs an induction algorithm to
determine the discriminative ability of a subset. Although SVMs and the newly
formulated Minimax Probability Machine (MPM) [20] are good wrappers [4],
the choice of ltering statistic remains an open question.
Nominal Liknon is a specic implementation of a strategy for perform-
ing simultaneous relevant feature identication and classication [4]. It exploits
the well-known property that minimizing an l
1
norm (via linear programming)
yields a sparse hyperplane [15,26,2,8,17]. The hyperplane constitutes the clas-
sier whilst its sparsity, a weight vector with few non-zero elements, denes a
small number of relevant features. Nominal Liknon is computationally less de-
manding than the prevailing lter(SVM/MPM) wrapper strategy which treats
the problems of feature selection and classication as two independent tasks
[4,16]. Biologically, nominal Liknon performs well when applied to real world
data generated not only by the ubiquitous transcript proling technology, but
also by the emergent protein proling technology.
2 Simultaneous Relevant Feature Identication
and Classication
Consider a data set T = (x
n
, y
n
), n (1, . . . , N). Each of the N data points
(proling experiments) is a P-dimensional vector of features (gene or protein
abundances) x
n
R
P
(usually N 10
1
10
2
; P 10
3
10
4
). A data point
n is assigned to one of two classes y
n
+1, 1 such a normal or tumor tis-
sue sample. Given such two-class high-dimensional data, the analytical goal is
to estimate a sparse classier, a model which distinguishes the two classes of
data points (classication) and species a small subset of discriminatory fea-
tures (relevant feature identication). Assume that the data T can be separated
by a linear hyperplane in the P-dimensional input feature space. The learning
task can be formulated as an attempt to estimate a hyperplane, parameterized
in terms of a weight vector w and bias b, via a solution to the following N
inequalities [30]:
y
n
z
n
= y
n
(w
T
x
n
b) 0
n = 1, . . . , N . (1)
The hyperplane satisfying w
T
x b = 0 is termed a classier. A new data point
x (abundances of P features in a new sample) is classied by computing z =
w
T
xb. If z > 0, the data point is assigned to one class otherwise it belongs to
the other class.
Enumerating relevant features at the same time as discovering a classier
can be addressed by nding a sparse hyperplane, a weight vector w in which
most components are equal to zero. The rationale is that zero elements do not
contribute to determining the value of z:
Simultaneous Relevant Feature Identication and Classication 3
z =
P

p=1
w
p
x
p
b .
If w
p
= 0, feature p is irrelevant with regards to deciding the class. Since
only non-zero elements w
p
,= 0 inuence the value of z, they can be regarded as
relevant features.
The task of dening a small number of relevant features can be equated
with that of nding a small set of non-zero elements. This can be formulated as
an optimization problem; namely that of minimizing the l
0
norm |w|
0
, where
|w|
0
= number ofp : w
p
,= 0, the number of non-zero elements of w. Thus we
obtain:
min
w,b
|w|
0
subject to y
n
(w
T
x
n
b) 0
n = 1, . . . , N . (2)
Unfortunately, problem (2) is NP-hard [10]. A tractable, convex approxima-
tion to this problem can be obtained by replacing the l
0
norm with the l
1
norm
|w|
1
, where |w|
1
=

P
p=1
[w
p
[, the sum of the absolute magnitudes of the
elements of a vector [10]:
min
w,b
|w|
1
=

P
p=1
[w
p
[
subject to y
n
(w
T
x
n
b) 0
n = 1, . . . , N . (3)
A solution to (3) yields the desired sparse weight vector w.
Optimization problem (3) can be solved via linear programming [11]. The
ensuing formulation requires the imposition of constraints on the allowed ranges
of variables. The introduction of new variables u
p
, v
p
R
P
such that [w
p
[ =
u
p
+ v
p
and w
p
= u
p
v
p
ensures non-negativity. The range of w
p
= u
p
v
p
is
unconstrained (positive or negative) whilst u
p
and v
p
remain non-negative. u
p
and v
p
are designated the positive and negative parts respectively. Similarly,
the bias b is split into positive and negative components b = b
+
b

. Given a
solution to problem (3), either u
p
or v
p
will be non-zero for feature p [11]:
min
u,v,b
+
,b

P
p=1
(u
p
+ v
p
)
subject to y
n
((u v)
T
x
n
(b
+
b

)) 1
u
p
0; v
p
0; b
+
0; b

0
n = 1, . . . , N; p = 1, . . . , P . (4)
A detailed description of the origins of the 1 constraint can be found elsewhere
[30].
If the data T are not linearly separable, misclassications (errors in the class
labels y
n
) can be accounted for by the introduction of slack variables
n
. Problem
(4) can be recast yielding the nal optimization problem,
4 L.R. Grate et al.
min
u,v,b
+
,b

P
p=1
(u
p
+ v
p
) + C

N
n=1

n
subject to y
n
((u v)
T
x
n
(b
+
b

)) 1
n
u
p
0; v
p
0; b
+
0; b

0;
n
0
n = 1, . . . , N; p = 1, . . . , P . (5)
C is an adjustable parameter weighing the contribution of misclassied data
points. Larger values lead to fewer misclassications being ignored: C = 0 cor-
responds to no outliers being ignored whereas C leads to the hard margin
limit.
3 Computational, Software and Practical Issues
Learning the sparse classier dened by optimization problem (5) involves min-
imizing a linear function subject to linear constraints. Ecient algorithms for
solving such linear programming problems involving 10,000 variables (N) and
10,000 constraints (P) are well-known. Standalone open source codes include
lp solve
1
and PCx
2
.
Nominal Liknon is an implementation of the sparse classier (5). It incor-
porates routines written in Matlab
3
and a system utilizing perl
4
and lp solve.
The code is available from the authors upon request. The input consists of a le
containing an N (P + 1) data matrix in which each row represents a single
proling experiment. The rst P columns are the feature values, abundances of
molecular species, whilst column P + 1 is the class label y
n
+1, 1. The
output comprises the non-zero values of the weight vector w (relevant features),
the bias b and the number of non-zero slack variables
n
.
The adjustable parameter C in problem (5) can be set using cross validation
techniques. The results described here were obtained by choosing C = 0.5 or
C = 1.
4 Application of Nominal Liknon to Real World Data
Nominal Liknon was applied to ve data sets in the size range (N = 19, P =
1,987) to (N = 200, P = 15,154). A data set T yielded a sparse classier, w and
b, and a specication of the l relevant features (P l). Since the proling studies
produced only a small number of data points (N P), the generalization error
of a nominal Liknon classier was determined by computing the leave-one-out
error for l-dimensional data points. A classier trained using N 1 data points
was used to predict the class of the withheld data point; the procedure repeated
N times. The results are shown in Table 1.
Nominal Liknon performs well in terms of simultaneous relevant feature
identication and classication. In all ve transcript and protein proling data
http://www.netlib.org/ampl/solvers/lpsolve/
http://www-fp.mcs.anl.gov/otc/Tools/PCx/
http://www.mathworks.com
http://www.perl.org/
Simultaneous Relevant Feature Identication and Classication 5
Table 1. Summary of published and unpublished investigations using nominal Liknon
[4,16].
Transcript proles Sporadic breast carcinoma tissue samples [29]
inkjet microarrays; relative transcript levels
http://www.rii.com/publications/vantveer.htm
Two-class data 46 patients with distant metastases < 5 years
51 patients with no distant metastases 5 years
Relevant features 72 out of P=5,192
Leave-one-out error 1 out of N=97
Transcript proles Tumor tissue samples [1]
custom cDNA microarrays; relative transcript levels
http://www.nhgri.nih.gov/DIR/Microarray/
selected_publications.html
Two-class data 13 KIT-mutation positive gastrointestinal stromal tumors
6 spindle cell tumors from locations outside the gastrointestinal tract
Relevant features 6 out of P=1,987
Leave-one-out error 0 out of N=19
Transcript proles Small round blue cell tumor samples (EWS, RMS, NHL, NB) [19]
custom cDNA microarrays; relative transcript levels
http://www.nhgri.nih.gov/DIR/Microarray/Supplement
Two-class data 46 EWS/RMS tumor biopsies
38 EWS/RMS/NHL/NB cell lines
Relevant features 23 out of P=2,308
Leave-one-out error 0 out of N=84
Transcript proles Prostate tissue samples [31]
Aymetrix arrays; absolute transcript levels
http://carrier.gnf.org/welsh/prostate
Two-class data 9 normal
25 malignant
Relevant features 7 out of P=12,626
Leave-one-out error 0 out of N=34
Protein proles Serum samples [24]
SELDI-TOF mass spectrometry; M/Z values (spectral amplitudes)
http://clinicalproteomics.steem.com
Two-class data 100 unaected
100 ovarian cancer
Relevant features 51 out of P=15,154
Leave-one-out error 3 out of N=200
sets a hyperplane was found, the weight vector was sparse (< 100 or < 2% non-
zero components) and the relevant features were of interest to domain experts
(they generated novel biological hypotheses amenable to subsequent experimen-
tal or clinical validation). For the protein proles, better results were obtained
using normalized as opposed to raw values: when employed to predict the class
of 16 independent non-cancer samples, the 51 relevant features had a test error
of 0 out of 16.
On a powerful desktop computer, a > 1 GHz Intel-like machine, the time
required to create a sparse classier varied from 2 seconds to 20 minutes. For
the larger problems, the main memory RAM requirement exceeded 500 MBytes.
6 L.R. Grate et al.
5 Heuristic Solutions to Problems Posed
by Domain Experts
Domain experts wish to postprocess nominal Liknon results to assist in the
design of subsequent experiments aimed at validating, verifying and extending
any biological predictions. In lieu of a theoretically sound statistical framework,
heuristics have been developed to prioritize, reduce or increase the number of
relevant features.
In order to priorities features, assume that all P features are on the same
scale. The l relevant features can be ranked according to the magnitude and/or
sign of the non-zero elements of the weight vector w (w
p
,= 0). To reduce the
number of relevant features to a smaller, most interesting set, a histogram of
w
p
,= 0 values can be used to determine a threshold for pruning the set. In order
to increase the number of features to a larger, more interesting set, nominal
Liknon can be run in an iterative manner. The l relevant features identied in
one pass through the data are removed from the data points to be used as input
for the next pass. Each successive round generates a new set of relevant features.
The procedure is terminated either by the domain expert or by monitoring the
leave-one-out error of the classier associated with each set of relevant features.
Preliminary results from analysis of the gastrointestinal stromal tumor/spin-
dle cell tumor transcript proling data set indicate that these extensions are
likely to be of utility to domain experts. The leave-one-out error of the relevant
features identied by ve iterations of nominal Liknon was at most one. The
details are: iteration 0 (number of relevant features = 6, leave-one-out error =
0), iteration 1 (5, 0), iteration 2 (5, 1), iteration 3 (9, 0), iteration 4 (13, 1),
iteration 5 (11, 1).
Iterative Liknon may prove useful during explorations of the (qualitative)
association between relevant features and their behavior in the N data points.
The gastrointestinal stromal tumor/spindle cell tumor transcript proling data
set has been the subject of probabilistic clustering [16]. A nite Gaussian mix-
ture model as implemented by the program AutoClass [6] was estimated from
P=1,987, N=19-dimensional unlabeled data points. The trained model was used
to assign each feature (gene) to one of the resultant clusters. Five iterations of
nominal Liknon identied the majority of genes assigned to a small number of
discriminative clusters. Furthermore, these genes constituted most of the impor-
tant distinguishing genes dened by the original authors [1].
6 Discussion
Nominal Liknon implements a mathematical technique for nding a sparse
hyperplane. When applied to two-class high-dimensional real-world molecular
proling data, it identies a small number of relevant features and creates a
classier that generalizes well. As discussed elsewhere [4,7], many subsets of rel-
evant features are likely to exist. Although nominal Liknon species but one
set of discriminatory features, this low-hanging fruit approach does suggest
Simultaneous Relevant Feature Identication and Classication 7
genes of interest to experimentalists. Iterating the procedure provides a rapid
mechanism for highlighting additional sets of relevant features that yield good
classiers. Since nominal Liknon is a single-pass method, one disadvantage is
that the learned parameters cannot be adjusted (improved) as would be possible
with a more typical train/test methodology.
7 Future Directions
Computational biology and chemistry are generating high-dimensional data so
sparse solutions for classication and regression problems are of widespread im-
portance. A general purpose toolbox containing specic implementations of par-
ticular statistical techniques would be of considerable practical utility. Future
plans include developing a suite of software modules to aid in performing tasks
such as the following. A. Create high-dimensional input data. (i) Direct genera-
tion by high-throughput experimental technologies. (ii) Systematic formulation
and extraction of large numbers of features from data that may be in the form of
strings, images, and so on (a priori, features relevant for one problem may be
irrelevant for another). B. Enunciate sparse solutions for classication and re-
gression problems in high-dimensions. C. Construct and assess models. (i) Learn
a variety of models by a grid search through the space of adjustable parameters.
(ii) Evaluate the generalization error of each model. D. Combine best models to
create a nal decision function. E. Propose hypotheses for domain expert.
Acknowledgements
This work was supported by NSF grant IIS-9988642, the Director, Oce of
Energy Research, Oce of Health and Environmental Research, Division of
the U.S. Department of Energy under Contract No. DE-AC03-76F00098 and
an LBNL/LDRD through U.S. Department of Energy Contract No. DE-AC03-
76SF00098.
References
1. S.V. Allander, N.N. Nupponen, M. Ringner, G. Hostetter, G.W. Maher, N. Gold-
berger, Y. Chen, Carpten J., A.G. Elkahloun, and P.S. Meltzer. Gastrointestinal
Stromal Tumors with KIT mutations exhibit a remarkably homogeneous gene ex-
pression prole. Cancer Research, 61:86248628, 2001.
2. K. Bennett and A. Demiriz. Semi-supervised support vector machines. In Neural
and Information Processing Systems, volume 11. MIT Press, Cambridge MA, 1999.
3. A. Bhattacharjee, W.G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd,
J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E.J. Mark, E.S. Lander,
W. Wong, B.E. Johnson, T.R. Golub, D.J. Sugarbaker, and M. Meyerson. Clas-
sication of human lung carcinomas by mrna expression proling reveals distinct
adenocarcinoma subclasses. Proc. Natl. Acad. Sci., 98:1379013795, 2001.
4. C. Bhattacharyya, L.R. Grate, A. Rizki, D.C. Radisky, F.J. Molina, M.I. Jor-
dan, M.J. Bissell, and I.S. Mian. Simultaneous relevant feature identication and
classication in high-dimensional spaces: application to molecular proling data.
Submitted, Signal Processing, 2002.
8 L.R. Grate et al.
5. M.P. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey,
M. Ares, Jr, and D. Haussler. Knowledge-based analysis of microarray gene expres-
sion data by using support vector machines. Proc. Natl. Acad. Sci., 97:262267,
2000.
6. P. Cheeseman and J. Stutz. Bayesian Classication (AutoClass): Theory and
Results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthu-
rusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153
180. AAAI Press/MIT Press, 1995. The software is available at the URL
http://www.gnu.org/directory/autoclass.html.
7. M.L. Chow, E.J. Moler, and I.S. Mian. Identifying marker genes in transcription
prole data using a mixture of feature relevance experts. Physiological Genomics,
5:99111, 2001.
8. N. Cristianini and J. Shawe-Taylor. Support Vector Machines and other kernel-
based learning methods. Cambridge University Press, Cambridge, England, 2000.
9. S.M. Dhanasekaran, T.R. Barrette, R. Ghosh, D. Shah, S. Varambally, K. Ku-
rachi, K.J. Pienta, M.J. Rubin, and A.M. Chinnaiyan. Delineation of prognostic
biomarkers in prostate cancer. Nature, 432, 2001.
10. D.L. Donoho and X. Huo. Uncertainty principles and idea atomic decomposition.
Technical Report, Statistics Department, Stanford University, 1999.
11. R. Fletcher. Practical Methods in Optimization. John Wiley & Sons, New York,
2000.
12. T. Furey, N. Cristianini, N. Duy, D. Bednarski, M. Schummer, and D. Haussler.
Support vector machine classication and validation of cancer tissue samples using
microarray expression data. Bioinformatics, 16:906914, 2000.
13. M.E. Garber, O.G. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler,
M. Pacyana-Gengelbach, M. van de Rijn, G.D. Rosen, C.M. Perou, R.I. Whyte,
R.B. Altman, P.O. Brown, D. Botstein, and I. Petersen. Diversity of gene ex-
pression in adenocarcinoma of the lung. Proc. Natl. Acad. Sci., 98:1378413789,
2001.
14. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov,
H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfeld, and E.S.
Lander. Molecular classication of cancer: Class discovery and class prediction by
gene expression monitoring. Science, 286:531537, 1999. The data are available at
the URL waldo.wi.mit.edu/MPR/data_sets.html.
15. T. Graepel, B. Herbrich, R. Scholkopf, A.J. Smola, P. Bartlett, K. M uller, K. Ober-
mayer, and R.C. Williamson. Classication on proximity data with lp-machines. In
Ninth International Conference on Articial Neural Networks, volume 470, pages
304309. IEE, London, 1999.
16. L.R. Grate, C. Bhattacharyya, M.I. Jordan, and I.S. Mian. Integrated analysis of
transcript proling and protein sequence data. In press, Mechanisms of Ageing
and Development, 2002.
17. T. Hastie, R. Tibshirani, , and Friedman J. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2000.
18. I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon,
P. Meltzer, B. Gusterson, M. Esteller, M. Raeld, Z. Yakhini, A. Ben-Dor,
E. Dougherty, J. Kononen, L. Bubendorf, W. Fehrle, S. Pittaluga, S. Gruvberger,
N. Loman, O. Johannsson, H. Olsson, B. Wilfond, G. Sauter, O.-P. Kallioniemi,
A. Borg, and J. Trent. Gene-expression proles in hereditary breast cancer. New
England Journal of Medicine, 344:539548, 2001.
Simultaneous Relevant Feature Identication and Classication 9
19. J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold,
M. Schwab, Antonescu C.R., Peterson C., and P.S. Meltzer. Classication and
diagnostic prediction of cancers using gene expression proling and articial neural
networks. Nature Medicine, 7:673679, 2001.
20. G. Lanckerit, L. El Ghaoui, C. Bhattacharyya, and M.I. Jordan. Minimax proba-
bility machine. Advances in Neural Processing systems, 14, 2001.
21. L.A. Liotta, E.C. Kohn, and E.F. Perticoin. Clinical proteomics. personalized
molecular medicine. JAMA, 14:22112214, 2001.
22. E.J. Moler, M.L. Chow, and I.S. Mian. Analysis of molecular prole data using
generative and discriminative methods. Physiological Genomics, 4:109126, 2000.
23. D.A. Notterman, U. Alon, A.J. Sierk, and A.J. Levine. Transcriptional gene expres-
sion proles of colorectal adenoma, adenocarcinoma, and normal tissue examined
by oligonucleotide arrays. Cancer Research, 61:31243130, 2001.
24. E.F. Petricoin III, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Stein-
berg, G.B Mills, C. Simone, D.A. Fishman, E.C. Kohn, and L.A. Liotta. Use of
proteomic patterns in serum to identify ovarian cancer. The Lancet, 359:572577,
2002.
25. S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo,
C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S.
Lander, and T.R. Golub. Multiclass cancer diagnosis using tumor gene expression
signatures. Proc. Natl. Acad. Sci., 98:1514915154, 2001. The data are available
from http://www-genome.wi.mit.edu/mpr/GCM.html.
26. A. Smola, T.T. Friess, and B. Scholkopf. Semiparametric support vector and linear
programming machines. In Neural and Information Processing Systems, volume 11.
MIT Press, Cambridge MA, 1999.
27. T. Sorlie, C.M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie,
M.B. Eisen, M. van de Rijn, S.S. Jerey, T. Thorsen, H. Quist, J.C. Matese, P.O.
Brown, D. Botstein, P.E. Lonning, and A.-L. Borresen-Dale. Gene expression pat-
terns of breast carcinomas distinguish tumor subclasses with clinical implications.
Proc. Natl. Acad. Sci., 98:1086910874, 2001.
28. A.I. Su, J.B. Welsh, L.M. Sapinoso, S.G. Kern, P. Dimitrov, H. Lapp, P.G. Schultz,
S.M. Powell, C.A. Moskaluk, H.F. Frierson Jr, and G.M. Hampton. Molecular
classication of human carcinomas by use of gene expression signatures. Cancer
Research, 61:73887393, 2001.
29. L.J. van t Veer, H. Dai, M.J. van de Vijver, Y.D. He, A.A. Hart, M. Mao, H.L.
Peterse, van der Kooy K., M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M.
Kerkhoven, C. Roberts, P.S. Linsley, R. Bernards, and S.H. Friend. Gene expression
proling predicts clinical outcome of breast cancer. Nature, 415:530536, 2002.
30. V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
31. J.B. Welsh, L.M. Sapinoso, A.I. Su, S.G. Kern, J. Wang-Rodriguez, C.A. Moskaluk,
J.F. Frierson Jr, and G.M. Hampton. Analysis of gene expression identies can-
didate markers and pharmacological targets in prostate cancer. Cancer Research,
61:59745978, 2001.
32. J. Weston, Mukherjee S., O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Fea-
ture Selection for SVMs. In Advances in Neural Information Processing Systems,
volume 13, 2000.
Pooled Genomic Indexing (PGI):
Mathematical Analysis and Experiment Design
Mikl os Cs ur os
1,2,3
and Aleksandar Milosavljevic
2,3
Departement dinformatique et de recherche operationnelle, Universite de Montreal
CP 6128 succ. Centre-Ville, Montreal, Quebec H3C 3J7, Canada
[email protected]
Human Genome Sequencing Center, Department of Molecular and Human Genetics
Baylor College of Medicine
Bioinformatics Research Laboratory
Department of Molecular and Human Genetics
Baylor College of Medicine, Houston, Texas 77030, USA
[email protected]
Abstract. Pooled Genomic Indexing (PGI) is a novel method for phys-
ical mapping of clones onto known macromolecular sequences. PGI is
carried out by pooling arrayed clones, generating shotgun sequence reads
from pools and by comparing the reads against a reference sequence. If
two reads from two dierent pools match the reference sequence at a
close distance, they are both assigned (deconvoluted) to the clone at the
intersection of the two pools and the clone is mapped onto the region of
the reference sequence between the two matches. A probabilistic model
for PGI is developed, and several pooling schemes are designed and ana-
lyzed. The probabilistic model and the pooling schemes are validated in
simulated experiments where 625 rat BAC clones and 207 mouse BAC
clones are mapped onto homologous human sequence.
1 Introduction
Pooled Genomic Indexing (PGI) is a novel method for physical mapping of
clones onto known macromolecular sequences. PGI enables targeted compar-
ative sequencing of homologous regions for the purpose of discovery of genes,
gene structure, and conserved regulatory regions through comparative sequence
analysis. An application of the basic PGI method to BAC
1
clone mapping is
illustrated in Figure 1. PGI rst pools arrayed BAC clones, then shotgun se-
quences the pools at an appropriate coverage, and uses this information to map
individual BACs onto homologous sequences of a related organism. Specically,
shotgun reads from the pools provide a set of short (cca. 500 base pair long)
random subsequences of the unknown clone sequences (100200 thousand base
pair long). The reads are then individually compared to reference sequences,
using standard sequence alignment techniques [1] to nd homologies. In a clone-
by-clone sequencing strategy [2], the shotgun reads are collected for each clone
Bacterial Articial Chromosome
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 1028, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Pooled Genomic Indexing (PGI) 11
separately. Because of the pooling in PGI, the individual shotgun reads are not
associated with the clones, but detected homologies may be in certain cases.
If two reads from two dierent pools match the reference sequence at a close
distance, they are both assigned (deconvoluted) to the clone at the intersec-
tion of the two pools. Simultaneously, the clone is mapped onto the region of
the reference sequence between the two matches. Subsequently, known genomic
or transcribed reference sequences are turned into an index into the yet-to-be
sequenced homologous clones across species. As we will see below, this basic
pooling scheme is somewhat modied in practice in order to achieve correct and
unambiguous mapping.
PGI constructs comparative BAC-based physical maps at a fraction (on the
order of 1%) of the cost of full genome sequencing. PGI requires only minor
changes in the BAC-based sequencing pipeline already established in sequencing
laboratories, and thus it takes full advantage of existing economies of scale. The
key to the economy of PGI is BAC pooling, which reduces the amount of BAC
and shotgun library preparations down to the order of the square root of the
number of BAC clones. The depth of shotgun sequencing of the pools is adjusted
to t the evolutionary distance of comparatively mapped organisms. Shotgun
sequencing, which represents the bulk of the eort involved in a PGI project,
provides useful information irrespective of the pooling scheme. In other words,
pooling by itself does not represent a signicant overhead, and yet produces a
comprehensive and accurate comparative physical map.
Our reason for proposing PGI is motivated by recent advances in sequencing
technology [3] that allow shotgun sequencing of BAC pools. The Clone-Array
Pooled Shotgun Sequencing (CAPSS) method, described by [3], relies on clone-
array pooling and shotgun sequencing of the pools. CAPSS detects overlaps be-
tween shotgun sequence reads are used by and assembles the overlapping reads
into sequence contigs. PGI oers a dierent use for the shotgun read informa-
tion obtained in the same laboratory process. PGI compares the reads against
another sequence, typically the genomic sequence of a related species for the pur-
pose of comparative physical mapping. CAPSS does not use a reference sequence
to deconvolute the pools. Instead, CAPSS deconvolutes by detecting overlaps be-
tween reads: a column-pool read and a row-pool read that signicantly overlap
are deconvoluted to the BAC at the intersection of the row and the column. De-
spite the clear distinction between PGI and CAPSS, the methods are compatible
and, in fact, can be used simultaneously on the same data set. Moreover, the
advanced pooling schemes that we present here in the context of PGI are also
applicable to and increase performance of CAPSS, indicating that improvements
of one method are potentially applicable to the other.
In what follows, we propose a probabilistic model for the PGI method. We
then discuss and analyze dierent pooling schemes, and propose algorithms for
experiment design. Finally, we validate the method in two simulated PGI exper-
iments, involving 207 mouse and 625 rat BACs.
12 M. Cs uros and A. Milosavljevic
1
3
2
4
(5) reference sequence
(6) 200kbp
Fig. 1. The Pooled Genomic Indexing method maps arrayed clones of one species onto
genomic sequence of another (5). Rows (1) and columns (2) are pooled and shotgun
sequenced. If one row and one column fragment match the reference sequence (3 and 4
respectively) within a short distance (6), the two fragments are assigned (deconvoluted)
to the clone at the intersection of the row and the column. The clone is simultaneously
mapped onto the region between the matches, and the reference sequence is said to
index the clone.
2 Probability of Successful Indexing
In order to study the eciency of the PGI strategy formally, dene the following
values. Let N be the total number of clones on the array, and let m be the number
of clones within a pool. For simplicitys sake, assume that every pool has the
same number of clones, that clones within a pool are represented uniformly, and
that every clone has the same length L. Let F be the total number of random
shotgun reads, and let be the expected length of a read. The shotgun coverage c
is dened by c =
F
NL
.
Since reads are randomly distributed along the clones, it is not certain that
homologies between reference sequences and clones are detected. However, with
larger shotgun coverage, this probability increases rapidly. Consider the partic-
ular case of detecting homology between a given reference sequence and a clone.
A random fragment of length from this clone is aligned locally to the reference
sequence and if a signicant alignment is found, the homology is detected. Such
an alignment is called a hit. Let M() be the number of positions at which a
fragment of length can begin and produce a signicant alignment. The prob-
ability of a hit for a xed length equals M() divided by the total number of
possible start positions for the fragment, (L + 1).
When L , the expected probability of a random read aligning to the
reference sequence equals
p
hit
= E
M()
L + 1
=
M
L
, (1)
Pooled Genomic Indexing (PGI) 13
where M is a shorthand notation for EM(). The value M measures the homol-
ogy between the clone and the reference sequence. We call this value the eective
length of the (possibly undetected) index between the clone and the reference
sequence in question. For typical denitions of signicant alignment, such as an
identical region of a certain minimal length, M() is a linear function of , and
thus EM() = M(). Example 1: let the reference sequence be a subsequence
of length h of the clone, and dene a hit as identity of at least o base pairs.
Then M = h + 2o. Example 2: let the reference sequence be the transcribed
sequence of total length g of a gene on the genomic clone, consisting of e exons,
separated by long ( ) introns, and dene a hit as identity of at least o base
pairs. Then M = g + e( 2o).
Assuming uniform coverage and a mm array, the number of reads coming
from a xed pool equals
cmL
2
. If there is an index between a reference sequence
and a clone in the pool, then the number of hits in the pool is distributed
binomially with expected value
_
cmL
2
__
p
hit
m
_
=
cM
2
. Propositions 1, 2, and 3 rely
on the properties of this distribution, using approximation techniques pioneered
by [4] in the context of physical mapping.
Proposition 1. Consider an index with eective length M between a clone and
a reference sequence The probability that the index is detected equals approxi-
mately
p
M

_
1 e
c
M
2
_
2
.
(Proof in Appendix.)
Thus, by Proposition 1, the probability of false negatives decreases expo-
nentially with the shotgun coverage level. The expected number of hits for a
detected index can be calculated similarly.
Proposition 2. The number of hits for a detected index of eective length M
equals approximately
E
M
c
M

_
1 e
c
M
2
_
1
.
(Proof in Appendix.)
3 Pooling Designs
3.1 Ambiguous Indexes
The success of indexing in the PGI method depends on the possibility of decon-
voluting the local alignments. In the simplest case, homology between a clone
and a reference sequence is recognized by nding alignments with fragments from
one row and one column pool. It may happen, however, that more than one clone
are homologous to the same region in a reference sequence (and therefore to each
other). This is the case if the clones overlap, contain similar genes, or contain
similar repeat sequences. Subsequently, close alignments may be found between
14 M. Cs uros and A. Milosavljevic
C C
R B B
R B B
Fig. 2. Ambiguity caused by overlap or homology between clones. If clones B , B ,
B , and B are at the intersections of rows R , R and columns C , C as shown, then
alignments from the pools for R , R , C , C may originate from homologies in B
and B , or B and B , or even B , B , B , etc.
a reference sequence and fragments from more than two pools. If, for example,
two rows and one column align to the same reference sequence, an index can
be created to the two clones at the intersections simultaneously. However, align-
ments from two rows and two columns cannot be deconvoluted conclusively, as
illustrated by Figure 2.
A simple clone layout on an array cannot remedy this problem, thus calling
for more sophisticated pooling designs. The problem can be alleviated, for in-
stance, by arranging the clones on more than one array, thereby reducing the
chance of assigning overlapping clones to the same array. We propose other alter-
natives. One of them is based on construction of extremal graphs, while another
uses reshuing of the clones.
In addition to the problem of deconvolution, multiple homologies may also
lead to incorrect indexing at low coverage levels. Referring again to the example
of Figure 2, assume that the clones B
11
and B
22
contain a particular homology.
If the shotgun coverage level is low, it may happen that the only homologies
found are from row R
1
and column C
2
. In that case, the clone B
12
gets indexed
erroneously. Such indexes are false positives. The probability of false positive
indexing decreases rapidly with the coverage level as shown by the following
result.
Proposition 3. Consider an index between a reference sequence and two clones
with the same eective length M. If the two clones are not in the same row or
same column, the probability of false positive indexing equals approximately
p
FP
2e
c
M

_
1 e
c
M
2
_
2
. (2)
(Proof in Appendix.)
3.2 Sparse Arrays
The array layout of clones can be represented by an edge-labeled graph G in
the following manner. Edges in G are bijectively labeled with the clones. We
call such graphs clone-labeled. The pooling is dened by incidence, so that each
vertex corresponds to a pool, and the incident edges dene the clones in that
pool. If G is bipartite, then it represents arrayed pooling with rows and columns
corresponding to the two sets of vertices, and cells corresponding to edges. Notice
Pooled Genomic Indexing (PGI) 15
Fig. 3. Sparse array used in conjunction with the mouse experiments. This 39 39
array contains 207 clones, placed at the darkened cells. The array was obtained by
randomly adding clones while preserving the sparse property, i.e., the property that
for all choices of two rows and two columns, at most three out of the four cells at the
intersections have clones assigned to them.
that every clone-labeled graph denes a pooling design, even if it is not bipartite.
For instance, a clone-labeled full graph with N = K(K 1)/2 edges denes a
pooling that minimizes the number K of pools and thus number of shotgun
libraries, at the expense of increasing ambiguities.
Ambiguities originating from the existence of homologies and overlaps be-
tween exactly two clones correspond to cycles of length four in G. If G is bi-
partite, and it contains no cycles of length four, then deconvolution is always
possible for two clones. Such a graph represents an array in which cells are left
empty systematically, so that for all choices of two rows and two columns, at
most three out of the four cells at the intersections have clones assigned to them.
An array with that property is called a sparse array. Figure 3 shows a sparse
array. A sparse array is represented by a bipartite graph G

with given size N


and a small number K of vertices, which has no cycle of length four. This is a
specic case of a well-studied problem in extremal graph theory [5] known as
the problem of Zarankiewicz. It is known that if N > 180K
3/2
, then G

does
contain a cycle of length four, hence K > N
2/3
/32.
Let m be a prime power. We design a m
2
m
2
sparse array for placing N =
m
3
clones, achieving the K = (N
2/3
) density, by using an idea of Reiman [6].
The number m is the size of a pool, i.e., the number of clones in a row or
a column. Number the rows as R
a,b
with a, b 0, 1, . . . , m 1. Similarly,
number the columns as C
x,y
with x, y 0, 1, . . . , m1. Place a clone in each
cell (R
a,b
, C
x,y
) for which ax+b = y, where the arithmetic is carried out over the
nite eld F
m
. This design results in a sparse array by the following reasoning.
Considering the ane plane of order m, rows correspond to lines, and columns
correspond to points. A cell contains a clone only if the columns point lies on
the rows line. Since there are no two distinct lines going through the same two
points, for all choices of two rows and two columns, at least one of the cells at
the intersections is empty.
16 M. Cs uros and A. Milosavljevic
3.3 Double Shuing
An alternative to using sparse arrays is to repeat the pooling with the clones
reshued on an array of the same size. Taking the example of Figure 2, it
is unlikely that the clones B
11
, B
12
, B
21
, and B
22
end up again in the same
conguration after the shuing. Deconvolution is possible if after the shuing,
clone B
11
is in cell (R

1
, C

1
) and B
22
is in cell (R

2
, C

2
), and alignments with
fragments from the pools for R
i
, C
i
, R

i
, and C

i
are found, except if the clones B
12
and B
21
got assigned to cells (R

1
, C

2
) and (R

2
, C

1
). Figure 4 shows the possible
placements of clones in which the ambiguity is not resolved despite the shuing.
We prove that such situations are avoided with high probability for random
shuings.
Dene the following notions. A rectangle is formed by the four clones at the
intersections of two arbitrary rows and two arbitrary columns. A rectangle is
preserved after a shuing if the same four clones are at the intersections of
exactly two rows and two columns on the reshued array, and the diagonals
contain the same two clone pairs as before (see Figure 4).
1 2
3 4
1 3
2 4
4 3
2 1
4 2
3 1
2 1
4 3
2 4
1 3
3 4
1 2
3 1
4 2
Fig. 4. Possible placements of four clones (14) within two rows and two columns
forming a rectangle, which give rise to the same ambiguity.
Theorem 1. Let R(m) be the number of preserved rectangles on a mm array
after a random shuing. Then, for the expected value ER(m),
ER(m) =
1
2

2
m
_
1 + o(1)
_
.
Moreover, for all m > 2, ER(m + 1) > ER(m).
(Proof in Appendix.)
Consequently, the expected number of preserved rectangles equals
1
2
asymp-
totically. Theorem 1 also implies that with signicant probability, a random
shuing produces an array with at most one preserved rectangle. Specically,
by Markovs inequality, P
_
R(m) 1
_
ER(m), and thus
P
_
R(m) = 0
_

1
2
+
2
m
_
1 + o(1)
_
.
In particular, P
_
R(m) = 0
_
>
1
2
holds for all m > 2. Therefore, a random
shuing will preserve no rectangles with at least
1
2
probability. This allows us
Pooled Genomic Indexing (PGI) 17
to use a random algorithm: pick a random shuing, and count the number of
preserved rectangles. If that number is greater than zero, repeat the step. The
algorithm nishes in at most two steps on average, and takes more than one
hundred steps with less than 10
33
probability.
Remark. Theorem 1 can be extended to non-square arrays without much di-
culty. Let R(m
r
, m
c
) be the number of preserved rectangles on a m
r
m
c
array
after a random shuing. Then, for the expected value ER(m
r
, m
c
),
ER(m
r
, m
c
) =
1
2

m
r
+ m
c
m
r
m
c
_
1 + o(1)
_
.
Consequently, random shuing can be used to produce arrays with no preserved
rectangles even in case of non-square arrays (such as 12 8, for example).
3.4 Pooling Designs in General
Let B = B
1
, B
2
, . . . , B
N
be the set of clones, and P = P
1
, P
2
, . . . , P
K
be the
set of pools. A general pooling design is described by an incidence structure that
is represented by an NK 0-1 matrix M. The entry M[i, j] equals one if clone B
i
is included in P
j
, otherwise it is 0. The signature c(B
i
) of clone B
i
is the i-th row
vector, a binary vector of length K. In general, the signature of a subset S B
of clones is the binary vector of length K, dened by c(S) =
BS
c(B), where
denotes the bitwise OR operation. In order to assign an index to a set of clones,
one rst calculates the signature x of the index dened as a binary vector of
length K, in which the j-th bit is 1 if and only if there is a hit coming from
pool P
j
. For all binary vectors x and c of length K, dene
(x, c) =
_

K
j=1
(c
j
x
j
) if j = 1 . . . K: x
j
c
j
;
otherwise;
(3a)
and let
(x, B) = min
SB

_
x, c(S)
_
. (3b)
An index with signature x can be deconvoluted unambiguously if and only if the
minimum in Equation (3b) is unique. The weight w(x) of a binary vector x is
the number of coordinates that equal one, i.e, w(x) =

K
j=1
x
j
. Equation (3a)
implies that if (x, c) < , then (x, c) = w(c) w(x).
Similar problems to PGI pool design have been considered in other applica-
tions of combinatorial group testing [7], and pooling designs have often been used
for clone library screening [8]. The design of sparse arrays based on combinato-
rial geometries is a basic design method in combinatorics (eg., [9]). Reshued
array designs are sometimes called transversal designs. Instead of preserved rect-
angles, [10] consider collinear clones, i.e., clone pairs in the same row or column,
and propose designs with the unique collinearity condition, in which clone pairs
are collinear at most once on the reshued arrays. Such a condition is more
18 M. Cs uros and A. Milosavljevic
restrictive than ours and leads to incidence structures obtained by more com-
plicated combinatorial methods than our random algorithm. We describe here
a construction of arrays satisfying the unique collinearity condition. Let q be a
prime power. Based on results of design theory [9], the following method can be
used for producing up to q/2 reshued arrays, each of size q q, for pooling q
2
clones. Let F
q
be a nite eld of size q. Pools are indexed with elements of F
2
q
as P
x,y
: x, y F
q
. Dene pool set P
i
= P
i,y
: y F
q
for all i and P =
i
P
i
.
Index the clones as B
a,b
: a, b F
q
. Pool P
x,y
contains clone B
a,b
if and only if
y = a + bx holds.
Proposition 4. We claim that the pooling design described above has the fol-
lowing properties.
1. each pool belongs to exactly one pool set;
2. each clone is included in exactly one pool from each pool set;
3. for every pair of pools that are not in the same pool set there is exactly one
clone that is included in both pools.
(Proof in Appendix.)
Let d q/2. This pooling design can be used for arranging the clones on d
reshued arrays, each one of size q q. Select 2d pool sets, and pair them
arbitrarily. By Properties 13 of the design, every pool set pair can dene an
array layout, by setting one pool set as the row pools and the other pool set as the
column pools. Moreover, this set of reshued arrays gives a pooling satisfying
the unique collinearity condition by Property 3 since two clones are in the same
row or column on at most one array.
Similar questions also arise in coding theory, in the context of superimposed
codes [11,12]. Based on the idea of [11], consider the following pooling design
method using error-correcting block codes. Let C be a code of block length n
over the nite eld F
q
. In other words, let C be a set of length-n vectors over F
q
.
A corresponding binary code is constructed by replacing the elements of F
q
in
the codewords with binary vectors of length q. The substitution uses binary
vectors of weight one, i.e., vectors in which exactly one coordinate is 1, using
the simple rule that the z-th element of F
q
is replaced by a binary vector in
which the z-th coordinate is 1. The resulting binary code C

has length qn,


and each binary codeword has weight n. Using the binary code vectors as clone
signatures, the binary code denes a pooling design with K = qn pools and
N = [C[ clones, for which each clone is included in n pools. If d is the the
minimum distance of the original code C, i.e., if two codewords dier in at
least d coordinates, then C

has minimum distance 2d. In order to formalize this


procedure, dene as the operator of binary vector substitution, mapping
elements of F
q
onto column vectors of length q: (0) = [1, 0, . . . , 0], (1) =
[0, 1, 0, . . . , 0], . . . , (q 1) = [0, . . . , 0, 1]. Furthermore, for every codeword c
represented as a row vector of length n, let (c) denote the q n array, in
which the j-th column vector equals (c
j
) for all j. Enumerating the entries
of (c) in any xed order gives a binary vector, giving the signature for the
clone corresponding to c. Let f : F
n
q
F
qn
2
denote the mapping of the original
Pooled Genomic Indexing (PGI) 19
codewords onto binary vectors dened by and the enumeration of the matrix
entries.
Designs from Linear Codes. A linear code of dimension k is dened by a k
n generator matrix G with entries over F
q
in the following manner. For each
message u that is a row vector of length k over F
q
, a codeword c is generated
by calculating c = uG. The code C
G
is the set of all codewords obtained in this
way. Such a linear code with minimum distance d is called a [n, k, d] code. It is
assumed that the rows of G are linearly independent, and thus the number of
codewords equals q
k
. Linear codes can lead to designs with balanced pool sizes
as shown by the next lemma.
Lemma 1. Let C be a [n, k, d] code over F
q
with generator matrix G, and
let f : F
n
q
F
qn
2
denote the mapping of codewords onto binary vectors as de-
ned above. Let u
(1)
, u
(2)
, . . . , u
(q
k
)
be the lexicographic enumeration of length k
vectors over F
q
. For an arbitrary 1 N q
k
, let the incidence matrix M of the
pooling be dened by the mapping f of the rst N codewords c
(i)
= u
(i)
G: i =
1, . . . , N onto binary vectors. If the last row of G has no zero entries, then ev-
ery column of M has
_
N
q
_
or
_
N
q
_
ones, i.e., every pool contains about m =
N
q
clones.
(Proof in Appendix.)
Designs from MDS Codes. A [n, k, d] code is maximum distance separable (MDS)
if it has minimum distance d = nk+1. (The inequality d nk+1 holds for all
linear codes, so MDS codes achieve maximum distance for xed code length and
dimension, hence the name.) The Reed-Solomon codes are MDS codes over F
q
,
and are dened as follows. Let n = q 1, and
0
,
2
, . . . ,
n1
be dierent
non-zero elements of F
q
. The generator matrix G of the RS(n, k) code is
G =
_
_
_
_
_
_
1 1 . . . 1

0

1
. . .
n1

2
0

2
1
. . .
2
n1
. . . . . . . . . . . . . . . . . . .

k1
0

k1
1
. . .
k1
n1
_
_
_
_
_
_
.
Using a mapping f : F
n
q
F
qn
2
as before, for the rst N q
k
codewords, a
pooling design is obtained with N clones and K pools. This design has many
advantageous properties. Kautz and Singleton [11] prove that if t =
n1
k1
|, then
the signature of any t-set of clones is unique. Since
k1
i
,= 0 in G, Lemma 1
applies and thus each pool has about the same size m =
N
q
.
Proposition 5. Suppose that the pooling design with N = q
k
is based on a
RS(n, k) code and x is a binary vector of positive weight and length K. If there
is a clone B such that (x, c(B)) < , then (x, B) = n w(x), and the
following holds. If w(x) k, then the minimum in Equation (3b) is unique,
20 M. Cs uros and A. Milosavljevic
and is attained for the singleton set containing B. Conversely, if w(x) < k, then
the minimum in Equation (3b) is attained for q
kw(x)
choices of singleton clone
sets.
(Proof in Appendix.)
For instance, a design based on the RS(6, 3) code has the following properties.
343 clones are pooled in 42 pools;
each clone is included in 6 pools, if at least 3 of those are included in an
index to the clone, the index can be deconvoluted unambiguously;
each pool contains 49 clones;
signatures of 2-sets of clones are unique and have weights 1012;
signatures of 3-sets of clones have weights between 12 and 18; if the weight of
a 3-sets signature is less than 14, than the signature is unique (determined
by a computer program).
The success of indexing in the general case is shown by the following claim.
Proposition 6. Consider an index with eective length M between a clone and
a reference sequence. If the clone appears in n pools, and a set of at least n
min
n
of these pools uniquely determines the clone, then the probability that the index
is detected equals
p
M
=
n

t=n
min
_
n
t
_
(1 p
0
)
t
p
nt
0
where p
0
e
c
M
n
.
(Proof in Appendix.)
In simple arrayed pooling n
min
= n = 2. In the pooling based on the RS(6, 3)
code, n
min
= 3, n = 6. When is the probability of success larger with the latter
indexing method, at a xed coverage? The probabilities can be compared using
the following analogy. Let 6 balls be colored with red and green independently,
each ball is colored randomly to green with probability p. Let X denote the event
that at least one of the rst three balls, and at least one of the last three balls
are red: PX = (1 p
3
)
2
. Let Y denote the event that at least three balls are red:
PY =

6
t=3
_
6
t
_
(1 p)
t
p
nt
. PX equals the probability of successful indexing in
simple arrayed pooling (with p
0
= p
3
). PY equals the probability of successful
indexing in RS(6,3) pooling (with p
0
= p). We claim that PY > PX if p < 2/11.
Consider the event X Y: when exactly one of the rst three balls is red and
exactly one of the second three balls is red. P(XY) = (3(1p)p
2
)
2
. Consider the
event YX: when the rst three or the second three balls are red, and the others
green. P(YX) = 2(1p)
3
p
3
. Now, PY > PX if and only if P(YX) > P(XY),
i.e., when p < 2/11. The inequality holds if M >
_
6c ln(2/11)
_
10c.
Thus, for longer homologous regions (cca. 5000 bp eective length if c = 1), the
RS(6,3) pooling is predicted to have better success. As c grows, simple arrayed
pooling becomes better for the same xed index. Furthermore, at p < 2/11,
even simple arrayed pooling gives around 99% or higher success probability, thus
Pooled Genomic Indexing (PGI) 21
10kbp 5kbp
effective length
0.001
0.01
0.1
simple
RS(6,3) RS(42,3)
shuffled
Indexing failure probability of pooling designs
10000 100 10 1000
clones
10
100
1000
pools
simple
sparse
RS(6,3)
RS(42,3)
shuffled
Number of pools in pooling designs
Fig. 5. The graphs on the left-hand side show the probabilities for failing to nd an
index as a function of the eective length. The graphs are plotted for coverage c = 1 and
expected shotgun read length = 500. The values are calculated from Proposition 6
for simple arraying and double shuing with unique collinearity, as well as for designs
based on the RS(6, 3) and RS(42, 3) codes. Notice that in case of a simple index, the
failure probabilities for the sparse array design are the same as for the simple array.
The graphs on the right-hand side compare the costs of dierent pooling designs by
plotting how the number of pools depends on the total number of clones in dierent
designs.
the improvements are marginal, while the dierence between the probabilities
is signicantly larger when p is large. On the other hand, a similar argument
shows that double shuing with unique collinearity (n = 4, n
min
= 2) has always
higher success probability than simple arrayed pooling. Figure 5 compares the
probabilities of successful indexing for some pooling designs.
4 Experiments
We tested the eciency of the PGI method for indexing mouse and rat clones by
human reference sequences in simulated experiments. The reference databases
included the public human genome draft sequence [2], the Human Transcript
Database (HTDB) [13], and the Unigene database of human transcripts [14]. Lo-
cal alignments were computed using BLASTN [1] with default search parameters
(word size=11, gap open cost=5, gap extension cost=2, mismatch penalty=2).
A hit was dened as a local alignment with an E-value less than 10
5
, a length
of at most 40 bases, and a score of at least 60.
In the case of transcribed reference sequences, hits on the same human refer-
ence sequence were grouped together to form the indexes. In the case of genomic
sequences, we grouped close hits together within the same reference sequence
to form the indexes. In particular, indexes were dened as maximal sets of hits
on the same reference sequence, with a certain threshold on the maximum dis-
tance between consecutive hits, called the resolution. After experimenting with
resolution values between 1kbp and 200kbp, we decided to use 2kbp resolution
throughout the experiments. A particular diculty we encountered in the exper-
iments was the abundance of repetitive elements in eukaryotic DNA. If part of a
shotgun read has many homologies in the human sequences, as is the case with
22 M. Cs uros and A. Milosavljevic
common repeat sequences, the read generates many hits. Conversely, the same
human sequence may be homologous to many reads. Accordingly, repeats in the
arrayed clones correspond to highly ambiguous indexes, and human-specic re-
peats may produce large number of indexes to the same clone. Whereas in the
cases of rat and mouse, it is possible to use a database of repeat sequences such
as Repbase [15], such information is not available for many other species. We
thus resorted to a dierent technique for ltering out repeats. We simply dis-
carded shotgun reads that generated more than twelve hits, thereby eliminating
almost all repeats without using a database of repeated elements.
For the deconvolution, we set the maximum number of clones to three, that is,
each index was assigned to one, two, or three clones, or was declared ambiguous.
Finally, due to the fact that pooling was simulated and that in all exper-
iments the original clone for each read was known enabled a straightforward
test of accuracy of indexing: an index was considered accurate if both reads
deconvoluted to the correct original clone.
Table 1. Experimental results for simulated indexing of 207 mouse clones with coverage
level 2. The table gives the results for four pooling designs (simple, shued, sparse
array, and RS(6, 3)), and three databases of reference sequences: Unigene (UG), HTDB,
and human genome draft (HS). The RS(6, 3) design was used with the UG database
only.
Pools Number of correctly indexed clones Number of correct indexes / false positives
UG HTDB HS UG HTDB HS
simple 29 159 (77%) 139 (67%) 172 (83%) 723 / 248 488 / 108 1472 / 69
RS(6, 3) 42 172 (83%) - - 611 / 150 - -
shued 58 172 (83%) 150 (72%) 180 (87%) 756 / 76 514 / 18 1549 / 238
sparse 78 175 (85%) 152 (74%) 185 (88%) 823 / 22 569 / 11 1634 / 69
4.1 Mouse Experiments
In one set of experiments we studied the eciency of indexing mouse clones with
human sequences. We selected 207 phase 3 sequences in the mouse sequencing
project with lengths greater than 50k bases. We used four pooling designs: two
14 15 arrays for double shuing, a 39 39 sparse array, shown in Figure 3,
and a design based on the RS(6, 3) code with 42 pools.
Random shotgun reads were produced by simulation. Each random read from
a xed pool was obtained in the following manner. A clone was selected randomly
with probabilities proportional to clone lengths. A read length was picked using a
Poisson distribution with mean = 550, truncated at 1000. The random position
of the read was picked uniformly from the possible places for the read length on
the selected clone. Each position of the read was corrupted independently with
probability 0.01. The procedure was repeated to attain the desired coverage
within each pool. Fragments were generated for the dierent pooling designs
with c = 2.
Pooled Genomic Indexing (PGI) 23
The results of the experiments are summarized in Table 1. The main nding is
that 67%88% of the clones have at least one correct index, and 5001600 correct
indexes are created, depending on the array design and the reference database.
Note that double shuing and sparse array methods signicantly reduce the
number of false positives, as expected based on the discussion above. One of
the reasons for the excellent performance of the sparse array method is the fact
that BAC redundancy in the data set does not exceed 2X. In the course of the
experiments, between 7% and 10% of a total of approximately 121,400 simulated
reads gave alignments against human sequences that were informative for the
deconvolution, a percentage that is consistent with the expected overall level of
genomic sequence conservation between mouse and human.
0.5 1.5 1
shotgun coverage per clone
100
50
150
clones
shfl
sprs
smpl
rs63
Arrayed pooling against Unigene: number of indexed clones
Fig. 6. Number of correctly mapped clones is indicated as a function of shotgun cov-
erage for three dierent pooling schemes. PGI was tested in a simulated experiment
involving 207 publicly available mouse genomic sequences of length between 50kbp and
300kbp. Pooling and shotgun sequencing were then simulated. For each coverage level,
reads for c = 2 were resampled ten times. Graphs go through the median values for
four poling designs: simple (smpl), shued (sh), and sparse (sprs), and Reed-Solomon
(rs63). Deconvolution was performed using human Unigene database. Notice that the
curves level o at approximately c = 1.0, indicating limited benet of much greater
shotgun coverages.
In order to explore the eect of shotgun coverage on the success of indexing,
we repeated the indexing with lower coverage levels. We resampled the simulated
reads by selecting appropriate portions randomly from those produced with cov-
erage 2. Figure 6 plots the number of indexed clones as a function of coverage.
It is worth noting that even for c = 0.5, about 2/3 of the clones get indexed by
at least one human sequence. In addition, the curves level o at about c = 1.0,
and higher coverage levels yield limited benets.
24 M. Cs uros and A. Milosavljevic
4.2 Rat Experiments
In contrast to the mouse experiments, PGI simulation on rat was performed using
publicly available real shotgun reads from individual BACs being sequenced as
part of the rat genome sequencing project. The only simulated aspect was BAC
pooling, for which we pooled reads from individual BACs computationally. We
selected a total of 625 rat BACs, each with more than 570 publicly available reads
from the rat sequencing project. An average of 285 random reads per clone were
used in each pool corresponding to an approximate c = 1.5 coverage. The
results are summarized in Table 2.
Table 2. Experimental results on simulated indexing of 625 rat clones by Unigene
sequences with coverage level 1.5.
correct indexes false positives indexed clones
simple 1418 236 384 (61%)
shued 1383 30 409 (65%)
sparse 1574 17 451 (72%)
The lower percentage of correctly mapped clones only partially reects the
lower coverage (1.5 for rat vs. 2 for mouse). A much more important factor
contributing to the dierence is the fact that the mouse sequences used in our
experiments are apparently richer in genes and thus contain more conserved
regions that contribute to the larger number of BACs correctly mapped onto
human sequences.
5 Discussion
PGI is a novel method for physical mapping of clones onto known macromolecu-
lar sequences. It employs available sequences of humans and model organisms to
index genomes of new organisms at a fraction of full genome sequencings cost.
The key idea of PGI is the pooled shotgun library construction, which reduces
the amount of library preparations down to the order of the square root of the
number of BAC clones. In addition to setting priorities for targeted sequencing,
PGI has the advantage that the libraries and reads it needs can be reused in the
sequencing phase. Consequently, it is ideally suited for a two-staged approach to
comparative genome explorations yielding maximum biological information for
given amounts of sequencing eorts.
We presented a probabilistic analysis of indexing success, and described pool-
ing designs that increase the eciency of in silico deconvolution of pooled shot-
gun reads. Using publicly available mouse and rat sequences, we demonstrated
the power of the PGI method in simulated experiments. In particular, we showed
that using relatively few shotgun reads corresponding to 0.5-2.0 coverage of the
clones, 60-90% of the clones can be indexed with human genomic or transcribed
sequences.
Pooled Genomic Indexing (PGI) 25
Due to the low level of chromosomal rearrangements across mammals, the
order of BACs in a comparative physical map should provide an almost correct
ordering of BACs along the genome of a newly indexed mammal. Such infor-
mation should be very useful for whole-genome sequencing of such organisms.
Moreover, the already assembled reference sequences of model organisms onto
which the BACs are mapped may guide the sequence assembly of the homologous
sequence of a newly sequenced organism
2
[16].
Comparative physical maps will allow ecient, targeted, cross-species se-
quencing for the purpose of comparative annotation of genomic regions in model
organisms that are of particular biomedical importance. PGI is not limited to the
mapping of BAC clones. Other applications currently include the mapping of ar-
rayed cDNA clones onto genomic or known full or partial cDNA sequences within
and across species, and the mapping of bacterial genomic clones across dierent
bacterial strains. Sampling eciency of PGI is in practice increased by an order
of magnitude by sequencing short sequence fragments. This is accomplished by
breaking the pooled DNA into short fragments, selecting them by size, forming
concatenamers by ligation, and then sequencing the concatenamers [17,18,19].
Assuming a sequence read of 600bp and tag size of 20200bp, a total of 330 dif-
ferent clones may be sampled in a single sequencing reaction. This technique is
particularly useful when the clone sequences and reference sequences are highly
similar, e.g., cDNA mapping against the genome of the same species, bacterial
genome mapping across similar strains, and mapping of primate genomic BACs
against human sequence.
Acknowledgements
The authors are grateful to Richard Gibbs and George Weinstock for sharing
pre-publication information on CAPSS and for useful comments, and to Paul
Havlak and David Wheeler for contributing database access and computational
resources at HGSC.
Claim 1 of US Patent 6,001,562 [16] reads as follows: A method for detecting se-
quence similarity between at least two nucleic acids, comprising the steps of:
(a) identifying a plurality of putative subsequences from a rst nucleic acid;
(b) comparing said subsequences with at least a second nucleic acid sequence; and
(c) aligning said subsequences using said second nucleic acid sequence in order to
simultaneously maximize
(i) matching between said subsequences and said second nucleic acid sequence
and
(ii) mutual overlap between said subsequences,
whereby said aligning predicts a subsequence that occurs within both said rst and
said second nucleic acids.
26 M. Cs uros and A. Milosavljevic
References
1. Altschul, S.F., Madden, T.L., Schaer, A.A., Zhang, J., Zhang, Z., Miller, W.,
Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res. 25 (1997) 33893402
2. IHGSC: Initial sequencing and analysis of the human genome. Nature 609 (2001)
860921
3. Cai, W.W., Chen, R., Gibbs, R.A., Bradley, A.: A clone-array pooled strategy for
sequencing large genomes. Genome Res. 11 (2001) 16191623
4. Lander, E.S., Waterman, M.S.: Genomic mapping by ngerprinting random clones:
a mathematical analysis. Genomics 2 (1988) 231239
5. Bollobas, B.: Extremal graph theory. In Graham, R.L., Grotschel, M., Lovasz, L.,
eds.: Handbook of Combinatorics. Volume II. Elsevier, Amsterdam (1995) 1231
1292
6. Reiman, I.:

Uber ein Problem von K. Zarankiewicz. Acta Math. Sci. Hung. 9
(1958) 269279
7. Du, D.Z., Hwang, F.K.: Combinatorial Group Testing and Its Applications. 2nd
edn. World Scientic, Singapore (2000)
8. Bruno, W.J., Knill, E., Balding, D.J., Bruce, D.C., Doggett, N.A., Sawhill, W.W.,
Stallings, R.L., Whittaker, C.C., Torney, D.C.: Ecient pooling designs for library
screening. Genomics 26 (1995) 2130
9. Beth, T., Jungnickel, D., Lenz, H.: Design Theory. 2nd edn. Cambridge University
Press, UK (1999)
10. Barillot, E., Lacroix, B., Cohen, D.: Theoretical analysis of library screening using
an n-dimensional strategy. Nucleic Acids Res. 19 (1991) 62416247
11. Kautz, W.H., Singleton, R.C.: Nonrandom binary superimposed codes. IEEE
Trans. Inform. Theory IT-10 (1964) 363377
12. Dyachkov, A.G., Macula, Jr., A.J., Rykov, V.V.: New constructions of superim-
posed codes. IEEE Trans. Inform. Theory IT-46 (2000) 284290
13. Bouck, J., McLeod, M.P., Worley, K., Gibbs, R.A.: The Human Transcript
Database: a catalogue of full length cDNA inserts. Bioinformatics 16 (2000) 176
177 http://www.hgsc.bcm.tmc.edu/HTDB/.
14. Schuler, G.D.: Pieces of the puzzle: expressed sequence tags and the catalog of
human genes. J. Mol. Med. 75 (1997) 694698 http://www.ncbi.nlm.nih.gov/
Unigene/.
15. Jurka, J.: Repbase update: a database and an electronic journal of repetitive
elements. Trends Genet. 16 (2000) 418420 http://www.girinst.org/.
16. Milosavljevic, A.: DNA sequence similarity recognition by hybridization to short
oligomers (1999) U. S. patent 6,001,562.
17. Andersson, B., Lu, J., Shen, Y., Wentland, M.A., Gibbs, R.A.: Simultaneous shot-
gun sequencing of multiple cDNA clones. DNA Seq. 7 (1997) 6370
18. Yu, W., Andersson, B., Worley, K.C., Muzny, D.M., Ding, Y., Liu, W., Ricafrente,
J.Y., Wentland, M.A., Lennon, G., Gibbs, R.A.: Large-scale concatenation cDNA
sequencing. Genome Res. 7 (1997) 353358
19. Velculescu, V.E., Vogelstein, B., Kinzler, K.W.: Analysing uncharted transcrip-
tomes with SAGE. Trends Genet. 16 (2000) 423425
Pooled Genomic Indexing (PGI) 27
Appendix
Proof (Proposition 1). The number of random shotgun reads from the row pool
associated with the clone equals
cmL
2
. By Equation (1), the probability that at
least one of them aligns with the reference sequence equals
p
1
= 1
_
1
p
hit
m
_cmL
2
= 1
_
1
M
mL
_cmL
2
1 e
c
M
2
. (4)
The probability that the row and column pools both generate reads aligning to
the reference sequence equals p
2
1
, as claimed.
Proof (Proposition 2). The number of hits for an index coming from a xed row
or column pool is distributed binomially with parameters n =
cmL
2
and p =
M
mL
.
Let
r
denote the number of hits coming from the clones row pool, and let
c
denote the number of hits coming from the clones column pool. Then
E
M
= E
_

r
+
c


r
> 0,
c
> 0
_
= E
_


r
> 0
_
+E
_


c
> 0
_
.
In order to calculate the conditional expectations on the right-hand side, notice
that if is a non-negative random variable, then E = E
_

> 0
_
P
_
> 0
_
.
Since P
_

c
> 0
_
= P
_

r
> 0
_
= p
1
,
E
M
= 2E
_


r
> 0
_
=
2np
p
1
Resubstituting p and n, the proposition follows from Equation (4).
Proof (Proposition 3). Let n =
cmL
2
be the number of reads from a pool and p =
M

the probability of a hit within a pool containing one of the clones. Then a
false positive occurs if there are no hits in the row pool for one clone and the
column pool for the other, and there is at least one hit in each of the other two
pools containing the clones. The probability of that event equals
2
_
(1 p)
n
_
2
_
1 (1 p)
n
_
2
.
Using the same approximation technique as in Proposition 1 leads to Equa-
tion (2).
Proof (Theorem 1). Let R be an arbitrary rectangle in the array. We calculate
the probability that R is preserved after a random shuing by counting the
number of shuings in which it is preserved. There are
_
m
2
_
choices for selecting
two rows for the position of the preserved rectangle. Similarly, there are
_
m
2
_
ways
to select two columns. There are 8 ways of placing the clones of the rectangle in
the cells at the four intersections in such a way that the diagonals are preserved
(see Figure 4). There are (m
2
4)! ways of placing the remaining clones on the
array. Thus, the total number of shuings in which R is preserved equals
8
_
m
2
_
2
(m
2
4)! = 8
_
m(m1)
2
_
2
(m
2
4)!.
28 M. Cs uros and A. Milosavljevic
Dividing this number by the number of possible shuings gives the probability
of preserving R:
p =
8
_
m
2
_
2
(m
2
)(m
2
1)(m
2
2)(m
2
3)
. (5)
For every rectangle R, dene the indicator variable I(R) for the event that
it is preserved. Obviously, EI(R) = p. Using the linearity of expectations and
Equation (5),
ER(m) =

R
EI(R) =
_
m
2
_
2
p =
1
2

m
4
(m1)
4
m
2
(m
2
1)(m
2
2)(m
2
3)
. (6)
Thus,
ER(m) =
1
2
_
1
4
m
_
1 + o(1)
_
_
,
proving the theorem. Equation (6) also implies that
ER(m+1)
ER(m)
> 1 for m > 2, and
thus ER(m) is increasing monotonically.
Proof (Proposition 4). Property 1 is trivial. For Property 2, notice that
clone B
a,b
is included in pool P
i,a+ib
in every P
i
, and in no other pools. For
Property 3, let P
x
1
,y
1
and P
x
2
,y
2
be two arbitrary pools. Each clone B
a,b
is
included in both pools if and only if
y
1
= a + bx
1
and y
2
= a + bx
2
.
If x
1
,= x
2
, then there is exactly one solution for (a, b) that satises both equal-
ities.
Proof (Lemma 1). Fix the rst (k 1) coordinates of u and let the last one vary
from 0 to (q 1). The i-th coordinate of the corresponding codeword c = uG
takes all values of F
q
if the entry G[k, i] is not 0.
Proof (Proposition 5). The rst part of the claim for the case w(x) k is a
consequence of the error-correcting properties of the code. The second part of
the claim follows from the MDS property.
Proof (Proposition 6). This proof generalizes that of Proposition 1. The number
of random shotgun reads from one pool associated with the clone equals
cmL
n
. By
Equation (1), the probability that at least one of them aligns with the reference
sequence equals
p
1
= 1
_
1
p
hit
m
_cmL
n
1 e
c
M
n
.
The probability that at least n
min
pools generate reads aligning to the reference
sequence equals
n

t=n
min
_
n
t
_
p
t
1
(1 p
1
)
nt
proving the claim with p
0
= 1 p
1
.
Practical Algorithms and Fixed-Parameter
Tractability for the Single Individual SNP
Haplotyping Problem
Romeo Rizzi
1,
, Vineet Bafna
2
, Sorin Istrail
2
, and Giuseppe Lancia
3
Math. Dept., Universit`a di Trento, 38050 Povo (Tn), Italy
[email protected]
Celera Genomics, Rockville MD, USA,
{Vineet.Bafna,Sorin.Istrail}@celera.com
D.E.I., Universit`a di Padova, 35100 Padova, Italy
[email protected]
Abstract. Single nucleotide polymorphisms (SNPs) are the most fre-
quent form of human genetic variation, of foremost importance for a va-
riety of applications including medical diagnostic, phylogenies and drug
design.
The complete SNPs sequence information from each of the two copies
of a given chromosome in a diploid genome is called a haplotype. The
Haplotyping Problem for a single individual is as follows: Given a set of
fragments from one individuals DNA, nd a maximally consistent pair of
SNPs haplotypes (one per chromosome copy) by removing data errors
related to sequencing errors, repeats, and paralogous recruitment. Two
versions of the problem, i.e. the Minimum Fragment Removal (MFR)
and the Minimum SNP Removal (MSR), are considered.
The Haplotyping Problem was introduced in [8], where it was proved
that both MSR and MFR are polynomially solvable when each fragment
covers a set of consecutive SNPs (i.e., it is a gapless fragment), and NP-
hard in general. The original algorithms of [8] are of theoretical interest,
but by no means practical. In fact, one relies on nding the maximum
stable set in a perfect graph, and the other is a reduction to a network
ow problem. Furthermore, the reduction does not work when there are
fragments completely included in others, and neither algorithm can be
generalized to deal with a bounded total number of holes in the data.
In this paper, we give the rst practical algorithms for the Haplotyping
Problem, based on Dynamic Programming. Our algorithms do not re-
quire the fragments to not include each other, and are polynomial for
each constant k bounding the total number of holes in the data.
For m SNPs and n fragments, we give an O(mn
k
) algorithm for the
MSR problem, and an O(2
k
m n+2
k
m ) algorithm for the MFR prob-
lem, when each fragment has at most k holes. In particular, we obtain
an O(mn ) algorithm for MSR and an O(m n+m ) algorithm for MFR
on gapless fragments.
Finally, we prove that both MFR and MSR are APX-hard in general.

Research partially done while enjoying hospitality at BRICS, Department of Com-


puter Science, University of Aarhus, Denmark.
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 2943, 2002.
c Springer-Verlag Berlin Heidelberg 2002
30 R. Rizzi et al.
1 Introduction
With the sequencing of the human genome [12,7] has come the conrmation that
all humans are almost identical at DNA level (99% and greater identity). Hence,
small regions of dierences must be responsible for the observed diversities at
phenotype level. The smallest possible variation is at a single nucleotide, and is
called Single Nucleotide Polymorphism, or SNP (pronounced snip). Broadly
speaking, a polymorphism is a trait, common to everybody, whose value can be
dierent but drawn in a limited range of possibilities, called alleles. A SNP is
a specic nucleotide, placed in the middle of a DNA region which is otherwise
identical for all of us, whose value varies within a population. In particular, each
SNP shows a variability of only two alleles. These alleles can be dierent for
dierent SNPs.
Recent studies have shown that SNPs are the predominant form of human
variation [2] occurring, on average, every thousand bases. Their importance can-
not be overestimated for therapeutic, diagnostic and forensic applications. Nowa-
days, there is a large amount of research going on in determining SNP sites in
humans as well as other species, with a SNP consortium founded with the aim
of designing a detailed SNP map for the human genome [11,6].
Since DNA of diploid organisms is organized in pairs of chromosomes, for
each SNP one can either be homozygous (same allele on both chromosomes)
or heterozygous (dierent alleles). The values of a set of SNPs on a particular
chromosome copy dene a haplotype. Haplotyping an individual consists in de-
termining a pair of haplotypes, one for each copy of a given chromosome. The
pair provides full information of the SNP ngerprint for that individual at the
specic chromosome.
There exist dierent combinatorial versions of haplotyping problems. In par-
ticular, the problem of haplotyping a population (i.e., a set of individuals) has
been extensively studied, under many objective functions [4,5,3], while haplo-
typing for a single individual has been studied in [8] and in [9].
Given complete DNA sequence, haplotyping an individual would consist of
a trivial check of the value of some nucleotides. However, the complete DNA
sequence is obtained by the assembly of smaller fragments, each of which can
contain errors, and is very sensitive to repeats. Therefore, it is better to dene
the haplotyping problem considering as input data the fragments instead of the
fully assembled sequence.
Computationally, the haplotyping problem calls for determining the best
pair of haplotypes, which can be inferred from data which is possibly inconsistent
and contradictory. The problem was formally dened in [8], where conditions are
derived under which it results solvable in polynomial time and others for which
it is NP-hard. We remark that both situations are likely to occur in real-life
contexts, depending on the type of data available and the methodology used
for sequencing. In this paper we improve on both the polynomial and hardness
results of [8]. In particular, we describe practical eective algorithms based on
Dynamic Programming, which are lowdegree polynomial in the number of SNPs
and fragments, and remain polynomial even if the fragments are allowed to skip
Practical Algorithms and Fixed-Parameter Tractability 31
Table 1. A chromosome and the two haplotypes
Chrom. c, paternal: ataggtccCtatttccaggcgcCgtatacttcgacgggActata
Chrom. c, maternal: ataggtccGtatttccaggcgcCgtatacttcgacgggTctata
Haplotype 1 C C A
Haplotype 2 G C T
some SNPs (up to a xed maximum). As for the complexity results, we show
that the problems are not just NP-hard, but in fact APX-hard.
Despite the biological origin of the problem, the model turns out to be purely
combinatorial, and has many nice mathematical properties. The basic biological
motivations behind the problem are provided in Section 2. The mathematical
model of the problems, together with the notation and two useful reductions,
are described in Section 3. In Section 4, we introduce a suitable bipartite labeled
graph, used to characterize the problems and to give an APX-hardness result
for their general versions. In Section 5 we describe Dynamic Programming-based
polynomial algorithms for the gapless versions of the problem, while in Section 6
we extend these results to bounded-length gaps.
2 SNPs and Haplotypes
The process of passing from the sequence of nucleotides in a DNA molecule to
a string over the DNA alphabet is called sequencing. A sequencer is a machine
that is fed some DNA and whose output is a string of As, Ts, Cs and Gs. To
each letter, the sequencer attaches a value (condence level) which represents,
essentially, the probability that the letter has been correctly read.
The main problem with sequencing is that the technology is not powerful
enough to sequence a long DNA molecule, which must therefore rst be cloned
into many copies, and then be broken, at random, into several pieces (called
fragments), of a few hundred nucleotides each, that are individually fed to a se-
quencer. The cloning phase is necessary so that the fragments can have nonempty
overlap. From the overlap of two fragments one infers a longer fragment, and so
on, until the original DNA sequence has been reconstructed. This is, in essence,
the principle of Shotgun Sequencing [13], in which the fragments are assembled
back into the original sequence by using sophisticated algorithms. The assembly
phase is complicated from the fact that in a genome there exist many regions
with identical content (repeats) scattered all around, which may fool the as-
sembler into thinking that they are all copies of a same, unique, region. The
situation is complicated further from the fact that diploid genomes are orga-
nized into pairs of chromosomes (a paternal and a maternal copy) which may
have identical or nearly identical content, a situation that makes the assembly
process even harder.
To partly overcome these diculties, the fragments used in shotgun sequenc-
ing sometimes have some extra information attached. In fact, they are obtained
via a process that generates pairs (called mate pairs) of fragments instead than
32 R. Rizzi et al.
individual ones, with a fairly precise estimate of the distance between them.
These pairs are guaranteed to come from the same copy of a chromosome, and
there is a good chance that, even if one of them comes from a repeat region, the
other does not.
A Single Nucleotide Polymorphism, or SNP, is a position in a genome at
which, within a population, some individuals have a certain base while the others
have a dierent one. In this sense, that nucleotide is polymorphic, from which
the name. For each SNP, an individual is homozygous if the SNP has the same
allele on both chromosome copies, and otherwise the individual is heterozygous.
The values of a set of SNPs on a particular chromosome copy dene a haplotype.
In Figure 1 we give a simplistic example of a chromosome with three SNP sites.
The individual is heterozygous at SNPs 1 and 3 and homozygous at SNP 2. The
haplotypes are CCA and GCT.
The Haplotyping Problem consists in determining a pair of haplotypes, one
for each copy of a given chromosome, from some input genomic data. Given the
assembly output (i.e., a fully sequenced genome) haplotyping would simply con-
sist in checking the value of some specic sites. However, there are unavoidable
errors, some due to the assembler, some to the sequencer, that complicate the
problem and make it necessary to proceed in a dierent way. One problem is
due to repeat regions and paralogous recruitment [9]. In practice, fragments
with high similarity are merged together, even if they really come from dierent
chromosome copies, and the assembler tends to reconstruct a single copy of each
chromosome. Note that in these cases, heterozygous SNP sites could be used
to correct the assembly and segregate two distinct copies of the similar regions.
Another problem is related to the quality of the reads. For these reasons, the
haplotyping problem has been recently formalized as a combinatorial problem,
dened not over the assembly output, but over the original set of fragments.
The framework for this problem was introduced in [8]. The data consists of
small, overlapping fragments, which can come from either one of two chromo-
some copies. Further, e.g. in shotgun sequencing, there may be pairs of fragments
known to come from the same chromosome copy and to have a given distance be-
tween them. Because of unavoidable errors, under a general parsimony principle,
the basic problem is the following:
Given a set of fragments obtained by DNA sequencing from the two copies
of a chromosome, nd the smallest number of errors so that there exist two
haplotypes compatible with all the (corrected) fragments observed.
Depending on the errors considered, dierent combinatorial problems have
been dened in the literature. Bad fragments can be due either to contami-
nants (i.e. DNA coming from a dierent organism than the actual target) or to
read errors. An alternative point of view assigns the errors to the SNPs, i.e. a
bad SNP is a SNP for which some fragments contain read errors. Correspond-
ingly, we have the following optimization problems: Find the minimum number
of fragments to ignore or Find the minimum number of SNPs to ignore, such
that the (corrected) data is consistent with the existence of two haplotypes. Find
such haplotypes.
Practical Algorithms and Fixed-Parameter Tractability 33
1 A B - A A B
2 B A - - B -
3 - A B A B A
4 - A B - B A
5 B - A B A -
(a)
4
1 2 3 4 5 6
(d) (b) (c)
2
3
5
5
1 2 3
4
6
1
1
2
3
5
4
1
2
3
4
5
6
Fig. 1. (a) A SNP matrix. (b) The Fragment conict graph G
F
(M). (c) The SNP
conict graph G
S
(M). (d) The labeled bipartite graph (G, )
3 Terminology and Notation
Let S = {1, . . . , n} be a set of SNPs and F = {1, . . . , m} be a set of fragments
(where, for each fragment, only the nucleotides at positions corresponding to
some SNP are considered). Each SNP is covered by some of the fragments, and
can take only two values. The actual values (nucleotides) are irrelevant to the
combinatorics of the problem and hence we will denote, for each SNP, by A and
B the two values it can take. Given any ordering of the SNPs (e.g., the natural
one, induced by their physical location on the chromosome), the data can also
be represented by an m n matrix over the alphabet {A, B, }, which we call
the SNP matrix (read snip matrix), dened in the obvious way. The symbol
appears in all cells M[f, s] for which a fragment f does not cover a SNP s,
and it is called a hole.
For a SNP s, two fragments f and g are said to conict on s if M[f, s] = A
and M[g, s] = B or vice-versa. Two fragments f and g are said to conict if there
exists a SNP s such that they conict on s, otherwise f and g are said to agree.
A SNP matrix M is called error-free if we can partition the rows (fragments)
into two classes of non-conicting fragments.
Given a SNP matrix M, the fragment conict graph is the graph G
F
(M) =
(F, E
F
) with an edge for each pair of conicting fragments (see Figure 1(a)
and (b)). Note that if M is error-free, G
F
(M) is a bipartite graph, since each
haplotype denes a shore of G
F
(M), made of all the fragments coming from
that haplotype. Conversely, if G
F
(M) is bipartite, with shores H
1
and H
2
, all
the fragments in H
1
can be merged into one haplotype and similarly for H
2
.
Hence, M is error-free if and only if G
F
(M) is bipartite.
The fundamental underlying problem in SNP haplotyping is determining
an optimal set of changes to M (e.g., row and/or column- deletion) so that M
becomes error-free. Given a matrix M, and where X is any set of rows or columns
34 R. Rizzi et al.
of M, we denote by M \ X the matrix obtained from M by dropping the rows
or columns in X. In this work, we will consider the following problems.
MSR Minimum SNP Removal - Find a minimum number of columns
(SNPs) whose removal makes M error-free;
MFR Minimum Fragment Removal - Find a minimum number of rows
(fragments) whose removal makes M error-free.
For better readability, from now on we will refer to a matrix M instead
of a SNP matrix M, unless a possible confusion arises with another type of
matrix. For a problem {MSR, MFR} on input a matrix M, we will denote
by (M) the value of an optimal solution.
The following two reductions can be used to remove redundant data from
the input, and hence to clean the structure of the problems.
We start by considering the minimum number of columns (SNPs) whose
removal makes M error-free. We have the following proposition:
Proposition 1 (S-reduction) Let M

be the matrix obtained from M by drop-


ping those columns where no As or no Bs occur. Clearly, MSR(M

) MSR(M).
Let X be any set of SNPs such that M

\ X is error-free. Then also M \ X is


error-free.
Essentially, Proposition 1 says that when solving the problem we can simply
concentrate our attention to M

, the other columns being inessential. Matrix


M

so obtained is called S-reduced. When M is error-free, then we say that two


fragments f and g are allies (enemies) when they must be in the same class
(in separate classes) for every partition of the rows of M into two classes of
non-conicting fragments.
Now, on an S-reduced M

, we have the following structure for solutions.


Lemma 2 (o con noi o contro di noi)
1
Let M be an S-reduced matrix.
Let X be any set of SNPs whose removal makes M error-free. Let f, g F be any
two fragments. Consider any SNP s S\X such that {M[f, s], M[g, s]} {A, B}.
Then, if M[f, s] = M[g, s] then f and g are enemies, otherwise, they are allies.
Proof: The rst part is obvious. As for the second, assume, e.g., M[f, s] =
M[g, s] = A. Then , since M is S-reduced, there exists a third fragment h such
that M[h, s] = B.
We also have a similar reduction which applies to rows.
Proposition 3 (F-reduction) Let M

be the matrix obtained from M by drop-


ping those rows which conict with at most one other row. Clearly, MSR(M

)
MSR(M). Let X be any set of SNPs whose removal makes M

error-free. Then
the removal of X makes also M error-free.
Literal translation: either with us or against us. Meaning: you have to choose, either
be friend or enemy
Practical Algorithms and Fixed-Parameter Tractability 35
We can hence assume that every row conicts with at least two other rows,
simply by dropping those rows which conict with at most one row. Matrix M

so obtained is called F-reduced.


We now proceed to check that the two reductions just introduced for MSR are
also valid for MFR (they are valid for MER as well). So, consider the minimum
number of rows (fragments) whose removal makes M error-free. The following
propositions are easy to prove
Proposition 4 (S-reduction, again) Let M

be the matrix obtained from M


by dropping those columns where only As or only Bs occur. Clearly, MFR(M

)
MFR(M). Let X be any set of rows whose removal makes M

error-free. Then
also M \ X is error-free.
Proposition 5 (F-reduction, again) Let M

be the matrix obtained from M


by dropping those rows which conict with at most one single other row. Clearly,
MFR(M

) MFR(M). Let X be any set of rows whose removal makes M

error-free. Then also M \ X is error-free.


3.1 Old and New Results
We dene a gapless fragment i as one for which the As and Bs appear con-
secutively, with no - s between them, in row i of M. In general, a gap is a
maximal run of consecutive holes between two non-hole symbols. As an example,
--ABABBBA----, is a gapless fragment, while there are 2 gaps in --AB---B--AB-.
The length of a gap is the number of holes it contains (so, the above second exam-
ple, has a total gap length of 5 = 3+2). The body of a fragment extends from the
leftmost non-hole to the rightmost non-hole (e.g., the body of ---ABB-B--AB-
is ABB-B--AB).
In haplotyping, there can be gaps for mainly two reasons:
1. thresholding of low-quality reads. This happens if the sequencer cannot call
a SNP A or B with enough condence; then, no call is made, and the position
is marked with -.
2. mate-pairing in shotgun sequencing. One pair of mates are two fragments
coming from the same chromosome copy, with a given distance between
them. Hence, they are logically equivalent to a single fragment with one gap.
We call a matrix gapless if all fragments are gapless. We say that a matrix
has the consecutive-1 property (is C1P) if the columns can be rearranged so
as to obtain a gapless matrix. Note that determining if such a rearrangement
exists is a well-known polynomial problem [1]. As a consequence, all polynomial
results for gapless matrices can be readily extended to polynomial results for
C1P matrices. We note also that the consecutive-1 property is conserved under
both S-reduction and F-reduction.
We now briey recall the main results that were obtained in [8] for gapless
matrices and in Sections 5.1 and 5.2 we go on to improve them to more practical
and eective algorithms. In a nutshell, in [8] it is shown that for gapless matrices
36 R. Rizzi et al.
the problems are polynomial, while, in general, they are NP-hard. Note that
these cases are both likely to occur in real-life applications. For instance, an
EST (Expressed Tagged Sequence) is a short DNA fragment with no gaps, fully
sequenced. When the input consists only of ESTs, the matrix M is C1P (i.e.,
has the consecutive 1 property). On the other hand, when also mate pairs are
used, the matrix is not necessarily C1P.
Let M be an S-reduced matrix. Two SNPs s and t are said to be in conict
if there exists fragments f and g such that M[f, a], M[g, a], M[f, c], M[g, c] =
and the boolean value (M[f, a] = M[g, a]) is the negation of (M[f, c] = M[g, c]).
In other words, the 2 2 submatrix of M dened by rows f and g and columns
s and t has 3 symbols of one type (A or B) and one of the opposite (B or A
respectively). Given a matrix M, the SNP conict graph is the graph G
S
(M) =
(S, E
S
), with an edge for each pair of SNPs in conict (see Figure 1(c)).
The following theorems are proved in [8]
Theorem 6 Let M be a gapless matrix. Then M is error-free if and only if
G
S
(M) is a stable set.
Theorem 7 If M is a gapless matrix, then G
S
(M) is a weakly triangulated
graph.
From theorems 6 and 7 it followed that MSR is polynomial for a gapless
matrix, since it amounts to nding the largest stable set in a perfect graph. The
main result in [8] was to show that the problem is polynomial, and little attention
was paid to the fact that the running times of algorithms based on the perfectness
of graphs are bounded by too high-degree polynomials for practical applications
(let apart the problem of actually coding these algorithms). Similarly, in [8] a
point was made that, for gapless data, also MFR is polynomial, which was shown
via an expensive reduction to a network ow problem. Furthermore, the result
is only valid when no fragment is contained in another. In the next two sections,
we set up for quicker and much simpler algorithms for both MFR and MSR.
We also manage to get rid of the assumption that no fragment is contained in
another. The main results we establish in this paper are the following:
Theorem 8 There is an O(mn
2k+2
) polynomial time algorithm for MSR prob-
lem on matrices in which each fragment has total gap length at most k.
Corollary 9 There is an O(mn
2
) polynomial time algorithm for MSR problem
on matrices which have the consecutive-1 property.
Theorem 10 There is an O(2
2k
m
2
n + 2
3k
m
3
) polynomial time algorithm for
MFR problem on matrices in which each fragment has total gap length at most
k.
Corollary 11 There is an O(m
2
n + m
3
) polynomial time algorithm for MFR
problem on matrices which have the consecutive-1 property.
Theorem 12 The problems MFR and MSR are, in general, APX-hard.
Practical Algorithms and Fixed-Parameter Tractability 37
4 The Data as Labeled Bipartite Graph
We assume that we are working on an S-reduced matrix M. To M we associate a
labeled graph as follows. Let G = (F, S; E) be a bipartite graph on color classes
F and S and with edge set E := {sf : s S, f F, M[s, f] = }. When
M[s, f] = A we label sf even, and set (sf) = 0. When M[s, f] = B we label sf
odd, and set (sf) = 1 (see Figure 1(d), where even edges are dashed and odd
edges are thick). An edge set F E is called -odd if it contains an odd number
of odd edges, and -even otherwise. Indeed, we can extend our labeling to edge
sets as (F) :=

fF
(f)

mod 2
. We say that the pair (G, ) is -bipartite if
and only if it contains no -odd cycle (regard a cycle as an edge set). By the o
con noi o contro di noi Lemma 1, we have the following consequence:
Proposition 13 The matrix M is error-free if and only if (G, ) is -bipartite.
We now give a formal proof of the above proposition in a more general form.
Lemma 14 Let X be a set of columns. If M is S-reduced and M \ X is error-
free, then (G\ X, ) is -bipartite.
Proof: Indeed, let C = f
1
, s
1
, . . . , f
k
, s
k
be any cycle in G\X. Dene (
C
(s
i
)) =
(f
i
s
i
)+(s
i
f
i+1
), so that (C) =

k
i=1
(
C
(s
i
)). Consider a partition of F into
two haplotypes. By Lemma 1, whenever (
C
(s
i
)) is even, f
i
and f
i+1
are in in
the same haplotype, while when it is odd, they are in dierent haplotypes. But,
along the cycle f
1
, f
2
, . . . , f
k
, the number of haplotype switches must be even
(like jumping forth and back between the two sides of a river but eventually
returning to the starting side).
Now we go for the converse.
Lemma 15 If (G, ) is -bipartite then M (not necessarily reduced) is error-free.
Proof: By denition, the fragment conict graph G
F
(M) = (F, E
F
) is such that
uv E
F
if there is an s such that us and sv are in (G, ) and (us) +(sv) = 1.
Hence, to each cycle C in G
F
(M) corresponds a cycle C

in (G, ), and C has


an even number of edges if and only if C

is -even. So, G
F
(M) is bipartite and
hence M is error-free.
We nally prove Theorem 12 of Section 3.1.
Theorem 16 The problems MFR and MSR are APX-hard.
Proof: Given a graph G, the problem (minEdgeBipartizer) of nding a min-
imum cardinality set of edges F such that G \ F is bipartite, and the problem
(minNodeBipartizer) of nding a minimum cardinality set of nodes Z such
that G \ Z is bipartite, are known to be APX-hard [10]. Moreover, the same
problems are not known to be in APX. We give simple L-reductions from these
two problems to MFR and MSR, hence showing the APX-hardness of MFR and
38 R. Rizzi et al.
MSR, but also that nding a constant ratio approximation algorithm for any of
these problems in general is somewhat of a challenge.
Given an input graph G as instance for either minEdgeBipartizer or minN-
odeBipartizer, we subdivide each edge into two edges in series. (The resulting
graph G

is clearly bipartite, we call SNPs the new nodes and fragments the
old ones). Now, of the two edges incident with every SNP, we declare one to
be odd and one to be even. Clearly, solving or approximating MSR on (G

, )
amount to solving or approximating (within exactly the same approximation)
minEdgeBipartizer on G. Moreover, solving or approximating MFR on (G

, )
amounts to solving or approximating (within exactly the same approximation)
minNodeBipartizer on G.
5 Polynomial Algorithms for the Gapless Case
In this section we prove Corollaries 9 and 11 of Section 3.1. Although they would
follow from the more general theorems for matrices with bounded length gaps
(described in Section 6) we prove them here directly because it is didactically
better to do so. In fact, the results of Section 6 will be obtained as generalizations
of the ideas described in this section.
5.1 MSR: A Dynamic Programming O(mn
2
) Algorithm
In this section we propose a dynamic programming approach for the solution
of MSR. The resulting algorithm can be coded as to take O(mn
2
) time. In the
following, we assume M to be S-reduced.
It is easier to understand our dynamic program if we state it for the comple-
mentary problem of MSR, i.e., nd the maximum number of SNPs that can be
kept so that M becomes error-free. Clearly, if k is the largest number of SNPs
that we can keep, then n k is the smallest number of SNPs to remove.
For j n (with j 0), we dene K[j] as the maximum number of columns
that can be kept to make M error-free, under the condition that j is the rightmost
column kept (if all columns are removed then j = 0, and K[0] = 0).
Once all the K[j] are known, the solution to the problem is given by
max
j{0,...,n}
K[j].
For every j, we dene OK(j) as the set of those i with i < j such that
columns i and j do not conict. We assume that 0 belongs to OK(j) for every
j. Now, for every j,
K[j] := 1 + max
iOK(j)
K[i] (1)
where Equation (1) is correct by the following easily proven fact.
Lemma 17 Let M be a gapless S-reduced matrix. Consider columns a < b <
c S. If a is not in conict with b and b is not in conict with c, then a is not
in conict with c.
Practical Algorithms and Fixed-Parameter Tractability 39
Proof: Assume SNPs a and c to be conicting, that is, there exist fragments f
and g such that M[f, a], M[g, a], M[f, c], M[g, c] = and the boolean value
(M[f, a] = M[g, a]) is the negation of (M[f, c] = M[g, c]). Since a < b < c, then
M[f, b], M[g, b] = since M is gapless. Therefore, (M[f, b] = M[g, b]) is either
the negation of (M[f, a] = M[g, a]) or the negation of (M[f, c] = M[g, c]). That
is, b is either conicting with a or with c.
Note that for computing the entries K[j] we only need to know the sets
OK(j). Note that OK(j) is the set of those i < j which are neighbors of j in the
SNP-conict graph G
S
(M). Since determining if two SNPs are in conict can be
done in time O(m), the cost of creating the OK(j) is O(mn
2
). This dominates
the cost O(n
2
) of solving all equations (1).
5.2 MFR: A Dynamic Programming O(m
2
n +m
3
) Algorithm
In this section we propose a dynamic programming approach for the solution
of MFR. We remark that, contrary to the approach suggested in [8], nested
fragments will not be a problem. The resulting algorithm can be coded as to
take O(m
2
n +m
3
) time.
Given a row f of M we denote by l(f) the index of the leftmost SNP s
such that M[f, s] = and by r(f) the index of the rightmost SNP s such
that M[f, s] = . In other words, the body of the fragment f is all contained
between the SNPs l(f) and r(f). We assume that the rows of M are ordered
so that l(i) l(j) whenever i < j. For every i {1, . . . , m}, let M
i
be the
matrix made up by the rst i rows of M. For h, k i (with h, k 1) such that
r(h) r(k), we dene D[h, k; i] as the minimum number of rows to remove to
make M
i
error-free, under the condition that
row k is not removed, and among the non-removed rows maximizes r(k);
row h is not removed and goes into the opposite haplotype as k, and among
such rows maximizes r(h).
(If all rows are removed then h = 1, k = 0, and D[1, 0; i] := i. Rows 1
and 0 are all , that is, empty).
Once all the D[h, k; i] are known, the solution to the MSR problem is given
by
min
h,k : r(h) r(k)
D[h, k; m].
Clearly, for every i, and for every h, k < i with r(h) r(k),
D[h, k; i] :=

D[h, k; i 1] if r(i) r(k) and rows i and k agree


D[h, k; i 1] if r(i) r(h) and rows i and h agree
D[h, k; i 1] + 1 otherwise,
(2)
where Equation (2), as well as Equation (3) and Equation (4), nds expla-
nation into the following easily proven fact.
40 R. Rizzi et al.
Lemma 18 Consider rows a, b, c F. Assume a, b < c and r(a) r(b). If a
agrees with b and b agrees with c, then a agrees with c.
For every i, we dene OK(i) as the set of those j with j < i such that rows
i and j agree. We assume that 0, 1 belong to OK(i) for every i. (Just think to
append a pair of all rows at the top of M).
Now, for every i, and for every h < i with r(h) r(i),
D[h, i; i] := min
jOK(i), j=h, r(j) r(i)

D[j, h; i 1] if r(h) r(j)


D[h, j; i 1] if r(h) < r(j)
(3)
Finally, for every i, and for every k < i with r(k) r(i),
D[i, k; i] := min
jOK(i), j=k, r(j) r(i)
D[j, k; i 1]. (4)
Note that for computing the entries D[h, k; i] we only need to know the sets
OK(i). The cost of creating the OK(i) data structure (done in a rst phase) is
O(m
2
n). The cost of computing the entries D[h, k; i] (done in a second phase). is
O(m
3
), since it can be seen as the cost of computing the O(m
3
) entries D[h, k; i]
by using Equation (2) (costs O(1) each) plus the cost of computing the O(m
2
)
entries D[h, i; i] and D[i, k; i] by using Equations (3) and (4) (costs O(m) each).
6 Dealing with Gaps
In this section we propose a practical approach to deal with any number of gaps,
when the body of each fragment does not contain many holes. For the remainder
of this section, let k be a constant such that the body of each fragment in the
input instance contains at most k holes. We will derive dynamic programming-
based polynomial (for a constant k) algorithms for both MFR and MSR, proving
Theorems 8 and 10 of Section 3.
6.1 MSR: An O(mn
2k+2
) Algorithm
We modify the basic dynamic programming approach for MSR introduced in
Section 5.1. More precisely, the new dynamic programming algorithm for MSR
will now consider 2k columns of history. We remind the reader that we show how
to determine the largest number of SNPs that can be kept to make M error-free,
which is equivalent to solving MSR. The resulting algorithm can be coded as to
take O(mn
2k+2
) time. In the following, we assume M to be S-reduced.
We say that SNPs s
1
, . . . , s
t
are consistent when no two of them conict.
For consistent j
1
< . . . < j
2k+1
i (with j
1
2k), we dene K[j
1
, . . . , j
2k+1
]
as the maximum number of columns to keep to make M error-free, under the
condition that j
1
, . . . , j
2k+1
are the 2k+1 rightmost columns kept (if all columns
are removed then j
2k+1
= 0, and K[2k, . . . , 1, 0] = 0).
Once all the K[. . .] are known the solution to the problem is given by
max
2k<j
1
<...<j
2k+1
n
K[j
1
, . . . , j
2k+1
].
Practical Algorithms and Fixed-Parameter Tractability 41
For i n and consistent j
1
< . . . < j
2k
< i we dene OK(j
1
, . . . , j
2k
, i) as
the set of those j < j
1
such that columns j, j
1
< . . . < j
2k
, i are consistent.
Now, for every consistent j
1
< . . . < j
2k
< i,
K[j
1
, . . . , j
2k
, i] := 1 + max
jOK(j
1
,...,j
2k
,i)
K[j, j
1
, . . . , j
2k
] (5)
where Equation (5) is correct by the following easily proven fact.
Lemma 19 Let M be an S-reduced matrix where each fragment contains at
most k holes. Consider columns a < j
1
, . . . , j
2k
< c S. Assume columns
a, j
1
, . . . , j
2k
are consistent. Assume further columns j
1
, . . . , j
2k
, c are consistent.
Then, columns a, j
1
, . . . , j
2k
, c are consistent.
Proof: Assume on the contrary that a and c conict. Let f and g be fragments
such that M[f, a], M[g, a], M[f, c], M[g, c] = and the boolean value (M[f, a] =
M[g, a]) is the negation of (M[f, c] = M[g, c]).
Since each row has at most k holes, as a consequence, on 2k+1 columns, any
two rows must have a common non-hole or one of the considered columns is out
of the body of one of the two rows. Since a and c are both in the body of f then
j
1
, . . . , j
2k
are all in the body of f. Similarly, j
1
, . . . , j
2k
are all in the body of g.
Hence, let b {j
1
, . . . , j
2k
} such that M[f, b], M[g, b] = . Since a < b < c, then
M[f, b], M[g, b] = since M is gapless. Therefore, (M[f, b] = M[g, b]) is either
the negation of (M[f, a] = M[g, a]) or the negation of (M[f, c] = M[g, c]). That
is, b is either conicting with a or with c.
Note that for computing the entries K[. . . ; i] we only need to know the
sets OK(j
1
, . . . , j
2k
, i). The cost of creating the OK(j
1
, . . . , j
2k
, i) (done in a
rst phase) is O(mn
2k+2
), which dominates the cost O(n
2k+2
) of solving all
equations (5).
6.2 MFR: An O(2
2k
nm
2
+ 2
3k
m
3
) Algorithm
We show how to extend the dynamic programming approach given in Section 5.2
to solve gapless problem instances with holes in O(2
2k
nm
2
+ 2
3k
m
3
) time. We
point out that the form of the necessarily-exponential dependence on k is a very
good one, i.e., it shows the xed-parameter tractability of the problem. The
memory requirement is 2
2k
m
3
.
Let f be a fragment and let x {A, B}
k
. We denote by f[x] the fragment
obtained from f by lling in the holes one by one, using the rst characters in x.
Since we assumed that the body of each fragment in our input instance contains
at most k holes, the characters in x will always suce to ll in all the holes
of f. Given x
1
, x
2
{A, B}
k
and two rows f
1
, f
2
of matrix M, we denote by
M[f
1
[x
1
], f
2
[x
2
]] the matrix obtained from M by substituting f
1
with f
1
[x
1
] and
f
2
with f
2
[x
2
]. The following consideration suggests how to extend the dynamic
programming algorithm given in Section 5.2 to the case with holes:
Proposition 20 Let F
1
= {f
1
1
, . . . , f
p
1
}, F
2
= {f
1
2
, . . . , f
q
2
} be sets of fragments
in M such that any two fragments in F
i
(i = 1, 2) agree. Then, for every i p
42 R. Rizzi et al.
and j q we can give x
i
1
, x
j
2
{A, B}
k
such that F

1
= {f
1
1
, . . . , f
i
1
[x
i
1
], . . . , f
p
1
}
and F

2
= {f
1
2
, . . . , f
j
2
[x
j
2
], . . . , f
q
2
} would still be both without conicts.
We assume that the rows of M are ordered so that l(i) l(j) whenever i < j.
For every i {1, . . . , m}, let M
i
be the matrix made up by the rst i rows of
M. For h, k i (with h, k 1) such that r(h) r(k), and for x, y {A, B}
k
we dene D[h, x; k, y; i] as the minimum number of rows to remove to make
M
i
[h[x], k[y]] error-free, under the condition that
row k[y] is not removed, and among the non-removed rows maximizes r(k);
row h[x] is not removed and goes into the opposite haplotype as k[y], and
among such rows maximizes r(h).
(If all rows are removed then h = 1, k = 0, and D[1, x; 0, y; i] = i for all
x, y {A, B}
k
.
Once all the D[h, x; k, y; i] are known, the solution to the problem is given
by
min
x,y{A,B}
k
;h,k :r(h) r(k)
D[h, x; k, y; m].
Clearly, for every i, and for every h, k < i with r(h) r(k), and for every
x, y {A, B}
k
,
D[h, x; k, y; i] :=

D[h, x; k, y; i 1] if r(i) r(k) and rows i


and k[y] agree
D[h, x; k, y; i 1] if r(i) r(h) and rows i
and h[x] agree
D[h, x; k, y; i 1] + 1 otherwise
(6)
For every fragment i and for every x {A, B}
k
we dene OK(i, x) as the
set of those pairs (j, y) such that j is a fragment with j < i and y {A, B}
k
such that rows i[x] and j[y] agree. Now, for every i, and for every h < i with
r(h) r(i), and for every x
i
, x
h
{A, B}
k
,
D[h, x
h
; i, x
i
; i] := min
(j,x
j
)OK(i,x
i
),j=h,r(j)r(i)

D[j, x
j
; h, x
h
; i 1]
if r(h) r(j)
D[h, x
h
; j, x
j
; i 1]
if r(h) < r(j)
(7)
Finally, for every i, and for every k < i with r(k) r(i), and for every
x
i
, x
k
{A, B}
k
,
D[i, x
i
; k, x
k
; i] := min
(j,x
j
)OK(i,x
i
),j=k,r(j)r(i)
D[j, x
j
; k, x
k
; i 1]. (8)
Note that for computing the entries D[h, x; k, y; i] we only need to know
the sets OK(i). The cost of creating the OK(i, x) data structure (done in a rst
Practical Algorithms and Fixed-Parameter Tractability 43
phase) is O(2
2k
nm
2
). The cost of computing the entries D[h, x; k, y; i] (done in
a second phase) is O(2
3k
m
3
), since it can be seen as the cost of computing the
O(2
2k
m
3
) entries D[h, x; k, y; i] with h < k < i by using Equation (6) (costs
O(1) each) plus the cost of computing the O(2
2k
m
2
) entries D[h, x; i, y; i] by
using Equations (7) and (8) (costs O(2
k
m) each).
References
1. K. S. Booth and G. S. Lueker. Testing for the consecutive ones property, intervals
graphs and graph planarity testing using PQ-tree algorithms. J. Comput. System
Sci., 13:335379, 1976.
2. A. Chakravarti. Its raining SNP, hallelujah? Nature Genetics, 19:216217, 1998.
3. A. Clark. Inference of haplotypes from PCRamplied samples of diploid popula-
tions. Molecular Biology Evolution, 7:111122, 1990.
4. D. Guseld. A practical algorithm for optimal inference of haplotypes from diploid
populations. In R. Altman, T.L. Bailey, P. Bourne, M. Gribskov, T. Lengauer,
I.N. Shindyalov, L.F. Ten Eyck, and H. Weissig, editors, Proceedings of the Eighth
International Conference on Intelligent Systems for Molecular Biology, pages 183
189, Menlo Park, CA, 2000. AAAI Press.
5. D. Guseld. Haplotyping as perfect phylogeny: Conceptual framework and ecient
solutions. In G. Myers, S. Hannenhalli, S. Istrail, P. Pevzner, and M. Watermand,
editors, Proceedings of the Sixth Annual International Conference on Computa-
tional Biology, pages 166175, New York, NY, 2002. ACM Press.
6. L. Helmuth. Genome research: Map of the human genome 3.0. Science,
293(5530):583585, 2001.
7. International Human Genome Sequencing Consortium. Initial sequencing and anal-
ysis of the human genome. Nature, 409:860921, 2001.
8. G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz. SNPs problems,
complexity and algorithms. In Proceedings of Annual European Symposium on
Algorithms (ESA), volume 2161 of Lecture Notes in Computer Science, pages 182
193. Springer, 2001.
9. R. Lippert, R. Schwartz, G. Lancia, and S. Istrail. Algorithmic strategies for the
SNPs haplotype assembly problem. Briengs in Bioinformatics, 3(1):2331, 2002.
10. C. Lund and M. Yannakakis. The approximation of maximum subgraph problems.
In Proceedings of 20th Int. Colloqium on Automata, Languages and Programming,
pages 4051. Springer-Verlag, 1994.
11. E. Marshall. Drug rms to create public database of genetic mutations. Science
Magazine, 284(5413):406407, 1999.
12. J.C. Venter et al. The sequence of the human genome. Science, 291:13041351,
2001.
13. J. Weber and E. Myers. Human whole genome shotgun sequencing. Genome
Research, 7:401409, 1997.
Methods for Inferring Block-Wise Ancestral
History from Haploid Sequences
The Haplotype Coloring Problem
Russell Schwartz
1
, Andrew G. Clark
1,2
, and Sorin Istrail
1,
Celera Genomics, Inc.
45 West Gude Dr.
Rockville, MD 20850 USA
Phone: (240) 453-3668, Fax: (240) 453-3324
{Russell.Schwartz,Sorin.Istrail}@celera.com
Department of Biology
Pennsylvania State University
University Park, PA 16802 USA
[email protected]
Abstract. Recent evidence for a blocky haplotype structure to the
human genome and for its importance to disease inference studies has
created a pressing need for tools that identify patterns of past recom-
bination in sequences of samples of human genes and gene regions. We
present two new approaches to the reconstruction of likely recombination
patterns from a set of haploid sequences which each combine combinato-
rial optimization techniques with statistically motivated recombination
models. The rst breaks the problem into two discrete steps: nding re-
combination sites then coloring sequences to signify the likely ancestry of
each segment. The second poses the problem as optimizing a single prob-
ability function for parsing a sequence in terms of ancestral haplotypes.
We explain the motivation for each method, present algorithms, show
their correctness, and analyze their complexity. We illustrate and ana-
lyze the methods with results on real, contrived, and simulated datasets.
1 Introduction
The sequencing of the human genome [12,25] has created a tremendous oppor-
tunity for medical advances through the discovery of genetic predictors of dis-
ease. So far, though, catalogs of the genetic dierences between individuals have
proven dicult to apply. Examined in isolation, these dierences - which occur
predominantly in the form of isolated changes called single nucleotide polymor-
phisms (SNPs) - may fail to distinguish real relationships from the background
noise millions of SNPs produce. Recent analysis of the structure of the human
genome [14] has given hope that greater success will be achieved through studies
of haplotypes, sets of alleles from all SNPs in a region that tend to travel together

To whom correspondence should be addressed


R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 4459, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 45
A
ACGATCGATCATGAT
GGTGATTGCATCGAT
ACGATCGGGCTTCCG
ACGATCGGCATCCCG
GGTGATTATCATGAT
GGTGATTGGCTTGAT
B
ACGATCG|ATCAT|GAT
GGTGATT|GCATC|GAT
ACGATCG|GGCTT|CCG
ACGATCG|GCATC|CCG
GGTGATT|ATCAT|GAT
GGTGATT|GGCTT|GAT
C
---A---|G---C|-C-
---G---|A---T|-A-
---G---|G---T|-C-
---A---|A---T|-A-
Fig. 1. An illustration of the value of haplotype blocks. A: A hypothetical population
sample of a set of polymorphic sites. B: A pattern of haplotype blocks inferred from
the population sample. C: The results of a hypothetical assay conducted on additional
individuals based on the block patterns. If the original sample adequately captured the
full population variability, then typing four sites per individual would be sucient to
determine their block patterns, allowing inference of their untyped sites.
through evolutionary history. Several recent studies [3,13,20] have suggested that
the human genome consists largely of blocks of common SNPs organized in hap-
lotypes separated by recombination sites, such that most human chromosome
segments have one of a few possible sets of variations. It may therefore be possi-
ble to classify most human genetic variation in terms of a small number of SNPs
identifying the common haplotype blocks. If so, then determining the genomes
haplotype structure and dening reduced SNP sets characterizing common hap-
lotype variants could greatly reduce the time and cost of performing disease
association studies without signicantly reducing their power to nd disease-
related genes and genetic predictors of disease phenotypes. Figure 1 illustrates
haplotype blocks and their potential value with a contrived example.
There is considerable prior work on detecting ancestral recombination events
from a sample of gene sequences of known haplotype phase (i.e. haploid se-
quences). Some methods, such as those of Sawyer [22], Maynard Smith [17], and
Maynard Smith and Smith [18], detect whether any recombination has occurred
in a set of sequences. Others, such as the methods of Hudson and Kaplan [11]
and Weiler [27] further attempt to nd the locations of the recombination events.
More dicult is assigning haplotypes and recombination patterns to individual
sequences in a sample as was done for example by Daly et al. [3] and Zhang et
al. [29] which provides the information that would be necessary for haplotype-
based LD mapping and associated inference studies. The ultimate goal of such
methods would be reconstruction of the ancestral recombination graph from a
set of sequences, a problem addressed by the methods of Hein [9], Kececioglu
and Guseld [15], and Wang et al. [26]. Simulation studies of recombination de-
tection methods [28,21,6] suggest some room for improvement. One suggestion
of these studies is that more adaptable methods and methods suited to special
cases of recombination might have greater power in detecting recombination.
There is also a need for better integration of the statistical theory of recom-
bination with the theory of algorithmic optimization methods. With the notable
exception of the work of Kececioglu and Guseld, there has been little interac-
tion between these elds; it seems likely that progress will be best achieved at
the intersection of the best models and the best methods for solving for them.
46 R. Schwartz, A.G. Clark, and S. Istrail
Our hope is to suggest methods that may help in better combining statistically
motivated recombination models with combinatorial optimization methods.
We specically address the problem of inferring from a set of haploid se-
quences the recombination patterns and regions of shared recent ancestry be-
tween sequences in the sample, similar to the computational problem approached
by Daly et al. [3]. We formulate the task as two variants of what we call the
haplotype coloring problem, the goal of which is to assign colors to regions
of haploid sequences to indicate common ancestry. Thus, shared colors between
sequences at a given site would indicate descent from a common haplotypes at
that site. Shared colors between sites in a single sequence would indicate descent
of those sites from a common ancestral sequence undisrupted by recombination.
The rst method treats the problem as two discrete steps: locating haplotype
blocks and coloring sequences to indicate likely ancestry of haplotypes. The sec-
ond method performs all aspects of haplotype coloring as the optimization of a
unied objective function. We also describe an iterative expectation maximiza-
tion (EM) algorithm based on the second method to allow simultaneous inference
of population haplotype frequencies and assignment of haplotypes to individual
haploid sequences within a sample. In the remainder of this paper, we formal-
ize the methods as computational problems, describe algorithms, prove their
correctness, and analyze their eciency. We also describe applications of the
methods to contrived, real, and simulated data. Finally, we discuss implications
of the methods and prospects for future work.
2 The Block-Color Method
Inferring ancestry in the presence of recombination can be decomposed into two
distinct stages: infer the block pattern of the haplotypes and color haplotypes
in individual sequences in the way that is most likely to reect the ancestry of
the observed sequences. Related problems are commonly understood in terms
of the neutral innite sites model [16], which assumes that any site mutates at
most once in the ancestral history of a genomic region. Under the assumptions
of the innite sites model, the only way that gametic types AB, Ab, aB, and
ab can all be present in a sample is if recombination had generated the fourth
gametic type from the others. Recurrent or back mutations could also produce
the fourth gamete in reality, but are assumed not to occur under the innite sites
model. Thus, any pair of sites for which all 4 gametes are found can be inferred
to have incurred a recombination between them at some time in the past [11].
Tallying all pairs of sites having four gametes allows one to infer a minimum
number of recombination events necessary, giving rise to a two-step method for
the haplotype coloring problem. First, identify maximal blocks of sites for which
no pair has all four gametes and separate sequences within blocks into distinct
haplotypes. Second, color sequences within each block to minimize the number
of color changes across blocks over all sequences. When the innite sites model is
not obeyed, as is often the case with real data, recurrent mutation at a site also
generates site-pairs with all four gametes, but the pattern of haplotypes is quite
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 47
dierent. We would therefore like the method to be insensitive to minor changes
between haplotypes generated by such recurrent mutation. Very rare variants
and mistyping errors are also dicult to distinguish, and past recombination
events may be obscured by subsequent mutation events. We thus perform a pre-
ltering step that removes from consideration all polymorphic loci for which the
minor variant occurs in less than a user-specied fraction of the total population.
The two-step method is meant to apply the well-understood four-gamete test
to the broader problem of inferring the ancestral history of the set of sequences.
The additional coloring stage provides a method for inferring which sequences
were likely to have been formed through recombinations at identied recombi-
nation sites, allowing us to infer a history of the sequence sample and reduce the
amount of information needed to specify an individuals haplotypes.
2.1 Identifying Blocks
Hudson and Kaplan [11] developed a method for nding the minimum number of
blocks in a set of phased sequences such that the four-gamete constraint is satis-
ed within each block. Their method nds a minimum-size set of blocks for any
set of phased sequences such that all blocks satisfy the four-gamete constraint.
Guseld [8] developed a method for testing for the equivalent perfect phylogeny
constraint in unphased sequences. Guselds method can be applied straightfor-
wardly to simultaneously infer haplotypes and a block structure by locating the
minimal size set of blocks consistent with a perfect phylogeny for each block.
Within each block, haplotypes can be inferred eciently using an earlier algo-
rithm of Guseld [7] for inferring phylogenetic trees. Either the Guseld method
for unphased data or the Hudson and Kaplan method for phased data can there-
fore be used to derive sets of blocks associated with haploid sequences, which
provides the input necessary for the coloring stage of our block-color method.
Jereys et al. [13] suggested that within the hotspots of high recombination,
the recombination events might localize to a few nearby markers rather than
occurring at a single point. It may therefore be valuable to develop methods for
enumerating or sampling optimal or near-optimal solutions or nding common
substructures among them, rather than nding a single optimum. We might also
want to consider other tests for block viability than the four-gamete constraint
that are more robust to the number of sequences or more liberal or conservative
in selecting which regions may be considered blocks. We have therefore also
developed a slower but more general dynamic programming algorithm for the
block-identication problem, similar that used by Zhang et al. [29], which takes
an arbitrary set of pair-wise constraints and constructs bottom-up a minimum-
size partition into blocks such that every constraint spans a block boundary. Due
to space limitations, we omit a detailed description of this algorithm.
2.2 Coloring
Our goal in block coloring is to assign colors to blocks such that each distinct
haplotype on a given sequence region is assigned a dierent color so as to mini-
48 R. Schwartz, A.G. Clark, and S. Istrail
A B C D E
ACT|GAAC
ACT|GAAC
CGA|GAAC
CCG|CTAT
ACT|CTAT
ACT|GAAC
ACT|GAAC
CGA|GAAC
CCG|CTAT
ACT|CTAT
Fig. 2. An illustration of one step in the coloring algorithm. A: Two consecutive blocks
in an example set of sequences. The sequences in the rst block are assumed to have
already been colored. B: The graph construction, with each node on the left representing
one of the three haplotypes in the rst block and each node on the right representing
one haplotype in the second block. The uppermost edge (between the red and gray
nodes) is thicker to represent the fact that two examples of that haplotype pair occur
and the edge therefore has double weight. C: The solution to the maximum matching.
D: The coloring of the right nodes implied by the matching. E: The translation of the
coloring back into the sequences yielding a coloring of the haplotypes in the second
block that minimizes color changes between the blocks.
mize the total number of color changes between haplotypes in our sequence set.
Intuitively, this procedure is meant to explain the data with as few recombina-
tions as possible, providing a maximally parsimonious solution to the coloring
problem given the assumption that recombination events are relatively rare. For
the purposes of this analysis, we will decompose each sequence into a string of
haplotypes arranged in blocks, with all sequences sharing the same block struc-
ture. We informally dene a block to be an interval of polymorphic sites and the
set of haplotypes that occur on that interval. Within each block, distinct haplo-
types are assigned distinct colors and identical haplotypes are assigned identical
colors. Where two consecutive haplotypes in a sequence are given the same color,
it is implied that they are likely to have come from a common ancestral sequence.
Where colors change between two haplotypes in a sequence, the implication is
that a recombination event was likely involved in forming that sequence from
two dierent ancestral sequences that were sources of the two haplotypes. In ex-
pressing the coloring stage as a computational problem, we more formally dene
the input as the following:
B, a sequence b
1
, . . . , b
k
of blocks where each b
i
is a set of haplotypes in
a given interval of polymorphic sites. Element x of b
i
is denoted b
i,x
.
S, a set of block-decomposed sequences s
1
, . . . , s
n
where each sequence
s
i
b
1
b
k
has a multiplicity n
i
. Element j of s
i
is denoted s
i,j
.
Let C be a set of positive integer colors. Then our output is a set of as-
signments X =
1
, ..,
k
where each
i
is a function from b
i
to C such that
b
i,x
= b
i,y

i
(b
i,x
) =
i
(b
i,y
), for which we minimize the following objective
function (expressing the sum of color changes):
G =
k1

j=1
n

i=1
n
i
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)), whereID(b) =

1 if b is true
0 if b is false

.
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 49
The function ID tests whether for a given i and j there is a color change in
sequence i between blocks j and j + 1. G thus counts the total number of color
changes between consecutive blocks over all sequences. We will show two key
properties of this problem that allow us to solve it eciently. First, greedily
coloring each block optimally in terms of the previous block (minimizing the
number of color changes given the previous blocks coloring) yields a globally
optimal solution. Second, coloring each block is an instance of the weighted
bipartite maximum matching problem, for which there exist ecient algorithms
[5]. Figure 2 illustrates the resulting method, which we now examine.
Lemma 1. Any set X such that
j+1
minimizes G
j
(X) =

n
i=1
ID(
j
(s
i,j
) =

j+1
(s
i,j+1
)) given
j
j [1, k 1] will minimize the objective function G.
Proof. Assume we have a mapping X such that
j+1
minimizes G
j
(X) given

j
for all j in [1, k 1]. Assume further for the purposes of contradiction that
there exists another solution X

such that

k1
j=1
G
j
(X

) <

k1
j=1
G
j
(X). Then
there must be some smallest j such that G
j
(X

) < G
j
(X). We can then create
a new X

such that X

is identical to X up to position j and

j+1
(c) =

j
(c)
if and only if

j+1
(c) =

j
(c). Then X

must have the same cost as X

for
the transition from region j to j + 1, which is strictly lower than that for X on
that transition. Thus, X could have chosen a better solution for the transition
from j to j + 1 given its solutions for all previous transitions, contradicting the
assumption that X minimizes G
j
(X) given
j
for all j. Thus, X

cannot exist
and X must minimize G.
Lemma 2. Finding the optimal
j+1
given
j
can be expressed as weighted
maximum matching.
Proof. We prove the lemma by construction of the instance of
weighted maximum matching. We rst rewrite G as (k 1)

n
i=1
n
i

k1
j=1

n
i=1
n
i
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)). Since (k 1)

n
i=1
n
i
does not
depend on X, minimizing our original objective function is equivalent to
maximizing

k1
j=1

n
i=1
n
i
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)). We create a bipartite
graph B in which each node u
i
in the rst part corresponds to a haplotype
c
ji
in block j and each node v
i
in the second part corresponds to a haplotype
c
j+1,i
in block j + 1. If any sequence has haplotypes c
ji
and c
j+1,i
, then we
create an edge (u
i
, v
i
) with weight
n

i=1
n
i
HAS
j,i,i
(s
i
), where HAS
j,i,i
(s) =

1 if s
j
= c
j,i
and s
j+1
= c
j+1,i

0 otherwise

.
A matching in B corresponds to a set of edges pairing haplotypes in block j with
haplotypes in block j + 1. We construct X so that it assigns the same color to
c
j+1,i
as was assigned to c
j,i
if and only if the matching of B selects the edge
between the nodes corresponding to those two haplotypes. Any given matching
will have a weight equal to the sums of the frequencies of sequences sharing
50 R. Schwartz, A.G. Clark, and S. Istrail
both of each pair of haplotypes whose corresponding nodes are connected by
the matching. Thus the coloring corresponding to a maximum matching will
maximize

n
i=1
n
i

k1
j=1
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)), which yields an optimal
assignment of
j+1
given
j
.
Lemmas 1 and 2 imply the following algorithm for optimal coloring, which
we refer to as Algorithm 1:
create an arbitrary assignment
1
of haplotypes to distinct colors in block 1
for i = 2 to m
construct an instance of weighted maximum matching as described above
solve the instance with the algorithm of Edmonds [5]
for each pair (c
i1,j
, c
i,k
) joined in the matching assign
i
(c
i,k
) =

i1
(c
i1,j
)
for each c
i,k
that is unmatched assign an arbitrary unused color to
i
(c
i,k
)
Theorem 1. Algorithm 1 produces a coloring of haplotype blocks minimizing
the number of color changes across sequences in time O(mn
2
log n).
Proof. The proof of correctness follows directly from Lemmas 1 and 2. Creat-
ing an instance of weighted maximum matching and assigning colors requires
O(n) time. The run time for each iteration of i is therefore dominated by the
O(n
2
log n) run time of the maximum matching algorithm for this type of dataset
(where the number of non-zero edges of the graph is bounded by n). There are
O(m) rounds of computation, yielding a total run time of O(mn
2
log n).
Using the Hudson and Kaplan algorithm for block assignment gives the block-
color method an overall complexity of O(nm
2
+mn
2
log n) for n sequences and
m polymorphic sites. In practice, the input data would typically come from
resequencing some number of individuals in one sequenced gene - with a bound
on m and n typically on the order of 100 - yielding run times well within what
can be handled by standard desktop computers.
3 The Alignment Method
Although solving the problem in well-dened stages has advantages, it may also
be fruitful to nd a unied global solution to all aspects of the problem. Our
hope is that such an approach will help in nding more biologically meaningful
probabilistic models that capture the essential features of the system but are
computationally tractable. We therefore developed a second approach based on
techniques from sequence alignment. Sequence alignment can be viewed as as-
signing to each position of the target sequence a frame of a reference sequence,
with changes of frame representing gaps. We can analogously align a single
target to multiple references, but instead of shifting frames to model gaps, shift
reference sequences to model recombination. Figure 3 illustrates this analogy.
This approach is similar to the Recombination Cost problem of Kececioglu and
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 51
A
ACTAGCTAGCATCG
ACTAGCTAGCATCG
ACTAGCTAGCATCG
ACTGCTTCATCG
B
TGAGGCATGTACGA
TCAAGTTTCTACCG
ACTAGCTAGCATCG
ACTAGTTTGTATGA
Fig. 3. An illustration of the analogy between sequence alignment and haplotype col-
oring as variants of the problem of parsing a query sequence in terms of a set of
reference sequences. Each sub-gure shows a query sequence (bottom) aligned to three
reference sequences (top). A: sequence alignment as parsing of a sequence in terms of a
set of identical reference sequences in distinct frames; B: haplotype coloring as parsing
a sequence in terms of a set of distinct reference sequences in identical frames.
Guseld [15] and the jumping alignments of Spang et al. [23]. We, however,
assume we have a population sequenced for specic polymorphic sites and there-
fore need not consider insertion/deletion costs. Deletion polymorphisms can be
treated simply as additional allowed symbols. Dealing with missing data is more
complicated, but can be handled through the use of a special symbol represent-
ing an undened site, which can match any other symbol. Bayesian methods,
similar to that of Stephens et al. [24], might also be used to impute missing data.
A major advantage of this technique over our other method is that it does
not assume that there are recombination hotspots or a block structure. It may
therefore be better suited to testing that assumption or examining genomes or
genome regions in which it proves to be inapplicable. It should also allow us
to distinguish recent recombination sites aecting a small fraction of the se-
quenced population from more ancient or frequently recombining sites and is
easily parameterized to be more or less sensitive to recent mutations in iden-
tifying haplotypes. Among its disadvantages are that it requires an additive
objective function and therefore uses a probability model dierent from those
traditionally used in LD studies and that it is parameterized by values that may
be unknown and dicult to estimate. In addition, the function being optimized
is harder to understand intuitively than those used in the block-color method,
making the alignment method harder to judge and improve upon.
Our probability model parses a sequence as a string of haplotype identiers
describing the ancestral source of each of its polymorphic sites. A given sequence
chooses its rst value from a distribution of haplotypes at the rst polymorphic
position. Each subsequent polymorphic site may follow a recombination event
with some probability . If there is no recombination event then the sequence
continues with the same haplotype as it had at the prior site. Otherwise, the
sequence samples among all available haplotypes according to a site-specic dis-
tribution for the new site. There is also a mutation probability that any given
site will be mutated from that of its ancestral haplotype. This model leads to
the following formalization of the inputs:
m, the log probability of a mutation event at any one site of a sequence
r, the log probability of a recombination event between any two sites
S = s
1
, . . . , s
n
, a set of n reference sequences. Each s
i
is a sequence of l
polymorphic sites s
i1
, . . . , s
il
52 R. Schwartz, A.G. Clark, and S. Istrail
F, an nl matrix of log frequencies in which f
ij
species the probability
of choosing a given haplotype i at each site j following a recombination imme-
diately prior to that site
=
1
, . . . ,
t
, a set of t target sequences. Each
i
is a sequence of l
polymorphic values,
i1
, . . . ,
il
Our goal is to produce a tl matrix H, where h
ij
species which haplotype from
the set [1, n] has been assigned to position j of target sequence
i
, maximizing
the following objective function:
G(H) =
t

i=1
f
h
i1
1
+
t

i=1
l

j=2
(f
h
ij
j
+r)D(h
i,j
, h
i,j1
)
+
t

i=1
l

j=2
log((1 e
r
) +e
r+f
h
ij
j
)(1 D(h
i,j
, h
i,j1
)) +
t

i=1
l

j=1
M(s
h
i,j
,j
,
i,j
)
where D(a, b) =

0, a = b
1, a = b

and M(a, b) =

0, a = b
m, a = b

.
G(H) gives a normalized log probability of the assignment H given the sequences
and haplotype frequencies. The rst sum reects the probabilities of choosing
dierent starting haplotypes. The next two sums reect the contributions of
respectively choosing to recombine at each site or choosing not to recombine.
The nal sum gives the contribution to the probability of mismatches.
The function implies some assumptions about the independence of dierent
events whose validity must be considered. The assumptions that probabilities of
recombination and mutation are independent and identical at all sites are imper-
fect but may be reasonable a priori in the absence of additional information. It
is less clearly reasonable to assume that the selection of recombination positions
and the choices of haplotypes between them can be considered independent,
although this too may be reasonable absent additional information.
It is important to note that there is no unique correct pair of r and m
parameters for a given genome or gene region. The right parameters depend
on how much tolerance is desired in allowing slightly dierent sequences to be
considered identical haplotypes, analogous to the need for dierent sensitivities
in sequence similarity searches conducted for dierent purposes. We can thus
propose that the value of m should be determined by the particular application
of the method, with r then being an unknown that depends on both m and the
nature of the genome region under examination.
3.1 The Alignment Algorithm
We maximize G(H) with a dynamic programming algorithm. The following for-
mula describes the basic dynamic programming recursion for assigning haplo-
types to a single sequence :
C

(i, j) = max
k

(i, j 1) + log(1 e
r
+e
r+f
ij
)
C

(k, j 1) +r +f
ij

0,
j
= s
ij
m,
j
= s
ij

.
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 53
C

(i, j) represents the cost of the optimal parse of sequence up to position j


ending with an assignment to haplotype i at position j. The right-hand term
accounts for the mismatch penalty, if any, between and haplotype i at position
j. The full algorithm follows directly from the recurrence. The following pseu-
docode describes the complete algorithm, which we call Algorithm 2:
for i = 1 to n: if (
1
= s
i1
) then C

[i, 1] f
i1
else C

[i, 1] f
i1
+m
for j = 2 to l: for i = 1 to n:
best C

[i, j 1] + log(1 e
r
+e
r+f
ij
); argbest i
for k = 1 to n, k = i:
if (C

[k, j 1] +r +f
ij
< best)
best C

[k, j 1] +r +f
ij
; argbest k
if (
j
= s
ij
) then C[i, j] best else C[i, j] best +m
P

[i, j] argbest
best
for i = 1 to n: if C

[i, l] > best then best C

[i, l]; H

[l] i
for j = l-1 downto 1: H

[j] P

[H

[j + 1], j + 1]
Lemma 3. C

[i, j] is the optimal cost of any assignment of positions 1 through


j of to haplotypes in S such that position j of is assigned to haplotype i.
Proof. We prove the statement by induction on j. For the base case of j = 1, each
C

[i, j] is uniquely determined by the log frequency f


i1
plus a mutation penalty
m if
1
and s
i1
do not match. The rst for loop sets each C

[i, 1] accordingly,
satisfying the inductive hypothesis. Now assume the lemma is true for j 1. We
can decompose C

[i, j] into two terms, A

[i, j] +B

[i, j], where


A

[i, j] =
t

i=1
f
h
i1
1
+
t

i=1
j1

=2
(f
h
ij
j
+r)D(h
i,j
, h
i,j

1
)
+
t

i=1
j1

=2
log((1 e
r
) +e
r+f
h
ij

)(1 D(h
i,j
, h
i,j

1
))
+
t

i=1
j1

=1
M(s
h
i,j
,j
,
i,j
)
and
B

[i, j] =
t

i=1
(f
h
ij
j
+r)D(h
i,j
, h
i,j1
)
+
t

i=1
log((1 e
r
) +e
r+f
h
i
jj
)(1 D(h
i,j
, h
i,j1
)) +
t

i=1
M(s
h
i,j
,j
,
i,j
).
A

is exactly the function optimized by C

[i, j 1]. B

depends only on assign-


ments at positions j and j 1; it is therefore optimized for a given assignment i
at position j by maximizing it over all assignments at position j 1, which the
54 R. Schwartz, A.G. Clark, and S. Istrail
algorithm does in deriving C

[i, j] in the second loop. C

[i, j] is thus the cost of


the optimal assignment of positions 1 through j ending on haplotype i.
Theorem 2. Algorithm 2 will nd an H maximizing G(H) for sets of n refer-
ence sequences S and t target sequences with length l in O(n
2
lt) time.
Proof. It follows from lemma 3 that C

[i, l] will be the cost of the optimal solu-


tion to the global problem terminating in haplotype i. Finding an i maximizing
C

[i, l], as is done by the third outer loop, therefore yields the global optimum to
the problem. The nal loop performs backtracking, reconstructing the optimal
solution H. Run time is dominated by an inner loop requiring constant time for
each of O(n
2
l) iterations, for a total of O(n
2
lt) run time when run on t targets.
The denition of the problem solved by the alignment method requires that
we know in advance the frequencies from which haplotypes are sampled following
a recombination event. As this information may not be available, we would like
a way to estimate it from measured frequencies of full-length haploid sequences.
We therefore developed an iterative method to optimize this probability by suc-
cessively performing a discrete optimization of the coloring given the haplotype
frequencies followed by a continuous optimization of the frequencies given the
coloring, which we perform through a steepest-descent search. The algorithm is
a form of generalized expectation maximization (EM) algorithm [1,4] that treats
the frequencies as hidden variables, repeatedly nding a maximum aposteriori
probability (MAP) coloring H given the frequencies F maximizing Pr(H|F) by
Algorithm 2 then improving Pr(H|F) in terms of F by steepest descent. The
number of rounds required for the resulting method to converge might theoreti-
cally be large, although we have found convergence in practice to occur reliably
within ten iterations on real gene region data sets.
4 Results
Both methods were implemented in C++. All tests were run on four-processor
500 MHz Compaq Alpha machines, although the code itself is serial.
We used a contrived data set, shown in Figure 4, to illustrate the strengths
and weaknesses of the methods. We strongly penalized mutations in order to
simplify the illustration. The block-color method correctly detects the recombi-
nation site, although the coloring appears suboptimal due to the constraint that
identical haplotypes must have identical colors. The alignment method correctly
identies the recombinants, but only with an appropriate choice of parameters.
In order to demonstrate the methods and explore the parameter space, we
further applied the methods to a real data set: the apolipoprotein E (APOE)
gene region core sample of Nickerson et al. [19], a set of 72 individuals typed
on 22 polymorphic sites, with full-length haplotypes determined by a combina-
tion of the Clark haplotype inference method [2] and allele-specic PCR to verify
phases. Figure 5 demonstrates the eects on the block-color algorithm of varying
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 55
A
B C D E
AAAAAA x 16
AAAATA x 16
ATATAA x 4
ATTTAA x 1
ATTAAT x 16
ATAAAA x 16
AAAAAA
AAATAA
ATATAA
ATTTAA
ATTAAT
ATAAAA
AAAAAA
AAATAA
ATATAA
ATTTAA
ATTAAT
ATAAAA
AAAAAA
AAATAA
ATATAA
ATTTAA
ATTAAT
ATAAAA
Fig. 4. A contrived sample problem. Dierent colors represent dierent predicted an-
cestral sequences. A: An ancestral recombination graph showing a proposed sequence
ancestry; B: The extant sequences with frequencies chosen to make the recombinant and
double recombinant relatively rare; C: The output of the block-color method screen-
ing out sites with minor allele frequencies below 0.1; D: The output of the alignment
method with parameters r = 1 and m = 100; E: The output of the alignment
method with parameters r = 2 and m = 100.
tolerance to infrequent SNPs. Moving from considering all SNPs in Figure 5A
to considering only those with minor frequencies above 10% in Figure 5B then
25% in Figure 5C, leads to progressively simpler block structures, with fewer
blocks and less variability within them. Figure 6 illustrates the eects of varying
recombination and mutation penalties for the alignment algorithm with the EM
extension on the same dataset. Higher recombination penalties generally lead
to greater numbers of haplotypes being assigned while higher mutation penal-
ties reveal more subtle regions of variation. We note that inferred haplotype
regions do not necessarily line up at recombination hotspots, suggesting that
the data might be more parsimoniously explained by not assuming the existence
of discrete haplotype blocks with the same structure across all sequences.
While a real dataset can demonstrate the methods, it cannot rigorously vali-
date them, as we do not denitively know the recombination history of any real
gene region. We therefore resorted to simulated data to perform a partial test of
the methods. Simulations were generated through Hudsons coalescent simulator
[10], using populations of 50 individuals with 70 segregating sites and allowing
recombination between any pair of sites. A simulated data set was generated for
each recombination parameter in the set {0,1,2,5,10,20,30}. We then calcu-
lated for the block-color method how many recombination sites were predicted,
screening out sites with minor allele frequency below 0.1. We further calculated
for the alignment method, with parameters m = 1.5 and r = 1.0, at how
many sites at least one recombination event was predicted. Figure 7 shows the
56 R. Schwartz, A.G. Clark, and S. Istrail
A B C
Fig. 5. Coloring of the APOE gene region [19] by the block-color method. A: coloring
using all sites. B: coloring using sites with minor allele frequency above 10%. C: coloring
using sites with minor allele frequency above 25%.
A B C
D E F
G H I
Fig. 6. Coloring of the APOE gene region [19] by the alignment-color method. Param-
eter values are A: r = 1.0 m = 2.0; B: r = 1.0 m = 4.0; C: r = 1.0 m = 6.0;
D: r = 0.5 m = 2.0; E: r = 0.5 m = 4.0; F: r = 0.5 m = 6.0; G: r = 0.0
m = 2.0; H: r = 0.0 m = 4.0; I: r = 0.0 m = 6.0;
resulting plot. As the plot indicates, the number of detected recombination sites
generally increases with increasing recombination rate, although the correlation
is imperfect and grows more slowly than the increase in . Of course the process
of simulating such data involves a high degree of stochasiticity, so one does not
expect a perfect correlation. For the block-color method, this result is consistent
with the analogous experiments performed by Hudson and Kaplan [11].
5 Discussion
We have presented two new methods for detecting recombination patterns in sets
of haploid sequences, phrased in terms of the problem of coloring sequences
to reect their ancestral histories. The rst method uses parsimony principles to
separately infer a minimum number of recombination sites capable of explaining
the data and then color sequences between sites to denote likely ancestry. The
second uses a discrete optimization method similar to those used in sequence
alignment to nd the most probable parse of a set of sequences in terms of hap-
lotypes. We have also incorporated that technique into an iterative method using
alternating discrete and continuous optimization to simultaneously infer haplo-
type frequencies and color sequences optimally given the inferred frequencies.
Each method has strengths that might make it more appropriate for cer-
tain cases. The block-color method creates a general algorithmic framework in
which optimal solutions can be found eciently for a range of possible tests of
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 57
0
2
4
6
8
10
0 5 10 15 20 25 30
P
r
e
d
i
c
t
e
d

r
e
c
o
m
b
i
n
a
t
i
o
n

s
i
t
e
s
Recombination rate
Block-color method
Alignment method
Fig. 7. Predicted recombination sites versus coalescent parameter for simulated data.
compatibility of pairs of sites with the assumption of no recombination. It thus
might provide a useful general method for the evaluation and application of new
statistical tests for recombination. The alignment method solves optimally for a
single objective function, making it potentially more useful in testing and apply-
ing a unied probabilistic model of sequence evolution. It also does not rely on a
prior assumption that haplotypes have a block structure and therefore might be
more useful for testing hypotheses about the existence of such a block structure,
nding instances in which it is not conserved, or processing data from organisms
or gene regions that do not exhibit haplotype blocks. The EM algorithm may be
independently useful in estimating haplotype block frequencies.
We can consider possible generalizations of the methods described above. It
may be worthwhile to try other tests for deriving pair-wise constraints for the
block-color method. As Hudson and Kaplan [11] note, the number of recombina-
tion events detected by the four-gamete constraint may be substantially smaller
than the actual number of recombination sites, especially for small population
sizes; their method nds a maximally parsimonious explanation for the data and
will therefore miss instances in which recombination has occurred but has not
yielded a violation of the four-gamete test or in which the population exam-
ined does not contain sucient examples to demonstrate such a violation. The
Guseld [8] method for diploid data can be expected to be similarly conserva-
tive, suggesting the value of pursuing more sensitive tests of recombination. The
alignment model could be extended to handle position-specic mutation weights
or recombination probabilities when an empirical basis is available for choosing
them. It might also be possible to adapt the EM algorithm to infer mutation or
recombination probabilities at the same time as it infers haplotype frequencies.
In addition to providing greater versatility and accuracy, automating inference of
the mutation and recombination rates might substantially improve ease-of-use.
Making the best use of sequencing data for understanding human diversity
and applying that understanding to association studies will require rst devel-
oping a more complete picture of the processes involved; second, building and
validating statistical and probabilistic models that reliably capture that picture;
and third, developing methods that can best interpret the available data given
58 R. Schwartz, A.G. Clark, and S. Istrail
those models. While the preceding work is meant to suggest avenues and tech-
niques for pursuing the overall goal of applying human genetic diversity data,
all three of those steps remain far from resolved.
Acknowledgments
We thank Vineet Bafna, Clark Mobarry, and Nathan Edwards for their insights
into block identication and Hagit Shatkay and Ross Lippert for their advice and
guidance. We also thank Michael Waterman and various anonymous referees for
their advice on the preparation of this manuscript.
References
1. Baum, L.E., Petrie, T., Soules, G., and Weiss, N. A maximization technique occur-
ring in the statistical analysis of probabilistic functions of Markov chains. Annals
Math. Stat., 41, 164171, 1970.
2. Clark, A. G. Inference of Haplotypes from PCR-amplied samples of diploid pop-
ulations. Mol. Biol. Evol., 7, 111122, 1990.
3. Daly, M.J., Rioux, J.D., Schaner, S.F., Hudson, T.J., and Lander, E.S. High-
resolution haplotype structure in the human genome. Nature Gen., 29, 229232,
2001.
4. Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incom-
plete data via the EM algorithm. J. Royal Stat. Soc. B, 39, 138, 1977.
5. Edmonds, J. Paths, trees, and owers. Canad. J. Math., 17, 449467, 1965.
6. Fearnhead, P. and Donnelly, P. Estimating recombination rates from population
genetic data. Genetics, 159:12991318, 2001.
7. Guseld, D. Ecient algorithms for inferring evolutionary history. Networks,
21:1928, 1991.
8. Guseld, D. Haplotyping as perfect phylogeny: Conceptual framework and ecient
solutions. In Proc. 6th Intl. Conf. Comp. Biol., RECOMB02, 166175, 2002.
9. Hein, J. A heuristic method to reconstruct the history of sequences subject to
recombination. J. Mol. Evol., 20, 402411, 1993.
10. Hudson, R.R. Properties of the neutral allele model with intergenic recombination.
Theoret. Pop. Biol., 23, 183201, 1983.
11. Hudson, R.R. and Kaplan, N.L. Statistical properties of the number of recombina-
tion events in the history of a sample of DNA sequences. Genetics, 111, 147164,
1985.
12. International Human Genome Sequencing Consortium. Initial sequencing and anal-
ysis of the human genome. Nature, 409, 860921, 2001.
13. Jereys, A.J., Kauppi, L., and Neumann, R. Intensely punctate meiotic recombi-
nation in the class II region of the major histocompatibility complex. Nature Gen.,
29, 217222, 2001.
14. Johnson, G.C.L., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Di Genova,
G., Ueda, H., Cordell, H.J., Eaves, I.A., Dudbridge, F., Twells, R.C.J., Payne,
F., Hughes, W., Nutland, S., Stevens, H., Carr, P., Tuomilehto-Wolf, E., Tuomile-
hto, J., Gough, S.C.L., Clayton, D.G., and Todd, J.A. Haplotype tagging for the
identication of common disease genes. Nature Gen., 29, 233237, 2001.
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 59
15. Kececioglu, J. and Guseld, D. Reconstructing a history of recombinations from a
set of sequences. Disc. Appl. Math., 88, 239260, 1998.
16. Kimura, M. Theoretical foundations of population genetics at the molecular level.
Theoret. Pop. Biol., 2, 174208, 1971.
17. Maynard Smith, J. Analyzing the mosaic structure of genes. J. Mol. Evol., 34,
126129, 1992.
18. Maynard Smith, J. and Smith, N.H. Detecting recombination from gene trees. Mol.
Biol. Evol., 15, 590599, 1998.
19. Nickerson, D. A., Taylor, S. L., Fullerton, S. M., Weiss, K. M., Clark, A. G.,
Stengrd, J. H., Salomaa, V., Boerwinkle, E., and Sing, C. F. Sequence diversity
and large-scale typing of SNPs in the human apolipoprotein E gene. Gen. Res., 10,
15321545, 2000.
20. Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R.,
Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., Nguyen, B.T.N.,
Norris, M.C., Sheehan, J.B., Shen, N., Stern, D., Stokowski, R.P., Thomas, D.J.,
Trulson, M.O., Vyas, K.R., Frazer, K.A., Fodor, S.P.A., and Cox, D.R. Blocks of
limited haplotype diversity revealed by high-resolution scanning of human chro-
mosome 21. Science, 294, 17191723, 2001.
21. Posada, D., and Crandall, K.A. Evaluation of methods for detecting recombina-
tion from DNA sequences: computer simulations. Proc. Natl. Acad. Sci. USA, 98,
1375713762, 2001.
22. Sawyer, S. Statistical tests for detecting gene conversion. Mol. Biol. Evol., 6, 526
536, 1989.
23. Spang, R., Rehmsmeier, M., and Stoye, J. Sequence database search using jumping
alignments. In Proc. Intel. Sys. Mol. Biol., ISMB00, 367375, 2000.
24. Stephens, M., Smith, N.J., and Donnelly, P. A new statistical method for haplotype
reconstruction from population data. Am. J. Hum. Gen., 68, 978989, 2001.
25. Venter, J.C., Adams, M.D., Myers, E.W., et al. The sequence of the human genome.
Science, 291, 13041351, 2001.
26. Wang, L. Zhang, K., and Zhang, L. Perfect phylogentic networks with recombina-
tion. J. Comp. Biol., 8, 6978, 2001.
27. Weiler, G. F. Phylogenetic proles: a graphical method for detecting genetic re-
combinations in homologous sequences. Mol. Biol. Evol., 15, 326335, 1998.
28. Wiuf, C., Christensen, T., and Hein, J. A simulation study of the reliability of
recombination detection methods. Mol. Biol. Evol., 18,19291939, 2001.
29. Zhang, K., Deng, M., Chen, T., Waterman, M. S., and Sun, F. A dynamic pro-
gramming algorithm for haplotype block partition. Proc. Natl. Acad. Sci. USA,
99, 73357339, 2002.
Finding Signal Peptides in Human Protein
Sequences Using Recurrent Neural Networks
Martin Reczko
1
, Petko Fiziev
2
, Eike Staub
2
, and Artemis Hatzigeorgiou
3
Synaptic Ltd., Science and Technology Park of Crete
P.O. Box 1447, 711 10 Voutes Heraklion, Greece
metaGen Pharmaceuticals GmbH, Oudenader Str.16, D-13347 Berlin, Germany
Department of Genetics, University of Pennsylvania, School of Medicine
Philadelphia, PA 19104-6145, USA
Abstract. A new approach called Signd for the prediction of signal
peptides in human protein sequences is introduced. The method is based
on the bidirectional recurrent neural network architecture. The modica-
tions to this architecture and a better learning algorithm result in a very
accurate identication of signal peptides (99.5% correct in vefold cross-
validation). The Signd system is available on the WWW for predictions
(http://www.stepc.gr/ synaptic/signd.html).
1 Introduction
In the last few years the complete or nearly the complete sequence for a large
number of genomes including also the human genome has been determined. One
of the expected consequences from sequencing these genomes is the discovery of
new drugs. The sorting of subcellular proteins is a fundamental aspect of nding
such drugs, because such proteins have the potential to leave the cell and move
through the organism.
In humans secretory proteins account for about one-tenth of the human pro-
teome. As many functional characteristics of proteins can be correlated with
more or less well-dened linear motifs in their aminoacid sequences, also in this
case sorting depends on signals that can already be identied in the primary
structure of a protein.
Signal peptides are N-terminal extensions on the mature polypeptide chain.
They are cleaved from the mature part of the protein at the cleavage site. The
basic design of signal peptides is constituted by three distinct regions: a short
positively charged N-terminal domain (N-region) a central hydrophobic stretch
of 7-15 residues (H-region) and a more polar C-terminal part of 3-7 residues that
denes a processing site for the signal peptidase enzyme (C-region) [18].
Prediction of signal peptides has a long history starting back in early eighties
by simple algorithms based on hydrophobicity calculations [8] and the later use
of a positional weight matrix for the cleavage site [17]. The application of more
sophisticated methods such as rule-based systems [7], neural networks (NNs) [13],
hidden Markov models (HMMs) [14] and linear programming [9] has increased
the levels of accuracy considerably. A more detailed review can be found in [12].
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 6067, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Finding Signal Peptides in Human Protein Sequences 61
In this paper a new method for dening signal peptides is introduced. The
method is based on a novel neural network architecture called bidirectional re-
current neural nets (BRNN) that was recently introduced by [2]. The BRNN
architecture is used for sequential learning problems with sequences of nite
lengths. Other recurrent NNs can only model causal dynamical system, where
the output of the system at a certain time does not depend on future inputs.
The BRNN model has a symmetric memory for storing inuences of past and
future inputs to an output at time t.
This non causal behavior is useful for the prediction of properties of biose-
quences, since most properties are inuenced by elements of those sequences
both upstream and downstream of that property. The concept of time does not
apply to this class of sequential learning problems, so we know about the fu-
ture sequence following a sequence position t and the system is not violating
any physical causality.
We modify the architecture and introduce the combination with a more e-
cient learning algorithm that produces networks with better generalization per-
formance in less computational time. In the results section we present a com-
parison with the methods available on the Internet and we are documenting the
achieved improvements.
2 Material and Methods
The data used here is the same data as was used for the rst SignalP prediction
system [13] that was made available by the authors
1
. That data was derived from
SWISS-PROT version 29 [1]. For the removal of redundancy in their data set,
Nielsen et. al. [13] eliminated any pair of sequences with more than 17 identical
residues in a local alignment.
From this non-redundant data set we use only the human sequences which
contain 416 signal peptides and the N-terminal part of 97 cytoplasmic and 154
nuclear proteins as negative examples.
2.1 Neural Network Architecture
The neural network architecture employed for the task is a modied version of
the BRNN as described in [2]. The BRNN architecture is summarized here and
the modications are introduced.
The state vectors are dened as F
t
and B
t
in IR
n
and IR
m
and contain the
context information while reading forward and backwards, respectively. They
are calculated as:
F
t
= (F
t1
, U
t
) and (1)
B
t
= (B
t+1
, U
t
) (2)
The SignalP datasets are available at: ftp://virus.cbs.dtu.dk/pub/signalp
62 M. Reczko et al.
where () and () are nonlinear transition functions realized by the multilayer
perceptrons (MLPs)

and

and the vector U


t
IR
k
is the input at time
t [1, T]. The state vectors are initialized with F
0
= B
T+1
= 0.
The output Y
t
IR
s
is calculated after the calculation of F
t
and B
t
has
nished using
Y
t
= (F
t
, B
t
, U
t
) (3)
where () is again realized by a MLP

. In the realization of [2], the MLP

contains one hidden layer that is connected to U


t
and the state units in F
t
and B
t
are connected directly to the output layer.
Fig. 1. The modied bidirectional recurrent neural network architecture. The input at
time t is applied at U
t
. Y
t
contains the output vector. F
t
is the forward state vector,
and B
t
the backward state vector. The shift operator q

copies the last F


t
state vector
while reading the input sequence forwards while the operator q copies the last B
t
state vector while reading the input backwards.
In our modied BRNN, the state units are not connected directly to the
output, but to the hidden layer. It is reasonable to assume that several activa-
tion patterns occurring on the combination of the forward and backward state
units are not linearly separable and thus cannot be distinguished without the
additional transformation performed by this hidden layer. There is an implicit
asymmetry in the amount of context information accumulated in the state vec-
tors. At the start of the sequence (t = 1), the forward state vector F
1
contains
Finding Signal Peptides in Human Protein Sequences 63
information about only one input vector U
1
whereas the backward state vector
B
1
has accumulated all information of the complete sequence. At the end of
the sequence (t = T), the opposite situation occurs. The processing of the state
vectors with a hidden layer can help detecting these asymmetric situations and
process the more relevant activations.
The architecture is shown in gure 1 where the shift operator q
1
to copy the
forward state vector is dened as X
t1
= q
1
X
t
and the inverse shift operator
q
+1
to copy the backward state vector is dened as X
t+1
= q
+1
X
t
.
Fig. 2. The unfolded feedforward structure of a BRNN processing an input sequence
U
t
with length T = 3. First, the forward states F , . . . , F are calculated sequentially
from left to right, while the backward states B , . . . , B are calculated beginning with
the end of the sequence and continuing from right to left. After all F
t
and B
t
are
processed, the output sequence Y
t
can be calculated.
2.2 Learning Algorithm
To adapt the weight parameters in the MLPs

and

the recurrent NN
is unfolded in time and the backpropagation algorithm [16] is used to calculate
the gradient for all weights. The unfolding is illustrated in gure 2. This trans-
formation of a recurrent NN into a equivalent feedforward NN is called back-
propagation through time and was rst described in [11] and the application of
backpropagation learning to these networks was introduced in [16].
The gradient for a weight is summed up for all training sequences and each
shared position within each sequence.
One well known obstacle when training recurrent NNs is the problem of the
vanishing gradients [3]. In most cases, the gradient decays exponentially with
64 M. Reczko et al.
the number of layers in a NN. Since the unfolded network has at least as many
layers as the length of the input sequences, there is a limit for storing features
that might be relevant for the prediction of a property and are separated by
longer sequences from that property.
A learning algorithm that performs well without depending on the magnitude
of weight gradients is the resilient backpropagation (RPROP) algorithm [15].
Only the sign of the derivative is considered to indicate the direction of the
weight update. The size of the weight change is determined by a weight-specic
update-value
(epoch)
ij
and dened as
w
(epoch)
ij
=

(epoch)
ij
, if
E
w
ij
(epoch)
> 0
+
(epoch)
ij
, if
E
w
ij
(epoch)
< 0
0 , else
(4)
where
E
w
ij
(epoch)
denotes the summed gradient information over all patterns
of the pattern set (one epoch in batch learning). After each batch update, the
new update-values
ij
(epoch) are determined using a sign-dependent adaptation
process.

(epoch)
ij
=

+

(epoch1)
ij
, if
E
w
ij
(epoch1)

E
w
ij
(epoch)
> 0


(epoch1)
ij
, if
E
w
ij
(epoch1)

E
w
ij
(epoch)
< 0

(epoch1)
ij
, else
(5)
where

= 0.5 and
+
= 1.05 are xed values.
Every time the partial derivative of the corresponding weight w
ij
changes its
sign, which indicates that the last update was too big and the algorithm has
jumped over a local minimum, the update-value
(epoch)
ij
is decreased by the
factor

. If the derivative retains its sign, the update-value is slightly increased


in order to accelerate convergence in shallow regions.
Using this scheme the magnitude of the update-values is completely decou-
pled from the magnitude of the gradients. It is possible that a weight with a
very small gradient has a very large weight change and vice versa, if it just
is in accordance with the topology of the error landscape. This learning algo-
rithm converges substantially faster than standard backpropagation and allows
all BRNN simulations to be run on standard PC hardware.
3 Results
A sequence of the rst 45 residues of the N-terminal is used for the input vectors
U
t
[0, 1]
20
where each vector codes the residue in the standard one out of 20
coding. Each element of output sequence Y
t
[0, 1] is set to 1, if the residue at
position t is part of the signal peptide before the cleavage site and 0 otherwise.
Finding Signal Peptides in Human Protein Sequences 65
The set of the signal peptide sequences and the two sets of non-secretory
sequences are each split into 5 subsets of equal size. Fivefold crossvalidation
is used by training 5 networks on 3/5 of each set, using 1/5 of each set as a
validation set to stop the training and 1/5 of each set as a test set to measure
the performance. The sizes of the dierent layers in the BRNN are determined
experimentally. The number of units in the forward and backward state vectors
is 5 or 6, the number of hidden units feeding into the state vectors is between 10
and 13 and the number of hidden units feeding into the output layer is between
2 and 6.
The criterion to discriminate the signal peptides is optimized using the val-
idation set. If a segment of more than 7 consecutive output activations meets
the condition Y
t
> 0.75, t [1, 45], the input sequence is classied as a signal
peptide.
The performance is measured by calculating the correlation coecient, [10]
the sensitivity sens and specicity spec dened as
sens = cp/(cp +fn) (6)
spec = cp/(cp +fp), (7)
where cp is the number of correct positive examples, fn is the number of false
negative and fp is the number of false positive examples. The results are given
in table 1. The average percentage of correctly predicted sequences is 99.55%
Table 1. Performance on the test sets in vefold crossvalidation.
dataset sensitivity specicity correlation
1 100.0 100.0 1.0
2 100.0 98.8 0.984
3 100.0 98.8 0.984
4 100.0 98.8 0.984
5 100.0 100.0 1.0
avg 100.0 99.3 0.99
Another test set is created by extracting human protein sequences from the
release 39.25 of SWISS-PROT that have the feature keyword SIGNAL and that
are not contained in the used training set. The non signal peptide proteins used
in this set are full-length mRNAs with the N-terminus experimentally veried.
Since the data set for training the SignalP2 system is not available, it is not
guaranteed that all of these sequences are not novel to that system. For the
performance measurements shown in table 2 and the general use of the Signd
system, the predictions of the 5 networks are combined using a majority vote of
the jury of these networks.
For scanning targets in complete genomes, the execution time of an algorithm
should not be neglected. As shown in table 3 the Signd system compares very
favorably with SignalP2.
66 M. Reczko et al.
Table 2. Performances on an independent testset.
system sensitivity specicity correlation
signalp2 with NN and HMM 88.28 96.55 0.851
signalp2 with NN 93.24 92.83 0.856
signalp2 with HMM 90.99 94.83 0.857
signalp2 with NN or HMM 95.94 91.42 0.867
signd 98.64 93.59 0.918
Table 3. Execution times on a Sun Sparc.
system 222 signal peptides 210 non signal peptides
signalp2 300 seconds 251 seconds
signd 1.98 seconds 1.94 seconds
4 Availability
The signd system can be used for predictions using a WWW interface with the
URL http://www.stepc.gr/ synaptic/signd.html .
5 Discussion
The approach introduced here is shown to be very accurate for the identica-
tion of signal peptides in human protein sequences. The identication of signal
peptides has become of major importance in the analysis of sequenced data.
Programs like SignalP2 and Psort [7] are widely used for this purpose and pro-
duce in combination quite reliable predictions. The algorithm that we developed
here can be used as an computationally ecient, additional tool to nd signal
peptides.
The method is designed to determine if the N-terminal part of a protein is
a signal peptide or not. However in many cases the complete N-terminal part
of the sequence is not experimentally veried but predicted in silico. Particu-
larly in protein sequences derived from genomic data it is possible that the real
startcodon is not predicted correctly. The same occurs also for protein sequences
derived out of expressed sequence tags (ESTs). The use of Signd on the whole
protein might produce false positive predictions mostly on the location of trans-
membrane regions which has a similar structure as the signal peptides. For these
reasons we are planning to combine the Signd predictions with reliable start-
codon [4] and coding region predictions. For ESTs it would also be useful to
apply a frame shift correction software [6,5]. The modied BRNN algorithm de-
scribed has very great potential for these tasks as well as for other biosequence
analysis problems.
Finding Signal Peptides in Human Protein Sequences 67
References
1. Bairoch, A., Boeckmann, B.: The swiss-prot protein sequence data bank: current
status. Nucleic Acids Res. 22 (1994) 35783580
2. Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., Soda, G.: Bidirectional dynamics
for protein secondary structure prediction. In: Sun, R., Giles, L. (eds.): Sequence
Learning: Paradigms, Algorithms, and Applications. Springer Verlag (2000)
3. Bengio, Y., P.Simard, Frasconi, P.: Learning long-term dependencies with gradient
descent is dicult. IEEE Trans. on Neural Networks, 5 (1994) 157166
4. Hatzigeorgiou, A.: Translation initiation site prediction in human cDNAs with
high accuracy. Bioinformatics, 18 (2002) 343350
5. Hatzigeorgiou, A., Fizief, P., Reczko, M. Diana-est: A statistical analysis. Bioin-
formatics, 17 (2001) 913919
6. Hatzigeorgiou, A., Papanikolaou, H., Reczko, M.: Finding the reading frame in
protein coding regions on dna sequences: a combination of statistical and neural
network methods. In: Computational Intelligence: Neural Networks & Advanced
Control Strategies. IOS Press, Vienna (1999) 148153
7. Horton, P., Nakai, K.: Better prediction of protein cellular localization sites with
the k nearest neighbors classier. In: ISMB (1997) 147152
8. Kyte, J., Doolittle, R.: A simple method dor displaying the hydrophatic character
of a protein. J. Mol. Biol., 157 (1982) 105132
9. Ladunga, I.: Large-scale predictions of secretory proteins from mammalian ge-
nomic and est sequences. Curr. Opin. in Biotechnolgy, 11 (2000) 1318
10. Mathews, B. W.: Comparison of the predicted and observed secondary structure
of T4 phage lysozyme. Biochem. Biophys. Acta, Vol. 405 (1975) 442451
11. Minsky, M., Papert, S.: Perceptrons: An Introduction to Computational Geometry.
The MIT Press, Cambridge, Massachusetts (1969) 145
12. Nielsen, H., Brunak, S., von Heijne, G. Machine learning approaches for the pre-
diction of signal peptides and other protein sorting signals. Protein Engineering,
12 (1999) 39
13. Nielsen, H., Engelbrecht, J., S.Brunak, von Heijne, G.: Identication of prokary-
otic and eukaryotic signal peptides and prediction of their cleavage sites. Protein
Engineering, 10 (1997) 16
14. Nielsen, H., Krogh, A.: Prediction of signal peptides and signal anchors by a hidden
markov model. In: ISMB (1998) 122130
15. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation
learning: The RPROP algorithm. In: Ruspini, H., (ed.): Proceedings of the IEEE
International Conference on Neural Networks (ICNN 93). IEEE, San Francisco
(1993) 586591
16. Rumelhart, D. E., Hinton, G. E., Williams, R. J.: Learning internal representations
by error propagation. In: Rumelhart, D. E., McClelland, J. L. (eds.): Parallel
Distributed Processing: Explorations in the microstructure of cognition; Vol. 1:
Foundations. The MIT Press, Cambridge, Massachusetts (1986)
17. v. Heijne, G.: A new method for predicting signal sequence cleavage sites. Nucleid
Acids Res., 14 (1986) 46834690
18. von Heijne, G.: Computer-assisted identication of protein sorting signals and
prediction of membrane protein topology and structure. In: Advances in Compu-
tational Biology, volume 2, Jai Press Inc. (1996) 114
Generating Peptide Candidates from
Amino-Acid Sequence Databases for Protein
Identication via Mass Spectrometry
Nathan Edwards and Ross Lippert
Celera Genomics, 45 West Gude Drive, Rockville, MD
{Nathan.Edwards,Ross.Lippert}@Celera.Com
Abstract. Protein identication via mass spectrometry forms the foun-
dation of high-throughput proteomics. Tandem mass spectrometry, when
applied to a complex mixture of peptides, selects and fragments each pep-
tide to reveal its amino-acid sequence structure. The successful analysis
of such an experiment typically relies on amino-acid sequence databases
to provide a set of biologically relevant peptides to examine. A key sub-
problem, then, for amino-acid sequence database search engines that
analyze tandem mass spectra is to eciently generate all the peptide
candidates from a sequence database with mass equal to one of a large
set of observed peptide masses. We demonstrate that to solve the prob-
lem eciently, we must deal with substring redundancy in the amino-acid
sequence database and focus our attention on looking up the observed
peptide masses quickly. We show that it is possible, with some prepro-
cessing and memory overhead, to solve the peptide candidate generation
problem in time asymptotically proportional to the size of the sequence
database and the number of peptide candidates output.
1 Introduction
Reliable methods for identifying proteins form the foundation of proteomics, the
large scale study of expressed proteins. Of the available technologies for pro-
tein identication, mass spectrometry is the most suitable for high-throughput
automation and analysis.
Tandem mass spectrometry, in which peptides are selected and fragmented,
reveals peptides amino-acid sequence structure. In a typical high-throughput
setting, a complex mixture of unknown proteins is cut into peptides using a
digestion enzyme such as trypsin; fractionated into reduced complexity samples
on the basis of some physical or chemical property such as hydrophobicity; and
then a tandem mass spectrum is taken for all the observed peptides in each
fraction. The end result of such an experiment is a set of a few hundred to a
few thousand tandem mass spectra, each of which represents a peptide of about
6-20 amino acid residues. Typically, amino-acid sequences of 8-10 residues carry
sucient information content to determine the protein from which the peptide
is derived, at least up to signicant homology. This experimental protocol can
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 6881, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Generating Peptide Candidates from Amino-Acid Sequence Databases 69
reliably identify hundreds of proteins from a complex mixture in a couple of
hours of instrument time.
The traditional approach to automated protein identication using mass
spectrometry is not nearly as suited to a high-throughput environment. Pep-
tide mass ngerprinting, see Pappin, Hojrup, and Bleasby [11], James et. al. [8],
Cottrell and Sutton [4], and Pappin [10], considers the mass spectrum generated
by the (typically tryptic) peptides of a single protein. The mass of each peptide,
by itself, carries little information content about the original protein, it is only
by the simultaneous observation of the masses of many peptides from a protein
that reliable protein identication can be carried out. Furthermore, if peptides
from many dierent proteins are observed in the spectrum, the task of reliably
determining which peptide ions belong to the same protein becomes near impos-
sible. This means that signicant wet lab work must be done in order to reduce
the number of proteins in each fraction to as near to one as possible, which
ultimately makes peptide mass ngerprinting unsuitable for high-throughput
protein identication.
The analysis of each tandem mass spectrum can be done in a variety of ways.
Given a good quality spectrum and a peptide that fragments nicely, the amino-
acid sequence of the peptide can often be determined de novo, merely by looking
at the mass dierences between peaks in the spectrum. Many algorithmic ap-
proaches have been proposed for de novo tandem mass spectrum interpretation,
see Taylor and Johnson [14], Dancik et. al. [5], Pevzner, Dancik and Tang [13],
and Chen et. al. [2] for some examples. In reality, however, in a high-throughput
setting, the number of spectra that can be reliably analyzed de novo is small. The
most reliable approach for analyzing high-throughput tandem mass spectra for
protein identication uses amino-acid sequence databases to suggest peptide can-
didates which are then ranked according to how well they explain the observed
spectrum. A number of papers have been published describing tandem mass
spectra database search engines and peptide candidate ranking schemes. Some
of these approaches have even been commercialized. The rst widely adopted
program for tandem mass spectra identication via sequence database search
was SEQUEST, based on work by Eng, McCormack and Yates [6]. Another suc-
cessful commercial product, Mascot, is based on work by Perkins et. al. [12].
Bafna and Edwards [1] published a probability based model for scoring and
ranking peptide candidates that could be tuned to the peptide fragmentation
propensities of dierent mass spectrometry technologies. Unfortunately, all of
this work focuses primarily on determining the correct peptide candidate from
among the possibilities suggested by the sequence database. In this paper, we
address the problem at the core of all tandem mass spectrum identication se-
quence search engines, enumerating all appropriate peptide candidates from an
amino-acid sequence database eciently. For each tandem mass spectrum, we
require peptide candidates that match the mass of the peptide that was selected
for fragmentation.
70 N. Edwards and R. Lippert
2 Generating Peptide Candidates
2.1 Problem Formulation and Notation
Peptide Candidate Generation with Integer Weights [PCG(Int)].
Given a string of length n over an alphabet A of size m; a positive integer
mass (a) for each a A; and k integer mass queries M
1
, . . . , M
k
, enumerate
all (distinct) pairs (i, ), where 1 i k and is a substring of , such that
||

j=1
(
j
) = M
i
.
For convenience, we denote the maximum relevant substring mass by M
max
=
max
i
M
i
and the minimum relevant substring mass by M
min
= min
i
M
i
. Further,
since the set of query masses may contain repeated values, we dene R
max
to be
the maximum number of repeats.
While we will concentrate primarily on the integer mass version of the peptide
candidate generation problem, we must not lose sight of the real mass version
which we must eventually implement. The real mass version of the problem
denes real masses for each alphabet symbol and supports real query masses
with lower and upper mass tolerances.
Peptide Candidate Generation with Real Weights [PCG(Real)]. Given
a string of length n over an alphabet A of size m; a positive real mass (a) for
each a A; and k positive real mass queries with positive lower and upper tol-
erances (M
1
, l
1
, u
1
), . . . , (M
k
, l
k
, u
k
), enumerate all (distinct) pairs (i, ), where
1 i k and is a substring of , such that
M
i
l
i

||

j=1
(
j
) M
i
+ u
i
.
We dene M
max
and M
min
to be the minimum and maximum relevant substring
mass, M
max
= max
i
(M
i
+u
i
) and M
min
= min
i
(M
i
l
i
) and dene O
max
to be
the maximum number of [M
i
l
i
, M
i
+ u
i
] intervals to overlap any mass.
2.2 Application to Peptide Identication
via Tandem Mass Spectrometry
In order to focus our analysis of dierent algorithms for generating peptide can-
didates, we rst describe some of the issues that need be resolved before a peptide
candidate generation technique can be used as part of a peptide identication
amino-acid sequence search engine.
First, most amino-acid sequence databases do not consist of a single sequence
of amino-acids, they instead contain many distinct amino-acid sequences, repre-
senting many proteins. Our peptide candidates should not straddle amino-acid
Generating Peptide Candidates from Amino-Acid Sequence Databases 71
sequence boundaries. This is easily dealt with by introducing an addition symbol
to the alphabet A with a mass guaranteed to be larger than the biggest query.
We usually require signicantly more information about each peptide can-
didate than is represented by its sequence alone. A peptide candidates protein
context minimally consists of a reference, such as an accession number, to the
sequence database entry containing it, and its position within the sequence. Of-
ten, the protein context also contains anking residues, for checking enzymatic
digest consistency quickly, and taxonomy information, for restricting candidates
to a species of interest.
Since our experimental protocol cuts each protein with a digestion enzyme,
we usually require our peptide candidates have one or both ends consistent with
the putative digest cut sites. Trypsin, very commonly used in mass spectrometry
applications, cuts proteins immediately after a lysine (K) or arginine (R) unless
the next residue is a proline (P). In order to determine whether a peptide can-
didate is consistent with trypsin, for example, we must use the protein context
to examine the symbols before and after a peptide candidate.
If we will always require that both ends of the peptide candidate be consistent
with the enzymatic digest, we can improve the performance of any algorithm for
the peptide candidate generation problem signicantly. However, in practice, we
have observed that a signicant number of the tandem mass spectra generated in
the high throughput setting do not have enzymatic digest consistent endpoints,
and requiring this property of our peptide candidates signicantly aects our
ability to successfully interpret many spectra. As such, we currently do not
impose this constraint at the algorithmic level, instead we lter out candidates
without digestion consistent endpoints at candidate generation time. Section 6
deals with this issue further.
Peptide redundancy must be carefully considered too. It does us no good to
score the same peptide candidate against a spectrum multiple times, but if a
peptide scores well, we must be able list all the proteins that it occurs in. If our
algorithm does not explicitly eliminate redundant peptide candidates, we can
choose to score them multiple times, or store all scored candidates in some data-
structure and only score candidates the rst time they are observed. Section 4
deals with this issue further.
Post-translational modications provide yet another consideration that af-
fects algorithmic implementation decisions. If we permit the online specication
of the amino-acid mass table in order to support the generation of peptide can-
didates with a particular post-translational modication, we must discard any
algorithm which precomputes peptide candidate masses, unless we can somehow
correct for amino-acids with modied masses.
Modeling post-translational modications becomes more dicult when we
must generate peptide candidates in which some amino-acid symbols have mul-
tiple masses. In this setting, we can still use the peptide candidate generation
formulation above but we must use additional query masses and do additional
checks on the generated candidates. For each tandem mass spectrum, we gen-
erate a set of mass queries that correct for the presence of a certain number of
72 N. Edwards and R. Lippert
residues being translated to a particular modied amino-acid, and check each
returned candidate-query mass pair to ensure that the required number of am-
biguous amino-acid symbols is present.
2.3 Implications for Peptide Candidate Generation
We consider amino-acid sequence databases of a few megabytes to a few gigabytes
of sequence, over an alphabet of size 20. The alphabet is small enough that we
can look up the mass table for each symbol in the alphabet in constant time.
A typical set of tandem mass spectra from a single high throughput run
consists of hundreds to thousands of spectra, so we expect hundreds to tens of
thousands of distinct query masses. Furthermore, these query masses are typ-
ically constrained between 600-3000 Daltons with a 2 Dalton tolerance on ac-
ceptable peptide candidates. A 2 Dalton tolerance is typically necessary since
mass spectrometers select ions for further fragmentation with much less accu-
racy than they measure the mass of the same ions, and we must consider any
peptide candidate that might have been selected, not only those that have been
measured. For our peptide candidate generation problem, this implies that the
query masses occupy much of the feasible mass range. The importance of this
consideration will become clear in Section 5.
We consider the amino-acid sequence database to be provided oine, and
that it may be pre-processed at will, the cost amortized over many identication
searches. On the other hand, we assume that the alphabet mass function and
the query masses are provided only at runtime and any preprocessing based on
weights must be accounted for in the algorithm run time.
3 Simple Algorithms
3.1 Linear Scan
Suppose initially that we have a single query mass, M. Finding all substrings of
with weight M involves a simple linear scan. The algorithm maintains indices
b and e for the beginning and end of the current substring and accumulates the
current substring mass in

M. If the current mass is less than M, e is incremented
and

M increased. If the current mass is greater than

M, b is incremented and

M
decreased. If the current mass equals M, then
b...e
is output. See Algorithm 1
for pseudo-code.
This algorithm outputs all peptide candidates that match the query mass in
O(n) time. Linear Scan requires no preprocessing and requires only the string
itself be stored in memory. The recent work of Cieliebak, Erlebach, Lipt ak, Stoye,
and Welzl [3] demonstrates how to generalize Linear Scan by creating blocks
of contiguous alphabet symbols and changing the pointers b and e in block size
increments.
3.2 Sequential Linear Scan
The Sequential Linear Scan algorithm for the peptide candidate generation
problem merely calls Linear Scan for each query mass M
i
, i = 1, . . . , k. Se-
quential Linear Scan requires no preprocessing of the string and requires
Generating Peptide Candidates from Amino-Acid Sequence Databases 73
Algorithm 1 Linear Scan
b 1, e 0,

M 0.
while e < n or

M M do
if

M = M then
Output
b...e
.
if

M < M and e < n then
e e + 1,

M

M +(
e
).
else {

M M}

M

M (
b
), b b + 1.
memory only to store the sequence data and query masses. The algorithm takes
O(nk) time to enumerate all peptide candidates that match the k query masses.
The algorithm not only outputs redundant peptide candidates, it computes their
mass from scratch every time it encounters them. On the other hand, the algo-
rithm can easily keep track of protein context information as it scans the sequence
database.
3.3 Simultaneous Linear Scan
Unlike Sequential Linear Scan, Simultaneous Linear Scan solves the pep-
tide candidate generation problem with a single linear scan of the sequence
database. For every peptide candidate start position, we generate all candidates
with mass between the smallest and largest query masses, and check, for each
one, whether or not there is a query mass that it matches. Assuming positive
symbol masses and query masses bounded above by some constant, there are at
most L candidates that must be considered at each start position. If we pre-sort
the query masses, collapsing identical queries into a list associated with a single
table entry, we can look up all queries corresponding to a particular substring
mass in O(R
max
log k) time. Therefore, we can crudely estimate the running time
of Simultaneous Linear Scan as O(k log k + nLR
max
log k).
Theoretically, while it is possible to construct an input for the peptide candi-
date generation problem that requires the consideration of this many candidates,
such an input will never occur in practice. Assuming typical amino-acid frequen-
cies and a 3000 Dalton maximum query, the average number of candidates that
must be considered at each start position, or L
ave
, is about 27, much less than
the worst case L of 54. Table 1(a) provides L
ave
for a variety of publicly available
sequence databases. For the case when enzymatic digest consistent endpoints are
required, many fewer candidates need to be generated. Table 1(b) gives L
ave
for
tryptic peptide candidates.
In order to take advantage of this, Simultaneous Linear Scan must ex-
amine only the peptide candidates with masses up to the maximum query mass
M
max
. Algorithm 2 implements Simultaneous Linear Scan, achieving a run-
ning time bound of O(k log k + nL
ave
R
max
log k). Clearly, as k gets large, this
approach will do much better than Sequential Linear Scan.
74 N. Edwards and R. Lippert
Table 1. Average (Maximum) number of (a) candidates (b) tryptic candidates per
start position for various values of M .
Sequence Database 500 Da 1000 Da 2000 Da 3000 Da
SwissPROT 4.491(8) 9.008(17) 18.007(32) 27.010(48)
(a) TrEMBL 4.492(8) 9.004(17) 18.006(33) 27.010(49)
GenPept 4.491(8) 9.004(17) 18.005(33) 27.006(49)
SwissPROT 1.443(4) 2.046(7) 3.137(12) 4.204(18)
(b) TrEMBL 1.418(4) 2.022(7) 3.066(14) 4.189(19)
GenPept 1.427(4) 2.036(7) 3.101(14) 4.195(22)
Algorithm 2 Simultaneous Linear Scan
Sort the query masses M , . . . , M
k
.
b 1, e 0,

M 0.
while b n do
while

M M and e < n do
e e + 1,

M

M +(
e
).
Q QueryLookup(

M).
Output (i,
b...e
) for each i Q.
b b + 1, e b 1,

M 0.
Like Sequential Linear Scan, Simultaneous Linear Scan does no pre-
processing of the amino-acid sequence database and requires no additional mem-
ory above that required for the sequence and the queries. Simultaneous Linear
Scan generates redundant peptide candidates and can easily provide protein
context for each peptide candidates.
4 Redundant Candidate Elimination
Typical amino-acid sequence databases, particularly those that combine se-
quences from dierent sources, contain some entries with identical sequences.
Fortunately, these redundant entries can be eciently eliminated by computing
a suitable hash on each sequence. Substring redundancy, however, is not as easy
to deal with. When all peptide candidates of mass between 600 and 4000 Dal-
tons are enumerated from the publicly available GenPept amino-acid sequence
database, over 60% of the peptide candidates output have the same sequence
as a candidate already seen. We will denote the compression in the number of
candidates that must be considered when this redundancy is eliminated by the
substring density, . Figure 1 plots the substring density of various publicly avail-
able amino-acid sequence databases for candidates (and tryptic candidates) with
mass between 600 and M
max
, where M
max
varies from 610 to 4000 Daltons. No-
tice that is quite at for large M
max
, suggesting substring density signicantly
less than 1 is not merely an artifact of the length of the substrings considered.
Notice also that the carefully curated SwissPROT sequence database ex-
hibits much greater substring density than the others. This probably reects
Generating Peptide Candidates from Amino-Acid Sequence Databases 75
GenPept
TrEMBL
SwissPROT
0.8
1
500 1000 1500 2000 2500
0.6
3000
0
3500 4000
0.4
0.2

M
max
Fig. 1. Substring density of all candidates (solid line) and tryptic candidates (dashed
line) with mass between 600 and M Daltons.
the curation policy of SwissPROT to collapse polymorphic and variant forms
of the same protein to a single entry with feature annotations. If each variant
form of a protein was a distinct entry in the database, then its substring den-
sity would decrease signicantly. On the other hand, GenPept, which contains
protein sequence data from many sources and is not hand curated, has much
lower substring density. We claim that substring density is not merely inversely
proportional to database size, but that it is increasingly dicult to nd appro-
priate ways to collapse similar, but not identical, entries in ever larger sequence
databases.
4.1 Sux Tree Traversal
Sux trees provide a compact representation of all distinct substrings of a string,
eectively eliminating redundant peptide candidates. We discuss the salient
properties of sux trees here, see Guseld [7] for a comprehensive introduc-
tion. Sux trees can be built in time and space linear in the string length.
Once built, all distinct substrings are represented by some path from the root of
the sux tree. Having constructed a sux tree representation of our sequence
database, a depth rst traversal of the sux tree that examines substrings on
paths from the root with mass at most the maximum query mass generates all
sequence distinct peptide candidates of our sequence database.
As the depth rst traversal of the sux tree proceeds, each candidate sub-
string must be checked to see whether it corresponds to a query mass. As before,
we bound the worst case performance of query mass lookup by O(R
max
log k).
This results in a running time bound of O(k log k + nL
ave
R
max
log k).
The additional memory overhead of the sux tree is the only downside of this
approach. A naive implementation of the sux tree data-structure can require as
much as 24n additional memory for a string of length n. However, as sux trees
have found application in almost every aspect of exact string matching, a great
76 N. Edwards and R. Lippert
5
10
15
20
1000 1500
0
2000 2500 3000 3500 4000
N
u
m
b
e
r
o
f
s
p
e
c
t
r
a
Mass
Fig. 2. Overlap plot for 1493 query masses from Finnigan LCQ tandem mass spec-
trometry experiment.
deal of eort has been made to nd more compact representations. For example,
Kurtz [9] claims a memory overhead of approximately 8n for various biological
databases. This degree of compression can also be achieved, in our experience,
by using a sux array data-structure, which represents the same information as
a sux tree, but in a typically more compact form. Again, see Guseld [7] for an
introduction. In theory, a depth rst traversal of a sux array should be quite a
bit more expensive than for a sux tree, but in our experience, the linear scan
through the sux array that is necessary to implement the depth rst traversal
costs little more than pointer following in the sux tree.
We note that using a sux tree traversal for peptide candidate generation
can make the protein context of peptide candidates more expensive to compute,
particularly for context upstream of the candidate, necessary for determining if
the peptide candidate is consistent with an enzymatic digest.
5 Query Lookup
Whether we consider the Simultaneous Linear Scan or Sux Tree Traver-
sal algorithm for peptide candidate generation, it should be clear that as the
number of distinct query masses k gets large, our running time depends criti-
cally on the query mass lookup time. Once k gets too big for the Sequential
Linear Scan algorithm to be appropriate, eective algorithms for the peptide
candidate generation problem must no longer focus on the string scanning sub-
problem, but instead on the query mass lookup subproblem. Figure 2 plots the
degree of overlap of a typical set of query mass intervals derived from a Finni-
gan LCQ mass spectrometer run containing 1493 tandem mass spectra. Notice
the wide range of candidate masses that must be considered by the candidate
generation algorithm.
We can obtain an O(1) query lookup time per output peptide candidate
by constructing a lookup table, indexed by the set of all possible candidate
masses, containing the corresponding query mass indices. This is feasible for
Generating Peptide Candidates from Amino-Acid Sequence Databases 77
Algorithm 3 Query Mass Lookup Table: Construction: Real Case
min
i
(l
i
+u
i
).
allocate and initialize a lookup table of size N =

M
max

+ 1.
for all query masses i do
for all j =

M
i
l
i

, . . . ,

M
i
u
i

do
if j M
i
l
i
< (j + 1) then
T[j].L i
else if j M
i
+u
i
< (j + 1) then
T[j].U i
else {M
i
l
i
< j and M
i
+u
i
(j + 1)}
T[j].O i
for all j = 0, . . . , N 1 do
sort the elements i of T[j].L in increasing M
i
l
i
order.
sort the elements i of T[j].U in decreasing M
i
+u
i
order.
Algorithm 4 Query Mass Lookup Table: Lookup: Real Case
input query mass

M.
j

.
output i for each i T[j].O.
output i in order from T[j].L until M
i
l
i
>

M.
output i in order from T[j].U until M
i
+u
i
<

M.
suitably small integer query masses, as the resulting data-structure will have
size O(M
max
+ k).
We can readily adapt this approach to the real mass case. Algorithm 3 de-
scribes the procedure for building the lookup table. Having selected the dis-
cretization factor for the real masses to match the smallest query mass toler-
ance and allocating a table of the appropriate size, we iterate through the query
masses to populate the table. For each query mass, we determine the relevant
table entries that represent intervals that intersect with the query mass tolerance
interval. The query mass intervals that intersect the interval J = [j, (j + 1))
represented by a table entry j do so in one of three ways: the lower endpoint
falls inside J, in which case the query mass index is stored in T[j].L; the upper
endpoint falls inside J, in which case the query mass index is stored in T[j].U;
or the query mass interval completely overlaps J, in which case the query mass
index is stored in T[j].O. The choice of ensures that no query mass interval
is properly contained in J. Once all the table entries are populated, then for
all table entries j, the query mass indices i T[j].L are sorted in increasing
M
i
l
i
order and similarly, the query mass indices i T[j].U are sorted in
decreasing M
i
+ u
i
order. The resulting lookup table has size bounded above
by O(M
max
/ + k max
i
(l
i
+ u
i
)/) and can be constructed in time bounded by
O(M
max
/ + k max
i
(l
i
+ u
i
)/ + (M
max
/)O
max
log O
max
).
For typical values, building this table is quite feasible. M
max
is usually be-
tween 2000 and 4000 Daltons, while is usually about 2 Daltons. Even tens of
78 N. Edwards and R. Lippert
thousands of query masses spread over 1000-2000 bins will not cause signicant
memory or construction time problems. Once built, the cost in memory and
construction time is quickly amortized by the O(1) query lookup time.
The bottom line is that it is possible to solve the peptide candidate generation
problem (with integer masses) in time O(M
max
+ k + nL
ave
R
max
) and that it
is practical to implement this approach for real query masses. Note that the
number of peptide candidates output by any correct algorithm is bounded above
by nL
ave
R
max
. This bound is quite tight, particularly when R
max
is close to
its lower bound of k/M
max
. Ultimately, this means that we cannot expect to do
much better than this running time bound when building the query lookup table
is feasible.
6 Enzymatic Digest Considerations
Enzymatic digest considerations oer both a challenge and an opportunity for
peptide candidate generation. Enzymatic digest considerations present a chal-
lenge, depending on the underlying scan or traversal, because obtaining the
necessary protein context in order to check whether a peptide candidate is con-
sistent with a particular enzymatic digest may not be trivial, particularly during
a sux tree based traversal. On the other hand, enzymatic digest considerations
present an opportunity, because digest constraints signicantly reduce the num-
ber of peptide candidates that need to be examined. With careful consideration
for the underlying data-structures, it is possible to enumerate peptide candidates
that satisfy enzymatic digest constraints at no additional cost, which means that
the reduction in the number of candidates output directly impacts the algorithm
run-time.
First, we consider how to manage enzymatic digest constraints for the linear
scan algorithms. Notice rst that since the motifs for enzymatic digestion are
typically simple and short, we can nd all putative digestion sites in O(n) time.
If both ends of the peptide candidates are constrained to be consistent with the
enzymatic digest, then the begin and end pointers of Algorithms 1 and 2 can
skip from one putative site to the next. Further, if we consider the case when
only a small number of missed digestion sites are permitted, we can avoid the
generation of candidates with moderate mass, but many missed digestion sites.
If only one end of the peptide candidate is required to be consistent with the
digest, we must use two passes through the sequence. The rst pass constrains
the left pointer to putative digest sites, and the second pass constrains the right
pointer to putative digest sites and the left pointer to non-digest sites.
Next, we consider how to manage enzymatic digest consistency for the sux
tree based algorithms. Unlike the linear scan algorithms, peptide candidate gen-
eration is asymmetric with respect to the ends of the candidate. In fact, there
is little we can do within the sux tree traversal to reduce the amount of work
that needs to be done to generate a new peptide candidate in which the right
endpoint is consistent with the enzymatic digest. We must still explore all the
children of any node we visit, as long as the current peptide candidate mass is
Generating Peptide Candidates from Amino-Acid Sequence Databases 79
Table 2. total time (s)/number of calls (10 )/time per call (10

s) of the traversal
subroutines.
(sec) generation interval search string scan
Sux tree 237/1.62/0.146 548/2.90/0.189 115/1.33/0.086
Sim. linear scan 228/2.02/0.112 844/3.53/0.239 87/1.55/0.056
less than the maximum query mass. We can avoid query mass lookups, just as
we did for the linear scans, by checking each subsequence with a valid mass for
an digest consistent right endpoint before we lookup the query masses. On the
other hand, constraining the left endpoint of all candidates to be consistent with
the digest can restrict the sux tree traversal algorithms to a subtree of the
sux tree. In order to accomplish this, we must consider a peptide candidate to
consist of the protein context to the left of the candidate prepended to the pep-
tide candidate proper. The non-candidate symbols prepended to the candidate
do not contribute to the mass of the candidate, but permit us to look only at
those subtrees of the root that begin with a digest consistent motif.
7 Computational Observations
We implemented solutions with a sux tree and a simultaneous linear scan as
described in Algorithm 2. Proles of these implementations give an estimate
to the relative constant factors associated with string traversal versus interval
search costs. We ran our implementations against the 1493 sample query masses
on the SwissPROT database.
Although we have shown that O(1) lookup of intervals is possible, we have
used a skip list to store the intervals in both examples. As the average interval
coverage is not large in this example, we thought this would suce.
Our implementations were compiled in DEC Alpha EV6.7 processors (500
MHz) with the native C++ compiler, CXX (with the option -O4). Timing was
done with gprof. We have aggregated the timings for the interval traversal
routines, the string traversal routines, and the candidate generation function
which calls them alternately.
Table 2 demonstrates that the largest fraction of the cost (by number of calls
or time) is spent in the interval lookup, examining about 2 intervals per string
search, as Figure 2 suggests. The extra time cost from the redundant string
traversals in the simultaneous linear scan implementation is compensated by the
speed per-call coming from a simpler scanning routine. The extra interval search
time for the simultaneous linear scan is likely due to the increased time spent
re-traversing short intervals.
8 Conclusion
We have identied and formulated a key subproblem, the peptide candidate
generation problem, that must be solved in order to identify proteins via tandem
80 N. Edwards and R. Lippert
mass spectrometry and amino-acid sequence database search. We have outlined
the context in which any solution to the peptide candidate generation problem
must operate and carefully examined how this context inuences the classic
algorithmic tradeos of time and space.
We have identied a key property of amino-acid sequence databases, sub-
string density, that quanties the unnecessary peptide candidates output by a
linear scan of the sequence database. We have further proposed the use of the suf-
x array as a compact representation of all substrings of the sequence database
that eliminates candidate redundancy.
We have also demonstrated that as the number of query masses increases,
query mass lookup time becomes more signicant than sequence database scan
time. We proposed a constant time query lookup algorithm based on preprocess-
ing the query masses to form a lookup table, and showed that this technique is
quite feasible in practice.
Our results for the peptide candidate generation problem depend heavily on
our empirical observations about the nature of the problem instances we see
every day. We need to obtain better worst case bounds for this problem to make
our performance less sensitive to less common extreme instances. We also need
to do a comprehensive empirical study to more carefully determine the scenarios
in which each of the algorithms identied is the method of choice.
References
1. V. Bafna and N. Edwards. Scope: A probabilistic model for scoring tandem mass
spectra against a peptide database. Bioinformatics, 17(Suppl. 1):S13S21, 2001.
2. T. Chen, M. Kao, M. Tepel, J. Rush, and G. Church. A dynamic programming
approach to de novo peptide sequencing via tandem mass spectrometry. In ACM-
SIAM Symposium on Discrete Algorithms, 2000.
3. M. Cieliebak, T. Erlebach, S. Liptak, J. Stoye, and E. Welzl. Algorithmic com-
plexity of protein identication: Combinatorics of weighted strings. Submitted to
Discrete Applied Mathematics special issue on Combinatorics of Searching, Sort-
ing, and Coding., 2002.
4. J. Cottrell and C. Sutton. The identication of electrophoretically separated pro-
teins by peptide mass ngerprinting. Methods in Molecular Biology, 61:6782,
1996.
5. V. Dancik, T. Addona, K. Clauser, J. Vath, and P. Pevzner. De novo peptide
sequencing via tandem mass spectrometry. Journal of Computational Biology,
6:327342, 1999.
6. J. Eng, A. McCormack, and J. Yates. An approach to correlate tandem mass
spectral data of peptides with amino acid sequences in a protein database. Journal
of American Society of Mass Spectrometry, 5:976989, 1994.
7. D. Guseld. Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology. Cambridge University Press, 1997.
8. P. James, M. Quadroni, E. Carafoli, and G. Gonnet. Protein identication in dna
databases by peptide mass ngerprinting. Protein Science, 3(8):13471350, 1994.
9. S. Kurtz. Reducing the space requirement of sux trees. SoftwarePractice and
Experience, 29(13):11491171, 1999.
Generating Peptide Candidates from Amino-Acid Sequence Databases 81
10. D. Pappin. Peptide mass ngerprinting using maldi-tof mass spectrometry. Meth-
ods in Molecular Biology, 64:165173, 1997.
11. D. Pappin, P. Hojrup, and A. Bleasby. Rapid identication of proteins by peptide-
mass ngerprinting. Currents in Biology, 3(6):327332, 1993.
12. D. Perkins, D. Pappin, D. Creasy, and J. Cottrell. Probability-based protein iden-
tication by searching sequence databases using mass spectrometry data. Elec-
trophoresis, 20(18):35513567, 1997.
13. P. Pevzner, V. Dancik, and C. Tang. Mutation-tolerant protein identication by
mass-spectrometry. In R. Shamir, S. Miyano, S. Istrail, P. Pevzner, and M. Water-
man, editors, International Conference on Computational Molecular Biology (RE-
COMB), pages 231236. ACM Press, 2000.
14. J. Taylor and R. Johnson. Sequence database searches via de novo peptide se-
quencing by mass spectrometry. Rapid Communications in Mass Spectrometry,
11:10671075, 1997.
Improved Approximation Algorithms
for NMR Spectral Peak Assignment
Zhi-Zhong Chen
1,
, Tao Jiang
2,
, Guohui Lin
3,
, Jianjun Wen
2,
,
Dong Xu
4,
, and Ying Xu
4,
Dept. of Math. Sci., Tokyo Denki Univ., Hatoyama, Saitama 350-0394, Japan
[email protected]
Dept. of Comput. Sci., Univ. of California, Riverside, CA 92521
{jiang,wjianju}@cs.ucr.edu
Dept. of Comput. Sci., Univ. of Alberta, Edmonton, Alberta T6G 2E8, Canada
[email protected]
Protein Informatics Group, Life Sciences Division
Oak Ridge National Lab. Oak Ridge, TN 37831
{xud,xyn}@ornl.gov
Abstract. We study a constrained bipartite matching problem where
the input is a weighted bipartite graph G = (U, V, E), U is a set of ver-
tices following a sequential order, V is another set of vertices partitioned
into a collection of disjoint subsets, each following a sequential order,
and E is a set of edges between U and V with non-negative weights.
The objective is to nd a matching in G with the maximum weight that
satises the given sequential orders on both U and V , i.e., if u
i
fol-
lows u
i
in U and if v
j
follows v
j
in V , then u
i
is matched with v
j
if
and only if u
i
is matched with v
j
. The problem has recently been
formulated as a crucial step in an algorithmic approach for interpret-
ing NMR spectral data [15]. The interpretation of NMR spectral data
is known as a key problem in protein structure determination via NMR
spectroscopy. Unfortunately, the constrained bipartite matching problem
is NP-hard [15]. We rst propose a 2-approximation algorithm for the
problem, which follows directly from the recent result of Bar-Noy et al.
[2] on interval scheduling. However, our extensive experimental results
on real NMR spectral data illustrate that the algorithm performs poorly
in terms of recovering the target-matching (i.e. correct) edges. We then
propose another approximation algorithm that tries to take advantage

Supported in part by the Grant-in-Aid for Scientic Research of the Ministry of


Education, Science, Sports and Culture of Japan, under Grant No. 12780241. Part
of the work done while visiting at UC Riverside.

Supported in part by a UCR startup grant and NSF Grants CCR-9988353 and
ITR-0085910.

Supported in part by NSERC grants RGPIN249633 and A008599, and Startup


Grant REE-P5-01-02-Sci from the University of Alberta.

Supported by NSF Grant CCR-9988353.

Supported by the Oce of Biological and Environmental Research, U.S. Depart-


ment of Energy, under Contract DE-AC05-00OR22725, managed by UT-Battelle,
LLC.
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 8296, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Improved Approximation Algorithms for NMR Spectral Peak Assignment 83
of the density of the sequential order information in V . Although we
are only able to prove an approximation ratio of 3 log D for this algo-
rithm, where D is the length of a longest string in V , the experimental
results demonstrate that this new algorithm performs much better on
real data, i.e. it is able to recover a large fraction of the target-matching
edges and the weight of its output matching is often in fact close to the
maximum. We also prove that the problem is MAX SNP-hard, even if
the input bipartite graph is unweighted. We further present an approx-
imation algorithm for a nontrivial special case that breaks the ratio 2
barrier.
1 Introduction
The Human Genome Project [1] has led to the identication of a vast major-
ity of protein-encoding genes in the human genome. To facilitate a systematic
study of the biological functions of these proteins, the US National Institutes of
Health (NIH) has recently launched another ambitious project, the Structural
Genomics Project [9]. Its main goal is to solve about 100,000 protein structures
within the next ten years, through the development and application of signi-
cantly improved experimental and computational technologies. Along with X-ray
crystallography, nuclear magnetic resonance (NMR) spectroscopy has been one
of the two main experimental methods for solving protein structures. Among the
seven pilot Structural Genomics Centers set up by NIH, one center is devoted
to protein structure determination via NMR.
Protein structure determination via NMR generally involves the following
three key steps:
NMR spectral data generation, which produces
resonance peaks corresponding to amino acids in the target protein se-
quence. Peaks corresponding to a common amino acid are grouped into
a spin system;
certain geometric relationships (e.g. distances and angles) between the
spin systems;
NMR data interpretation, which involves relating the spin systems to the
amino acids in the target protein sequence, providing both inter- and intra-
amino acid distance and angle information;
NMR structure calculation, which calculates the target protein structure
through molecular dynamics (MD) and energy minimization (EM) under
the constraints of the identied geometric relationships.
It typically takes several months to a year to solve a single protein structure by
NMR, and a major part of that time is used for NMR data interpretation. Up
until very recently, NMR data interpretation has been done mainly using manual
procedures. Though a number of computer programs [4,6,7,14,16] have recently
been developed to assist the data interpretation, most NMR labs are still doing
the peak assignments manually or semi-manually for quality reasons. With the
recent progress in NMR technologies for speeding up the data production rate,
84 Z.-Z. Chen et al.
we expect that NMR data interpretation will soon become the sole bottleneck
in a high-throughput NMR structure determination process.
Two key pieces of information form the foundation of NMR peak assignment:
Each amino acid has a somewhat unique spin system
1
;
The sequential adjacency information between spin systems in a protein
sequence is often inferable from the spectral data. However, this type of
information is generally incomplete, i.e. we may often be able to obtain the
adjacency relationship between some of the spin systems but not all.
In a recently developed computational framework [15], the NMR peak assign-
ment problem has been formulated as a constrained bipartite matching problem.
In this framework, each amino acid (also called residue) is represented as a ver-
tex of U and each spin system is represented as a vertex of V (and thus generally
[U[ = [V [). A pair (u
i
, v
j
) U V of vertices that represents a potential assign-
ment has a non-negative weight w
i,j
= w(u
i
, v
j
), which scores the preference of
assigning spin system v
j
to amino acid u
i
. Let E denote the set of all potential
assignments. Clearly G = (U, V, E U V ) is a bipartite graph. In general,
the edges in E have dierent weights and G is said weighted. In the special case
that the edges have equal weight, G is said unweighted. For more detailed infor-
mation about the weighting scheme, we refer the reader to [15]. The Maximum
Weight Bipartite Matching [5] provides a natural framework for the study
of the NMR peak assignment problem. Nonetheless, some resonance peaks from
a single NMR experiment are known to belong to atoms from consecutive amino
acids and thus their host spin systems should be mapped to consecutive amino
acids. Such spin systems that should be mapped consecutively are said to be ad-
jacent and their corresponding vertices in V are required to follow a sequential
order. For convenience, we number the amino acids consecutively in the order
that they appear in the protein sequence, and number the spin systems in such
a way that adjacent spin systems have consecutive indices. In this formulation,
a feasible matching M in G is one such that if v
j
and v
j+1
are sequentially
adjacent, then edge (u
i
, v
j
) M if and only if edge (u
i+1
, v
j+1
) M. The Con-
strained Bipartite Matching (CBM) problem is to nd a feasible matching
in G achieving the maximum weight.
We call a maximal set of vertices in V that are consecutively adjacent a string.
Thus, the set V is partitioned into a collection of strings. The CBM problem in
which the maximum length of strings in V is D is called the D-String CBM
problem. Without loss of generality, assume D > 1. In the practice of NMR peak
assignment, D is usually between 4 and 10. One may notice that the standard
Maximum Weight Bipartite Matching problem is simply the 1-String
CBM problem, and it is known to be solvable in polynomial time [5]. Unfortu-
nately, the D-String CBM problem is intractable even when it is unweighted
and D = 2.
This information alone is not sucient for a correct assignment since a protein
sequence typically contains multiple copies of the same amino acid. Additional in-
formation is necessary in order to tell if a particular spin system corresponds to, for
example, an alanine at a particular sequence position.
Improved Approximation Algorithms for NMR Spectral Peak Assignment 85
Theorem 1. [15] The unweighted 2-String CBM is NP-hard.
A two-layer algorithm for D-String CBM has been proposed in [15] that
attempts to x likely assignments and lter out unlikely assignments for long
strings (i.e. at least 3 spin systems) in the rst layer of computation. In the
second layer, it tries all possible combinations of assignments for long strings
and extends them to perfect matchings (recall that [U[ = [V [) by exhaustive
enumeration. A perfect matching with the maximum weight generated in this
way is output as the result. The current implementation of the algorithm runs
eciently for cases where the number of long strings is relatively small and most
of the long strings consist of at least 4 or 5 spin systems. Its running time goes
up quickly (i.e. exponentially) when the instance has many strings consisting of
2 or 3 spin systems.
In this paper, we rst propose a simple 2-approximation algorithm for D-
String CBM that directly follows from the recent result of Bar-Noy et al. [2] on
interval scheduling. However, our experimental results on 126 instances of NMR
spectral data derived from 14 proteins illustrate that the algorithm performs
poorly in terms of recovering the target-matching edges (i.e. matching edges
that assign spin systems to their correct amino acids)
2
. One explanation is that
the algorithm looks for matching edges by scanning U from left to right, hence
giving preference to edges close to the beginning of U. Consequently, it may miss
many target-matching edges. We thus propose a second approximation algorithm
that attempts to take advantage of the density of the spin system adjacency
information in V . Although we are only able to prove an approximation ratio
of 3 log
2
D for this algorithm, the experimental results demonstrate that this
new algorithm performs much better than the 2-approximation algorithm on
real data. In fact, it often recovers as many target-matching edges as the two-
layer algorithm [15] (of exhaustive search nature) and the weight of its output
matching is often close to the maximum. We then prove that unweighted 2-
String CBM is MAX SNP-hard, implying that the problem has no polynomial-
time approximation scheme (PTAS) unless P = NP. The proof extends to all
constants D 2. Although ratio 2 seems to be a barrier to polynomial-time
approximation algorithms for D-String CBM, we show that this barrier can
be broken for unweighted 2-String CBM, by presenting a
5
3
-approximation
algorithm. We remark that unweighted D-String CBM could be interesting
because it is simpler and is useful in NMR peak assignment when the edge
weights fall into a small range. Moreover, since long strings in V are usually
associated with good quality spectral data, algorithms that attempt to solve
unweighted D-String CBMcould yield reasonably good NMR peak assignment
since they tend to favor long strings. We expect that the techniques developed
in this work, in conjunction with the work of [15], will lead to a signicantly
improved capability for NMR data interpretation, providing a highly eective
tool for high-throughput protein structure determination.
Note that, target-matching edges are known in the simulations from BioMagRes-
Bank [13] and they were used to generate the adjacency data.
86 Z.-Z. Chen et al.
The paper is organized as follows. Section 2 describes the 2-approximation
and the 3 log
2
D-approximation algorithms for D-String CBM, and compares
their performances (as well as that of the two-layer algorithm) on 126 real NMR
spectral data derived from 14 proteins. It also gives a proof of the MAX SNP-
hardness of unweighted 2-String CBM. Section 3 presents an improved ap-
proximation algorithm for unweighted 2-String CBM. Section 4 concludes the
paper with some future research directions.
2 Weighted Constrained Bipartite Matching
We rst present two approximation algorithms for D-String CBM. Consider
an instance of D-String CBM: G = (U, V, E), where U = u
1
, u
2
, . . . , u
n
1
,
V = v
1
v
i
1
, v
i
1
+1
v
i
2
, . . . , v
i
p
v
n
2
, and E U V is the set of edges.
Here, v
i
j1
+1
v
i
j
in V denotes a string of consecutively adjacent spin systems.
We may assume that for every substring v
j
v
j+1
of a string in V , (u
i
, v
j
) E
if and only if (u
i+1
, v
j+1
) E, because otherwise (u
i
, v
j
) cannot be in any
feasible matching and thus can be eliminated without further consideration.
Based on G = (U, V, E), we construct a new edge-weighted bipartite graph
G

= (U, V, E

) as follows: For each u


i
U and each string v
j
v
j+1
v
k
V
such that (u
i
, v
j
) E, let (u
i
, v
j
) be an edge in E

and its weight be the total


weight of edges (u
i+x
, v
j+x
) [ 0 x k j in E. For convenience, we call
the subset (u
i+x
, v
j+x
) [ 0 x k j of E the expanded matching of edge
(u
i
, v
j
) of E

.
We say that two edges of E

are conicting if the union of their expanded


matchings is not a feasible matching in G. Note that a set of non-conicting
edges in E

is always a matching in G

but the reverse is not necessarily true.


A matching in G

is feasible if it consists of non-conicting edges. There is an


obvious one-to-one correspondence between feasible matchings in G and feasible
matchings in G

. Namely, the feasible matching M in G corresponding to a


feasible matching M

in G

is the union of the expanded matchings of edges in


M

. Note that the weight of M in G is the same as that of M

in G

. Thus, it
remains to show how to compute a feasible approximate matching in G

.
Dene an innermost edge of G

to be an edge (u
i
, v
j
) in G

satisfying the
following condition:
G

has no edge (u
i
, v
j
) other than (u
i
, v
j
) such that i i

+ s

1
i + s 1, where s (respectively, s

) is the size of the expanded matching of


(u
i
, v
j
) (respectively, (u
i
, v
j
)).
Note that for every u
i
U, G

has at most one innermost edge incident to u


i
(i.e., there cannot exist v
j
1
V and v
j
2
V with j
1
,= j
2
such that both (u
i
, v
j
1
)
and (u
i
, v
j
2
) are innermost edges of G

). Dene a leading innermost edge of G

to
be an innermost edge (u
i
, v
j
) such that index i is minimized. The crucial point is
that for every leading innermost edge (u
i
, v
j
) of G

and every feasible matching


M

in G

, at most two edges of M

conict with (u
i
, v
j
). To see this, let (u
i
, v
j
)
be an edge in M

that conicts with (u


i
, v
j
). Let s (respectively, s

) be the size
Improved Approximation Algorithms for NMR Spectral Peak Assignment 87
of the expanded matching of (u
i
, v
j
) (respectively, (u
i
, v
j
)). Since (u
i
, v
j
) is an
innermost edge of G

, at least one of the following conditions holds:


1. j

= j.
2. i

i i + s 1 i

+ s

1.
3. i < i

i + s 1 < i

+ s

1.
4. i

< i i

+ s

1 < i + s 1.
For each of these conditions, M

contains at most one edge (u


i
, v
j
) satisfy-
ing the condition because M

is a feasible matching in G

. Moreover, if M

contains an edge (u
i
, v
j
) satisfying Condition 2, then it contains no edge sat-
isfying Condition 3 or 4. Furthermore, M

contains no edge (u
i
, v
j
) satisfying
Condition 4 or else there would be an innermost edge (u
i
, v
i
) in G

with
i

< i i

+ s

1 i

+ s

1 (where s

is the size of the expanded


matching of (u
i
, v
j
)), contradicting the assumption that (u
i
, v
j
) is a leading
innermost edge in G

. Thus, at most two edges of M

conict with (u
i
, v
j
).
Using the above fact (that at most two edges of M

conict with a leading


innermost edge) and the local ratio technique in [3], we can construct a recursive
algorithm to nd a (heavy) feasible matching in G

as shown in Figure 1. The


algorithm in fact, as we were informed very recently, follows directly from the
recent result of Bar-Noy et al. [2] on interval scheduling.
2-Approximation on G

:
1. if (E(G

) = )
output the empty set and halt;
2. nd a leading innermost edge e in G

;
3. = {e} {e

| e

E(G

), e

conicts with e};


4. nd the minimum weight c of an edge of in G

;
5. for (every edge f )
subtract c from the weight of f;
6. F = {e | e , e has weight 0};
7. G

= G

F;
8. recursively call 2-Approximation on G

and output M

;
9. nd a maximal M

F such that M

is a feasible matching in G

;
10. output M

and halt.
Fig. 1. A recursive algorithm for nding a feasible matching in G

.
Theorem 2. [2] The algorithm described in Figure 1 outputs a feasible matching
of the graph G

= (U, V, E

) with weight at least half of the optimum.


We have implemented the algorithm and tested it on a set of 14 proteins
from BioMagResBank [13]. For each protein, we randomly generated 9 instances
of spin-system adjacency by adding links (getting from the correct adjacency
from BioMagResBank) between neighboring spin systems. If the spin systems
are connected by the links, they will map to the sequence as a string together.
88 Z.-Z. Chen et al.
We increased the number of links from 10% of the sequence length to 90% of
the sequence length. In other words, the algorithm was tested on 126 bipartite
graphs with positive edge weights and adjacency constraints. The test results are
summarized in Table 1 (Columns 57). In the tests, the target assignments are
matchings consisting of edges of form (u
i
, v
i
) that assign spin systems to correct
amino acids from BioMagResBank. Although these target assignments do not
always have the maximum weights, their weights are not far from the maxima.
As can be seen from the table, although the algorithm did very well in terms
of maximizing the weight of its output matching, it recovered very few target-
matching edges and is thus almost useless in practice. A possible explanation of
the poor performance of the algorithm in this experiment is that the algorithm
looks for edges by scanning amino acids in U from left to right, hence giving
preference to edges close to the beginning of U. As a consequence, it may miss
many target-matching edges. Another reason of the poor performance is probably
due to the scoring function that was used. The goal of the scoring function is
to force the correct assignment to have the maximum score. However, given the
statistical nature of the scoring function, this goal can not be achieved completely
currently. That is why even the two-layer algorithm [15] (briey described in
the Introduction section. For the detail description of the algorithm, please refer
to [15].) recovers small numbers of correct assignments (Table 1, Column 4)
in many cases, although as the number of links between adjacent spin systems
increases, the performance improves. The development of the scoring function,
which we are working on, will not be addressed in this paper. As the scoring
function improves, the correct assignment should get closer to the maximum
score, especially when the number of links between adjacent spin systems is
large.
In trying to improve the performance on recovering the target-matching
edges, we next present a second approximation algorithm that tries to take
advantage of the presence of many long strings in the instance, as described
in Figure 2. Basically, the algorithm partitions the strings in V into groups
of strings of approximately the same length, greedily nds a maximal feasible
matching in each group, and then greedily extends the matching to a maximal
feasible matching in G

. It outputs the heaviest one among the matchings found


for all groups.
Theorem 3. The algorithm described in Figure 2 outputs a feasible matching in
G

with weight at least


1
3 max{1,log
2
r}
of the maximum weight in

O([U[[V [) (i.e.
quadratic up to a poly-logarithmic factor) time, where r is as dened in Figure 2.
It is thus an approximation algorithm for D-String CBM with performance
ratio 3 log
2
D.
Proof. For each i 1, 2, . . . , g, consider the bipartite graph G

i
= (U, V
i
, E
i
).
Let M

i
denote an optimal feasible matching for graph G

i
. Right before the
execution of Step 3.4 of the algorithm, M

i
is clearly a feasible matching for
graph G

i
, and its weight is at least
1
6
of that of M

i
because we can claim that
each execution of Step 3.3.2 only rules out at most 6 edges of M

i
from further
Improved Approximation Algorithms for NMR Spectral Peak Assignment 89
Table 1. Summary on the performances of the 2-approximation and 3 log D-
approximation algorithms on 126 instances of NMR peak assignment. The number
after the underscore symbol in the name of each instance indicates the number of adja-
cent pairs of spin system in the instance (more precisely, 5 means that the number of
adjacent pairs of spin systems is 50% of the total number of residues). M

represents
the target assignment that we want to recover, and M (M , or M ) is the assignment
computed by the two-layer (2-approximation, or 3 log D-approximation, respectively)
algorithm. The parameters R = |M

M |, R = |M

M |, and R = |M

M | show
how many target-matching edges were recovered by the two-layer, 2-approximation, and
3 log D-approximation algorithms, respectively, on each instance. Each N/A indicates
that the computation of the two-layer algorithm was taking too long (> 2 days on a
supercomputer) and had to be killed before any result could be obtained.
|M

| w(M

) R |M | w(M ) R |M | w(M ) R |M

| w(M

) R |M | w(M ) R |M | w(M ) R
bmr4027 1 158 1896284 11 158 1871519 28 158 1931099 4 bmr4144 1 78 949170 10 78 936144 10 78 845578 5
bmr4027 2 18 157 1849500 7 158 1927193 2 bmr4144 2 8 78 928175 4 78 869229 1
bmr4027 3 18 158 1841683 8 158 1930119 23 bmr4144 3 10 77 917197 5 78 881665 1
bmr4027 4 43 158 1829367 11 158 1925237 36 bmr4144 4 0 78 907130 10 78 886147 6
bmr4027 5 33 156 1827498 3 158 1923556 37 bmr4144 5 14 77 921816 17 78 914564 14
bmr4027 6 36 157 1818131 8 158 1916814 48 bmr4144 6 30 77 897500 11 76 876005 3
bmr4027 7 79 155 1784027 44 158 1885779 90 bmr4144 7 34 76 842073 2 78 888087 6
bmr4027 8 19 154 1671475 113 158 1875058 117 bmr4144 8 67 77 804531 5 78 896088 22
bmr4027 9 155 155 1652859 60 158 1896606 156 bmr4144 9 75 76 837519 35 78 949844 76
bmr4288 1 105 1249465 9 105 1238612 8 105 1208142 6 bmr4302 1 115 1298321 9 115 1305677 0 115 1316209 8
bmr4288 2 9 105 1220481 8 105 1194198 9 bmr4302 2 12 115 1273146 0 115 1324173 8
bmr4288 3 16 103 1206095 17 105 1199374 17 bmr4302 3 7 114 1276372 8 115 1313288 8
bmr4288 4 33 105 1185685 5 105 1214237 21 bmr4302 4 16 114 1246952 4 115 1307472 10
bmr4288 5 38 103 1169907 6 105 1211226 34 bmr4302 5 34 113 1219920 11 115 1295035 24
bmr4288 6 52 102 1179110 15 105 1217006 52 bmr4302 6 44 114 1174564 0 115 1255172 60
bmr4288 7 55 103 1112288 22 105 1230117 62 bmr4302 7 65 112 1181267 8 115 1294044 78
bmr4288 8 N/A 101 1133554 35 105 1232331 66 bmr4302 8 N/A 113 1152323 27 113 1283268 99
bmr4288 9 105 100 1051817 48 105 1249465 105 bmr4302 9 111 115 1293954 107 115 1298321 111
bmr4309 1 178 2048987 6 178 2066506 4 178 2118482 4 bmr4316 1 89 1029827 4 89 997300 13 89 1011408 7
bmr4309 2 10 178 2023648 9 178 2108291 4 bmr4316 2 15 89 976270 2 89 1019640 7
bmr4309 3 33 177 2013099 9 178 2115356 22 bmr4316 3 21 88 972224 0 89 1020190 9
bmr4309 4 34 176 2024268 14 178 2107417 18 bmr4316 4 20 87 936852 5 89 1028608 31
bmr4309 5 46 174 1954955 13 178 2090346 31 bmr4316 5 42 86 890944 2 89 1007619 43
bmr4309 6 59 177 1924727 12 178 2074540 55 bmr4316 6 60 84 863207 13 89 1012008 48
bmr4309 7 122 174 1885986 24 178 2078322 114 bmr4316 7 79 87 882818 9 87 1004449 67
bmr4309 8 106 173 1868338 55 178 2026479 112 bmr4316 8 87 87 957378 62 89 1029827 89
bmr4309 9 176 170 1796864 95 175 1999734 153 bmr4316 9 89 85 984774 85 89 1029827 89
bmr4318 1 215 2390881 8 215 2418440 17 215 2495022 2 bmr4353 1 126 1498891 6 126 1482821 20 126 1492927 7
bmr4318 2 5 215 2398412 0 215 2481997 6 bmr4353 2 8 126 1473982 9 126 1499720 7
bmr4318 3 N/A 214 2409316 17 215 2481867 10 bmr4353 3 4 125 1455084 6 126 1499983 8
bmr4318 4 23 213 2394682 3 215 2481099 12 bmr4353 4 20 126 1441162 9 126 1511112 14
bmr4318 5 38 215 2355926 2 215 2473707 27 bmr4353 5 17 125 1417351 8 126 1502628 21
bmr4318 6 38 214 2312260 13 215 2440684 31 bmr4353 6 35 125 1421633 18 126 1514294 11
bmr4318 7 87 210 2259377 52 215 2421426 70 bmr4353 7 29 125 1370235 14 126 1499010 58
bmr4318 8 113 212 2214174 63 209 2326045 91 bmr4353 8 N/A 123 1337329 9 122 1443144 81
bmr4318 9 N/A 207 2158223 122 215 2390651 197 bmr4353 9 126 122 1273988 15 126 1498891 126
bmr4391 1 66 710914 5 66 723525 8 66 750059 5 bmr4393 1 156 1850868 6 156 1826257 10 156 1876203 5
bmr4391 2 8 66 720589 6 66 755718 3 bmr4393 2 14 156 1805561 3 156 1873989 6
bmr4391 3 7 66 724102 8 66 749505 5 bmr4393 3 N/A 156 1782350 5 156 1859924 4
bmr4391 4 6 65 681286 9 66 745159 5 bmr4393 4 22 156 1778165 3 156 1868573 12
bmr4391 5 13 64 688400 5 66 741824 0 bmr4393 5 30 155 1742954 3 156 1862071 42
bmr4391 6 10 66 699066 8 66 739778 0 bmr4393 6 45 155 1772955 42 156 1857579 67
bmr4391 7 0 66 684953 37 66 717888 21 bmr4393 7 74 154 1722026 22 151 1794248 94
bmr4391 8 18 64 663147 30 66 705513 20 bmr4393 8 128 156 1640682 15 154 1830609 136
bmr4391 9 N/A 66 687290 45 61 652235 45 bmr4393 9 143 152 1527885 3 156 1851298 152
bmr4579 1 86 950173 7 86 931328 12 86 967574 5 bmr4670 1 120 1391055 8 120 1378876 27 120 1434117 5
bmr4579 2 12 86 933035 7 86 977013 9 bmr4670 2 10 120 1366541 14 120 1437469 5
bmr4579 3 11 85 923916 4 86 973431 14 bmr4670 3 20 120 1370848 6 120 1437484 16
bmr4579 4 16 86 935901 6 86 961214 11 bmr4670 4 32 119 1341300 6 120 1423323 28
bmr4579 5 13 85 894084 2 86 968378 21 bmr4670 5 35 117 1309727 11 120 1393428 28
bmr4579 6 15 86 911564 8 86 945148 21 bmr4670 6 48 118 1290812 13 120 1394903 40
bmr4579 7 42 86 873884 17 86 952794 45 bmr4670 7 45 118 1239001 6 120 1377578 45
bmr4579 8 49 83 877556 26 86 950136 78 bmr4670 8 N/A 120 1236726 19 118 1370011 101
bmr4579 9 86 83 760356 0 86 950173 86 bmr4670 9 N/A 113 1237614 60 114 1319698 94
bmr4752 1 68 882755 8 68 862523 20 68 889083 9 bmr4929 1 114 1477704 7 114 1432825 5 114 1502375 3
bmr4752 2 12 68 848225 16 68 886989 11 bmr4929 2 10 114 1424433 5 114 1500838 7
bmr4752 3 13 68 834299 2 68 886910 18 bmr4929 3 16 113 1417722 7 114 1499302 18
bmr4752 4 20 67 820207 2 68 892854 16 bmr4929 4 20 113 1411387 7 114 1497361 27
bmr4752 5 28 67 796019 8 68 878244 29 bmr4929 5 24 114 1408112 4 114 1487741 26
bmr4752 6 28 67 824289 6 68 879380 35 bmr4929 6 24 112 1385673 12 114 1480828 31
bmr4752 7 43 66 752633 3 68 868981 40 bmr4929 7 65 112 1378166 30 114 1449648 55
bmr4752 8 N/A 65 730276 17 68 860366 42 bmr4929 8 86 114 1424433 5 114 1471279 87
bmr4752 9 68 67 812950 44 68 882755 68 bmr4929 9 112 107 1178499 20 114 1477704 114
90 Z.-Z. Chen et al.
3 log D-Approximation on G

:
1. compute ratio r =

max

min
, where
max
(respectively,
min
) is the
maximum (respectively, minimum) length of strings in V ;
2. partition V into g = max{1, log r} subsets V , V , . . . , V
g
such that
a string s is included in subset V
i
if and only if 4
i

|s|

min
4
i
;
(Note: V
i
and V
i
may not be disjoint.)
3. for (every i {1, 2, . . . , g})
3.1 compute the set E
i
of edges of G

incident to strings in V
i
;
3.2 initialize M

i
= ;
3.3 while (E
i
= )
3.3.1 nd an edge e E
i
of maximum weight;
3.3.2 add e to M

i
, and delete e and all edges conicting with
e from E
i
;
3.4 greedily extend M

i
to a maximal feasible matching of G

;
4. output the heaviest one among M

, M

, . . . , M

g
and halt.
Fig. 2. A new algorithm for nding a feasible matching in G

.
consideration. To see the claim, consider an edge e = (u
x
, v
y
) added to M

i
in
Step 3.3.2. Let e

= (u
x
, v
y
) be an edge conicting with e. Let s (respectively,
s

) be the size of the expanded matching of e (respectively, e

). Then, at least
one of the following conditions 1 through 6 holds:
1. y

= y.
2. x

= x and s

= s.
3. x

< x x

+ s

1 < x + s 1.
4. x < x

x + s 1 < x

+ s

1.
5. x

< x x + s 1 x

+ s

1 or x

x x + s 1 < x

+ s

1.
6. x < x

+ s

1 x + s 1 or x x

+ s

1 < x + s 1.
Since M

i
is a feasible matching of G

i
, M

i
may contain at most one edge sat-
isfying Condition 1, at most one edge satisfying Condition 2, at most one edge
satisfying Condition 3, at most one edge satisfying Condition 4, at most one
edge satisfying Condition 5, and at most four edges satisfying Condition 6 (be-
cause of the construction of V
i
). Due to the same reason, if M

i
contains an edge
satisfying Condition 2 (respectively, 5), then M

i
contains no edge satisfying
Condition 6. Similarly, if M

i
contains an edge satisfying Condition 3 or 4, then
M

i
contains at most three edges satisfying Condition 6 (because of the con-
struction of V
i
). So, in the worse case (where M

i
contains the largest number of
edges conicting with e), M

i
may contain one edge satisfying Condition 1, one
edge satisfying Condition 3, one edge satisfying Condition 4, and three edges
satisfying Condition 6. This proves the claim.
Let M

denote the output matching of the algorithm. Let



M

denote an
optimal feasible matching for graph G

, and

M

i
be the sub-matching of

M

in edge set E
i
. Suppose without loss of generality that

M

j
is the heaviest one
among

M

1
,

M

2
, . . . ,

M

g
. Clearly, we have w(

M

j
)
1
g
w(

M

). Thus, w(M

)
1
6
w(M

j
)
1
6g
w(

M

). The time complexity analysis is straightforward. 2


Improved Approximation Algorithms for NMR Spectral Peak Assignment 91
The above 3 log
2
D-approximation has been implemented and tested on the
same set of 126 instances of NMR peak assignment. The test results are also
summarized in Table 1 (Columns 810). It is quite clear that this algorithm is
much more superior to the 2-approximation algorithm both in terms of maximiz-
ing the weight of the output matching and in terms of maximizing the number
of target-matching edges recovered. In fact, on over half of the instances (more
precisely, 65 out of the 126 instances), the 3 log
2
D-approximation algorithm re-
covered at least as many target-matching edges as the (exhaustive) two-layer
algorithm. Because the 3 log
2
D-approximation algorithm is much more ecient
than the two-layer algorithm, it will be very useful in NMR peak assignment.
Observe that the (feasible) matchings found by the approximation algorithms
have weights greater than that of the target assignments on quite a few in-
stances, especially when the adjacency information is sparse. This implies that
the weighting scheme as formulated in [15] may not work very well when the
adjacency information is sparse, and more work on weighting scheme is needed
in the future.
A natural question is if D-String CBMadmits a -approximation algorithm
for some constant < 2. Our next theorem shows that there is a constant > 1
such that D-String CBM does not admit a -approximation algorithm for
every D 2, unless P = NP, even if the input bipartite graph is unweighted.
Theorem 4. For all D 2, unweighted D-String CBM is MAX SNP-hard.
Proof. We prove the theorem for D = 2 by a simple L-reduction from Max-
imum 3-Dimensional Matching (3DM), which is known to be MAX SNP-
complete [8]. The proof can be easily extended to any constant D 2.
Maximum Bounded 3-Dimensional Matching (MB3DM): Given a
universal set | = 1, 2, . . . , m and a collection of subsets S
1
, S
2
, . . . , S
n
,
where S
i
|, [S
i
[ = 3, and every element u | is contained in at most 3
subsets, nd a largest subcollection of pairwise disjoint subsets.
Given an instance of MB3DM, without loss of generality, suppose that m =
3q and n q. Observe that n m, because every element of | appears in at
most 3 subsets. For each subset S
i
, construct 7 vertices a
i,1
, a
i,2
, . . . , a
i,7
in set
U and for each element i | construct a 2-vertex string b
i,1
b
i,2
in set V . We will
also have in V q 1-vertex strings f
1
, f
2
, . . . , f
q
and 3n 2-vertex strings c
1,1
c
1,2
,
c
1,3
c
1,4
, c
1,5
c
1,6
, . . ., c
n,1
c
n,2
, c
n,3
c
n,4
, c
n,5
c
n,6
. Finally, for every i = 1, 2, . . . , m,
we connect string b
i,1
b
i,2
to a
j,2k
a
j,2k+1
(i.e. connect vertex b
i,1
to vertex a
j,2k
and vertex b
i,2
to vertex a
j,2k+1
), for each 1 k 3, if i S
j
; for every
i = 1, 2, . . . , q and every j = 1, 2, . . . , n, connect string f
i
to a
j,1
; and for every
i = 1, 2, . . . , n and every j = 1, 2, . . . , n, connect string c
i,2k1
c
i,2k
to a
j,2k1
a
j,2k
,
for each 1 k 3. All the edges have the unit weight. This forms an instance
of unweighted 2-String CBM: G = (U, V, E), where [U[ = 7n, [V [ = 7q + 6n.
We claim that the above construction is an L-reduction [10] from MB3DM to
unweighted 2-String CBM. It is straightforward to see that each subcollection
of p (where p q) disjoint subsets implies a constrained matching in G of weight
92 Z.-Z. Chen et al.
7p + 6(n p) = 6n + p. To complete the proof of the claim, we only need to
observe that, for any given constrained matching in the above bipartite graph,
we can always rearrange it without decreasing the weight so that each group of
vertices a
i,1
, a
i,2
, . . . , a
i,7
are matched either with three c-type strings or with
a combination of one f-type string and three b-type strings, due to the special
construction of the edges. This completes the L-reduction. 2
3 Unweighted Constrained Bipartite Matching
As noted in the last section, a natural question is to ask if D-String CBM
admits an approximation algorithm with ratio less than 2. In this section, we
answer the question armatively for a special case, namely, unweighted 2-
String CBM. More specically, we will give a
5
3
-approximation algorithm for
unweighted 2-String CBM.
Consider an instance of unweighted D-String CBM: G = (U, V, E), where
U =u
1
, u
2
, . . . , u
n
1
, V =v
1
, v
2
, v
k
, v
k+1
v
k+2
, v
k+3
v
k+4
, . . . , v
k+21
v
k+2
,
and E U V is the set of edges in G. Here, v
j
alone forms a 1-string in V for
each 1 j k, while v
k+2j1
v
k+2j
is a 2-string in V for each 1 j . Note
that k 0 and 0. Let n
2
= k + 2 and m = [E[. We may assume that for
every u
i
U and every 2-string v
k+2j1
v
k+2j
in V , (u
i
, v
k+2j1
) E if and only
if (u
i+1
, v
k+2j
) E, because otherwise (u
i
, v
k+2j1
) or (u
i
, v
k+2j
) cannot be in
any feasible matching and thus can be eliminated without further consideration.
Fix a maximum size feasible matching M

in G. Let m

1
be the number of
edges (u
i
, v
j
) M

with 1 j k. Similarly, let m

2
be the number of edges
(u
i
, v
k+2j1
) M

with 1 j . Then, [M

[ = m

1
+ 2m

2
.
Lemma 1. A feasible matching in G can be found in O(m

n
1
n
2
) time, whose
size is at least m

1
+ m

2
.
Proof. Construct a new bipartite graph G

= (U, V, E
1
E
2
), where E
1
=
(u
i
, v
j
) E [ 1 j k and E
2
= (u
i
, v
k+2j1
) E [ 1 j . Let
M

be a maximum matching in G

. Obviously, we can obtain a matching in G

from M

by deleting all edges (u


i
, v
k+2j
) M

with 1 i n
1
and 1 j .
So, [M

[ m

1
+m

2
. To obtain a feasible matching M of G from M

, we perform
the following steps:
1. Initialize M = .
2. Construct an auxiliary graph H as follows. The vertex set of H is M

. The
edge set of H consists of all (e
1
, e
2
) such that e
1
M

, e
2
M

, and e
1
conicts with e
2
. [Comment: Each connected component of H is a path P
(possibly consisting of a single vertex); if P contains two or more vertices,
then there exist integers i, j
1
, . . . , j
h
(h 2) such that the vertices of P are
(u
i
, v
j
1
), (u
i+1
, v
j
2
), . . . , (u
i+h1
, v
j
h
), and each of v
j
1
through v
j
h1
is the
leading vertex of a 2-string in V .]
3. For each connected component of H formed by only one vertex (u
i
, v
j
) M

,
if v
j
is a 1-string in V , then add edge (u
i
, v
j
) to M; otherwise, add edges
(u
i
, v
j
) and (u
i+1
, v
j+1
) to M.
Improved Approximation Algorithms for NMR Spectral Peak Assignment 93
4. For each connected component P of H formed by two or more vertices,
perform the following three substeps:
(a) Let the vertices of P be (u
i
, v
j
1
), (u
i+1
, v
j
2
), . . . , (u
i+h1
, v
j
h
) M

.
(b) If h is even or v
j
h
is the leading vertex of a 2-string in V , then for each
1 x ,
h
2
|, add edges (u
i+2x2
, v
j
2x1
) and (u
i+2x1
, v
j
2x1
+1
) to M.
(c) If h is odd and v
j
h
alone forms a 1-string in V , then for each 1 x
h1
2
,
add edges (u
i+2x2
, v
j
2x1
) and (u
i+2x1
, v
j
2x1
+1
) to M; further add
edge (u
i+h1
, v
j
h
) to M.
It is clear that for each connected component P of H, we add at least as many
edges to M as the number of vertices in P. Thus, [M[ [M

[ m

1
+ m

2
. 2
Lemma 2. A feasible matching in G can be found in O(m

n
1
n
2
) time, whose
size is at least
m

1
3
+
4m

2
3
.
Proof. For each index i 0, 1, 2, let G
i
be the edge-weighted bipartite graph
obtained from G as follows:
1. For every 2-string v
j
v
j+1
in V , merge the two vertices in the string into a
single super-vertex s
j,j+1
(with all resulting multiple edges deleted).
2. For all j such that i + 1 j n
1
2 and j 1 i (mod 3), perform the
following three substeps:
(a) Merge u
j
, u
j+1
, and u
j+2
into a single super-vertex t
j,j+1,j+2
(with all
resulting multiple edges deleted).
(b) For every 1-string v
h
that is a neighbor of t
j,j+1,j+2
, if edge u
j+1
, v
h
is
not in the original input graph, then delete the edge between t
j,j+1,j+2
and v
h
; otherwise, assign a weight of 1 to the edge between t
j,j+1,j+2
and v
h
.
(c) For every 2-string v
h
v
h+1
such that s
h,h+1
is a neighbor of t
j,j+1,j+2
, if
neither (u
j
, v
h
), (u
j+1
, v
h+1
) nor (u
j+1
, v
h
), (u
j+2
, v
h+1
) is a match-
ing in the original input graph, then delete the edge between t
j,j+1,j+2
and s
h,h+1
; otherwise, assign a weight of 2 to the edge between t
j,j+1,j+2
and s
h,h+1
.
3. If neither u
1
nor u
2
was merged in Step 2a, then perform the following three
substeps:
(a) Merge u
1
and u
2
into a single super-vertex t
1,2
(with all resulting multiple
edges deleted).
(b) For every 1-string v
h
that is a neighbor of t
1,2
, if edge u
1
, v
h
is not
in the original input graph, then delete the edge between t
1,2
and v
h
;
otherwise, assign a weight of 1 to the edge between t
1,2
and v
h
.
(c) For every 2-string v
h
v
h+1
such that s
h,h+1
is a neighbor of t
1,2
, if (u
1
,v
h
),
(u
2
, v
h+1
) is not a matching in the original input graph, then delete the
edge between t
1,2
and s
h,h+1
; otherwise, assign a weight of 2 to the edge
between t
1,2
and s
h,h+1
.
4. If neither u
n
1
1
nor u
n
1
was merged in Step 2a, then perform the following
three substeps:
94 Z.-Z. Chen et al.
(a) Merge u
n
1
1
and u
n
1
into a single super-vertex t
n
1
1,n
1
(with all result-
ing multiple edges deleted).
(b) For every 1-string v
h
that is a neighbor of t
n
1
1,n
1
, if edge u
n
1
, v
h
is
not in the original input graph, then delete the edge between t
n
1
1,n
1
and v
h
; otherwise, assign a weight of 1 to the edge between t
n
1
1,n
1
and
v
h
.
(c) For every 2-string v
h
v
h+1
such that s
h,h+1
is a neighbor of t
n
1
1,n
1
, if
(u
n
1
1
, v
h
), (u
n
1
, v
h+1
) is not a matching in the original input graph,
then delete the edge between t
n
1
1,n
1
and s
h,h+1
; otherwise, assign a
weight of 2 to the edge between t
n
1
1,n
1
and s
h,h+1
.
For each i 0, 1, 2, let M
i
be a maximum-weighted matching in G
i
. From
each M
i
, we can obtain a feasible matching

M
i
in the original input graph by
performing the following steps in turn:
Initialize

M
i
= .
For each edge (u
j
, v
h
) M
i
, add (u
j
, v
h
) to

M
i
.
For each edge (t
j,j+1,j+2
, v
h
) M
i
, add (u
j+1
, v
h
) to

M
i
.
For each edge (t
1,2
, v
h
) M
i
, add (u
1
, v
h
) to

M
i
.
For each edge (t
n
1
1,n
1
, v
h
) M
i
, add (u
n
1
, v
h
) to

M
i
.
For each edge (t
j,j+1,j+2
, s
h,h+1
) M
i
, if (u
j
, v
h
), (u
j+1
, v
h+1
) is a match-
ing in the original input graph, then add edges (u
j
, v
h
) and (u
j+1
, v
h+1
) to

M
i
; otherwise, add edges (u
j+1
, v
h
) and (u
j+2
, v
h+1
) to

M
i
.
For each edge (t
1,2
, s
h,h+1
) M
i
, add edges (u
1
, v
h
) and (u
2
, v
h+1
) to

M
i
.
For each edge (t
n
1
1,n
1
, s
h,h+1
) M
i
, add (u
n
1
1
, v
h
) and (u
n
1
, v
h+1
) to

M
i
.
Note that the total weight of edges in M
i
is exactly [

M
i
[. Let

M be the
maximum size one among

M
0
,

M
1
,

M
2
. We claim that [

M[
m

1
3
+
4m

2
3
. To see
this, for each i 0, 1, 2, let M

i
be the union of the set (u
j
, v
h
) M

[ j+1 i
(mod 3) and v
h
is a 1-string in V and the set (u
j
, v
h
), (u
j+1
, v
h+1
) M

[ u
j
and u
j+1
belong to the same super-vertex in G
i
and v
h
v
h+1
is a 2-string in V .
It holds that

2
i=0
[M

i
[ = m

1
+4m

2
, because each edge in M

incident to a 1-
string belongs to exactly one of M

0
, M

1
, M

2
while each edge in M

incident to a
2-string belongs to exactly two of M

0
, M

1
, M

2
. This implies that the maximum
size one among M

0
, M

1
, M

2
has size at least
m

1
3
+
4m

2
3
. On the other hand,
for each i 0, 1, 2, we can obtain a matching

M
i
in G
i
by modifying M

i
as
follows:
For each edge (u
j
, v
h
) M

i
such that v
h
is a 1-string in V , replace u
j
by
the super-vertex of G
i
to which u
j
belongs.
For each pair of edges (u
j
, v
h
), (u
j+1
, v
h+1
) M

i
such that v
h
v
h+1
is a 2-
string in V , replace the two edges by a single edge between s
h,h+1
and the
super-vertex in G
i
to which both u
j
and u
j+1
belong.
The total weight of edges in

M
i
is exactly [M

i
[. On the other hand, the total
weight of edges in M
i
is larger than or equal to that of edges in

M
i
, because M
i
Improved Approximation Algorithms for NMR Spectral Peak Assignment 95
is a maximum-weighted matching in G
i
. Thus, the total weight of edges in M
i
is at least [M

i
[. In turn, [

M
i
[ [M

i
[. Therefore,
[

M[
1
3
2

i=0
[

M
i
[
1
3
2

i=0
[M

i
[ =
1
3
(m

1
+ 4m

2
).
This completes the proof of the claim and hence that of the lemma. 2
Combining Lemmas 1 and 2, we now have:
Theorem 5. We can compute a feasible matching in G whose size is at least
3
5
[M

[, in O(m

n
1
n
2
) time. Consequently, there is a
5
3
-approximation algorithm
for the unweighted 2-String CBM problem; it runs in O(m

n
1
n
2
) time.
4 Concluding Remarks
It would be interesting to test if the
5
3
-approximation algorithm works well in
practice. An obvious open question is if D-String CBM admits a -approx-
imation algorithm for some constant < 2.
In the real NMR spectral peak assignment, we want to assign every amino
acid to a spin system. Therefore, the desired output matchings are perfect match-
ings. So far our theoretical analysis on the approximation algorithms does not
involve this requirement, although during the implementation we did put prior-
ity on perfect matchings. Designing algorithms that guarantee to output perfect
matchings (given that the real data is a complete weighted bipartite graph) is
a practical consideration. On the other hand, not putting the perfect require-
ment on approximation algorithms could be one of the reasons that they produce
matchings with better weights than the correct assignments.
Our current framework provides a proof-of-principle for automated backbone
spectral assignment. In real NMR experiments, it can easily incorporate other
NMR data to further improve the assignment accuracy. For example, one can use
residual dipolar couplings, which are easy and fast to obtain. Residual dipolar
couplings have been used to help peak assignment and structure determination
[12,11]. Under our framework, we can use residual dipolar coupling data to pro-
vide more constraint information for spin systems. We believe further studies
along the line could lead to a software package that moves closer to the goal of
fully automated NMR peak assignments.
References
1. International Human Genome Sequencing Consortium. Initial Sequencing and
Analysis of the Human Genome. Nature, 409:860-921, 2001.
2. A. Bar-Noy, R. Bar-Yehuda, A. Freund, J. Naor, and B. Schieber. A unied ap-
proach to approximating resource allocation and scheduling. J. ACM, 48:1069
1090, 2001.
3. R. Bar-Yehuda and S. Even. A local-ratio theorem for approximating the weighted
vertex cover problem. Annals of Discrete Mathematics, 25:2746, 1985.
96 Z.-Z. Chen et al.
4. C. Bartels, P. G untert, M. Billeter, and K. W uthrich. GARANT-A general algo-
rithm for resonance assignment of multidimensional nuclear magnetic resonance
spectra. Journal of Computational Chemistry, 18:139149, 1996.
5. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT
Press, 1990.
6. P. G untert, M. Salzmann, D. Braun, and K. W uthrich. Sequence-specic NMR
assignment of proteins by global fragment mapping with the program mapper.
Journal of Biomolecular NMR, 18:129137, 2000.
7. K. Huang, M. Andrec, S. Heald, P. Blake, and J.H. Prestegard. Performance of a
neural-network-based determination of amino acid class and secondary structure
from H- N NMR data. Journal of Biomolecular NMR, 10:4552, 1997.
8. V. Kann. Maximum bounded 3-dimensional matching is MAX SNP-complete.
Information Processing Letters, 37:2735, 1991.
9. National Institute of General Medical Sciences. Pilot projects for the protein
structure initiative (structural genomics). June 1999. Web page at
http://www.nih.gov/grants/guide/rfa-files/RFA-GM-99-009.html.
10. C.H. Papadimitriou and M. Yannakakis. Optimization, approximation, and com-
plexity classes. Journal of Computer and System Sciences, 43:425440, 1991.
11. C. A. Rohl and D. Baker. De novo determination of protein backbone structure
from residual dipolar couplings using Rosetta. Journal of The American Chemical
Society, 124:27232729, 2002.
12. F. Tian, H. Valafar, and J. H. Prestegard. A dipolar coupling based strategy for
simultaneous resonance assignment and structure determination of protein back-
bones. Journal of The American Chemical Society, 123:1179111796, 2001.
13. University of Wisconsin. BioMagResBank. http://www.bmrb.wisc.edu. 2001.
14. J. Xu, S.K. Straus, B.C. Sanctuary, and L. Trimble. Use of fuzzy mathematics
for complete automated assignment of peptide H 2D NMR spectra. Journal of
Magnetic Resonance, 103:5358, 1994.
15. Y. Xu, D. Xu, D. Kim, V. Olman, J. Razumovskaya, and T. Jiang. Automated
assignment of backbone NMR peaks using constrained bipartite matching. IEEE
Computing in Science & Engineering, 4:5062, 2002.
16. D.E. Zimmerman, C.A. Kulikowski, Y. Huang, W.F.M. Tashiro, S. Shimotaka-
hara, C. Chien, R. Powers, and G.T. Montelione. Automated analysis of protein
NMR assignments using methods from articial intelligence. Journal of Molecular
Biology, 269:592610, 1997.
Ecient Methods for Inferring Tandem
Duplication History
Louxin Zhang
1
, Bin Ma
2
, and Lusheng Wang
3
Labs for Infor. Tech. & Dept. of Mathematics
Nat. University of Singapore, Singapore 117543
[email protected]
Department of Computer Science
University of Western Ontario, Canada N6A 5B8
[email protected]
Department of Computer Science
City University of Hong Kong, Hong Kong
[email protected]
Abstract. In this paper, we study the problem of inferring duplication
history of a tandem gene family using the duplication model proposed by
Tang et al.. We provide an ecient algorithm for inferring a duplication
model for a given set of (gene) sequences by combining a linear-time algo-
rithm, which is for determining whether a rooted tree is associated with
a duplication model, with the nearest neighbor interchange operation.
Finally, using our proposed method, we derive duplication hypotheses
for an exon of a mucin gene MUC5B, a ZNF gene family, and a OR gene
family.
Keywords: Algorithm, tandem duplication model, phylogeny, Mucin
genes, ZNF genes, olfactory receptors.
1 Introduction
Tandem duplication is a less well understood mutational process for DNA mole-
cules in which a short stretch of DNA is transformed into several adjacent copies.
For example, in the following tandem duplication,
TTCGGAGA TTCGGA CGGA CGGAGA
the segment CGGA is transformed into three adjacent copies. Since individual
copies in a tandem repeat may undergo additional mutations later on, approxi-
mate repeats are usually present in a genome.
This process is a primary mechanism for generating gene family clusters on
a chromosome. Thirty-one percent of known and predicted genes (340 genes)
in human chromosome 19 (HSA19) are arranged in tandem arrays [11]. It was
reported that 262 distinct C2H2 zinc-nger (ZNF) loci are distributed in 11
dierent clusters in three cytogenetic bands 19p12, 19p13 and 19q13 and 49
olfactory receptors (OR) loci in 4 dierent clusters [3]. Gene families organized
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 97111, 2002.
c Springer-Verlag Berlin Heidelberg 2002
98 L. Zhang, B. Ma, and L. Wang
in tandem arrays have also been described in C. elegans [18], Drosophila [1], and
Arabidopsis [17].
Genome analysis suggests tandem duplication is an important mode of evo-
lutionary novelty by permitting one copy of each gene to drift and potentially to
acquire a new function. Hence, evolution study of a tandem gene family together
with other analyzes may yield valuable insights into predicting their functions
and resolve important species-specic biological questions. With these motiva-
tions in mind, researchers have begun to study tandem gene families and their
duplication history (see for example [2,6,16]).
In this paper, we study the problem of systematically inferring duplication
history of a tandem gene family. Previous and related works [2,6,10,16] are sum-
marized in the next section. Here, we attack this problem using a dierent ap-
proach. A duplication model (DM) Mfor a given family of (gene) sequences is a
directed graph that contains nodes, edges and blocks as shown in Figure 1. If only
the parent-child relations are considered in a duplication model M, then, the
resulting structure is just a rooted binary tree T
M
which is unique and called
the associated phylogeny for M. We present an optimal linear-time algorithm
which, given a phylogeny for a set of (gene) sequences, reconstruct the unique
duplication model M such that T = T
M
if it exists. Most importantly, we pro-
pose an ecient algorithm for inferring a duplication model for a set of (gene)
sequences by combining this linear-time algorithm with the nearest neighbor
interchange (NNI) operation (see Section 4 for its denition). The algorithm
was used to infer the history of tandem repeats within a large exon of a mucin
gene MUC5B [6] and performed better than Benson and Dongs algorithm; it
was also used to derive the duplication history of tandem gene families ZNF45
and OLF3 reported in [3]. Our algorithm outperformed the Window method on
ZNF45 family according to the parsimony principle.
2 Tandem Duplication Models
Let F = {s
1
, s
2
, , s
n
} be a set of DNA (or protein) sequences of a xed length
l over an alphabet A. Let A
l
denote the set of all the sequences of length l over
A and let d(, ) be a pairwise distance function on sequences in A
l
. Let T be a
phylogenetic tree (or phylogeny) on F. Then, it is a rooted binary tree in which
each leaf is labeled by a sequence in F and each internal node v a sequence
s(v) A
l
. We measure T using the following cost C(T):
C(T) =

(u,v)E(T)
d(s(u), s(v)),
where E(T) denotes the set of directed edges in T.
An ordered phylogenetic tree on F is a rooted phylogenetic tree in which the
left-to-right order of the leaves is s
1
, s
2
, , s
n
. Obviously, ordered phylogenetic
tree form a proper subclass of phylogenetic trees on F. In the nal version of
this paper, we prove that there are

2(n1)
n1

/n ordered phylogenetic trees on F.


Ecient Methods for Inferring Tandem Duplication History 99
Benson and Dong described a greedy algorithm for the tandem repeat history
(TRHIST) problem stated as [2]
TRHIST problem
Input: A set F = {s
1
, s
2
, , s
n
} of gene sequences of length l and a distance
function d(, ) on sequences.
Output: An ordered phylogenetic tree T on F such that the cost of T is mini-
mum.
Later, Jaitly et al. and Tang et al. proved that TRHIST problem is NP-
hard; they also discovered a polynomial-time approximation scheme (PTAS) for
it [10,16]. Such a PTAS is obtained from combining the lifting technique [20,19]
with dynamic programming. A PTAS for a minimization problem is an algorithm
which, for every > 0, returns a solution whose score is at most (1 + ) times
the optimal score, and runs in polynomial time (depending on and the input
size).
Furthermore, Tang et al. extended the study in this aspect by proposing
the so-called duplication model [16] that captures evolutionary history, and the
observed left-to-right order of (gene) sequences appearing on a chromosome.
They then presented a heuristic method called the Window Method for inferring
the tandem duplication history.
A duplication model (DM) M for a given family of (gene) sequences is a
directed graph that contains nodes, edges and blocks [16] as shown in Figure 1.
A node in Mrepresents a sequence. A directed edge (u, v) from u to v indicates
that v is a child of u. A node s is an ancestor of a node t if there is a directed
path from s to t. A node that has no outgoing edges is called a leaf; each leaf
v
2
v
7
v
8
v
3
v
1
v
6
v
4
v
5
v
2
v
1
v
3
v
4
v
6
v
8
v
7
v
5
r
r
3 4 5 6 7 1 2 9 8 10 6 2 4 7 9 3 5 8 10 1
Fig. 1. A DM on a gene (sequence) family {1, 2, , n}, in which each directed
edge is from top to bottom, and its associated phylogeny. There are ve blocks
[r], [v ], [v , v , v ], [v , v ], [v , v ] in the DM; blocks with more than one nodes are
represented by boxes.
100 L. Zhang, B. Ma, and L. Wang
is labeled with a unique sequence in the family. Each non-leaf node is called an
internal node; it has a left and right child. There is a special node r called the
root that has only outgoing edges. The root represents the ancestral sequence of
all the given sequences.
Each block in M represents a duplication event. Each internal node appears
in a unique block; no node is an ancestor of another in a block. Assume that
a block B contains k nodes v
1
, v
2
, , v
k
in left-to-right order. Let lc(v
i
) and
rc(v
i
) denote the left and right child of v
i
respectively for each i from 1 to k.
Then,
lc(v
1
), lc(v
2
), , lc(v
k
), rc(v
1
), rc(v
2
), , rc(v
k
)
are placed in left-to-right order in the model. Therefore, for any i and j, 1 i <
j k, the edges (v
i
, rc(v
i
)) and (v
j
, lc(v
j
)) cross each other. However, no other
edges in the model cross. For simplicity, we will only draw a box for a block with
more than one internal nodes.
Each leaf is labeled by a unique sequence of the given family. The left-to-
right order of leaves in the duplication model is identical to the order of the
given tandem sequences appearing on a chromosome.
A duplication model captures not only evolutionary relationship, but also
the observed order of sequences appearing on a chromosome. If we consider only
the evolutionary relationships dened in a duplication model M, the underlying
structure T
M
is just a rooted binary tree, called the associated phylogeny of M.
Clearly, T
M
is unique. Obviously, any ordered phylogenetic tree is the associated
phylogeny of a duplication model in which all the duplication blocks contains
only an internal node. Note that Figure 1 gives an associated phylogeny that is
not ordered. Counting associated phylogenies on a set of n genes seems to be an
interesting problem.
In the rest of this section, we list two basic properties of associated phylogeny.
These properties will be used to reconstruct a DM from a phylogeny in the next
section. We assume that the family F under consideration contains n sequences
1, 2, , n, and appear in increasing order on a chromosome. Thus, a DM is
consistent with F if the left-to-right ordering of its leaves is identical to increasing
order.
Proposition 1. Let M be a DM consistent with F = {1, 2, , n} and T
M
be
the phylogeny associated with M. Then, for any internal node u in T
M
, the
subtree T
M
(u) rooted at u is also associated with a DM on the set of sequences
appearing in T
M
(u).
Proof. T
M
(u) is consistent with the restriction of M on T
M
(u), in which each
block is composed of all of the internal nodes in T
M
(u) that are contained in a
block in M.
Recall that, for an internal node u in a DM M, we use lc(u) and rc(u) to
denote its left and right child in M or T
M
respectively. Similarly, we use l

c(u)
and r

c(u) to denote the sequence labels of the leftmost and rightmost leaf in
the subtree T
M
(u) rooted at u respectively.
Ecient Methods for Inferring Tandem Duplication History 101
Proposition 2. Let M be a DM consistent with F = {1, 2, , n} and T
M
be
the phylogeny associated with M. Then, T
M
satises the following two proper-
ties:
(i) For each internal node u in T
M
, r

c(u) > r

c(lc(u)) and l

c(u) <
l

c(rc(u)); equivalently,
(ii) For each internal node u in T
M
, l

c(u) and r

c(u) are the smallest and


largest sequence labels in the subtree T
M
(u) respectively.
Proof. Both (i) and (ii) can be proved by induction. Their full proof is omitted
here.
Proposition 2 can be illustrated by the model given in Figure 1. For v
2
in the
model, lc(v
2
) = v
3
and r

c(v
2
) = 10 > r

c(v
3
) = 9.
3 Reconstructing a Duplication Model from a Phylogeny
We have known that each DM M has a unique associated phylogeny T
M
. In
this section, we present an optimal algorithm for the following reverse problem:
Given a phylogeny T, reconstruct a DMMsuch that T = T
M
if it exists.
Let M be a DM consistent with the sequence family F = {1, 2, , n}. A du-
plication block is single if it contains only one node; it is double if it contains
two nodes. A DM is said to be double if every duplication block of it is either
single or double. In Section 3.1, we shall present a linear-time algorithm for re-
constructing double DMs from phylogenies. Then, we generalize the algorithm
to arbitrary DMs in Section 3.2. The best known algorithm for this problem
takes quadratic time [16].
Note that, to specify a DM, we need only to list all non-single blocks on
the associated phylogeny. Hence, the output from our algorithm is just a list of
non-single duplication blocks.
3.1 Double Duplication Models
Let T be a phylogeny. We associate a pair (L
v
, R
v
) of leaf indices to each node
v in T as follows [16]: The pair of a leaf labeled by gene i is (i, i); (L
v
, R
v
) =
(l

c(v), r

c(v)) for an internal node v. Finally, we dene P(T) = {(L


v
, R
v
) | v
T}.
Proposition 3. For a phylogeny T of n leaves, the set P(T) can be computed
in O(n) time.
Proof. Note that (L
v
, R
v
) = (L
lc(v)
, R
rc(v)
), where lc(v) and rc(v) are the left
and right child of v. The set P(T) can be computed using the dynamic program-
ming approach in a bottom up fashion: (L
v
, R
v
) = (i, i) for a leaf v labeled with
sequence i; for a non-leaf node v, we calculate (L
v
, R
v
) once (L
u
, R
u
) is known
102 L. Zhang, B. Ma, and L. Wang
for u = lc(v) and u = rc(v). Since T contains 2n1 nodes, the above procedure
takes O(n) times.
The set P(T) will play a critical role in designing algorithms for reconstruct-
ing a DM from a phylogeny. To use P(T) eciently, we store each pair (L
v
, R
v
)
at the node v by introducing an extra pointer. The following fact is straightfor-
ward.
Proposition 4. The properties in Proposition 2 can be veried in O(n) time
for a phylogeny T of n nodes using the set P(T).
By Proposition 4, we may assume that a given phylogeny T satises the
properties stated in Proposition 2. Otherwise, T cannot be associated with a
DM. Note that the leftmost and rightmost leaf in T are labeled with 1 and n
respectively. Where does the leaf labeled with 2 locate? Let
v
0
= r, v
1
, v
2
, , v
p1
, v
p
= 1 (1)
be the path from the root r to the leftmost leaf 1. Such a path is said to be the
leftmost path of T. Since l

c(v) < l

c(rc(v)) for every v in T (Proposition 2), 2


can only appear as the leftmost leaf of a subtree rooted at some rc(v
i
), 0 i p
(see Figure 2 (a)). Let
u
1
= rc(v
i
), u
2
, , u
q1
, u
q
= 2 (2)
be the path from rc(v
i
) to the leaf 2, where q p i as we will see later.
Proposition 5. Suppose that T is associated with a double DM M that is con-
sistent with F = {1, 2, , n}. Then,
(1) M contains a unique series of p i 1 double duplication blocks of the
following form:
[v
i+1
, u
j
1
], [v
i+2
, u
j
2
], , [v
p1
, u
j
pi1
],
where 1 j
1
< j
2
< < j
pi1
q 1. Hence, q p i.
(2) Such a series of p i 1 blocks can be determined by using at most 2q
comparisons, where q is the length of the left path leading to label 2 given in (2).
Proof. (1) Let k be an integer between 1 and p i 1. If v
i+k
does not belong
to a double block in M, 2 cannot be placed before the leftmost leaf in the
subtree rooted at rc(v
i+k
), contradicting the fact that 2 is right next to 1 in
M. (A tedious full proof of this statement will appear in the nal version of
this paper.) Hence, v
i+k
must appear in a double duplication block for each k,
1 k p i 1.
Since the subtree rooted at rc(v
i+1
) is left to the subtree rooted at u
1
in T,
v
i+1
forms a block with some u
j
1
, 1 j
1
q 1. By Proposition 2, R
u
1
>
R
u
2
> > R
u
q1
> R
u
q
. Since the left-to-right ordering of the sequences is
identical to increasing order in M, v
i+1
and u
j
1
form a duplication block only
if R
u
j
1
> R
v
i+1
> R
u
j
1
+1
. Hence, u
j
1
is unique. Similarly, we can prove that u
j
k
is unique for any k > 1.
Ecient Methods for Inferring Tandem Duplication History 103
(b)
(c)
1
2 2
1
2
(a)
v
0
v
1
v
2
u
1
u
2
u
3
u
4
u
5
v
4
v
3
Fig. 2. (a) Example illustrating the position of 2, where p = 5, i = 2 and q = 6. (b)
The conguration after duplication blocks [v , u ] and [v , u ] are placed. (c) The tree
T

derived after the duplication blocks are eliminated. The bad nodes are marked with
circles.
(2) Notice that R
u
1
> R
u
2
> > R
u
q1
> R
u
q
and R
v
i+1
> R
v
i+2
>
> R
v
p1
. By merging these two sequences in p i + q 2q comparisons,
we are able to determine all the u
j
k
s, 1 k p i 1, by just scanning the
merged sequences. This is because R
v
i+k
appears between R
u
j
k
and R
u
j
k
+1
in
the merged sequences. To implement this procedure eciently, we need not only
a pointer from a parent node to its child, but also a pointer from a child node
to its parent.
Assume that all u
j
k
s have been determined for all k = 1, 2, , p i 1.
After all the duplication blocks [v
i+k
, u
j
k
] are placed on T, the leaf 2 should be
right next to the leaf 1 (see Figure 2 (b)). We derive a rooted binary tree T

from the subtree T(u


1
) by inserting a new node v

k
in the edge (u
j
k
, u
j
k
+1
) for
each 1 k p i 1, and assigning the subtree T(rc(v
i+k
)) rooted at rc(v
i+k
)
as the right subtree of v

k
. Notice that the left child of v

k
is u
j
k
+1
. Then, we form
a new phylogeny T

from T by replacing subtree T(v


i
) by T

as illustrated in
Figure 2 (c). To distinguish the inserted nodes v

k
, 1 k p i 1, from the
104 L. Zhang, B. Ma, and L. Wang
Algorithm for Reconstructing a DDM
Input: A phylogeny T with root r
on a sequence family {1, 2, , n}.
Output: The DDM derived from T if it exists.
DS = ; /* DS keeps double dup. blocks */
Determine a series of duplication blocks in
Proposition 5 that places 2 right next to 1;
if such a series of blocks do not exist, then
no DM can be derived from T
and exit the procedure.
Add all the dup. blocks obtained into DS;
Construct T

from T as described before


Proposition 6, and recursively determine
if T

is associated to a DM or not and


update DS accordingly.
Ouput DS.
Fig. 3. Algorithm for Reconstructing a DDM.
original ones in T, we mark these inserted ones as bad nodes. It is not dicult
to see that the leftmost leaf is labeled by 2 in T

and the number of nodes in T

is
1 less than those of T. Most importantly, we have the following straightforward
fact.
Proposition 6. T is associated with a double DM M only if T

is associated
with the double DM M

that is the restriction of M on the nodes of T

, in which
the bad nodes cannot participate double duplication blocks.
Using Proposition 6, we obtain the algorithm in Figure 3 for reconstructing
a double DM (DDM) from a phylogeny. By Proposition 5, we could charge the
number of comparisons taken in dierent recursive steps to disjoint left paths in
the input tree T. Since the total number of nodes in all the disjoint left paths is at
most 2n, the whole algorithm takes at most 2 2n comparisons for determining
all the duplication blocks. Therefore, we have
Theorem 1. A double DM can be reconstructed from a given phylogenetic tree
T of n leaves in O(n) time if it exists.
3.2 Generalization to Arbitrary Duplication Models
In this subsection, we generalize the algorithm presented above to arbitrary DMs.
Such a generalization is based on the following proposition. Again, we assume
the leftmost paths leading to sequence 1 and sequence 2 in T are given as (1)
and (2) in the previous subsection.
Ecient Methods for Inferring Tandem Duplication History 105
Proposition 7. Assume a phylogeny T is associated with a DM M. Then, there
exists a unique series of p i 1 double duplication blocks [v
i+k
, u
j
k
] such that,
after these duplications are placed in T, sequence 2 is right next to sequence 1.
Notice that these double duplication blocks may not be in M.
Proof. Let k be an integer such that 1 k p i 1. If v
i+k
and u
j
are not
involved in a duplication block for any j q 1 in the model M, sequence 2
cannot be moved across the leftmost leaf l

c(rc(v
i+k
)) in the subtree rooted at
the right child of v
i+k
, contradicting to the hypothesis that 2 is placed next to
1. Therefore, for any v
i+k
, 1 k p i 1, there exists a duplication block in
M that contains v
i+k
and u
j
for some j. Restricting all the duplications in M
on the leftmost paths specied in (1) and (2) gives a desired series of p i 1
double duplication blocks.
We construct a new phylogeny T

from T in the same way as described before


Proposition 6 in the previous subsection. Recall that there are two types of nodes
on the leftmost path of T

. Some nodes are original ones in the input tree T; some


are inserted due to duplication blocks we have considered. To extend the existing
duplication blocks into large ones, we associated a ag to each original node on
the leftmost path of T

, which indicates whether the node is in an existing block


or not. Let x be an original node on the leftmost path of T

involving in a
duplication [x
1
, x
2
, , x
t
, x] of size t +1 so far, then, there are t inserted nodes
x

i
below x that corresponds to x
i
for i = 1, 2, , t. To decide whether x is in a
double duplication block in T

, we need to consider x and all the x

i
s (1 i t)
simultaneously. We call such a double block a hyper-double block. If x involves
in a hyper-double block [x, y] in T

, the block [x
1
, x
2
, , x
t
, x] is extended into
block [x
1
, x
2
, , x
t
, x, y] of size t+2 in the original tree T. With these notations
in mind, we have
Proposition 8. A phylogeny T is associated with a DM if and only if
(1) There exists a unique series of pi1 double duplication blocks [v
i+k
, u
j
k
]
in T such that, after these duplication blocks are placed in T, sequence 2 is right
next to sequence 1, and
(2) T

constructed above is associated to a DM.


Hence, the algorithm given in Subsection 3.1 can be used to reconstruct a
DM from a phylogeny. To make the algorithm run in linear time, we assign a
pair (R

x
, R

x
) of indices to a node x on the leftmost path of T

in each recursive
step: if x is in a duplication block [x
1
, x
2
, , x
t
, x] in the current stage, we let
R

x
= R
x
1
and R

x
= R
x
, which are dened in the beginning of Subsection 3.1.
Since R

x
< R
x
i
< R

x
for any 2 i t, only R

x
and R

x
will be examined
for determining if x is in a hyper-double block in next step. In this way, we can
reconstruct a DM from a phylogeny in linear time. Take the tree in Figure 4 as an
example. To arrange 2 next to 1 in step 1, we obtain 3 hyper-double duplications
[v
1
, v
2
], [v
3
, v
5
] and [v
8
, v
6
]. After this step, sequences 1, 2, 3, 4 have been arranged
in increasing order. To arrange 5 next to 4 in step 2, we obtain a hyper-double
block [v
5
, v
4
] and hence we extend the double duplication [v
3
, v
5
] into [v
3
, v
5
, v
4
];
106 L. Zhang, B. Ma, and L. Wang
v
1 v
2
v
3
v
4
v
5
v
6
v
7 v
8
v
2
v
1

v
5
v
3

v
7
v
4
v
7
v
2
v
4
v
5

v
3

1 2 3 7 8 9 6 10 5 4 9 6 2 4 8 7 5 1 3 10
7 8 5 9 6 10 4 4 5 7 8 9 6 10
7 5 8 9 6 10 5 6 7 8 9 10
r
Fig. 4. Step-by-step demonstration of reconstructing a DM from a rooted tree. Here,
we use a box drawn in break lines to denote a hyper-double block; we mark the inserted
nodes with circles.
nally, to arrange 6 next to 5 in step 3, we derive a hyper-double block [v
4
, v
7
]
and further extend the duplication [v
3
, v
5
, v
4
] into [v
3
, v
5
, v
4
, v
7
]. After step 3, all
sequences have been arranged in increasing order and the algorithm terminates
successfully by outputting 3 duplications [v
1
, v
2
], [v
3
, v
5
, v
4
, v
7
] and [v
8
.v
6
].
4 Inferring Duplication Models from Sequence Data
Let F = {s
1
, s
2
, , s
n
} be a set of tandem sequences of a xed length l over
an alphabet A and let M be a DM consistent with F, in which each node v is
assigned with a sequence s(v) A
l
, where s(v) is in F if v is a leaf. We measure
M using the following cost C(M):
C(M) =

(u,v)E(M)
dist(s(u), s(v)),
where E(M) denotes the set of directed edges in M.
By the parsimony principle, inferring a DM on F is to nd a DM consistent
with F and of minimum cost. Such an inference problem is NP-hard as proved
in [10]. Therefore, there is unlikely a polynomial time algorithm for it. Here, we
introduce an ecient heuristic for inferring a DM from sequence data using the
algorithm presented in Section 3 and the Nearest Neighbor Interchange (NNI)
operation for tree transformation (see Figure 5). Every edge, e, in a unrooted tree
Ecient Methods for Inferring Tandem Duplication History 107
A
C
B
D
A
D
C
B
A
B
C
D
e
Fig. 5. By a NNI operation on e, we can swap two subtrees at the dierent ends of e.
partitions the set of leafs. We may consider alternate hypothesis associated with
e by applying a NNI on e. This rearrangement generates alternate bipartition
for e while leaving all the other parts of the tree unchanged (see [15] for more
information).
Our method is analogous to the Feng-Doolittle algorithm for multiple align-
ments. First, a guided tree is constructed for the given set of sequences using the
parsimony method or other ecient distance-based methods such as Neighbor-
Joining (see [15]). Then, the NNIs are applied to the guided tree T for searching
a DM that is nearest to T in terms of NNI operations. During the search pro-
cess, we use the algorithm in section 3 for determining whether the resulting
tree is associated with a DM or not. Since a phylogeny with n leaves can always
be transformed into another by applying at most nlog
2
n + O(n) NNIs [12],
our algorithm will terminate quickly. For more eciency, we can even use the
following greedy approach to obtain a DM.
Greedy NNI Search: We root the given guided tree at a xed edge
on the path from the leaf labeled with s
1
to the leaf labeled with s
n
.
For each k = 2, 3, , n1, we repeatedly arrange s
k
next to s
k1
using
duplications as described in Section 3. If we cannot arrange s
k
next to
s
k1
, we use the NNIs to move the leaf labeled with s
k
to the branch
having s
k1
as an end so that these two leaves become siblings.
Greedy NNI Search is proposed based on the hypothesis that a tandem gene
family has evolved mainly through single gene duplication events [2,16]. Since
we can determine whether a rooted tree is associated with a duplication model
in linear time, the Greedy NNI Search takes only quadratic time.
5 Analysis of Tandem Repeats
5.1 Mucin Gene MUC5B
Mucus consists mainly of mucin proteins, which are heterogeneous, highly glyco-
sylated and produced from epithelial cells [8]. Four mucin genes MUC6, MUC2,
MUC5AC and MUC5B have been identied within a 400kb genomic segment in
human 11p15.5 [14]. All mucin genes contains a central part of tandem repeats.
In the central part of MUC5B, a single large exon of about 11K bp [4], three Cyc-
subdomains are followed by four repeats. Each repeat contains a R-subdomain
108 L. Zhang, B. Ma, and L. Wang
(11 or 17 tandem repeats of a motif of 29 amino acids), an R-End subdomain,
and a Cys-subdomain. Another R-subdomain of 23 tandem repeats of 29 amino
acids follows the fourth repeat. This suggests that the central part of MUC5B
arose through tandem duplications.
The amino acid sequences for the ve R-subdomains were taken from [6]
and aligned using an alignment program at Michigan Tech [9]. We chose the
following parameters - substitution matrix: Blosum62, mismatch score: -15, gap
open penalty 10, gap extension penalty 5. The alignment was used to produce a
guided tree by using the parsimony method in [7,13], which is rooted on the edge
that partitions the set of leafs into {RI, RII, RIV } and {RIII, RV }. This guided
tree was then given as input to the Greedy NNI Searching algorithm described
in last section. The resulting tandem DM for the central part of MUC5B agrees
with the evolutionary history described verbally by Desseyn et al. in their original
papers. The central repeat (Cyc/R/R-End subdomains) of an ancestral gene
duplicated into two repeats. This event was then followed by a further duplication
of a region containing all the two repeats. After that, the rst repeat duplicated.
We also derive a tandem duplication hypothesis of the forth repeat segment
RIV. It composed of 17 tandem repeats of an irregular motif of 29 amino acids
rich in Ser and Thr. We obtained a guided parsimony tree from the multiple
alignment of the 17 tandem repeats (RIV-1 to RIV-17) (see Figure6 (a)). Then,
a duplication model was derived after 3 local NNI arrangements on the guided
tree (see Figure 6 (b)). This model has the substitution score 126, which is better
than the one with score 133 that we produced using Bensons greedy algorithm
[2]. This hypothesis is also consistent with the conclusion drawn in [6].
5.2 ZNF and OLF Gene Families
Human genome contains roughly 700 Kr uppel-type (C2H2) zinc-nger (ZNF)
genes, which encode putative transcription factors, and 900 olfactory receptors
(OR), which encode proteins that recognize distinct types of olfactants and that
function in dierent regions of the olfactory epithelium. Because of their impor-
tance in species-specic aspects of biology, Dehal et al. analyzed the evolutionary
relationship between C2H2 ZNF genes and OR genes in human chromosome 19
(HSA19) [3]. Two hundred sixty two C2H2 ZNF genes were identied in the
HSA19 and were grouped into in 11 dierent locations (Z1 to Z11); forty nine
OR genes were grouped into 4 location clusters (OLF1 to OLF4) (Table C and
Table G in the supplementary material of [3]). In situ tandem duplication events
are likely to have given rise to these features.
Because of their locations, we choose gene clusters ZNF45 and OLF3 and ob-
tained gene-duplication history of them using our heuristic algorithm described
in last section. The gene family ZNF45 contains 16 genes and OLF3 14 genes in
HSA19q13.2. DNA or protein sequences for each gene family were aligned using
the program at Michigan Tech [9]. The alignment of each gene family was used
to produce a matrix of pairwise distances between the family members and then
to produce a guided tree by using the parsimony method [7,13]. This guided
tree was given as input to the Greedy NNI Searching algorithm described in last
Ecient Methods for Inferring Tandem Duplication History 109
R
I
I
I

1
4


R
I
I
I

1
2
R
I
I
I

1
0
R
I
I
I

4
R
I
I
I

7
R
I
I
I

6

R
I
I
I

1
R
I
I
I

5
R
I
I
I

1
1
R
I
I
I

2
R
I
I
I

8
R
I
I
I

3
R
I
I
I

9
R
I
I
I

1
3
R
I
I
I

1
5
R
I
I
I

1
6
R
I
I
I

1
7
(a)
(b)
R
I
I
I

1
7
R
I
I
I

1
6
R
I
I
I

1
5
R
I
I
I

1
4


R
I
I
I

1
3
R
I
I
I

1
2
R
I
I
I

1
1
R
I
I
I

1
0
R
I
I
I

9
R
I
I
I

8
R
I
I
I

7
R
I
I
I

6

R
I
I
I

5
R
I
I
I

4
R
I
I
I

3
R
I
I
I

2
R
I
I
I

1
Fig. 6. The duplication history of the third repeat RIII in the MUC5B central part.
section. The resulting DMs for these two gene families are shown in Figure 7.
Our algorithm performed better than the window method on the gene family
ZNF45 according to parsimony principle. Our model for ZNF45 has substitution
score 4690 while the one in [16] has score 4800, which were constructed from
same alignment data.
These hypotheses in conjunction with other sequence analyzes could provide
valuable clues into their functions at various stages of development.
6 Conclusion
In this paper, we present an optimal linear-time algorithm for reconstructing a
DM from a phylogeny. Using this algorithm, we propose an ecient heuristic
algorithm for inferring a DM from sequence data. Such an algorithm performed
well on three test data sets - MUC5B [6], gene families ZNF45 and OLF3 [3].
Currently, we are conducting the simulation test for evaluating the algorithm
further. Its performance on simulation data will be reported in the journal version
of this paper.
Genome analysis suggests that sequence inversion and gene loss occurs during
the course of tandem duplications. As a future research topic, the authors shall
study how to incorporate gene sequence inversion and loss events into tandem
duplication model and inferring algorithms.
110 L. Zhang, B. Ma, and L. Wang
Z
N
F
4
5
Z
N
F
2
2
1
Z
N
F
1
5
5
Z
N
F
2
3
0
Z
N
F
2
2
2
Z
N
F
2
2
3
Z
N
F
2
2
4
Z
N
F
2
2
5
Z
N
F
2
3
4
Z
N
F
2
2
6
Z
N
F
2
2
7
Z
N
F
2
3
3
Z
N
F
2
3
5
Z
N
F
2
2
8
Z
N
F
2
2
9
Z
N
F
1
8
0
(a)
O
L
F

1
O
L
F

2
O
L
F

3
O
L
F

4
O
L
F

6
O
L
F

7
O
L
F

8
O
L
F

9
O
L
F

1
0
O
L
F

1
1
O
L
F

5
O
L
F

1
4
O
L
F

1
3
O
L
F

1
2
(b)
Fig. 7. The duplication models (a) of ZNF45 and (b) of OLF3.
Acknowledgments
Ma Bin would like to thank Chengzhi Liang for providing us the references on
mucin genes and valuable discussions; he would also like to thank Mengxiang
Tang for sharing their data. L. Zhang is nancially supported in part by a NSTB
grant LS/99/001. B. Ma is supported in part by NSERC RGP0238748. L. Wang
is supported in part by a RGC grant CityU/1130/99E.
References
1. M.D. Adams et al. The genome sequence of Drosophila melanogaster. Science 287
(2000), 2185-2195.
2. G. Benson and L. Dong. Reconstructing the duplication history of a tandem repeat.
In Proceedings of the 7th ISMB, 44-53, 1999.
3. P. Dehal at al.. Human Chromosome 19 and related regions in Mouse: Conservative
and lineage-specic evolution. Science 293 (2001), July, 104-111.
4. J. L. Desseyn et al. Human mucin gene MUC5B, the 10.7 -kb large central exon en-
codes various alternate subdomains resulting in a super-repeat. Structural evidence
for a 11p15.5 gene family. J. Biol. Chem 272 (1997), 3168-3178.
5. J. L. Desseyn et al. Evolutionary history of the 11p15 human mucin gene family.
J. Mol Evol. 46 (1998), 102-106.
6. J. L. Desseyn et al. Evolution of the large secreted gel-forming mucins, Mol. Biol.
Evol. 17 (2000), 1175-1184.
7. J. Felsentein, PHYLIP, version 3.57c, Dept. of Genetics, Univ. of Washington,
Seattle.
Ecient Methods for Inferring Tandem Duplication History 111
8. S. Ho and Y. Kim. Carbohydrate antigens on cancer-associated mucin-like
molecules. Semin. Cancer Biol. 2(1991), 389-400.
9. X. Huang, Multiple Sequence Alignment,
http://genome.cs.mtu.edu/map/map.html.
10. D. Jaitly et al.. Methods for reconstructing the history of tandem repeats and
their application to the human genome. To appear on J. Computer and Systems
Sciences.
11. J. Kim et al. Homology-Driven Assembly of a Sequence-Ready Mouse BAC Contig
Map Spanning Regions Related to the 46-Mb Gene-Rich Euchromatic Segments of
Human Chromosome 19. Genomics 74(2001), 129-141.
12. M. Li, J Tromp, and L. Zhang. On the nearest neighbor interchange distance
between evolutionary trees. Journal of Theoret. Biol. 182(1996), 463-467.
13. A. Lim and L. Zhang. WebPHYLIP: A web interface to PHYLIP. Bioinformatics
12(1999), 1068-1069.
14. P. Pigny et al. Genomics 8(1996), 340-352
15. D. L. Swoord et al. Phylogeny inference. In Molecular Systematic (edited by D.M.
Hillis et al. pages 407-514. Sinauer Associate Inc., MA, 1996.
16. M. Tang, M. Waterman, and S. Yooseph. Zinc nger gene clusters and tandem
gene duplication. RECOMB01, 297-304, 2001.
17. The Arabidopsis Genome Initiative. Nature 408(2000), 796-815.
18. The C. elegans Sequencing Consortium. Science 282 (1998), 2012-2018.
19. L. Wang, and D. Guseld. Improved approximation algorithms for tree alignment.
J. of Algorithms 25(1997), 255-273, 1997.
20. L. Wang, T. Jiang, and E. Lawler. Approximation algorithms for tree alignment
with a given phylogeny. Algorithmica 16(3), 302-315, 1996.
Genome Rearrangement Phylogeny
Using Weighbor
Li-San Wang
Department of Computer Sciences, University of Texas
Austin, TX 78712 USA
[email protected]
Abstract. Evolution operates on whole genomes by operations that
change the order and strandedness of genes within the genomes. This
type of data presents new opportunities for discoveries about deep evo-
lutionary rearrangement events. Several distance-based phylogenetic re-
construction methods have been proposed [12,21,19] that use neighbor
joining (NJ) [16] with the expected breakpoint or inversion distances
after k rearrangement events. In this paper we study the variance of
the breakpoint and inversion distances. The result is combined with
Weighbor [5], an improved version of NJ using the variance of true evolu-
tionary distance estimators, to yield two new methods, Weighbor-IEBP
and Weighbor-EDE. Experiments show the new methods have better ac-
curacy than all previous distance-based methods, and are robust against
model parameter misspecications.
1 Background
Distance-Based Phylogenetic Reconstruction. A (phylogenetic) tree T on a set
of taxa S is a tree representation of the evolutionary history of S: T is a tree
leaf-labeled by S, such that the internal nodes reect past speciation events.
Given a tree T on a set S of genomes and given any two leaves i, j in T, we
denote by P(i, j) the path in T between i and j. We let
e
denote the number of
evolutionary events on the edge e during the evolution of the genomes in S within
the tree T; we call it the true evolutionary distance between the two endpoints.
We can then dene the matrix of true evolutionary distances, d
ij
=

eP(i,j)

e
,
which is additive (a distance D is additive if it satises the four-point condition:
for every distinct four points {i, j, l, m}, D
ij
+ D
lm
max{D
il
+ D
jm
, D
im
+
D
jl
}). Given an additive matrix, many distance-based methods are guaranteed
to reconstruct the tree T and the edge weights.
Neighbor Joining and Its Variants BioNJ and Weighbor. Neighbor joining (NJ)
is the most popular distance-based tree inference method. The input to the
method is a matrix of estimated leaf-to-leaf distances {D
ij
|1 i, j n} on n
leaves. Starting with these leaves as one-node subtrees, the algorithm creates
new subtrees iteratively by picking two subtrees using a least-squares criterion
of the distances to other roots of the subtrees (the pairing step), and updates
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 112125, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Genome Rearrangement Phylogeny Using Weighbor 113
the distances of the root of the new subtree to other roots of the subtrees (the
distance update step) according to some least-squares criterion [7].
In reality, we do not get exact distance estimates between leaves due to the
random nature of evolution. Atteson showed NJ is guaranteed to reconstruct the
true tree topology if the input distance matrix is suciently close to a distance
matrix dening the same tree topology [1]. Consequently, techniques that yield
a good estimate of the matrix {d
ij
} are of signicant interest.
In this paper we will use two modied versions of neighbor joining called
BioNJ [7] and Weighbor [5]. Both methods use the variance of the true distance
estimators in the pairing (Weighbor only) and distance update steps to improve
the accuracy of the tree reconstruction.
Genome Rearrangement Evolution. Modern laboratory techniques yield the or-
dering and strandedness of genes on a chromosome, allowing us to represent
each chromosome by an ordering of signed genes (where the sign indicates the
strand). Evolutionary events can alter these orderings through rearrangements
such as inversions and transpositions, collectively called genome rearrangements.
Because these events are rare, they give us information about ancient events in
the evolutionary history of a group of organisms. In consequence, many biolo-
gists have embraced this new source of data in their phylogenetic work [6,14,15].
Appropriate tools for analyzing such data remain primitive when compared to
those developed for DNA sequence data; thus developing such tools is becoming
an important area of research.
The genomes of some organisms have a single chromosome or contain single
chromosome organelles (such as mitochondria [4] or chloroplasts [14,15]) whose
evolution is largely independent of the evolution of the nuclear genome for these
organisms. When each genome has the same set of genes and each gene appears
exactly once, a genome can be described by an ordering (circular or linear)
of these genes, each gene given with an orientation that is either positive (g
i
)
or negative (g
i
). If the genome is circular, we always represent the genome
linearly so g
1
is positive and at the rst position. A close inspection shows a
circular genome with n genes, when represented linearly, is identical to a linear
genome with n 1 genomes.
Let G be the genome with signed ordering g
1
, g
2
, . . . , g
k
. An inversion be-
tween indices a and b, for a b, produces the genome with linear order-
ing (g
1
, g
2
, . . . , g
a1
, g
b
, g
b1
, . . . , g
a
, g
b+1
, . . . , g
k
). A transposition on the
(linear or circular) ordering G acts on three indices, a, b, c, with a b and
c / [a, b], picking up the interval g
a
, g
a+1
, . . . , g
b
and inserting it immediately
after g
c
. Thus the genome G above (with the assumption of c > b) is replaced
by (g
1
, . . . , g
a1
, g
b+1
, . . . , g
c
, g
a
, g
a+1
, . . . , g
b
, g
c+1
, . . . , g
k
). An inverted transpo-
sition is a transposition followed by an inversion of the transposed subsequence.
The Generalized Nadeau-Taylor Model. The Generalized Nadeau-Taylor (GNT)
model [21] of genome evolution uses only genome rearrangement events which do
not change the gene content. The model assumes that the number of each of the
three types of events obeys a Poisson distribution on each edge, that the relative
114 L.-S. Wang
probabilities of each type of event are xed across the tree, and that events
of a given type are equiprobable. Thus we can represent a GNT model tree as
a triplet (T, {
e
}, (, )), where the (, ) denes the relative probabilities of
transpositions and inverted transpositions (hence an event is an inversion with
probability 1 ). For instance, (
1
3
,
1
3
) indicates that the three event classes
are equiprobable, while the pair (0, 0) indicates that only inversions happen.
Estimating True Evolutionary Distances Using Genome Rearrangements. Let
us be given a set of genome rearrangement events with a particular weighting
scheme; the weight of a sequence of rearrangement events from this set is the
sum of the weights of these events. The edit distance between two gene orders
is the minimum of the weights of all sequences of events from the given set that
transform one gene order into the other. For example, the inversion distance
is the edit distance when only inversions are permitted and all inversions have
weight 1. The inversion distance can be computed in linear time [2], and the
transposition distance is of unknown computational complexity [3].
Another type of genomic distance that received much attention in the ge-
nomics community is the breakpoint distance. Given two genomes G and G

on
the same set of genes, a breakpoint in G is an ordered pair of genes (g
a
, g
b
) such
that g
a
and g
b
appear consecutively in that order in G, but neither (g
a
, g
b
) nor
(g
b
, g
a
) appear consecutively in that order in G

. The number of breakpoints


in G relative to G

is the breakpoint distance between G and G

. The breakpoint
distance is easily calculated by inspection in linear time.
Estimating the true evolutionary distance requires assumption about the
model; in the case of gene-order evolution, the assumption is that the genomes
have evolved from a common ancestor under the Nadeau-Taylor model of evo-
lution. The technique in [4], applicable only to inversions, calculates this value
exactly, while Approx-IEBP [21] and EDE [12], applicable to very general models
of evolution, obtain approximations of these values, and Exact-IEBP [19] cal-
culates the value exactly for any combination of inversions, transpositions, and
inverted transpositions. These estimates can all be computed in low polynomial
time.
Variance of Genomic Distances. The following problem has been studied in
[17,19]: given any genome G with n genes, what is the expected breakpoint
distance between G and G

when G

is the genome obtained from G by applying k


rearrangements according to the GNT model? Both papers approach the problem
by computing the probability of having a breakpoint between every pair of genes;
by linearity of the expectation the expected breakpoint distance can be obtained
by n times the aforementioned probability. Each breakpoint can be characterized
as a Markov process with 2(n 1) states. But the probability of a breakpoint is
a sum of O(n) terms that we do not know yet how to further simplify.
However the variance cannot be obtained this way since breakpoints are not
independent (under any evolutionary model) by the following simple observation:
the probability of having a breakpoint for each breakpoint position is nonzero,
but the probability of the breakpoint distance being 1 is zero (the breakpoint
Genome Rearrangement Phylogeny Using Weighbor 115
distance is always 0 or at least 2). Thus, to compute the variance (or the second
moment) of the breakpoint distance we need to look at two breakpoints at the
same time. This implies we have to study a Markov process of O(n
2
) states and
a sum of O(n
2
) terms that is hard to simplify. As for the inversion distance, even
the expectation is still an open problem.
Estimating the variance of breakpoint and inversion distances is important for
several reasons. Based on these estimates we can compute the variances of the
Approx-IEBP and Exact-IEBP estimators (based on the breakpoint distance),
and the EDE estimator (based on the inversion distance). It is also informative
when we compare estimators based on breakpoint distances to other estimators,
e.g. the inversion distance and the EDE distance. Finally, variance estimation can
be used in distance-based methods to improve the topological accuracy of tree
reconstruction.
Outline of this Paper. We start in Section 2 by presenting a stochastic model
approximating the breakpoint distance, and derive the analytical form of the
variance of the approximation, as well as the variance of the IEBP estimators.
In Section 3 the variance of the inversion and the EDE distances are obtained
through simulation. Based on these variance estimates we propose four new
methods, called BioNJ-IEBP, Weighbor-IEBP, BioNJ-EDE, and Weighbor-EDE.
These methods are based on BioNJ and Weighbor, but the variances in these
algorithms have been replaced with the variances of IEBP and EDE. In Section 4
we present our simulation study to verify the accuracy of these new methods.
2 Variance of the Breakpoint Distance
The Approximating Model. We rst dene the following notation in this paper:
_
a
b
_
is the number of choosing b objects from a (the binomial coecient) when
a b 0;
_
a
b
_
is set to 0 otherwise.
We motivate the approximating model by the case of inversion-only evolution
on signed circular genomes. Let n be the number of genes, and b be the number
of breakpoints of the current genome G. When we apply a random inversion (out
of
_
n
2
_
possible choices) to G, we have the following cases according to the two
endpoints of the inversion [10]:
1. None of the two endpoints of the inversion is a breakpoint. The number of
breakpoints is increased by 2. There are
_
nb
2
_
such inversions.
2. Exactly one of the two endpoints of the inversion is a breakpoint. The number
of breakpoints is increased by 1. There are b(n b) such inversions.
3. The two endpoints of the inversion are two breakpoints. There are
_
b
2
_
such
inversions. Let g
i
and g
i+1
be the left and right genes at the left breakpoint,
and let g
j
and g
j+1
be the left and right genes at the right breakpoint. There
are three subcases:
(a) None of (g
i
, g
j
) and (g
i+1
, g
j+1
) is an adjacency in G
0
. The number
of breakpoints is unchanged.
(b) Exactly one of (g
i
, g
j
) and (g
i+1
, g
j+1
) is an adjacency in G
0
. The
number of breakpoints is decreased by 1.
116 L.-S. Wang
(c) (g
i
, g
j
) and (g
i+1
, g
j+1
) are adjacencies in G
0
. The number of break-
points is decreased by 2.
When b 3, out of the
_
b
2
_
inversions from case 3, case 3(b) and 3(c) count for
at most b inversions; this means given that an inversion belongs to case 3, with
probability at least 1 b/
_
b
2
_
=
b3
b1
it does not change the breakpoint distance;
this probability is close to 1 when b is large. Furthermore, when b << n almost
all the inversions belong to case 1 and 2. Therefore, when n is large, we can drop
cases 3(b) and 3(c) without aecting the distribution of breakpoint distance
drastically.
The approximating model we use is as follows. Assume rst the evolutionary
model is such that each rearrangement creates r breakpoints on an unrearranged
genome (for example, r = 2 for inversions and r = 3 for transpositions and in-
verted transpositions). Let us be given n boxes, initially empty. At each iteration
r boxes will be chosen randomly (without replacement); we then place a ball
into each of these r boxes if it is empty. The number of nonempty boxes after k
iterations, b
k
, can be used to estimate the number of breakpoints after k rear-
rangement events are applied to an unrearranged genome. This model can also
be extended to approximate the GNT model: at each iteration, with probability
1 we choose 2 boxes, and with probability + we choose 3 boxes.
Mean and Variance of the Approximating Model. Fix n (the number of boxes)
and k (the number of times we choose r boxes). Consider the expansion of the
following expression
S = ((x
1
x
2
+x
1
x
3
+ +x
n1
x
n
)/
(
n
2
)
)
k
.
Each term corresponds to the number of ways of choosing r = 2 boxes for k
times where the total number of times box i is chosen is the power of x
i
, and
the coecient of that term is the total probability of these ways. For example,
the coecient of x
3
1
x
2
x
2
3
in S (when k = 6) is the probability of choosing box 1
three times, box 2 once, and box 3 twice. Let u
i
be the coecient of the terms
with i distinct symbols;
_
n
i
_
u
i
is the probability i boxes are nonempty after k
iterations. The identity of u
i
for all terms of the same set of power indices holds
as long as the probability of each box being chosen is identical; in other words,
S is not changed by permuting {x
1
, x
2
, . . . , x
n
} arbitrarily.
To solve for u
i
exactly for all k is dicult and unnecessary. Instead we can
nd the expectation and variance of b
k
directly. Actually the following results
give all moments of b
k
. Let S(a
1
, a
2
, . . . , a
n
) be the value of S when we substitute
x
i
by a
i
, 1 i n, and let S
j
be the value of S when a
1
= a
2
= = a
j
= 1
and a
j+1
= a
j+2
= = a
n
= 0. For integers j, 0 j n, we have
j

i=0
_
j
i
_
u
i
= S(1, 1, 1, . . . , 1
. .
j 1

s
, 0, . . . , 0) = S
j
.
Let
Z
a
=
n

i=0
i(i 1) (i a + 1)
_
n
i
_
u
i
=
n

i=a
n(n 1) (n a + 1)
_
n a
i a
_
u
i
Genome Rearrangement Phylogeny Using Weighbor 117
for all a, 1 a n. We want to express Z
a
by some linear combination of S
i
,
0 i n. The following lemma, which is a special case of equation (5.24) in [9],
nds the coecients of the linear combination.
Lemma 1. Let a be some given integer such that 1 a n. Let us be given
{u
i
: 0 i n} that satisfy

i
j=0
_
i
j
_
u
j
=

n
j=0
_
i
j
_
u
j
= S
i
,where 0 i n.
We have

n
i=na
(1)
ni
_
a
ni
_
S
i
=

n
j=0
_
na
ja
_
u
j
.
The following results follow from Lemma 1; we state them without proof due to
space limitations.
Theorem 1. For all a, 1 a n,
Z
a
= n(n 1) (n a + 1)
n

i=na
(1)
ni
_
a
n i
_
S
i
.
Corollary 1. (a) Eb
k
= Z
1
= n(1 S
n1
).
(b) Var b
k
= nS
n1
n
2
S
2
n1
+n(n 1)S
n2
.
These results work for all integers r, 1 r n. When there are more than
one type of rearrangement events with dierent rs we can change S accordingly.
For example, let = +; for the GNT model we can set
S =
_
_
1
_
n
2
_ (

1i
1
<i
2
n
x
i
1
x
i
2
) +

_
n
3
_(

1i
1
<i
2
<i
3
n
x
i
1
x
i
2
x
i
3
)
_
_
k
. (1)
Mean and Variance of the Breakpoint Distance under the GNT Model. We begin
this section by nding the mean and variance for b
k
with respect to the GNT
model. By substituting into equation (1):
S
n1
= (1
2 +
n
)
k
, S
n2
=
_
(n 3)(n 2 2)
n(n 1)
_
k
.
For the GNT model, we have the following results:
d
dk
Eb
k
= nS
n1
(
1
k
ln S
n1
) = nS
n1
ln(1
2 +
n
) (2)
Var b
k
= nS
n1
n
2
S
2
n1
+n(n 1)S
n2
= (nS
n1
n
2
S
2
n1
+n(n 1)S
n2
) (3)
Using BioNJ and Weighbor. Both BioNJ and Weighbor are designed for DNA
sequence phylogeny using the variance of the true evolutionary distance estima-
tor. In BioNJ, the distance update step of NJ is modied so the variances of
the new distances are minimized. In Weighbor, the pairing step is also modied
to utilize the variance information. We use the variance for the GNT model in
this section and the expected breakpoint distance in [19] in the two methods.
The new methods are called BioNJ-IEBP and Weighbor-IEBP. To estimate the
118 L.-S. Wang
0 50 100 150 200 250 300
0
1
2
3
4
5
6
Number of Rearrangement Events
B
r
e
a
k
p
o
i
n
t

D
i
s
t
a
n
c
e
Simulation
Theory
0 50 100 150 200 250 300
0
10
20
30
40
50
60
Number of Rearrangements
S
t
d
e
v
.

E
x
a
c
t

I
E
B
P

D
i
s
t
a
n
c
e
Simulation
Theory
Fig. 1. Accuracy of the estimator for the variance. Each gure consists of two sets
of curves, corresponding to the values of simulation and theoretical estimation. The
number of genes is 120. The number of rearrangement events, k, range from 1 to 220.
The evolutionary model is inversion-only GNT; see Section 1 for details. For each k we
generate 500 runs. We then compute the standard deviation of b
k
for each k, and those
of

k(b
k
) for each k, and compare them with the values of the theoretical estimation.
true evolutionary distance, we use Exact-IEBP, though we can also use Equa-
tion (2) which is less accurate. Let

k(b) denote the Exact-IEBP distance given
the breakpoint distance is b;

k(b) behaves as the inverse of Eb
k
, the expected
breakpoint distance after k rearrangement events. The variance of

k(b) can be
approximated using a common statistical technique, the delta method [13], as
follows:
Var

k(b) (
d
dk
Eb
k
)
2
Var b
k
=
_
1 nS
n1
+ (n 1)(
S
n2
S
n1
)
_
nS
n1
(ln(1
2+
n
))
2
.
When the number of rearrangements are below the number of genes (120 in the
simulation), these results are accurate approximations to the mean and variance
of the breakpoint distance under the GNT model, as the simulation in Figure 1
shows. As the number of rearrangements k is so high the breakpoint distance
is close to the maximum (the resulting genome is random with respect to the
genome before evolution), the simulation shows the variance is much lower than
the theoretical formula. This is due to the application of the delta method:
while the method assumes the Exact-IEBP distance is continuous, in reality it is
a discrete function. The eect gets more obvious as k is large: dierent values of
k all give breakpoint distances close to the maximum, yet the Exact-IEBP can
only return one estimate for k, hence the very low variance. This problem is less
serious as n increases.
3 Variance of the Inversion and EDE Distances
The EDE Distance. Given two genomes having the same set of n genes and the
inversion distance between them is d, we dene the EDE distance as nf
1
(
d
n
):
Genome Rearrangement Phylogeny Using Weighbor 119
here n is the number of genes, and f, an approximation to the expected inversion
distance normalized by the number of genes, is dened as (see [12]):
f(x) = min{x,
ax
2
+bx
x
2
+cx +b
}
We simulate the inversion-only GNT model to evaluate the relationship between
the inversion distance and the actual number of inversion applied. Regression on
simulation results suggests a = 1, b = 0.5956, and c = 0.4577. As the rational
function is inverted, we take the larger (and only positive) root:
x =
(b cy)
_
(b cy)
2
+ 4(a y)by
2(a y)
.
Let y =
d
n
. Thus
f
1
(y) = max{y,
(b cy)
_
(b cy)
2
+ 4(a y)by
2(a y)
}.
Here the coecients do not depend on n, since for dierent values of n the curves
of the normalized expected inversion distance are similar.
Regression for the Variance. Due to the success of nonlinear regression in the
derivation of EDE, we use the same technique again for the variance of the in-
version distance (and that of EDE). However for dierent numbers of genes, the
curves of the variance are very dierent (see Figure 2). From the simulation it is
obvious the magnitudes of the curves are inversely proportional to the number
of genes (or some kind of function of it).
We use the following regression formula for the standard deviation of the
inversion distance normalized by the number of genes after nx inversions are
applied:
g
n
(x) = n
q
ux
2
+vx
x
2
+wx +t
.
The constant term in the numerator is zero because we know g(0) = 0. Let r
be the value such that rn is the largest number of inversions applied; we use
r = 2.5. Note that
ln(
1
rn
rn

0
g
n
(x)) ln(
1
r
_
r
0
g
n
(x)dx) = q ln n + ln(
1
r
_
r
0
ux
2
+vx
x
2
+wx +t
dx)
is a linear function of lnn. Thus we can obtain q as the slope in the linear
regression using n as the independent variable and ln(
1
rn

rn
0
g
n
(x)) as the in-
dependent variable (see Figure 2(b); simulation results suggest the average of
the curve indeed is inversely proportional to lnn). When q is obtained we apply
nonlinear regression to obtain u, v, w, and t using the simulation data for 40,
80, 120, and 160 genes. The resultant functions are shown as the solid curves in
Figure 2, with coecients q = 0.6998, u = 0.1684, v = 0.1573, w = 1.3893,
and t = 0.8224.
120 L.-S. Wang
0.0 0.5 1.0 1.5 2.0 2.5
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
Normalized Actual Number of Inversions
S
t
a
n
d
a
r
d

D
e
v
i
a
t
i
o
n

o
f

N
o
r
m
a
l
i
z
e
d

I
n
v
e
r
s
i
o
n

D
i
s
t
a
n
c
e
20 genes
40 genes
80 genes
120 genes
160 genes
10 20 50 100 200
0
.
0
1
0
.
0
2
0
.
0
5
0
.
1
0
Number of genes
I
n
t
e
g
r
a
t
i
o
n

o
f

t
h
e

S
t
d
e
v

o
f

I
n
v
.

D
i
s
t
Empirical
Regression
Fig. 2. Left: simulation (points) and regression (solid lines) of the standard deviation
of the inversion distance. Right: regression of coecient q (see Section 3); for every
point corresponding to n genes, the y coordinate is the average of all data points in
the simulation.
Variance of the EDE Distance. Let X
k
and Y
k
be the inversion and EDE distances
after k inversions are applied to a genome of n genes, respectively. We again use
the delta method. Let x =
k
n
. Since X
k
= nf(
Y
k
n
), we have

dY
k
dX
k

1
=

dX
k
dY
k

=
1
n

dX
k
d(Y
k
/n)

= f

(x) =
d
dx
_
min{x,
x
2
+bx
x
2
+cx +b
}
_
.
The point where x =
x
2
+bx
x
2
+cx +b
is when x = 0.5423. Therefore
f

(x) =
_
_
_
1 if 0 x < 0.5423
d
dx
(
x
2
+bx
x
2
+cx +b
) =
x
2
(c b) + 2bx +b
2
(x
2
+cx +b)
2
if x 0.5423
and V ar(Y
k
)

dY
k
dX
k

2
V ar(X
k
) = (f

(x))
2
(ng
n
(x))
2
= (ng
n
(
k
n
)/f

(
k
n
))
2
.
4 Simulation Study
We present the results of two simulation studies in this section. The rst ex-
periment compares the accuracy of our new methods with other distance-based
methods for genome rearrangement phylogeny. In the second experiment we show
the robustness of Weighbor-IEBP against errors in the parameters of the GNT
model, (, ).
4.1 Experiment 1: Accuracy of the New Methods
In this experiment we compare the trees reconstructed using BioNJ-IEBP,
Weighbor-IEBP, BioNJ-EDE, and Weighbor-EDE to neighbor joining trees using
dierent distance estimators. The following four distance estimators are used
Genome Rearrangement Phylogeny Using Weighbor 121
Table 1. Settings for Experiments 1 and 2.
Parameter Value
1. Number of genes 120 (plant chloroplast genome)
2. Model tree generation Uniformly Random Topology
(See the Model Tree paragraph in Section 4.1 for details.)
4. GNT Model parameters (, )

(0, 0), ( , )
5. Datasets for each setting 30
The probabilities that a rearrangement is an inversion, a transposition, or an inverted
transposition are 1 , , and , respectively.
with neighbor joining: (1) BP, the breakpoint distance, (2) INV [2], the inversion
distance, and (3) Exact-IEBP [19] and (4) EDE [12], true evolutionary distance es-
timators based on BP and INV, respectively. The procedure of neighbor joining
combined with distance X will be denoted by NJ(X). According to past sim-
ulation studies [21,12,19], NJ(EDE) has the best accuracy, followed closely by
NJ(Exact-IEBP). See Table 1 for the settings for the experiment.
Quantifying Error. Given an inferred tree, we compare its topological accuracy
by computing false negatives with respect to the true tree [11,8]. During
the evolutionary process, some edges of the model tree may have no changes
(i.e. evolutionary events) on them. Since reconstructing such edges is at best
guesswork, we are not interested in these edges. Hence, we dene the true tree
to be the result of contracting those edges in the model tree on which there are
no changes.
We now dene how we score an inferred tree, by comparison to the true
tree. For every tree there is a natural association between every edge and the
bipartition on the leaf set induced by deleting the edge from the tree. Let T be
the true tree and let T

be the inferred tree. An edge e in T is missing in T

if
T

does not contain an edge dening the same bipartition; such an edge is called
a false negative. Note that the external edges (i.e. edges incident to a leaf) are
trivial in the sense that they are present in every tree with the same set of leaves.
The false negative rate is the number of false negative edges in T

with respect
to T divided by the number of internal edges in T.
Software. We use PAUP* 4.0 [18] to compute the neighbor joining method
and the false negative rates between two trees. We have implemented a sim-
ulator [12,21] for the GNT model. The input is a rooted leaf-labeled model tree
(T, {
e
}), and parameters (, ). On each edge, the simulator applies random
rearrangement events to the circular genome at the ancestral node according
to the model with given parameters and . We use the original Weighbor
and BioNJ implementations [5,7] (downloadable from the internet) and make
modications so they use the new variance formulas.
122 L.-S. Wang
(, ) = (0, 0)
0.0 0.2 0.4 0.6 0.8 1.0
0
10
20
30
40
50
Normalized Max. Pairwise Inv. Distance
N
o
r
m
a
l
i
z
e
d

F
a
l
s
e

N
e
g
a
t
i
v
e

R
a
t
e

(
%
)
NJ(BP)
NJ(EIEBP)
BioNJIEBP
WeighborIEBP
0.0 0.2 0.4 0.6 0.8 1.0
0
10
20
30
40
50
Normalized Max. Pairwise Inv. Distance
N
o
r
m
a
l
i
z
e
d

F
a
l
s
e

N
e
g
a
t
i
v
e

R
a
t
e

(
%
)
NJ(INV)
NJ(EDE)
BioNJEDE
WeighborEDE
(, ) = ( , )
0.0 0.2 0.4 0.6 0.8 1.0
0
10
20
30
40
50
Normalized Max. Pairwise Inv. Distance
N
o
r
m
a
l
i
z
e
d

F
a
l
s
e

N
e
g
a
t
i
v
e

R
a
t
e

(
%
)
NJ(BP)
NJ(EIEBP)
BioNJIEBP
WeighborIEBP
0.0 0.2 0.4 0.6 0.8 1.0
0
10
20
30
40
50
Normalized Max. Pairwise Inv. Distance
N
o
r
m
a
l
i
z
e
d

F
a
l
s
e

N
e
g
a
t
i
v
e

R
a
t
e

(
%
)
NJ(INV)
NJ(EDE)
BioNJEDE
WeighborEDE
Fig. 3. The topological accuracy of various distance-based tree reconstruction meth-
ods. The number of genomes is 160. See Table 1 for the settings in the experiment.
Model Trees. The model trees have topologies drawn from the uniform distribu-
tion
1
, and edge lengths drawn from the discrete uniform distribution on intervals
[1, b], where b is one of the following: 3, 6, 12, 18 (higher values of b makes the
variance of the edge lengths higher). Then the length of each edge is scaled by
the same factor so the diameter of the tree (the maximum pairwise leaf-to-leaf
distance on the tree) is 36, 72, 120, 180, 360 (so low- to high-evolutionary rates
are covered).
Discussion. The results of the simulation are in Figure 3. Due to space limita-
tions, and because the relative order of accuracy remains the same, we describe
our results only for a subset of our experiments. For each setting for simulation,
we group methods based on the genomic distances they are based on: breakpoint
or inversion distance. In either group, the relative order of accuracy is roughly
the same. Let Y be the true distance estimator based on a genomic distance X;
This is easily done and well known by the community by add one leaf at a time to
produce the whole tree. At each iteration, we choose an edge from the current tree
(each edge has the same probability to be chosen) and attach the new leaf to it.
Genome Rearrangement Phylogeny Using Weighbor 123
e.g. Y=IEBP if X=BP, and Y=EDE if X=INV. The order of the methods, start-
ing from the worst, is (1) NJ(X), (2) BioNJ-Y, (3) NJ(Y), and (4) Weighbor-Y
(except for very low evolutionary rates when Weighbor-IEBP is worst, but only
by a few percents). The dierences between NJ(Y) and BioNJ-Y are extremely
small. Weighbor-EDE has the best accuracy over all methods.
When we compare methods based on breakpoint distance and methods based
on inversion distance, the latter are always better (or no worse) than the former
if we compare methods of the same complexity: NJ(INV) is better than NJ(BP),
NJ(EDE) is better than NJ(Exact-IEBP), BioNJ-EDE is better than BioNJ-IEBP,
and Weighbor-EDE is better than Weighbor-IEBP. This suggests INV is a better
statistic than BP for the true evolutionary distance under the GNT model, even
when transpositions and inverted transpositions are present. This is not surpris-
ing as INV, just like BP, increases by a small constant when a rearrangement
event from the GNT model is applied. Also, though their maximum allowed val-
ues are the same (the number of genes for circular signed genomes), the fact the
average increase in INV is smaller
2
than the average increase in BP gives INV
a wider eective range.
Note Weighbor-IEBP outperforms NJ(Exact-IEBP) when the normalized
maximum pairwise inversion distance, or the diameter of the dataset, exceeds
0.6; Weighbor-IEBP (based on BP) is even better than NJ(EDE) (based on the
better INV) when the diameter of the dataset exceeds 0.9. This suggests the
Weighbor approach really shines under high amounts of evolution.
Running Time. NJ, BioNJ-IEBP, and BioNJ-EDE all nish within 1 second for all
settings on our Pentium workstations running Linux. However, Weighbor-IEBP
and Weighbor-EDE take considerably more time; both Weighbor-IEBP and
Weighbor-EDE take about 10 minutes to nish for 160 genomes. As a side
note, fast maximum parsimony methods for genome rearrangement phylogeny,
MPME/MPBE, almost always exceed the four-hour running time limit in the exper-
iment in [20]; Weighbor-EDE is almost as good as these two methods except for
datasets with very high amount of evolution.
4.2 Experiment 2: Robustness of Weighbor-IEBP
In this section we demonstrate the robustness of the Weighbor-IEBP method
when the model parameters are unknown. The settings are the same in Table
1. The experiment is similar to the previous experiment, except here we use
An inversion creates two breakpoints; a transposition and an inverted transposi-
tion can be realized by three and two inversions, respectively, and they all cre-
ate three breakpoints each. Thus under the GNT model with model parameters
(, ) and assumption that genome G has only a small breakpoint (BP(G, G ))
and inversion (INV (G, G )) distance from the reference (ancestral) genome G , the
average increase in BP(G, G ) after a random rearrangement is applied to G is
2(1 ) + 3 + 3 = 2 + + and the average increase in INV (G, G ) is
(1 ) + 3 + 2 = 1 + 2 + . The latter is always smaller, and the two
quantities are equal only when = 1, i.e. only transpositions occur.
124 L.-S. Wang
both the correct and the incorrect values of (, ) for the Exact-IEBP distance
and the variance estimation. Due to space limitations the results are not shown
here. The false negative curves are similar even when dierent values of and
are used. These results suggest that Weighbor-IEBP is robust against errors
in (, ).
5 Conclusion and Future Work
In this paper we study the variance of the breakpoint and inversion dis-
tances under the Generalized Nadeau-Taylor model. We then used these re-
sults to obtain four new methods: BioNJ-IEBP, Weighbor-IEBP, BioNJ-EDE,
and Weighbor-EDE. Of these Weighbor-IEBP and Weighbor-EDE yield very ac-
curate phylogenetic trees, and are robust against errors in the model parameters.
Future research includes analytical estimates of the expectation and variance of
the inversion distance, and the dierence between the approximating model and
the breakpoint distance under the true GNT model.
Acknowledgements
The author thanks Tandy Warnow for suggesting the direction of research which
leads to this paper, and the two anonymous referees for their helpful comments.
References
1. K. Atteson. The performance of the neighbor-joining methods of phylogenetic
reconstruction. Algorithmica, 25(2/3):251278, 1999.
2. D. A. Bader, B. M. E. Moret, and M. Yan. A linear-time algorithm for computing
inversion distances between signed permutations with an experimental study. J.
Comput. Biol., 8(5):483491, 2001.
3. V. Bafna and P. Pevzner. Sorting permutations by transpositions. In Proc. 6th
Annual ACM-SIAM Symp. on Disc. Alg. SODA95, pages 614623. ACM Press,
1995.
4. M. Blanchette, M. Kunisawa, and D. Sanko. Gene order breakpoint evidence in
animal mitochondrial phylogeny. J. Mol. Evol., 49:193203, 1999.
5. W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A
likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol.
Evol., 17:189197, 2000. http://www.t10.lanl.gov/billb/weighbor/.
6. S. R. Downie and J. D. Palmer. Use of chloroplast DNA rearrangements in recon-
structing plant phylogeny. In P. Soltis, D. Soltis, and J.J. Doyle, editors, Molecular
Systematics of Plants, volume 49, pages 1435. Chapman & Hall, 1992.
7. O. Gascuel. BIONJ: an improved version of the nj algorithm based on a smple
model of sequence data. Mol. Biol. Evol., 14:685695, 1997.
http://www.crt.umontreal.ca/olivierg/bionj.html.
8. O. Gascuel. Personal communication, April 2001.
9. R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics. Addison-
Wesley, 1994. 2nd ed.
Genome Rearrangement Phylogeny Using Weighbor 125
10. S. Hannenhalli and P. Pevzner. Transforming cabbage into turnip (polynomial
algorithm for genomic distance problems). In Proc. 27th Annual ACM Symp. on
Theory of Comp. STOC95, pages 178189. ACM Press, NY, 1995.
11. S. Kumar. Minimum evolution trees. Mol. Biol. Evol., 15:584593, 1996.
12. B. M. E. Moret, L.-S. Wang, T. Warnow, and S. Wyman. New approaches for
reconstructing phylogenies based on gene order. In Proc. 9th Intl. Conf. on Intel.
Sys. for Mol. Bio. (ISMB 2001), pages 165173. AAAI Press, 2001.
13. G. W. Oehlert. A note on the delta method. Amer. Statist., 46:2729, 1992.
14. R. G. Olmstead and J. D. Palmer. Chloroplast DNA systematics: a review of
methods and data analysis. Amer. J. Bot., 81:12051224, 1994.
15. L. A. Raubeson and R. K. Jansen. Chloroplast DNA evidence on the ancient
evolutionary split in vascular land plants. Science, 255:16971699, 1992.
16. N. Saitou and M. Nei. The neighbor-joining method: A new method for recon-
structing phylogenetic trees. Mol. Biol. & Evol., 4:406425, 1987.
17. D. Sanko and M. Blanchette. Probability models for genome rearrangements and
linear invariants for phylogenetic inference. Proc. 3rd Intl Conf. on Comput. Mol.
Bio. (RECOMB99), pages 302309, 1999.
18. D. Swoord. PAUP* 4.0. Sinauer Associates Inc, 2001.
19. L.-S. Wang. Improving the accuracy of evolutionary distances between genomes.
In Lec. Notes in Comp. Sci. No. 2149: Proc. 1st Workshop for Alg. & Bio. Inform.
WABI 2001, pages 175188. Springer Verlag, 2001.
20. L.-S. Wang, R. K. Jansen, B. M. E. Moret, L. A. Raubeson, and T. Warnow. Fast
phylogenetic methods for the analysis of genome rearrangement data: An empirical
study. In Proc. 7th Pacic Symp. Biocomputing (PSB 2002), pages 524535, 2002.
21. L.-S. Wang and T. Warnow. Estimating true evolutionary distances between
genomes. In Proc. 33th Annual ACM Symp. on Theory of Comp. (STOC 2001),
pages 637646. ACM Press, 2001.
Segment Match Renement and Applications
Aaron L. Halpern, Daniel H. Huson

, and Knut Reinert


Informatics Research, Celera Genomics Corp.
45 W Gude Drive, Rockville, MD 20850, USA
Phone: +240-453-3000, Fax: +1-240-453-3324
{Aaron.Halpern,Knut.Reinert}@celera.com
Abstract. Comparison of large, unnished genomic sequences requires
fast methods that are robust to misordering, misorientation, and duplica-
tions. A number of fast methods exist that can compute local similarities
between such sequences, from which an optimal one-to-one correspon-
dence might be desired. However, existing methods for computing such
a correspondence are either too costly to run or are inappropriate for
unnished sequence. We propose an ecient method for rening a set
of segment matches such that the resulting segments are of maximal
size without non-identity overlaps. This resolved set of segments can be
used in various ways to compute a similarity measure between any two
large sequences, and hence can be used in alignment, matching, or tree
construction algorithms for two or more sequences.
1 Introduction
Given two (possibly incomplete) genomic sequences of two related organisms, or
two separately derived sequences from the same organism, one may wish to know
how the sequences relate to one another. In this paper, we introduce techniques
that facilitate analyses for two particular questions. The rst question is how
much of each sequence is not represented in the other, under the restriction that
positions in each sequence can be involved in at most one selected match. This
may be of interest, e.g., in determining the relative degrees of completion of two
sequencing projects. The second question is how the sequences may be most
economically derived from one another by various rearrangements, duplications,
insertions or deletions. This gives both a transformation distance [15] and an
edit script implying specic dierences between the sequences, whether these
result from true evolutionary events or are artifacts of a sequence assembly
project.
We assume the existence of a pre-computed set of local segment matches be-
tween two sequences. Such sets can be obtained eciently by various techniques,
e.g. [2,4,5,9]. However, a set of local segment matches is not disjoint: more than
one match may cover a given interval, whereas the two analyses described above
require either a one-to-one matching or a many-to-one matching.

New address: WSI-AB, T ubingen University, Sand 14, 72076 T ubingen, Germany.
Phone: +49-7071-29-70450, email: [email protected]
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 126139, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Segment Match Renement and Applications 127
A
B
S
S

Fig. 1. Two partly overlapping matches.


Methods for rapidly constructing a nearly optimal global sequence alignment
from such a set of matches have been presented [4,3], but such a global alignment
ignores the treatment of transpositions, inversions, and duplications.
Another approach to selecting near-optimal many-to-one matches has been
to nd the set of matches that gives maximum coverage of the primary axis of
interest without overlapping on that axis [14,15]. For our purposes, this treat-
ment is incomplete in two respects. Firstly, such an approach may block the use
of one match that only partially overlaps a larger match, where a solution mak-
ing use of parts of two matches would be better than a solution making use of
either one, as in Figure 1; although the treatment of matches which overlap on
the primary axis is briey mentioned in [15], no details are given. Secondly, such
an approach does not readily extend to the problem of determining a one-to-one
match, i.e. to choosing an optimal set of matches that dont overlap on either
axis.
Finally, we note that the problem of determining an optimal one-to-one set
of matches is not directly a classical matching problem, nor is it a form of
generalized sequence alignment [13].
In the following, we introduce an ecient and provably optimal approach to
the problem of constructing one-to-one or many-to-one match sets by minimally
rening (subdividing) matches until all overlaps between the rened matches
are resolved: The projections of any two rened matches onto each sequence are
either disjoint or are identical. Given such a resolved set of matches, several anal-
yses that are otherwise dicult at best become quite straightforward, including
not only the one-to-one matching problem and the many-to-one matching prob-
lems introduced above but also sequence alignment based on conserved segments
[8]. For instance, we note that the one-to-one correspondence on the rened set
is reduced to the classical matching problem. In addition, it is straightforward
to compute the transformation distance or an alignment between the sequences
if desired.
The paper is organized as follows. We rst introduce segment matches in
Section 2 and then discuss renements and resolved renements in Section 3. In
Section 4 we then present our main result, an ecient algorithm for computing
a resolved renement. Applications of the algorithm to the matched sequence
problem, in the context of transformation distances and to the maximal weight
trace problem are discussed in Section 5. Finally, in Section 6 we illustrate the
feasibility of the discussed methods on real data.
128 A.L. Halpern, D.H. Huson, and K. Reinert
2 Segment Matches
Let A = a
1
a
2
. . . a
p
and B = b
1
b
2
. . . b
q
be two sequences. We call A
ij
=
a
i+1
a
i+2
. . . a
j
an A-segment, with 0 i < j p. A B-segment is dened
similarly. In the following, we will assume without specic mention that any
denitions relative to A carry over to analogous denitions relative to B.
A segment match S = (A
ij
, B
kl
) consists of an A-segment A
ij
and a B-
segment B
kl
. We call i, j, k and l the left-A, right-A, left-B and right-B positions
of S, respectively. Given a set of segment matches , let supp
A
() denote the
A-support of , i.e. the set of left-A and right-A positions of all matches S .
We assume that every segment match S = (A
ij
, B
kl
) comes with two projec-
tion maps
S
: [i, j] [k, l] and
S
: [k, l] [i, j] that reect e.g. the alignment
associated with S, as depicted in Figure 2.
In the case of a direct (i.e. orientation preserving) match, we require that

S
and
S
are non-crossing, i.e. that the two maps are monotonically increasing
and that we have
S
(h) h

h (h

) for all h [i, j] and h

[k, l].
In the case of a transposition, we require that the projections are (completely)
crossing, i.e. that they are monotonically decreasing and that we have
S
(h)
h

h (h

) for all h [i, j] and h

[k, l].
When the alignments in matches are gapless, one can use linear interpolation
in place of the and projections, i.e. by setting

S
(h) :=

k +
lk
ji
(h i), if S is a direct match, and
l
lk
ji
(h i), else,
for every S = (A
ij
, B
kl
) and i h j.
A match S

= (A
i

j
, B
k

l
) is called a submatch of S, if i i

j,
and we have
S
(i

) = k

or
S
(k

) = i

, and
S
(j

) = l

or
S
(l

) = k

. Here,
we assume that

S
, and

S
, equals the restriction of
S
to [i

, j

], and of
S
to
[k

, l

], respectively.
3 Resolved Renement
Consider a match S = (A
ij
, B
kl
) . A set of matches

= {S
1
, S
2
, . . . , S
r
} is
called a renement of S, if each S

is a submatch of S and

tiles S, i.e.
we the following two disjoint unions:
(i, j] =

(A
i

j
,B
k

l
)

(i

, j

] and (k, l] =

(A
i

j
,B
k

l
)

(k

, l

].
Denition 1. A set

of matches is called a renement of =


{S
1
, S
2
, . . . , S
n
}, if there exists a partitioning

2

. . .

n
such that

i
is a renement of S
i
for every i = 1, 2, . . . , n.
We are particularly interested in renements of that have the property
that the segments of any two matches are either disjoint or are identical, and
capture this property as follows:
Segment Match Renement and Applications 129
A
B
S
P
0
P
1
Q
0
Q
1
I
1
i
j
k l
(a)
A
B
S
S

i j i

k l
(b)
A
B
S

S
1
S
2
S
3
i j i

k l
S
(i

)
S
( j

)
(c)
A
B
i
j
k l
h
h

(d)
Fig. 2. Rening and resolving matches. (a) A valid match S = (A
ij
, B
kl
), together with
arrows depicting
S
(pointing down) and
S
(pointing up). (b) The set = {S, S

}
is unresolved. (c) The set {S , S , S , S

} is a minimum resolved renement of . (d)


This set is a non-minimum resolved renement of .
130 A.L. Halpern, D.H. Huson, and K. Reinert
Denition 2. A set of matches is called resolved, if for any S = (A
ij
, B
kl
)
we have [i, j] supp
A
() = {i, j} and [k, l] supp
B
() = {k, l}. For tech-
nical reasons, we also require {
S
(i),
S
(j)} supp
B
() and {
S
(k),
S
(l)}
supp
A
().
In Figure 2(a) we show a match S = (A
ij
, B
kl
). The indicated choice of

S
and
S
implies that the match has incorporated an insertion I
1
. We show
an unresolved set of two matches = {S, S

} in Figure 2(b) and a resolved


renement of minimum cardinality in Figure 2(c). Finally, Figure 2(d) shows a
resolved renement of that is does not have minimum cardinality.
Any set of matches possesses a (trivial) resolved renement obtained by
simply rening every S into a set of single position matches. We are inter-
ested in resolved renements of minimum cardinality and claim:
Lemma 1. There exists a unique resolved renement

of of minimum car-
dinality.
Proof. Consider two dierent resolved renements
1
and
2
of , both of
minimum cardinality.
First, assume that supp
A
(
1
) = supp
A
(
2
). Then, w.l.o.g., there exists h
supp
A

1
\ supp
A

2
. We say that an arbitrary A-position p is attracted to h,
if there exists a sequence of matches S
1
, S
2
, . . . , S
r
such that alternating
application of their and projections to p yields h. Let P
A
and P
B
denote
the basin of attraction of A-positions and B-positions for h, respectively. Note
that P
A
= and P
B
= . If P
A
supp
A
() = or P
B
supp
B
() = , then by
Denition 2, h must occur in every resolved renement of , hence in particular
in
2
. On the other hand, if both of these intersections are empty, then one can
simplify
1
by merging each pair of consecutive matches that is separated only
by a pair of points in P
A
and P
B
. Note that this simplication is possible due to
the non-crossing/crossing requirement for direct/reverse segment matches. We
thus obtain a resolved renement of of smaller cardinality, a contradiction.
For example, in Figure 2(d), P
A
= {h} and P
B
= {h

, h

}.
Now, assume supp
A
(
1
) = supp
A
(
2
), supp
B
(
1
) = supp
B
(
2
) and there
exists a match S = (A
ij
, B
kl
)
1
\
2
. By Denition 1, there exists a match
S

such that S is a submatch of S

. By assumption, [i, j] supp


A
(
2
) =
[i, j] supp
A
(
1
) = {i, j} and [k, l] supp
B
(
2
) = [k, l] supp
B
(
1
) = {k, l},
and so S must occur in the
2
renement of S

, a contradiction.
4 Computing a Minimum Resolved Renement
The following algorithm computes the minimum resolved renement:
Algorithm 1 Input: set of segment matches and projections
S
,
S
for all
S . Output: resolved renement

of of minimum cardinality. We
maintain a bipartite graph G = (V
A

V
B
, E V
A
V
B
) with node sets
V
A
{a
0
, . . . , a
p
} and V
B
{b
0
, . . . , b
q
} representing A- and B-positions, re-
spectively, and a set E of (labeled) edges that keep track of pairs of positions
projected onto each other by a given sequence match.
Segment Match Renement and Applications 131
First, for a set W
A
of A-positions, we dene the A-renement step as follows:
If this is not the rst execution of this step:
Set W
B
= .
For each h W
A
\ V
A
:
For each match S = (A
ij
, B
kl
) such that i h j:
With h

:=
S
(h), add the edge (S, h, h

) to G (creating nodes as
necessary) and insert h

into W
B
.
Second, for a set W
B
of B-positions, we dene the B-renement step as
follows:
Set W
A
= .
For each h

W
B
\ V
B
:
For each match S = (A
ij
, B
kl
) such that k h

l:
With h :=
S
(h

), add the edge (S, h, h

) to G (creating nodes as
necessary) and insert h into W
A
.
Third:
Initialize V
A
= , W
A
= supp
A
(), V
B
= and W
B
= supp
B
().
Repeat until no further renement is necessary (i.e. until either W
A
=
or W
B
= ):
Apply the A-renement step to W
A
and then the B-renement step
to W
B
.
Finally:
Lexicographically order the set of edges E, using (S, i, k) or (S, i, k), de-
pending on whether
S
and
S
are non-crossing or crossing, respectively.
Then, set

:= {(A
ij
, B
kl
) | (S, i, j) and (S

, k, l) are consecutive
and S = S

}.
Lemma 2. Algorithm 1 computes the minimum resolved renement

of .
Proof. In the rst iteration of the A-renement step, all original A-positions
are projected onto B using all segment matches. Then, all original and new B-
positions are projected onto A. We then repeatedly project the newly generated
positions back and forth until no new positions in A or B are reached. At this
stage, V
A
and V
B
are the A- and B-supports, respectively, of a renement of .
Because every position was produced by a chain of projections starting from a
position coming from an original match in , there are no superuous matches
and the resulting set has minimum cardinality.
In the nal extraction step, for each match S , the renement of S based
on the set M of all position pairs in S is computed.
Lemma 3. Given a set of n segment matches between two sequences A =
a
1
a
2
. . . a
p
and B = b
1
b
2
. . . b
q
, with p q. The graph G in Algorithm 1 has at
most O(kp) edges and can be computed in at most O(p log
2
n + kp log p) steps,
where k is the maximal number of segment matches containing any given posi-
tion.
Proof. The maximum number of dierent A-positions is p. To insert an A-
position h into V
A
, we rst determine which segment matches contain it in
132 A.L. Halpern, D.H. Huson, and K. Reinert
O(log
2
n + k) steps, using a range tree [10,17]. For each of the O(k) matches
S that contains h, determine whether
S
(h) is contained in V
B
in O(log p)
steps. Putting this together, we obtain a bound of O(p(log
2
n + k + k log p)) =
O(p log
2
n + kp log p). Note that each of O(p) positions gives rise to at most k
edges in G.
Lemma 4. Given a set of n segment matches between two sequences A =
a
1
a
2
. . . a
p
and B = b
1
b
2
. . . b
q
, with p q. The minimum resolved renement

of can be computed in at most O(p log


2
n +kp log kp) steps, where k is the
maximal number of segment matches containing any given position.
Proof. The preceding result implies that the ordering step in Algorithm 1 will
take at most O(kp log kp) computations.
Note that the size of a minimum resolved renement depends on the length
of the input sequences. For example, a set of only four segment matches between
segments of length d, say, can give resize to 4d matches, as demonstrated in
Figure 3.
5 Applications
In this section we describe several comparison algorithms for large genomic se-
quences that operate on sets of resolved matches and discuss their worst case
time bounds.
5.1 The Matched Sequence Problem
Given two sequences A and B and a set of segment matches between them.
The Maximal Matched Sequence Problem (MMSP) is to compute a set of non-
intersecting matches

that are all submatches of , such that the amount of


sequence covered by matched segments is maximized.
Let be a resolved set of segment matches between sequences A and B.
The associated match graph is an edge-weighted bipartite graph G = (V, E, :
E N). For the node set V = V
A
V
B
, we dene V
A
, or V
B
, as the set of
all intervals (i, j], or (k, l] respectively, for which there exists a segment match
(A
ij
, B
kl
) . We connect any two such nodes (i, j] V
A
and (k, l] V
B
by an
edge e E, if S = (A
ij
, B
kl
) , and set (e) to the length of S.
Lemma 5. Given a resolved set of n segment matches . The Maximal Matched
Sequence Problem can be solved in at most O(n
2
log n) steps.
Proof. Any maximal weight matching of the match graph G = (V, E) gives rise
to a solution. Such a matching can be computed in O(v(e+v log(v))) time, where
v = |V | and e = |E|. Since v 2n and e = n, the claim follows.
The preceding results imply:
Theorem 2. Let be a set of n matches between sequences A = a
1
a
2
. . . a
p
and B = b
1
b
2
. . . b
q
, with p q. The Maximal Matched Sequence Problem can
be solved in at most O((kp)
2
log(kp)) steps, where k is the maximal number of
matches containing any given position.
Segment Match Renement and Applications 133
A
B
S
S1
S2
S

(a)
A
B
S
S1
S2
S

(b)
A
B
S
S1
S2
S

(c)
Fig. 3. (a) Matches S and S

(shown in grey) map two A-segments of length d onto


two B-segments of length d. Diagonal match S maps the B-segment of S onto the
A-segment of S , whereas S maps the A-segment of S onto the B-segment of S with
a shift of h base pairs. (b) The left-B position of S forces a renement of match S

and
then, subsequently, of match S . (c) This leads to a chain of renements, producing
O(
d
h
) matches.
Proof. The minimum resolved renement of has size O(kp) and can be com-
puted in O(p log
2
(n)+kp log(kp)) steps. On the set of rened matches, the MMSP
problem can be solved in at most O((kp)
2
log(kp))) steps. Hence, the total run-
ning time is at most O((kp)
2
log(kp)+p log
2
(n)+kp log(kp)) = O((kp)
2
log(kp)).
5.2 The Transformation Distance Problem
The transformation distance introduced in [15] provides a measure for comparing
two large sequences A and B based on computing an optimal script that uses
copy, reverse-copy and insertion operations to build B from segments of A. The
input considered by Varre et al. consists of a set of segment matches that are
so-called factors, i.e. matches between segments of A and B allowing at most one
134 A.L. Halpern, D.H. Huson, and K. Reinert
mismatch. They dene an order <
0
on by setting (A
ij
, B
kl
) <
0
(A
i

j
, B
k

l
)
if the rst segment match strictly precedes the second, i.e. if j < i

and l < k

.
A weighted script graph G = (V, E) is constructed, the nodes of which represent
the segment matches. An edge is placed between two nodes, if the corresponding
segment matches S and S

are adjacent in <


0
, i.e. if S <
0
S

and no segment
match S

exists with S <


0
S

<
0
S

, or vice versa. Each node is assigned a


weight that reects the cost of the associated segment match, whereas the weight
of an edge equals the cost of inserting the sequence between the two segments
covered by the corresponding matches into the target sequence.
As noted by Varre et al., this graph can be very large, and so only maximal
matches are used. However, because the maximal matches will often overlap, one
needs access to submatches and the authors insert submatches in the graph as
necessary, using an mechanism that is not described in their paper.
In the following we describe how one can easily compute the transformation
distance given a (minimum) resolved set of matches . First, we construct the
script graph as follows:
1. Sort the rened matches according to <
0
.
2. Determine all runs of rened matches that can be joined to create longer
matches, i.e. all S
1
, S
2
, . . . , S
k
such that the right A- and B-positions of S
i
coincide with the left A- and B-positions of S
i+1
, respectively. Insert nodes
for all original rened matches and for all joinable pairs, triplets, etc. of
matches and then weight them appropriately [15].
3. Sort all nodes according to <
0
.
4. For each node n in the graph, determine all nodes that correspond to matches
that are adjacent in <
0
and insert the weighted edge in the graph.
Obviously this graph is minimum in the sense that our renement is minimum
and we only add new nodes that correspond to two rened matches that are
joinable. In other words, we insert all original matches and all those submatches
that are necessary to compute the transformation distance. (We note that in the
case of an ane weight function for joinable matches one can omit the insertion
of new nodes in step 2 and instead adjust the edge weights between joinable
nodes, which gives rise to a problem that is computationally much easier to
solve.)
Theorem 3. Let be a set of n matches between sequences A = a
1
a
2
. . . a
p
and
B = b
1
b
2
. . . b
q
, with p q. Let m be the length of the longest run of joinable
matches. Then the Transformation Distance Problem can be solved in at most
O(kpm
2
log(kpm)) steps.
Proof. As mentioned above we can compute the minimum resolved renement
of in O(p(log
2
(n) + kp log(kp)) steps. In step 1 we have to sort the O(kp)
matches. Determining the runs of joinable matches can be done in linear time.
There can be at most O(kp/m) such runs where each run gives rise to O(m
2
)
new nodes for a total of O(kpm) new nodes. In step 3 we have to sort the nodes
in time O(kpmlog(kpm)). Finally we insert edges between adjacent nodes in
Segment Match Renement and Applications 135
step 4. Each node in a joinable run of length m is adjacent to O(m) nodes. So
in the worst case we have to insert O(kpm
2
) edges. Hence the edge insertion has
a dominating run time of O(kpm
2
log(kpm)).
5.3 The Trace Problem
If the sequences under consideration are small enough to assume a minor impact
of translocations and duplications, then the rened set of segments can be used
to compute a maximal weight trace.
The original formulation of the trace problem was introduced by Kececioglu
[7]. Two characters of distinct strings in S are said to be aligned if they are
placed into the same column of the alignment array. One can view the character
positions of the k input strings in S as the vertex set V of a k-partite graph
G = (V, E) called the input alignment graph. The edge set E represents pairs
of characters that one would like to have aligned in an alignment of the input
strings. We say that an edge is realized by an alignment if the endpoints of the
edge are placed into the same column of the alignment array. The subset of E
realized by an alignment is called the trace of

S, denoted by trace(

S).
The original formulation of [7] assumes single character matches, where Rein-
ert et al. [8] generalized it to allow for block matches. However, they used a rene-
ment into single character matches. This immediately aects the time complexity
of their separation method in the branch-and-cut algorithm described in [8].
Similar to the transformation distance, this formulation can also immediately
benet from a minimum fully rened set of matches by reducing the size of the
input alignment graph.
If we apply a suitable scoring scheme for the rened matches (e.g. [1,11,16]),
we can construct the alignment graph analogous to the match graph in Section
5.1 (note that it can easily be extended to the comparison of multiple sequences).
Then the following holds:
Theorem 4. Let be a set of n matches between sequences A = a
1
a
2
. . . a
p
and B = b
1
b
2
. . . b
q
, with p q. The Trace Problem can be solved in at most
O(p(log
2
(n) +kp log(kp)) steps.
Proof. As mentioned above we can compute the minimum resolved renement of
in O(p(log
2
(n) +kp log(kp)) steps resulting in a minimum resolved renement
of size O(kp). As noted in the proof of Lemma 5 the number of distinct nodes
in the alignment graph is O(kp). The trace problem for two sequences can be
reduced to computing the heaviest common subsequence which is possible in
time O((kp + kp) log kp) [6]. This time is dominated by the construction of the
renement.
6 Illustration
To illustrate the strengths and weaknesses of the match renement method de-
scribed above, we use it to evaluate an alternative, greedy approach to nding
a one-to-one matching of two large sequence sets.
136 A.L. Halpern, D.H. Huson, and K. Reinert
Input matches Matching Greedy algorithm Transformation dis-
tance
Fig. 4. Illustration of method on two Drosophila assemblies.
The greedy technique was developed earlier to evaluate progress in assembling
the human genome (Halpern, Huson, Delcher, Sutton, and Myers, unpublished),
for which we needed a pipeline that is extremely fast, reasonable in memory
usage, and is capable of nding matches to most repeated elements but also
chooses the intuitively correct match between repeats. Such a pipeline might
be useful for assessing a modication in an assembly algorithm, in determining
the relative degrees of completion of two separate genome sequencing projects
[12], or in identifying corresponding pieces between incremental assemblies for
the purpose of annotation transfer.
In a nutshell, the method conducts a sux-tree-based all-against-all search of
two assemblies [5], and then greedily determines a set of one-to-one matches. The
greedy phase rst constructs a priority queue of all resulting segment matches or-
dered by decreasing length and initializes the set of selected (one-to-one) matches
to null. Then, the match at the head of the priority queue is popped and com-
pared against the set of selected matches. If it does not overlap any selected
match, then it is added to the selected set, otherwise, the match is trimmed to
eliminate all such overlaps and the shorted match is reinserted into the queue.
This is repeated until the match at the head of the priority queue is shorter than
some cuto.
Segment Match Renement and Applications 137
Although the results of the greedy method appeared quite good on informal
inspection, the question arose just how close to maximal the obtained results
were. The method described in this paper permitted us to answer this problem
denitively for a given dataset.
For illustration here, we computed an initial match set of gapless alignments
between two (partial) versions of the drosophila genome. The rst was obtained
from the BDGP (http://www.fruitfly.org/sequence/dlMfasta.shtml), cor-
responding to 50,107,720 bp of two nished chromosome arms (2L and 3R). The
second was a unpublished assembly of whole genome shotgun sequencing data
conducted at Celera in late 2001. The sux-tree method yielded 292,769 align-
ments seeded with a perfect match of at least 50 bp. These matches covered
49,953,627 bp of the nished sequence (i.e., they covered all but 154,093 bp).
Precisely 43,740,070 bp were covered by a single match, but the remainder was
covered two or more times; since the total length of all matches was 125,863,932,
a relatively small fraction of the sequence is clearly involved in a large number
of matches.
The renement method described in this paper transformed the match set
into 17,167,891 distinct matches. This took a total of 880 seconds on a Unix
workstation (667 Mhz Alpha processor) and shows that even for a large data set
the renement is computed quickly for a realistic set of input matches. Thus,
the initial match set, with mean length 429 bp, was cut up into much smaller
pieces (mean length 7 bp) by the renement. However, since most of the orig-
inal sequence was covered by a single match, there should still be matches of
considerable size; indeed, the maximum length rened match was 466,877 bp,
and of the nished sequence covered by matches, half is covered by a rened
match of length at least 97,008 bp (this is known as the N50 value). Again,
the complexity in the match set is restricted to the repetitive regions of the
sequence.
An optimal matching was determined on the rened match set, as described
above. The resulting set of matches involved 262,885 rened matches and covered
49,875,599 bp, leaving 232,121 bp of the nished sequence uncovered. Due to
the renement process, groups of selected matches may correspond to (parts of)
original matches, so it may be of greater interest to consider runs of adjacent,
consistent rened matches as a single matching block; 84211 such blocks were
present in the optimal matching, with a mean length of 592 bp, and an N50
value of 97,164 bp.
Given the availability of the optimal matching, why would one ever choose
the greedy method? The optimal matching makes no eort to favor long runs of
consistent matches (i.e. large blocks), and may choose more or less at random
among a set of rened matches covering a given copy of a repeated element. In
contrast, the greedy method attempts to maximize the length of the individual
matches, rather than the sum of their lengths, as can be seen by comparing
the number of selected matches