Algorithms in Tics 2002
Algorithms in Tics 2002
p=1
w
p
x
p
b .
If w
p
= 0, feature p is irrelevant with regards to deciding the class. Since
only non-zero elements w
p
,= 0 inuence the value of z, they can be regarded as
relevant features.
The task of dening a small number of relevant features can be equated
with that of nding a small set of non-zero elements. This can be formulated as
an optimization problem; namely that of minimizing the l
0
norm |w|
0
, where
|w|
0
= number ofp : w
p
,= 0, the number of non-zero elements of w. Thus we
obtain:
min
w,b
|w|
0
subject to y
n
(w
T
x
n
b) 0
n = 1, . . . , N . (2)
Unfortunately, problem (2) is NP-hard [10]. A tractable, convex approxima-
tion to this problem can be obtained by replacing the l
0
norm with the l
1
norm
|w|
1
, where |w|
1
=
P
p=1
[w
p
[, the sum of the absolute magnitudes of the
elements of a vector [10]:
min
w,b
|w|
1
=
P
p=1
[w
p
[
subject to y
n
(w
T
x
n
b) 0
n = 1, . . . , N . (3)
A solution to (3) yields the desired sparse weight vector w.
Optimization problem (3) can be solved via linear programming [11]. The
ensuing formulation requires the imposition of constraints on the allowed ranges
of variables. The introduction of new variables u
p
, v
p
R
P
such that [w
p
[ =
u
p
+ v
p
and w
p
= u
p
v
p
ensures non-negativity. The range of w
p
= u
p
v
p
is
unconstrained (positive or negative) whilst u
p
and v
p
remain non-negative. u
p
and v
p
are designated the positive and negative parts respectively. Similarly,
the bias b is split into positive and negative components b = b
+
b
. Given a
solution to problem (3), either u
p
or v
p
will be non-zero for feature p [11]:
min
u,v,b
+
,b
P
p=1
(u
p
+ v
p
)
subject to y
n
((u v)
T
x
n
(b
+
b
)) 1
u
p
0; v
p
0; b
+
0; b
0
n = 1, . . . , N; p = 1, . . . , P . (4)
A detailed description of the origins of the 1 constraint can be found elsewhere
[30].
If the data T are not linearly separable, misclassications (errors in the class
labels y
n
) can be accounted for by the introduction of slack variables
n
. Problem
(4) can be recast yielding the nal optimization problem,
4 L.R. Grate et al.
min
u,v,b
+
,b
P
p=1
(u
p
+ v
p
) + C
N
n=1
n
subject to y
n
((u v)
T
x
n
(b
+
b
)) 1
n
u
p
0; v
p
0; b
+
0; b
0;
n
0
n = 1, . . . , N; p = 1, . . . , P . (5)
C is an adjustable parameter weighing the contribution of misclassied data
points. Larger values lead to fewer misclassications being ignored: C = 0 cor-
responds to no outliers being ignored whereas C leads to the hard margin
limit.
3 Computational, Software and Practical Issues
Learning the sparse classier dened by optimization problem (5) involves min-
imizing a linear function subject to linear constraints. Ecient algorithms for
solving such linear programming problems involving 10,000 variables (N) and
10,000 constraints (P) are well-known. Standalone open source codes include
lp solve
1
and PCx
2
.
Nominal Liknon is an implementation of the sparse classier (5). It incor-
porates routines written in Matlab
3
and a system utilizing perl
4
and lp solve.
The code is available from the authors upon request. The input consists of a le
containing an N (P + 1) data matrix in which each row represents a single
proling experiment. The rst P columns are the feature values, abundances of
molecular species, whilst column P + 1 is the class label y
n
+1, 1. The
output comprises the non-zero values of the weight vector w (relevant features),
the bias b and the number of non-zero slack variables
n
.
The adjustable parameter C in problem (5) can be set using cross validation
techniques. The results described here were obtained by choosing C = 0.5 or
C = 1.
4 Application of Nominal Liknon to Real World Data
Nominal Liknon was applied to ve data sets in the size range (N = 19, P =
1,987) to (N = 200, P = 15,154). A data set T yielded a sparse classier, w and
b, and a specication of the l relevant features (P l). Since the proling studies
produced only a small number of data points (N P), the generalization error
of a nominal Liknon classier was determined by computing the leave-one-out
error for l-dimensional data points. A classier trained using N 1 data points
was used to predict the class of the withheld data point; the procedure repeated
N times. The results are shown in Table 1.
Nominal Liknon performs well in terms of simultaneous relevant feature
identication and classication. In all ve transcript and protein proling data
http://www.netlib.org/ampl/solvers/lpsolve/
http://www-fp.mcs.anl.gov/otc/Tools/PCx/
http://www.mathworks.com
http://www.perl.org/
Simultaneous Relevant Feature Identication and Classication 5
Table 1. Summary of published and unpublished investigations using nominal Liknon
[4,16].
Transcript proles Sporadic breast carcinoma tissue samples [29]
inkjet microarrays; relative transcript levels
http://www.rii.com/publications/vantveer.htm
Two-class data 46 patients with distant metastases < 5 years
51 patients with no distant metastases 5 years
Relevant features 72 out of P=5,192
Leave-one-out error 1 out of N=97
Transcript proles Tumor tissue samples [1]
custom cDNA microarrays; relative transcript levels
http://www.nhgri.nih.gov/DIR/Microarray/
selected_publications.html
Two-class data 13 KIT-mutation positive gastrointestinal stromal tumors
6 spindle cell tumors from locations outside the gastrointestinal tract
Relevant features 6 out of P=1,987
Leave-one-out error 0 out of N=19
Transcript proles Small round blue cell tumor samples (EWS, RMS, NHL, NB) [19]
custom cDNA microarrays; relative transcript levels
http://www.nhgri.nih.gov/DIR/Microarray/Supplement
Two-class data 46 EWS/RMS tumor biopsies
38 EWS/RMS/NHL/NB cell lines
Relevant features 23 out of P=2,308
Leave-one-out error 0 out of N=84
Transcript proles Prostate tissue samples [31]
Aymetrix arrays; absolute transcript levels
http://carrier.gnf.org/welsh/prostate
Two-class data 9 normal
25 malignant
Relevant features 7 out of P=12,626
Leave-one-out error 0 out of N=34
Protein proles Serum samples [24]
SELDI-TOF mass spectrometry; M/Z values (spectral amplitudes)
http://clinicalproteomics.steem.com
Two-class data 100 unaected
100 ovarian cancer
Relevant features 51 out of P=15,154
Leave-one-out error 3 out of N=200
sets a hyperplane was found, the weight vector was sparse (< 100 or < 2% non-
zero components) and the relevant features were of interest to domain experts
(they generated novel biological hypotheses amenable to subsequent experimen-
tal or clinical validation). For the protein proles, better results were obtained
using normalized as opposed to raw values: when employed to predict the class
of 16 independent non-cancer samples, the 51 relevant features had a test error
of 0 out of 16.
On a powerful desktop computer, a > 1 GHz Intel-like machine, the time
required to create a sparse classier varied from 2 seconds to 20 minutes. For
the larger problems, the main memory RAM requirement exceeded 500 MBytes.
6 L.R. Grate et al.
5 Heuristic Solutions to Problems Posed
by Domain Experts
Domain experts wish to postprocess nominal Liknon results to assist in the
design of subsequent experiments aimed at validating, verifying and extending
any biological predictions. In lieu of a theoretically sound statistical framework,
heuristics have been developed to prioritize, reduce or increase the number of
relevant features.
In order to priorities features, assume that all P features are on the same
scale. The l relevant features can be ranked according to the magnitude and/or
sign of the non-zero elements of the weight vector w (w
p
,= 0). To reduce the
number of relevant features to a smaller, most interesting set, a histogram of
w
p
,= 0 values can be used to determine a threshold for pruning the set. In order
to increase the number of features to a larger, more interesting set, nominal
Liknon can be run in an iterative manner. The l relevant features identied in
one pass through the data are removed from the data points to be used as input
for the next pass. Each successive round generates a new set of relevant features.
The procedure is terminated either by the domain expert or by monitoring the
leave-one-out error of the classier associated with each set of relevant features.
Preliminary results from analysis of the gastrointestinal stromal tumor/spin-
dle cell tumor transcript proling data set indicate that these extensions are
likely to be of utility to domain experts. The leave-one-out error of the relevant
features identied by ve iterations of nominal Liknon was at most one. The
details are: iteration 0 (number of relevant features = 6, leave-one-out error =
0), iteration 1 (5, 0), iteration 2 (5, 1), iteration 3 (9, 0), iteration 4 (13, 1),
iteration 5 (11, 1).
Iterative Liknon may prove useful during explorations of the (qualitative)
association between relevant features and their behavior in the N data points.
The gastrointestinal stromal tumor/spindle cell tumor transcript proling data
set has been the subject of probabilistic clustering [16]. A nite Gaussian mix-
ture model as implemented by the program AutoClass [6] was estimated from
P=1,987, N=19-dimensional unlabeled data points. The trained model was used
to assign each feature (gene) to one of the resultant clusters. Five iterations of
nominal Liknon identied the majority of genes assigned to a small number of
discriminative clusters. Furthermore, these genes constituted most of the impor-
tant distinguishing genes dened by the original authors [1].
6 Discussion
Nominal Liknon implements a mathematical technique for nding a sparse
hyperplane. When applied to two-class high-dimensional real-world molecular
proling data, it identies a small number of relevant features and creates a
classier that generalizes well. As discussed elsewhere [4,7], many subsets of rel-
evant features are likely to exist. Although nominal Liknon species but one
set of discriminatory features, this low-hanging fruit approach does suggest
Simultaneous Relevant Feature Identication and Classication 7
genes of interest to experimentalists. Iterating the procedure provides a rapid
mechanism for highlighting additional sets of relevant features that yield good
classiers. Since nominal Liknon is a single-pass method, one disadvantage is
that the learned parameters cannot be adjusted (improved) as would be possible
with a more typical train/test methodology.
7 Future Directions
Computational biology and chemistry are generating high-dimensional data so
sparse solutions for classication and regression problems are of widespread im-
portance. A general purpose toolbox containing specic implementations of par-
ticular statistical techniques would be of considerable practical utility. Future
plans include developing a suite of software modules to aid in performing tasks
such as the following. A. Create high-dimensional input data. (i) Direct genera-
tion by high-throughput experimental technologies. (ii) Systematic formulation
and extraction of large numbers of features from data that may be in the form of
strings, images, and so on (a priori, features relevant for one problem may be
irrelevant for another). B. Enunciate sparse solutions for classication and re-
gression problems in high-dimensions. C. Construct and assess models. (i) Learn
a variety of models by a grid search through the space of adjustable parameters.
(ii) Evaluate the generalization error of each model. D. Combine best models to
create a nal decision function. E. Propose hypotheses for domain expert.
Acknowledgements
This work was supported by NSF grant IIS-9988642, the Director, Oce of
Energy Research, Oce of Health and Environmental Research, Division of
the U.S. Department of Energy under Contract No. DE-AC03-76F00098 and
an LBNL/LDRD through U.S. Department of Energy Contract No. DE-AC03-
76SF00098.
References
1. S.V. Allander, N.N. Nupponen, M. Ringner, G. Hostetter, G.W. Maher, N. Gold-
berger, Y. Chen, Carpten J., A.G. Elkahloun, and P.S. Meltzer. Gastrointestinal
Stromal Tumors with KIT mutations exhibit a remarkably homogeneous gene ex-
pression prole. Cancer Research, 61:86248628, 2001.
2. K. Bennett and A. Demiriz. Semi-supervised support vector machines. In Neural
and Information Processing Systems, volume 11. MIT Press, Cambridge MA, 1999.
3. A. Bhattacharjee, W.G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd,
J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E.J. Mark, E.S. Lander,
W. Wong, B.E. Johnson, T.R. Golub, D.J. Sugarbaker, and M. Meyerson. Clas-
sication of human lung carcinomas by mrna expression proling reveals distinct
adenocarcinoma subclasses. Proc. Natl. Acad. Sci., 98:1379013795, 2001.
4. C. Bhattacharyya, L.R. Grate, A. Rizki, D.C. Radisky, F.J. Molina, M.I. Jor-
dan, M.J. Bissell, and I.S. Mian. Simultaneous relevant feature identication and
classication in high-dimensional spaces: application to molecular proling data.
Submitted, Signal Processing, 2002.
8 L.R. Grate et al.
5. M.P. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey,
M. Ares, Jr, and D. Haussler. Knowledge-based analysis of microarray gene expres-
sion data by using support vector machines. Proc. Natl. Acad. Sci., 97:262267,
2000.
6. P. Cheeseman and J. Stutz. Bayesian Classication (AutoClass): Theory and
Results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthu-
rusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153
180. AAAI Press/MIT Press, 1995. The software is available at the URL
http://www.gnu.org/directory/autoclass.html.
7. M.L. Chow, E.J. Moler, and I.S. Mian. Identifying marker genes in transcription
prole data using a mixture of feature relevance experts. Physiological Genomics,
5:99111, 2001.
8. N. Cristianini and J. Shawe-Taylor. Support Vector Machines and other kernel-
based learning methods. Cambridge University Press, Cambridge, England, 2000.
9. S.M. Dhanasekaran, T.R. Barrette, R. Ghosh, D. Shah, S. Varambally, K. Ku-
rachi, K.J. Pienta, M.J. Rubin, and A.M. Chinnaiyan. Delineation of prognostic
biomarkers in prostate cancer. Nature, 432, 2001.
10. D.L. Donoho and X. Huo. Uncertainty principles and idea atomic decomposition.
Technical Report, Statistics Department, Stanford University, 1999.
11. R. Fletcher. Practical Methods in Optimization. John Wiley & Sons, New York,
2000.
12. T. Furey, N. Cristianini, N. Duy, D. Bednarski, M. Schummer, and D. Haussler.
Support vector machine classication and validation of cancer tissue samples using
microarray expression data. Bioinformatics, 16:906914, 2000.
13. M.E. Garber, O.G. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler,
M. Pacyana-Gengelbach, M. van de Rijn, G.D. Rosen, C.M. Perou, R.I. Whyte,
R.B. Altman, P.O. Brown, D. Botstein, and I. Petersen. Diversity of gene ex-
pression in adenocarcinoma of the lung. Proc. Natl. Acad. Sci., 98:1378413789,
2001.
14. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov,
H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfeld, and E.S.
Lander. Molecular classication of cancer: Class discovery and class prediction by
gene expression monitoring. Science, 286:531537, 1999. The data are available at
the URL waldo.wi.mit.edu/MPR/data_sets.html.
15. T. Graepel, B. Herbrich, R. Scholkopf, A.J. Smola, P. Bartlett, K. M uller, K. Ober-
mayer, and R.C. Williamson. Classication on proximity data with lp-machines. In
Ninth International Conference on Articial Neural Networks, volume 470, pages
304309. IEE, London, 1999.
16. L.R. Grate, C. Bhattacharyya, M.I. Jordan, and I.S. Mian. Integrated analysis of
transcript proling and protein sequence data. In press, Mechanisms of Ageing
and Development, 2002.
17. T. Hastie, R. Tibshirani, , and Friedman J. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2000.
18. I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon,
P. Meltzer, B. Gusterson, M. Esteller, M. Raeld, Z. Yakhini, A. Ben-Dor,
E. Dougherty, J. Kononen, L. Bubendorf, W. Fehrle, S. Pittaluga, S. Gruvberger,
N. Loman, O. Johannsson, H. Olsson, B. Wilfond, G. Sauter, O.-P. Kallioniemi,
A. Borg, and J. Trent. Gene-expression proles in hereditary breast cancer. New
England Journal of Medicine, 344:539548, 2001.
Simultaneous Relevant Feature Identication and Classication 9
19. J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold,
M. Schwab, Antonescu C.R., Peterson C., and P.S. Meltzer. Classication and
diagnostic prediction of cancers using gene expression proling and articial neural
networks. Nature Medicine, 7:673679, 2001.
20. G. Lanckerit, L. El Ghaoui, C. Bhattacharyya, and M.I. Jordan. Minimax proba-
bility machine. Advances in Neural Processing systems, 14, 2001.
21. L.A. Liotta, E.C. Kohn, and E.F. Perticoin. Clinical proteomics. personalized
molecular medicine. JAMA, 14:22112214, 2001.
22. E.J. Moler, M.L. Chow, and I.S. Mian. Analysis of molecular prole data using
generative and discriminative methods. Physiological Genomics, 4:109126, 2000.
23. D.A. Notterman, U. Alon, A.J. Sierk, and A.J. Levine. Transcriptional gene expres-
sion proles of colorectal adenoma, adenocarcinoma, and normal tissue examined
by oligonucleotide arrays. Cancer Research, 61:31243130, 2001.
24. E.F. Petricoin III, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Stein-
berg, G.B Mills, C. Simone, D.A. Fishman, E.C. Kohn, and L.A. Liotta. Use of
proteomic patterns in serum to identify ovarian cancer. The Lancet, 359:572577,
2002.
25. S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo,
C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S.
Lander, and T.R. Golub. Multiclass cancer diagnosis using tumor gene expression
signatures. Proc. Natl. Acad. Sci., 98:1514915154, 2001. The data are available
from http://www-genome.wi.mit.edu/mpr/GCM.html.
26. A. Smola, T.T. Friess, and B. Scholkopf. Semiparametric support vector and linear
programming machines. In Neural and Information Processing Systems, volume 11.
MIT Press, Cambridge MA, 1999.
27. T. Sorlie, C.M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie,
M.B. Eisen, M. van de Rijn, S.S. Jerey, T. Thorsen, H. Quist, J.C. Matese, P.O.
Brown, D. Botstein, P.E. Lonning, and A.-L. Borresen-Dale. Gene expression pat-
terns of breast carcinomas distinguish tumor subclasses with clinical implications.
Proc. Natl. Acad. Sci., 98:1086910874, 2001.
28. A.I. Su, J.B. Welsh, L.M. Sapinoso, S.G. Kern, P. Dimitrov, H. Lapp, P.G. Schultz,
S.M. Powell, C.A. Moskaluk, H.F. Frierson Jr, and G.M. Hampton. Molecular
classication of human carcinomas by use of gene expression signatures. Cancer
Research, 61:73887393, 2001.
29. L.J. van t Veer, H. Dai, M.J. van de Vijver, Y.D. He, A.A. Hart, M. Mao, H.L.
Peterse, van der Kooy K., M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M.
Kerkhoven, C. Roberts, P.S. Linsley, R. Bernards, and S.H. Friend. Gene expression
proling predicts clinical outcome of breast cancer. Nature, 415:530536, 2002.
30. V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
31. J.B. Welsh, L.M. Sapinoso, A.I. Su, S.G. Kern, J. Wang-Rodriguez, C.A. Moskaluk,
J.F. Frierson Jr, and G.M. Hampton. Analysis of gene expression identies can-
didate markers and pharmacological targets in prostate cancer. Cancer Research,
61:59745978, 2001.
32. J. Weston, Mukherjee S., O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Fea-
ture Selection for SVMs. In Advances in Neural Information Processing Systems,
volume 13, 2000.
Pooled Genomic Indexing (PGI):
Mathematical Analysis and Experiment Design
Mikl os Cs ur os
1,2,3
and Aleksandar Milosavljevic
2,3
Departement dinformatique et de recherche operationnelle, Universite de Montreal
CP 6128 succ. Centre-Ville, Montreal, Quebec H3C 3J7, Canada
[email protected]
Human Genome Sequencing Center, Department of Molecular and Human Genetics
Baylor College of Medicine
Bioinformatics Research Laboratory
Department of Molecular and Human Genetics
Baylor College of Medicine, Houston, Texas 77030, USA
[email protected]
Abstract. Pooled Genomic Indexing (PGI) is a novel method for phys-
ical mapping of clones onto known macromolecular sequences. PGI is
carried out by pooling arrayed clones, generating shotgun sequence reads
from pools and by comparing the reads against a reference sequence. If
two reads from two dierent pools match the reference sequence at a
close distance, they are both assigned (deconvoluted) to the clone at the
intersection of the two pools and the clone is mapped onto the region of
the reference sequence between the two matches. A probabilistic model
for PGI is developed, and several pooling schemes are designed and ana-
lyzed. The probabilistic model and the pooling schemes are validated in
simulated experiments where 625 rat BAC clones and 207 mouse BAC
clones are mapped onto homologous human sequence.
1 Introduction
Pooled Genomic Indexing (PGI) is a novel method for physical mapping of
clones onto known macromolecular sequences. PGI enables targeted compar-
ative sequencing of homologous regions for the purpose of discovery of genes,
gene structure, and conserved regulatory regions through comparative sequence
analysis. An application of the basic PGI method to BAC
1
clone mapping is
illustrated in Figure 1. PGI rst pools arrayed BAC clones, then shotgun se-
quences the pools at an appropriate coverage, and uses this information to map
individual BACs onto homologous sequences of a related organism. Specically,
shotgun reads from the pools provide a set of short (cca. 500 base pair long)
random subsequences of the unknown clone sequences (100200 thousand base
pair long). The reads are then individually compared to reference sequences,
using standard sequence alignment techniques [1] to nd homologies. In a clone-
by-clone sequencing strategy [2], the shotgun reads are collected for each clone
Bacterial Articial Chromosome
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 1028, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Pooled Genomic Indexing (PGI) 11
separately. Because of the pooling in PGI, the individual shotgun reads are not
associated with the clones, but detected homologies may be in certain cases.
If two reads from two dierent pools match the reference sequence at a close
distance, they are both assigned (deconvoluted) to the clone at the intersec-
tion of the two pools. Simultaneously, the clone is mapped onto the region of
the reference sequence between the two matches. Subsequently, known genomic
or transcribed reference sequences are turned into an index into the yet-to-be
sequenced homologous clones across species. As we will see below, this basic
pooling scheme is somewhat modied in practice in order to achieve correct and
unambiguous mapping.
PGI constructs comparative BAC-based physical maps at a fraction (on the
order of 1%) of the cost of full genome sequencing. PGI requires only minor
changes in the BAC-based sequencing pipeline already established in sequencing
laboratories, and thus it takes full advantage of existing economies of scale. The
key to the economy of PGI is BAC pooling, which reduces the amount of BAC
and shotgun library preparations down to the order of the square root of the
number of BAC clones. The depth of shotgun sequencing of the pools is adjusted
to t the evolutionary distance of comparatively mapped organisms. Shotgun
sequencing, which represents the bulk of the eort involved in a PGI project,
provides useful information irrespective of the pooling scheme. In other words,
pooling by itself does not represent a signicant overhead, and yet produces a
comprehensive and accurate comparative physical map.
Our reason for proposing PGI is motivated by recent advances in sequencing
technology [3] that allow shotgun sequencing of BAC pools. The Clone-Array
Pooled Shotgun Sequencing (CAPSS) method, described by [3], relies on clone-
array pooling and shotgun sequencing of the pools. CAPSS detects overlaps be-
tween shotgun sequence reads are used by and assembles the overlapping reads
into sequence contigs. PGI oers a dierent use for the shotgun read informa-
tion obtained in the same laboratory process. PGI compares the reads against
another sequence, typically the genomic sequence of a related species for the pur-
pose of comparative physical mapping. CAPSS does not use a reference sequence
to deconvolute the pools. Instead, CAPSS deconvolutes by detecting overlaps be-
tween reads: a column-pool read and a row-pool read that signicantly overlap
are deconvoluted to the BAC at the intersection of the row and the column. De-
spite the clear distinction between PGI and CAPSS, the methods are compatible
and, in fact, can be used simultaneously on the same data set. Moreover, the
advanced pooling schemes that we present here in the context of PGI are also
applicable to and increase performance of CAPSS, indicating that improvements
of one method are potentially applicable to the other.
In what follows, we propose a probabilistic model for the PGI method. We
then discuss and analyze dierent pooling schemes, and propose algorithms for
experiment design. Finally, we validate the method in two simulated PGI exper-
iments, involving 207 mouse and 625 rat BACs.
12 M. Cs uros and A. Milosavljevic
1
3
2
4
(5) reference sequence
(6) 200kbp
Fig. 1. The Pooled Genomic Indexing method maps arrayed clones of one species onto
genomic sequence of another (5). Rows (1) and columns (2) are pooled and shotgun
sequenced. If one row and one column fragment match the reference sequence (3 and 4
respectively) within a short distance (6), the two fragments are assigned (deconvoluted)
to the clone at the intersection of the row and the column. The clone is simultaneously
mapped onto the region between the matches, and the reference sequence is said to
index the clone.
2 Probability of Successful Indexing
In order to study the eciency of the PGI strategy formally, dene the following
values. Let N be the total number of clones on the array, and let m be the number
of clones within a pool. For simplicitys sake, assume that every pool has the
same number of clones, that clones within a pool are represented uniformly, and
that every clone has the same length L. Let F be the total number of random
shotgun reads, and let be the expected length of a read. The shotgun coverage c
is dened by c =
F
NL
.
Since reads are randomly distributed along the clones, it is not certain that
homologies between reference sequences and clones are detected. However, with
larger shotgun coverage, this probability increases rapidly. Consider the partic-
ular case of detecting homology between a given reference sequence and a clone.
A random fragment of length from this clone is aligned locally to the reference
sequence and if a signicant alignment is found, the homology is detected. Such
an alignment is called a hit. Let M() be the number of positions at which a
fragment of length can begin and produce a signicant alignment. The prob-
ability of a hit for a xed length equals M() divided by the total number of
possible start positions for the fragment, (L + 1).
When L , the expected probability of a random read aligning to the
reference sequence equals
p
hit
= E
M()
L + 1
=
M
L
, (1)
Pooled Genomic Indexing (PGI) 13
where M is a shorthand notation for EM(). The value M measures the homol-
ogy between the clone and the reference sequence. We call this value the eective
length of the (possibly undetected) index between the clone and the reference
sequence in question. For typical denitions of signicant alignment, such as an
identical region of a certain minimal length, M() is a linear function of , and
thus EM() = M(). Example 1: let the reference sequence be a subsequence
of length h of the clone, and dene a hit as identity of at least o base pairs.
Then M = h + 2o. Example 2: let the reference sequence be the transcribed
sequence of total length g of a gene on the genomic clone, consisting of e exons,
separated by long ( ) introns, and dene a hit as identity of at least o base
pairs. Then M = g + e( 2o).
Assuming uniform coverage and a mm array, the number of reads coming
from a xed pool equals
cmL
2
. If there is an index between a reference sequence
and a clone in the pool, then the number of hits in the pool is distributed
binomially with expected value
_
cmL
2
__
p
hit
m
_
=
cM
2
. Propositions 1, 2, and 3 rely
on the properties of this distribution, using approximation techniques pioneered
by [4] in the context of physical mapping.
Proposition 1. Consider an index with eective length M between a clone and
a reference sequence The probability that the index is detected equals approxi-
mately
p
M
_
1 e
c
M
2
_
2
.
(Proof in Appendix.)
Thus, by Proposition 1, the probability of false negatives decreases expo-
nentially with the shotgun coverage level. The expected number of hits for a
detected index can be calculated similarly.
Proposition 2. The number of hits for a detected index of eective length M
equals approximately
E
M
c
M
_
1 e
c
M
2
_
1
.
(Proof in Appendix.)
3 Pooling Designs
3.1 Ambiguous Indexes
The success of indexing in the PGI method depends on the possibility of decon-
voluting the local alignments. In the simplest case, homology between a clone
and a reference sequence is recognized by nding alignments with fragments from
one row and one column pool. It may happen, however, that more than one clone
are homologous to the same region in a reference sequence (and therefore to each
other). This is the case if the clones overlap, contain similar genes, or contain
similar repeat sequences. Subsequently, close alignments may be found between
14 M. Cs uros and A. Milosavljevic
C C
R B B
R B B
Fig. 2. Ambiguity caused by overlap or homology between clones. If clones B , B ,
B , and B are at the intersections of rows R , R and columns C , C as shown, then
alignments from the pools for R , R , C , C may originate from homologies in B
and B , or B and B , or even B , B , B , etc.
a reference sequence and fragments from more than two pools. If, for example,
two rows and one column align to the same reference sequence, an index can
be created to the two clones at the intersections simultaneously. However, align-
ments from two rows and two columns cannot be deconvoluted conclusively, as
illustrated by Figure 2.
A simple clone layout on an array cannot remedy this problem, thus calling
for more sophisticated pooling designs. The problem can be alleviated, for in-
stance, by arranging the clones on more than one array, thereby reducing the
chance of assigning overlapping clones to the same array. We propose other alter-
natives. One of them is based on construction of extremal graphs, while another
uses reshuing of the clones.
In addition to the problem of deconvolution, multiple homologies may also
lead to incorrect indexing at low coverage levels. Referring again to the example
of Figure 2, assume that the clones B
11
and B
22
contain a particular homology.
If the shotgun coverage level is low, it may happen that the only homologies
found are from row R
1
and column C
2
. In that case, the clone B
12
gets indexed
erroneously. Such indexes are false positives. The probability of false positive
indexing decreases rapidly with the coverage level as shown by the following
result.
Proposition 3. Consider an index between a reference sequence and two clones
with the same eective length M. If the two clones are not in the same row or
same column, the probability of false positive indexing equals approximately
p
FP
2e
c
M
_
1 e
c
M
2
_
2
. (2)
(Proof in Appendix.)
3.2 Sparse Arrays
The array layout of clones can be represented by an edge-labeled graph G in
the following manner. Edges in G are bijectively labeled with the clones. We
call such graphs clone-labeled. The pooling is dened by incidence, so that each
vertex corresponds to a pool, and the incident edges dene the clones in that
pool. If G is bipartite, then it represents arrayed pooling with rows and columns
corresponding to the two sets of vertices, and cells corresponding to edges. Notice
Pooled Genomic Indexing (PGI) 15
Fig. 3. Sparse array used in conjunction with the mouse experiments. This 39 39
array contains 207 clones, placed at the darkened cells. The array was obtained by
randomly adding clones while preserving the sparse property, i.e., the property that
for all choices of two rows and two columns, at most three out of the four cells at the
intersections have clones assigned to them.
that every clone-labeled graph denes a pooling design, even if it is not bipartite.
For instance, a clone-labeled full graph with N = K(K 1)/2 edges denes a
pooling that minimizes the number K of pools and thus number of shotgun
libraries, at the expense of increasing ambiguities.
Ambiguities originating from the existence of homologies and overlaps be-
tween exactly two clones correspond to cycles of length four in G. If G is bi-
partite, and it contains no cycles of length four, then deconvolution is always
possible for two clones. Such a graph represents an array in which cells are left
empty systematically, so that for all choices of two rows and two columns, at
most three out of the four cells at the intersections have clones assigned to them.
An array with that property is called a sparse array. Figure 3 shows a sparse
array. A sparse array is represented by a bipartite graph G
does
contain a cycle of length four, hence K > N
2/3
/32.
Let m be a prime power. We design a m
2
m
2
sparse array for placing N =
m
3
clones, achieving the K = (N
2/3
) density, by using an idea of Reiman [6].
The number m is the size of a pool, i.e., the number of clones in a row or
a column. Number the rows as R
a,b
with a, b 0, 1, . . . , m 1. Similarly,
number the columns as C
x,y
with x, y 0, 1, . . . , m1. Place a clone in each
cell (R
a,b
, C
x,y
) for which ax+b = y, where the arithmetic is carried out over the
nite eld F
m
. This design results in a sparse array by the following reasoning.
Considering the ane plane of order m, rows correspond to lines, and columns
correspond to points. A cell contains a clone only if the columns point lies on
the rows line. Since there are no two distinct lines going through the same two
points, for all choices of two rows and two columns, at least one of the cells at
the intersections is empty.
16 M. Cs uros and A. Milosavljevic
3.3 Double Shuing
An alternative to using sparse arrays is to repeat the pooling with the clones
reshued on an array of the same size. Taking the example of Figure 2, it
is unlikely that the clones B
11
, B
12
, B
21
, and B
22
end up again in the same
conguration after the shuing. Deconvolution is possible if after the shuing,
clone B
11
is in cell (R
1
, C
1
) and B
22
is in cell (R
2
, C
2
), and alignments with
fragments from the pools for R
i
, C
i
, R
i
, and C
i
are found, except if the clones B
12
and B
21
got assigned to cells (R
1
, C
2
) and (R
2
, C
1
). Figure 4 shows the possible
placements of clones in which the ambiguity is not resolved despite the shuing.
We prove that such situations are avoided with high probability for random
shuings.
Dene the following notions. A rectangle is formed by the four clones at the
intersections of two arbitrary rows and two arbitrary columns. A rectangle is
preserved after a shuing if the same four clones are at the intersections of
exactly two rows and two columns on the reshued array, and the diagonals
contain the same two clone pairs as before (see Figure 4).
1 2
3 4
1 3
2 4
4 3
2 1
4 2
3 1
2 1
4 3
2 4
1 3
3 4
1 2
3 1
4 2
Fig. 4. Possible placements of four clones (14) within two rows and two columns
forming a rectangle, which give rise to the same ambiguity.
Theorem 1. Let R(m) be the number of preserved rectangles on a mm array
after a random shuing. Then, for the expected value ER(m),
ER(m) =
1
2
2
m
_
1 + o(1)
_
.
Moreover, for all m > 2, ER(m + 1) > ER(m).
(Proof in Appendix.)
Consequently, the expected number of preserved rectangles equals
1
2
asymp-
totically. Theorem 1 also implies that with signicant probability, a random
shuing produces an array with at most one preserved rectangle. Specically,
by Markovs inequality, P
_
R(m) 1
_
ER(m), and thus
P
_
R(m) = 0
_
1
2
+
2
m
_
1 + o(1)
_
.
In particular, P
_
R(m) = 0
_
>
1
2
holds for all m > 2. Therefore, a random
shuing will preserve no rectangles with at least
1
2
probability. This allows us
Pooled Genomic Indexing (PGI) 17
to use a random algorithm: pick a random shuing, and count the number of
preserved rectangles. If that number is greater than zero, repeat the step. The
algorithm nishes in at most two steps on average, and takes more than one
hundred steps with less than 10
33
probability.
Remark. Theorem 1 can be extended to non-square arrays without much di-
culty. Let R(m
r
, m
c
) be the number of preserved rectangles on a m
r
m
c
array
after a random shuing. Then, for the expected value ER(m
r
, m
c
),
ER(m
r
, m
c
) =
1
2
m
r
+ m
c
m
r
m
c
_
1 + o(1)
_
.
Consequently, random shuing can be used to produce arrays with no preserved
rectangles even in case of non-square arrays (such as 12 8, for example).
3.4 Pooling Designs in General
Let B = B
1
, B
2
, . . . , B
N
be the set of clones, and P = P
1
, P
2
, . . . , P
K
be the
set of pools. A general pooling design is described by an incidence structure that
is represented by an NK 0-1 matrix M. The entry M[i, j] equals one if clone B
i
is included in P
j
, otherwise it is 0. The signature c(B
i
) of clone B
i
is the i-th row
vector, a binary vector of length K. In general, the signature of a subset S B
of clones is the binary vector of length K, dened by c(S) =
BS
c(B), where
denotes the bitwise OR operation. In order to assign an index to a set of clones,
one rst calculates the signature x of the index dened as a binary vector of
length K, in which the j-th bit is 1 if and only if there is a hit coming from
pool P
j
. For all binary vectors x and c of length K, dene
(x, c) =
_
K
j=1
(c
j
x
j
) if j = 1 . . . K: x
j
c
j
;
otherwise;
(3a)
and let
(x, B) = min
SB
_
x, c(S)
_
. (3b)
An index with signature x can be deconvoluted unambiguously if and only if the
minimum in Equation (3b) is unique. The weight w(x) of a binary vector x is
the number of coordinates that equal one, i.e, w(x) =
K
j=1
x
j
. Equation (3a)
implies that if (x, c) < , then (x, c) = w(c) w(x).
Similar problems to PGI pool design have been considered in other applica-
tions of combinatorial group testing [7], and pooling designs have often been used
for clone library screening [8]. The design of sparse arrays based on combinato-
rial geometries is a basic design method in combinatorics (eg., [9]). Reshued
array designs are sometimes called transversal designs. Instead of preserved rect-
angles, [10] consider collinear clones, i.e., clone pairs in the same row or column,
and propose designs with the unique collinearity condition, in which clone pairs
are collinear at most once on the reshued arrays. Such a condition is more
18 M. Cs uros and A. Milosavljevic
restrictive than ours and leads to incidence structures obtained by more com-
plicated combinatorial methods than our random algorithm. We describe here
a construction of arrays satisfying the unique collinearity condition. Let q be a
prime power. Based on results of design theory [9], the following method can be
used for producing up to q/2 reshued arrays, each of size q q, for pooling q
2
clones. Let F
q
be a nite eld of size q. Pools are indexed with elements of F
2
q
as P
x,y
: x, y F
q
. Dene pool set P
i
= P
i,y
: y F
q
for all i and P =
i
P
i
.
Index the clones as B
a,b
: a, b F
q
. Pool P
x,y
contains clone B
a,b
if and only if
y = a + bx holds.
Proposition 4. We claim that the pooling design described above has the fol-
lowing properties.
1. each pool belongs to exactly one pool set;
2. each clone is included in exactly one pool from each pool set;
3. for every pair of pools that are not in the same pool set there is exactly one
clone that is included in both pools.
(Proof in Appendix.)
Let d q/2. This pooling design can be used for arranging the clones on d
reshued arrays, each one of size q q. Select 2d pool sets, and pair them
arbitrarily. By Properties 13 of the design, every pool set pair can dene an
array layout, by setting one pool set as the row pools and the other pool set as the
column pools. Moreover, this set of reshued arrays gives a pooling satisfying
the unique collinearity condition by Property 3 since two clones are in the same
row or column on at most one array.
Similar questions also arise in coding theory, in the context of superimposed
codes [11,12]. Based on the idea of [11], consider the following pooling design
method using error-correcting block codes. Let C be a code of block length n
over the nite eld F
q
. In other words, let C be a set of length-n vectors over F
q
.
A corresponding binary code is constructed by replacing the elements of F
q
in
the codewords with binary vectors of length q. The substitution uses binary
vectors of weight one, i.e., vectors in which exactly one coordinate is 1, using
the simple rule that the z-th element of F
q
is replaced by a binary vector in
which the z-th coordinate is 1. The resulting binary code C
0
1
. . .
n1
2
0
2
1
. . .
2
n1
. . . . . . . . . . . . . . . . . . .
k1
0
k1
1
. . .
k1
n1
_
_
_
_
_
_
.
Using a mapping f : F
n
q
F
qn
2
as before, for the rst N q
k
codewords, a
pooling design is obtained with N clones and K pools. This design has many
advantageous properties. Kautz and Singleton [11] prove that if t =
n1
k1
|, then
the signature of any t-set of clones is unique. Since
k1
i
,= 0 in G, Lemma 1
applies and thus each pool has about the same size m =
N
q
.
Proposition 5. Suppose that the pooling design with N = q
k
is based on a
RS(n, k) code and x is a binary vector of positive weight and length K. If there
is a clone B such that (x, c(B)) < , then (x, B) = n w(x), and the
following holds. If w(x) k, then the minimum in Equation (3b) is unique,
20 M. Cs uros and A. Milosavljevic
and is attained for the singleton set containing B. Conversely, if w(x) < k, then
the minimum in Equation (3b) is attained for q
kw(x)
choices of singleton clone
sets.
(Proof in Appendix.)
For instance, a design based on the RS(6, 3) code has the following properties.
343 clones are pooled in 42 pools;
each clone is included in 6 pools, if at least 3 of those are included in an
index to the clone, the index can be deconvoluted unambiguously;
each pool contains 49 clones;
signatures of 2-sets of clones are unique and have weights 1012;
signatures of 3-sets of clones have weights between 12 and 18; if the weight of
a 3-sets signature is less than 14, than the signature is unique (determined
by a computer program).
The success of indexing in the general case is shown by the following claim.
Proposition 6. Consider an index with eective length M between a clone and
a reference sequence. If the clone appears in n pools, and a set of at least n
min
n
of these pools uniquely determines the clone, then the probability that the index
is detected equals
p
M
=
n
t=n
min
_
n
t
_
(1 p
0
)
t
p
nt
0
where p
0
e
c
M
n
.
(Proof in Appendix.)
In simple arrayed pooling n
min
= n = 2. In the pooling based on the RS(6, 3)
code, n
min
= 3, n = 6. When is the probability of success larger with the latter
indexing method, at a xed coverage? The probabilities can be compared using
the following analogy. Let 6 balls be colored with red and green independently,
each ball is colored randomly to green with probability p. Let X denote the event
that at least one of the rst three balls, and at least one of the last three balls
are red: PX = (1 p
3
)
2
. Let Y denote the event that at least three balls are red:
PY =
6
t=3
_
6
t
_
(1 p)
t
p
nt
. PX equals the probability of successful indexing in
simple arrayed pooling (with p
0
= p
3
). PY equals the probability of successful
indexing in RS(6,3) pooling (with p
0
= p). We claim that PY > PX if p < 2/11.
Consider the event X Y: when exactly one of the rst three balls is red and
exactly one of the second three balls is red. P(XY) = (3(1p)p
2
)
2
. Consider the
event YX: when the rst three or the second three balls are red, and the others
green. P(YX) = 2(1p)
3
p
3
. Now, PY > PX if and only if P(YX) > P(XY),
i.e., when p < 2/11. The inequality holds if M >
_
6c ln(2/11)
_
10c.
Thus, for longer homologous regions (cca. 5000 bp eective length if c = 1), the
RS(6,3) pooling is predicted to have better success. As c grows, simple arrayed
pooling becomes better for the same xed index. Furthermore, at p < 2/11,
even simple arrayed pooling gives around 99% or higher success probability, thus
Pooled Genomic Indexing (PGI) 21
10kbp 5kbp
effective length
0.001
0.01
0.1
simple
RS(6,3) RS(42,3)
shuffled
Indexing failure probability of pooling designs
10000 100 10 1000
clones
10
100
1000
pools
simple
sparse
RS(6,3)
RS(42,3)
shuffled
Number of pools in pooling designs
Fig. 5. The graphs on the left-hand side show the probabilities for failing to nd an
index as a function of the eective length. The graphs are plotted for coverage c = 1 and
expected shotgun read length = 500. The values are calculated from Proposition 6
for simple arraying and double shuing with unique collinearity, as well as for designs
based on the RS(6, 3) and RS(42, 3) codes. Notice that in case of a simple index, the
failure probabilities for the sparse array design are the same as for the simple array.
The graphs on the right-hand side compare the costs of dierent pooling designs by
plotting how the number of pools depends on the total number of clones in dierent
designs.
the improvements are marginal, while the dierence between the probabilities
is signicantly larger when p is large. On the other hand, a similar argument
shows that double shuing with unique collinearity (n = 4, n
min
= 2) has always
higher success probability than simple arrayed pooling. Figure 5 compares the
probabilities of successful indexing for some pooling designs.
4 Experiments
We tested the eciency of the PGI method for indexing mouse and rat clones by
human reference sequences in simulated experiments. The reference databases
included the public human genome draft sequence [2], the Human Transcript
Database (HTDB) [13], and the Unigene database of human transcripts [14]. Lo-
cal alignments were computed using BLASTN [1] with default search parameters
(word size=11, gap open cost=5, gap extension cost=2, mismatch penalty=2).
A hit was dened as a local alignment with an E-value less than 10
5
, a length
of at most 40 bases, and a score of at least 60.
In the case of transcribed reference sequences, hits on the same human refer-
ence sequence were grouped together to form the indexes. In the case of genomic
sequences, we grouped close hits together within the same reference sequence
to form the indexes. In particular, indexes were dened as maximal sets of hits
on the same reference sequence, with a certain threshold on the maximum dis-
tance between consecutive hits, called the resolution. After experimenting with
resolution values between 1kbp and 200kbp, we decided to use 2kbp resolution
throughout the experiments. A particular diculty we encountered in the exper-
iments was the abundance of repetitive elements in eukaryotic DNA. If part of a
shotgun read has many homologies in the human sequences, as is the case with
22 M. Cs uros and A. Milosavljevic
common repeat sequences, the read generates many hits. Conversely, the same
human sequence may be homologous to many reads. Accordingly, repeats in the
arrayed clones correspond to highly ambiguous indexes, and human-specic re-
peats may produce large number of indexes to the same clone. Whereas in the
cases of rat and mouse, it is possible to use a database of repeat sequences such
as Repbase [15], such information is not available for many other species. We
thus resorted to a dierent technique for ltering out repeats. We simply dis-
carded shotgun reads that generated more than twelve hits, thereby eliminating
almost all repeats without using a database of repeated elements.
For the deconvolution, we set the maximum number of clones to three, that is,
each index was assigned to one, two, or three clones, or was declared ambiguous.
Finally, due to the fact that pooling was simulated and that in all exper-
iments the original clone for each read was known enabled a straightforward
test of accuracy of indexing: an index was considered accurate if both reads
deconvoluted to the correct original clone.
Table 1. Experimental results for simulated indexing of 207 mouse clones with coverage
level 2. The table gives the results for four pooling designs (simple, shued, sparse
array, and RS(6, 3)), and three databases of reference sequences: Unigene (UG), HTDB,
and human genome draft (HS). The RS(6, 3) design was used with the UG database
only.
Pools Number of correctly indexed clones Number of correct indexes / false positives
UG HTDB HS UG HTDB HS
simple 29 159 (77%) 139 (67%) 172 (83%) 723 / 248 488 / 108 1472 / 69
RS(6, 3) 42 172 (83%) - - 611 / 150 - -
shued 58 172 (83%) 150 (72%) 180 (87%) 756 / 76 514 / 18 1549 / 238
sparse 78 175 (85%) 152 (74%) 185 (88%) 823 / 22 569 / 11 1634 / 69
4.1 Mouse Experiments
In one set of experiments we studied the eciency of indexing mouse clones with
human sequences. We selected 207 phase 3 sequences in the mouse sequencing
project with lengths greater than 50k bases. We used four pooling designs: two
14 15 arrays for double shuing, a 39 39 sparse array, shown in Figure 3,
and a design based on the RS(6, 3) code with 42 pools.
Random shotgun reads were produced by simulation. Each random read from
a xed pool was obtained in the following manner. A clone was selected randomly
with probabilities proportional to clone lengths. A read length was picked using a
Poisson distribution with mean = 550, truncated at 1000. The random position
of the read was picked uniformly from the possible places for the read length on
the selected clone. Each position of the read was corrupted independently with
probability 0.01. The procedure was repeated to attain the desired coverage
within each pool. Fragments were generated for the dierent pooling designs
with c = 2.
Pooled Genomic Indexing (PGI) 23
The results of the experiments are summarized in Table 1. The main nding is
that 67%88% of the clones have at least one correct index, and 5001600 correct
indexes are created, depending on the array design and the reference database.
Note that double shuing and sparse array methods signicantly reduce the
number of false positives, as expected based on the discussion above. One of
the reasons for the excellent performance of the sparse array method is the fact
that BAC redundancy in the data set does not exceed 2X. In the course of the
experiments, between 7% and 10% of a total of approximately 121,400 simulated
reads gave alignments against human sequences that were informative for the
deconvolution, a percentage that is consistent with the expected overall level of
genomic sequence conservation between mouse and human.
0.5 1.5 1
shotgun coverage per clone
100
50
150
clones
shfl
sprs
smpl
rs63
Arrayed pooling against Unigene: number of indexed clones
Fig. 6. Number of correctly mapped clones is indicated as a function of shotgun cov-
erage for three dierent pooling schemes. PGI was tested in a simulated experiment
involving 207 publicly available mouse genomic sequences of length between 50kbp and
300kbp. Pooling and shotgun sequencing were then simulated. For each coverage level,
reads for c = 2 were resampled ten times. Graphs go through the median values for
four poling designs: simple (smpl), shued (sh), and sparse (sprs), and Reed-Solomon
(rs63). Deconvolution was performed using human Unigene database. Notice that the
curves level o at approximately c = 1.0, indicating limited benet of much greater
shotgun coverages.
In order to explore the eect of shotgun coverage on the success of indexing,
we repeated the indexing with lower coverage levels. We resampled the simulated
reads by selecting appropriate portions randomly from those produced with cov-
erage 2. Figure 6 plots the number of indexed clones as a function of coverage.
It is worth noting that even for c = 0.5, about 2/3 of the clones get indexed by
at least one human sequence. In addition, the curves level o at about c = 1.0,
and higher coverage levels yield limited benets.
24 M. Cs uros and A. Milosavljevic
4.2 Rat Experiments
In contrast to the mouse experiments, PGI simulation on rat was performed using
publicly available real shotgun reads from individual BACs being sequenced as
part of the rat genome sequencing project. The only simulated aspect was BAC
pooling, for which we pooled reads from individual BACs computationally. We
selected a total of 625 rat BACs, each with more than 570 publicly available reads
from the rat sequencing project. An average of 285 random reads per clone were
used in each pool corresponding to an approximate c = 1.5 coverage. The
results are summarized in Table 2.
Table 2. Experimental results on simulated indexing of 625 rat clones by Unigene
sequences with coverage level 1.5.
correct indexes false positives indexed clones
simple 1418 236 384 (61%)
shued 1383 30 409 (65%)
sparse 1574 17 451 (72%)
The lower percentage of correctly mapped clones only partially reects the
lower coverage (1.5 for rat vs. 2 for mouse). A much more important factor
contributing to the dierence is the fact that the mouse sequences used in our
experiments are apparently richer in genes and thus contain more conserved
regions that contribute to the larger number of BACs correctly mapped onto
human sequences.
5 Discussion
PGI is a novel method for physical mapping of clones onto known macromolecu-
lar sequences. It employs available sequences of humans and model organisms to
index genomes of new organisms at a fraction of full genome sequencings cost.
The key idea of PGI is the pooled shotgun library construction, which reduces
the amount of library preparations down to the order of the square root of the
number of BAC clones. In addition to setting priorities for targeted sequencing,
PGI has the advantage that the libraries and reads it needs can be reused in the
sequencing phase. Consequently, it is ideally suited for a two-staged approach to
comparative genome explorations yielding maximum biological information for
given amounts of sequencing eorts.
We presented a probabilistic analysis of indexing success, and described pool-
ing designs that increase the eciency of in silico deconvolution of pooled shot-
gun reads. Using publicly available mouse and rat sequences, we demonstrated
the power of the PGI method in simulated experiments. In particular, we showed
that using relatively few shotgun reads corresponding to 0.5-2.0 coverage of the
clones, 60-90% of the clones can be indexed with human genomic or transcribed
sequences.
Pooled Genomic Indexing (PGI) 25
Due to the low level of chromosomal rearrangements across mammals, the
order of BACs in a comparative physical map should provide an almost correct
ordering of BACs along the genome of a newly indexed mammal. Such infor-
mation should be very useful for whole-genome sequencing of such organisms.
Moreover, the already assembled reference sequences of model organisms onto
which the BACs are mapped may guide the sequence assembly of the homologous
sequence of a newly sequenced organism
2
[16].
Comparative physical maps will allow ecient, targeted, cross-species se-
quencing for the purpose of comparative annotation of genomic regions in model
organisms that are of particular biomedical importance. PGI is not limited to the
mapping of BAC clones. Other applications currently include the mapping of ar-
rayed cDNA clones onto genomic or known full or partial cDNA sequences within
and across species, and the mapping of bacterial genomic clones across dierent
bacterial strains. Sampling eciency of PGI is in practice increased by an order
of magnitude by sequencing short sequence fragments. This is accomplished by
breaking the pooled DNA into short fragments, selecting them by size, forming
concatenamers by ligation, and then sequencing the concatenamers [17,18,19].
Assuming a sequence read of 600bp and tag size of 20200bp, a total of 330 dif-
ferent clones may be sampled in a single sequencing reaction. This technique is
particularly useful when the clone sequences and reference sequences are highly
similar, e.g., cDNA mapping against the genome of the same species, bacterial
genome mapping across similar strains, and mapping of primate genomic BACs
against human sequence.
Acknowledgements
The authors are grateful to Richard Gibbs and George Weinstock for sharing
pre-publication information on CAPSS and for useful comments, and to Paul
Havlak and David Wheeler for contributing database access and computational
resources at HGSC.
Claim 1 of US Patent 6,001,562 [16] reads as follows: A method for detecting se-
quence similarity between at least two nucleic acids, comprising the steps of:
(a) identifying a plurality of putative subsequences from a rst nucleic acid;
(b) comparing said subsequences with at least a second nucleic acid sequence; and
(c) aligning said subsequences using said second nucleic acid sequence in order to
simultaneously maximize
(i) matching between said subsequences and said second nucleic acid sequence
and
(ii) mutual overlap between said subsequences,
whereby said aligning predicts a subsequence that occurs within both said rst and
said second nucleic acids.
26 M. Cs uros and A. Milosavljevic
References
1. Altschul, S.F., Madden, T.L., Schaer, A.A., Zhang, J., Zhang, Z., Miller, W.,
Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res. 25 (1997) 33893402
2. IHGSC: Initial sequencing and analysis of the human genome. Nature 609 (2001)
860921
3. Cai, W.W., Chen, R., Gibbs, R.A., Bradley, A.: A clone-array pooled strategy for
sequencing large genomes. Genome Res. 11 (2001) 16191623
4. Lander, E.S., Waterman, M.S.: Genomic mapping by ngerprinting random clones:
a mathematical analysis. Genomics 2 (1988) 231239
5. Bollobas, B.: Extremal graph theory. In Graham, R.L., Grotschel, M., Lovasz, L.,
eds.: Handbook of Combinatorics. Volume II. Elsevier, Amsterdam (1995) 1231
1292
6. Reiman, I.:
Uber ein Problem von K. Zarankiewicz. Acta Math. Sci. Hung. 9
(1958) 269279
7. Du, D.Z., Hwang, F.K.: Combinatorial Group Testing and Its Applications. 2nd
edn. World Scientic, Singapore (2000)
8. Bruno, W.J., Knill, E., Balding, D.J., Bruce, D.C., Doggett, N.A., Sawhill, W.W.,
Stallings, R.L., Whittaker, C.C., Torney, D.C.: Ecient pooling designs for library
screening. Genomics 26 (1995) 2130
9. Beth, T., Jungnickel, D., Lenz, H.: Design Theory. 2nd edn. Cambridge University
Press, UK (1999)
10. Barillot, E., Lacroix, B., Cohen, D.: Theoretical analysis of library screening using
an n-dimensional strategy. Nucleic Acids Res. 19 (1991) 62416247
11. Kautz, W.H., Singleton, R.C.: Nonrandom binary superimposed codes. IEEE
Trans. Inform. Theory IT-10 (1964) 363377
12. Dyachkov, A.G., Macula, Jr., A.J., Rykov, V.V.: New constructions of superim-
posed codes. IEEE Trans. Inform. Theory IT-46 (2000) 284290
13. Bouck, J., McLeod, M.P., Worley, K., Gibbs, R.A.: The Human Transcript
Database: a catalogue of full length cDNA inserts. Bioinformatics 16 (2000) 176
177 http://www.hgsc.bcm.tmc.edu/HTDB/.
14. Schuler, G.D.: Pieces of the puzzle: expressed sequence tags and the catalog of
human genes. J. Mol. Med. 75 (1997) 694698 http://www.ncbi.nlm.nih.gov/
Unigene/.
15. Jurka, J.: Repbase update: a database and an electronic journal of repetitive
elements. Trends Genet. 16 (2000) 418420 http://www.girinst.org/.
16. Milosavljevic, A.: DNA sequence similarity recognition by hybridization to short
oligomers (1999) U. S. patent 6,001,562.
17. Andersson, B., Lu, J., Shen, Y., Wentland, M.A., Gibbs, R.A.: Simultaneous shot-
gun sequencing of multiple cDNA clones. DNA Seq. 7 (1997) 6370
18. Yu, W., Andersson, B., Worley, K.C., Muzny, D.M., Ding, Y., Liu, W., Ricafrente,
J.Y., Wentland, M.A., Lennon, G., Gibbs, R.A.: Large-scale concatenation cDNA
sequencing. Genome Res. 7 (1997) 353358
19. Velculescu, V.E., Vogelstein, B., Kinzler, K.W.: Analysing uncharted transcrip-
tomes with SAGE. Trends Genet. 16 (2000) 423425
Pooled Genomic Indexing (PGI) 27
Appendix
Proof (Proposition 1). The number of random shotgun reads from the row pool
associated with the clone equals
cmL
2
. By Equation (1), the probability that at
least one of them aligns with the reference sequence equals
p
1
= 1
_
1
p
hit
m
_cmL
2
= 1
_
1
M
mL
_cmL
2
1 e
c
M
2
. (4)
The probability that the row and column pools both generate reads aligning to
the reference sequence equals p
2
1
, as claimed.
Proof (Proposition 2). The number of hits for an index coming from a xed row
or column pool is distributed binomially with parameters n =
cmL
2
and p =
M
mL
.
Let
r
denote the number of hits coming from the clones row pool, and let
c
denote the number of hits coming from the clones column pool. Then
E
M
= E
_
r
+
c
r
> 0,
c
> 0
_
= E
_
r
> 0
_
+E
_
c
> 0
_
.
In order to calculate the conditional expectations on the right-hand side, notice
that if is a non-negative random variable, then E = E
_
> 0
_
P
_
> 0
_
.
Since P
_
c
> 0
_
= P
_
r
> 0
_
= p
1
,
E
M
= 2E
_
r
> 0
_
=
2np
p
1
Resubstituting p and n, the proposition follows from Equation (4).
Proof (Proposition 3). Let n =
cmL
2
be the number of reads from a pool and p =
M
the probability of a hit within a pool containing one of the clones. Then a
false positive occurs if there are no hits in the row pool for one clone and the
column pool for the other, and there is at least one hit in each of the other two
pools containing the clones. The probability of that event equals
2
_
(1 p)
n
_
2
_
1 (1 p)
n
_
2
.
Using the same approximation technique as in Proposition 1 leads to Equa-
tion (2).
Proof (Theorem 1). Let R be an arbitrary rectangle in the array. We calculate
the probability that R is preserved after a random shuing by counting the
number of shuings in which it is preserved. There are
_
m
2
_
choices for selecting
two rows for the position of the preserved rectangle. Similarly, there are
_
m
2
_
ways
to select two columns. There are 8 ways of placing the clones of the rectangle in
the cells at the four intersections in such a way that the diagonals are preserved
(see Figure 4). There are (m
2
4)! ways of placing the remaining clones on the
array. Thus, the total number of shuings in which R is preserved equals
8
_
m
2
_
2
(m
2
4)! = 8
_
m(m1)
2
_
2
(m
2
4)!.
28 M. Cs uros and A. Milosavljevic
Dividing this number by the number of possible shuings gives the probability
of preserving R:
p =
8
_
m
2
_
2
(m
2
)(m
2
1)(m
2
2)(m
2
3)
. (5)
For every rectangle R, dene the indicator variable I(R) for the event that
it is preserved. Obviously, EI(R) = p. Using the linearity of expectations and
Equation (5),
ER(m) =
R
EI(R) =
_
m
2
_
2
p =
1
2
m
4
(m1)
4
m
2
(m
2
1)(m
2
2)(m
2
3)
. (6)
Thus,
ER(m) =
1
2
_
1
4
m
_
1 + o(1)
_
_
,
proving the theorem. Equation (6) also implies that
ER(m+1)
ER(m)
> 1 for m > 2, and
thus ER(m) is increasing monotonically.
Proof (Proposition 4). Property 1 is trivial. For Property 2, notice that
clone B
a,b
is included in pool P
i,a+ib
in every P
i
, and in no other pools. For
Property 3, let P
x
1
,y
1
and P
x
2
,y
2
be two arbitrary pools. Each clone B
a,b
is
included in both pools if and only if
y
1
= a + bx
1
and y
2
= a + bx
2
.
If x
1
,= x
2
, then there is exactly one solution for (a, b) that satises both equal-
ities.
Proof (Lemma 1). Fix the rst (k 1) coordinates of u and let the last one vary
from 0 to (q 1). The i-th coordinate of the corresponding codeword c = uG
takes all values of F
q
if the entry G[k, i] is not 0.
Proof (Proposition 5). The rst part of the claim for the case w(x) k is a
consequence of the error-correcting properties of the code. The second part of
the claim follows from the MDS property.
Proof (Proposition 6). This proof generalizes that of Proposition 1. The number
of random shotgun reads from one pool associated with the clone equals
cmL
n
. By
Equation (1), the probability that at least one of them aligns with the reference
sequence equals
p
1
= 1
_
1
p
hit
m
_cmL
n
1 e
c
M
n
.
The probability that at least n
min
pools generate reads aligning to the reference
sequence equals
n
t=n
min
_
n
t
_
p
t
1
(1 p
1
)
nt
proving the claim with p
0
= 1 p
1
.
Practical Algorithms and Fixed-Parameter
Tractability for the Single Individual SNP
Haplotyping Problem
Romeo Rizzi
1,
, Vineet Bafna
2
, Sorin Istrail
2
, and Giuseppe Lancia
3
Math. Dept., Universit`a di Trento, 38050 Povo (Tn), Italy
[email protected]
Celera Genomics, Rockville MD, USA,
{Vineet.Bafna,Sorin.Istrail}@celera.com
D.E.I., Universit`a di Padova, 35100 Padova, Italy
[email protected]
Abstract. Single nucleotide polymorphisms (SNPs) are the most fre-
quent form of human genetic variation, of foremost importance for a va-
riety of applications including medical diagnostic, phylogenies and drug
design.
The complete SNPs sequence information from each of the two copies
of a given chromosome in a diploid genome is called a haplotype. The
Haplotyping Problem for a single individual is as follows: Given a set of
fragments from one individuals DNA, nd a maximally consistent pair of
SNPs haplotypes (one per chromosome copy) by removing data errors
related to sequencing errors, repeats, and paralogous recruitment. Two
versions of the problem, i.e. the Minimum Fragment Removal (MFR)
and the Minimum SNP Removal (MSR), are considered.
The Haplotyping Problem was introduced in [8], where it was proved
that both MSR and MFR are polynomially solvable when each fragment
covers a set of consecutive SNPs (i.e., it is a gapless fragment), and NP-
hard in general. The original algorithms of [8] are of theoretical interest,
but by no means practical. In fact, one relies on nding the maximum
stable set in a perfect graph, and the other is a reduction to a network
ow problem. Furthermore, the reduction does not work when there are
fragments completely included in others, and neither algorithm can be
generalized to deal with a bounded total number of holes in the data.
In this paper, we give the rst practical algorithms for the Haplotyping
Problem, based on Dynamic Programming. Our algorithms do not re-
quire the fragments to not include each other, and are polynomial for
each constant k bounding the total number of holes in the data.
For m SNPs and n fragments, we give an O(mn
k
) algorithm for the
MSR problem, and an O(2
k
m n+2
k
m ) algorithm for the MFR prob-
lem, when each fragment has at most k holes. In particular, we obtain
an O(mn ) algorithm for MSR and an O(m n+m ) algorithm for MFR
on gapless fragments.
Finally, we prove that both MFR and MSR are APX-hard in general.
) MSR(M).
Let X be any set of SNPs such that M
)
MSR(M). Let X be any set of SNPs whose removal makes M
error-free. Then
the removal of X makes also M error-free.
Literal translation: either with us or against us. Meaning: you have to choose, either
be friend or enemy
Practical Algorithms and Fixed-Parameter Tractability 35
We can hence assume that every row conicts with at least two other rows,
simply by dropping those rows which conict with at most one row. Matrix M
)
MFR(M). Let X be any set of rows whose removal makes M
error-free. Then
also M \ X is error-free.
Proposition 5 (F-reduction, again) Let M
fF
(f)
mod 2
. We say that the pair (G, ) is -bipartite if
and only if it contains no -odd cycle (regard a cycle as an edge set). By the o
con noi o contro di noi Lemma 1, we have the following consequence:
Proposition 13 The matrix M is error-free if and only if (G, ) is -bipartite.
We now give a formal proof of the above proposition in a more general form.
Lemma 14 Let X be a set of columns. If M is S-reduced and M \ X is error-
free, then (G\ X, ) is -bipartite.
Proof: Indeed, let C = f
1
, s
1
, . . . , f
k
, s
k
be any cycle in G\X. Dene (
C
(s
i
)) =
(f
i
s
i
)+(s
i
f
i+1
), so that (C) =
k
i=1
(
C
(s
i
)). Consider a partition of F into
two haplotypes. By Lemma 1, whenever (
C
(s
i
)) is even, f
i
and f
i+1
are in in
the same haplotype, while when it is odd, they are in dierent haplotypes. But,
along the cycle f
1
, f
2
, . . . , f
k
, the number of haplotype switches must be even
(like jumping forth and back between the two sides of a river but eventually
returning to the starting side).
Now we go for the converse.
Lemma 15 If (G, ) is -bipartite then M (not necessarily reduced) is error-free.
Proof: By denition, the fragment conict graph G
F
(M) = (F, E
F
) is such that
uv E
F
if there is an s such that us and sv are in (G, ) and (us) +(sv) = 1.
Hence, to each cycle C in G
F
(M) corresponds a cycle C
is -even. So, G
F
(M) is bipartite and
hence M is error-free.
We nally prove Theorem 12 of Section 3.1.
Theorem 16 The problems MFR and MSR are APX-hard.
Proof: Given a graph G, the problem (minEdgeBipartizer) of nding a min-
imum cardinality set of edges F such that G \ F is bipartite, and the problem
(minNodeBipartizer) of nding a minimum cardinality set of nodes Z such
that G \ Z is bipartite, are known to be APX-hard [10]. Moreover, the same
problems are not known to be in APX. We give simple L-reductions from these
two problems to MFR and MSR, hence showing the APX-hardness of MFR and
38 R. Rizzi et al.
MSR, but also that nding a constant ratio approximation algorithm for any of
these problems in general is somewhat of a challenge.
Given an input graph G as instance for either minEdgeBipartizer or minN-
odeBipartizer, we subdivide each edge into two edges in series. (The resulting
graph G
is clearly bipartite, we call SNPs the new nodes and fragments the
old ones). Now, of the two edges incident with every SNP, we declare one to
be odd and one to be even. Clearly, solving or approximating MSR on (G
, )
amount to solving or approximating (within exactly the same approximation)
minEdgeBipartizer on G. Moreover, solving or approximating MFR on (G
, )
amounts to solving or approximating (within exactly the same approximation)
minNodeBipartizer on G.
5 Polynomial Algorithms for the Gapless Case
In this section we prove Corollaries 9 and 11 of Section 3.1. Although they would
follow from the more general theorems for matrices with bounded length gaps
(described in Section 6) we prove them here directly because it is didactically
better to do so. In fact, the results of Section 6 will be obtained as generalizations
of the ideas described in this section.
5.1 MSR: A Dynamic Programming O(mn
2
) Algorithm
In this section we propose a dynamic programming approach for the solution
of MSR. The resulting algorithm can be coded as to take O(mn
2
) time. In the
following, we assume M to be S-reduced.
It is easier to understand our dynamic program if we state it for the comple-
mentary problem of MSR, i.e., nd the maximum number of SNPs that can be
kept so that M becomes error-free. Clearly, if k is the largest number of SNPs
that we can keep, then n k is the smallest number of SNPs to remove.
For j n (with j 0), we dene K[j] as the maximum number of columns
that can be kept to make M error-free, under the condition that j is the rightmost
column kept (if all columns are removed then j = 0, and K[0] = 0).
Once all the K[j] are known, the solution to the problem is given by
max
j{0,...,n}
K[j].
For every j, we dene OK(j) as the set of those i with i < j such that
columns i and j do not conict. We assume that 0 belongs to OK(j) for every
j. Now, for every j,
K[j] := 1 + max
iOK(j)
K[i] (1)
where Equation (1) is correct by the following easily proven fact.
Lemma 17 Let M be a gapless S-reduced matrix. Consider columns a < b <
c S. If a is not in conict with b and b is not in conict with c, then a is not
in conict with c.
Practical Algorithms and Fixed-Parameter Tractability 39
Proof: Assume SNPs a and c to be conicting, that is, there exist fragments f
and g such that M[f, a], M[g, a], M[f, c], M[g, c] = and the boolean value
(M[f, a] = M[g, a]) is the negation of (M[f, c] = M[g, c]). Since a < b < c, then
M[f, b], M[g, b] = since M is gapless. Therefore, (M[f, b] = M[g, b]) is either
the negation of (M[f, a] = M[g, a]) or the negation of (M[f, c] = M[g, c]). That
is, b is either conicting with a or with c.
Note that for computing the entries K[j] we only need to know the sets
OK(j). Note that OK(j) is the set of those i < j which are neighbors of j in the
SNP-conict graph G
S
(M). Since determining if two SNPs are in conict can be
done in time O(m), the cost of creating the OK(j) is O(mn
2
). This dominates
the cost O(n
2
) of solving all equations (1).
5.2 MFR: A Dynamic Programming O(m
2
n +m
3
) Algorithm
In this section we propose a dynamic programming approach for the solution
of MFR. We remark that, contrary to the approach suggested in [8], nested
fragments will not be a problem. The resulting algorithm can be coded as to
take O(m
2
n +m
3
) time.
Given a row f of M we denote by l(f) the index of the leftmost SNP s
such that M[f, s] = and by r(f) the index of the rightmost SNP s such
that M[f, s] = . In other words, the body of the fragment f is all contained
between the SNPs l(f) and r(f). We assume that the rows of M are ordered
so that l(i) l(j) whenever i < j. For every i {1, . . . , m}, let M
i
be the
matrix made up by the rst i rows of M. For h, k i (with h, k 1) such that
r(h) r(k), we dene D[h, k; i] as the minimum number of rows to remove to
make M
i
error-free, under the condition that
row k is not removed, and among the non-removed rows maximizes r(k);
row h is not removed and goes into the opposite haplotype as k, and among
such rows maximizes r(h).
(If all rows are removed then h = 1, k = 0, and D[1, 0; i] := i. Rows 1
and 0 are all , that is, empty).
Once all the D[h, k; i] are known, the solution to the MSR problem is given
by
min
h,k : r(h) r(k)
D[h, k; m].
Clearly, for every i, and for every h, k < i with r(h) r(k),
D[h, k; i] :=
1
= {f
1
1
, . . . , f
i
1
[x
i
1
], . . . , f
p
1
}
and F
2
= {f
1
2
, . . . , f
j
2
[x
j
2
], . . . , f
q
2
} would still be both without conicts.
We assume that the rows of M are ordered so that l(i) l(j) whenever i < j.
For every i {1, . . . , m}, let M
i
be the matrix made up by the rst i rows of
M. For h, k i (with h, k 1) such that r(h) r(k), and for x, y {A, B}
k
we dene D[h, x; k, y; i] as the minimum number of rows to remove to make
M
i
[h[x], k[y]] error-free, under the condition that
row k[y] is not removed, and among the non-removed rows maximizes r(k);
row h[x] is not removed and goes into the opposite haplotype as k[y], and
among such rows maximizes r(h).
(If all rows are removed then h = 1, k = 0, and D[1, x; 0, y; i] = i for all
x, y {A, B}
k
.
Once all the D[h, x; k, y; i] are known, the solution to the problem is given
by
min
x,y{A,B}
k
;h,k :r(h) r(k)
D[h, x; k, y; m].
Clearly, for every i, and for every h, k < i with r(h) r(k), and for every
x, y {A, B}
k
,
D[h, x; k, y; i] :=
D[j, x
j
; h, x
h
; i 1]
if r(h) r(j)
D[h, x
h
; j, x
j
; i 1]
if r(h) < r(j)
(7)
Finally, for every i, and for every k < i with r(k) r(i), and for every
x
i
, x
k
{A, B}
k
,
D[i, x
i
; k, x
k
; i] := min
(j,x
j
)OK(i,x
i
),j=k,r(j)r(i)
D[j, x
j
; k, x
k
; i 1]. (8)
Note that for computing the entries D[h, x; k, y; i] we only need to know
the sets OK(i). The cost of creating the OK(i, x) data structure (done in a rst
Practical Algorithms and Fixed-Parameter Tractability 43
phase) is O(2
2k
nm
2
). The cost of computing the entries D[h, x; k, y; i] (done in
a second phase) is O(2
3k
m
3
), since it can be seen as the cost of computing the
O(2
2k
m
3
) entries D[h, x; k, y; i] with h < k < i by using Equation (6) (costs
O(1) each) plus the cost of computing the O(2
2k
m
2
) entries D[h, x; i, y; i] by
using Equations (7) and (8) (costs O(2
k
m) each).
References
1. K. S. Booth and G. S. Lueker. Testing for the consecutive ones property, intervals
graphs and graph planarity testing using PQ-tree algorithms. J. Comput. System
Sci., 13:335379, 1976.
2. A. Chakravarti. Its raining SNP, hallelujah? Nature Genetics, 19:216217, 1998.
3. A. Clark. Inference of haplotypes from PCRamplied samples of diploid popula-
tions. Molecular Biology Evolution, 7:111122, 1990.
4. D. Guseld. A practical algorithm for optimal inference of haplotypes from diploid
populations. In R. Altman, T.L. Bailey, P. Bourne, M. Gribskov, T. Lengauer,
I.N. Shindyalov, L.F. Ten Eyck, and H. Weissig, editors, Proceedings of the Eighth
International Conference on Intelligent Systems for Molecular Biology, pages 183
189, Menlo Park, CA, 2000. AAAI Press.
5. D. Guseld. Haplotyping as perfect phylogeny: Conceptual framework and ecient
solutions. In G. Myers, S. Hannenhalli, S. Istrail, P. Pevzner, and M. Watermand,
editors, Proceedings of the Sixth Annual International Conference on Computa-
tional Biology, pages 166175, New York, NY, 2002. ACM Press.
6. L. Helmuth. Genome research: Map of the human genome 3.0. Science,
293(5530):583585, 2001.
7. International Human Genome Sequencing Consortium. Initial sequencing and anal-
ysis of the human genome. Nature, 409:860921, 2001.
8. G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz. SNPs problems,
complexity and algorithms. In Proceedings of Annual European Symposium on
Algorithms (ESA), volume 2161 of Lecture Notes in Computer Science, pages 182
193. Springer, 2001.
9. R. Lippert, R. Schwartz, G. Lancia, and S. Istrail. Algorithmic strategies for the
SNPs haplotype assembly problem. Briengs in Bioinformatics, 3(1):2331, 2002.
10. C. Lund and M. Yannakakis. The approximation of maximum subgraph problems.
In Proceedings of 20th Int. Colloqium on Automata, Languages and Programming,
pages 4051. Springer-Verlag, 1994.
11. E. Marshall. Drug rms to create public database of genetic mutations. Science
Magazine, 284(5413):406407, 1999.
12. J.C. Venter et al. The sequence of the human genome. Science, 291:13041351,
2001.
13. J. Weber and E. Myers. Human whole genome shotgun sequencing. Genome
Research, 7:401409, 1997.
Methods for Inferring Block-Wise Ancestral
History from Haploid Sequences
The Haplotype Coloring Problem
Russell Schwartz
1
, Andrew G. Clark
1,2
, and Sorin Istrail
1,
Celera Genomics, Inc.
45 West Gude Dr.
Rockville, MD 20850 USA
Phone: (240) 453-3668, Fax: (240) 453-3324
{Russell.Schwartz,Sorin.Istrail}@celera.com
Department of Biology
Pennsylvania State University
University Park, PA 16802 USA
[email protected]
Abstract. Recent evidence for a blocky haplotype structure to the
human genome and for its importance to disease inference studies has
created a pressing need for tools that identify patterns of past recom-
bination in sequences of samples of human genes and gene regions. We
present two new approaches to the reconstruction of likely recombination
patterns from a set of haploid sequences which each combine combinato-
rial optimization techniques with statistically motivated recombination
models. The rst breaks the problem into two discrete steps: nding re-
combination sites then coloring sequences to signify the likely ancestry of
each segment. The second poses the problem as optimizing a single prob-
ability function for parsing a sequence in terms of ancestral haplotypes.
We explain the motivation for each method, present algorithms, show
their correctness, and analyze their complexity. We illustrate and ana-
lyze the methods with results on real, contrived, and simulated datasets.
1 Introduction
The sequencing of the human genome [12,25] has created a tremendous oppor-
tunity for medical advances through the discovery of genetic predictors of dis-
ease. So far, though, catalogs of the genetic dierences between individuals have
proven dicult to apply. Examined in isolation, these dierences - which occur
predominantly in the form of isolated changes called single nucleotide polymor-
phisms (SNPs) - may fail to distinguish real relationships from the background
noise millions of SNPs produce. Recent analysis of the structure of the human
genome [14] has given hope that greater success will be achieved through studies
of haplotypes, sets of alleles from all SNPs in a region that tend to travel together
j=1
n
i=1
n
i
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)), whereID(b) =
1 if b is true
0 if b is false
.
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 49
The function ID tests whether for a given i and j there is a color change in
sequence i between blocks j and j + 1. G thus counts the total number of color
changes between consecutive blocks over all sequences. We will show two key
properties of this problem that allow us to solve it eciently. First, greedily
coloring each block optimally in terms of the previous block (minimizing the
number of color changes given the previous blocks coloring) yields a globally
optimal solution. Second, coloring each block is an instance of the weighted
bipartite maximum matching problem, for which there exist ecient algorithms
[5]. Figure 2 illustrates the resulting method, which we now examine.
Lemma 1. Any set X such that
j+1
minimizes G
j
(X) =
n
i=1
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)) given
j
j [1, k 1] will minimize the objective function G.
Proof. Assume we have a mapping X such that
j+1
minimizes G
j
(X) given
j
for all j in [1, k 1]. Assume further for the purposes of contradiction that
there exists another solution X
such that
k1
j=1
G
j
(X
) <
k1
j=1
G
j
(X). Then
there must be some smallest j such that G
j
(X
) < G
j
(X). We can then create
a new X
such that X
j+1
(c) =
j
(c)
if and only if
j+1
(c) =
j
(c). Then X
for
the transition from region j to j + 1, which is strictly lower than that for X on
that transition. Thus, X could have chosen a better solution for the transition
from j to j + 1 given its solutions for all previous transitions, contradicting the
assumption that X minimizes G
j
(X) given
j
for all j. Thus, X
cannot exist
and X must minimize G.
Lemma 2. Finding the optimal
j+1
given
j
can be expressed as weighted
maximum matching.
Proof. We prove the lemma by construction of the instance of
weighted maximum matching. We rst rewrite G as (k 1)
n
i=1
n
i
k1
j=1
n
i=1
n
i
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)). Since (k 1)
n
i=1
n
i
does not
depend on X, minimizing our original objective function is equivalent to
maximizing
k1
j=1
n
i=1
n
i
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)). We create a bipartite
graph B in which each node u
i
in the rst part corresponds to a haplotype
c
ji
in block j and each node v
i
in the second part corresponds to a haplotype
c
j+1,i
in block j + 1. If any sequence has haplotypes c
ji
and c
j+1,i
, then we
create an edge (u
i
, v
i
) with weight
n
i=1
n
i
HAS
j,i,i
(s
i
), where HAS
j,i,i
(s) =
1 if s
j
= c
j,i
and s
j+1
= c
j+1,i
0 otherwise
.
A matching in B corresponds to a set of edges pairing haplotypes in block j with
haplotypes in block j + 1. We construct X so that it assigns the same color to
c
j+1,i
as was assigned to c
j,i
if and only if the matching of B selects the edge
between the nodes corresponding to those two haplotypes. Any given matching
will have a weight equal to the sums of the frequencies of sequences sharing
50 R. Schwartz, A.G. Clark, and S. Istrail
both of each pair of haplotypes whose corresponding nodes are connected by
the matching. Thus the coloring corresponding to a maximum matching will
maximize
n
i=1
n
i
k1
j=1
ID(
j
(s
i,j
) =
j+1
(s
i,j+1
)), which yields an optimal
assignment of
j+1
given
j
.
Lemmas 1 and 2 imply the following algorithm for optimal coloring, which
we refer to as Algorithm 1:
create an arbitrary assignment
1
of haplotypes to distinct colors in block 1
for i = 2 to m
construct an instance of weighted maximum matching as described above
solve the instance with the algorithm of Edmonds [5]
for each pair (c
i1,j
, c
i,k
) joined in the matching assign
i
(c
i,k
) =
i1
(c
i1,j
)
for each c
i,k
that is unmatched assign an arbitrary unused color to
i
(c
i,k
)
Theorem 1. Algorithm 1 produces a coloring of haplotype blocks minimizing
the number of color changes across sequences in time O(mn
2
log n).
Proof. The proof of correctness follows directly from Lemmas 1 and 2. Creat-
ing an instance of weighted maximum matching and assigning colors requires
O(n) time. The run time for each iteration of i is therefore dominated by the
O(n
2
log n) run time of the maximum matching algorithm for this type of dataset
(where the number of non-zero edges of the graph is bounded by n). There are
O(m) rounds of computation, yielding a total run time of O(mn
2
log n).
Using the Hudson and Kaplan algorithm for block assignment gives the block-
color method an overall complexity of O(nm
2
+mn
2
log n) for n sequences and
m polymorphic sites. In practice, the input data would typically come from
resequencing some number of individuals in one sequenced gene - with a bound
on m and n typically on the order of 100 - yielding run times well within what
can be handled by standard desktop computers.
3 The Alignment Method
Although solving the problem in well-dened stages has advantages, it may also
be fruitful to nd a unied global solution to all aspects of the problem. Our
hope is that such an approach will help in nding more biologically meaningful
probabilistic models that capture the essential features of the system but are
computationally tractable. We therefore developed a second approach based on
techniques from sequence alignment. Sequence alignment can be viewed as as-
signing to each position of the target sequence a frame of a reference sequence,
with changes of frame representing gaps. We can analogously align a single
target to multiple references, but instead of shifting frames to model gaps, shift
reference sequences to model recombination. Figure 3 illustrates this analogy.
This approach is similar to the Recombination Cost problem of Kececioglu and
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 51
A
ACTAGCTAGCATCG
ACTAGCTAGCATCG
ACTAGCTAGCATCG
ACTGCTTCATCG
B
TGAGGCATGTACGA
TCAAGTTTCTACCG
ACTAGCTAGCATCG
ACTAGTTTGTATGA
Fig. 3. An illustration of the analogy between sequence alignment and haplotype col-
oring as variants of the problem of parsing a query sequence in terms of a set of
reference sequences. Each sub-gure shows a query sequence (bottom) aligned to three
reference sequences (top). A: sequence alignment as parsing of a sequence in terms of a
set of identical reference sequences in distinct frames; B: haplotype coloring as parsing
a sequence in terms of a set of distinct reference sequences in identical frames.
Guseld [15] and the jumping alignments of Spang et al. [23]. We, however,
assume we have a population sequenced for specic polymorphic sites and there-
fore need not consider insertion/deletion costs. Deletion polymorphisms can be
treated simply as additional allowed symbols. Dealing with missing data is more
complicated, but can be handled through the use of a special symbol represent-
ing an undened site, which can match any other symbol. Bayesian methods,
similar to that of Stephens et al. [24], might also be used to impute missing data.
A major advantage of this technique over our other method is that it does
not assume that there are recombination hotspots or a block structure. It may
therefore be better suited to testing that assumption or examining genomes or
genome regions in which it proves to be inapplicable. It should also allow us
to distinguish recent recombination sites aecting a small fraction of the se-
quenced population from more ancient or frequently recombining sites and is
easily parameterized to be more or less sensitive to recent mutations in iden-
tifying haplotypes. Among its disadvantages are that it requires an additive
objective function and therefore uses a probability model dierent from those
traditionally used in LD studies and that it is parameterized by values that may
be unknown and dicult to estimate. In addition, the function being optimized
is harder to understand intuitively than those used in the block-color method,
making the alignment method harder to judge and improve upon.
Our probability model parses a sequence as a string of haplotype identiers
describing the ancestral source of each of its polymorphic sites. A given sequence
chooses its rst value from a distribution of haplotypes at the rst polymorphic
position. Each subsequent polymorphic site may follow a recombination event
with some probability . If there is no recombination event then the sequence
continues with the same haplotype as it had at the prior site. Otherwise, the
sequence samples among all available haplotypes according to a site-specic dis-
tribution for the new site. There is also a mutation probability that any given
site will be mutated from that of its ancestral haplotype. This model leads to
the following formalization of the inputs:
m, the log probability of a mutation event at any one site of a sequence
r, the log probability of a recombination event between any two sites
S = s
1
, . . . , s
n
, a set of n reference sequences. Each s
i
is a sequence of l
polymorphic sites s
i1
, . . . , s
il
52 R. Schwartz, A.G. Clark, and S. Istrail
F, an nl matrix of log frequencies in which f
ij
species the probability
of choosing a given haplotype i at each site j following a recombination imme-
diately prior to that site
=
1
, . . . ,
t
, a set of t target sequences. Each
i
is a sequence of l
polymorphic values,
i1
, . . . ,
il
Our goal is to produce a tl matrix H, where h
ij
species which haplotype from
the set [1, n] has been assigned to position j of target sequence
i
, maximizing
the following objective function:
G(H) =
t
i=1
f
h
i1
1
+
t
i=1
l
j=2
(f
h
ij
j
+r)D(h
i,j
, h
i,j1
)
+
t
i=1
l
j=2
log((1 e
r
) +e
r+f
h
ij
j
)(1 D(h
i,j
, h
i,j1
)) +
t
i=1
l
j=1
M(s
h
i,j
,j
,
i,j
)
where D(a, b) =
0, a = b
1, a = b
and M(a, b) =
0, a = b
m, a = b
.
G(H) gives a normalized log probability of the assignment H given the sequences
and haplotype frequencies. The rst sum reects the probabilities of choosing
dierent starting haplotypes. The next two sums reect the contributions of
respectively choosing to recombine at each site or choosing not to recombine.
The nal sum gives the contribution to the probability of mismatches.
The function implies some assumptions about the independence of dierent
events whose validity must be considered. The assumptions that probabilities of
recombination and mutation are independent and identical at all sites are imper-
fect but may be reasonable a priori in the absence of additional information. It
is less clearly reasonable to assume that the selection of recombination positions
and the choices of haplotypes between them can be considered independent,
although this too may be reasonable absent additional information.
It is important to note that there is no unique correct pair of r and m
parameters for a given genome or gene region. The right parameters depend
on how much tolerance is desired in allowing slightly dierent sequences to be
considered identical haplotypes, analogous to the need for dierent sensitivities
in sequence similarity searches conducted for dierent purposes. We can thus
propose that the value of m should be determined by the particular application
of the method, with r then being an unknown that depends on both m and the
nature of the genome region under examination.
3.1 The Alignment Algorithm
We maximize G(H) with a dynamic programming algorithm. The following for-
mula describes the basic dynamic programming recursion for assigning haplo-
types to a single sequence :
C
(i, j) = max
k
(i, j 1) + log(1 e
r
+e
r+f
ij
)
C
(k, j 1) +r +f
ij
0,
j
= s
ij
m,
j
= s
ij
.
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 53
C
[i, 1] f
i1
else C
[i, 1] f
i1
+m
for j = 2 to l: for i = 1 to n:
best C
[i, j 1] + log(1 e
r
+e
r+f
ij
); argbest i
for k = 1 to n, k = i:
if (C
[k, j 1] +r +f
ij
< best)
best C
[k, j 1] +r +f
ij
; argbest k
if (
j
= s
ij
) then C[i, j] best else C[i, j] best +m
P
[i, j] argbest
best
for i = 1 to n: if C
[i, l]; H
[l] i
for j = l-1 downto 1: H
[j] P
[H
[j + 1], j + 1]
Lemma 3. C
[i, 1] accordingly,
satisfying the inductive hypothesis. Now assume the lemma is true for j 1. We
can decompose C
[i, j] +B
[i, j] =
t
i=1
f
h
i1
1
+
t
i=1
j1
=2
(f
h
ij
j
+r)D(h
i,j
, h
i,j
1
)
+
t
i=1
j1
=2
log((1 e
r
) +e
r+f
h
ij
)(1 D(h
i,j
, h
i,j
1
))
+
t
i=1
j1
=1
M(s
h
i,j
,j
,
i,j
)
and
B
[i, j] =
t
i=1
(f
h
ij
j
+r)D(h
i,j
, h
i,j1
)
+
t
i=1
log((1 e
r
) +e
r+f
h
i
jj
)(1 D(h
i,j
, h
i,j1
)) +
t
i=1
M(s
h
i,j
,j
,
i,j
).
A
[i, j 1]. B
[i, l], as is done by the third outer loop, therefore yields the global optimum to
the problem. The nal loop performs backtracking, reconstructing the optimal
solution H. Run time is dominated by an inner loop requiring constant time for
each of O(n
2
l) iterations, for a total of O(n
2
lt) run time when run on t targets.
The denition of the problem solved by the alignment method requires that
we know in advance the frequencies from which haplotypes are sampled following
a recombination event. As this information may not be available, we would like
a way to estimate it from measured frequencies of full-length haploid sequences.
We therefore developed an iterative method to optimize this probability by suc-
cessively performing a discrete optimization of the coloring given the haplotype
frequencies followed by a continuous optimization of the frequencies given the
coloring, which we perform through a steepest-descent search. The algorithm is
a form of generalized expectation maximization (EM) algorithm [1,4] that treats
the frequencies as hidden variables, repeatedly nding a maximum aposteriori
probability (MAP) coloring H given the frequencies F maximizing Pr(H|F) by
Algorithm 2 then improving Pr(H|F) in terms of F by steepest descent. The
number of rounds required for the resulting method to converge might theoreti-
cally be large, although we have found convergence in practice to occur reliably
within ten iterations on real gene region data sets.
4 Results
Both methods were implemented in C++. All tests were run on four-processor
500 MHz Compaq Alpha machines, although the code itself is serial.
We used a contrived data set, shown in Figure 4, to illustrate the strengths
and weaknesses of the methods. We strongly penalized mutations in order to
simplify the illustration. The block-color method correctly detects the recombi-
nation site, although the coloring appears suboptimal due to the constraint that
identical haplotypes must have identical colors. The alignment method correctly
identies the recombinants, but only with an appropriate choice of parameters.
In order to demonstrate the methods and explore the parameter space, we
further applied the methods to a real data set: the apolipoprotein E (APOE)
gene region core sample of Nickerson et al. [19], a set of 72 individuals typed
on 22 polymorphic sites, with full-length haplotypes determined by a combina-
tion of the Clark haplotype inference method [2] and allele-specic PCR to verify
phases. Figure 5 demonstrates the eects on the block-color algorithm of varying
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 55
A
B C D E
AAAAAA x 16
AAAATA x 16
ATATAA x 4
ATTTAA x 1
ATTAAT x 16
ATAAAA x 16
AAAAAA
AAATAA
ATATAA
ATTTAA
ATTAAT
ATAAAA
AAAAAA
AAATAA
ATATAA
ATTTAA
ATTAAT
ATAAAA
AAAAAA
AAATAA
ATATAA
ATTTAA
ATTAAT
ATAAAA
Fig. 4. A contrived sample problem. Dierent colors represent dierent predicted an-
cestral sequences. A: An ancestral recombination graph showing a proposed sequence
ancestry; B: The extant sequences with frequencies chosen to make the recombinant and
double recombinant relatively rare; C: The output of the block-color method screen-
ing out sites with minor allele frequencies below 0.1; D: The output of the alignment
method with parameters r = 1 and m = 100; E: The output of the alignment
method with parameters r = 2 and m = 100.
tolerance to infrequent SNPs. Moving from considering all SNPs in Figure 5A
to considering only those with minor frequencies above 10% in Figure 5B then
25% in Figure 5C, leads to progressively simpler block structures, with fewer
blocks and less variability within them. Figure 6 illustrates the eects of varying
recombination and mutation penalties for the alignment algorithm with the EM
extension on the same dataset. Higher recombination penalties generally lead
to greater numbers of haplotypes being assigned while higher mutation penal-
ties reveal more subtle regions of variation. We note that inferred haplotype
regions do not necessarily line up at recombination hotspots, suggesting that
the data might be more parsimoniously explained by not assuming the existence
of discrete haplotype blocks with the same structure across all sequences.
While a real dataset can demonstrate the methods, it cannot rigorously vali-
date them, as we do not denitively know the recombination history of any real
gene region. We therefore resorted to simulated data to perform a partial test of
the methods. Simulations were generated through Hudsons coalescent simulator
[10], using populations of 50 individuals with 70 segregating sites and allowing
recombination between any pair of sites. A simulated data set was generated for
each recombination parameter in the set {0,1,2,5,10,20,30}. We then calcu-
lated for the block-color method how many recombination sites were predicted,
screening out sites with minor allele frequency below 0.1. We further calculated
for the alignment method, with parameters m = 1.5 and r = 1.0, at how
many sites at least one recombination event was predicted. Figure 7 shows the
56 R. Schwartz, A.G. Clark, and S. Istrail
A B C
Fig. 5. Coloring of the APOE gene region [19] by the block-color method. A: coloring
using all sites. B: coloring using sites with minor allele frequency above 10%. C: coloring
using sites with minor allele frequency above 25%.
A B C
D E F
G H I
Fig. 6. Coloring of the APOE gene region [19] by the alignment-color method. Param-
eter values are A: r = 1.0 m = 2.0; B: r = 1.0 m = 4.0; C: r = 1.0 m = 6.0;
D: r = 0.5 m = 2.0; E: r = 0.5 m = 4.0; F: r = 0.5 m = 6.0; G: r = 0.0
m = 2.0; H: r = 0.0 m = 4.0; I: r = 0.0 m = 6.0;
resulting plot. As the plot indicates, the number of detected recombination sites
generally increases with increasing recombination rate, although the correlation
is imperfect and grows more slowly than the increase in . Of course the process
of simulating such data involves a high degree of stochasiticity, so one does not
expect a perfect correlation. For the block-color method, this result is consistent
with the analogous experiments performed by Hudson and Kaplan [11].
5 Discussion
We have presented two new methods for detecting recombination patterns in sets
of haploid sequences, phrased in terms of the problem of coloring sequences
to reect their ancestral histories. The rst method uses parsimony principles to
separately infer a minimum number of recombination sites capable of explaining
the data and then color sequences between sites to denote likely ancestry. The
second uses a discrete optimization method similar to those used in sequence
alignment to nd the most probable parse of a set of sequences in terms of hap-
lotypes. We have also incorporated that technique into an iterative method using
alternating discrete and continuous optimization to simultaneously infer haplo-
type frequencies and color sequences optimally given the inferred frequencies.
Each method has strengths that might make it more appropriate for cer-
tain cases. The block-color method creates a general algorithmic framework in
which optimal solutions can be found eciently for a range of possible tests of
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 57
0
2
4
6
8
10
0 5 10 15 20 25 30
P
r
e
d
i
c
t
e
d
r
e
c
o
m
b
i
n
a
t
i
o
n
s
i
t
e
s
Recombination rate
Block-color method
Alignment method
Fig. 7. Predicted recombination sites versus coalescent parameter for simulated data.
compatibility of pairs of sites with the assumption of no recombination. It thus
might provide a useful general method for the evaluation and application of new
statistical tests for recombination. The alignment method solves optimally for a
single objective function, making it potentially more useful in testing and apply-
ing a unied probabilistic model of sequence evolution. It also does not rely on a
prior assumption that haplotypes have a block structure and therefore might be
more useful for testing hypotheses about the existence of such a block structure,
nding instances in which it is not conserved, or processing data from organisms
or gene regions that do not exhibit haplotype blocks. The EM algorithm may be
independently useful in estimating haplotype block frequencies.
We can consider possible generalizations of the methods described above. It
may be worthwhile to try other tests for deriving pair-wise constraints for the
block-color method. As Hudson and Kaplan [11] note, the number of recombina-
tion events detected by the four-gamete constraint may be substantially smaller
than the actual number of recombination sites, especially for small population
sizes; their method nds a maximally parsimonious explanation for the data and
will therefore miss instances in which recombination has occurred but has not
yielded a violation of the four-gamete test or in which the population exam-
ined does not contain sucient examples to demonstrate such a violation. The
Guseld [8] method for diploid data can be expected to be similarly conserva-
tive, suggesting the value of pursuing more sensitive tests of recombination. The
alignment model could be extended to handle position-specic mutation weights
or recombination probabilities when an empirical basis is available for choosing
them. It might also be possible to adapt the EM algorithm to infer mutation or
recombination probabilities at the same time as it infers haplotype frequencies.
In addition to providing greater versatility and accuracy, automating inference of
the mutation and recombination rates might substantially improve ease-of-use.
Making the best use of sequencing data for understanding human diversity
and applying that understanding to association studies will require rst devel-
oping a more complete picture of the processes involved; second, building and
validating statistical and probabilistic models that reliably capture that picture;
and third, developing methods that can best interpret the available data given
58 R. Schwartz, A.G. Clark, and S. Istrail
those models. While the preceding work is meant to suggest avenues and tech-
niques for pursuing the overall goal of applying human genetic diversity data,
all three of those steps remain far from resolved.
Acknowledgments
We thank Vineet Bafna, Clark Mobarry, and Nathan Edwards for their insights
into block identication and Hagit Shatkay and Ross Lippert for their advice and
guidance. We also thank Michael Waterman and various anonymous referees for
their advice on the preparation of this manuscript.
References
1. Baum, L.E., Petrie, T., Soules, G., and Weiss, N. A maximization technique occur-
ring in the statistical analysis of probabilistic functions of Markov chains. Annals
Math. Stat., 41, 164171, 1970.
2. Clark, A. G. Inference of Haplotypes from PCR-amplied samples of diploid pop-
ulations. Mol. Biol. Evol., 7, 111122, 1990.
3. Daly, M.J., Rioux, J.D., Schaner, S.F., Hudson, T.J., and Lander, E.S. High-
resolution haplotype structure in the human genome. Nature Gen., 29, 229232,
2001.
4. Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incom-
plete data via the EM algorithm. J. Royal Stat. Soc. B, 39, 138, 1977.
5. Edmonds, J. Paths, trees, and owers. Canad. J. Math., 17, 449467, 1965.
6. Fearnhead, P. and Donnelly, P. Estimating recombination rates from population
genetic data. Genetics, 159:12991318, 2001.
7. Guseld, D. Ecient algorithms for inferring evolutionary history. Networks,
21:1928, 1991.
8. Guseld, D. Haplotyping as perfect phylogeny: Conceptual framework and ecient
solutions. In Proc. 6th Intl. Conf. Comp. Biol., RECOMB02, 166175, 2002.
9. Hein, J. A heuristic method to reconstruct the history of sequences subject to
recombination. J. Mol. Evol., 20, 402411, 1993.
10. Hudson, R.R. Properties of the neutral allele model with intergenic recombination.
Theoret. Pop. Biol., 23, 183201, 1983.
11. Hudson, R.R. and Kaplan, N.L. Statistical properties of the number of recombina-
tion events in the history of a sample of DNA sequences. Genetics, 111, 147164,
1985.
12. International Human Genome Sequencing Consortium. Initial sequencing and anal-
ysis of the human genome. Nature, 409, 860921, 2001.
13. Jereys, A.J., Kauppi, L., and Neumann, R. Intensely punctate meiotic recombi-
nation in the class II region of the major histocompatibility complex. Nature Gen.,
29, 217222, 2001.
14. Johnson, G.C.L., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Di Genova,
G., Ueda, H., Cordell, H.J., Eaves, I.A., Dudbridge, F., Twells, R.C.J., Payne,
F., Hughes, W., Nutland, S., Stevens, H., Carr, P., Tuomilehto-Wolf, E., Tuomile-
hto, J., Gough, S.C.L., Clayton, D.G., and Todd, J.A. Haplotype tagging for the
identication of common disease genes. Nature Gen., 29, 233237, 2001.
Methods for Inferring Block-Wise Ancestral History from Haploid Sequences 59
15. Kececioglu, J. and Guseld, D. Reconstructing a history of recombinations from a
set of sequences. Disc. Appl. Math., 88, 239260, 1998.
16. Kimura, M. Theoretical foundations of population genetics at the molecular level.
Theoret. Pop. Biol., 2, 174208, 1971.
17. Maynard Smith, J. Analyzing the mosaic structure of genes. J. Mol. Evol., 34,
126129, 1992.
18. Maynard Smith, J. and Smith, N.H. Detecting recombination from gene trees. Mol.
Biol. Evol., 15, 590599, 1998.
19. Nickerson, D. A., Taylor, S. L., Fullerton, S. M., Weiss, K. M., Clark, A. G.,
Stengrd, J. H., Salomaa, V., Boerwinkle, E., and Sing, C. F. Sequence diversity
and large-scale typing of SNPs in the human apolipoprotein E gene. Gen. Res., 10,
15321545, 2000.
20. Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R.,
Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., Nguyen, B.T.N.,
Norris, M.C., Sheehan, J.B., Shen, N., Stern, D., Stokowski, R.P., Thomas, D.J.,
Trulson, M.O., Vyas, K.R., Frazer, K.A., Fodor, S.P.A., and Cox, D.R. Blocks of
limited haplotype diversity revealed by high-resolution scanning of human chro-
mosome 21. Science, 294, 17191723, 2001.
21. Posada, D., and Crandall, K.A. Evaluation of methods for detecting recombina-
tion from DNA sequences: computer simulations. Proc. Natl. Acad. Sci. USA, 98,
1375713762, 2001.
22. Sawyer, S. Statistical tests for detecting gene conversion. Mol. Biol. Evol., 6, 526
536, 1989.
23. Spang, R., Rehmsmeier, M., and Stoye, J. Sequence database search using jumping
alignments. In Proc. Intel. Sys. Mol. Biol., ISMB00, 367375, 2000.
24. Stephens, M., Smith, N.J., and Donnelly, P. A new statistical method for haplotype
reconstruction from population data. Am. J. Hum. Gen., 68, 978989, 2001.
25. Venter, J.C., Adams, M.D., Myers, E.W., et al. The sequence of the human genome.
Science, 291, 13041351, 2001.
26. Wang, L. Zhang, K., and Zhang, L. Perfect phylogentic networks with recombina-
tion. J. Comp. Biol., 8, 6978, 2001.
27. Weiler, G. F. Phylogenetic proles: a graphical method for detecting genetic re-
combinations in homologous sequences. Mol. Biol. Evol., 15, 326335, 1998.
28. Wiuf, C., Christensen, T., and Hein, J. A simulation study of the reliability of
recombination detection methods. Mol. Biol. Evol., 18,19291939, 2001.
29. Zhang, K., Deng, M., Chen, T., Waterman, M. S., and Sun, F. A dynamic pro-
gramming algorithm for haplotype block partition. Proc. Natl. Acad. Sci. USA,
99, 73357339, 2002.
Finding Signal Peptides in Human Protein
Sequences Using Recurrent Neural Networks
Martin Reczko
1
, Petko Fiziev
2
, Eike Staub
2
, and Artemis Hatzigeorgiou
3
Synaptic Ltd., Science and Technology Park of Crete
P.O. Box 1447, 711 10 Voutes Heraklion, Greece
metaGen Pharmaceuticals GmbH, Oudenader Str.16, D-13347 Berlin, Germany
Department of Genetics, University of Pennsylvania, School of Medicine
Philadelphia, PA 19104-6145, USA
Abstract. A new approach called Signd for the prediction of signal
peptides in human protein sequences is introduced. The method is based
on the bidirectional recurrent neural network architecture. The modica-
tions to this architecture and a better learning algorithm result in a very
accurate identication of signal peptides (99.5% correct in vefold cross-
validation). The Signd system is available on the WWW for predictions
(http://www.stepc.gr/ synaptic/signd.html).
1 Introduction
In the last few years the complete or nearly the complete sequence for a large
number of genomes including also the human genome has been determined. One
of the expected consequences from sequencing these genomes is the discovery of
new drugs. The sorting of subcellular proteins is a fundamental aspect of nding
such drugs, because such proteins have the potential to leave the cell and move
through the organism.
In humans secretory proteins account for about one-tenth of the human pro-
teome. As many functional characteristics of proteins can be correlated with
more or less well-dened linear motifs in their aminoacid sequences, also in this
case sorting depends on signals that can already be identied in the primary
structure of a protein.
Signal peptides are N-terminal extensions on the mature polypeptide chain.
They are cleaved from the mature part of the protein at the cleavage site. The
basic design of signal peptides is constituted by three distinct regions: a short
positively charged N-terminal domain (N-region) a central hydrophobic stretch
of 7-15 residues (H-region) and a more polar C-terminal part of 3-7 residues that
denes a processing site for the signal peptidase enzyme (C-region) [18].
Prediction of signal peptides has a long history starting back in early eighties
by simple algorithms based on hydrophobicity calculations [8] and the later use
of a positional weight matrix for the cleavage site [17]. The application of more
sophisticated methods such as rule-based systems [7], neural networks (NNs) [13],
hidden Markov models (HMMs) [14] and linear programming [9] has increased
the levels of accuracy considerably. A more detailed review can be found in [12].
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 6067, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Finding Signal Peptides in Human Protein Sequences 61
In this paper a new method for dening signal peptides is introduced. The
method is based on a novel neural network architecture called bidirectional re-
current neural nets (BRNN) that was recently introduced by [2]. The BRNN
architecture is used for sequential learning problems with sequences of nite
lengths. Other recurrent NNs can only model causal dynamical system, where
the output of the system at a certain time does not depend on future inputs.
The BRNN model has a symmetric memory for storing inuences of past and
future inputs to an output at time t.
This non causal behavior is useful for the prediction of properties of biose-
quences, since most properties are inuenced by elements of those sequences
both upstream and downstream of that property. The concept of time does not
apply to this class of sequential learning problems, so we know about the fu-
ture sequence following a sequence position t and the system is not violating
any physical causality.
We modify the architecture and introduce the combination with a more e-
cient learning algorithm that produces networks with better generalization per-
formance in less computational time. In the results section we present a com-
parison with the methods available on the Internet and we are documenting the
achieved improvements.
2 Material and Methods
The data used here is the same data as was used for the rst SignalP prediction
system [13] that was made available by the authors
1
. That data was derived from
SWISS-PROT version 29 [1]. For the removal of redundancy in their data set,
Nielsen et. al. [13] eliminated any pair of sequences with more than 17 identical
residues in a local alignment.
From this non-redundant data set we use only the human sequences which
contain 416 signal peptides and the N-terminal part of 97 cytoplasmic and 154
nuclear proteins as negative examples.
2.1 Neural Network Architecture
The neural network architecture employed for the task is a modied version of
the BRNN as described in [2]. The BRNN architecture is summarized here and
the modications are introduced.
The state vectors are dened as F
t
and B
t
in IR
n
and IR
m
and contain the
context information while reading forward and backwards, respectively. They
are calculated as:
F
t
= (F
t1
, U
t
) and (1)
B
t
= (B
t+1
, U
t
) (2)
The SignalP datasets are available at: ftp://virus.cbs.dtu.dk/pub/signalp
62 M. Reczko et al.
where () and () are nonlinear transition functions realized by the multilayer
perceptrons (MLPs)
and
and
the recurrent NN
is unfolded in time and the backpropagation algorithm [16] is used to calculate
the gradient for all weights. The unfolding is illustrated in gure 2. This trans-
formation of a recurrent NN into a equivalent feedforward NN is called back-
propagation through time and was rst described in [11] and the application of
backpropagation learning to these networks was introduced in [16].
The gradient for a weight is summed up for all training sequences and each
shared position within each sequence.
One well known obstacle when training recurrent NNs is the problem of the
vanishing gradients [3]. In most cases, the gradient decays exponentially with
64 M. Reczko et al.
the number of layers in a NN. Since the unfolded network has at least as many
layers as the length of the input sequences, there is a limit for storing features
that might be relevant for the prediction of a property and are separated by
longer sequences from that property.
A learning algorithm that performs well without depending on the magnitude
of weight gradients is the resilient backpropagation (RPROP) algorithm [15].
Only the sign of the derivative is considered to indicate the direction of the
weight update. The size of the weight change is determined by a weight-specic
update-value
(epoch)
ij
and dened as
w
(epoch)
ij
=
(epoch)
ij
, if
E
w
ij
(epoch)
> 0
+
(epoch)
ij
, if
E
w
ij
(epoch)
< 0
0 , else
(4)
where
E
w
ij
(epoch)
denotes the summed gradient information over all patterns
of the pattern set (one epoch in batch learning). After each batch update, the
new update-values
ij
(epoch) are determined using a sign-dependent adaptation
process.
(epoch)
ij
=
+
(epoch1)
ij
, if
E
w
ij
(epoch1)
E
w
ij
(epoch)
> 0
(epoch1)
ij
, if
E
w
ij
(epoch1)
E
w
ij
(epoch)
< 0
(epoch1)
ij
, else
(5)
where
= 0.5 and
+
= 1.05 are xed values.
Every time the partial derivative of the corresponding weight w
ij
changes its
sign, which indicates that the last update was too big and the algorithm has
jumped over a local minimum, the update-value
(epoch)
ij
is decreased by the
factor
j=1
(
j
) = M
i
.
For convenience, we denote the maximum relevant substring mass by M
max
=
max
i
M
i
and the minimum relevant substring mass by M
min
= min
i
M
i
. Further,
since the set of query masses may contain repeated values, we dene R
max
to be
the maximum number of repeats.
While we will concentrate primarily on the integer mass version of the peptide
candidate generation problem, we must not lose sight of the real mass version
which we must eventually implement. The real mass version of the problem
denes real masses for each alphabet symbol and supports real query masses
with lower and upper mass tolerances.
Peptide Candidate Generation with Real Weights [PCG(Real)]. Given
a string of length n over an alphabet A of size m; a positive real mass (a) for
each a A; and k positive real mass queries with positive lower and upper tol-
erances (M
1
, l
1
, u
1
), . . . , (M
k
, l
k
, u
k
), enumerate all (distinct) pairs (i, ), where
1 i k and is a substring of , such that
M
i
l
i
||
j=1
(
j
) M
i
+ u
i
.
We dene M
max
and M
min
to be the minimum and maximum relevant substring
mass, M
max
= max
i
(M
i
+u
i
) and M
min
= min
i
(M
i
l
i
) and dene O
max
to be
the maximum number of [M
i
l
i
, M
i
+ u
i
] intervals to overlap any mass.
2.2 Application to Peptide Identication
via Tandem Mass Spectrometry
In order to focus our analysis of dierent algorithms for generating peptide can-
didates, we rst describe some of the issues that need be resolved before a peptide
candidate generation technique can be used as part of a peptide identication
amino-acid sequence search engine.
First, most amino-acid sequence databases do not consist of a single sequence
of amino-acids, they instead contain many distinct amino-acid sequences, repre-
senting many proteins. Our peptide candidates should not straddle amino-acid
Generating Peptide Candidates from Amino-Acid Sequence Databases 71
sequence boundaries. This is easily dealt with by introducing an addition symbol
to the alphabet A with a mass guaranteed to be larger than the biggest query.
We usually require signicantly more information about each peptide can-
didate than is represented by its sequence alone. A peptide candidates protein
context minimally consists of a reference, such as an accession number, to the
sequence database entry containing it, and its position within the sequence. Of-
ten, the protein context also contains anking residues, for checking enzymatic
digest consistency quickly, and taxonomy information, for restricting candidates
to a species of interest.
Since our experimental protocol cuts each protein with a digestion enzyme,
we usually require our peptide candidates have one or both ends consistent with
the putative digest cut sites. Trypsin, very commonly used in mass spectrometry
applications, cuts proteins immediately after a lysine (K) or arginine (R) unless
the next residue is a proline (P). In order to determine whether a peptide can-
didate is consistent with trypsin, for example, we must use the protein context
to examine the symbols before and after a peptide candidate.
If we will always require that both ends of the peptide candidate be consistent
with the enzymatic digest, we can improve the performance of any algorithm for
the peptide candidate generation problem signicantly. However, in practice, we
have observed that a signicant number of the tandem mass spectra generated in
the high throughput setting do not have enzymatic digest consistent endpoints,
and requiring this property of our peptide candidates signicantly aects our
ability to successfully interpret many spectra. As such, we currently do not
impose this constraint at the algorithmic level, instead we lter out candidates
without digestion consistent endpoints at candidate generation time. Section 6
deals with this issue further.
Peptide redundancy must be carefully considered too. It does us no good to
score the same peptide candidate against a spectrum multiple times, but if a
peptide scores well, we must be able list all the proteins that it occurs in. If our
algorithm does not explicitly eliminate redundant peptide candidates, we can
choose to score them multiple times, or store all scored candidates in some data-
structure and only score candidates the rst time they are observed. Section 4
deals with this issue further.
Post-translational modications provide yet another consideration that af-
fects algorithmic implementation decisions. If we permit the online specication
of the amino-acid mass table in order to support the generation of peptide can-
didates with a particular post-translational modication, we must discard any
algorithm which precomputes peptide candidate masses, unless we can somehow
correct for amino-acids with modied masses.
Modeling post-translational modications becomes more dicult when we
must generate peptide candidates in which some amino-acid symbols have mul-
tiple masses. In this setting, we can still use the peptide candidate generation
formulation above but we must use additional query masses and do additional
checks on the generated candidates. For each tandem mass spectrum, we gen-
erate a set of mass queries that correct for the presence of a certain number of
72 N. Edwards and R. Lippert
residues being translated to a particular modied amino-acid, and check each
returned candidate-query mass pair to ensure that the required number of am-
biguous amino-acid symbols is present.
2.3 Implications for Peptide Candidate Generation
We consider amino-acid sequence databases of a few megabytes to a few gigabytes
of sequence, over an alphabet of size 20. The alphabet is small enough that we
can look up the mass table for each symbol in the alphabet in constant time.
A typical set of tandem mass spectra from a single high throughput run
consists of hundreds to thousands of spectra, so we expect hundreds to tens of
thousands of distinct query masses. Furthermore, these query masses are typ-
ically constrained between 600-3000 Daltons with a 2 Dalton tolerance on ac-
ceptable peptide candidates. A 2 Dalton tolerance is typically necessary since
mass spectrometers select ions for further fragmentation with much less accu-
racy than they measure the mass of the same ions, and we must consider any
peptide candidate that might have been selected, not only those that have been
measured. For our peptide candidate generation problem, this implies that the
query masses occupy much of the feasible mass range. The importance of this
consideration will become clear in Section 5.
We consider the amino-acid sequence database to be provided oine, and
that it may be pre-processed at will, the cost amortized over many identication
searches. On the other hand, we assume that the alphabet mass function and
the query masses are provided only at runtime and any preprocessing based on
weights must be accounted for in the algorithm run time.
3 Simple Algorithms
3.1 Linear Scan
Suppose initially that we have a single query mass, M. Finding all substrings of
with weight M involves a simple linear scan. The algorithm maintains indices
b and e for the beginning and end of the current substring and accumulates the
current substring mass in
M. If the current mass is less than M, e is incremented
and
M increased. If the current mass is greater than
M, b is incremented and
M
decreased. If the current mass equals M, then
b...e
is output. See Algorithm 1
for pseudo-code.
This algorithm outputs all peptide candidates that match the query mass in
O(n) time. Linear Scan requires no preprocessing and requires only the string
itself be stored in memory. The recent work of Cieliebak, Erlebach, Lipt ak, Stoye,
and Welzl [3] demonstrates how to generalize Linear Scan by creating blocks
of contiguous alphabet symbols and changing the pointers b and e in block size
increments.
3.2 Sequential Linear Scan
The Sequential Linear Scan algorithm for the peptide candidate generation
problem merely calls Linear Scan for each query mass M
i
, i = 1, . . . , k. Se-
quential Linear Scan requires no preprocessing of the string and requires
Generating Peptide Candidates from Amino-Acid Sequence Databases 73
Algorithm 1 Linear Scan
b 1, e 0,
M 0.
while e < n or
M M do
if
M = M then
Output
b...e
.
if
M < M and e < n then
e e + 1,
M
M +(
e
).
else {
M M}
M
M (
b
), b b + 1.
memory only to store the sequence data and query masses. The algorithm takes
O(nk) time to enumerate all peptide candidates that match the k query masses.
The algorithm not only outputs redundant peptide candidates, it computes their
mass from scratch every time it encounters them. On the other hand, the algo-
rithm can easily keep track of protein context information as it scans the sequence
database.
3.3 Simultaneous Linear Scan
Unlike Sequential Linear Scan, Simultaneous Linear Scan solves the pep-
tide candidate generation problem with a single linear scan of the sequence
database. For every peptide candidate start position, we generate all candidates
with mass between the smallest and largest query masses, and check, for each
one, whether or not there is a query mass that it matches. Assuming positive
symbol masses and query masses bounded above by some constant, there are at
most L candidates that must be considered at each start position. If we pre-sort
the query masses, collapsing identical queries into a list associated with a single
table entry, we can look up all queries corresponding to a particular substring
mass in O(R
max
log k) time. Therefore, we can crudely estimate the running time
of Simultaneous Linear Scan as O(k log k + nLR
max
log k).
Theoretically, while it is possible to construct an input for the peptide candi-
date generation problem that requires the consideration of this many candidates,
such an input will never occur in practice. Assuming typical amino-acid frequen-
cies and a 3000 Dalton maximum query, the average number of candidates that
must be considered at each start position, or L
ave
, is about 27, much less than
the worst case L of 54. Table 1(a) provides L
ave
for a variety of publicly available
sequence databases. For the case when enzymatic digest consistent endpoints are
required, many fewer candidates need to be generated. Table 1(b) gives L
ave
for
tryptic peptide candidates.
In order to take advantage of this, Simultaneous Linear Scan must ex-
amine only the peptide candidates with masses up to the maximum query mass
M
max
. Algorithm 2 implements Simultaneous Linear Scan, achieving a run-
ning time bound of O(k log k + nL
ave
R
max
log k). Clearly, as k gets large, this
approach will do much better than Sequential Linear Scan.
74 N. Edwards and R. Lippert
Table 1. Average (Maximum) number of (a) candidates (b) tryptic candidates per
start position for various values of M .
Sequence Database 500 Da 1000 Da 2000 Da 3000 Da
SwissPROT 4.491(8) 9.008(17) 18.007(32) 27.010(48)
(a) TrEMBL 4.492(8) 9.004(17) 18.006(33) 27.010(49)
GenPept 4.491(8) 9.004(17) 18.005(33) 27.006(49)
SwissPROT 1.443(4) 2.046(7) 3.137(12) 4.204(18)
(b) TrEMBL 1.418(4) 2.022(7) 3.066(14) 4.189(19)
GenPept 1.427(4) 2.036(7) 3.101(14) 4.195(22)
Algorithm 2 Simultaneous Linear Scan
Sort the query masses M , . . . , M
k
.
b 1, e 0,
M 0.
while b n do
while
M M and e < n do
e e + 1,
M
M +(
e
).
Q QueryLookup(
M).
Output (i,
b...e
) for each i Q.
b b + 1, e b 1,
M 0.
Like Sequential Linear Scan, Simultaneous Linear Scan does no pre-
processing of the amino-acid sequence database and requires no additional mem-
ory above that required for the sequence and the queries. Simultaneous Linear
Scan generates redundant peptide candidates and can easily provide protein
context for each peptide candidates.
4 Redundant Candidate Elimination
Typical amino-acid sequence databases, particularly those that combine se-
quences from dierent sources, contain some entries with identical sequences.
Fortunately, these redundant entries can be eciently eliminated by computing
a suitable hash on each sequence. Substring redundancy, however, is not as easy
to deal with. When all peptide candidates of mass between 600 and 4000 Dal-
tons are enumerated from the publicly available GenPept amino-acid sequence
database, over 60% of the peptide candidates output have the same sequence
as a candidate already seen. We will denote the compression in the number of
candidates that must be considered when this redundancy is eliminated by the
substring density, . Figure 1 plots the substring density of various publicly avail-
able amino-acid sequence databases for candidates (and tryptic candidates) with
mass between 600 and M
max
, where M
max
varies from 610 to 4000 Daltons. No-
tice that is quite at for large M
max
, suggesting substring density signicantly
less than 1 is not merely an artifact of the length of the substrings considered.
Notice also that the carefully curated SwissPROT sequence database ex-
hibits much greater substring density than the others. This probably reects
Generating Peptide Candidates from Amino-Acid Sequence Databases 75
GenPept
TrEMBL
SwissPROT
0.8
1
500 1000 1500 2000 2500
0.6
3000
0
3500 4000
0.4
0.2
M
max
Fig. 1. Substring density of all candidates (solid line) and tryptic candidates (dashed
line) with mass between 600 and M Daltons.
the curation policy of SwissPROT to collapse polymorphic and variant forms
of the same protein to a single entry with feature annotations. If each variant
form of a protein was a distinct entry in the database, then its substring den-
sity would decrease signicantly. On the other hand, GenPept, which contains
protein sequence data from many sources and is not hand curated, has much
lower substring density. We claim that substring density is not merely inversely
proportional to database size, but that it is increasingly dicult to nd appro-
priate ways to collapse similar, but not identical, entries in ever larger sequence
databases.
4.1 Sux Tree Traversal
Sux trees provide a compact representation of all distinct substrings of a string,
eectively eliminating redundant peptide candidates. We discuss the salient
properties of sux trees here, see Guseld [7] for a comprehensive introduc-
tion. Sux trees can be built in time and space linear in the string length.
Once built, all distinct substrings are represented by some path from the root of
the sux tree. Having constructed a sux tree representation of our sequence
database, a depth rst traversal of the sux tree that examines substrings on
paths from the root with mass at most the maximum query mass generates all
sequence distinct peptide candidates of our sequence database.
As the depth rst traversal of the sux tree proceeds, each candidate sub-
string must be checked to see whether it corresponds to a query mass. As before,
we bound the worst case performance of query mass lookup by O(R
max
log k).
This results in a running time bound of O(k log k + nL
ave
R
max
log k).
The additional memory overhead of the sux tree is the only downside of this
approach. A naive implementation of the sux tree data-structure can require as
much as 24n additional memory for a string of length n. However, as sux trees
have found application in almost every aspect of exact string matching, a great
76 N. Edwards and R. Lippert
5
10
15
20
1000 1500
0
2000 2500 3000 3500 4000
N
u
m
b
e
r
o
f
s
p
e
c
t
r
a
Mass
Fig. 2. Overlap plot for 1493 query masses from Finnigan LCQ tandem mass spec-
trometry experiment.
deal of eort has been made to nd more compact representations. For example,
Kurtz [9] claims a memory overhead of approximately 8n for various biological
databases. This degree of compression can also be achieved, in our experience,
by using a sux array data-structure, which represents the same information as
a sux tree, but in a typically more compact form. Again, see Guseld [7] for an
introduction. In theory, a depth rst traversal of a sux array should be quite a
bit more expensive than for a sux tree, but in our experience, the linear scan
through the sux array that is necessary to implement the depth rst traversal
costs little more than pointer following in the sux tree.
We note that using a sux tree traversal for peptide candidate generation
can make the protein context of peptide candidates more expensive to compute,
particularly for context upstream of the candidate, necessary for determining if
the peptide candidate is consistent with an enzymatic digest.
5 Query Lookup
Whether we consider the Simultaneous Linear Scan or Sux Tree Traver-
sal algorithm for peptide candidate generation, it should be clear that as the
number of distinct query masses k gets large, our running time depends criti-
cally on the query mass lookup time. Once k gets too big for the Sequential
Linear Scan algorithm to be appropriate, eective algorithms for the peptide
candidate generation problem must no longer focus on the string scanning sub-
problem, but instead on the query mass lookup subproblem. Figure 2 plots the
degree of overlap of a typical set of query mass intervals derived from a Finni-
gan LCQ mass spectrometer run containing 1493 tandem mass spectra. Notice
the wide range of candidate masses that must be considered by the candidate
generation algorithm.
We can obtain an O(1) query lookup time per output peptide candidate
by constructing a lookup table, indexed by the set of all possible candidate
masses, containing the corresponding query mass indices. This is feasible for
Generating Peptide Candidates from Amino-Acid Sequence Databases 77
Algorithm 3 Query Mass Lookup Table: Construction: Real Case
min
i
(l
i
+u
i
).
allocate and initialize a lookup table of size N =
M
max
+ 1.
for all query masses i do
for all j =
M
i
l
i
, . . . ,
M
i
u
i
do
if j M
i
l
i
< (j + 1) then
T[j].L i
else if j M
i
+u
i
< (j + 1) then
T[j].U i
else {M
i
l
i
< j and M
i
+u
i
(j + 1)}
T[j].O i
for all j = 0, . . . , N 1 do
sort the elements i of T[j].L in increasing M
i
l
i
order.
sort the elements i of T[j].U in decreasing M
i
+u
i
order.
Algorithm 4 Query Mass Lookup Table: Lookup: Real Case
input query mass
M.
j
.
output i for each i T[j].O.
output i in order from T[j].L until M
i
l
i
>
M.
output i in order from T[j].U until M
i
+u
i
<
M.
suitably small integer query masses, as the resulting data-structure will have
size O(M
max
+ k).
We can readily adapt this approach to the real mass case. Algorithm 3 de-
scribes the procedure for building the lookup table. Having selected the dis-
cretization factor for the real masses to match the smallest query mass toler-
ance and allocating a table of the appropriate size, we iterate through the query
masses to populate the table. For each query mass, we determine the relevant
table entries that represent intervals that intersect with the query mass tolerance
interval. The query mass intervals that intersect the interval J = [j, (j + 1))
represented by a table entry j do so in one of three ways: the lower endpoint
falls inside J, in which case the query mass index is stored in T[j].L; the upper
endpoint falls inside J, in which case the query mass index is stored in T[j].U;
or the query mass interval completely overlaps J, in which case the query mass
index is stored in T[j].O. The choice of ensures that no query mass interval
is properly contained in J. Once all the table entries are populated, then for
all table entries j, the query mass indices i T[j].L are sorted in increasing
M
i
l
i
order and similarly, the query mass indices i T[j].U are sorted in
decreasing M
i
+ u
i
order. The resulting lookup table has size bounded above
by O(M
max
/ + k max
i
(l
i
+ u
i
)/) and can be constructed in time bounded by
O(M
max
/ + k max
i
(l
i
+ u
i
)/ + (M
max
/)O
max
log O
max
).
For typical values, building this table is quite feasible. M
max
is usually be-
tween 2000 and 4000 Daltons, while is usually about 2 Daltons. Even tens of
78 N. Edwards and R. Lippert
thousands of query masses spread over 1000-2000 bins will not cause signicant
memory or construction time problems. Once built, the cost in memory and
construction time is quickly amortized by the O(1) query lookup time.
The bottom line is that it is possible to solve the peptide candidate generation
problem (with integer masses) in time O(M
max
+ k + nL
ave
R
max
) and that it
is practical to implement this approach for real query masses. Note that the
number of peptide candidates output by any correct algorithm is bounded above
by nL
ave
R
max
. This bound is quite tight, particularly when R
max
is close to
its lower bound of k/M
max
. Ultimately, this means that we cannot expect to do
much better than this running time bound when building the query lookup table
is feasible.
6 Enzymatic Digest Considerations
Enzymatic digest considerations oer both a challenge and an opportunity for
peptide candidate generation. Enzymatic digest considerations present a chal-
lenge, depending on the underlying scan or traversal, because obtaining the
necessary protein context in order to check whether a peptide candidate is con-
sistent with a particular enzymatic digest may not be trivial, particularly during
a sux tree based traversal. On the other hand, enzymatic digest considerations
present an opportunity, because digest constraints signicantly reduce the num-
ber of peptide candidates that need to be examined. With careful consideration
for the underlying data-structures, it is possible to enumerate peptide candidates
that satisfy enzymatic digest constraints at no additional cost, which means that
the reduction in the number of candidates output directly impacts the algorithm
run-time.
First, we consider how to manage enzymatic digest constraints for the linear
scan algorithms. Notice rst that since the motifs for enzymatic digestion are
typically simple and short, we can nd all putative digestion sites in O(n) time.
If both ends of the peptide candidates are constrained to be consistent with the
enzymatic digest, then the begin and end pointers of Algorithms 1 and 2 can
skip from one putative site to the next. Further, if we consider the case when
only a small number of missed digestion sites are permitted, we can avoid the
generation of candidates with moderate mass, but many missed digestion sites.
If only one end of the peptide candidate is required to be consistent with the
digest, we must use two passes through the sequence. The rst pass constrains
the left pointer to putative digest sites, and the second pass constrains the right
pointer to putative digest sites and the left pointer to non-digest sites.
Next, we consider how to manage enzymatic digest consistency for the sux
tree based algorithms. Unlike the linear scan algorithms, peptide candidate gen-
eration is asymmetric with respect to the ends of the candidate. In fact, there
is little we can do within the sux tree traversal to reduce the amount of work
that needs to be done to generate a new peptide candidate in which the right
endpoint is consistent with the enzymatic digest. We must still explore all the
children of any node we visit, as long as the current peptide candidate mass is
Generating Peptide Candidates from Amino-Acid Sequence Databases 79
Table 2. total time (s)/number of calls (10 )/time per call (10
s) of the traversal
subroutines.
(sec) generation interval search string scan
Sux tree 237/1.62/0.146 548/2.90/0.189 115/1.33/0.086
Sim. linear scan 228/2.02/0.112 844/3.53/0.239 87/1.55/0.056
less than the maximum query mass. We can avoid query mass lookups, just as
we did for the linear scans, by checking each subsequence with a valid mass for
an digest consistent right endpoint before we lookup the query masses. On the
other hand, constraining the left endpoint of all candidates to be consistent with
the digest can restrict the sux tree traversal algorithms to a subtree of the
sux tree. In order to accomplish this, we must consider a peptide candidate to
consist of the protein context to the left of the candidate prepended to the pep-
tide candidate proper. The non-candidate symbols prepended to the candidate
do not contribute to the mass of the candidate, but permit us to look only at
those subtrees of the root that begin with a digest consistent motif.
7 Computational Observations
We implemented solutions with a sux tree and a simultaneous linear scan as
described in Algorithm 2. Proles of these implementations give an estimate
to the relative constant factors associated with string traversal versus interval
search costs. We ran our implementations against the 1493 sample query masses
on the SwissPROT database.
Although we have shown that O(1) lookup of intervals is possible, we have
used a skip list to store the intervals in both examples. As the average interval
coverage is not large in this example, we thought this would suce.
Our implementations were compiled in DEC Alpha EV6.7 processors (500
MHz) with the native C++ compiler, CXX (with the option -O4). Timing was
done with gprof. We have aggregated the timings for the interval traversal
routines, the string traversal routines, and the candidate generation function
which calls them alternately.
Table 2 demonstrates that the largest fraction of the cost (by number of calls
or time) is spent in the interval lookup, examining about 2 intervals per string
search, as Figure 2 suggests. The extra time cost from the redundant string
traversals in the simultaneous linear scan implementation is compensated by the
speed per-call coming from a simpler scanning routine. The extra interval search
time for the simultaneous linear scan is likely due to the increased time spent
re-traversing short intervals.
8 Conclusion
We have identied and formulated a key subproblem, the peptide candidate
generation problem, that must be solved in order to identify proteins via tandem
80 N. Edwards and R. Lippert
mass spectrometry and amino-acid sequence database search. We have outlined
the context in which any solution to the peptide candidate generation problem
must operate and carefully examined how this context inuences the classic
algorithmic tradeos of time and space.
We have identied a key property of amino-acid sequence databases, sub-
string density, that quanties the unnecessary peptide candidates output by a
linear scan of the sequence database. We have further proposed the use of the suf-
x array as a compact representation of all substrings of the sequence database
that eliminates candidate redundancy.
We have also demonstrated that as the number of query masses increases,
query mass lookup time becomes more signicant than sequence database scan
time. We proposed a constant time query lookup algorithm based on preprocess-
ing the query masses to form a lookup table, and showed that this technique is
quite feasible in practice.
Our results for the peptide candidate generation problem depend heavily on
our empirical observations about the nature of the problem instances we see
every day. We need to obtain better worst case bounds for this problem to make
our performance less sensitive to less common extreme instances. We also need
to do a comprehensive empirical study to more carefully determine the scenarios
in which each of the algorithms identied is the method of choice.
References
1. V. Bafna and N. Edwards. Scope: A probabilistic model for scoring tandem mass
spectra against a peptide database. Bioinformatics, 17(Suppl. 1):S13S21, 2001.
2. T. Chen, M. Kao, M. Tepel, J. Rush, and G. Church. A dynamic programming
approach to de novo peptide sequencing via tandem mass spectrometry. In ACM-
SIAM Symposium on Discrete Algorithms, 2000.
3. M. Cieliebak, T. Erlebach, S. Liptak, J. Stoye, and E. Welzl. Algorithmic com-
plexity of protein identication: Combinatorics of weighted strings. Submitted to
Discrete Applied Mathematics special issue on Combinatorics of Searching, Sort-
ing, and Coding., 2002.
4. J. Cottrell and C. Sutton. The identication of electrophoretically separated pro-
teins by peptide mass ngerprinting. Methods in Molecular Biology, 61:6782,
1996.
5. V. Dancik, T. Addona, K. Clauser, J. Vath, and P. Pevzner. De novo peptide
sequencing via tandem mass spectrometry. Journal of Computational Biology,
6:327342, 1999.
6. J. Eng, A. McCormack, and J. Yates. An approach to correlate tandem mass
spectral data of peptides with amino acid sequences in a protein database. Journal
of American Society of Mass Spectrometry, 5:976989, 1994.
7. D. Guseld. Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology. Cambridge University Press, 1997.
8. P. James, M. Quadroni, E. Carafoli, and G. Gonnet. Protein identication in dna
databases by peptide mass ngerprinting. Protein Science, 3(8):13471350, 1994.
9. S. Kurtz. Reducing the space requirement of sux trees. SoftwarePractice and
Experience, 29(13):11491171, 1999.
Generating Peptide Candidates from Amino-Acid Sequence Databases 81
10. D. Pappin. Peptide mass ngerprinting using maldi-tof mass spectrometry. Meth-
ods in Molecular Biology, 64:165173, 1997.
11. D. Pappin, P. Hojrup, and A. Bleasby. Rapid identication of proteins by peptide-
mass ngerprinting. Currents in Biology, 3(6):327332, 1993.
12. D. Perkins, D. Pappin, D. Creasy, and J. Cottrell. Probability-based protein iden-
tication by searching sequence databases using mass spectrometry data. Elec-
trophoresis, 20(18):35513567, 1997.
13. P. Pevzner, V. Dancik, and C. Tang. Mutation-tolerant protein identication by
mass-spectrometry. In R. Shamir, S. Miyano, S. Istrail, P. Pevzner, and M. Water-
man, editors, International Conference on Computational Molecular Biology (RE-
COMB), pages 231236. ACM Press, 2000.
14. J. Taylor and R. Johnson. Sequence database searches via de novo peptide se-
quencing by mass spectrometry. Rapid Communications in Mass Spectrometry,
11:10671075, 1997.
Improved Approximation Algorithms
for NMR Spectral Peak Assignment
Zhi-Zhong Chen
1,
, Tao Jiang
2,
, Guohui Lin
3,
, Jianjun Wen
2,
,
Dong Xu
4,
, and Ying Xu
4,
Dept. of Math. Sci., Tokyo Denki Univ., Hatoyama, Saitama 350-0394, Japan
[email protected]
Dept. of Comput. Sci., Univ. of California, Riverside, CA 92521
{jiang,wjianju}@cs.ucr.edu
Dept. of Comput. Sci., Univ. of Alberta, Edmonton, Alberta T6G 2E8, Canada
[email protected]
Protein Informatics Group, Life Sciences Division
Oak Ridge National Lab. Oak Ridge, TN 37831
{xud,xyn}@ornl.gov
Abstract. We study a constrained bipartite matching problem where
the input is a weighted bipartite graph G = (U, V, E), U is a set of ver-
tices following a sequential order, V is another set of vertices partitioned
into a collection of disjoint subsets, each following a sequential order,
and E is a set of edges between U and V with non-negative weights.
The objective is to nd a matching in G with the maximum weight that
satises the given sequential orders on both U and V , i.e., if u
i
fol-
lows u
i
in U and if v
j
follows v
j
in V , then u
i
is matched with v
j
if
and only if u
i
is matched with v
j
. The problem has recently been
formulated as a crucial step in an algorithmic approach for interpret-
ing NMR spectral data [15]. The interpretation of NMR spectral data
is known as a key problem in protein structure determination via NMR
spectroscopy. Unfortunately, the constrained bipartite matching problem
is NP-hard [15]. We rst propose a 2-approximation algorithm for the
problem, which follows directly from the recent result of Bar-Noy et al.
[2] on interval scheduling. However, our extensive experimental results
on real NMR spectral data illustrate that the algorithm performs poorly
in terms of recovering the target-matching (i.e. correct) edges. We then
propose another approximation algorithm that tries to take advantage
Supported in part by a UCR startup grant and NSF Grants CCR-9988353 and
ITR-0085910.
= (U, V, E
.
We say that two edges of E
is always a matching in G
in G
in G
. Thus, it
remains to show how to compute a feasible approximate matching in G
.
Dene an innermost edge of G
to be an edge (u
i
, v
j
) in G
satisfying the
following condition:
G
has no edge (u
i
, v
j
) other than (u
i
, v
j
) such that i i
+ s
1
i + s 1, where s (respectively, s
to
be an innermost edge (u
i
, v
j
) such that index i is minimized. The crucial point is
that for every leading innermost edge (u
i
, v
j
) of G
in G
conict with (u
i
, v
j
). To see this, let (u
i
, v
j
)
be an edge in M
) be the size
Improved Approximation Algorithms for NMR Spectral Peak Assignment 87
of the expanded matching of (u
i
, v
j
) (respectively, (u
i
, v
j
)). Since (u
i
, v
j
) is an
innermost edge of G
= j.
2. i
i i + s 1 i
+ s
1.
3. i < i
i + s 1 < i
+ s
1.
4. i
< i i
+ s
1 < i + s 1.
For each of these conditions, M
is a feasible matching in G
. Moreover, if M
contains an edge (u
i
, v
j
) satisfying Condition 2, then it contains no edge sat-
isfying Condition 3 or 4. Furthermore, M
contains no edge (u
i
, v
j
) satisfying
Condition 4 or else there would be an innermost edge (u
i
, v
i
) in G
with
i
< i i
+ s
1 i
+ s
1 (where s
conict with (u
i
, v
j
).
Using the above fact (that at most two edges of M
:
1. if (E(G
) = )
output the empty set and halt;
2. nd a leading innermost edge e in G
;
3. = {e} {e
| e
E(G
), e
;
5. for (every edge f )
subtract c from the weight of f;
6. F = {e | e , e has weight 0};
7. G
= G
F;
8. recursively call 2-Approximation on G
and output M
;
9. nd a maximal M
F such that M
is a feasible matching in G
;
10. output M
and halt.
Fig. 1. A recursive algorithm for nding a feasible matching in G
.
Theorem 2. [2] The algorithm described in Figure 1 outputs a feasible matching
of the graph G
= (U, V, E
i
= (U, V
i
, E
i
).
Let M
i
denote an optimal feasible matching for graph G
i
. Right before the
execution of Step 3.4 of the algorithm, M
i
is clearly a feasible matching for
graph G
i
, and its weight is at least
1
6
of that of M
i
because we can claim that
each execution of Step 3.3.2 only rules out at most 6 edges of M
i
from further
Improved Approximation Algorithms for NMR Spectral Peak Assignment 89
Table 1. Summary on the performances of the 2-approximation and 3 log D-
approximation algorithms on 126 instances of NMR peak assignment. The number
after the underscore symbol in the name of each instance indicates the number of adja-
cent pairs of spin system in the instance (more precisely, 5 means that the number of
adjacent pairs of spin systems is 50% of the total number of residues). M
represents
the target assignment that we want to recover, and M (M , or M ) is the assignment
computed by the two-layer (2-approximation, or 3 log D-approximation, respectively)
algorithm. The parameters R = |M
M |, R = |M
M |, and R = |M
M | show
how many target-matching edges were recovered by the two-layer, 2-approximation, and
3 log D-approximation algorithms, respectively, on each instance. Each N/A indicates
that the computation of the two-layer algorithm was taking too long (> 2 days on a
supercomputer) and had to be killed before any result could be obtained.
|M
| w(M
) R |M | w(M ) R |M | w(M ) R |M
| w(M
) R |M | w(M ) R |M | w(M ) R
bmr4027 1 158 1896284 11 158 1871519 28 158 1931099 4 bmr4144 1 78 949170 10 78 936144 10 78 845578 5
bmr4027 2 18 157 1849500 7 158 1927193 2 bmr4144 2 8 78 928175 4 78 869229 1
bmr4027 3 18 158 1841683 8 158 1930119 23 bmr4144 3 10 77 917197 5 78 881665 1
bmr4027 4 43 158 1829367 11 158 1925237 36 bmr4144 4 0 78 907130 10 78 886147 6
bmr4027 5 33 156 1827498 3 158 1923556 37 bmr4144 5 14 77 921816 17 78 914564 14
bmr4027 6 36 157 1818131 8 158 1916814 48 bmr4144 6 30 77 897500 11 76 876005 3
bmr4027 7 79 155 1784027 44 158 1885779 90 bmr4144 7 34 76 842073 2 78 888087 6
bmr4027 8 19 154 1671475 113 158 1875058 117 bmr4144 8 67 77 804531 5 78 896088 22
bmr4027 9 155 155 1652859 60 158 1896606 156 bmr4144 9 75 76 837519 35 78 949844 76
bmr4288 1 105 1249465 9 105 1238612 8 105 1208142 6 bmr4302 1 115 1298321 9 115 1305677 0 115 1316209 8
bmr4288 2 9 105 1220481 8 105 1194198 9 bmr4302 2 12 115 1273146 0 115 1324173 8
bmr4288 3 16 103 1206095 17 105 1199374 17 bmr4302 3 7 114 1276372 8 115 1313288 8
bmr4288 4 33 105 1185685 5 105 1214237 21 bmr4302 4 16 114 1246952 4 115 1307472 10
bmr4288 5 38 103 1169907 6 105 1211226 34 bmr4302 5 34 113 1219920 11 115 1295035 24
bmr4288 6 52 102 1179110 15 105 1217006 52 bmr4302 6 44 114 1174564 0 115 1255172 60
bmr4288 7 55 103 1112288 22 105 1230117 62 bmr4302 7 65 112 1181267 8 115 1294044 78
bmr4288 8 N/A 101 1133554 35 105 1232331 66 bmr4302 8 N/A 113 1152323 27 113 1283268 99
bmr4288 9 105 100 1051817 48 105 1249465 105 bmr4302 9 111 115 1293954 107 115 1298321 111
bmr4309 1 178 2048987 6 178 2066506 4 178 2118482 4 bmr4316 1 89 1029827 4 89 997300 13 89 1011408 7
bmr4309 2 10 178 2023648 9 178 2108291 4 bmr4316 2 15 89 976270 2 89 1019640 7
bmr4309 3 33 177 2013099 9 178 2115356 22 bmr4316 3 21 88 972224 0 89 1020190 9
bmr4309 4 34 176 2024268 14 178 2107417 18 bmr4316 4 20 87 936852 5 89 1028608 31
bmr4309 5 46 174 1954955 13 178 2090346 31 bmr4316 5 42 86 890944 2 89 1007619 43
bmr4309 6 59 177 1924727 12 178 2074540 55 bmr4316 6 60 84 863207 13 89 1012008 48
bmr4309 7 122 174 1885986 24 178 2078322 114 bmr4316 7 79 87 882818 9 87 1004449 67
bmr4309 8 106 173 1868338 55 178 2026479 112 bmr4316 8 87 87 957378 62 89 1029827 89
bmr4309 9 176 170 1796864 95 175 1999734 153 bmr4316 9 89 85 984774 85 89 1029827 89
bmr4318 1 215 2390881 8 215 2418440 17 215 2495022 2 bmr4353 1 126 1498891 6 126 1482821 20 126 1492927 7
bmr4318 2 5 215 2398412 0 215 2481997 6 bmr4353 2 8 126 1473982 9 126 1499720 7
bmr4318 3 N/A 214 2409316 17 215 2481867 10 bmr4353 3 4 125 1455084 6 126 1499983 8
bmr4318 4 23 213 2394682 3 215 2481099 12 bmr4353 4 20 126 1441162 9 126 1511112 14
bmr4318 5 38 215 2355926 2 215 2473707 27 bmr4353 5 17 125 1417351 8 126 1502628 21
bmr4318 6 38 214 2312260 13 215 2440684 31 bmr4353 6 35 125 1421633 18 126 1514294 11
bmr4318 7 87 210 2259377 52 215 2421426 70 bmr4353 7 29 125 1370235 14 126 1499010 58
bmr4318 8 113 212 2214174 63 209 2326045 91 bmr4353 8 N/A 123 1337329 9 122 1443144 81
bmr4318 9 N/A 207 2158223 122 215 2390651 197 bmr4353 9 126 122 1273988 15 126 1498891 126
bmr4391 1 66 710914 5 66 723525 8 66 750059 5 bmr4393 1 156 1850868 6 156 1826257 10 156 1876203 5
bmr4391 2 8 66 720589 6 66 755718 3 bmr4393 2 14 156 1805561 3 156 1873989 6
bmr4391 3 7 66 724102 8 66 749505 5 bmr4393 3 N/A 156 1782350 5 156 1859924 4
bmr4391 4 6 65 681286 9 66 745159 5 bmr4393 4 22 156 1778165 3 156 1868573 12
bmr4391 5 13 64 688400 5 66 741824 0 bmr4393 5 30 155 1742954 3 156 1862071 42
bmr4391 6 10 66 699066 8 66 739778 0 bmr4393 6 45 155 1772955 42 156 1857579 67
bmr4391 7 0 66 684953 37 66 717888 21 bmr4393 7 74 154 1722026 22 151 1794248 94
bmr4391 8 18 64 663147 30 66 705513 20 bmr4393 8 128 156 1640682 15 154 1830609 136
bmr4391 9 N/A 66 687290 45 61 652235 45 bmr4393 9 143 152 1527885 3 156 1851298 152
bmr4579 1 86 950173 7 86 931328 12 86 967574 5 bmr4670 1 120 1391055 8 120 1378876 27 120 1434117 5
bmr4579 2 12 86 933035 7 86 977013 9 bmr4670 2 10 120 1366541 14 120 1437469 5
bmr4579 3 11 85 923916 4 86 973431 14 bmr4670 3 20 120 1370848 6 120 1437484 16
bmr4579 4 16 86 935901 6 86 961214 11 bmr4670 4 32 119 1341300 6 120 1423323 28
bmr4579 5 13 85 894084 2 86 968378 21 bmr4670 5 35 117 1309727 11 120 1393428 28
bmr4579 6 15 86 911564 8 86 945148 21 bmr4670 6 48 118 1290812 13 120 1394903 40
bmr4579 7 42 86 873884 17 86 952794 45 bmr4670 7 45 118 1239001 6 120 1377578 45
bmr4579 8 49 83 877556 26 86 950136 78 bmr4670 8 N/A 120 1236726 19 118 1370011 101
bmr4579 9 86 83 760356 0 86 950173 86 bmr4670 9 N/A 113 1237614 60 114 1319698 94
bmr4752 1 68 882755 8 68 862523 20 68 889083 9 bmr4929 1 114 1477704 7 114 1432825 5 114 1502375 3
bmr4752 2 12 68 848225 16 68 886989 11 bmr4929 2 10 114 1424433 5 114 1500838 7
bmr4752 3 13 68 834299 2 68 886910 18 bmr4929 3 16 113 1417722 7 114 1499302 18
bmr4752 4 20 67 820207 2 68 892854 16 bmr4929 4 20 113 1411387 7 114 1497361 27
bmr4752 5 28 67 796019 8 68 878244 29 bmr4929 5 24 114 1408112 4 114 1487741 26
bmr4752 6 28 67 824289 6 68 879380 35 bmr4929 6 24 112 1385673 12 114 1480828 31
bmr4752 7 43 66 752633 3 68 868981 40 bmr4929 7 65 112 1378166 30 114 1449648 55
bmr4752 8 N/A 65 730276 17 68 860366 42 bmr4929 8 86 114 1424433 5 114 1471279 87
bmr4752 9 68 67 812950 44 68 882755 68 bmr4929 9 112 107 1178499 20 114 1477704 114
90 Z.-Z. Chen et al.
3 log D-Approximation on G
:
1. compute ratio r =
max
min
, where
max
(respectively,
min
) is the
maximum (respectively, minimum) length of strings in V ;
2. partition V into g = max{1, log r} subsets V , V , . . . , V
g
such that
a string s is included in subset V
i
if and only if 4
i
|s|
min
4
i
;
(Note: V
i
and V
i
may not be disjoint.)
3. for (every i {1, 2, . . . , g})
3.1 compute the set E
i
of edges of G
incident to strings in V
i
;
3.2 initialize M
i
= ;
3.3 while (E
i
= )
3.3.1 nd an edge e E
i
of maximum weight;
3.3.2 add e to M
i
, and delete e and all edges conicting with
e from E
i
;
3.4 greedily extend M
i
to a maximal feasible matching of G
;
4. output the heaviest one among M
, M
, . . . , M
g
and halt.
Fig. 2. A new algorithm for nding a feasible matching in G
.
consideration. To see the claim, consider an edge e = (u
x
, v
y
) added to M
i
in
Step 3.3.2. Let e
= (u
x
, v
y
) be an edge conicting with e. Let s (respectively,
s
). Then, at least
one of the following conditions 1 through 6 holds:
1. y
= y.
2. x
= x and s
= s.
3. x
< x x
+ s
1 < x + s 1.
4. x < x
x + s 1 < x
+ s
1.
5. x
< x x + s 1 x
+ s
1 or x
x x + s 1 < x
+ s
1.
6. x < x
+ s
1 x + s 1 or x x
+ s
1 < x + s 1.
Since M
i
is a feasible matching of G
i
, M
i
may contain at most one edge sat-
isfying Condition 1, at most one edge satisfying Condition 2, at most one edge
satisfying Condition 3, at most one edge satisfying Condition 4, at most one
edge satisfying Condition 5, and at most four edges satisfying Condition 6 (be-
cause of the construction of V
i
). Due to the same reason, if M
i
contains an edge
satisfying Condition 2 (respectively, 5), then M
i
contains no edge satisfying
Condition 6. Similarly, if M
i
contains an edge satisfying Condition 3 or 4, then
M
i
contains at most three edges satisfying Condition 6 (because of the con-
struction of V
i
). So, in the worse case (where M
i
contains the largest number of
edges conicting with e), M
i
may contain one edge satisfying Condition 1, one
edge satisfying Condition 3, one edge satisfying Condition 4, and three edges
satisfying Condition 6. This proves the claim.
Let M
denote an
optimal feasible matching for graph G
, and
M
i
be the sub-matching of
M
in edge set E
i
. Suppose without loss of generality that
M
j
is the heaviest one
among
M
1
,
M
2
, . . . ,
M
g
. Clearly, we have w(
M
j
)
1
g
w(
M
). Thus, w(M
)
1
6
w(M
j
)
1
6g
w(
M
in G. Let m
1
be the number of
edges (u
i
, v
j
) M
2
be the number of edges
(u
i
, v
k+2j1
) M
with 1 j . Then, [M
[ = m
1
+ 2m
2
.
Lemma 1. A feasible matching in G can be found in O(m
n
1
n
2
) time, whose
size is at least m
1
+ m
2
.
Proof. Construct a new bipartite graph G
= (U, V, E
1
E
2
), where E
1
=
(u
i
, v
j
) E [ 1 j k and E
2
= (u
i
, v
k+2j1
) E [ 1 j . Let
M
be a maximum matching in G
from M
with 1 i n
1
and 1 j .
So, [M
[ m
1
+m
2
. To obtain a feasible matching M of G from M
, we perform
the following steps:
1. Initialize M = .
2. Construct an auxiliary graph H as follows. The vertex set of H is M
. The
edge set of H consists of all (e
1
, e
2
) such that e
1
M
, e
2
M
, and e
1
conicts with e
2
. [Comment: Each connected component of H is a path P
(possibly consisting of a single vertex); if P contains two or more vertices,
then there exist integers i, j
1
, . . . , j
h
(h 2) such that the vertices of P are
(u
i
, v
j
1
), (u
i+1
, v
j
2
), . . . , (u
i+h1
, v
j
h
), and each of v
j
1
through v
j
h1
is the
leading vertex of a 2-string in V .]
3. For each connected component of H formed by only one vertex (u
i
, v
j
) M
,
if v
j
is a 1-string in V , then add edge (u
i
, v
j
) to M; otherwise, add edges
(u
i
, v
j
) and (u
i+1
, v
j+1
) to M.
Improved Approximation Algorithms for NMR Spectral Peak Assignment 93
4. For each connected component P of H formed by two or more vertices,
perform the following three substeps:
(a) Let the vertices of P be (u
i
, v
j
1
), (u
i+1
, v
j
2
), . . . , (u
i+h1
, v
j
h
) M
.
(b) If h is even or v
j
h
is the leading vertex of a 2-string in V , then for each
1 x ,
h
2
|, add edges (u
i+2x2
, v
j
2x1
) and (u
i+2x1
, v
j
2x1
+1
) to M.
(c) If h is odd and v
j
h
alone forms a 1-string in V , then for each 1 x
h1
2
,
add edges (u
i+2x2
, v
j
2x1
) and (u
i+2x1
, v
j
2x1
+1
) to M; further add
edge (u
i+h1
, v
j
h
) to M.
It is clear that for each connected component P of H, we add at least as many
edges to M as the number of vertices in P. Thus, [M[ [M
[ m
1
+ m
2
. 2
Lemma 2. A feasible matching in G can be found in O(m
n
1
n
2
) time, whose
size is at least
m
1
3
+
4m
2
3
.
Proof. For each index i 0, 1, 2, let G
i
be the edge-weighted bipartite graph
obtained from G as follows:
1. For every 2-string v
j
v
j+1
in V , merge the two vertices in the string into a
single super-vertex s
j,j+1
(with all resulting multiple edges deleted).
2. For all j such that i + 1 j n
1
2 and j 1 i (mod 3), perform the
following three substeps:
(a) Merge u
j
, u
j+1
, and u
j+2
into a single super-vertex t
j,j+1,j+2
(with all
resulting multiple edges deleted).
(b) For every 1-string v
h
that is a neighbor of t
j,j+1,j+2
, if edge u
j+1
, v
h
is
not in the original input graph, then delete the edge between t
j,j+1,j+2
and v
h
; otherwise, assign a weight of 1 to the edge between t
j,j+1,j+2
and v
h
.
(c) For every 2-string v
h
v
h+1
such that s
h,h+1
is a neighbor of t
j,j+1,j+2
, if
neither (u
j
, v
h
), (u
j+1
, v
h+1
) nor (u
j+1
, v
h
), (u
j+2
, v
h+1
) is a match-
ing in the original input graph, then delete the edge between t
j,j+1,j+2
and s
h,h+1
; otherwise, assign a weight of 2 to the edge between t
j,j+1,j+2
and s
h,h+1
.
3. If neither u
1
nor u
2
was merged in Step 2a, then perform the following three
substeps:
(a) Merge u
1
and u
2
into a single super-vertex t
1,2
(with all resulting multiple
edges deleted).
(b) For every 1-string v
h
that is a neighbor of t
1,2
, if edge u
1
, v
h
is not
in the original input graph, then delete the edge between t
1,2
and v
h
;
otherwise, assign a weight of 1 to the edge between t
1,2
and v
h
.
(c) For every 2-string v
h
v
h+1
such that s
h,h+1
is a neighbor of t
1,2
, if (u
1
,v
h
),
(u
2
, v
h+1
) is not a matching in the original input graph, then delete the
edge between t
1,2
and s
h,h+1
; otherwise, assign a weight of 2 to the edge
between t
1,2
and s
h,h+1
.
4. If neither u
n
1
1
nor u
n
1
was merged in Step 2a, then perform the following
three substeps:
94 Z.-Z. Chen et al.
(a) Merge u
n
1
1
and u
n
1
into a single super-vertex t
n
1
1,n
1
(with all result-
ing multiple edges deleted).
(b) For every 1-string v
h
that is a neighbor of t
n
1
1,n
1
, if edge u
n
1
, v
h
is
not in the original input graph, then delete the edge between t
n
1
1,n
1
and v
h
; otherwise, assign a weight of 1 to the edge between t
n
1
1,n
1
and
v
h
.
(c) For every 2-string v
h
v
h+1
such that s
h,h+1
is a neighbor of t
n
1
1,n
1
, if
(u
n
1
1
, v
h
), (u
n
1
, v
h+1
) is not a matching in the original input graph,
then delete the edge between t
n
1
1,n
1
and s
h,h+1
; otherwise, assign a
weight of 2 to the edge between t
n
1
1,n
1
and s
h,h+1
.
For each i 0, 1, 2, let M
i
be a maximum-weighted matching in G
i
. From
each M
i
, we can obtain a feasible matching
M
i
in the original input graph by
performing the following steps in turn:
Initialize
M
i
= .
For each edge (u
j
, v
h
) M
i
, add (u
j
, v
h
) to
M
i
.
For each edge (t
j,j+1,j+2
, v
h
) M
i
, add (u
j+1
, v
h
) to
M
i
.
For each edge (t
1,2
, v
h
) M
i
, add (u
1
, v
h
) to
M
i
.
For each edge (t
n
1
1,n
1
, v
h
) M
i
, add (u
n
1
, v
h
) to
M
i
.
For each edge (t
j,j+1,j+2
, s
h,h+1
) M
i
, if (u
j
, v
h
), (u
j+1
, v
h+1
) is a match-
ing in the original input graph, then add edges (u
j
, v
h
) and (u
j+1
, v
h+1
) to
M
i
; otherwise, add edges (u
j+1
, v
h
) and (u
j+2
, v
h+1
) to
M
i
.
For each edge (t
1,2
, s
h,h+1
) M
i
, add edges (u
1
, v
h
) and (u
2
, v
h+1
) to
M
i
.
For each edge (t
n
1
1,n
1
, s
h,h+1
) M
i
, add (u
n
1
1
, v
h
) and (u
n
1
, v
h+1
) to
M
i
.
Note that the total weight of edges in M
i
is exactly [
M
i
[. Let
M be the
maximum size one among
M
0
,
M
1
,
M
2
. We claim that [
M[
m
1
3
+
4m
2
3
. To see
this, for each i 0, 1, 2, let M
i
be the union of the set (u
j
, v
h
) M
[ j+1 i
(mod 3) and v
h
is a 1-string in V and the set (u
j
, v
h
), (u
j+1
, v
h+1
) M
[ u
j
and u
j+1
belong to the same super-vertex in G
i
and v
h
v
h+1
is a 2-string in V .
It holds that
2
i=0
[M
i
[ = m
1
+4m
2
, because each edge in M
incident to a 1-
string belongs to exactly one of M
0
, M
1
, M
2
while each edge in M
incident to a
2-string belongs to exactly two of M
0
, M
1
, M
2
. This implies that the maximum
size one among M
0
, M
1
, M
2
has size at least
m
1
3
+
4m
2
3
. On the other hand,
for each i 0, 1, 2, we can obtain a matching
M
i
in G
i
by modifying M
i
as
follows:
For each edge (u
j
, v
h
) M
i
such that v
h
is a 1-string in V , replace u
j
by
the super-vertex of G
i
to which u
j
belongs.
For each pair of edges (u
j
, v
h
), (u
j+1
, v
h+1
) M
i
such that v
h
v
h+1
is a 2-
string in V , replace the two edges by a single edge between s
h,h+1
and the
super-vertex in G
i
to which both u
j
and u
j+1
belong.
The total weight of edges in
M
i
is exactly [M
i
[. On the other hand, the total
weight of edges in M
i
is larger than or equal to that of edges in
M
i
, because M
i
Improved Approximation Algorithms for NMR Spectral Peak Assignment 95
is a maximum-weighted matching in G
i
. Thus, the total weight of edges in M
i
is at least [M
i
[. In turn, [
M
i
[ [M
i
[. Therefore,
[
M[
1
3
2
i=0
[
M
i
[
1
3
2
i=0
[M
i
[ =
1
3
(m
1
+ 4m
2
).
This completes the proof of the claim and hence that of the lemma. 2
Combining Lemmas 1 and 2, we now have:
Theorem 5. We can compute a feasible matching in G whose size is at least
3
5
[M
[, in O(m
n
1
n
2
) time. Consequently, there is a
5
3
-approximation algorithm
for the unweighted 2-String CBM problem; it runs in O(m
n
1
n
2
) time.
4 Concluding Remarks
It would be interesting to test if the
5
3
-approximation algorithm works well in
practice. An obvious open question is if D-String CBM admits a -approx-
imation algorithm for some constant < 2.
In the real NMR spectral peak assignment, we want to assign every amino
acid to a spin system. Therefore, the desired output matchings are perfect match-
ings. So far our theoretical analysis on the approximation algorithms does not
involve this requirement, although during the implementation we did put prior-
ity on perfect matchings. Designing algorithms that guarantee to output perfect
matchings (given that the real data is a complete weighted bipartite graph) is
a practical consideration. On the other hand, not putting the perfect require-
ment on approximation algorithms could be one of the reasons that they produce
matchings with better weights than the correct assignments.
Our current framework provides a proof-of-principle for automated backbone
spectral assignment. In real NMR experiments, it can easily incorporate other
NMR data to further improve the assignment accuracy. For example, one can use
residual dipolar couplings, which are easy and fast to obtain. Residual dipolar
couplings have been used to help peak assignment and structure determination
[12,11]. Under our framework, we can use residual dipolar coupling data to pro-
vide more constraint information for spin systems. We believe further studies
along the line could lead to a software package that moves closer to the goal of
fully automated NMR peak assignments.
References
1. International Human Genome Sequencing Consortium. Initial Sequencing and
Analysis of the Human Genome. Nature, 409:860-921, 2001.
2. A. Bar-Noy, R. Bar-Yehuda, A. Freund, J. Naor, and B. Schieber. A unied ap-
proach to approximating resource allocation and scheduling. J. ACM, 48:1069
1090, 2001.
3. R. Bar-Yehuda and S. Even. A local-ratio theorem for approximating the weighted
vertex cover problem. Annals of Discrete Mathematics, 25:2746, 1985.
96 Z.-Z. Chen et al.
4. C. Bartels, P. G untert, M. Billeter, and K. W uthrich. GARANT-A general algo-
rithm for resonance assignment of multidimensional nuclear magnetic resonance
spectra. Journal of Computational Chemistry, 18:139149, 1996.
5. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT
Press, 1990.
6. P. G untert, M. Salzmann, D. Braun, and K. W uthrich. Sequence-specic NMR
assignment of proteins by global fragment mapping with the program mapper.
Journal of Biomolecular NMR, 18:129137, 2000.
7. K. Huang, M. Andrec, S. Heald, P. Blake, and J.H. Prestegard. Performance of a
neural-network-based determination of amino acid class and secondary structure
from H- N NMR data. Journal of Biomolecular NMR, 10:4552, 1997.
8. V. Kann. Maximum bounded 3-dimensional matching is MAX SNP-complete.
Information Processing Letters, 37:2735, 1991.
9. National Institute of General Medical Sciences. Pilot projects for the protein
structure initiative (structural genomics). June 1999. Web page at
http://www.nih.gov/grants/guide/rfa-files/RFA-GM-99-009.html.
10. C.H. Papadimitriou and M. Yannakakis. Optimization, approximation, and com-
plexity classes. Journal of Computer and System Sciences, 43:425440, 1991.
11. C. A. Rohl and D. Baker. De novo determination of protein backbone structure
from residual dipolar couplings using Rosetta. Journal of The American Chemical
Society, 124:27232729, 2002.
12. F. Tian, H. Valafar, and J. H. Prestegard. A dipolar coupling based strategy for
simultaneous resonance assignment and structure determination of protein back-
bones. Journal of The American Chemical Society, 123:1179111796, 2001.
13. University of Wisconsin. BioMagResBank. http://www.bmrb.wisc.edu. 2001.
14. J. Xu, S.K. Straus, B.C. Sanctuary, and L. Trimble. Use of fuzzy mathematics
for complete automated assignment of peptide H 2D NMR spectra. Journal of
Magnetic Resonance, 103:5358, 1994.
15. Y. Xu, D. Xu, D. Kim, V. Olman, J. Razumovskaya, and T. Jiang. Automated
assignment of backbone NMR peaks using constrained bipartite matching. IEEE
Computing in Science & Engineering, 4:5062, 2002.
16. D.E. Zimmerman, C.A. Kulikowski, Y. Huang, W.F.M. Tashiro, S. Shimotaka-
hara, C. Chien, R. Powers, and G.T. Montelione. Automated analysis of protein
NMR assignments using methods from articial intelligence. Journal of Molecular
Biology, 269:592610, 1997.
Ecient Methods for Inferring Tandem
Duplication History
Louxin Zhang
1
, Bin Ma
2
, and Lusheng Wang
3
Labs for Infor. Tech. & Dept. of Mathematics
Nat. University of Singapore, Singapore 117543
[email protected]
Department of Computer Science
University of Western Ontario, Canada N6A 5B8
[email protected]
Department of Computer Science
City University of Hong Kong, Hong Kong
[email protected]
Abstract. In this paper, we study the problem of inferring duplication
history of a tandem gene family using the duplication model proposed by
Tang et al.. We provide an ecient algorithm for inferring a duplication
model for a given set of (gene) sequences by combining a linear-time algo-
rithm, which is for determining whether a rooted tree is associated with
a duplication model, with the nearest neighbor interchange operation.
Finally, using our proposed method, we derive duplication hypotheses
for an exon of a mucin gene MUC5B, a ZNF gene family, and a OR gene
family.
Keywords: Algorithm, tandem duplication model, phylogeny, Mucin
genes, ZNF genes, olfactory receptors.
1 Introduction
Tandem duplication is a less well understood mutational process for DNA mole-
cules in which a short stretch of DNA is transformed into several adjacent copies.
For example, in the following tandem duplication,
TTCGGAGA TTCGGA CGGA CGGAGA
the segment CGGA is transformed into three adjacent copies. Since individual
copies in a tandem repeat may undergo additional mutations later on, approxi-
mate repeats are usually present in a genome.
This process is a primary mechanism for generating gene family clusters on
a chromosome. Thirty-one percent of known and predicted genes (340 genes)
in human chromosome 19 (HSA19) are arranged in tandem arrays [11]. It was
reported that 262 distinct C2H2 zinc-nger (ZNF) loci are distributed in 11
dierent clusters in three cytogenetic bands 19p12, 19p13 and 19q13 and 49
olfactory receptors (OR) loci in 4 dierent clusters [3]. Gene families organized
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 97111, 2002.
c Springer-Verlag Berlin Heidelberg 2002
98 L. Zhang, B. Ma, and L. Wang
in tandem arrays have also been described in C. elegans [18], Drosophila [1], and
Arabidopsis [17].
Genome analysis suggests tandem duplication is an important mode of evo-
lutionary novelty by permitting one copy of each gene to drift and potentially to
acquire a new function. Hence, evolution study of a tandem gene family together
with other analyzes may yield valuable insights into predicting their functions
and resolve important species-specic biological questions. With these motiva-
tions in mind, researchers have begun to study tandem gene families and their
duplication history (see for example [2,6,16]).
In this paper, we study the problem of systematically inferring duplication
history of a tandem gene family. Previous and related works [2,6,10,16] are sum-
marized in the next section. Here, we attack this problem using a dierent ap-
proach. A duplication model (DM) Mfor a given family of (gene) sequences is a
directed graph that contains nodes, edges and blocks as shown in Figure 1. If only
the parent-child relations are considered in a duplication model M, then, the
resulting structure is just a rooted binary tree T
M
which is unique and called
the associated phylogeny for M. We present an optimal linear-time algorithm
which, given a phylogeny for a set of (gene) sequences, reconstruct the unique
duplication model M such that T = T
M
if it exists. Most importantly, we pro-
pose an ecient algorithm for inferring a duplication model for a set of (gene)
sequences by combining this linear-time algorithm with the nearest neighbor
interchange (NNI) operation (see Section 4 for its denition). The algorithm
was used to infer the history of tandem repeats within a large exon of a mucin
gene MUC5B [6] and performed better than Benson and Dongs algorithm; it
was also used to derive the duplication history of tandem gene families ZNF45
and OLF3 reported in [3]. Our algorithm outperformed the Window method on
ZNF45 family according to the parsimony principle.
2 Tandem Duplication Models
Let F = {s
1
, s
2
, , s
n
} be a set of DNA (or protein) sequences of a xed length
l over an alphabet A. Let A
l
denote the set of all the sequences of length l over
A and let d(, ) be a pairwise distance function on sequences in A
l
. Let T be a
phylogenetic tree (or phylogeny) on F. Then, it is a rooted binary tree in which
each leaf is labeled by a sequence in F and each internal node v a sequence
s(v) A
l
. We measure T using the following cost C(T):
C(T) =
(u,v)E(T)
d(s(u), s(v)),
where E(T) denotes the set of directed edges in T.
An ordered phylogenetic tree on F is a rooted phylogenetic tree in which the
left-to-right order of the leaves is s
1
, s
2
, , s
n
. Obviously, ordered phylogenetic
tree form a proper subclass of phylogenetic trees on F. In the nal version of
this paper, we prove that there are
2(n1)
n1
c(u)
and r
c(u) to denote the sequence labels of the leftmost and rightmost leaf in
the subtree T
M
(u) rooted at u respectively.
Ecient Methods for Inferring Tandem Duplication History 101
Proposition 2. Let M be a DM consistent with F = {1, 2, , n} and T
M
be
the phylogeny associated with M. Then, T
M
satises the following two proper-
ties:
(i) For each internal node u in T
M
, r
c(u) > r
c(lc(u)) and l
c(u) <
l
c(rc(u)); equivalently,
(ii) For each internal node u in T
M
, l
c(u) and r
c(v
2
) = 10 > r
c(v
3
) = 9.
3 Reconstructing a Duplication Model from a Phylogeny
We have known that each DM M has a unique associated phylogeny T
M
. In
this section, we present an optimal algorithm for the following reverse problem:
Given a phylogeny T, reconstruct a DMMsuch that T = T
M
if it exists.
Let M be a DM consistent with the sequence family F = {1, 2, , n}. A du-
plication block is single if it contains only one node; it is double if it contains
two nodes. A DM is said to be double if every duplication block of it is either
single or double. In Section 3.1, we shall present a linear-time algorithm for re-
constructing double DMs from phylogenies. Then, we generalize the algorithm
to arbitrary DMs in Section 3.2. The best known algorithm for this problem
takes quadratic time [16].
Note that, to specify a DM, we need only to list all non-single blocks on
the associated phylogeny. Hence, the output from our algorithm is just a list of
non-single duplication blocks.
3.1 Double Duplication Models
Let T be a phylogeny. We associate a pair (L
v
, R
v
) of leaf indices to each node
v in T as follows [16]: The pair of a leaf labeled by gene i is (i, i); (L
v
, R
v
) =
(l
c(v), r
c(v) < l
derived after the duplication blocks are eliminated. The bad nodes are marked with
circles.
(2) Notice that R
u
1
> R
u
2
> > R
u
q1
> R
u
q
and R
v
i+1
> R
v
i+2
>
> R
v
p1
. By merging these two sequences in p i + q 2q comparisons,
we are able to determine all the u
j
k
s, 1 k p i 1, by just scanning the
merged sequences. This is because R
v
i+k
appears between R
u
j
k
and R
u
j
k
+1
in
the merged sequences. To implement this procedure eciently, we need not only
a pointer from a parent node to its child, but also a pointer from a child node
to its parent.
Assume that all u
j
k
s have been determined for all k = 1, 2, , p i 1.
After all the duplication blocks [v
i+k
, u
j
k
] are placed on T, the leaf 2 should be
right next to the leaf 1 (see Figure 2 (b)). We derive a rooted binary tree T
k
in the edge (u
j
k
, u
j
k
+1
) for
each 1 k p i 1, and assigning the subtree T(rc(v
i+k
)) rooted at rc(v
i+k
)
as the right subtree of v
k
. Notice that the left child of v
k
is u
j
k
+1
. Then, we form
a new phylogeny T
as illustrated in
Figure 2 (c). To distinguish the inserted nodes v
k
, 1 k p i 1, from the
104 L. Zhang, B. Ma, and L. Wang
Algorithm for Reconstructing a DDM
Input: A phylogeny T with root r
on a sequence family {1, 2, , n}.
Output: The DDM derived from T if it exists.
DS = ; /* DS keeps double dup. blocks */
Determine a series of duplication blocks in
Proposition 5 that places 2 right next to 1;
if such a series of blocks do not exist, then
no DM can be derived from T
and exit the procedure.
Add all the dup. blocks obtained into DS;
Construct T
is
1 less than those of T. Most importantly, we have the following straightforward
fact.
Proposition 6. T is associated with a double DM M only if T
is associated
with the double DM M
, in which
the bad nodes cannot participate double duplication blocks.
Using Proposition 6, we obtain the algorithm in Figure 3 for reconstructing
a double DM (DDM) from a phylogeny. By Proposition 5, we could charge the
number of comparisons taken in dierent recursive steps to disjoint left paths in
the input tree T. Since the total number of nodes in all the disjoint left paths is at
most 2n, the whole algorithm takes at most 2 2n comparisons for determining
all the duplication blocks. Therefore, we have
Theorem 1. A double DM can be reconstructed from a given phylogenetic tree
T of n leaves in O(n) time if it exists.
3.2 Generalization to Arbitrary Duplication Models
In this subsection, we generalize the algorithm presented above to arbitrary DMs.
Such a generalization is based on the following proposition. Again, we assume
the leftmost paths leading to sequence 1 and sequence 2 in T are given as (1)
and (2) in the previous subsection.
Ecient Methods for Inferring Tandem Duplication History 105
Proposition 7. Assume a phylogeny T is associated with a DM M. Then, there
exists a unique series of p i 1 double duplication blocks [v
i+k
, u
j
k
] such that,
after these duplications are placed in T, sequence 2 is right next to sequence 1.
Notice that these double duplication blocks may not be in M.
Proof. Let k be an integer such that 1 k p i 1. If v
i+k
and u
j
are not
involved in a duplication block for any j q 1 in the model M, sequence 2
cannot be moved across the leftmost leaf l
c(rc(v
i+k
)) in the subtree rooted at
the right child of v
i+k
, contradicting to the hypothesis that 2 is placed next to
1. Therefore, for any v
i+k
, 1 k p i 1, there exists a duplication block in
M that contains v
i+k
and u
j
for some j. Restricting all the duplications in M
on the leftmost paths specied in (1) and (2) gives a desired series of p i 1
double duplication blocks.
We construct a new phylogeny T
involving in a
duplication [x
1
, x
2
, , x
t
, x] of size t +1 so far, then, there are t inserted nodes
x
i
below x that corresponds to x
i
for i = 1, 2, , t. To decide whether x is in a
double duplication block in T
i
s (1 i t)
simultaneously. We call such a double block a hyper-double block. If x involves
in a hyper-double block [x, y] in T
, the block [x
1
, x
2
, , x
t
, x] is extended into
block [x
1
, x
2
, , x
t
, x, y] of size t+2 in the original tree T. With these notations
in mind, we have
Proposition 8. A phylogeny T is associated with a DM if and only if
(1) There exists a unique series of pi1 double duplication blocks [v
i+k
, u
j
k
]
in T such that, after these duplication blocks are placed in T, sequence 2 is right
next to sequence 1, and
(2) T
x
, R
x
) of indices to a node x on the leftmost path of T
in each recursive
step: if x is in a duplication block [x
1
, x
2
, , x
t
, x] in the current stage, we let
R
x
= R
x
1
and R
x
= R
x
, which are dened in the beginning of Subsection 3.1.
Since R
x
< R
x
i
< R
x
for any 2 i t, only R
x
and R
x
will be examined
for determining if x is in a hyper-double block in next step. In this way, we can
reconstruct a DM from a phylogeny in linear time. Take the tree in Figure 4 as an
example. To arrange 2 next to 1 in step 1, we obtain 3 hyper-double duplications
[v
1
, v
2
], [v
3
, v
5
] and [v
8
, v
6
]. After this step, sequences 1, 2, 3, 4 have been arranged
in increasing order. To arrange 5 next to 4 in step 2, we obtain a hyper-double
block [v
5
, v
4
] and hence we extend the double duplication [v
3
, v
5
] into [v
3
, v
5
, v
4
];
106 L. Zhang, B. Ma, and L. Wang
v
1 v
2
v
3
v
4
v
5
v
6
v
7 v
8
v
2
v
1
v
5
v
3
v
7
v
4
v
7
v
2
v
4
v
5
v
3
1 2 3 7 8 9 6 10 5 4 9 6 2 4 8 7 5 1 3 10
7 8 5 9 6 10 4 4 5 7 8 9 6 10
7 5 8 9 6 10 5 6 7 8 9 10
r
Fig. 4. Step-by-step demonstration of reconstructing a DM from a rooted tree. Here,
we use a box drawn in break lines to denote a hyper-double block; we mark the inserted
nodes with circles.
nally, to arrange 6 next to 5 in step 3, we derive a hyper-double block [v
4
, v
7
]
and further extend the duplication [v
3
, v
5
, v
4
] into [v
3
, v
5
, v
4
, v
7
]. After step 3, all
sequences have been arranged in increasing order and the algorithm terminates
successfully by outputting 3 duplications [v
1
, v
2
], [v
3
, v
5
, v
4
, v
7
] and [v
8
.v
6
].
4 Inferring Duplication Models from Sequence Data
Let F = {s
1
, s
2
, , s
n
} be a set of tandem sequences of a xed length l over
an alphabet A and let M be a DM consistent with F, in which each node v is
assigned with a sequence s(v) A
l
, where s(v) is in F if v is a leaf. We measure
M using the following cost C(M):
C(M) =
(u,v)E(M)
dist(s(u), s(v)),
where E(M) denotes the set of directed edges in M.
By the parsimony principle, inferring a DM on F is to nd a DM consistent
with F and of minimum cost. Such an inference problem is NP-hard as proved
in [10]. Therefore, there is unlikely a polynomial time algorithm for it. Here, we
introduce an ecient heuristic for inferring a DM from sequence data using the
algorithm presented in Section 3 and the Nearest Neighbor Interchange (NNI)
operation for tree transformation (see Figure 5). Every edge, e, in a unrooted tree
Ecient Methods for Inferring Tandem Duplication History 107
A
C
B
D
A
D
C
B
A
B
C
D
e
Fig. 5. By a NNI operation on e, we can swap two subtrees at the dierent ends of e.
partitions the set of leafs. We may consider alternate hypothesis associated with
e by applying a NNI on e. This rearrangement generates alternate bipartition
for e while leaving all the other parts of the tree unchanged (see [15] for more
information).
Our method is analogous to the Feng-Doolittle algorithm for multiple align-
ments. First, a guided tree is constructed for the given set of sequences using the
parsimony method or other ecient distance-based methods such as Neighbor-
Joining (see [15]). Then, the NNIs are applied to the guided tree T for searching
a DM that is nearest to T in terms of NNI operations. During the search pro-
cess, we use the algorithm in section 3 for determining whether the resulting
tree is associated with a DM or not. Since a phylogeny with n leaves can always
be transformed into another by applying at most nlog
2
n + O(n) NNIs [12],
our algorithm will terminate quickly. For more eciency, we can even use the
following greedy approach to obtain a DM.
Greedy NNI Search: We root the given guided tree at a xed edge
on the path from the leaf labeled with s
1
to the leaf labeled with s
n
.
For each k = 2, 3, , n1, we repeatedly arrange s
k
next to s
k1
using
duplications as described in Section 3. If we cannot arrange s
k
next to
s
k1
, we use the NNIs to move the leaf labeled with s
k
to the branch
having s
k1
as an end so that these two leaves become siblings.
Greedy NNI Search is proposed based on the hypothesis that a tandem gene
family has evolved mainly through single gene duplication events [2,16]. Since
we can determine whether a rooted tree is associated with a duplication model
in linear time, the Greedy NNI Search takes only quadratic time.
5 Analysis of Tandem Repeats
5.1 Mucin Gene MUC5B
Mucus consists mainly of mucin proteins, which are heterogeneous, highly glyco-
sylated and produced from epithelial cells [8]. Four mucin genes MUC6, MUC2,
MUC5AC and MUC5B have been identied within a 400kb genomic segment in
human 11p15.5 [14]. All mucin genes contains a central part of tandem repeats.
In the central part of MUC5B, a single large exon of about 11K bp [4], three Cyc-
subdomains are followed by four repeats. Each repeat contains a R-subdomain
108 L. Zhang, B. Ma, and L. Wang
(11 or 17 tandem repeats of a motif of 29 amino acids), an R-End subdomain,
and a Cys-subdomain. Another R-subdomain of 23 tandem repeats of 29 amino
acids follows the fourth repeat. This suggests that the central part of MUC5B
arose through tandem duplications.
The amino acid sequences for the ve R-subdomains were taken from [6]
and aligned using an alignment program at Michigan Tech [9]. We chose the
following parameters - substitution matrix: Blosum62, mismatch score: -15, gap
open penalty 10, gap extension penalty 5. The alignment was used to produce a
guided tree by using the parsimony method in [7,13], which is rooted on the edge
that partitions the set of leafs into {RI, RII, RIV } and {RIII, RV }. This guided
tree was then given as input to the Greedy NNI Searching algorithm described
in last section. The resulting tandem DM for the central part of MUC5B agrees
with the evolutionary history described verbally by Desseyn et al. in their original
papers. The central repeat (Cyc/R/R-End subdomains) of an ancestral gene
duplicated into two repeats. This event was then followed by a further duplication
of a region containing all the two repeats. After that, the rst repeat duplicated.
We also derive a tandem duplication hypothesis of the forth repeat segment
RIV. It composed of 17 tandem repeats of an irregular motif of 29 amino acids
rich in Ser and Thr. We obtained a guided parsimony tree from the multiple
alignment of the 17 tandem repeats (RIV-1 to RIV-17) (see Figure6 (a)). Then,
a duplication model was derived after 3 local NNI arrangements on the guided
tree (see Figure 6 (b)). This model has the substitution score 126, which is better
than the one with score 133 that we produced using Bensons greedy algorithm
[2]. This hypothesis is also consistent with the conclusion drawn in [6].
5.2 ZNF and OLF Gene Families
Human genome contains roughly 700 Kr uppel-type (C2H2) zinc-nger (ZNF)
genes, which encode putative transcription factors, and 900 olfactory receptors
(OR), which encode proteins that recognize distinct types of olfactants and that
function in dierent regions of the olfactory epithelium. Because of their impor-
tance in species-specic aspects of biology, Dehal et al. analyzed the evolutionary
relationship between C2H2 ZNF genes and OR genes in human chromosome 19
(HSA19) [3]. Two hundred sixty two C2H2 ZNF genes were identied in the
HSA19 and were grouped into in 11 dierent locations (Z1 to Z11); forty nine
OR genes were grouped into 4 location clusters (OLF1 to OLF4) (Table C and
Table G in the supplementary material of [3]). In situ tandem duplication events
are likely to have given rise to these features.
Because of their locations, we choose gene clusters ZNF45 and OLF3 and ob-
tained gene-duplication history of them using our heuristic algorithm described
in last section. The gene family ZNF45 contains 16 genes and OLF3 14 genes in
HSA19q13.2. DNA or protein sequences for each gene family were aligned using
the program at Michigan Tech [9]. The alignment of each gene family was used
to produce a matrix of pairwise distances between the family members and then
to produce a guided tree by using the parsimony method [7,13]. This guided
tree was given as input to the Greedy NNI Searching algorithm described in last
Ecient Methods for Inferring Tandem Duplication History 109
R
I
I
I
1
4
R
I
I
I
1
2
R
I
I
I
1
0
R
I
I
I
4
R
I
I
I
7
R
I
I
I
6
R
I
I
I
1
R
I
I
I
5
R
I
I
I
1
1
R
I
I
I
2
R
I
I
I
8
R
I
I
I
3
R
I
I
I
9
R
I
I
I
1
3
R
I
I
I
1
5
R
I
I
I
1
6
R
I
I
I
1
7
(a)
(b)
R
I
I
I
1
7
R
I
I
I
1
6
R
I
I
I
1
5
R
I
I
I
1
4
R
I
I
I
1
3
R
I
I
I
1
2
R
I
I
I
1
1
R
I
I
I
1
0
R
I
I
I
9
R
I
I
I
8
R
I
I
I
7
R
I
I
I
6
R
I
I
I
5
R
I
I
I
4
R
I
I
I
3
R
I
I
I
2
R
I
I
I
1
Fig. 6. The duplication history of the third repeat RIII in the MUC5B central part.
section. The resulting DMs for these two gene families are shown in Figure 7.
Our algorithm performed better than the window method on the gene family
ZNF45 according to parsimony principle. Our model for ZNF45 has substitution
score 4690 while the one in [16] has score 4800, which were constructed from
same alignment data.
These hypotheses in conjunction with other sequence analyzes could provide
valuable clues into their functions at various stages of development.
6 Conclusion
In this paper, we present an optimal linear-time algorithm for reconstructing a
DM from a phylogeny. Using this algorithm, we propose an ecient heuristic
algorithm for inferring a DM from sequence data. Such an algorithm performed
well on three test data sets - MUC5B [6], gene families ZNF45 and OLF3 [3].
Currently, we are conducting the simulation test for evaluating the algorithm
further. Its performance on simulation data will be reported in the journal version
of this paper.
Genome analysis suggests that sequence inversion and gene loss occurs during
the course of tandem duplications. As a future research topic, the authors shall
study how to incorporate gene sequence inversion and loss events into tandem
duplication model and inferring algorithms.
110 L. Zhang, B. Ma, and L. Wang
Z
N
F
4
5
Z
N
F
2
2
1
Z
N
F
1
5
5
Z
N
F
2
3
0
Z
N
F
2
2
2
Z
N
F
2
2
3
Z
N
F
2
2
4
Z
N
F
2
2
5
Z
N
F
2
3
4
Z
N
F
2
2
6
Z
N
F
2
2
7
Z
N
F
2
3
3
Z
N
F
2
3
5
Z
N
F
2
2
8
Z
N
F
2
2
9
Z
N
F
1
8
0
(a)
O
L
F
1
O
L
F
2
O
L
F
3
O
L
F
4
O
L
F
6
O
L
F
7
O
L
F
8
O
L
F
9
O
L
F
1
0
O
L
F
1
1
O
L
F
5
O
L
F
1
4
O
L
F
1
3
O
L
F
1
2
(b)
Fig. 7. The duplication models (a) of ZNF45 and (b) of OLF3.
Acknowledgments
Ma Bin would like to thank Chengzhi Liang for providing us the references on
mucin genes and valuable discussions; he would also like to thank Mengxiang
Tang for sharing their data. L. Zhang is nancially supported in part by a NSTB
grant LS/99/001. B. Ma is supported in part by NSERC RGP0238748. L. Wang
is supported in part by a RGC grant CityU/1130/99E.
References
1. M.D. Adams et al. The genome sequence of Drosophila melanogaster. Science 287
(2000), 2185-2195.
2. G. Benson and L. Dong. Reconstructing the duplication history of a tandem repeat.
In Proceedings of the 7th ISMB, 44-53, 1999.
3. P. Dehal at al.. Human Chromosome 19 and related regions in Mouse: Conservative
and lineage-specic evolution. Science 293 (2001), July, 104-111.
4. J. L. Desseyn et al. Human mucin gene MUC5B, the 10.7 -kb large central exon en-
codes various alternate subdomains resulting in a super-repeat. Structural evidence
for a 11p15.5 gene family. J. Biol. Chem 272 (1997), 3168-3178.
5. J. L. Desseyn et al. Evolutionary history of the 11p15 human mucin gene family.
J. Mol Evol. 46 (1998), 102-106.
6. J. L. Desseyn et al. Evolution of the large secreted gel-forming mucins, Mol. Biol.
Evol. 17 (2000), 1175-1184.
7. J. Felsentein, PHYLIP, version 3.57c, Dept. of Genetics, Univ. of Washington,
Seattle.
Ecient Methods for Inferring Tandem Duplication History 111
8. S. Ho and Y. Kim. Carbohydrate antigens on cancer-associated mucin-like
molecules. Semin. Cancer Biol. 2(1991), 389-400.
9. X. Huang, Multiple Sequence Alignment,
http://genome.cs.mtu.edu/map/map.html.
10. D. Jaitly et al.. Methods for reconstructing the history of tandem repeats and
their application to the human genome. To appear on J. Computer and Systems
Sciences.
11. J. Kim et al. Homology-Driven Assembly of a Sequence-Ready Mouse BAC Contig
Map Spanning Regions Related to the 46-Mb Gene-Rich Euchromatic Segments of
Human Chromosome 19. Genomics 74(2001), 129-141.
12. M. Li, J Tromp, and L. Zhang. On the nearest neighbor interchange distance
between evolutionary trees. Journal of Theoret. Biol. 182(1996), 463-467.
13. A. Lim and L. Zhang. WebPHYLIP: A web interface to PHYLIP. Bioinformatics
12(1999), 1068-1069.
14. P. Pigny et al. Genomics 8(1996), 340-352
15. D. L. Swoord et al. Phylogeny inference. In Molecular Systematic (edited by D.M.
Hillis et al. pages 407-514. Sinauer Associate Inc., MA, 1996.
16. M. Tang, M. Waterman, and S. Yooseph. Zinc nger gene clusters and tandem
gene duplication. RECOMB01, 297-304, 2001.
17. The Arabidopsis Genome Initiative. Nature 408(2000), 796-815.
18. The C. elegans Sequencing Consortium. Science 282 (1998), 2012-2018.
19. L. Wang, and D. Guseld. Improved approximation algorithms for tree alignment.
J. of Algorithms 25(1997), 255-273, 1997.
20. L. Wang, T. Jiang, and E. Lawler. Approximation algorithms for tree alignment
with a given phylogeny. Algorithmica 16(3), 302-315, 1996.
Genome Rearrangement Phylogeny
Using Weighbor
Li-San Wang
Department of Computer Sciences, University of Texas
Austin, TX 78712 USA
[email protected]
Abstract. Evolution operates on whole genomes by operations that
change the order and strandedness of genes within the genomes. This
type of data presents new opportunities for discoveries about deep evo-
lutionary rearrangement events. Several distance-based phylogenetic re-
construction methods have been proposed [12,21,19] that use neighbor
joining (NJ) [16] with the expected breakpoint or inversion distances
after k rearrangement events. In this paper we study the variance of
the breakpoint and inversion distances. The result is combined with
Weighbor [5], an improved version of NJ using the variance of true evolu-
tionary distance estimators, to yield two new methods, Weighbor-IEBP
and Weighbor-EDE. Experiments show the new methods have better ac-
curacy than all previous distance-based methods, and are robust against
model parameter misspecications.
1 Background
Distance-Based Phylogenetic Reconstruction. A (phylogenetic) tree T on a set
of taxa S is a tree representation of the evolutionary history of S: T is a tree
leaf-labeled by S, such that the internal nodes reect past speciation events.
Given a tree T on a set S of genomes and given any two leaves i, j in T, we
denote by P(i, j) the path in T between i and j. We let
e
denote the number of
evolutionary events on the edge e during the evolution of the genomes in S within
the tree T; we call it the true evolutionary distance between the two endpoints.
We can then dene the matrix of true evolutionary distances, d
ij
=
eP(i,j)
e
,
which is additive (a distance D is additive if it satises the four-point condition:
for every distinct four points {i, j, l, m}, D
ij
+ D
lm
max{D
il
+ D
jm
, D
im
+
D
jl
}). Given an additive matrix, many distance-based methods are guaranteed
to reconstruct the tree T and the edge weights.
Neighbor Joining and Its Variants BioNJ and Weighbor. Neighbor joining (NJ)
is the most popular distance-based tree inference method. The input to the
method is a matrix of estimated leaf-to-leaf distances {D
ij
|1 i, j n} on n
leaves. Starting with these leaves as one-node subtrees, the algorithm creates
new subtrees iteratively by picking two subtrees using a least-squares criterion
of the distances to other roots of the subtrees (the pairing step), and updates
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 112125, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Genome Rearrangement Phylogeny Using Weighbor 113
the distances of the root of the new subtree to other roots of the subtrees (the
distance update step) according to some least-squares criterion [7].
In reality, we do not get exact distance estimates between leaves due to the
random nature of evolution. Atteson showed NJ is guaranteed to reconstruct the
true tree topology if the input distance matrix is suciently close to a distance
matrix dening the same tree topology [1]. Consequently, techniques that yield
a good estimate of the matrix {d
ij
} are of signicant interest.
In this paper we will use two modied versions of neighbor joining called
BioNJ [7] and Weighbor [5]. Both methods use the variance of the true distance
estimators in the pairing (Weighbor only) and distance update steps to improve
the accuracy of the tree reconstruction.
Genome Rearrangement Evolution. Modern laboratory techniques yield the or-
dering and strandedness of genes on a chromosome, allowing us to represent
each chromosome by an ordering of signed genes (where the sign indicates the
strand). Evolutionary events can alter these orderings through rearrangements
such as inversions and transpositions, collectively called genome rearrangements.
Because these events are rare, they give us information about ancient events in
the evolutionary history of a group of organisms. In consequence, many biolo-
gists have embraced this new source of data in their phylogenetic work [6,14,15].
Appropriate tools for analyzing such data remain primitive when compared to
those developed for DNA sequence data; thus developing such tools is becoming
an important area of research.
The genomes of some organisms have a single chromosome or contain single
chromosome organelles (such as mitochondria [4] or chloroplasts [14,15]) whose
evolution is largely independent of the evolution of the nuclear genome for these
organisms. When each genome has the same set of genes and each gene appears
exactly once, a genome can be described by an ordering (circular or linear)
of these genes, each gene given with an orientation that is either positive (g
i
)
or negative (g
i
). If the genome is circular, we always represent the genome
linearly so g
1
is positive and at the rst position. A close inspection shows a
circular genome with n genes, when represented linearly, is identical to a linear
genome with n 1 genomes.
Let G be the genome with signed ordering g
1
, g
2
, . . . , g
k
. An inversion be-
tween indices a and b, for a b, produces the genome with linear order-
ing (g
1
, g
2
, . . . , g
a1
, g
b
, g
b1
, . . . , g
a
, g
b+1
, . . . , g
k
). A transposition on the
(linear or circular) ordering G acts on three indices, a, b, c, with a b and
c / [a, b], picking up the interval g
a
, g
a+1
, . . . , g
b
and inserting it immediately
after g
c
. Thus the genome G above (with the assumption of c > b) is replaced
by (g
1
, . . . , g
a1
, g
b+1
, . . . , g
c
, g
a
, g
a+1
, . . . , g
b
, g
c+1
, . . . , g
k
). An inverted transpo-
sition is a transposition followed by an inversion of the transposed subsequence.
The Generalized Nadeau-Taylor Model. The Generalized Nadeau-Taylor (GNT)
model [21] of genome evolution uses only genome rearrangement events which do
not change the gene content. The model assumes that the number of each of the
three types of events obeys a Poisson distribution on each edge, that the relative
114 L.-S. Wang
probabilities of each type of event are xed across the tree, and that events
of a given type are equiprobable. Thus we can represent a GNT model tree as
a triplet (T, {
e
}, (, )), where the (, ) denes the relative probabilities of
transpositions and inverted transpositions (hence an event is an inversion with
probability 1 ). For instance, (
1
3
,
1
3
) indicates that the three event classes
are equiprobable, while the pair (0, 0) indicates that only inversions happen.
Estimating True Evolutionary Distances Using Genome Rearrangements. Let
us be given a set of genome rearrangement events with a particular weighting
scheme; the weight of a sequence of rearrangement events from this set is the
sum of the weights of these events. The edit distance between two gene orders
is the minimum of the weights of all sequences of events from the given set that
transform one gene order into the other. For example, the inversion distance
is the edit distance when only inversions are permitted and all inversions have
weight 1. The inversion distance can be computed in linear time [2], and the
transposition distance is of unknown computational complexity [3].
Another type of genomic distance that received much attention in the ge-
nomics community is the breakpoint distance. Given two genomes G and G
on
the same set of genes, a breakpoint in G is an ordered pair of genes (g
a
, g
b
) such
that g
a
and g
b
appear consecutively in that order in G, but neither (g
a
, g
b
) nor
(g
b
, g
a
) appear consecutively in that order in G
. The breakpoint
distance is easily calculated by inspection in linear time.
Estimating the true evolutionary distance requires assumption about the
model; in the case of gene-order evolution, the assumption is that the genomes
have evolved from a common ancestor under the Nadeau-Taylor model of evo-
lution. The technique in [4], applicable only to inversions, calculates this value
exactly, while Approx-IEBP [21] and EDE [12], applicable to very general models
of evolution, obtain approximations of these values, and Exact-IEBP [19] cal-
culates the value exactly for any combination of inversions, transpositions, and
inverted transpositions. These estimates can all be computed in low polynomial
time.
Variance of Genomic Distances. The following problem has been studied in
[17,19]: given any genome G with n genes, what is the expected breakpoint
distance between G and G
when G
i=0
_
j
i
_
u
i
= S(1, 1, 1, . . . , 1
. .
j 1
s
, 0, . . . , 0) = S
j
.
Let
Z
a
=
n
i=0
i(i 1) (i a + 1)
_
n
i
_
u
i
=
n
i=a
n(n 1) (n a + 1)
_
n a
i a
_
u
i
Genome Rearrangement Phylogeny Using Weighbor 117
for all a, 1 a n. We want to express Z
a
by some linear combination of S
i
,
0 i n. The following lemma, which is a special case of equation (5.24) in [9],
nds the coecients of the linear combination.
Lemma 1. Let a be some given integer such that 1 a n. Let us be given
{u
i
: 0 i n} that satisfy
i
j=0
_
i
j
_
u
j
=
n
j=0
_
i
j
_
u
j
= S
i
,where 0 i n.
We have
n
i=na
(1)
ni
_
a
ni
_
S
i
=
n
j=0
_
na
ja
_
u
j
.
The following results follow from Lemma 1; we state them without proof due to
space limitations.
Theorem 1. For all a, 1 a n,
Z
a
= n(n 1) (n a + 1)
n
i=na
(1)
ni
_
a
n i
_
S
i
.
Corollary 1. (a) Eb
k
= Z
1
= n(1 S
n1
).
(b) Var b
k
= nS
n1
n
2
S
2
n1
+n(n 1)S
n2
.
These results work for all integers r, 1 r n. When there are more than
one type of rearrangement events with dierent rs we can change S accordingly.
For example, let = +; for the GNT model we can set
S =
_
_
1
_
n
2
_ (
1i
1
<i
2
n
x
i
1
x
i
2
) +
_
n
3
_(
1i
1
<i
2
<i
3
n
x
i
1
x
i
2
x
i
3
)
_
_
k
. (1)
Mean and Variance of the Breakpoint Distance under the GNT Model. We begin
this section by nding the mean and variance for b
k
with respect to the GNT
model. By substituting into equation (1):
S
n1
= (1
2 +
n
)
k
, S
n2
=
_
(n 3)(n 2 2)
n(n 1)
_
k
.
For the GNT model, we have the following results:
d
dk
Eb
k
= nS
n1
(
1
k
ln S
n1
) = nS
n1
ln(1
2 +
n
) (2)
Var b
k
= nS
n1
n
2
S
2
n1
+n(n 1)S
n2
= (nS
n1
n
2
S
2
n1
+n(n 1)S
n2
) (3)
Using BioNJ and Weighbor. Both BioNJ and Weighbor are designed for DNA
sequence phylogeny using the variance of the true evolutionary distance estima-
tor. In BioNJ, the distance update step of NJ is modied so the variances of
the new distances are minimized. In Weighbor, the pairing step is also modied
to utilize the variance information. We use the variance for the GNT model in
this section and the expected breakpoint distance in [19] in the two methods.
The new methods are called BioNJ-IEBP and Weighbor-IEBP. To estimate the
118 L.-S. Wang
0 50 100 150 200 250 300
0
1
2
3
4
5
6
Number of Rearrangement Events
B
r
e
a
k
p
o
i
n
t
D
i
s
t
a
n
c
e
Simulation
Theory
0 50 100 150 200 250 300
0
10
20
30
40
50
60
Number of Rearrangements
S
t
d
e
v
.
E
x
a
c
t
I
E
B
P
D
i
s
t
a
n
c
e
Simulation
Theory
Fig. 1. Accuracy of the estimator for the variance. Each gure consists of two sets
of curves, corresponding to the values of simulation and theoretical estimation. The
number of genes is 120. The number of rearrangement events, k, range from 1 to 220.
The evolutionary model is inversion-only GNT; see Section 1 for details. For each k we
generate 500 runs. We then compute the standard deviation of b
k
for each k, and those
of
k(b
k
) for each k, and compare them with the values of the theoretical estimation.
true evolutionary distance, we use Exact-IEBP, though we can also use Equa-
tion (2) which is less accurate. Let
k(b) denote the Exact-IEBP distance given
the breakpoint distance is b;
k(b) behaves as the inverse of Eb
k
, the expected
breakpoint distance after k rearrangement events. The variance of
k(b) can be
approximated using a common statistical technique, the delta method [13], as
follows:
Var
k(b) (
d
dk
Eb
k
)
2
Var b
k
=
_
1 nS
n1
+ (n 1)(
S
n2
S
n1
)
_
nS
n1
(ln(1
2+
n
))
2
.
When the number of rearrangements are below the number of genes (120 in the
simulation), these results are accurate approximations to the mean and variance
of the breakpoint distance under the GNT model, as the simulation in Figure 1
shows. As the number of rearrangements k is so high the breakpoint distance
is close to the maximum (the resulting genome is random with respect to the
genome before evolution), the simulation shows the variance is much lower than
the theoretical formula. This is due to the application of the delta method:
while the method assumes the Exact-IEBP distance is continuous, in reality it is
a discrete function. The eect gets more obvious as k is large: dierent values of
k all give breakpoint distances close to the maximum, yet the Exact-IEBP can
only return one estimate for k, hence the very low variance. This problem is less
serious as n increases.
3 Variance of the Inversion and EDE Distances
The EDE Distance. Given two genomes having the same set of n genes and the
inversion distance between them is d, we dene the EDE distance as nf
1
(
d
n
):
Genome Rearrangement Phylogeny Using Weighbor 119
here n is the number of genes, and f, an approximation to the expected inversion
distance normalized by the number of genes, is dened as (see [12]):
f(x) = min{x,
ax
2
+bx
x
2
+cx +b
}
We simulate the inversion-only GNT model to evaluate the relationship between
the inversion distance and the actual number of inversion applied. Regression on
simulation results suggests a = 1, b = 0.5956, and c = 0.4577. As the rational
function is inverted, we take the larger (and only positive) root:
x =
(b cy)
_
(b cy)
2
+ 4(a y)by
2(a y)
.
Let y =
d
n
. Thus
f
1
(y) = max{y,
(b cy)
_
(b cy)
2
+ 4(a y)by
2(a y)
}.
Here the coecients do not depend on n, since for dierent values of n the curves
of the normalized expected inversion distance are similar.
Regression for the Variance. Due to the success of nonlinear regression in the
derivation of EDE, we use the same technique again for the variance of the in-
version distance (and that of EDE). However for dierent numbers of genes, the
curves of the variance are very dierent (see Figure 2). From the simulation it is
obvious the magnitudes of the curves are inversely proportional to the number
of genes (or some kind of function of it).
We use the following regression formula for the standard deviation of the
inversion distance normalized by the number of genes after nx inversions are
applied:
g
n
(x) = n
q
ux
2
+vx
x
2
+wx +t
.
The constant term in the numerator is zero because we know g(0) = 0. Let r
be the value such that rn is the largest number of inversions applied; we use
r = 2.5. Note that
ln(
1
rn
rn
0
g
n
(x)) ln(
1
r
_
r
0
g
n
(x)dx) = q ln n + ln(
1
r
_
r
0
ux
2
+vx
x
2
+wx +t
dx)
is a linear function of lnn. Thus we can obtain q as the slope in the linear
regression using n as the independent variable and ln(
1
rn
rn
0
g
n
(x)) as the in-
dependent variable (see Figure 2(b); simulation results suggest the average of
the curve indeed is inversely proportional to lnn). When q is obtained we apply
nonlinear regression to obtain u, v, w, and t using the simulation data for 40,
80, 120, and 160 genes. The resultant functions are shown as the solid curves in
Figure 2, with coecients q = 0.6998, u = 0.1684, v = 0.1573, w = 1.3893,
and t = 0.8224.
120 L.-S. Wang
0.0 0.5 1.0 1.5 2.0 2.5
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
Normalized Actual Number of Inversions
S
t
a
n
d
a
r
d
D
e
v
i
a
t
i
o
n
o
f
N
o
r
m
a
l
i
z
e
d
I
n
v
e
r
s
i
o
n
D
i
s
t
a
n
c
e
20 genes
40 genes
80 genes
120 genes
160 genes
10 20 50 100 200
0
.
0
1
0
.
0
2
0
.
0
5
0
.
1
0
Number of genes
I
n
t
e
g
r
a
t
i
o
n
o
f
t
h
e
S
t
d
e
v
o
f
I
n
v
.
D
i
s
t
Empirical
Regression
Fig. 2. Left: simulation (points) and regression (solid lines) of the standard deviation
of the inversion distance. Right: regression of coecient q (see Section 3); for every
point corresponding to n genes, the y coordinate is the average of all data points in
the simulation.
Variance of the EDE Distance. Let X
k
and Y
k
be the inversion and EDE distances
after k inversions are applied to a genome of n genes, respectively. We again use
the delta method. Let x =
k
n
. Since X
k
= nf(
Y
k
n
), we have
dY
k
dX
k
1
=
dX
k
dY
k
=
1
n
dX
k
d(Y
k
/n)
= f
(x) =
d
dx
_
min{x,
x
2
+bx
x
2
+cx +b
}
_
.
The point where x =
x
2
+bx
x
2
+cx +b
is when x = 0.5423. Therefore
f
(x) =
_
_
_
1 if 0 x < 0.5423
d
dx
(
x
2
+bx
x
2
+cx +b
) =
x
2
(c b) + 2bx +b
2
(x
2
+cx +b)
2
if x 0.5423
and V ar(Y
k
)
dY
k
dX
k
2
V ar(X
k
) = (f
(x))
2
(ng
n
(x))
2
= (ng
n
(
k
n
)/f
(
k
n
))
2
.
4 Simulation Study
We present the results of two simulation studies in this section. The rst ex-
periment compares the accuracy of our new methods with other distance-based
methods for genome rearrangement phylogeny. In the second experiment we show
the robustness of Weighbor-IEBP against errors in the parameters of the GNT
model, (, ).
4.1 Experiment 1: Accuracy of the New Methods
In this experiment we compare the trees reconstructed using BioNJ-IEBP,
Weighbor-IEBP, BioNJ-EDE, and Weighbor-EDE to neighbor joining trees using
dierent distance estimators. The following four distance estimators are used
Genome Rearrangement Phylogeny Using Weighbor 121
Table 1. Settings for Experiments 1 and 2.
Parameter Value
1. Number of genes 120 (plant chloroplast genome)
2. Model tree generation Uniformly Random Topology
(See the Model Tree paragraph in Section 4.1 for details.)
4. GNT Model parameters (, )
(0, 0), ( , )
5. Datasets for each setting 30
The probabilities that a rearrangement is an inversion, a transposition, or an inverted
transposition are 1 , , and , respectively.
with neighbor joining: (1) BP, the breakpoint distance, (2) INV [2], the inversion
distance, and (3) Exact-IEBP [19] and (4) EDE [12], true evolutionary distance es-
timators based on BP and INV, respectively. The procedure of neighbor joining
combined with distance X will be denoted by NJ(X). According to past sim-
ulation studies [21,12,19], NJ(EDE) has the best accuracy, followed closely by
NJ(Exact-IEBP). See Table 1 for the settings for the experiment.
Quantifying Error. Given an inferred tree, we compare its topological accuracy
by computing false negatives with respect to the true tree [11,8]. During
the evolutionary process, some edges of the model tree may have no changes
(i.e. evolutionary events) on them. Since reconstructing such edges is at best
guesswork, we are not interested in these edges. Hence, we dene the true tree
to be the result of contracting those edges in the model tree on which there are
no changes.
We now dene how we score an inferred tree, by comparison to the true
tree. For every tree there is a natural association between every edge and the
bipartition on the leaf set induced by deleting the edge from the tree. Let T be
the true tree and let T
if
T
does not contain an edge dening the same bipartition; such an edge is called
a false negative. Note that the external edges (i.e. edges incident to a leaf) are
trivial in the sense that they are present in every tree with the same set of leaves.
The false negative rate is the number of false negative edges in T
with respect
to T divided by the number of internal edges in T.
Software. We use PAUP* 4.0 [18] to compute the neighbor joining method
and the false negative rates between two trees. We have implemented a sim-
ulator [12,21] for the GNT model. The input is a rooted leaf-labeled model tree
(T, {
e
}), and parameters (, ). On each edge, the simulator applies random
rearrangement events to the circular genome at the ancestral node according
to the model with given parameters and . We use the original Weighbor
and BioNJ implementations [5,7] (downloadable from the internet) and make
modications so they use the new variance formulas.
122 L.-S. Wang
(, ) = (0, 0)
0.0 0.2 0.4 0.6 0.8 1.0
0
10
20
30
40
50
Normalized Max. Pairwise Inv. Distance
N
o
r
m
a
l
i
z
e
d
F
a
l
s
e
N
e
g
a
t
i
v
e
R
a
t
e
(
%
)
NJ(BP)
NJ(EIEBP)
BioNJIEBP
WeighborIEBP
0.0 0.2 0.4 0.6 0.8 1.0
0
10
20
30
40
50
Normalized Max. Pairwise Inv. Distance
N
o
r
m
a
l
i
z
e
d
F
a
l
s
e
N
e
g
a
t
i
v
e
R
a
t
e
(
%
)
NJ(INV)
NJ(EDE)
BioNJEDE
WeighborEDE
(, ) = ( , )
0.0 0.2 0.4 0.6 0.8 1.0
0
10
20
30
40
50
Normalized Max. Pairwise Inv. Distance
N
o
r
m
a
l
i
z
e
d
F
a
l
s
e
N
e
g
a
t
i
v
e
R
a
t
e
(
%
)
NJ(BP)
NJ(EIEBP)
BioNJIEBP
WeighborIEBP
0.0 0.2 0.4 0.6 0.8 1.0
0
10
20
30
40
50
Normalized Max. Pairwise Inv. Distance
N
o
r
m
a
l
i
z
e
d
F
a
l
s
e
N
e
g
a
t
i
v
e
R
a
t
e
(
%
)
NJ(INV)
NJ(EDE)
BioNJEDE
WeighborEDE
Fig. 3. The topological accuracy of various distance-based tree reconstruction meth-
ods. The number of genomes is 160. See Table 1 for the settings in the experiment.
Model Trees. The model trees have topologies drawn from the uniform distribu-
tion
1
, and edge lengths drawn from the discrete uniform distribution on intervals
[1, b], where b is one of the following: 3, 6, 12, 18 (higher values of b makes the
variance of the edge lengths higher). Then the length of each edge is scaled by
the same factor so the diameter of the tree (the maximum pairwise leaf-to-leaf
distance on the tree) is 36, 72, 120, 180, 360 (so low- to high-evolutionary rates
are covered).
Discussion. The results of the simulation are in Figure 3. Due to space limita-
tions, and because the relative order of accuracy remains the same, we describe
our results only for a subset of our experiments. For each setting for simulation,
we group methods based on the genomic distances they are based on: breakpoint
or inversion distance. In either group, the relative order of accuracy is roughly
the same. Let Y be the true distance estimator based on a genomic distance X;
This is easily done and well known by the community by add one leaf at a time to
produce the whole tree. At each iteration, we choose an edge from the current tree
(each edge has the same probability to be chosen) and attach the new leaf to it.
Genome Rearrangement Phylogeny Using Weighbor 123
e.g. Y=IEBP if X=BP, and Y=EDE if X=INV. The order of the methods, start-
ing from the worst, is (1) NJ(X), (2) BioNJ-Y, (3) NJ(Y), and (4) Weighbor-Y
(except for very low evolutionary rates when Weighbor-IEBP is worst, but only
by a few percents). The dierences between NJ(Y) and BioNJ-Y are extremely
small. Weighbor-EDE has the best accuracy over all methods.
When we compare methods based on breakpoint distance and methods based
on inversion distance, the latter are always better (or no worse) than the former
if we compare methods of the same complexity: NJ(INV) is better than NJ(BP),
NJ(EDE) is better than NJ(Exact-IEBP), BioNJ-EDE is better than BioNJ-IEBP,
and Weighbor-EDE is better than Weighbor-IEBP. This suggests INV is a better
statistic than BP for the true evolutionary distance under the GNT model, even
when transpositions and inverted transpositions are present. This is not surpris-
ing as INV, just like BP, increases by a small constant when a rearrangement
event from the GNT model is applied. Also, though their maximum allowed val-
ues are the same (the number of genes for circular signed genomes), the fact the
average increase in INV is smaller
2
than the average increase in BP gives INV
a wider eective range.
Note Weighbor-IEBP outperforms NJ(Exact-IEBP) when the normalized
maximum pairwise inversion distance, or the diameter of the dataset, exceeds
0.6; Weighbor-IEBP (based on BP) is even better than NJ(EDE) (based on the
better INV) when the diameter of the dataset exceeds 0.9. This suggests the
Weighbor approach really shines under high amounts of evolution.
Running Time. NJ, BioNJ-IEBP, and BioNJ-EDE all nish within 1 second for all
settings on our Pentium workstations running Linux. However, Weighbor-IEBP
and Weighbor-EDE take considerably more time; both Weighbor-IEBP and
Weighbor-EDE take about 10 minutes to nish for 160 genomes. As a side
note, fast maximum parsimony methods for genome rearrangement phylogeny,
MPME/MPBE, almost always exceed the four-hour running time limit in the exper-
iment in [20]; Weighbor-EDE is almost as good as these two methods except for
datasets with very high amount of evolution.
4.2 Experiment 2: Robustness of Weighbor-IEBP
In this section we demonstrate the robustness of the Weighbor-IEBP method
when the model parameters are unknown. The settings are the same in Table
1. The experiment is similar to the previous experiment, except here we use
An inversion creates two breakpoints; a transposition and an inverted transposi-
tion can be realized by three and two inversions, respectively, and they all cre-
ate three breakpoints each. Thus under the GNT model with model parameters
(, ) and assumption that genome G has only a small breakpoint (BP(G, G ))
and inversion (INV (G, G )) distance from the reference (ancestral) genome G , the
average increase in BP(G, G ) after a random rearrangement is applied to G is
2(1 ) + 3 + 3 = 2 + + and the average increase in INV (G, G ) is
(1 ) + 3 + 2 = 1 + 2 + . The latter is always smaller, and the two
quantities are equal only when = 1, i.e. only transpositions occur.
124 L.-S. Wang
both the correct and the incorrect values of (, ) for the Exact-IEBP distance
and the variance estimation. Due to space limitations the results are not shown
here. The false negative curves are similar even when dierent values of and
are used. These results suggest that Weighbor-IEBP is robust against errors
in (, ).
5 Conclusion and Future Work
In this paper we study the variance of the breakpoint and inversion dis-
tances under the Generalized Nadeau-Taylor model. We then used these re-
sults to obtain four new methods: BioNJ-IEBP, Weighbor-IEBP, BioNJ-EDE,
and Weighbor-EDE. Of these Weighbor-IEBP and Weighbor-EDE yield very ac-
curate phylogenetic trees, and are robust against errors in the model parameters.
Future research includes analytical estimates of the expectation and variance of
the inversion distance, and the dierence between the approximating model and
the breakpoint distance under the true GNT model.
Acknowledgements
The author thanks Tandy Warnow for suggesting the direction of research which
leads to this paper, and the two anonymous referees for their helpful comments.
References
1. K. Atteson. The performance of the neighbor-joining methods of phylogenetic
reconstruction. Algorithmica, 25(2/3):251278, 1999.
2. D. A. Bader, B. M. E. Moret, and M. Yan. A linear-time algorithm for computing
inversion distances between signed permutations with an experimental study. J.
Comput. Biol., 8(5):483491, 2001.
3. V. Bafna and P. Pevzner. Sorting permutations by transpositions. In Proc. 6th
Annual ACM-SIAM Symp. on Disc. Alg. SODA95, pages 614623. ACM Press,
1995.
4. M. Blanchette, M. Kunisawa, and D. Sanko. Gene order breakpoint evidence in
animal mitochondrial phylogeny. J. Mol. Evol., 49:193203, 1999.
5. W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A
likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol.
Evol., 17:189197, 2000. http://www.t10.lanl.gov/billb/weighbor/.
6. S. R. Downie and J. D. Palmer. Use of chloroplast DNA rearrangements in recon-
structing plant phylogeny. In P. Soltis, D. Soltis, and J.J. Doyle, editors, Molecular
Systematics of Plants, volume 49, pages 1435. Chapman & Hall, 1992.
7. O. Gascuel. BIONJ: an improved version of the nj algorithm based on a smple
model of sequence data. Mol. Biol. Evol., 14:685695, 1997.
http://www.crt.umontreal.ca/olivierg/bionj.html.
8. O. Gascuel. Personal communication, April 2001.
9. R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics. Addison-
Wesley, 1994. 2nd ed.
Genome Rearrangement Phylogeny Using Weighbor 125
10. S. Hannenhalli and P. Pevzner. Transforming cabbage into turnip (polynomial
algorithm for genomic distance problems). In Proc. 27th Annual ACM Symp. on
Theory of Comp. STOC95, pages 178189. ACM Press, NY, 1995.
11. S. Kumar. Minimum evolution trees. Mol. Biol. Evol., 15:584593, 1996.
12. B. M. E. Moret, L.-S. Wang, T. Warnow, and S. Wyman. New approaches for
reconstructing phylogenies based on gene order. In Proc. 9th Intl. Conf. on Intel.
Sys. for Mol. Bio. (ISMB 2001), pages 165173. AAAI Press, 2001.
13. G. W. Oehlert. A note on the delta method. Amer. Statist., 46:2729, 1992.
14. R. G. Olmstead and J. D. Palmer. Chloroplast DNA systematics: a review of
methods and data analysis. Amer. J. Bot., 81:12051224, 1994.
15. L. A. Raubeson and R. K. Jansen. Chloroplast DNA evidence on the ancient
evolutionary split in vascular land plants. Science, 255:16971699, 1992.
16. N. Saitou and M. Nei. The neighbor-joining method: A new method for recon-
structing phylogenetic trees. Mol. Biol. & Evol., 4:406425, 1987.
17. D. Sanko and M. Blanchette. Probability models for genome rearrangements and
linear invariants for phylogenetic inference. Proc. 3rd Intl Conf. on Comput. Mol.
Bio. (RECOMB99), pages 302309, 1999.
18. D. Swoord. PAUP* 4.0. Sinauer Associates Inc, 2001.
19. L.-S. Wang. Improving the accuracy of evolutionary distances between genomes.
In Lec. Notes in Comp. Sci. No. 2149: Proc. 1st Workshop for Alg. & Bio. Inform.
WABI 2001, pages 175188. Springer Verlag, 2001.
20. L.-S. Wang, R. K. Jansen, B. M. E. Moret, L. A. Raubeson, and T. Warnow. Fast
phylogenetic methods for the analysis of genome rearrangement data: An empirical
study. In Proc. 7th Pacic Symp. Biocomputing (PSB 2002), pages 524535, 2002.
21. L.-S. Wang and T. Warnow. Estimating true evolutionary distances between
genomes. In Proc. 33th Annual ACM Symp. on Theory of Comp. (STOC 2001),
pages 637646. ACM Press, 2001.
Segment Match Renement and Applications
Aaron L. Halpern, Daniel H. Huson
New address: WSI-AB, T ubingen University, Sand 14, 72076 T ubingen, Germany.
Phone: +49-7071-29-70450, email: [email protected]
R. Guigo and D. Guseld (Eds.): WABI 2002, LNCS 2452, pp. 126139, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Segment Match Renement and Applications 127
A
B
S
S
S
and
S
are non-crossing, i.e. that the two maps are monotonically increasing
and that we have
S
(h) h
h (h
[k, l].
In the case of a transposition, we require that the projections are (completely)
crossing, i.e. that they are monotonically decreasing and that we have
S
(h)
h
h (h
[k, l].
When the alignments in matches are gapless, one can use linear interpolation
in place of the and projections, i.e. by setting
S
(h) :=
k +
lk
ji
(h i), if S is a direct match, and
l
lk
ji
(h i), else,
for every S = (A
ij
, B
kl
) and i h j.
A match S
= (A
i
j
, B
k
l
) is called a submatch of S, if i i
j,
and we have
S
(i
) = k
or
S
(k
) = i
, and
S
(j
) = l
or
S
(l
) = k
. Here,
we assume that
S
, and
S
, equals the restriction of
S
to [i
, j
], and of
S
to
[k
, l
], respectively.
3 Resolved Renement
Consider a match S = (A
ij
, B
kl
) . A set of matches
= {S
1
, S
2
, . . . , S
r
} is
called a renement of S, if each S
is a submatch of S and
tiles S, i.e.
we the following two disjoint unions:
(i, j] =
(A
i
j
,B
k
l
)
(i
, j
] and (k, l] =
(A
i
j
,B
k
l
)
(k
, l
].
Denition 1. A set
2
. . .
n
such that
i
is a renement of S
i
for every i = 1, 2, . . . , n.
We are particularly interested in renements of that have the property
that the segments of any two matches are either disjoint or are identical, and
capture this property as follows:
Segment Match Renement and Applications 129
A
B
S
P
0
P
1
Q
0
Q
1
I
1
i
j
k l
(a)
A
B
S
S
i j i
k l
(b)
A
B
S
S
1
S
2
S
3
i j i
k l
S
(i
)
S
( j
)
(c)
A
B
i
j
k l
h
h
(d)
Fig. 2. Rening and resolving matches. (a) A valid match S = (A
ij
, B
kl
), together with
arrows depicting
S
(pointing down) and
S
(pointing up). (b) The set = {S, S
}
is unresolved. (c) The set {S , S , S , S
S
and
S
implies that the match has incorporated an insertion I
1
. We show
an unresolved set of two matches = {S, S
1
\ supp
A
2
. We say that an arbitrary A-position p is attracted to h,
if there exists a sequence of matches S
1
, S
2
, . . . , S
r
such that alternating
application of their and projections to p yields h. Let P
A
and P
B
denote
the basin of attraction of A-positions and B-positions for h, respectively. Note
that P
A
= and P
B
= . If P
A
supp
A
() = or P
B
supp
B
() = , then by
Denition 2, h must occur in every resolved renement of , hence in particular
in
2
. On the other hand, if both of these intersections are empty, then one can
simplify
1
by merging each pair of consecutive matches that is separated only
by a pair of points in P
A
and P
B
. Note that this simplication is possible due to
the non-crossing/crossing requirement for direct/reverse segment matches. We
thus obtain a resolved renement of of smaller cardinality, a contradiction.
For example, in Figure 2(d), P
A
= {h} and P
B
= {h
, h
}.
Now, assume supp
A
(
1
) = supp
A
(
2
), supp
B
(
1
) = supp
B
(
2
) and there
exists a match S = (A
ij
, B
kl
)
1
\
2
. By Denition 1, there exists a match
S
, a contradiction.
4 Computing a Minimum Resolved Renement
The following algorithm computes the minimum resolved renement:
Algorithm 1 Input: set of segment matches and projections
S
,
S
for all
S . Output: resolved renement
of of minimum cardinality. We
maintain a bipartite graph G = (V
A
V
B
, E V
A
V
B
) with node sets
V
A
{a
0
, . . . , a
p
} and V
B
{b
0
, . . . , b
q
} representing A- and B-positions, re-
spectively, and a set E of (labeled) edges that keep track of pairs of positions
projected onto each other by a given sequence match.
Segment Match Renement and Applications 131
First, for a set W
A
of A-positions, we dene the A-renement step as follows:
If this is not the rst execution of this step:
Set W
B
= .
For each h W
A
\ V
A
:
For each match S = (A
ij
, B
kl
) such that i h j:
With h
:=
S
(h), add the edge (S, h, h
) to G (creating nodes as
necessary) and insert h
into W
B
.
Second, for a set W
B
of B-positions, we dene the B-renement step as
follows:
Set W
A
= .
For each h
W
B
\ V
B
:
For each match S = (A
ij
, B
kl
) such that k h
l:
With h :=
S
(h
) to G (creating nodes as
necessary) and insert h into W
A
.
Third:
Initialize V
A
= , W
A
= supp
A
(), V
B
= and W
B
= supp
B
().
Repeat until no further renement is necessary (i.e. until either W
A
=
or W
B
= ):
Apply the A-renement step to W
A
and then the B-renement step
to W
B
.
Finally:
Lexicographically order the set of edges E, using (S, i, k) or (S, i, k), de-
pending on whether
S
and
S
are non-crossing or crossing, respectively.
Then, set
:= {(A
ij
, B
kl
) | (S, i, j) and (S
, k, l) are consecutive
and S = S
}.
Lemma 2. Algorithm 1 computes the minimum resolved renement
of .
Proof. In the rst iteration of the A-renement step, all original A-positions
are projected onto B using all segment matches. Then, all original and new B-
positions are projected onto A. We then repeatedly project the newly generated
positions back and forth until no new positions in A or B are reached. At this
stage, V
A
and V
B
are the A- and B-supports, respectively, of a renement of .
Because every position was produced by a chain of projections starting from a
position coming from an original match in , there are no superuous matches
and the resulting set has minimum cardinality.
In the nal extraction step, for each match S , the renement of S based
on the set M of all position pairs in S is computed.
Lemma 3. Given a set of n segment matches between two sequences A =
a
1
a
2
. . . a
p
and B = b
1
b
2
. . . b
q
, with p q. The graph G in Algorithm 1 has at
most O(kp) edges and can be computed in at most O(p log
2
n + kp log p) steps,
where k is the maximal number of segment matches containing any given posi-
tion.
Proof. The maximum number of dierent A-positions is p. To insert an A-
position h into V
A
, we rst determine which segment matches contain it in
132 A.L. Halpern, D.H. Huson, and K. Reinert
O(log
2
n + k) steps, using a range tree [10,17]. For each of the O(k) matches
S that contains h, determine whether
S
(h) is contained in V
B
in O(log p)
steps. Putting this together, we obtain a bound of O(p(log
2
n + k + k log p)) =
O(p log
2
n + kp log p). Note that each of O(p) positions gives rise to at most k
edges in G.
Lemma 4. Given a set of n segment matches between two sequences A =
a
1
a
2
. . . a
p
and B = b
1
b
2
. . . b
q
, with p q. The minimum resolved renement
(a)
A
B
S
S1
S2
S
(b)
A
B
S
S1
S2
S
(c)
Fig. 3. (a) Matches S and S
and
then, subsequently, of match S . (c) This leads to a chain of renements, producing
O(
d
h
) matches.
Proof. The minimum resolved renement of has size O(kp) and can be com-
puted in O(p log
2
(n)+kp log(kp)) steps. On the set of rened matches, the MMSP
problem can be solved in at most O((kp)
2
log(kp))) steps. Hence, the total run-
ning time is at most O((kp)
2
log(kp)+p log
2
(n)+kp log(kp)) = O((kp)
2
log(kp)).
5.2 The Transformation Distance Problem
The transformation distance introduced in [15] provides a measure for comparing
two large sequences A and B based on computing an optimal script that uses
copy, reverse-copy and insertion operations to build B from segments of A. The
input considered by Varre et al. consists of a set of segment matches that are
so-called factors, i.e. matches between segments of A and B allowing at most one
134 A.L. Halpern, D.H. Huson, and K. Reinert
mismatch. They dene an order <
0
on by setting (A
ij
, B
kl
) <
0
(A
i
j
, B
k
l
)
if the rst segment match strictly precedes the second, i.e. if j < i
and l < k
.
A weighted script graph G = (V, E) is constructed, the nodes of which represent
the segment matches. An edge is placed between two nodes, if the corresponding
segment matches S and S
and no segment
match S
<
0
S
S).
The original formulation of [7] assumes single character matches, where Rein-
ert et al. [8] generalized it to allow for block matches. However, they used a rene-
ment into single character matches. This immediately aects the time complexity
of their separation method in the branch-and-cut algorithm described in [8].
Similar to the transformation distance, this formulation can also immediately
benet from a minimum fully rened set of matches by reducing the size of the
input alignment graph.
If we apply a suitable scoring scheme for the rened matches (e.g. [1,11,16]),
we can construct the alignment graph analogous to the match graph in Section
5.1 (note that it can easily be extended to the comparison of multiple sequences).
Then the following holds:
Theorem 4. Let be a set of n matches between sequences A = a
1
a
2
. . . a
p
and B = b
1
b
2
. . . b
q
, with p q. The Trace Problem can be solved in at most
O(p(log
2
(n) +kp log(kp)) steps.
Proof. As mentioned above we can compute the minimum resolved renement of
in O(p(log
2
(n) +kp log(kp)) steps resulting in a minimum resolved renement
of size O(kp). As noted in the proof of Lemma 5 the number of distinct nodes
in the alignment graph is O(kp). The trace problem for two sequences can be
reduced to computing the heaviest common subsequence which is possible in
time O((kp + kp) log kp) [6]. This time is dominated by the construction of the
renement.
6 Illustration
To illustrate the strengths and weaknesses of the match renement method de-
scribed above, we use it to evaluate an alternative, greedy approach to nding
a one-to-one matching of two large sequence sets.
136 A.L. Halpern, D.H. Huson, and K. Reinert
Input matches Matching Greedy algorithm Transformation dis-
tance
Fig. 4. Illustration of method on two Drosophila assemblies.
The greedy technique was developed earlier to evaluate progress in assembling
the human genome (Halpern, Huson, Delcher, Sutton, and Myers, unpublished),
for which we needed a pipeline that is extremely fast, reasonable in memory
usage, and is capable of nding matches to most repeated elements but also
chooses the intuitively correct match between repeats. Such a pipeline might
be useful for assessing a modication in an assembly algorithm, in determining
the relative degrees of completion of two separate genome sequencing projects
[12], or in identifying corresponding pieces between incremental assemblies for
the purpose of annotation transfer.
In a nutshell, the method conducts a sux-tree-based all-against-all search of
two assemblies [5], and then greedily determines a set of one-to-one matches. The
greedy phase rst constructs a priority queue of all resulting segment matches or-
dered by decreasing length and initializes the set of selected (one-to-one) matches
to null. Then, the match at the head of the priority queue is popped and com-
pared against the set of selected matches. If it does not overlap any selected
match, then it is added to the selected set, otherwise, the match is trimmed to
eliminate all such overlaps and the shorted match is reinserted into the queue.
This is repeated until the match at the head of the priority queue is shorter than
some cuto.
Segment Match Renement and Applications 137
Although the results of the greedy method appeared quite good on informal
inspection, the question arose just how close to maximal the obtained results
were. The method described in this paper permitted us to answer this problem
denitively for a given dataset.
For illustration here, we computed an initial match set of gapless alignments
between two (partial) versions of the drosophila genome. The rst was obtained
from the BDGP (http://www.fruitfly.org/sequence/dlMfasta.shtml), cor-
responding to 50,107,720 bp of two nished chromosome arms (2L and 3R). The
second was a unpublished assembly of whole genome shotgun sequencing data
conducted at Celera in late 2001. The sux-tree method yielded 292,769 align-
ments seeded with a perfect match of at least 50 bp. These matches covered
49,953,627 bp of the nished sequence (i.e., they covered all but 154,093 bp).
Precisely 43,740,070 bp were covered by a single match, but the remainder was
covered two or more times; since the total length of all matches was 125,863,932,
a relatively small fraction of the sequence is clearly involved in a large number
of matches.
The renement method described in this paper transformed the match set
into 17,167,891 distinct matches. This took a total of 880 seconds on a Unix
workstation (667 Mhz Alpha processor) and shows that even for a large data set
the renement is computed quickly for a realistic set of input matches. Thus,
the initial match set, with mean length 429 bp, was cut up into much smaller
pieces (mean length 7 bp) by the renement. However, since most of the orig-
inal sequence was covered by a single match, there should still be matches of
considerable size; indeed, the maximum length rened match was 466,877 bp,
and of the nished sequence covered by matches, half is covered by a rened
match of length at least 97,008 bp (this is known as the N50 value). Again,
the complexity in the match set is restricted to the repetitive regions of the
sequence.
An optimal matching was determined on the rened match set, as described
above. The resulting set of matches involved 262,885 rened matches and covered
49,875,599 bp, leaving 232,121 bp of the nished sequence uncovered. Due to
the renement process, groups of selected matches may correspond to (parts of)
original matches, so it may be of greater interest to consider runs of adjacent,
consistent rened matches as a single matching block; 84211 such blocks were
present in the optimal matching, with a mean length of 592 bp, and an N50
value of 97,164 bp.
Given the availability of the optimal matching, why would one ever choose
the greedy method? The optimal matching makes no eort to favor long runs of
consistent matches (i.e. large blocks), and may choose more or less at random
among a set of rened matches covering a given copy of a repeated element. In
contrast, the greedy method attempts to maximize the length of the individual
matches, rather than the sum of their lengths, as can be seen by comparing
the number of selected matches