Background: Supertree methods combine trees on subsets of the full taxon set together to produce ... more Background: Supertree methods combine trees on subsets of the full taxon set together to produce a tree on the entire set of taxa. Of the many supertree methods, the most popular is MRP (Matrix Representation with Parsimony), a method that operates by first encoding the input set of source trees by a large matrix (the "MRP matrix") over {0,1, ?}, and then running maximum parsimony heuristics on the MRP matrix. Experimental studies evaluating MRP in comparison to other supertree methods have established that for large datasets, MRP generally produces trees of equal or greater accuracy than other methods, and can run on larger datasets. A recent development in supertree methods is SuperFine+MRP, a method that combines MRP with a divide-and-conquer approach, and produces more accurate trees in less time than MRP. In this paper we consider a new approach for supertree estimation, called MRL (Matrix Representation with Likelihood). MRL begins with the same MRP matrix, but then analyzes the MRP matrix using heuristics (such as RAxML) for 2-state Maximum Likelihood.
The estimation of species phylogenies requires multiple loci, since different loci can have diffe... more The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed 'bipartitions'. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent. We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL's running time is [Formula: see text], and ASTRAL-II's running time is [Formula: see text], where n is t...
Many biological questions, including the estimation of deep evolutionary histories and the detect... more Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique - the Ensemble of Hidden Markov Models - that we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .
Because biological processes can result in different loci having different evolutionary histories... more Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have...
The first heuristic for reconstructing phylogenetic trees from gene order data was introduced by ... more The first heuristic for reconstructing phylogenetic trees from gene order data was introduced by Blanchette et al.. It sought to reconstruct the breakpoint phylogeny and was applied to a variety of datasets. We present a new heuristic for estimating the breakpoint phylogeny which, although not polynomial-time, is much faster in practice than BP-Analysis. We use this heuristic to conduct a phylogenetic analysis of chloroplast genomes in the flowering plant family Campanulaceae. We also present and discuss the results of experimentation on this real dataset with three methods: our new method, BPAnalysis, and the neighbor-joining method, using breakpoint distances, inversion distances, and inversion plus transposition distances.
The benefits of experimental algorithmics and algorithm engineering need to be extended to applic... more The benefits of experimental algorithmics and algorithm engineering need to be extended to applications in the computational sciences. In this paper, we present on one such application: the reconstruction of evolutionary histories (phylogenies) from molecular data such as DNA sequences. Our presentation is not a survey of past and current work in the area, but rather a discussion of what we see as some of the important challenges in experimental algorithmics that arise from computational phylogenetics.
Proceedings / IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference, 2004
Phylogenetic trees are commonly reconstructed based on hard optimization problems such as maximum... more Phylogenetic trees are commonly reconstructed based on hard optimization problems such as maximum parsimony (MP) and maximum likelihood (ML). Conventional MP heuristics for producing phylogenetic trees produce good solutions within reasonable time on small datasets (up to a few thousand sequences), while ML heuristics are limited to smaller datasets (up to a few hundred sequences). However, since MP (and presumably ML) is NP-hard, such approaches do not scale when applied to large datasets. In this paper, we present a new technique called Recursive-Iterative-DCM3 (Rec-I-DCM3), which belongs to our family of Disk-Covering Methods (DCMs). We tested this new technique on ten large biological datasets ranging from 1,322 to 13,921 sequences and obtained dramatic speedups as well as significant improvements in accuracy (better than 99.99%) in comparison to existing approaches. Thus, high-quality reconstructions can be obtained for datasets at least ten times larger than was previously pos...
Genomes can be viewed in terms of their gene content and the order in which the genes appear alon... more Genomes can be viewed in terms of their gene content and the order in which the genes appear along each chromosome. Evolutionary events that affect the gene order or content are "rare genomic events" (rarer than events that affect the composition of the nucleotide sequences) and have been advocated by systematists for inferring deep evolutionary histories. This chapter surveys recent developments in the reconstruction of phylogenies from gene order and content, focusing on their performance under various stochastic models of evolution. Because such methods are quite restricted in the type of data they can analyze, we also present research aimed at handling the full range of whole-genome data.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2001
Phylogenies derived from gene order data may prove crucial in answering some fundamental open que... more Phylogenies derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Yet very few techniques are available for such phylogenetic reconstructions. One method is breakpoint analysis, developed by Blanchette and Sankoff for solving the "breakpoint phylogeny." Our earlier studies confirmed the usefulness of this approach, but also found that BPAnalysis, the implementation developed by Sankoff and Blanchette, was too slow to use on all but very small datasets. We report here on a reimplementation of BPAnalysis using the principles of algorithmic engineering. Our faster (by 2 to 3 orders of magnitude) and flexible implementation allowed us to conduct studies on the characteristics of breakpoint analysis, in terms of running time, quality, and robustness, as well as to analyze datasets that had so far been considered out of reach. We report on these findings and also discuss future directions for our new implementation.
Systematists study how a group of genes or organisms evolved. These biologists now have set their... more Systematists study how a group of genes or organisms evolved. These biologists now have set their sights on the Tree of Life challenge: to reconstruct the evolutionary history of all known living organisms. A typical phylogenetic reconstruction starts with biomolecular data, such as DNA sequences for modern organisms, and builds a tree, or phylogeny, for these sequences that represents a
The breakpoint phylogeny is an optimization problem proposed by Blanchette et al. for reconstruct... more The breakpoint phylogeny is an optimization problem proposed by Blanchette et al. for reconstructing evolutionary trees from gene order data. These same authors also developed and implemented BPAnalysis (3), a heuristic method (based upon solving many instances of the travelling salesman problem) for estimating the breakpoint phylogeny. We present a new heuristic for this purpose; although not polynomial-time, our heuristic
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2002
Evolution operates on whole genomes through mutations that change the order and strandedness of g... more Evolution operates on whole genomes through mutations that change the order and strandedness of genes within the genomes. Thus analyses of gene-order data present new opportunities for discoveries about deep evolutionary events, provided that sufficiently accurate methods can be developed to reconstruct evolutionary trees. In this paper we present two new methods of character coding for parsimony-based analysis of genomic rearrangements: one called MPBE-2, and a new parsimony-based method which we call MPME (based on an encoding of Bryant), both variants of the MPBE method. We then conduct computer simulations to compare this class of methods to distance-based methods (NJ under various distance measures). Our empirical results show that two of our new methods return highly accurate estimates of the true tree, outperforming the other methods significantly, especially when close to saturation.
Multiple sequence alignment (MSA) has long been a mainstay of bioinformatics, particularly in the... more Multiple sequence alignment (MSA) has long been a mainstay of bioinformatics, particularly in the alignment of well conserved protein and DNA sequences and in phylogenetic reconstruction for such data. Sequence datasets with low percentage identity, on the other hand, typically yield poor alignments. Now that researchers want to produce alignments among widely divergent genomes, including both coding and noncoding sequences it is necessary to revisit sequence alignment and phylogenetic reconstruction under more ambitious models of sequence evolution that take into account the plethora of genomic events that have been observed.
We present the results of a large-scale experimentalstudy of quartet-based methods (quartet clean... more We present the results of a large-scale experimentalstudy of quartet-based methods (quartet cleaning andpuzzling) for phylogeny reconstruction. Our experimentsinclude a broad range of problem sizes and evolutionaryrates, and were carefully designed to yield statisticallyrobust results despite the size of the samplespace. We measure outcomes in terms of numbers ofedges of the true tree correctly inferred by each method(true positives). Our
Evolution operates on whole genomes through direct rearrangements of genes, such as inversions, t... more Evolution operates on whole genomes through direct rearrangements of genes, such as inversions, transpositions, and inverted transposi- tions, as well as through operations, such as dupli- cations, losses, and transfers, that also affect the gene content of the genomes. Because these events are rare relative to nucleotide substitutions, gene order data offer the possibility of resolving ancient branches in the
Abundance profiling (also called &amp... more Abundance profiling (also called 'phylogenetic profiling') is a crucial step in understanding the diversity of a metagenomic sample, and one of the basic techniques used for this is taxonomic identification of the metagenomic reads. We present taxon identification and phylogenetic profiling (TIPP), a new marker-based taxon identification and abundance profiling method. TIPP combines SAT\'e-enabled phylogenetic placement a phylogenetic placement method, with statistical techniques to control the classification precision and recall, and results in improved abundance profiles. TIPP is highly accurate even in the presence of high indel errors and novel genomes, and matches or improves on previous approaches, including NBC, mOTU, PhymmBL, MetaPhyler and MetaPhlAn.
Motivation: Phylogenetic analyses often produce thou- sands of candidate trees. Biologists resolv... more Motivation: Phylogenetic analyses often produce thou- sands of candidate trees. Biologists resolve the conflict by computing the consensus of these trees. Single-tree con- sensus as postprocessing methods can be unsatisfactory due to their inherent limitations. Results: In this paper we present an alternative approach by using clustering algorithms on the set of candidate trees. We propose bicriterion problems, in particular
Whole-genome phylogenetic studies require various sources of phylogenetic signals to produce an a... more Whole-genome phylogenetic studies require various sources of phylogenetic signals to produce an accurate picture of the evolutionary history of a group of genomes. In particular, sequence-based reconstruction will play an important role, especially in r esolving more recent events. But using sequences at the level of whole genomes means working with very large amounts of data—large numbers of sequences—as well
Absolute fast converging phylogenetic reconstruction methods are provably guaranteed to recover t... more Absolute fast converging phylogenetic reconstruction methods are provably guaranteed to recover the true tree with high probability from sequences that grow only polynomially in the number of leaves, once the edge lengths are bounded arbitrarily from above and below. Only a few methods have been determined to be absolute fast converging; these have all been developed in just the last
A major computational problem in biology is the reconstruction of evolutionary (a.k.a. “phylogene... more A major computational problem in biology is the reconstruction of evolutionary (a.k.a. “phylogenetic”) trees from biomolecular sequences. Most polynomial time phylogenetic reconstruction methods are distance-based, and take as input an estimation of the evolutionary distance between every pair of biomolecular sequences in the dataset. The estimation of evolutionary distances is standardized except when the set of biomolecular sequences is “saturated”,
Background: Supertree methods combine trees on subsets of the full taxon set together to produce ... more Background: Supertree methods combine trees on subsets of the full taxon set together to produce a tree on the entire set of taxa. Of the many supertree methods, the most popular is MRP (Matrix Representation with Parsimony), a method that operates by first encoding the input set of source trees by a large matrix (the "MRP matrix") over {0,1, ?}, and then running maximum parsimony heuristics on the MRP matrix. Experimental studies evaluating MRP in comparison to other supertree methods have established that for large datasets, MRP generally produces trees of equal or greater accuracy than other methods, and can run on larger datasets. A recent development in supertree methods is SuperFine+MRP, a method that combines MRP with a divide-and-conquer approach, and produces more accurate trees in less time than MRP. In this paper we consider a new approach for supertree estimation, called MRL (Matrix Representation with Likelihood). MRL begins with the same MRP matrix, but then analyzes the MRP matrix using heuristics (such as RAxML) for 2-state Maximum Likelihood.
The estimation of species phylogenies requires multiple loci, since different loci can have diffe... more The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed 'bipartitions'. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent. We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL's running time is [Formula: see text], and ASTRAL-II's running time is [Formula: see text], where n is t...
Many biological questions, including the estimation of deep evolutionary histories and the detect... more Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique - the Ensemble of Hidden Markov Models - that we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .
Because biological processes can result in different loci having different evolutionary histories... more Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have...
The first heuristic for reconstructing phylogenetic trees from gene order data was introduced by ... more The first heuristic for reconstructing phylogenetic trees from gene order data was introduced by Blanchette et al.. It sought to reconstruct the breakpoint phylogeny and was applied to a variety of datasets. We present a new heuristic for estimating the breakpoint phylogeny which, although not polynomial-time, is much faster in practice than BP-Analysis. We use this heuristic to conduct a phylogenetic analysis of chloroplast genomes in the flowering plant family Campanulaceae. We also present and discuss the results of experimentation on this real dataset with three methods: our new method, BPAnalysis, and the neighbor-joining method, using breakpoint distances, inversion distances, and inversion plus transposition distances.
The benefits of experimental algorithmics and algorithm engineering need to be extended to applic... more The benefits of experimental algorithmics and algorithm engineering need to be extended to applications in the computational sciences. In this paper, we present on one such application: the reconstruction of evolutionary histories (phylogenies) from molecular data such as DNA sequences. Our presentation is not a survey of past and current work in the area, but rather a discussion of what we see as some of the important challenges in experimental algorithmics that arise from computational phylogenetics.
Proceedings / IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference, 2004
Phylogenetic trees are commonly reconstructed based on hard optimization problems such as maximum... more Phylogenetic trees are commonly reconstructed based on hard optimization problems such as maximum parsimony (MP) and maximum likelihood (ML). Conventional MP heuristics for producing phylogenetic trees produce good solutions within reasonable time on small datasets (up to a few thousand sequences), while ML heuristics are limited to smaller datasets (up to a few hundred sequences). However, since MP (and presumably ML) is NP-hard, such approaches do not scale when applied to large datasets. In this paper, we present a new technique called Recursive-Iterative-DCM3 (Rec-I-DCM3), which belongs to our family of Disk-Covering Methods (DCMs). We tested this new technique on ten large biological datasets ranging from 1,322 to 13,921 sequences and obtained dramatic speedups as well as significant improvements in accuracy (better than 99.99%) in comparison to existing approaches. Thus, high-quality reconstructions can be obtained for datasets at least ten times larger than was previously pos...
Genomes can be viewed in terms of their gene content and the order in which the genes appear alon... more Genomes can be viewed in terms of their gene content and the order in which the genes appear along each chromosome. Evolutionary events that affect the gene order or content are "rare genomic events" (rarer than events that affect the composition of the nucleotide sequences) and have been advocated by systematists for inferring deep evolutionary histories. This chapter surveys recent developments in the reconstruction of phylogenies from gene order and content, focusing on their performance under various stochastic models of evolution. Because such methods are quite restricted in the type of data they can analyze, we also present research aimed at handling the full range of whole-genome data.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2001
Phylogenies derived from gene order data may prove crucial in answering some fundamental open que... more Phylogenies derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Yet very few techniques are available for such phylogenetic reconstructions. One method is breakpoint analysis, developed by Blanchette and Sankoff for solving the "breakpoint phylogeny." Our earlier studies confirmed the usefulness of this approach, but also found that BPAnalysis, the implementation developed by Sankoff and Blanchette, was too slow to use on all but very small datasets. We report here on a reimplementation of BPAnalysis using the principles of algorithmic engineering. Our faster (by 2 to 3 orders of magnitude) and flexible implementation allowed us to conduct studies on the characteristics of breakpoint analysis, in terms of running time, quality, and robustness, as well as to analyze datasets that had so far been considered out of reach. We report on these findings and also discuss future directions for our new implementation.
Systematists study how a group of genes or organisms evolved. These biologists now have set their... more Systematists study how a group of genes or organisms evolved. These biologists now have set their sights on the Tree of Life challenge: to reconstruct the evolutionary history of all known living organisms. A typical phylogenetic reconstruction starts with biomolecular data, such as DNA sequences for modern organisms, and builds a tree, or phylogeny, for these sequences that represents a
The breakpoint phylogeny is an optimization problem proposed by Blanchette et al. for reconstruct... more The breakpoint phylogeny is an optimization problem proposed by Blanchette et al. for reconstructing evolutionary trees from gene order data. These same authors also developed and implemented BPAnalysis (3), a heuristic method (based upon solving many instances of the travelling salesman problem) for estimating the breakpoint phylogeny. We present a new heuristic for this purpose; although not polynomial-time, our heuristic
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2002
Evolution operates on whole genomes through mutations that change the order and strandedness of g... more Evolution operates on whole genomes through mutations that change the order and strandedness of genes within the genomes. Thus analyses of gene-order data present new opportunities for discoveries about deep evolutionary events, provided that sufficiently accurate methods can be developed to reconstruct evolutionary trees. In this paper we present two new methods of character coding for parsimony-based analysis of genomic rearrangements: one called MPBE-2, and a new parsimony-based method which we call MPME (based on an encoding of Bryant), both variants of the MPBE method. We then conduct computer simulations to compare this class of methods to distance-based methods (NJ under various distance measures). Our empirical results show that two of our new methods return highly accurate estimates of the true tree, outperforming the other methods significantly, especially when close to saturation.
Multiple sequence alignment (MSA) has long been a mainstay of bioinformatics, particularly in the... more Multiple sequence alignment (MSA) has long been a mainstay of bioinformatics, particularly in the alignment of well conserved protein and DNA sequences and in phylogenetic reconstruction for such data. Sequence datasets with low percentage identity, on the other hand, typically yield poor alignments. Now that researchers want to produce alignments among widely divergent genomes, including both coding and noncoding sequences it is necessary to revisit sequence alignment and phylogenetic reconstruction under more ambitious models of sequence evolution that take into account the plethora of genomic events that have been observed.
We present the results of a large-scale experimentalstudy of quartet-based methods (quartet clean... more We present the results of a large-scale experimentalstudy of quartet-based methods (quartet cleaning andpuzzling) for phylogeny reconstruction. Our experimentsinclude a broad range of problem sizes and evolutionaryrates, and were carefully designed to yield statisticallyrobust results despite the size of the samplespace. We measure outcomes in terms of numbers ofedges of the true tree correctly inferred by each method(true positives). Our
Evolution operates on whole genomes through direct rearrangements of genes, such as inversions, t... more Evolution operates on whole genomes through direct rearrangements of genes, such as inversions, transpositions, and inverted transposi- tions, as well as through operations, such as dupli- cations, losses, and transfers, that also affect the gene content of the genomes. Because these events are rare relative to nucleotide substitutions, gene order data offer the possibility of resolving ancient branches in the
Abundance profiling (also called &amp... more Abundance profiling (also called 'phylogenetic profiling') is a crucial step in understanding the diversity of a metagenomic sample, and one of the basic techniques used for this is taxonomic identification of the metagenomic reads. We present taxon identification and phylogenetic profiling (TIPP), a new marker-based taxon identification and abundance profiling method. TIPP combines SAT\'e-enabled phylogenetic placement a phylogenetic placement method, with statistical techniques to control the classification precision and recall, and results in improved abundance profiles. TIPP is highly accurate even in the presence of high indel errors and novel genomes, and matches or improves on previous approaches, including NBC, mOTU, PhymmBL, MetaPhyler and MetaPhlAn.
Motivation: Phylogenetic analyses often produce thou- sands of candidate trees. Biologists resolv... more Motivation: Phylogenetic analyses often produce thou- sands of candidate trees. Biologists resolve the conflict by computing the consensus of these trees. Single-tree con- sensus as postprocessing methods can be unsatisfactory due to their inherent limitations. Results: In this paper we present an alternative approach by using clustering algorithms on the set of candidate trees. We propose bicriterion problems, in particular
Whole-genome phylogenetic studies require various sources of phylogenetic signals to produce an a... more Whole-genome phylogenetic studies require various sources of phylogenetic signals to produce an accurate picture of the evolutionary history of a group of genomes. In particular, sequence-based reconstruction will play an important role, especially in r esolving more recent events. But using sequences at the level of whole genomes means working with very large amounts of data—large numbers of sequences—as well
Absolute fast converging phylogenetic reconstruction methods are provably guaranteed to recover t... more Absolute fast converging phylogenetic reconstruction methods are provably guaranteed to recover the true tree with high probability from sequences that grow only polynomially in the number of leaves, once the edge lengths are bounded arbitrarily from above and below. Only a few methods have been determined to be absolute fast converging; these have all been developed in just the last
A major computational problem in biology is the reconstruction of evolutionary (a.k.a. “phylogene... more A major computational problem in biology is the reconstruction of evolutionary (a.k.a. “phylogenetic”) trees from biomolecular sequences. Most polynomial time phylogenetic reconstruction methods are distance-based, and take as input an estimation of the evolutionary distance between every pair of biomolecular sequences in the dataset. The estimation of evolutionary distances is standardized except when the set of biomolecular sequences is “saturated”,
Uploads
Papers by Tandy Warnow