Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009
Abstract-In the past research efforts on computational phylogenetic analysis were dedicated to the design of heuristics which can quickly find near-optimal trees under a specific optimization criterion. However, all criteria are over-simplified and cannot realistically model the real evolution process. Thus all existing algorithms for phylogenetic analysis have their limitations. It has become a serious issue for many important real-life applications which often demand accurate results from phylogenetic analysis.
American Journal of Bioinformatics Research, 2012
Phylogenetics enables us to use various techniques to extract evolutionary relationships from sequence analysis. Most of the phylogenetic analysis techniques produce phylogenetic trees that represent relationship between any set of species or their evolutionary history. This article presents a comprehensive survey of the applications and the algorithms for inference of huge phylogenetic trees and also gives the reader an overview of the methods currently employed for the inference of phylogenetic trees. A comprehensive comparison of the methods and algorithms is presented in this paper.
Phylogenetic analysis may be considered to be a highly reliable and important bioinformatics tool. The importance of phylogenetic analysis lies in its simple manifestation and easy handling of data. The simple tree representation of the evolution makes the phylogenetic analysis easier to comprehend and represent as well. The varied applications of phylogenetics in different fields of biology make this analysis an absolute necessity. The different aspects of phylogenetic analysis have been described in a comprehensive manner. This review may be useful to those who would like to have a firsthand knowledge of phylogenetics.
2009
We review phylogenetic inference methods with a special emphasis on inference from molecular data. We begin with a general comment on phylogenetic inference using DNA sequences, followed by a clear statement of the relevance of a good alignment of sequences. Then we provide a general description of models of sequence evolution, including evolutionary models that account for rate heterogeneity along the DNA sequences or complex secondary structure (i.e., ribosomal genes). We then present an overall description of the most relevant inference methods, focusing on key concepts of general interest. We point out the most relevant traits of methods such as maximum parsimony (MP), distance methods, maximum likelihood (ML) and Bayesian inference (BI). Finally, we discuss different measures of support for the estimated phylogeny and discuss how this relates to confidence in particular nodes of a phylogeny reconstruction.
All organisms have evolved from a common ancestor. The distance between these species is measured using phylogenetic analysis. It enables us to extract evolutionary relationship from sequence analysis. These relationships are depicted on phylogenetic trees. This article provides a detailed survey on different sequential approaches of sequential alignment, clustering and complete details of how a mapreduce technology improves the performance of phylogenetic analysis. A comprehensive comparison of these methods is presented in this paper.
New Achievements in Evolutionary Computation, 2010
Molecular biology and evolution, 1994
Using simulated data, we compared five methods of phylogenetic tree estimation: parsimony, compatibility, maximum likelihood, Fitch-Margoliash, and neighbor joining. For each combination of substitution rates and sequence length, 100 data sets were generated for each of 50 trees, for a total of 5,000 replications per condition. Accuracy was measured by two measures of the distance between the true tree and the estimate of the tree, one measure sensitive to accuracy of branch lengths and the other not. The distance-matrix methods (Fitch-Margoliash and neighbor joining) performed best when they were constrained from estimating negative branch lengths; all comparisons with other methods used this constraint. Parsimony and compatibility had similar results, with compatibility generally inferior; Fitch-Margoliash and neighbor joining had similar results, with neighbor joining generally slightly inferior. Maximum likelihood was the most successful method overall, although for short sequen...
18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004
hood method is computationally extremely expensive. We present simple new heuristics which yield accurate trees for synthetic (simulated) as well as real data and significantly reduce execution time. The new heuristics have been implemented in a program called RAxML which is freely available as open source code. Furthermore, we present a distributed version of our algorithm which is implemented in an MPI-based prototype. This prototype is currently being used to implement an http-based seti@home-like version of RAxML. We compare our program with PHYML and MrBayes which to our best knowledge are currently the fastest and most accurate programs for phylogenetic tree inference based on statistical methods. Experiments are conducted using 50 synthetic 100 taxon alignments as well as real-world alignments comprising 101 up to 1000 sequences. RAxML outperforms MrBayes for real-world data both in terms of speed and final likelihood values. Furthermore, for real data RAxML requires less time (factor 2-8) than PHYML to reach PHYML's final likelihood values and yields better final trees due to its more exhaustive search strategy. For synthetic data MrBayes is slightly more accurate than RAxML and PHYML but significantly slower. £ This work is sponsored under the project ID ParBaum, within the framework of the "Competence Network for Technical, Scientific High Performance Computing in Bavaria": Kompetenznetzwerk für Technisch-Wissenschaftliches Hoch-und Höchstleistungsrechnen in Bayern (KON-WIHR). KONWIHR is funded by means of "High-Tech-Offensive Bayern".
Proceedings of the …, 2005
Because of the increase of genomic data, multiple genes are often available for the inference of phylogenetic relationships. The simple approach for combining multiple genes from the same taxon is to concatenate the sequences and then ignore the fact that different positions in the concatenated sequence came from different genes. Here, we discuss two criteria for inferring the optimal tree topology from data sets with multiple genes. These criteria are designed for multigene data sets where gene-specific evolutionary features are too important to ignore. One criterion is conventional and is obtained by taking the sum of log-likelihoods over all genes. The other criterion is obtained by dividing the log-likelihood for a gene by its sequence length and then taking the arithmetic mean over genes of these ratios. A similar strategy could be adopted with parsimony scores. The optimal tree is then declared to be the one for which the sum or the arithmetic mean is maximized. These criteria are justified within a two-stage hierarchical framework. The first level of the hierarchy represents gene-specific evolutionary features, and the second represents site-specific features for given genes. For testing significance of the optimal topology, we suggest a two-stage bootstrap procedure that involves resampling genes and then resampling alignment columns within resampled genes. An advantage of this procedure over concatenation is that it can effectively account for gene-specific evolutionary features. We discuss the applicability of the two-stage bootstrap idea to the Kishino-Hasegawa test and the Shimodaira-Hasegawa test.
Systematic Biology, 2007
Even when the maximum likelihood (ML) tree is a better estimate of the true phylogenetic tree than those produced by other methods, the result of a poor ML search may be no better than that of a more thorough search under some faster criterion. The ability to find the globally optimal ML tree is therefore important. Here, I compare a range of heuristic search strategies (and their associated computer programs) in terms of their success at locating the ML tree for 20 empirical data sets with 14 to 158 sequences and 411 to 120,762 aligned nucleotides. Three distinct topics are discussed: the success of the search strategies in relation to certain features of the data, the generation of starting trees for the search, and the exploration of multiple islands of trees. As a starting tree, there was little difference among the neighbor-joining tree based on absolute differences (including the BioNJ tree), the stepwise-addition parsimony tree (with or without nearest-neighbor-interchange (NNI) branch swapping), and the stepwise-addition ML tree. The latter produced the best ML score on average but was orders of magnitude slower than the alternatives. The BioNJ tree was second best on average. As search strategies, star decomposition and quartet puzzling were the slowest and produced the worst ML scores. The DPRml, IQPNNI, MultiPhyl, PhyML, PhyNav, and TreeFinder programs with default options produced qualitatively similar results, each locating a single tree that tended to be in an NNI suboptimum (rather than the global optimum) when the data set had low phylogenetic information. For such data sets, there were multiple tree islands with very similar ML scores. The likelihood surface only became relatively simple for data sets that contained approximately 500 aligned nucleotides for 50 sequences and 3,000 nucleotides for 100 sequences. The RAxML and GARLI programs allowed multiple islands to be explored easily, but both programs also tended to find NNI suboptima. A newly developed version of the likelihood ratchet using PAUP* successfully found the peaks of multiple islands, but its speed needs to be improved. [Large data sets; maximum likelihood; phylogeny; search strategies; tree islands.]
Journal of Molecular Evolution, 1991
The efficiency of obtaining the correct tree by the maximum likelihood method (Felsenstein 1981) for inferring trees from DNA sequence data was compared with trees obtained by distance methods. It was shown that the maximum likelihood method is superior to distance methods in the efficiency particularly when the evolutionary rate differs among lineages.
2005
In this paper we introduce a new quartet-based method. This method makes use of the Bayes (or quartet) weights of quartets as those used in the quartet puzzling. However, all the weights from the related quartets are accumulated to form a global quartet weight matrix. This matrix provides integrated information and can lead us to recursively merge small sub-trees to larger ones until the final single tree is obtained. The experimental results show that the probability for the correct tree to be among a very small number of trees constructed using our method is very high. These significant results open a new research direction to further investigate more efficient algorithms for phylogenetic inference. 1.
Nature Reviews Genetics, 2012
The inference of phylogenetic relationships among species and the use of such information to classify species.
Applied Soft Computing, 2014
In the spirit of the "grand challenge", this paper covers the development of novel concepts for inference of large phylogenies based on the maximum likelihood method, which has proved to be the most accurate model for inference of huge and complex phylogenetic trees. Here, a novel method called Leaf Pruning and Re-grafting (LPR) has being presented, which is a variant of standard Sub-tree Pruning and Re-grafting (SPR) technique. LPR is a systematic approach where only unique topologies are generated at each step. Various stochastic search strategies for estimation of the maximum likelihood (ML) tree have also being proposed. Here, simulated annealing has been combined with steepest accent method to improve the quality of the final tree obtained. All the current simulated annealing approaches are used with simple hill climbing method to avoid the large number of repeated topologies that are normally generated by SPR. This easily leads to local maxima. However in the present study steepest accent with simulated annealing by way of LPR (SAWSA-LPR) has being used; the chances of returning local maxima has being significantly reduced. A straightforward and efficient parallel version of simulated annealing with steepest accent to accelerate the process of DNA phylogenetic tree inference has also being presented. It was observed that the implementation of the algorithm based on random DNA sequences gave better results as compared to other tree construction methods.
Journal of applied biology and biotechnology, 2024
A phylogenetic tree commonly represents evolutionary relationships within a set of protein sequences. Various methods and strategies have been used to improve the accuracy of phylogenetic trees, but their capacity to derive a biologically credible relationship appears to be overestimated. Although the quality of the protein sequence alignment and the choice of substitution matrix are preliminary constraints to define the biological accuracy of the overlapped residues, the alignment is not iteratively optimized through the statistical testing of residue-substitution models. The exact alignment protocol and substitution model information are by default used for every sequence set by a server to construct an often-irrelevant phylogenetic tree, and no sequence-based tailoring of phylogenetic strategy is implemented by any server. Rigorously constructing 270 evolutionary trees, constructed using IQ-TREE based on 13 different alignments (Clustal-Omega, Kalign, MAFFT, MUSCLE, TCoffee, and Promals3D, as well as their HHPred-based hidden Markov model [HMM] alignments using HHPred) and nine substitution models (Dayhoff, JJT, block substitution matrix62, WAG, probability matrix from blocks [PMB], direct computation with mutability [DCMUT], JTTDCmut, LG, and variable time), the present study highlights the failure of the current methods and emphasizes the need for a more accurate scrutiny of the entire phylogenetic methodology. MUSCLE alignment and the LG and Dayhoff matrices yield more accurate phylogenetic results for sequences shorter than 500 residues for the log-likelihood measure. Moreover, Kalign 1 HMM alignment yields the top-ranked tree with the lowest tree length score with only the PMB matrix, making this substitution model more accurate in terms of total tree length score. The suggested strategy would be beneficial for understanding the potential pitfalls of phylogenetic inference and would aid us in deriving a more accurate evolutionary relationship for a sequence dataset.
Journal of Molecular Evolution, 1996
A new method is presented for inferring evolutionary trees using nucleotide sequence data. The birth-death process is used as a model of speciation and extinction to specify the prior distribution of phylogenies and branching times. Nucleotide substitution is modeled by a continuous-time Markov process. Parameters of the branching model and the substitution model are estimated by maximum likelihood. The posterior probabilities of different phylogenies are calculated and the phylogeny with the highest posterior probability is chosen as the best estimate of the evolutionary relationship among species. We refer to this as the maximum posterior probability (MAP) tree. The posterior probability provides a natural measure of the reliability of the estimated phylogeny. Two example data sets are analyzed to infer the phylogenetic relationship of human, chimpanzee, gorilla, and orangutan. The best trees estimated by the new method are the same as those from the maximum likelihood analysis of separate topologies, but the posterior probabilities are quite different from the bootstrap proportions. The results of the method are found to be insensitive to changes in the rate parameter of the branching process.
Molecular biology and evolution, 1993
The minimum-evolution (ME) method of phylogenetic inference is based on the assumption that the tree with the smallest sum of branch length estimates is most likely to be the true one. In the past this assumption has been used without mathematical proof. Here we present the theoretical basis of this method by showing that the expectation of the sum of branch length estimates for the true tree is smallest among all possible trees, provided that the evolutionary distances used are statistically unbiased and that the branch lengths are estimated by the ordinary least-squares method. We also present simple mathematical formulas for computing branch length estimates and their standard errors for any unrooted bifurcating tree, with the least-squares approach. As a numerical example, we have analyzed mtDNA sequence data obtained by Vigilant et al. and have found the ME tree for 95 human and 1 chimpanzee (outgroup) sequences. The tree was somewhat different from the neighbor-joining tree co...
19th IEEE International Parallel and Distributed Processing Symposium, 2005
Annealing) is presented that combines simulated annealing and hill-climbing techniques to improve the quality of final trees. In addition, to the ability to perform backward steps and potentially escape local maxima provided by simulated annealing, a large number of "good" alternative topologies is generated which can be used to build a consensus tree on the fly. Though, slower than some of the fastest hill-climbing programs such as RAxML-III and PHYML, RAxML-SA finds better trees for large real data alignments containing more than 250 sequences. Furthermore, the performance on 40 simulated 500-taxon alignments is reasonable in comparison to PHYML. Finally, a straight-forward and efficient OpenMP parallelization of RAxML is presented.
2005
Phylogenetic analysis is an integral part of biological research. As the number of sequenced genomes increases, available data sets are growing in number and size. Several algorithms have been proposed to handle these larger data sets. A family of algorithms known as disc covering methods (DCMs), have been selected by the NSF funded CIPRes project to boost the performance of existing phylogenetic algorithms. Recursive Iterative Disc Covering Method 3 (Rec-I-DCM3), recursively decomposes the guide tree into subtrees, executing a phylogenetic search on the subtree and merging the subtrees, for a set number of iterations. This paper presents a detailed analysis of this algorithm.