The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionar... more The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionary biology despite the repeated observation that the clock hypothesis does not perfectly explain observed DNA sequence variation. We introduce a parametric model that relaxes the molecular clock by allowing rates to vary across lineages according to a compound Poisson process. Events of substitution rate change are placed onto a phylogenetic tree according to a Poisson process. When an event of substitution rate change occurs, the current rate of substitution is modified by a gamma-distributed random variable. Parameters of the model can be estimated using Bayesian inference. We use Markov chain Monte Carlo integration to evaluate the posterior probability distribution because the posterior probability involves high dimensional integrals and summations. Specifically, we use the Metropolis-Hastings-Green algorithm with 11 different move types to evaluate the posterior distribution. We demo...
We develop a model-based methodology for integrating gene-set information with an experimentally-... more We develop a model-based methodology for integrating gene-set information with an experimentally-derived gene list. The methodology uses a previously reported sampling model, but takes advantage of natural constraints in the high-dimensional discrete parameter space in order to work from a more structured prior distribution than is currently available. We show how the natural constraints are expressed in terms of linear inequality constraints within a set of binary latent variables. Further, the currently available prior gives low probability to these constraints in complex systems, such as Gene Ontology (GO), thus reducing the efficiency of statistical inference. We develop two computational advances to enable posterior inference within the constrained parameter space: one using integer linear programming for optimization, and one using a penalized Markov chain sampler. Numerical experiments demonstrate the utility of the new methodology for a multivariate integration of genomic data with GO or related information systems. Compared to available methods, the proposed multi-functional analyzer covers more reported genes without mis-covering non-reported genes, as demonstrated on genomewide data from association studies of type 2 diabetes and from RNA interference studies of influenza.
ABSTRACT A fundamental problem in evolutionary biology is determining evolutionary relationships ... more ABSTRACT A fundamental problem in evolutionary biology is determining evolutionary relationships among different taxa. Genome arrangement data is potentially more informative than DNA sequence data in cases where alignment of DNA sequences is highly uncertain. We describe a Bayesian framework for phylogenetic inference from mitochondrial genome arrangement data that uses Markov chain Monte Carlo (MCMC) as the computational engine for inference. Our approach is to model mitochondrial data as a circular signed permutation which is subject to reversals. We calculate the likelihood of one arrangement mutating into another along a single branch by counting the number of possible sequences of reversals which transform the first to the second. We calculate the likelihood of the entire tree by augmenting the state space with the arrangements at the branching points of the tree. We use MCMC to update both the tree and the arrangement data at the branching points.
Phylogeography is the study of evolutionary history among populations in a species associated wit... more Phylogeography is the study of evolutionary history among populations in a species associated with geographic genetic variation. This paper examines the phylogeography of three African gorilla subspecies based on two types of DNA sequence data. One type is HV1, the first hyper-variable region in the control region of the mitochondrial genome. The other type is nuclear mitochondrial DNA (Numt DNA), which results from the introgression of a copy of HV1 from the mitochondrial genome into the nuclear genome. Numt and HV1 sequences evolve independently when in different organelles, but they share a common evolutionary history at the same locus in the mitochondrial genome prior to introgression. This study estimates the evolutionary history of gorilla populations in terms of population divergence times and effective population sizes. Also, this study estimates the number of introgression events. The estimates are obtained in a Bayesian framework using novel Markov chain Monte Carlo methods. The method is based on a hybrid coalescent process that combines separate coalescent processes for HV1 and Numt sequences along with a transfer model for introgression events within a single population tree. This Bayesian method for the analysis of Numt and HV1 sequences is the first approach specifically designed to model the evolutionary history of homologous multi-locus sequences within a population tree framework. The data analysis reveals highly discordant estimates of the divergence time between eastern and western gorilla populations for HV1 and Numt sequences. The discordant east-west split times are evidence of male-mediated gene flow between east and west long after female gorillas stopped this migration. In addition, the analysis estimates multiple independent introgression events.
Eight ruminally cannulated lactating dairy cows from a study on the effect of dietary rumen-degra... more Eight ruminally cannulated lactating dairy cows from a study on the effect of dietary rumen-degraded protein on production and digestion of nutrients were used to assess using sample duplication to control day-to-day variation within animals and errors associated with sampling and laboratory analyses. Two consecutive pooled omasal samples, each representing a feeding cycle, were obtained from each cow in each period. The effectiveness of sample duplication in error control was tested by comparing the variance of the difference in treatment means when taking 2 samples from each cow in each period to the variance when taking only one sample. Compared with no duplication, sample duplication improved precision by reducing variance by 50, 40, 31, 23, 23, and 9% for, respectively, rumen-undegraded protein flows, ruminal neutral detergent fiber digestibility, microbial nonammonia N flow, microbial efficiency, organic matter flow, and organic matter truly digested in the rumen. For these sa...
The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionar... more The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionary biology despite the repeated observation that the clock hypothesis does not perfectly explain observed DNA sequence variation. We introduce a parametric model that relaxes the molecular clock by allowing rates to vary across lineages according to a compound Poisson process. Events of substitution rate change are placed onto a phylogenetic tree according to a Poisson process. When an event of substitution rate change occurs, the current rate of substitution is modified by a gamma-distributed random variable. Parameters of the model can be estimated using Bayesian inference. We use Markov chain Monte Carlo integration to evaluate the posterior probability distribution because the posterior probability involves high dimensional integrals and summations. Specifically, we use the Metropolis-Hastings-Green algorithm with 11 different move types to evaluate the posterior distribution. We demo...
Evolution; international journal of organic evolution, 2014
Maintenance of genetic variation at loci under selection has profound implications for adaptation... more Maintenance of genetic variation at loci under selection has profound implications for adaptation under environmental change. In temporally and spatially varying habitats, non-neutral polymorphism could be maintained by heterozygote advantage across environments (marginal overdominance), which could be greatly increased by beneficial reversal of dominance across conditions. We tested for reversal of dominance and marginal overdominance in salinity tolerance in the saltwater-to-freshwater invading copepod Eurytemora affinis. We compared survival of F1 offspring generated by crossing saline and freshwater inbred lines (between-salinity F1 crosses) relative to within-salinity F1 crosses, across three salinities. We found evidence for both beneficial reversal of dominance and marginal overdominance in salinity tolerance. In support of reversal of dominance, survival of between-salinity F1 crosses was not different from that of freshwater F1 crosses under freshwater conditions and saltwa...
... A Likelihood Framework for Estimating Phylogeographic History on a Continuous Landscape, Alan... more ... A Likelihood Framework for Estimating Phylogeographic History on a Continuous Landscape, Alan R. Lemmon and Emily Moriarty Lemmon, 544. ... for Host-Symbiont Codivergence Indicates Ancient Origin of Fungal Endophytes in Grasses, CL Schardl, KD Craven, S. Speakman ...
Institute of Mathematical Statistics Lecture Notes - Monograph Series, 1999
We show how to quantify the uncertainty in a phylogenetic tree inferred from molecular sequence i... more We show how to quantify the uncertainty in a phylogenetic tree inferred from molecular sequence information. Given a stochastic model of evolution, the Bayesian solution is simply to form a posterior probability distribution over the space of phylogenies. All inferences are derived from this posterior, including tree reconstructions, credible sets of good trees, and conclusions about monophyletic groups, for example. The challenging part is to approximate the posterior, and we do this by constructing a Markov chain having the posterior as its invariant distribution, following the approach of Mau, Newton, and Larget (1998). Our Markov chain Monte Carlo algorithm is based on small but global changes in the phylogeny, and exhibits good mixing properties empirically. We illustrate the methodology on DNA encoding mitochondrial cytochrome oxidase 1 gathered by Hafner et al. (1994) for a set of parasites and their hosts.
Calculating the likelihood of observed DNA sequence data at the leaves of a tree is the computati... more Calculating the likelihood of observed DNA sequence data at the leaves of a tree is the computational bottleneck for phylogenetic analysis by Bayesian methods or by the method of maximum,likelihood. Because analysis of even moderately sized data sets can require hours of computational time on fast desktop computers, algorithmic changes that substantially increase the speed of the basic likelihood calculation are signican t. It has long been recognized that the contribution to the likelihood at sites with identical patterns is the same and need only be computed once for each unique pattern. We note that sites whose patterns are not identical on the entire tree may be identical on subtrees, and hence partial likelihood calculations made for one site may be stored and used for calculations at another. The bookkeeping and memory requirements are large, but not too excessive for current desktop computers. Timed calculations on many genuine data sets indicate that the computational algori...
Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesia... more Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest official release presented in 2003. The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly. The introduction of new proposals and automatic optimization of tuning parameters has improved convergence for many problems. The new version also sports significantly faster likelihood calculations through streaming single-instruction-multiple-data extensions (SSE) and support of the BEAGLE library, allowing likelihood calculations to be delegated to graphics processing units (GPUs) on compatible hardware. Speedup factors range from around 2 with SSE code to more than 50 with BEAGLE for codon problems. Checkpointing across all models allows long runs to be completed even when an analysis is prematurely terminated. New models include relaxed clocks, dating, model averaging across time-reversible substitution models, and support for hard, negative, and partial (backbone) tree constraints. Inference of species trees from gene trees is supported by full incorporation of the Bayesian estimation of species trees (BEST) algorithms. Marginal model likelihoods for Bayes factor tests can be estimated accurately across the entire model space using the stepping stone method. The new version provides more output options than previously, including samples of ancestral states, site rates, site d N /d S rations, branch rates, and node dates. A wide range of statistics on tree parameters can also be output for visualization in FigTree and compatible software.
Several stochastic models of character change, when implemented in a maximum likelihood framework... more Several stochastic models of character change, when implemented in a maximum likelihood framework, are known to give a correspondence between the maximum parsimony method and the method of maximum likelihood. One such model has an independently estimated branch-length parameter for each site and each branch of the phylogenetic tree. This model-the no-common-mechanism model-has many parameters, and, in fact, the number of parameters increases as fast as the alignment is extended. We take a Bayesian approach to the no-common-mechanism model and place independent gamma prior probability distributions on the branch-length parameters. We are able to analytically integrate over the branch lengths, and this allowed us to implement an efficient Markov chain Monte Carlo method for exploring the space of phylogenetic trees. We were able to reliably estimate the posterior probabilities of clades for phylogenetic trees of up to 500 sequences. However, the Bayesian approach to the problem, at least as implemented here with an independent prior on the length of each branch, does not tame the behavior of the branch-length parameters. The integrated likelihood appears to be a simple rescaling of the parsimony score for a tree, and the marginal posterior probability distribution of the length of a branch is dependent upon how the maximum parsimony method reconstructs the characters at the interior nodes of the tree. The method we describe, however, is of potential importance in the analysis of morphological character data and also for improving the behavior of Markov chain Monte Carlo methods implemented for models in which sites share a common branch-length parameter. [Bayesian phylogenetic inference; Markov chain Monte Carlo; maximum likelihood; parsimony model.]
The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionar... more The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionary biology despite the repeated observation that the clock hypothesis does not perfectly explain observed DNA sequence variation. We introduce a parametric model that relaxes the molecular clock by allowing rates to vary across lineages according to a compound Poisson process. Events of substitution rate change are placed onto a phylogenetic tree according to a Poisson process. When an event of substitution rate change occurs, the current rate of substitution is modified by a gamma-distributed random variable. Parameters of the model can be estimated using Bayesian inference. We use Markov chain Monte Carlo integration to evaluate the posterior probability distribution because the posterior probability involves high dimensional integrals and summations. Specifically, we use the Metropolis-Hastings-Green algorithm with 11 different move types to evaluate the posterior distribution. We demo...
We develop a model-based methodology for integrating gene-set information with an experimentally-... more We develop a model-based methodology for integrating gene-set information with an experimentally-derived gene list. The methodology uses a previously reported sampling model, but takes advantage of natural constraints in the high-dimensional discrete parameter space in order to work from a more structured prior distribution than is currently available. We show how the natural constraints are expressed in terms of linear inequality constraints within a set of binary latent variables. Further, the currently available prior gives low probability to these constraints in complex systems, such as Gene Ontology (GO), thus reducing the efficiency of statistical inference. We develop two computational advances to enable posterior inference within the constrained parameter space: one using integer linear programming for optimization, and one using a penalized Markov chain sampler. Numerical experiments demonstrate the utility of the new methodology for a multivariate integration of genomic data with GO or related information systems. Compared to available methods, the proposed multi-functional analyzer covers more reported genes without mis-covering non-reported genes, as demonstrated on genomewide data from association studies of type 2 diabetes and from RNA interference studies of influenza.
ABSTRACT A fundamental problem in evolutionary biology is determining evolutionary relationships ... more ABSTRACT A fundamental problem in evolutionary biology is determining evolutionary relationships among different taxa. Genome arrangement data is potentially more informative than DNA sequence data in cases where alignment of DNA sequences is highly uncertain. We describe a Bayesian framework for phylogenetic inference from mitochondrial genome arrangement data that uses Markov chain Monte Carlo (MCMC) as the computational engine for inference. Our approach is to model mitochondrial data as a circular signed permutation which is subject to reversals. We calculate the likelihood of one arrangement mutating into another along a single branch by counting the number of possible sequences of reversals which transform the first to the second. We calculate the likelihood of the entire tree by augmenting the state space with the arrangements at the branching points of the tree. We use MCMC to update both the tree and the arrangement data at the branching points.
Phylogeography is the study of evolutionary history among populations in a species associated wit... more Phylogeography is the study of evolutionary history among populations in a species associated with geographic genetic variation. This paper examines the phylogeography of three African gorilla subspecies based on two types of DNA sequence data. One type is HV1, the first hyper-variable region in the control region of the mitochondrial genome. The other type is nuclear mitochondrial DNA (Numt DNA), which results from the introgression of a copy of HV1 from the mitochondrial genome into the nuclear genome. Numt and HV1 sequences evolve independently when in different organelles, but they share a common evolutionary history at the same locus in the mitochondrial genome prior to introgression. This study estimates the evolutionary history of gorilla populations in terms of population divergence times and effective population sizes. Also, this study estimates the number of introgression events. The estimates are obtained in a Bayesian framework using novel Markov chain Monte Carlo methods. The method is based on a hybrid coalescent process that combines separate coalescent processes for HV1 and Numt sequences along with a transfer model for introgression events within a single population tree. This Bayesian method for the analysis of Numt and HV1 sequences is the first approach specifically designed to model the evolutionary history of homologous multi-locus sequences within a population tree framework. The data analysis reveals highly discordant estimates of the divergence time between eastern and western gorilla populations for HV1 and Numt sequences. The discordant east-west split times are evidence of male-mediated gene flow between east and west long after female gorillas stopped this migration. In addition, the analysis estimates multiple independent introgression events.
Eight ruminally cannulated lactating dairy cows from a study on the effect of dietary rumen-degra... more Eight ruminally cannulated lactating dairy cows from a study on the effect of dietary rumen-degraded protein on production and digestion of nutrients were used to assess using sample duplication to control day-to-day variation within animals and errors associated with sampling and laboratory analyses. Two consecutive pooled omasal samples, each representing a feeding cycle, were obtained from each cow in each period. The effectiveness of sample duplication in error control was tested by comparing the variance of the difference in treatment means when taking 2 samples from each cow in each period to the variance when taking only one sample. Compared with no duplication, sample duplication improved precision by reducing variance by 50, 40, 31, 23, 23, and 9% for, respectively, rumen-undegraded protein flows, ruminal neutral detergent fiber digestibility, microbial nonammonia N flow, microbial efficiency, organic matter flow, and organic matter truly digested in the rumen. For these sa...
The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionar... more The molecular clock hypothesis remains an important conceptual and analytical tool in evolutionary biology despite the repeated observation that the clock hypothesis does not perfectly explain observed DNA sequence variation. We introduce a parametric model that relaxes the molecular clock by allowing rates to vary across lineages according to a compound Poisson process. Events of substitution rate change are placed onto a phylogenetic tree according to a Poisson process. When an event of substitution rate change occurs, the current rate of substitution is modified by a gamma-distributed random variable. Parameters of the model can be estimated using Bayesian inference. We use Markov chain Monte Carlo integration to evaluate the posterior probability distribution because the posterior probability involves high dimensional integrals and summations. Specifically, we use the Metropolis-Hastings-Green algorithm with 11 different move types to evaluate the posterior distribution. We demo...
Evolution; international journal of organic evolution, 2014
Maintenance of genetic variation at loci under selection has profound implications for adaptation... more Maintenance of genetic variation at loci under selection has profound implications for adaptation under environmental change. In temporally and spatially varying habitats, non-neutral polymorphism could be maintained by heterozygote advantage across environments (marginal overdominance), which could be greatly increased by beneficial reversal of dominance across conditions. We tested for reversal of dominance and marginal overdominance in salinity tolerance in the saltwater-to-freshwater invading copepod Eurytemora affinis. We compared survival of F1 offspring generated by crossing saline and freshwater inbred lines (between-salinity F1 crosses) relative to within-salinity F1 crosses, across three salinities. We found evidence for both beneficial reversal of dominance and marginal overdominance in salinity tolerance. In support of reversal of dominance, survival of between-salinity F1 crosses was not different from that of freshwater F1 crosses under freshwater conditions and saltwa...
... A Likelihood Framework for Estimating Phylogeographic History on a Continuous Landscape, Alan... more ... A Likelihood Framework for Estimating Phylogeographic History on a Continuous Landscape, Alan R. Lemmon and Emily Moriarty Lemmon, 544. ... for Host-Symbiont Codivergence Indicates Ancient Origin of Fungal Endophytes in Grasses, CL Schardl, KD Craven, S. Speakman ...
Institute of Mathematical Statistics Lecture Notes - Monograph Series, 1999
We show how to quantify the uncertainty in a phylogenetic tree inferred from molecular sequence i... more We show how to quantify the uncertainty in a phylogenetic tree inferred from molecular sequence information. Given a stochastic model of evolution, the Bayesian solution is simply to form a posterior probability distribution over the space of phylogenies. All inferences are derived from this posterior, including tree reconstructions, credible sets of good trees, and conclusions about monophyletic groups, for example. The challenging part is to approximate the posterior, and we do this by constructing a Markov chain having the posterior as its invariant distribution, following the approach of Mau, Newton, and Larget (1998). Our Markov chain Monte Carlo algorithm is based on small but global changes in the phylogeny, and exhibits good mixing properties empirically. We illustrate the methodology on DNA encoding mitochondrial cytochrome oxidase 1 gathered by Hafner et al. (1994) for a set of parasites and their hosts.
Calculating the likelihood of observed DNA sequence data at the leaves of a tree is the computati... more Calculating the likelihood of observed DNA sequence data at the leaves of a tree is the computational bottleneck for phylogenetic analysis by Bayesian methods or by the method of maximum,likelihood. Because analysis of even moderately sized data sets can require hours of computational time on fast desktop computers, algorithmic changes that substantially increase the speed of the basic likelihood calculation are signican t. It has long been recognized that the contribution to the likelihood at sites with identical patterns is the same and need only be computed once for each unique pattern. We note that sites whose patterns are not identical on the entire tree may be identical on subtrees, and hence partial likelihood calculations made for one site may be stored and used for calculations at another. The bookkeeping and memory requirements are large, but not too excessive for current desktop computers. Timed calculations on many genuine data sets indicate that the computational algori...
Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesia... more Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest official release presented in 2003. The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly. The introduction of new proposals and automatic optimization of tuning parameters has improved convergence for many problems. The new version also sports significantly faster likelihood calculations through streaming single-instruction-multiple-data extensions (SSE) and support of the BEAGLE library, allowing likelihood calculations to be delegated to graphics processing units (GPUs) on compatible hardware. Speedup factors range from around 2 with SSE code to more than 50 with BEAGLE for codon problems. Checkpointing across all models allows long runs to be completed even when an analysis is prematurely terminated. New models include relaxed clocks, dating, model averaging across time-reversible substitution models, and support for hard, negative, and partial (backbone) tree constraints. Inference of species trees from gene trees is supported by full incorporation of the Bayesian estimation of species trees (BEST) algorithms. Marginal model likelihoods for Bayes factor tests can be estimated accurately across the entire model space using the stepping stone method. The new version provides more output options than previously, including samples of ancestral states, site rates, site d N /d S rations, branch rates, and node dates. A wide range of statistics on tree parameters can also be output for visualization in FigTree and compatible software.
Several stochastic models of character change, when implemented in a maximum likelihood framework... more Several stochastic models of character change, when implemented in a maximum likelihood framework, are known to give a correspondence between the maximum parsimony method and the method of maximum likelihood. One such model has an independently estimated branch-length parameter for each site and each branch of the phylogenetic tree. This model-the no-common-mechanism model-has many parameters, and, in fact, the number of parameters increases as fast as the alignment is extended. We take a Bayesian approach to the no-common-mechanism model and place independent gamma prior probability distributions on the branch-length parameters. We are able to analytically integrate over the branch lengths, and this allowed us to implement an efficient Markov chain Monte Carlo method for exploring the space of phylogenetic trees. We were able to reliably estimate the posterior probabilities of clades for phylogenetic trees of up to 500 sequences. However, the Bayesian approach to the problem, at least as implemented here with an independent prior on the length of each branch, does not tame the behavior of the branch-length parameters. The integrated likelihood appears to be a simple rescaling of the parsimony score for a tree, and the marginal posterior probability distribution of the length of a branch is dependent upon how the maximum parsimony method reconstructs the characters at the interior nodes of the tree. The method we describe, however, is of potential importance in the analysis of morphological character data and also for improving the behavior of Markov chain Monte Carlo methods implemented for models in which sites share a common branch-length parameter. [Bayesian phylogenetic inference; Markov chain Monte Carlo; maximum likelihood; parsimony model.]
Uploads
Papers by Bret Larget