Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013
…
51 pages
1 file
We propose a genealogy sampling algorithm, SMARTree, that provides an approach to estimation from SNP haplotype data of the patterns of coancestry across a genome segment among a set of homologous chromosomes. To enable analysis across longer segments of genome, the sequence of coalescent trees is modeled via the modified sequential Markov coalescent (Marjoram and Wall, 2006). To assess performance in estimating these local trees, our SMARTree implementation is tested on simulated data. Our base data set is of the SNPs in ten DNA sequences over 50kb. We examine the effects of longer sequences and of more sequences, and of a recombination and/or mutational hotspot. The model underlying SMARTree is an approximation to the full recombinant-coalescent distribution. However, in a small trial on simulated data, recovery of local trees was similar to that of LAMARC (Kuhner et al., 2000a), a sampler which uses the full model.
Journal of Molecular Evolution, 2014
We propose a genealogy sampling algorithm, SMARTree, that provides an approach to estimation from SNP haplotype data of the patterns of coancestry across a genome segment among a set of homologous chromosomes. To enable analysis across longer segments of genome, the sequence of coalescent trees is modeled via the modified sequential Markov coalescent (Marjoram and Wall, 2006). To assess performance in estimating these local trees, our SMARTree implementation is tested on simulated data. Our base data set is of the SNPs in ten DNA sequences over 50kb. We examine the effects of longer sequences and of more sequences, and of a recombination and/or mutational hotspot. The model underlying SMARTree is an approximation to the full recombinant-coalescent distribution. However, in a small trial on simulated data, recovery of local trees was similar to that of LAMARC (Kuhner et al., 2000a), a sampler which uses the full model.
Some Recent Advances in Mathematics and Statistics, 2013
The gene genealogy is a tree describing the ancestral relationships among genes sampled from unrelated individuals. Knowledge of the tree is useful for inference of population-genetic parameters such as migration or recombination rates. It also has potential application in gene-mapping, as individuals with similar trait values will tend to be more closely related genetically at the location of a trait-influencing mutation. One way to incorporate genealogical trees in genetic applications is to sample them conditional on observed genetic data. We have implemented a Markov chain Monte Carlo based genealogy sampler that conditions on observed haplotype data. Our implementation is based on an algorithm sketched by Zöllner and Pritchard but with several differences described herein. We also provide insights from our interpretation of their description that were necessary for efficient implementation. Our sampler can be used to summarize the distribution of tree-based association statistics, such as case-clustering measures.
Genetic Epidemiology, 2000
Analysis of the coalescent s~uGture of a population may provide info~ation useful in mapping disease loci. Current coalescent-based genealogy samplers require haplot~ed data, but haplot~es are not always available, and it is not practical to sum over all haplotype assignments for large data sets. We describe a method of adding haplot~e re-evaluation to the sampler, so that it samples not only among genealogies explaining a given haplotype configuration, but also among different haplot~e con~gurations. Several different haplotype-rea~angement strategies are considered, but the simplest-inverting the phase of a single site in a single individual-appears to be the most successful. The straightforward haplotype sampler does not mix well; heating approaches can greatly improve its performance. Genet. Epidemiol. 19(Suppl 1):s 1532 1,2~~~. 0 2000 Wiley-Liss, Inc.
2008
Traditionally nonrecombinant genome, i.e., mtDNA or Y chromosome, has been used for phylogeography, notably for ease of analysis. The topology of the phylogeny structure in this case is an acyclic graph, which is often a tree, is easy to comprehend and is somewhat easy to infer. However, recombination is an undeniable genetic fact for most part of the genome. Driven by the need for a more complete analysis, we address the problem of estimating the ancestral recombination graph (ARG) from a collection of extant sequences. We exploit the coherence that is observed in the human haplotypes as patterns and present a network model of patterns to reconstruct the ARG. We test our model on simulations that closely mimic the observed haplotypes and observe promising results.
Genetics, 2009
With incomplete lineage sorting (ILS), the genealogy of closely related species differs along their genomes. The amount of ILS depends on population parameters such as the ancestral effective population sizes and the recombination rate, but also on the number of generations between speciation events. We use a hidden Markov model parameterized according to coalescent theory to infer the genealogy along a fourspecies genome alignment of closely related species and estimate population parameters. We analyze a basic, panmictic demographic model and study its properties using an extensive set of coalescent simulations. We assess the effect of the model assumptions and demonstrate that the Markov property provides a good approximation to the ancestral recombination graph. Using a too restricted set of possible genealogies, necessary to reduce the computational load, can bias parameter estimates. We propose a simple correction for this bias and suggest directions for future extensions of the model. We show that the patterns of ILS along a sequence alignment can be recovered efficiently together with the ancestral recombination rate. Finally, we introduce an extension of the basic model that allows for mutation rate heterogeneity and reanalyze human-chimpanzee-gorilla-orangutan alignments, using the new models. We expect that this framework will prove useful for population genomics and provide exciting insights into genome evolution.
Acta Mathematicae Applicatae Sinica, English Series, 2009
An efficient rule-based algorithm is presented for haplotype inference from general pedigree genotype data, with the assumption of no recombination. This algorithm generalizes previous algorithms to handle the cases where some pedigree founders are not genotyped, provided that for each nuclear family at least one parent is genotyped and each non-genotyped founder appears in exactly one nuclear family. The importance of this generalization lies in that such cases frequently happen in real data, because some founders may have passed away and their genotype data can no longer be collected. The algorithm runs in O(m 3 n 3 ) time, where m is the number of single nucleotide polymorphism (SNP) loci under consideration and n is the number of genotyped members in the pedigree. This zero-recombination haplotyping algorithm is extended to a maximum parsimoniously haplotyping algorithm in one whole genome scan to minimize the total number of breakpoint sites, or equivalently, the number of maximal zero-recombination chromosomal regions. We show that such a whole genome scan haplotyping algorithm can be implemented in O(m 3 n 3 ) time in a novel incremental fashion, here m denotes the total number of SNP loci along the chromosome.
Lecture Notes in Computer Science, 2005
Haplotyping under the Mendelian law of inheritance on pedigree genotype data is studied. Because genetic recombinations are rare, research has focused on Minimum Recombination Haplotype Inference (MRHI), i.e. finding the haplotype configuration consistent with the genotype data having the minimum number of recombinations. We focus here on the more realistic k-MRHI, which has the additional constraint that the number of recombinations on each parent-offspring pair is at most k. Although k-MRHI is NP-hard even for k = 1, we give an algorithm to solve k-MRHI efficiently by dynamic programming in O(nm03k+12m0) time on pedigrees with n nodes and at most m0 heterozygous loci in each node. Experiments on real and simulated data show that, in most cases, our algorithm gives the same haplotyping results but runs much faster than other popular algorithms.
Bioinformatics, 2010
High-density SNP data of model animal resources provides opportunities for fine-resolution genetic variation studies. These genetic resources are generated through a variety of breeding schemes that involve multiple generations of matings derived from a set of founder animals. In this article, we investigate the problem of inferring the most probable ancestry of resulting genotypes, given a set of founder genotypes. Due to computational difficulty, existing methods either handle only small pedigree data or disregard the pedigree structure. However, large pedigrees of model animal resources often contain repetitive substructures that can be utilized in accelerating computation. Results: We present an accurate and efficient method that can accept complex pedigrees with inbreeding in inferring genome ancestry. Inbreeding is a commonly used process in generating genetically diverse and reproducible animals. It is often carried out for many generations and can account for most of the computational complexity in real-world model animal pedigrees. Our method builds a hidden Markov model that derives the ancestry probabilities through inbreeding process without explicit modeling in every generation. The ancestry inference is accurate and fast, independent of the number of generations, for model animal resources such as the Collaborative Cross (CC). Experiments on both simulated and real CC data demonstrate that our method offers comparable accuracy to those methods that build an explicit model of the entire pedigree, but much better scalability with respect to the pedigree size.
public.iastate.edu
Probability functions such as likelihoods and genotype probabilities play an important role in the analysis of genetic data. When genotype data are incomplete Markov chain Monte Carlo (MCMC) methods, such as the Gibbs sampler, can be used to sample genotypes at the marker and trait loci. The Markov chain that corresponds to the scalar Gibbs sampler may not work due to slow mixing. Further, the Gibbs chain may not be irreducible when sampling genotypes at marker loci with more than two alleles. These problems do not arise if the genotypes are sampled jointly from the entire pedigree. When the pedigree does not have loops, a joint sample of the genotypes can be obtained efficiently via modification of the Elston-Stewart algorithm. When the pedigree has many loops, obtaining a joint sample can be time consuming. We propose a method for sampling genotypes from a pedigree so modified as to make joint sampling efficient. These samples, obtained from the modified pedigree, are used as candidate draws in the Metropolis-Hastings algorithm.
Journal of Computational Biology, 2010
An O(nmα(m)) time algorithm is given for inferring haplotypes from genotypes of non-recombinant pedigree data, where n is the number of members, m is the number of sites, and α(m) is the inverse of the Ackermann function. The algorithm works on both tree and general pedigree structures with cycles. Constraints between pairs of heterozygous sites are used to resolve unresolved sites for the pedigree, enabling the algorithm to avoid problems previously experienced for non-tree pedigrees. 1 Haplotypes indicate which gene variations are on which chromosome copy, while genotypes indicate only which gene variations are present at each site of the genome. In diploid organisms such as humans, usually only genotypes are collected since the cells of these organisms carry two copies of each chromosome, and a biochemical method to extract single chromosomes directly is expensive (Gusfield 2002). Therefore, a computational methodology to infer haplotypes from genotypes is required. Data from a multigenerational family pedigree or from a population group can be used to deduce the haplotypes for all group members. Haplotype inference is complicated by recombinant data, where complementary parts of both of a parent's haplotypes can be inherited as a single combined haplotype of a child. Also complicating the problem are pedigree structures that are not trees, where there are multiple inheritance paths between some family members. The haplotyping problem has been studied extensively in the last few years, both for pedigree and population data. If recombinations are allowed, the problem of inferring haplotypes for pedigrees with the minimum number of recombinations is NP-hard (Li and Jiang 2003b). For reconstructing haplotype configurations for pedigree data, Qian and Beckmann (Qian and Beckmann 2002) proposed a rule-based algorithm with a time complexity O(2 d n 2 m 3 ), where d is the largest number of children in a family, n is the number of members and m is the number of sites. The main princi-2 ple of their algorithm is that the best haplotype configuration for pedigree data is the one that minimizes the number of recombination events (the Minimum-Recombinant Haplotype Configuration (MRHC) problem). In (Li and Jiang 2003b) (Li and Jiang 2003a) Li and Jiang proposed an O(dmn) block-extension algorithm for the MRHC problem using a greedy heuristic to resolve adjacent sites. However, as discussed in (Li and Jiang 2004), this algorithm did not always find the haplotypes that minimized the number of recombinations, and worked under some restrictions. In order to improve the performance and handle missing data, an integer linear programming (ILP) formulation (Li and Jiang 2004) was proposed, in which a branch-and-bound algorithm was used to narrow the search space. In fact, with missing data, the haplotyping problem is NP-hard even if there is no recombinant event in the genomic data (Liu et al. 2005). While inheritance in practice normally allows recombination, analysis of populations is discovering haplotype blocks (Gabriel et al. 2002), series of consecutive sites that are not known to be recombined. These discoveries enable us to limit data to blocks without recombination, especially for the close relationships found in pedigrees. Various results exist for non-recombinant pedigree data, though the fastest algorithm deals specifically with tree pedigrees and cannot handle multiple inheritance paths. Haplotypes can be inferred for tree pedigrees in linear time (Chan et al. 2006), by capturing parity constraints between members and using them to resolve genotype sites. An O(nmα(n)) algorithm (Li and Li 2008) is also proposed for tree pedigrees, with extensions to more com-3 plex pedigrees and missing data. However, the complexity of this algorithm does not hold for cycle pedigrees or pedigrees with missing data. For general pedigrees, an O(m 3 n 3 ) algorithm (Li and Jiang 2003b) represents and solves pedigree constraints using linear equations over the cyclic group Z 2 ; this algorithm has also been improved to take O(mn 2 + n 3 log 2 n log log n) time (Xiao et al. 2007) by eliminating redundant equations in the system. Several deduction techniques have also been combined into the program HAPLORE (Zhang et al. 2005), including a haplotype deduction algorithm (Qian and Beckmann 2002), an inconsistent haplotype elimination algorithm (O'Connell and Weeks 1999), and a haplotype frequency estimation algorithm (Qin et al. 2002). Though quick in practice, HAPLORE is largely described as a set of logic rules, and its computational complexity is not concretely specified.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes in Computer Science, 2002
Bioinformatics/computer Applications in The Biosciences, 2005
Molecular Ecology, 2000
Theoretical population biology, 2000
Lecture Notes in Computer Science, 2005
Genome Research, 2002
Computational Biology and Chemistry, 2014
Journal of Computational Biology, 2014
Bioinformatics, 2006
BMC Research Notes, 2014
BMC Bioinformatics, 2008
Bayesian Analysis, 2007
Genetics Selection Evolution, 2001
Bioinformatics/computer Applications in The Biosciences, 2007
The American Journal of Human Genetics, 2002
Bioinformatics, 2016
Bioinformatics, 2007
Bioinformatics/computer Applications in The Biosciences, 2004
2011
American Journal of Physical Anthropology, 2005
SIAM Journal on Computing, 2009