Papers by Elizabeth Thompson

Statistical Science, 2003
Multipoint linkage analyses of data collected on related individuals are often performed as a fir... more Multipoint linkage analyses of data collected on related individuals are often performed as a first step in the discovery of disease genes. Through the dependence in inheritance of genes segregating at several linked loci, multipoint linkage analysis detects and localizes chromosomal regions (called trait loci) which contain disease genes. Our ability to correctly detect and position these trait loci is increased with the analysis of data observed on large pedigrees and multiple genetic markers. However, large pedigrees generally contain substantial missing data and exact calculation of the required multipoint likelihoods quickly becomes intractable. In this paper, we present a new Markov chain Monte Carlo approach to multipoint linkage analysis which greatly extends the range of models and data sets for which analysis is practical. Several advances in Markov chain Monte Carlo theory, namely joint updates of latent variables across loci or meioses, integrated proposals, Metropolis-Hastings restarts via sequential imputation and Rao-Blackwellized estimators, are incorporated into a sampling strategy which mixes well and produces accurate results in real time. The methodology is demonstrated through its application to several data sets originating from a study of early-onset Alzheimer's disease in families of Volga-German ethnic origin.

We propose a genealogy sampling algorithm, SMARTree, that provides an approach to estimation from... more We propose a genealogy sampling algorithm, SMARTree, that provides an approach to estimation from SNP haplotype data of the patterns of coancestry across a genome segment among a set of homologous chromosomes. To enable analysis across longer segments of genome, the sequence of coalescent trees is modeled via the modified sequential Markov coalescent (Marjoram and Wall, 2006). To assess performance in estimating these local trees, our SMARTree implementation is tested on simulated data. Our base data set is of the SNPs in ten DNA sequences over 50kb. We examine the effects of longer sequences and of more sequences, and of a recombination and/or mutational hotspot. The model underlying SMARTree is an approximation to the full recombinant-coalescent distribution. However, in a small trial on simulated data, recovery of local trees was similar to that of LAMARC (Kuhner et al., 2000a), a sampler which uses the full model.

Genetics, 2000
In disequilibrium mapping from data on a rare allele, interest may focus on the ancestry of a ran... more In disequilibrium mapping from data on a rare allele, interest may focus on the ancestry of a random sample of current descendants of a mutation. The mutation is assumed to have been introduced into the population as a single copy a known time ago and to have reached a given copy number within the population. Theory has been developed to describe the ancestral distribution under arbitrary patterns of population expansion. Further results permit convenient realization of the ancestry for a random sample of copies of a rare allele within populations of constant size or within populations growing or shrinking at constant exponential rate. In this article, we present an efficient approximate method for realizing coalescence times under more general patterns of population growth. We also apply diagnostics, checking the age of the mutation. In the course of the derivation, some additional insight is gained into the dynamics of the descendants of the mutation.
Markov Chain Monte Carlo, 2005
This chapter provides a tutorial introduction to the use of MCMC in the analysis of data observed... more This chapter provides a tutorial introduction to the use of MCMC in the analysis of data observed for multiple genetic loci on members of extended pedigrees in which there are many missing data. We introduce the specification of pedigrees and inheritance, and the structure of genetic models defining the dependence structure of data. We review exact computational algorithms which can provide a partial solution, and can be used to improve MCMC sampling of inheritance patterns. Realization of inheritance patterns can be used in several ways. Here, we focus on the estimation of multilocus linkage lod scores for the location of a locus affecting a disease trait relative to a known map of genetic marker loci.

Genetics, 2002
An isolated population is a group of individuals who are descended from a founding population who... more An isolated population is a group of individuals who are descended from a founding population who lived some time ago. If the founding individuals are assumed to be noninbred and unrelated, a chromosome sampled from the population can be represented as a mosaic of segments of the original ancestral types. A population in which chromosomes are made up of a few long segments will exhibit linkage disequilibrium due to founder effect over longer distances than a population in which the chromosomes are made up of many short segments. We study the length of intact ancestral segments by obtaining the expected number of junctions (points where DNA of two distinct ancestral types meet) in a chromosome. Assuming random mating, we study analytically the effects of population age, growth patterns, and internal structure on the expected number of junctions in a chromosome. We demonstrate that the type of growth a population has experienced can influence the expected number of junctions, as can p...

Theoretical Population Biology, 2003
The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard work... more The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, IBD by LocAlity-Sensitive Hashing, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to the current leading method and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for hundreds of thousands to millions of individuals. We applied iLASH to the Population Architecture using Genomics and Epidemiology (PAGE) dataset of ~52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, which identified IBD segments on a single machine in an hour (~3 minutes per chromosome compared to over 6 days per chromosome for a state-of-the-art algorithm). iLASH is able to efficiently estimate IBD tracts in very large-scale datasets, as demonstrated via IBD estimation across the entire UK Biobank (~500,000 individuals), detecting nearly 13 billion pairwise IBD tracts shared between ~11% of participants. In summary, iLASH enables fast and accurate detection of IBD, an upstream step in applications of IBD for population genetics and trait mapping.

BMC proceedings, 2016
In the past few years, imputation approaches have been mainly used in population-based designs of... more In the past few years, imputation approaches have been mainly used in population-based designs of genome-wide association studies, although both family- and population-based imputation methods have been proposed. With the recent surge of family-based designs, family-based imputation has become more important. Imputation methods for both designs are based on identity-by-descent (IBD) information. Apart from imputation, the use of IBD information is also common for several types of genetic analysis, including pedigree-based linkage analysis. We compared the performance of several family- and population-based imputation methods in large pedigrees provided by Genetic Analysis Workshop 19 (GAW19). We also evaluated the performance of a new IBD mapping approach that we propose, which combines IBD information from known pedigrees with information from unrelated individuals. Different combinations of the imputation methods have varied imputation accuracies. Moreover, we showed gains from th...

Institute of Mathematical Statistics Lecture Notes - Monograph Series, 1999
Genetic Analysis Workshop 10 identified five key factors contributing to the resolution of the ge... more Genetic Analysis Workshop 10 identified five key factors contributing to the resolution of the genetic factors affecting complex traits. These include analysis with multipoint methods, use of extended pedigrees, and selective sampling of pedigrees. By sampling the affected individuals in an extended pedigree, we obtain individuals who have an increased probability of sharing genes identical by descent (IBD) at marker loci that are linked to the trait locus or loci. Given marker data on specified members of a pedigree, the conditional IBD status among relatives can be assessed, but exact computation is often impractical for multiple linked markers on complex pedigrees. The use of Markov chain Monte Carlo (MCMC) methods greatly extends the range of models and data sets for which analysis is computationally feasible. Many forms of MCMC have now been implemented in the context of genetic analysis. Here we propose a new sampler, which takes as latent variables the segregation indicators at marker loci, and jointly updates all indicators corresponding to a given meiosis. The sampler has good mixing properties. Questions of irreducibility are also addressed. 1. Introduction. Relatives share common ancestors. A single gene in such an ancestor may therefore descend via repeated segregations to each of the relatives. Such genes, which are copies of a single ancestral gene within a defined pedigree, are said to be identical by descent (IBD). Disregarding mutation, IBD genes must be of like type. It is the sharing of IBD genes that underlies phenotypic similarities among relatives. The probabilities of patterns of gene identity by descent are determined by the pedigree structure, and in turn determine the probability distribution of observed data on individuals of the pedigree. Genetic linkage is the dependent cosegregation of genes at different loci on the same chromosome. Linkage detection and linkage analysis on the basis of data observed on related individuals require the computation of multilocus probabilities of observed phenotypic data on pedigree structures. Genetic Analysis Workshop 10 identified five key factors contributing to the resolution of the genetic factors affecting complex traits (Wijsman and Amos 1997). These include analysis with multipoint methods, use of extended pedigrees, and selective sampling of pedigrees. Here we consider an approach to linkage detection which uses only data on affected individuals. However, calculation of multilocus probabili-Work supported in part by NIH grant GM-46255 and NSF grant BIR-9305835. AMS 1991 subject classifications. Primary 62F03 secondary 92D10.
We performed multipoint linkage analyses with multiple programs and models for several gene expre... more We performed multipoint linkage analyses with multiple programs and models for several gene expression traits in the Centre d'Etude du Polymorphisme Humain families. All analyses provided consistent results for both peak location and shape. Variance-components (VC) analysis gave wider peaks and Bayes factors gave fewer peaks. Among programs from the MORGAN package, lm_multiple performed better than lm_markers, resulting in less Markov-chain Monte Carlo (MCMC) variability between runs, and the program lm_twoqtl provided higher LOD scores by also including either a polygenic component or an additional quantitative trait locus.
We explored the utility of population-and pedigree-based analyses using the Framingham Heart Stud... more We explored the utility of population-and pedigree-based analyses using the Framingham Heart Study genome-wide 50 k single-nucleotide polymorphism marker data provided for Genetic Analysis Workshop 16. Our aims were: 1) to compare identity-by-descent sharing estimates from variable amounts of data; 2) to apply each of these estimates to a case-control association study designed to control for relatedness among samples; and 3) to contrast these results to those obtained using model-based and model-free linkage analysis methods.
Abstract: Multipoint linkage analyses of genetic data on extended pedigrees can involve exact com... more Abstract: Multipoint linkage analyses of genetic data on extended pedigrees can involve exact computationswhich are infeasible. Markov chain Monte Carlo methods represent an attractive alternative, greatlyextending the range of models and data sets for which analysis is practical. In this paper, severaladvances in Markov chain Monte Carlo theory, namely joint updates of latent variables across lociand meioses, integrated proposals, Metropolis-Hastings restarts via sequential imputation and...

TAG Theoretical and Applied Genetics, 1998
New types of markers, such as RAPDs, microsatellite markers, AFLPs, and SNPs provide the opportun... more New types of markers, such as RAPDs, microsatellite markers, AFLPs, and SNPs provide the opportunity to obtain information on individuals at multiple genetic loci across the genome. This increase in the number of marker loci has provided enhanced opportunities for statistical analysis of the genetic consequences of genealogical relationship among individuals. In place of the classical models, we can now investigate empirical multilocus segregation patterns. Linkage among loci decreases the precision of relationship estimation but permits additional dimensions of genome sharing to be explored. In this paper we consider the effect of linkage on the pattern of genome sharing among relatives who share (on average) 25% of their dipolid genomes using the empirical meioses giving rise to 58 gametophytes from a single maternal plant of the species Pinus taeda (loblolly pine). The genome sharing among relatives is quantified in terms of the linkage map of the markers.
The American Naturalist, 1998
However, many biologically interesting situations in na-Georgia 31698; ture fail to meet these op... more However, many biologically interesting situations in na-Georgia 31698; ture fail to meet these optimal criteria. Maximum likeli

Mathematical Medicine and Biology, 1988
Although there have been several mathematical formulations of multilocus segregation, multilocus ... more Although there have been several mathematical formulations of multilocus segregation, multilocus gene identity by descent in pedigrees has been little considered. Here we present a computationally feasible algorithm for the computation of two-locus kinship for individuals between whom there may be multiple complex relationships, and use it to investigate patterns of two-locus gene identity by descent for some standard relationships. We also present an explicit formula, which is used to discuss the determinants of two-locus identity and the relationship to 3-locus identity by descent. With the current increasing density of information on individuals genomes available from DNA polymorphisms, gene identity at linked loci has practical importance. Procedures for the estimation of relationships between individuals on the basis of genetic data will have increased exibility to discriminate wider classes of genealogical relationship where information on multiple linked loci can be employed. Gene identity by decent at linked loci is also a key aspect of mapping rare recessive diseases from data on inbred individuals,

Journal of Computational Biology, 2014
There has been much interest in detecting genomic identity by descent (IBD) segments from modern ... more There has been much interest in detecting genomic identity by descent (IBD) segments from modern dense genetic marker data, and in using them to identify human disease susceptibility loci. Here we present a novel Bayesian framework using Markov chain Monte Carlo (MCMC) realizations to jointly infer IBD states among multiple individuals not known to be related, together with the allelic typing error rate and the IBD process parameters. The data are phased single nucleotide polymorphisms (SNP) haplotypes. We model changes in latent IBD state along homologous chromosomes by a continuous time Markov model having the Ewens sampling formula as its stationary distribution. We show by simulation that this model for the IBD process fits quite well with the coalescent predictions. Using simulation data sets of 40 haplotypes over regions of 1 and 10 million base pairs (Mbp), we show that the jointly estimated IBD states are very close to the true values, although the presence of linkage disequilibrium decreases the accuracy. We also present comparisons with the ibd haplo program which estimates IBD among sets of four haplotypes. Our new IBD detection method focuses on the scale between genome-wide methods using simple IBD models and complex coalescent-based methods which are limited to short genome segments. At the scale of a few Mbp, our approach offers potentially more power for fine scale IBD association mapping.

Human Heredity, 2009
Background/Aims: With pedigree data, genetic linkage can be detected using inheritance vector tes... more Background/Aims: With pedigree data, genetic linkage can be detected using inheritance vector tests, which explore the discrepancy between the posterior distribution of the inheritance vectors given observed trait values and the prior distribution of the inheritance vectors. In this paper, we propose conditional inheritance vector tests for linkage localization. These conditional tests can also be used to detect additional linkage signals in the presence of previously detected causal genes. Methods: For linkage localization, we propose to perform inheritance vector tests conditioning on the inheritance vectors at two positions bounding a test region. We can detect additional linkage signals by conducting a further conditional test in a region with no previously detected genes. We use randomized p values to extend the marginal and conditional tests when the inheritance vectors cannot be completely determined from genetic marker data. Results: We conduct simulation studies to compare and contrast the marginal and the conditional tests and to demonstrate that randomized p values can capture both the significance and the uncertainty in the test results. Conclusions: The simulation results demonstrate that the proposed conditional tests provide useful localization information, and with informative marker data, the uncertainty in randomized marginal and conditional test results is small.

Genetics, 2008
We have developed a pruning algorithm for likelihood estimation of a tree of populations. This al... more We have developed a pruning algorithm for likelihood estimation of a tree of populations. This algorithm enables us to compute the likelihood for large trees. Thus, it gives an efficient way of obtaining the maximum-likelihood estimate (MLE) for a given tree topology. Our method utilizes the differences accumulated by random genetic drift in allele count data from single-nucleotide polymorphisms (SNPs), ignoring the effect of mutation after divergence from the common ancestral population. The computation of the maximum-likelihood tree involves both maximizing likelihood over branch lengths of a given topology and comparing the maximum-likelihood across topologies. Here our focus is the maximization of likelihood over branch lengths of a given topology. The pruning algorithm computes arrays of probabilities at the root of the tree from the data at the tips of the tree; at the root, the arrays determine the likelihood. The arrays consist of probabilities related to the number of coale...
BMC Genetics, 2005
We performed multipoint linkage analysis of the electrophysiological trait ECB21 on chromosome 4 ... more We performed multipoint linkage analysis of the electrophysiological trait ECB21 on chromosome 4 in the full pedigrees provided by the Collaborative Study on the Genetics of Alcoholism (COGA). Three Markov chain Monte Carlo (MCMC)-based approaches were applied to the provided and re-estimated genetic maps and to five different marker panels consisting of microsatellite (STRP) and/or SNP markers at various densities. We found evidence of linkage near the GABRB1 STRP using all methods, maps, and marker panels. Difficulties encountered with SNP panels included convergence problems and demanding computations.
Uploads
Papers by Elizabeth Thompson