Single-cell genomics is revolutionizing basic genome research and clinical genetic diagnosis. How... more Single-cell genomics is revolutionizing basic genome research and clinical genetic diagnosis. However, none of the current research or clinical methods for single-cell analysis distinguishes between the analysis of a cell in G1-, S- or G2/ M-phase of the cell cycle. Here, we demonstrate by means of array comparative genomic hybridiza-tion that charting the DNA copy number landscape of a cell in S-phase requires conceptually different approaches to that of a cell in G1- or G2/M-phase. Remarkably, despite single-cell whole-genome amplification artifacts, the log2 intensity ratios of single S-phase cells oscillate according to early and late replication domains, which in turn leads to the detection of significantly more DNA imbal-ances when compared with a cell in G1- or G2/ M-phase. Although these DNA imbalances may, on the one hand, be falsely interpreted as genuine structural aberrations in the S-phase cell’s copy number profile and hence lead to misdiagnosis, on the other hand, the...
Methods for haplotyping and DNA copy-number typing of single cells are paramount for studying gen... more Methods for haplotyping and DNA copy-number typing of single cells are paramount for studying genomic heterogeneity and enabling genetic diagnosis. Before analyzing the DNA of a single cell by microarray or next-generation sequencing, a whole-genome amplification (WGA) process is required, but it substantially distorts the frequency and composition of the cell's alleles. As a consequence, haplotyping methods suffer from error-prone discrete SNP genotypes (AA, AB, BB) and DNA copy-number profiling remains difficult because true DNA copy-number aberrations have to be discriminated from WGA artifacts. Here, we developed a single-cell genome analysis method that reconstructs genome-wide haplotype architectures as well as the copy-number and segregational origin of those haplotypes by employing phased parental genotypes and deciphering WGA-distorted SNP B-allele fractions via a process we coin haplarithmisis. We demonstrate that the method can be applied as a generic method for preimplantation genetic diagnosis on single cells biopsied from human embryos, enabling diagnosis of disease alleles genome wide as well as numerical and structural chromosomal anomalies. Moreover, meiotic segregation errors can be distinguished from mitotic ones.
The nature and pace of genome mutation is largely unknown. Because standard methods sequence DNA ... more The nature and pace of genome mutation is largely unknown. Because standard methods sequence DNA from populations of cells, the genetic composition of individual cells is lost, de novo mutations in cells are concealed within the bulk signal and per cell cycle mutation rates and mechanisms remain elusive. Although single-cell genome analyses could resolve these problems, such analyses are error-prone because of whole-genome amplification (WGA) artefacts and are limited in the types of DNA mutation that can be discerned. We developed methods for paired-end sequence analysis of single-cell WGA products that enable (i) detecting multiple classes of DNA mutation, (ii) distinguishing DNA copy number changes from allelic WGA-amplification artefacts by the discovery of matching aberrantly mapping read pairs among the surfeit of paired-end WGA and mapping artefacts and (iii) delineating the break points and architecture of structural variants. By applying the methods, we capture DNA copy number changes acquired over one cell cycle in breast cancer cells and in blastomeres derived from a human zygote after in vitro fertilization. Furthermore, we were able to discover and fine-map a heritable inter-chromosomal rearrangement t(1;16)(p36;p12) by sequencing a single blastomere. The methods will expedite applications in basic genome research and provide a stepping stone to novel approaches for clinical genetic diagnosis.
Single-cell genomics is revolutionizing basic genome research and clinical genetic diagnosis. How... more Single-cell genomics is revolutionizing basic genome research and clinical genetic diagnosis. However, none of the current research or clinical methods for single-cell analysis distinguishes between the analysis of a cell in G1-, S-or G2/ M-phase of the cell cycle. Here, we demonstrate by means of array comparative genomic hybridization that charting the DNA copy number landscape of a cell in S-phase requires conceptually different approaches to that of a cell in G1-or G2/M-phase. Remarkably, despite single-cell whole-genome amplification artifacts, the log2 intensity ratios of single S-phase cells oscillate according to early and late replication domains, which in turn leads to the detection of significantly more DNA imbalances when compared with a cell in G1-or G2/ M-phase. Although these DNA imbalances may, on the one hand, be falsely interpreted as genuine structural aberrations in the S-phase cell's copy number profile and hence lead to misdiagnosis, on the other hand, the ability to detect replication domains genome wide in one cell has important applications in DNA-replication research. Genomewide cell-type-specific early and late replicating domains have been identified by analyses of DNA from populations of cells, but cell-to-cell differences in DNA replication may be important in genome stability, disease aetiology and various other cellular processes.
Direction du Patrimoine de I'edition 395, rue Wellington Ottawa ON K1A 0N 4 C anada The author ha... more Direction du Patrimoine de I'edition 395, rue Wellington Ottawa ON K1A 0N 4 C anada The author has granted a non exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or non commercial purposes, in microform, paper, electronic and/or any other formats. L'auteur a accorde une licence non exclusive permettant a la Bibliotheque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par telecommunication ou par I'lnternet, preter, distribuer et vendre des theses partout dans le monde, a des fins commerciales ou autres, sur support microforme, papier, electronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriete du droit d'auteur et des droits moraux qui protege cette these. Ni la these ni des extraits substantiels de celle-ci ne doivent etre imprimes ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Conformement a la loi canadienne sur la protection de la vie privee, quelques formulaires secondaires ont ete enleves de cette these. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. i * i Canada R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.
Methods for extracting and amplifying sequences using ancient DNA (aDNA) can be prone to errors c... more Methods for extracting and amplifying sequences using ancient DNA (aDNA) can be prone to errors caused by postmortem modifications of the DNA strand. A new statistical method is developed for predicting errors in aDNA sequences caused by such processes. In addition to the canonical DNA substitution model parameters, a discrete Markov chain is used to describe nucleotide substitutions occurring via postmortem degradation of the aDNA sequences. A computer program, BYPASSR-degr, was developed implementing the method and was used in subsequent analyses of simulated data sets under the new model. Simulation studies show that the new method can be powerful and accurate in identifying damaged sites. The method is applied to analyze aDNA sequences of Etruscans, Adélie penguins, and horses. No significant signals of degradation were observed at any sites of the aDNA sequences we analyzed.
Nuclear insertions of mitochondrial origin (NUMTs) can be useful tools in evolution and populatio... more Nuclear insertions of mitochondrial origin (NUMTs) can be useful tools in evolution and population studies. However, due to their similarity to mitochondrial DNA (mtDNA), NUMTs may also be a source of contamination in mtDNA studies. The main goal of this work is to present a database of NUMTs, based on the latest version of the human genome-GRCh37 draft. A total of 755 insertions were identified. There are 33 paralogous sequences with over 80% sequence similarity and of a greater length than 500 bp. The non-identical positions between paralogous sequences are listed for the first time. As an application example, the described database is used to evaluate the impact of NUMT contamination in cancer studies. The evaluation reveals that 220 positions from 256 with zero hits in the current mtDNA phylogeny could in fact be traced to one or more nuclear insertions of mtDNA. This is due to they are located in non-identical positions between mtDNA and nuclear DNA (nDNA). After in silico primer validation of each revised cancer study, risk of co-amplification between mtDNA and nDNA was detected in some cases, whereas in others no risk of amplification was identified. This approach to cancer studies clearly proves the potential of our NUMT database as a valuable new tool to validate mtDNA mutations described in different contexts. Moreover, due to the amount of information provided for each nuclear insertion, this database should play an important role in designing evolutionary, phylogenetic and epidemiological studies.
A new method is developed for calculating sequence substitution probabilities using Markov chain ... more A new method is developed for calculating sequence substitution probabilities using Markov chain Monte Carlo (MCMC) methods. The basic strategy is to use uniformization to transform the original continuous time Markov process into a Poisson substitution process and a discrete Markov chain of state transitions. An efficient MCMC algorithm for evaluating substitution probabilities by this approach using a continuous gamma distribution to model site-specific rates is outlined. The method is applied to the problem of inferring branch lengths and site-specific rates from nucleotide sequences under a general time-reversible (GTR) model and a computer program BYPASSR is developed. Simulations are used to examine the performance of the new program relative to an existing program BASEML that uses a discrete approximation for the gamma distributed prior on site-specific rates. It is found that BASEML and BYPASSR are in close agreement when inferring branch lengths, regardless of the number of rate categories used, but that BASEML tends to underestimate high site-specific substitution rates, and to overestimate intermediate rates, when fewer than 50 rate categories are used. Rate estimates obtained using BASEML agree more closely with those of BYPASSR as the number of rate categories increases. Analyses of the posterior distributions of site-specific rates from BYPASSR suggest that a large number of taxa are needed to obtain precise estimates of site-specific rates, especially when rates are very high or very low. The method is applied to analyze 45 sequences of the alpha 2B adrenergic receptor gene (A2AB) from a sample of eutherian taxa. In general, the pattern expected for regions under negative selection is observed with third codon positions having the highest inferred rates, followed by first codon positions and with second codon positions having the lowest inferred rates. Several sites show exceptionally high substitution rates at second codon positions that may represent the effects of positive selection. [Bayesian phylogenetic inference; Markov process; Metropolis-Hastings algorithm; molecular evolution; site-specific rates.]
A new method is developed for calculating sequence substitution probabilities using Markov chain ... more A new method is developed for calculating sequence substitution probabilities using Markov chain Monte Carlo (MCMC) methods. The basic strategy is to use uniformization to transform the original continuous time Markov process into a Poisson substitution process and a discrete Markov chain of state transitions. An efficient MCMC algorithm for evaluating substitution probabilities by this approach using a continuous gamma distribution to model site-specific rates is outlined. The method is applied to the problem of inferring branch lengths and site-specific rates from nucleotide sequences under a general time-reversible (GTR) model and a computer program BYPASSR is developed. Simulations are used to examine the performance of the new program relative to an existing program BASEML that uses a discrete approximation for the gamma distributed prior on site-specific rates. It is found that BASEML and BYPASSR are in close agreement when inferring branch lengths, regardless of the number of rate categories used, but that BASEML tends to underestimate high site-specific substitution rates, and to overestimate intermediate rates, when fewer than 50 rate categories are used. Rate estimates obtained using BASEML agree more closely with those of BYPASSR as the number of rate categories increases. Analyses of the posterior distributions of site-specific rates from BYPASSR suggest that a large number of taxa are needed to obtain precise estimates of site-specific rates, especially when rates are very high or very low. The method is applied to analyze 45 sequences of the alpha 2B adrenergic receptor gene (A2AB) from a sample of eutherian taxa. In general, the pattern expected for regions under negative selection is observed with third codon positions having the highest inferred rates, followed by first codon positions and with second codon positions having the lowest inferred rates. Several sites show exceptionally high substitution rates at second codon positions that may represent the effects of positive selection. [Bayesian phylogenetic inference; Markov process; Metropolis-Hastings algorithm; molecular evolution; site-specific rates.]
A new method is developed for calculating sequence substitution probabilities using Markov chain ... more A new method is developed for calculating sequence substitution probabilities using Markov chain Monte Carlo (MCMC) methods. The basic strategy is to use uniformization to transform the original continuous time Markov process into a Poisson substitution process and a discrete Markov chain of state transitions. An efficient MCMC algorithm for evaluating substitution probabilities by this approach using a continuous gamma distribution to model site-specific rates is outlined. The method is applied to the problem of inferring branch lengths and site-specific rates from nucleotide sequences under a general time-reversible (GTR) model and a computer program BYPASSR is developed. Simulations are used to examine the performance of the new program relative to an existing program BASEML that uses a discrete approximation for the gamma distributed prior on site-specific rates. It is found that BASEML and BYPASSR are in close agreement when inferring branch lengths, regardless of the number of rate categories used, but that BASEML tends to underestimate high site-specific substitution rates, and to overestimate intermediate rates, when fewer than 50 rate categories are used. Rate estimates obtained using BASEML agree more closely with those of BYPASSR as the number of rate categories increases. Analyses of the posterior distributions of site-specific rates from BYPASSR suggest that a large number of taxa are needed to obtain precise estimates of site-specific rates, especially when rates are very high or very low. The method is applied to analyze 45 sequences of the alpha 2B adrenergic receptor gene (A2AB) from a sample of eutherian taxa. In general, the pattern expected for regions under negative selection is observed with third codon positions having the highest inferred rates, followed by first codon positions and with second codon positions having the lowest inferred rates. Several sites show exceptionally high substitution rates at second codon positions that may represent the effects of positive selection. [Bayesian phylogenetic inference; Markov process; Metropolis-Hastings algorithm; molecular evolution; site-specific rates.]
Single-cell genomics is revolutionizing basic genome research and clinical genetic diagnosis. How... more Single-cell genomics is revolutionizing basic genome research and clinical genetic diagnosis. However, none of the current research or clinical methods for single-cell analysis distinguishes between the analysis of a cell in G1-, S- or G2/ M-phase of the cell cycle. Here, we demonstrate by means of array comparative genomic hybridiza-tion that charting the DNA copy number landscape of a cell in S-phase requires conceptually different approaches to that of a cell in G1- or G2/M-phase. Remarkably, despite single-cell whole-genome amplification artifacts, the log2 intensity ratios of single S-phase cells oscillate according to early and late replication domains, which in turn leads to the detection of significantly more DNA imbal-ances when compared with a cell in G1- or G2/ M-phase. Although these DNA imbalances may, on the one hand, be falsely interpreted as genuine structural aberrations in the S-phase cell’s copy number profile and hence lead to misdiagnosis, on the other hand, the...
Methods for haplotyping and DNA copy-number typing of single cells are paramount for studying gen... more Methods for haplotyping and DNA copy-number typing of single cells are paramount for studying genomic heterogeneity and enabling genetic diagnosis. Before analyzing the DNA of a single cell by microarray or next-generation sequencing, a whole-genome amplification (WGA) process is required, but it substantially distorts the frequency and composition of the cell's alleles. As a consequence, haplotyping methods suffer from error-prone discrete SNP genotypes (AA, AB, BB) and DNA copy-number profiling remains difficult because true DNA copy-number aberrations have to be discriminated from WGA artifacts. Here, we developed a single-cell genome analysis method that reconstructs genome-wide haplotype architectures as well as the copy-number and segregational origin of those haplotypes by employing phased parental genotypes and deciphering WGA-distorted SNP B-allele fractions via a process we coin haplarithmisis. We demonstrate that the method can be applied as a generic method for preimplantation genetic diagnosis on single cells biopsied from human embryos, enabling diagnosis of disease alleles genome wide as well as numerical and structural chromosomal anomalies. Moreover, meiotic segregation errors can be distinguished from mitotic ones.
The nature and pace of genome mutation is largely unknown. Because standard methods sequence DNA ... more The nature and pace of genome mutation is largely unknown. Because standard methods sequence DNA from populations of cells, the genetic composition of individual cells is lost, de novo mutations in cells are concealed within the bulk signal and per cell cycle mutation rates and mechanisms remain elusive. Although single-cell genome analyses could resolve these problems, such analyses are error-prone because of whole-genome amplification (WGA) artefacts and are limited in the types of DNA mutation that can be discerned. We developed methods for paired-end sequence analysis of single-cell WGA products that enable (i) detecting multiple classes of DNA mutation, (ii) distinguishing DNA copy number changes from allelic WGA-amplification artefacts by the discovery of matching aberrantly mapping read pairs among the surfeit of paired-end WGA and mapping artefacts and (iii) delineating the break points and architecture of structural variants. By applying the methods, we capture DNA copy number changes acquired over one cell cycle in breast cancer cells and in blastomeres derived from a human zygote after in vitro fertilization. Furthermore, we were able to discover and fine-map a heritable inter-chromosomal rearrangement t(1;16)(p36;p12) by sequencing a single blastomere. The methods will expedite applications in basic genome research and provide a stepping stone to novel approaches for clinical genetic diagnosis.
Single-cell genomics is revolutionizing basic genome research and clinical genetic diagnosis. How... more Single-cell genomics is revolutionizing basic genome research and clinical genetic diagnosis. However, none of the current research or clinical methods for single-cell analysis distinguishes between the analysis of a cell in G1-, S-or G2/ M-phase of the cell cycle. Here, we demonstrate by means of array comparative genomic hybridization that charting the DNA copy number landscape of a cell in S-phase requires conceptually different approaches to that of a cell in G1-or G2/M-phase. Remarkably, despite single-cell whole-genome amplification artifacts, the log2 intensity ratios of single S-phase cells oscillate according to early and late replication domains, which in turn leads to the detection of significantly more DNA imbalances when compared with a cell in G1-or G2/ M-phase. Although these DNA imbalances may, on the one hand, be falsely interpreted as genuine structural aberrations in the S-phase cell's copy number profile and hence lead to misdiagnosis, on the other hand, the ability to detect replication domains genome wide in one cell has important applications in DNA-replication research. Genomewide cell-type-specific early and late replicating domains have been identified by analyses of DNA from populations of cells, but cell-to-cell differences in DNA replication may be important in genome stability, disease aetiology and various other cellular processes.
Direction du Patrimoine de I'edition 395, rue Wellington Ottawa ON K1A 0N 4 C anada The author ha... more Direction du Patrimoine de I'edition 395, rue Wellington Ottawa ON K1A 0N 4 C anada The author has granted a non exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or non commercial purposes, in microform, paper, electronic and/or any other formats. L'auteur a accorde une licence non exclusive permettant a la Bibliotheque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par telecommunication ou par I'lnternet, preter, distribuer et vendre des theses partout dans le monde, a des fins commerciales ou autres, sur support microforme, papier, electronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriete du droit d'auteur et des droits moraux qui protege cette these. Ni la these ni des extraits substantiels de celle-ci ne doivent etre imprimes ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Conformement a la loi canadienne sur la protection de la vie privee, quelques formulaires secondaires ont ete enleves de cette these. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. i * i Canada R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.
Methods for extracting and amplifying sequences using ancient DNA (aDNA) can be prone to errors c... more Methods for extracting and amplifying sequences using ancient DNA (aDNA) can be prone to errors caused by postmortem modifications of the DNA strand. A new statistical method is developed for predicting errors in aDNA sequences caused by such processes. In addition to the canonical DNA substitution model parameters, a discrete Markov chain is used to describe nucleotide substitutions occurring via postmortem degradation of the aDNA sequences. A computer program, BYPASSR-degr, was developed implementing the method and was used in subsequent analyses of simulated data sets under the new model. Simulation studies show that the new method can be powerful and accurate in identifying damaged sites. The method is applied to analyze aDNA sequences of Etruscans, Adélie penguins, and horses. No significant signals of degradation were observed at any sites of the aDNA sequences we analyzed.
Nuclear insertions of mitochondrial origin (NUMTs) can be useful tools in evolution and populatio... more Nuclear insertions of mitochondrial origin (NUMTs) can be useful tools in evolution and population studies. However, due to their similarity to mitochondrial DNA (mtDNA), NUMTs may also be a source of contamination in mtDNA studies. The main goal of this work is to present a database of NUMTs, based on the latest version of the human genome-GRCh37 draft. A total of 755 insertions were identified. There are 33 paralogous sequences with over 80% sequence similarity and of a greater length than 500 bp. The non-identical positions between paralogous sequences are listed for the first time. As an application example, the described database is used to evaluate the impact of NUMT contamination in cancer studies. The evaluation reveals that 220 positions from 256 with zero hits in the current mtDNA phylogeny could in fact be traced to one or more nuclear insertions of mtDNA. This is due to they are located in non-identical positions between mtDNA and nuclear DNA (nDNA). After in silico primer validation of each revised cancer study, risk of co-amplification between mtDNA and nDNA was detected in some cases, whereas in others no risk of amplification was identified. This approach to cancer studies clearly proves the potential of our NUMT database as a valuable new tool to validate mtDNA mutations described in different contexts. Moreover, due to the amount of information provided for each nuclear insertion, this database should play an important role in designing evolutionary, phylogenetic and epidemiological studies.
A new method is developed for calculating sequence substitution probabilities using Markov chain ... more A new method is developed for calculating sequence substitution probabilities using Markov chain Monte Carlo (MCMC) methods. The basic strategy is to use uniformization to transform the original continuous time Markov process into a Poisson substitution process and a discrete Markov chain of state transitions. An efficient MCMC algorithm for evaluating substitution probabilities by this approach using a continuous gamma distribution to model site-specific rates is outlined. The method is applied to the problem of inferring branch lengths and site-specific rates from nucleotide sequences under a general time-reversible (GTR) model and a computer program BYPASSR is developed. Simulations are used to examine the performance of the new program relative to an existing program BASEML that uses a discrete approximation for the gamma distributed prior on site-specific rates. It is found that BASEML and BYPASSR are in close agreement when inferring branch lengths, regardless of the number of rate categories used, but that BASEML tends to underestimate high site-specific substitution rates, and to overestimate intermediate rates, when fewer than 50 rate categories are used. Rate estimates obtained using BASEML agree more closely with those of BYPASSR as the number of rate categories increases. Analyses of the posterior distributions of site-specific rates from BYPASSR suggest that a large number of taxa are needed to obtain precise estimates of site-specific rates, especially when rates are very high or very low. The method is applied to analyze 45 sequences of the alpha 2B adrenergic receptor gene (A2AB) from a sample of eutherian taxa. In general, the pattern expected for regions under negative selection is observed with third codon positions having the highest inferred rates, followed by first codon positions and with second codon positions having the lowest inferred rates. Several sites show exceptionally high substitution rates at second codon positions that may represent the effects of positive selection. [Bayesian phylogenetic inference; Markov process; Metropolis-Hastings algorithm; molecular evolution; site-specific rates.]
A new method is developed for calculating sequence substitution probabilities using Markov chain ... more A new method is developed for calculating sequence substitution probabilities using Markov chain Monte Carlo (MCMC) methods. The basic strategy is to use uniformization to transform the original continuous time Markov process into a Poisson substitution process and a discrete Markov chain of state transitions. An efficient MCMC algorithm for evaluating substitution probabilities by this approach using a continuous gamma distribution to model site-specific rates is outlined. The method is applied to the problem of inferring branch lengths and site-specific rates from nucleotide sequences under a general time-reversible (GTR) model and a computer program BYPASSR is developed. Simulations are used to examine the performance of the new program relative to an existing program BASEML that uses a discrete approximation for the gamma distributed prior on site-specific rates. It is found that BASEML and BYPASSR are in close agreement when inferring branch lengths, regardless of the number of rate categories used, but that BASEML tends to underestimate high site-specific substitution rates, and to overestimate intermediate rates, when fewer than 50 rate categories are used. Rate estimates obtained using BASEML agree more closely with those of BYPASSR as the number of rate categories increases. Analyses of the posterior distributions of site-specific rates from BYPASSR suggest that a large number of taxa are needed to obtain precise estimates of site-specific rates, especially when rates are very high or very low. The method is applied to analyze 45 sequences of the alpha 2B adrenergic receptor gene (A2AB) from a sample of eutherian taxa. In general, the pattern expected for regions under negative selection is observed with third codon positions having the highest inferred rates, followed by first codon positions and with second codon positions having the lowest inferred rates. Several sites show exceptionally high substitution rates at second codon positions that may represent the effects of positive selection. [Bayesian phylogenetic inference; Markov process; Metropolis-Hastings algorithm; molecular evolution; site-specific rates.]
A new method is developed for calculating sequence substitution probabilities using Markov chain ... more A new method is developed for calculating sequence substitution probabilities using Markov chain Monte Carlo (MCMC) methods. The basic strategy is to use uniformization to transform the original continuous time Markov process into a Poisson substitution process and a discrete Markov chain of state transitions. An efficient MCMC algorithm for evaluating substitution probabilities by this approach using a continuous gamma distribution to model site-specific rates is outlined. The method is applied to the problem of inferring branch lengths and site-specific rates from nucleotide sequences under a general time-reversible (GTR) model and a computer program BYPASSR is developed. Simulations are used to examine the performance of the new program relative to an existing program BASEML that uses a discrete approximation for the gamma distributed prior on site-specific rates. It is found that BASEML and BYPASSR are in close agreement when inferring branch lengths, regardless of the number of rate categories used, but that BASEML tends to underestimate high site-specific substitution rates, and to overestimate intermediate rates, when fewer than 50 rate categories are used. Rate estimates obtained using BASEML agree more closely with those of BYPASSR as the number of rate categories increases. Analyses of the posterior distributions of site-specific rates from BYPASSR suggest that a large number of taxa are needed to obtain precise estimates of site-specific rates, especially when rates are very high or very low. The method is applied to analyze 45 sequences of the alpha 2B adrenergic receptor gene (A2AB) from a sample of eutherian taxa. In general, the pattern expected for regions under negative selection is observed with third codon positions having the highest inferred rates, followed by first codon positions and with second codon positions having the lowest inferred rates. Several sites show exceptionally high substitution rates at second codon positions that may represent the effects of positive selection. [Bayesian phylogenetic inference; Markov process; Metropolis-Hastings algorithm; molecular evolution; site-specific rates.]
Uploads
Papers by Ligia Mateiu