Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2016, Msc Thesis
https://doi.org/10.6084/m9.figshare.4558468…
95 pages
1 file
The 16S rRNA gene is a widely used target for phylogenetic analysis of prokaryote communities. This analysis starts with the sequencing of the 16S rRNA gene of a microbial sample, and includes several steps such as paired-end merging (when the sequencing technique produces paired-end reads), chimera removal, clustering, and sequence database search. The end-product is the phylogeny of the prokaryote taxa in the sample and an estimation of their abundance. The problem is that there are multiple tools available to carry out this analysis, and it is unclear which is the most e?ective. Namely, there are three analysis pipelines in wide use by the community: mothur, QIIME and USEARCH. These use di?erent paired-end merging al- gorithms, di?erent clustering algorithms, and di?erent sequence ref- erence databases (Silva, Greengenes, and RDP respectively). Addi- tionally, there are a number of other paired-end mergers available and again, it is unclear which performs better in the context of this analysis. In this study, we start by evaluating each of the seven publicly avail- able paired-end merging algorithms: BBmerge, FastqJoin (QIIME's merger), FLASH, mothur's merger, PANDAseq, PEAR and USE- ARCH's merger. Then, we assess the e?ectiveness of each the three analysis pipelines in conjunction with each of the three reference databases, and each of the most promising paired-end mergers. To do this evaluation, we use two sequencing datasets from mock communities, one publicly available and the other produced in-house. We evaluated the paired-end mergers by using BLAST against the known references to compare the number of mismatches before and after merging, and thereby calculate their precision and recall. We evaluated the analysis pipelines by implementing the UniFrac metric (a community standard) in order to measure the similarity between the predicted phylogeny and the real one. We implemented both a qualitative and a quantitative variant of UniFrac. We found that the best mergers were PEAR, FastqJoin and FLASH in terms of balance between precision and recall, whereas mothur was the best in terms of recall, and USEARCH the most correct in terms of the quality scores of the merged sequences. Regarding the analysis pipelines, in terms of qualitative UniFrac, QIIME with Silva as the reference and mothur's merger was the best on the ?rst dataset, and mothur with either Greengenes or RDP and its own merger was the best in the second dataset. In terms of quantative unifrac, mothur with Greengenes and its own merger was the best for the ?rst dataset, and USEARCH with SILVA and mothur's merger was the best on the second dataset. We concluded that having a high recall in the merging step is more important than having a high precision for the downstream phyloge- netic analysis, as mothur's merger was either the best or tied for the best in all settings.
BMC Bioinformatics, 2010
Background Molecular studies of microbial diversity have provided many insights into the bacterial communities inhabiting the human body and the environment. A common first step in such studies is a survey of conserved marker genes (primarily 16S rRNA) to characterize the taxonomic composition and diversity of these communities. To date, however, there exists significant variability in analysis methods employed in these studies. Results Here we provide a critical assessment of current analysis methodologies that cluster sequences into operational taxonomic units (OTUs) and demonstrate that small changes in algorithm parameters can lead to significantly varying results. Our analysis provides strong evidence that the species-level diversity estimates produced using common OTU methodologies are inflated due to overly stringent parameter choices. We further describe an example of how semi-supervised clustering can produce OTUs that are more robust to changes in algorithm parameters. Con...
Journal of Open Source Software
In the past decade, the number of publicly available bacterial genomes has increased dramatically. These genomes have been generated for impactful initiatives, especially in the field of genomic epidemiology (Brown, Dessai, McGarry, & Gerner-Smidt, 2019; Timme et al., 2017). Genomes are sequenced, shared publicly, and subsequently analyzed for phylogenetic relatedness. If two genomes of epidemiological interest are found to be related, further investigation might be prompted. However, comparing the multitudes of genomes for phylogenetic relatedness is computationally expensive and, with large numbers, laborious. Consequently, there are many strategies to reduce the complexity of the data for downstream analysis, especially using nucleotide stretches of length k (kmers).
Scientific Reports
Metagenomics research has recently thrived due to DNA sequencing technologies improvement, driving the emergence of new analysis tools and the growth of taxonomic databases. However, there is no all-purpose strategy that can guarantee the best result for a given project and there are several combinations of software, parameters and databases that can be tested. Therefore, we performed an impartial comparison, using statistical measures of classification for eight bioinformatic tools and four taxonomic databases, defining a benchmark framework to evaluate each tool in a standardized context. Using in silico simulated data for 16S rRNA amplicons and whole metagenome shotgun data, we compared the results from different software and database combinations to detect biases related to algorithms or database annotation. Using our benchmark framework, researchers can define cut-off values to evaluate the expected error rate and coverage for their results, regardless the score used by each software. A quick guide to select the best tool, all datasets and scripts to reproduce our results and benchmark any new method are available at . Finally, we stress out the importance of gold standards, database curation and manual inspection of taxonomic profiling results, for a better and more accurate microbial diversity description. For decades, important advances in microbial ecology and many other fields, have been achieved thanks to the possibility of studying microbial communities by characterizing their genetic information. While the 16S rRNA gene has been widely accepted as a biological fingerprint for bacterial species, it presents some limitations. Many bacterial species have multiple 16S rRNA gene copies, leading to an artificial diversity overrepresentation 1 . Between some bacterial species, there are no significant differences in their 16S rRNA genes, but other genomic elements will confer them important features that will differentiate them as pathogens or harmless free-living organisms 2,3 . Other technical considerations regarding the characterization of the 16S rRNA gene, are primer and amplification biases 4 , chimera formation 4,5 and other artifacts that make difficult the assessment of the real community structure, like the microheterogeneity of sequences between closely related strains, or the similarity of sequences between non-closely related species. The use of high-throughput sequencing technologies has allowed the analysis of very complex environmental samples either by 16S rRNA gene amplification or Whole Metagenome Shotgun (WMS) sequencing which could retrieve the genomic information from all the organisms present in the sample. Also, bioinformatics tools have been redesigned to cope with the massive amount of data generated by high-throughput sequencing technologies. Advantages and limitations of sequencing strategies and metagenomic analysis software have been vastly described before . However, the selection of sequencing or bioinformatic approaches for any project, remains
Phylogenetic relationships among microbial taxa in natural environments provide key insights into the mechanisms that shape community structure and functions. In this chapter, we address the current methodologies to carry out community structure profiling, using single-copy markers and the small subunit of the rRNA gene to measure phylogenetic diversity from next-generation sequencing data. Furthermore, the huge amount of data from metagenomics studies across the world has allowed us to assemble thousands of draft genomes, making necessary the comparison of whole genomes composites through phylogenomic approximations. Several computational tools are available to carry out these analyses with considerable success; we present a compendium of those open source tools, easy to use and with modest hardware requirements, with the aim that they can be applied by biologists non-specialists to study microbial diversity in a phylogenetic context.
The US DOE Joint Genome Institute ( JGI) sequences microbial and metagenomic projects through three main programs: DOE Microbial Genome Program (MGP), JGI Community Sequencing Program (CSP) and DOE Genomes to Life Program (GTL). The principle goal of the MGP is to fund sequencing projects related to DOE interests, the principle goal of the CSP is to fund sequencing projects from a broad range of disciplines that may not be covered in the MGP, and the principle goal of the GTL sequencing projects is to fund sequencing projects in direct support of the GTL program. The JGI is responsible for sequencing, assembling, annotating microbial genomes, and publishing sequence and annotation in GenBank and the DOE JGI Integrated Microbial Genomics web based system. The JGI has sequenced nearly 250 microbes and metagenomic samples to draft quality and completely finished over 120 microbes. Most microbial projects are targeted for finishing. The overall capacity is now approximately 100-125 microbial projects per year through draft sequencing and finishing. Virtually all microbial projects are sequenced by the whole genome shotgun method. To being the sequencing process, the Library group randomly shears the purified DNA under different conditions and selects for three size populations. Fragments are end repaired and selected for inserts in the range of 3kb, 8kb, and 40kb. These are cloned into different vector systems and checked for quality by PCR or sequencing. The libraries are sequenced by the Production group to approximately 8.5X coverage. The resulting reads are trimmed for vector sequences and assembled. The assembly is quality checked, automatically annotated by the Annotation group, and released to the collaborating PI as the initial Quality Draft assembly. For finishing, the draft assembly is assigned to a Finishing group. The Finishing group closes all sequence gaps, resolves all repeat discrepancies, and improves all low quality regions. The final assembly is then passed to the Quality Assurance group to assess the integrity and overall quality of the genome sequence. The finished sequence then receives a final annotation and this package is used as the basis for analysis and publication in GenBank and the DOE JGI Integrated Microbial Genomics web based system.
Molecular Biology and Evolution, 2007
It has recently been proposed that a well-resolved Tree of Life can be achieved through concatenation of shared genes. There are, however, several difficulties with such an approach, especially in the prokaryotic part of this tree. We tackled some of them using a new combination of maximum likelihood-based methods, developed in order to practice as safe and careful concatenations as possible. First, we used the application concaterpillar on carefully aligned core genes. This application uses a hierarchical likelihood-ratio test framework to assess both the topological congruence between gene phylogenies (i.e., whether different genes share the same evolutionary history) and branch-length congruence (i.e., whether genes that share the same history share the same pattern of relative evolutionary rates). We thus tested if these core genes can be concatenated or should be instead categorized into different incongruent sets. Second, we developed a heat map approach studying the evolution of the phylogenetic support for different bipartitions, when the number of sites of different phylogenetic quality in the concatenation increases. These heatmaps allow us to follow which phylogenetic signals increase or decrease as the concatenation progresses and to detect emerging artifactual groupings, that is, groups that are more and more supported when more and more homoplasic sites are thrown in the analysis. We showed that, as far as 7 major prokaryotic lineages are concerned, only 22 core genes can be said to be congruent and can be safely concatenated. This number is even smaller than the number of genes retained to reconstruct a ''Tree of One Per Cent.'' Furthermore, the concatenation of these 22 markers leads to an unresolved tree as the only groupings in the concatenation tree seem to reflect emerging artifacts. Using concatenated core genes as a valid framework to classify uncharacterized environmental sequences can thus be misleading. 1 Present address: UPMC UMR 7138, 7 quai Saint-Bernard, Bâtiment A, 4ème étage, 75005 Paris, France. 2 E.B. and E.S. contributed equally to this article.
PLOS ONE, 2015
As new sequencing technologies become cheaper and older ones disappear, laboratories switch vendors and platforms. Validating the new setups is a crucial part of conducting rigorous scientific research. Here we report on the reliability and biases of performing bacterial 16S rRNA gene amplicon paired-end sequencing on the MiSeq Illumina platform. We designed a protocol using 50 barcode pairs to run samples in parallel and coded a pipeline to process the data. Sequencing the same sediment sample in 248 replicates as well as 70 samples from alkaline soda lakes, we evaluated the performance of the method with regards to estimates of alpha and beta diversity. Using different purification and DNA quantification procedures we always found up to 5-fold differences in the yield of sequences between individually barcodes samples. Using either a one-step or a two-step PCR preparation resulted in significantly different estimates in both alpha and beta diversity. Comparing with a previous method based on 454 pyrosequencing, we found that our Illumina protocol performed in a similar mannerwith the exception for evenness estimates where correspondence between the methods was low. We further quantified the data loss at every processing step eventually accumulating to 50% of the raw reads. When evaluating different OTU clustering methods, we observed a stark contrast between the results of QIIME with default settings and the more recent UPARSE algorithm when it comes to the number of OTUs generated. Still, overall trends in alpha and beta diversity corresponded highly using both clustering methods. Our procedure performed well considering the precisions of alpha and beta diversity estimates, with insignificant effects of individual barcodes. Comparative analyses suggest that 454 and Illumina sequence data can be combined if the same PCR protocol and bioinformatic workflows are used for describing patterns in richness, beta-diversity and taxonomic composition.
Nucleic Acids Research, 2006
Microbiologists conducting surveys of bacterial and archaeal diversity often require comparative alignments of thousands of 16S rRNA genes collected from a sample. The computational resources and bioinformatics expertise required to construct such an alignment has inhibited high-throughput analysis. It was hypothesized that an online tool could be developed to efficiently align thousands of 16S rRNA genes via the NAST (Nearest Alignment Space Termination) algorithm for creating multiple sequence alignments (MSA). The tool was implemented with a web-interface at http://greengenes.lbl.gov/ NAST. Each user-submitted sequence is compared with Greengenes' 'Core Set', comprising $10 000 aligned non-chimeric sequences representative of the currently recognized diversity among bacteria and archaea. User sequences are oriented and paired with their closest match in the Core Set to serve as a template for inserting gap characters. Non-16S data (sequence from vector or surrounding genomic regions) are conveniently removed in the returned alignment. From the resulting MSA, distance matrices can be calculated for diversity estimates and organisms can be classified by taxonomy. The ability to align and categorize large sequence sets using a simple interface has enabled researchers with various experience levels to obtain bacterial and archaeal community profiles.
Science, 2009
Rapid Tree Building Phylogenetic reconstruction is used to determine the relationships between organisms and requires an accurate alignment and analysis of multiple sequences. Iterative rounds of alignment and tree building are often necessary to prevent errors in the phylogeny estimate. One such way to address this problem is to assess alignment and trees in a single step. However, efficient algorithms to analyze data sets of reasonable size have been lacking. Liu et al. (p. 1561 ; see the Perspective by Löytynoja and Goldman ) describe an iterative approach that simultaneously incorporates both alignment and phylogeny and applies a fast maximum likelihood algorithm to the tree-building part. By assembling the components of the methods in this way, accurate results were obtained for up to 1000 sequences. Thus, it is possible to produce coestimation of sequence alignment and phylogeny that is both rapid and accurate.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Methods in Microbiology, 2011
The ISME Journal, 2012
Computational Biology and Chemistry, 2009
BMC Bioinformatics, 2012
BMC research notes, 2014
Biomolecular Detection and Quantification, 2015
Bioinformatics, 2009
Advances in Intelligent and Soft Computing, 2012
BMC Bioinformatics
Bioinformatics, 1999