Academia.eduAcademia.edu

Optimizing 16S Sequencing Analysis Pipelines

2016, Msc Thesis

https://doi.org/10.6084/m9.figshare.4558468

Abstract

The 16S rRNA gene is a widely used target for phylogenetic analysis of prokaryote communities. This analysis starts with the sequencing of the 16S rRNA gene of a microbial sample, and includes several steps such as paired-end merging (when the sequencing technique produces paired-end reads), chimera removal, clustering, and sequence database search. The end-product is the phylogeny of the prokaryote taxa in the sample and an estimation of their abundance. The problem is that there are multiple tools available to carry out this analysis, and it is unclear which is the most e?ective. Namely, there are three analysis pipelines in wide use by the community: mothur, QIIME and USEARCH. These use di?erent paired-end merging al- gorithms, di?erent clustering algorithms, and di?erent sequence ref- erence databases (Silva, Greengenes, and RDP respectively). Addi- tionally, there are a number of other paired-end mergers available and again, it is unclear which performs better in the context of this analysis. In this study, we start by evaluating each of the seven publicly avail- able paired-end merging algorithms: BBmerge, FastqJoin (QIIME's merger), FLASH, mothur's merger, PANDAseq, PEAR and USE- ARCH's merger. Then, we assess the e?ectiveness of each the three analysis pipelines in conjunction with each of the three reference databases, and each of the most promising paired-end mergers. To do this evaluation, we use two sequencing datasets from mock communities, one publicly available and the other produced in-house. We evaluated the paired-end mergers by using BLAST against the known references to compare the number of mismatches before and after merging, and thereby calculate their precision and recall. We evaluated the analysis pipelines by implementing the UniFrac metric (a community standard) in order to measure the similarity between the predicted phylogeny and the real one. We implemented both a qualitative and a quantitative variant of UniFrac. We found that the best mergers were PEAR, FastqJoin and FLASH in terms of balance between precision and recall, whereas mothur was the best in terms of recall, and USEARCH the most correct in terms of the quality scores of the merged sequences. Regarding the analysis pipelines, in terms of qualitative UniFrac, QIIME with Silva as the reference and mothur's merger was the best on the ?rst dataset, and mothur with either Greengenes or RDP and its own merger was the best in the second dataset. In terms of quantative unifrac, mothur with Greengenes and its own merger was the best for the ?rst dataset, and USEARCH with SILVA and mothur's merger was the best on the second dataset. We concluded that having a high recall in the merging step is more important than having a high precision for the downstream phyloge- netic analysis, as mothur's merger was either the best or tied for the best in all settings.