|Year : 2020 | Volume
| Issue : 1 | Page : 93-103
Construction & assessment of a unified curated reference database for improving the taxonomic classification of bacteria using 16S rRNA sequence data
Shikha Agnihotry1, Aditya N Sarangi1, Rakesh Aggarwal2
1 Biomedical Informatics Centre, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, Uttar Pradesh, India
2 Biomedical Informatics Centre; Department of Gastroenterology, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, Uttar Pradesh, India
|Date of Submission||31-Jan-2018|
|Date of Web Publication||24-Feb-2020|
Dr Rakesh Aggarwal
Director, Jawaharlal Institute of Postgraduate Medical Education & Research, Dhanvantri Nagar, Puducherry 605 006
Source of Support: None, Conflict of Interest: None
| Abstract|| |
Background & objectives: For bacterial community analysis, 16S rRNA sequences are subjected to taxonomic classification through comparison with one of the three commonly used databases [Greengenes, SILVA and Ribosomal Database Project (RDP)]. It was hypothesized that a unified database containing fully annotated, non-redundant sequences from all the three databases, might provide better taxonomic classification during analysis of 16S rRNA sequence data. Hence, a unified 16S rRNA database was constructed and its performance was assessed by using it with four different taxonomic assignment methods, and for data from various hypervariable regions (HVRs) of 16S rRNA gene.
Methods: We constructed a unified 16S rRNA database (16S-UDb) by merging non-ambiguous, fully annotated, full-length 16S rRNA sequences from the three databases and compared its performance in taxonomy assignment with that of three original databases. This was done using four different taxonomy assignment methods [mothur Naïve Bayesian Classifier (mothur-nbc), RDP Naïve Bayesian Classifier (rdp-nbc), UCLUST, SortMeRNA] and data from 13 regions of 16S rRNA [seven hypervariable regions (HVR) (V2-V8) and six pairs of adjacent HVRs].
Results: Our unified 16S rRNA database contained 13,078 full-length, fully annotated 16S rRNA sequences. It could assign genus and species to larger proportions (90.05 and 46.82%, respectively, when used with mothur-nbc classifier and the V2+V3 region) of sequences in the test database than the three original 16S rRNA databases (70.88-87.20% and 10.23-24.28%, respectively, with the same classifier and region).
Interpretation & conclusions: Our results indicate that for analysis of bacterial mixtures, sequencing of V2-V3 region of 16S rRNA followed by analysis of the data using the mothur-nbc classifier and our 16S-UDb database may be preferred.
Keywords: 16S rRNA - bioinformatics - hypervariable regions - metagenomics - microbiota
|How to cite this article:|
Agnihotry S, Sarangi AN, Aggarwal R. Construction & assessment of a unified curated reference database for improving the taxonomic classification of bacteria using 16S rRNA sequence data. Indian J Med Res 2020;151:93-103
|How to cite this URL:|
Agnihotry S, Sarangi AN, Aggarwal R. Construction & assessment of a unified curated reference database for improving the taxonomic classification of bacteria using 16S rRNA sequence data. Indian J Med Res [serial online] 2020 [cited 2020 Dec 5];151:93-103. Available from: https://www.ijmr.org.in/text.asp?2020/151/1/93/279138
Shikha Agnihotry & Aditya N. Sarangi contributed equally.
Several human body sites contain a variety of organisms, including prokaryotes and eukaryotes, collectively referred to as microbiota. High-throughput genomic sequencing of 16S rRNA is used for profiling of microbiota. This bacterial gene has nine hypervariable regions (HVRs) interspersed with conserved nucleotide sequences. Sequences of these HVRs differ between bacterial groups and hence can be used to identify different bacteria. Further, high-throughput platforms allow sequencing of several nucleic acid molecules in parallel. One of the commonly used platforms, namely Illumina, can read DNA sequences beginning at each end of a DNA fragment for up to 300 nucleotides; these paired-end reads can then be merged to obtain sequences of up to 575-nucleotide long, enough to cover one or two adjacent HVRs. Sequencing of one or more HVR regions using this platform followed by taxonomic identification of each sequence by matching with a database of known bacterial 16S rRNA gene sequences is one of the most-frequently used methods for determining the type and abundance of various bacteria present in specimens that contain a mixture of several bacteria.
Three databases are commonly used to identify bacterial16S rRNA sequences, namely SILVA, Ribosomal Database Project (RDP) and Greengenes. These databases overlap partially, with each containing some entries which are absent in the others. The databases also vary in their coverage; for instance, the most abundant genus in the Greengenes is Prevotella and that in the SILVA database is Lachnospira. Further, several taxonomy assignment methods such as UCLUST, mothur-nbc, rdp-nbc and SortMeRNA are used for comparing the query sequences to these databases.
Effectiveness of bioinformatic analysis to assign individual reads in a high-throughput 16S rRNA sequence dataset to various bacterial species can be expected to vary with the HVR sequenced, the 16S rRNA reference database used and the taxonomic assignment methods used. A few studies have assessed the effect of varying one factor, e.g., the reference database, or assignment method on the taxonomic assignment process, at a time; however, no study has comprehensively looked at the effect of varying all the three factors together. Therefore, we analysed the composite effect of varying the three factors. Further, it was hypothesized that the use of a hybrid database which included entries from all the three major 16S rRNA sequence databases, might improve the phylogenetic assignment. Hence, a unified 16S rRNA database was constructed by merging non-ambiguous, full-length prokaryotic 16S rRNA sequences with complete annotation up to the species level from the three commonly used databases. The relative performance in taxonomic assignment of this new unified database [referred to as 16S unified database (16S-UDb)] and each of its constituents taken individually was compared, using four different taxonomic assignment methods and using different HVR of 16S rRNA.
| Material & Methods|| |
Acquisition and preparation of individual 16S rRNA databases: For the Greengenes and SILVA databases, files containing 16S rRNA reference sequences (pre-clustered at 97% threshold) and corresponding taxonomy mapping information in QIIME format were downloaded (ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_5_otus.tar.gz and https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_123_release.zip, respectively). For RDP (release 11.4), unaligned bacterial 16S rRNA sequences were downloaded (https://rdp.cme.msu.edu/download/current_Bacteria_unaligned.fa.gz) and made QIIME-compatible by removing sequences that were shorter than 1200 nucleotides or that contained any ambiguous base, followed by clustering at 97 per cent threshold using VSEARCH. In addition, a taxonomy mapping file in QIIME-compatible format was created by linking RDP sequence identifiers of the representative sequences with 6-level (phylum, class, order, family, genus and species) taxonomic lineage hierarchy.
Construction of a unified 16S rRNA database (16S-UDb): Bacterial 16S rRNA gene sequences were obtained from three databases - Greengenes (v.13.5), RDP (v11.4) and SILVA 'All-Species Living Tree' Project. For each, any sequences shorter than 1200 nucleotide in length, containing any ambiguous base, not classified up to the species level, or labelled as derived from environment or cloned material were purged. Sequences classified up to subspecies or isolate level were grouped to the species level. These curated data from the three databases were merged and clustered at 97 per cent threshold using VSEARCH. Further, a taxonomy mapping file was created by linking sequence identifiers of the representative sequences with 6-level taxonomic lineage hierarchy.
Construction of test dataset:A test dataset was required for comparing the performance of the unified and the individual 16S rRNA databases. For this, a pre-formatted 16S rRNA reference dataset was downloaded from the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz), and sequences of 1200 nucleotides or longer in it were extracted using the 'blastdbcmd' utility of NCBI stand-alone BLAST version 2.2.28. The identifier of each sequence was used to query the NCBI taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy) to obtain its taxonomic information. These data were used to create a taxonomy mapping file containing the reference identifiers and their 6-level taxonomic lineage hierarchies up to the species level; any sequences with incomplete species information were removed, and those with taxonomic lineage defined up to subspecies or isolates, were grouped to species level. Any duplicate sequences with 100 per cent identity were removed using CD-HIT. This provided a test dataset of non-redundant full-length 16S rRNA sequences classified up to species level.
Construction of test datasets of shorter lengths: Primer-pairs flanking seven individual HVRs (V2 to V8) of bacterial 16S rRNA and six pairs of adjacent HVRs [V23 (i.e., V2+V3), V34, V45, V56, V67, V78] were identified [Supplementary Table 1 [Additional file 1]]. Their sequences were aligned with the test dataset, using ClustalW program in-built in BioEdit Suite v.7.2.5 (http://www.mbio.ncsu.edu/BioEdit/BioEdit.zip), and the nucleotide sequences lying between the primers in primer pairs covering each HVR (V2 to V8; 7 sets) or two adjacent HVRs (i.e., V2+V3, V3+V4, V4+V5, V5+V6, V6+V7 and V7+V8; 6 sets) were extracted. This yielded a total of 13 test datasets. The V1 and V9 regions were ignored since these are rarely used for metagenomic analysis because of their incomplete nature.
Comparison of different approaches to identify an optimal pipeline: The performances of various combinations of four taxonomy assignment methods (UCLUST, SortMeRNA, mothur-nbc and rdp-nbc) and four 16S rRNA databases (Greengenes, SILVA, RDP and the new 16S-UDb) in correctly classifying reads included in the 13 HVR/HVR-pair test datasets, up to the genus and species levels were determined. This was done using 208 runs (=4×4×13) of the assign_taxonomy.py script of QIIME software package.
UCLUST-based classifier uses the consensus taxonomy assignment-based approach; it was invoked with the parameters min_consensus_fraction: 0.51, similarity: 0.9 and uclust_max_accepts: 3. SortMeRNA algorithm is based on approximate seeds and accounts for fast and sensitive analyses of rRNA sequences. SortMeRNA was run with parameters: SortMeRNA _coverage: 0.9, SortMeRNA _best_N_alignments: 5 and SortMeRNA _e_value: 1.0. The two naive Bayes-based classifiers (mothur-nbc and rdp-nbc) were retrained using the reference operational taxonomic units (OTUs) clustered at 97 per cent threshold and the corresponding taxonomy mapping files from the above four databases, individually, with minimum confidence score (-c) of 0.80. Performance of each classifier-database combination was calculated as the proportion of sequences in the test dataset that were correctly classified.
Assessment of performance of the 'optimal pipeline' using a real-life dataset: For this, Illumina 16S rRNA gene V3 sequencing data were used from one of our previous studies. It contained 20,508,594 high-quality reads obtained from stool specimens of 14 healthy persons (number of median reads per specimen: 272,327 (range 136,563 to 761,961); total reads 4,728,631) and 33 patients with enthesitis-related arthritis (median 397,351 (102,093 to 1,502,380); total reads 15,871,719). These data were subjected to taxonomy assignment using the optimum approach identified in the previous step. The results obtained were compared with those obtained using the approach used in the original analysis i.e., using UCLUST Consensus Taxonomy Assigner and sub-sampled open-reference OTU picking protocol of QIIME 1.8 against the Greengenes v.13.8 reference OTUs pre-clustered at 97 per cent threshold with the software's default parameters. During these analyses, any singleton OTUs (with only one sequence in all specimens taken together), unassigned OTUs and eukaryotic OTUs were removed from the 'biom' files generated by each approach. Further, to reduce noise, OTUs that were observed in fewer than 10 per cent of stool specimens or that accounted for fewer than 0.002 per cent of reads in all the specimens taken together were also purged. The relative performance of the two methods was assessed by comparing the BIOM files generated by each.
Comparison of computational performances of various approaches: The time taken for different combinations of databases and classifiers were assessed using an Intel Corporation Xeon E7 Workstation with 6 processors (Intel Corp., CA, USA) using identical parameters, i.e., pre-filtration of sequences at 60 per cent identity and subsample-based open-reference OTU picking method with the use of 10 per cent subsample.
The study was carried out at the Biomedical Informatics Centre, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, India.
| Results & Discussion|| |
Acquisition of 16S rRNA databases: The Greengenes database (v.13.5), when pre-clustered at 97 per cent threshold as inbuilt in QIIME V1 software, contained 99,322 sequences belonging to 1812 unique genera, and the SILVA database (v.123) with similar pre-clustering contained 216,401 sequences belonging to 3541 unique genera. For RDP, the QIIME-compatible data in a similar format contained 145,925 sequences belonging to 2737 unique genera. [Figure 1]A shows the number of taxonomic units (by name) at various ranks, i.e., phylum to genus, between the SILVA, RDP and Greengenes taxonomies. This comparison showed that the three databases varied markedly in the taxa represented in each, with only 19.8 per cent of the phyla, 9.7 per cent of classes, 9 per cent of orders, 14.8 per cent of families and 27.6 per cent of genus ranks present in any of the three databases being shared across all the databases.
|Figure 1: Comparison of taxonomies based on taxon names found at each rank from phylum to genus. The three taxonomies, SILVA, Ribosomal Database Project (RDP) and Greengenes (GG) commonly used for 16S rRNA based analyses were compared in detail (Panel A) and then a union of these three databases (labelled as ALL) were compared against the unified 16S-UDb database (Panel B).|
Click here to view
Construction of integrated 16S database: Of the 4,345,168 bacterial 16S rRNA gene sequences obtained from the three databases (Greengenes v.13.5: 1,262,986 sequences, RDP v.11.4: 3,070,243 sequences and SILVA 'All-Species Living Tree' Project v.123: 11,939 sequences), 2,629,394 (60.5%) were 1200 nucleotide or longer in length and free of ambiguous nucleotides. Of these, 405,538 were fully classified up to the species level. These 405,538 sequences were clustered at 97 per cent threshold using VSEARCH. For the current version of 16S-UDb v1.0, this yielded 13,078 unified sequences from the bacterial kingdom, belonging to 36 phyla, 94 classes, 187 orders, 414 families, 2453 genera and 4881 unique species-like groups. Proteobacteria was the predominant phylum accounting for 38.24 per cent of all the sequences, followed by Firmicutes (28.91%), Actinobacteria (12.72%), Bacteroidetes (10.44%), Cyanobacteria (2.6%), Tenericutes (1.39%), Spirochaetes (0.83%), Verrucomicrobia (0.47%) and 29 other minor phyla.
The numbers of taxonomic units (by name) shared between a union of the SILVA, RDP, Greengenes taxonomies (labelled as ALL) versus 16S-UDb at phylum to genus ranks are shown in [Figure 1]B. The top 10 phyla, class, order and genera in the Greengenes, SILVA, RDP and 16S-UDb databases are shown in [Supplementary Table 2 [Additional file 2]]. Of the taxonomic units at various levels contained in 'ALL' but not in 16S-UDb, a large majority were present in only one of the three starting databases, i.e., Greengenes, SILVA or RDP, and very few were represented in two or three of these [Supplementary Table 3 [Additional file 3]].
|Table 1: Accuracy (%) of various combinations of four taxonomy assignment algorithms and four 16S rRNA sequence databases using different hypervariable regions|
Click here to view
Contribution of each input database to the 16S-UDb database: Greengenes database contributed 3387 sequences belonging to 23 phyla, 45 classes, 79 orders, 150 families, 306 genus and 533 species to the 16S-UDb database. Among these sequences, Firmicutes was the predominant phylum with 43.52 per cent of all the sequences contributed, followed by Proteobacteria (26.28%), Bacteroidetes (13.14%), Actinobacteria (9.86%), Crenarchaeota (1.27%), Spirochaetes (1.18%), Euryarchaeota (1.09%) and Verrucomicrobia (0.83%).
RDP contributed 6254 sequences from 30 phyla, 56 classes, 118 orders, 280 families, 1316 genus and 2880 species. Among these sequences, Proteobacteria (45.68%) was the most dominant phylum, followed by Firmicutes (28.06%), Actinobacteria (11.24%), Bacteroidetes (4.88%), Cyanobacteria (4.05%), Tenericutes (1.92%), Spirochaetes (0.80%) and Planctomycetes (0.56%).
SILVA database contributed 3437 sequences from 29 phyla, 58 classes, 137 orders, 293 families, 1576 genus and 2264 species. Of these, Proteobacteria (36.49%) were the most predominant, followed by Firmicutes (19.84%), Actinobacteria (18.21%), Bacteroidetes (17.89%), Tenericutes (1.45%), Deinococcus-Thermus (0.93%), Verrucomicrobia (0.61%) and Chloroflexi (0.52%).
Construction of gold standard test dataset: The NCBI Bacterial 16S Ribosomal RNA dataset (ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz) contained a total of 18,775 eligible (1200 nucleotide or longer in length) 16S rRNA sequences, of which 2132 were redundant and 5318 lacked taxonomic classification to species level. From the remaining full-length 16S rRNA sequences, 13 HVR/HVR-pair test datasets (each with 11,325 members) were constructed. The 16S rRNA test dataset comprised sequences from only the bacterial kingdom, related to 36 phyla, 86 classes, 162 orders, 383 families, 2453 genera and 4881 unique species-like groups. Proteobacteria was the dominant phylum with 38.89 per cent of all sequences (n=11325), followed by Actinobacteria (23.69%), Firmicutes (19.51%), Bacteroidetes (10.76%), Tenericutes (1.78%), Spirochaetes (0.91%), Cyanobacteria (0.54%), Verrucomicrobia (0.44%) and 28 other minor phyla. The top 10 phylum, class, order, family and genus groups in this test dataset are shown in [Supplementary Table 4 [Additional file 4]].
Identification of 'optimal pipeline' using the test dataset: The default reference database for QIIME v1 is a subset of Greengenes rRNA sequences pre-clustered clustered at 97 per cent identity. Several meta-analyses and case-control analyses of human microbiome used this 97 per cent identity subset of Greengenes, and SILVA,, as reference databases. Hence, we used the 16S-UDb clustered at 97 per cent threshold for our comparisons.
[Table 1] shows the performance of various combinations of four taxonomy assignment methods and four 16S rRNA databases (including the 16S-UDb clustered at 97 per cent threshold) in correctly classifying the 13 HVR/HVR-pair test datasets at the family, genus and species levels. For UCLUST and SortMeRNA, the SILVA database performed the best, 16S-UDb performed the best; this could indicate that different methods work better with databases of different sizes. However, overall, for each of the 13 datasets, the unified database (16S-UDb) and mothur-nbc classifier combination provided the highest correct classification rate at each taxonomic level.
Using this database-classifier combination (16S-UDb and mothur-nbc), we compared the performance of the 13 test datasets. In this analysis, V2-3 HVR performed the best for classification at the genus and species levels, and the V3-4 HVR did the best for classification at the family level. The HVR providing the best discrimination depends on the composition of the sample (bacterial mixture),. The strength of our analysis was that we used various tools on both an idealized and a real-life dataset and obtained good performance with this database-classifier pair.
Assessment of performance of the 'optimal pipeline' using a real-life dataset:In data from our previous study, the original analysis using the Greengenes databases and UCLUST classifier (the GG-UCLUST approach) was able to assign genus and species to 17,439,681 (85.0%) and 8,381,405 (40.9%), respectively, of all reads (numbering 20,508,594). By contrast, the 16S-UDb-mothur-nbc approach was able to assign genus and species to large proportions (17,886,699 (87.2%) and 12,450,839 (60.7%), respectively) of reads in this dataset. The 16S-UDb-mothur-nbc pipeline identified presence of a large number of families, genera and species in the test dataset (50, 94 and 175, respectively; [Supplementary Table 5 [Additional file 5]] and [Supplementary Table 6 [Additional file 6]] than the original GG-UCLUST pipeline (47, 69 and 34, respectively). In the results from these two pipelines, 35 families, 59 genera and 32 species were common. In the species-level results from the two approaches, 145 species showed discordance, including two (Staphylococcus aureus and Lactobacillus zeae) that were detected only by the GG-UCLUST approach and 143 that were detected only by the 16S-UDb-mothur-nbc approach. These data showed that the 16S-UDb-mothur-nbc approach was more sensitive in identifying the presence of different bacterial species in 16S rRNA sequence datasets, such as those derived from bacterial mixtures.
The relative abundances of the four most frequent phyla (Bacteroidetes, Firmicutes, Proteobacteria and Actinobacteria) and for the 32 species identified by both the approaches [Table 2] were generally similar. However, for one species, namely Faecalibacterium prausnitzii (Family: Ruminococcaceae), the per cent abundances using the 16S-UDb-mothur-nbc approach [healthy: 3.33 (0.51-9.19), ERA: 4.39 (0.07-79.55); P <0.01] were much higher than those using the GG-UCLUST approach [healthy: 0.09 (0.02-0.63), ERA: 0.13 (0.001-1.84); P <0.01]. This indicated that the use of the optimum database and classifier combination identified by us might not only allow detection of a larger number of bacterial species in a mixture but also help better assess their abundances.
|Table 2: Relative abundance (in %) of different taxonomic groups detected in the real-life test dataset using the GG-UCLUST and 16S.Udb.mothur approaches. Data are shown separately for healthy subjects and those with disease|
Click here to view
Computational performances of different approaches: The 16S-UDb-mothur-nbc approach needed a shorter computational time than the GG-UCLUST approach (59 vs. 84 min, respectively) for completion. For analysis of high-throughput 16S rRNA sequence data for bacterial mixtures, several taxonomy assignment methods, at least three 16S rRNA reference databases and various HVRs of the 16S rRNA gene were used. In this study, a unified database was constructed by merging non-ambiguous, fully annotated, full-length 16S sequences from the three commonly-used databases. The performance of various combinations of 16S sequence databases, HVRs and taxonomy assignment methods was assessed. Our analysis showed that 16S-UDb (clustered at 97% identity), the unified 16S rRNA database that we created, performed better than the currently-available 16S rRNA databases in that it was able to assign taxonomic lineage up to the family, genus and species levels to a large proportion of sequences in a test database and in a real-life dataset. Further, a combination of this database with mothur-nbc classifier had the best performance among all the database-classifier combinations, as did a region covering the V2 and V3 HVRs compared to the other HVRs.
Our study had a limitation in that we clustered the reference sequences using a 97 per cent threshold. Edgar has reported that the use of 100 per cent identity threshold may be better than the usual 97 per cent threshold for accurate prediction. This is because the use of traditional 97 per cent threshold may place sequences from more than one species and/or genera in the same cluster, leading to some erroneous classification at genus and species levels. Thus, there is a need to repeat this study at a future date using 100 per cent identity threshold instead of the 97 per cent identity threshold.
In conclusion, our analysis shows that sequencing of V2-V3 region of the 16S rRNA, followed by analysis of the sequence data obtained using the mothur-nbc classifier of QIIME v1 and our unified 16S-UDb database (https://github.com/sarangian/16S-UDb.git) may be used for analysis of bacterial mixtures, such as those present various body sites and in environmental specimens.
Financial support & sponsorship: None.
Conflicts of Interest: None.
| References|| |
Neufeld JD, Mohn WW. Assessment of microbial phylogenetic diversity based on environmental nucleic acids. In: Molecular identification, systematics, and population Structure of prokaryotes.
Berlin, Heidelberg: Springer; 2006. p. 219-59.
Curtis TP, Head IM, Lunn M, Woodcock S, Schloss PD, Sloan WT. What is the extent of prokaryotic diversity? Philos Trans R Soc Lond B Biol Sci
Janda JM, Abbott SL. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, perils, and pitfalls. J Clin Microbiol
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al
. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Res
Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, et al
. Ribosomal Database Project: Data and tools for high throughput rRNA analysis. Nucleic Acids Res
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al
. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol
Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al
. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol
Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol
Kopylova E, Noé L, Touzet H. SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics
Ritari J, Salojärvi J, Lahti L, de Vos WM. Improved taxonomic assignment of human intestinal 16S rRNA sequences by a dedicated reference database. BMC Genomics
Chaudhary N, Sharma AK, Agarwal P, Gupta A, Sharma VK. 16S classifier: A tool for fast and accurate taxonomic classification of 16S rRNA hypervariable regions in metagenomic datasets. PLoS One
Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics
Yang B, Wang Y, Qian PY. Sensitivity and correlation of hypervariable regions in 16S rRNA genes in phylogenetic analysis. BMC Bioinformatics
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al
. QIIME allows analysis of high-throughput community sequencing data. Nat Methods
Aggarwal A, Sarangi AN, Gaur P, Shukla A, Aggarwal R. Gut microbiome in children with enthesitis-related arthritis in a developing country and the effect of probiotic administration. Clin Exp Immunol
Lozupone CA, Stombaugh J, Gonzalez A, Ackermann G, Wendel D, Vázquez-Baeza Y, et al
. Meta-analyses of studies of the human microbiota. Genome Res
Wu Y, Chi X, Zhang Q, Chen F, Deng X. Characterization of the salivary microbiome in people with obesity. PeerJ
Singh A, Sarangi AN, Goel A, Srivastava R, Bhargava R, Gaur P, et al
. Effect of administration of a probiotic preparation on gut microbiota and immune response in healthy women in India: An open-label, single-arm pilot study. BMC Gastroenterol
Jangi S, Gandhi R, Cox LM, Li N, von Glehn F, Yan R, et al
. Alterations of the human gut microbiome in multiple sclerosis. Nat Commun
Chakravorty S, Helb D, Burday M, Connell N, Alland D. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J Microbiol Methods
Kim M, Morrison M, Yu Z. Evaluation of different partial 16S rRNA gene sequence regions for phylogenetic analysis of microbiomes. J Microbiol Methods
Edgar RC. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics
Edgar RC. Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences. PeerJ
[Table 1], [Table 2]