Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

Alexa B.R. McIntyre, Rachid Ounit, Ebrahim Afshinnekoo, Robert J. Prill, Elizabeth Henaff, Noah Alexander, Samuel S. Minot, David Danko, Jonathan Foox, Sofia Ahsanuddin, Scott Tighe, Nur A. Hasan, Poorani Subramanian, Kelly Moffat, Shawn Levy, Stefano Lonardi, Nick Greenfield, Rita R. Colwell, Gail L. Rosen, Christopher E. Mason

Research output: Contribution to journalArticle

Abstract

Background: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. Results: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. Conclusions: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.

Original languageEnglish (US)
Article number182
JournalGenome Biology
Volume18
Issue number1
DOIs
StatePublished - Sep 21 2017

Fingerprint

Benchmarking
Metagenomics
benchmarking
microorganisms
Firearms
microorganism
sampling
Research Design
sequence analysis
experimental design
Genome
taxonomy
genome
relative abundance

Keywords

  • Classification
  • Comparison
  • Ensemble methods
  • Meta-classification
  • Metagenomics
  • Pathogen detection
  • Shotgun sequencing
  • Taxonomy

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Genetics
  • Cell Biology

Cite this

McIntyre, A. B. R., Ounit, R., Afshinnekoo, E., Prill, R. J., Henaff, E., Alexander, N., ... Mason, C. E. (2017). Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biology, 18(1), [182]. https://doi.org/10.1186/s13059-017-1299-7

Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. / McIntyre, Alexa B.R.; Ounit, Rachid; Afshinnekoo, Ebrahim; Prill, Robert J.; Henaff, Elizabeth; Alexander, Noah; Minot, Samuel S.; Danko, David; Foox, Jonathan; Ahsanuddin, Sofia; Tighe, Scott; Hasan, Nur A.; Subramanian, Poorani; Moffat, Kelly; Levy, Shawn; Lonardi, Stefano; Greenfield, Nick; Colwell, Rita R.; Rosen, Gail L.; Mason, Christopher E.

In: Genome Biology, Vol. 18, No. 1, 182, 21.09.2017.

Research output: Contribution to journalArticle

McIntyre, ABR, Ounit, R, Afshinnekoo, E, Prill, RJ, Henaff, E, Alexander, N, Minot, SS, Danko, D, Foox, J, Ahsanuddin, S, Tighe, S, Hasan, NA, Subramanian, P, Moffat, K, Levy, S, Lonardi, S, Greenfield, N, Colwell, RR, Rosen, GL & Mason, CE 2017, 'Comprehensive benchmarking and ensemble approaches for metagenomic classifiers', Genome Biology, vol. 18, no. 1, 182. https://doi.org/10.1186/s13059-017-1299-7
McIntyre ABR, Ounit R, Afshinnekoo E, Prill RJ, Henaff E, Alexander N et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biology. 2017 Sep 21;18(1). 182. https://doi.org/10.1186/s13059-017-1299-7
McIntyre, Alexa B.R. ; Ounit, Rachid ; Afshinnekoo, Ebrahim ; Prill, Robert J. ; Henaff, Elizabeth ; Alexander, Noah ; Minot, Samuel S. ; Danko, David ; Foox, Jonathan ; Ahsanuddin, Sofia ; Tighe, Scott ; Hasan, Nur A. ; Subramanian, Poorani ; Moffat, Kelly ; Levy, Shawn ; Lonardi, Stefano ; Greenfield, Nick ; Colwell, Rita R. ; Rosen, Gail L. ; Mason, Christopher E. / Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. In: Genome Biology. 2017 ; Vol. 18, No. 1.
@article{ba5c7058162d448586263a9506324095,
title = "Comprehensive benchmarking and ensemble approaches for metagenomic classifiers",
abstract = "Background: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. Results: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. Conclusions: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.",
keywords = "Classification, Comparison, Ensemble methods, Meta-classification, Metagenomics, Pathogen detection, Shotgun sequencing, Taxonomy",
author = "McIntyre, {Alexa B.R.} and Rachid Ounit and Ebrahim Afshinnekoo and Prill, {Robert J.} and Elizabeth Henaff and Noah Alexander and Minot, {Samuel S.} and David Danko and Jonathan Foox and Sofia Ahsanuddin and Scott Tighe and Hasan, {Nur A.} and Poorani Subramanian and Kelly Moffat and Shawn Levy and Stefano Lonardi and Nick Greenfield and Colwell, {Rita R.} and Rosen, {Gail L.} and Mason, {Christopher E.}",
year = "2017",
month = "9",
day = "21",
doi = "10.1186/s13059-017-1299-7",
language = "English (US)",
volume = "18",
journal = "Genome Biology",
issn = "1474-7596",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

AU - McIntyre, Alexa B.R.

AU - Ounit, Rachid

AU - Afshinnekoo, Ebrahim

AU - Prill, Robert J.

AU - Henaff, Elizabeth

AU - Alexander, Noah

AU - Minot, Samuel S.

AU - Danko, David

AU - Foox, Jonathan

AU - Ahsanuddin, Sofia

AU - Tighe, Scott

AU - Hasan, Nur A.

AU - Subramanian, Poorani

AU - Moffat, Kelly

AU - Levy, Shawn

AU - Lonardi, Stefano

AU - Greenfield, Nick

AU - Colwell, Rita R.

AU - Rosen, Gail L.

AU - Mason, Christopher E.

PY - 2017/9/21

Y1 - 2017/9/21

N2 - Background: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. Results: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. Conclusions: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.

AB - Background: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. Results: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. Conclusions: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.

KW - Classification

KW - Comparison

KW - Ensemble methods

KW - Meta-classification

KW - Metagenomics

KW - Pathogen detection

KW - Shotgun sequencing

KW - Taxonomy

UR - http://www.scopus.com/inward/record.url?scp=85029756134&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029756134&partnerID=8YFLogxK

U2 - 10.1186/s13059-017-1299-7

DO - 10.1186/s13059-017-1299-7

M3 - Article

VL - 18

JO - Genome Biology

JF - Genome Biology

SN - 1474-7596

IS - 1

M1 - 182

ER -