The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data

Jose Eduardo de la Torre-Bárcena, Sergios Orestis Kolokotronis, Ernest K. Lee, Dennis Wm Stevenson, Eric D. Brenner, Manpreet S. Katari, Gloria Coruzzi, Rob DeSalle

Research output: Contribution to journalArticle

Abstract

Background: Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning genomic data being generated for plant genomes to address one of the more important plant phylogenetic questions concerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group. Methodology: We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenated orthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employed programs that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on reference seed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved and highly supported phylogenetic hypothesis that was robust to various outgroup combinations. Conclusions: We evaluated character support and the relative contribution of numerous variables (e.g. gene number, missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics. Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branch support, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny when dealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informative characters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of adding conflicting characters is minimized.

Original languageEnglish (US)
Article numbere5764
JournalPLoS One
Volume4
Issue number6
DOIs
StatePublished - Jun 2 2009

Fingerprint

Expressed Sequence Tags
Spermatophytina
Seed
Seeds
Genes
Genome
genome
phylogeny
Plant Genome
Angiosperms
genes
Cycadophyta
Gymnosperms
Databases
Angiospermae
Cycadales
Gene Order
DNA sequences
nucleotide sequences
Phylogeny

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

de la Torre-Bárcena, J. E., Kolokotronis, S. O., Lee, E. K., Stevenson, D. W., Brenner, E. D., Katari, M. S., ... DeSalle, R. (2009). The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data. PLoS One, 4(6), [e5764]. https://doi.org/10.1371/journal.pone.0005764

The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data. / de la Torre-Bárcena, Jose Eduardo; Kolokotronis, Sergios Orestis; Lee, Ernest K.; Stevenson, Dennis Wm; Brenner, Eric D.; Katari, Manpreet S.; Coruzzi, Gloria; DeSalle, Rob.

In: PLoS One, Vol. 4, No. 6, e5764, 02.06.2009.

Research output: Contribution to journalArticle

de la Torre-Bárcena, JE, Kolokotronis, SO, Lee, EK, Stevenson, DW, Brenner, ED, Katari, MS, Coruzzi, G & DeSalle, R 2009, 'The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data', PLoS One, vol. 4, no. 6, e5764. https://doi.org/10.1371/journal.pone.0005764
de la Torre-Bárcena JE, Kolokotronis SO, Lee EK, Stevenson DW, Brenner ED, Katari MS et al. The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data. PLoS One. 2009 Jun 2;4(6). e5764. https://doi.org/10.1371/journal.pone.0005764
de la Torre-Bárcena, Jose Eduardo ; Kolokotronis, Sergios Orestis ; Lee, Ernest K. ; Stevenson, Dennis Wm ; Brenner, Eric D. ; Katari, Manpreet S. ; Coruzzi, Gloria ; DeSalle, Rob. / The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data. In: PLoS One. 2009 ; Vol. 4, No. 6.
@article{8b6a513a7b314eacad59187b1df1dfce,
title = "The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data",
abstract = "Background: Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning genomic data being generated for plant genomes to address one of the more important plant phylogenetic questions concerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group. Methodology: We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenated orthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employed programs that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on reference seed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved and highly supported phylogenetic hypothesis that was robust to various outgroup combinations. Conclusions: We evaluated character support and the relative contribution of numerous variables (e.g. gene number, missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics. Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branch support, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny when dealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informative characters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of adding conflicting characters is minimized.",
author = "{de la Torre-B{\'a}rcena}, {Jose Eduardo} and Kolokotronis, {Sergios Orestis} and Lee, {Ernest K.} and Stevenson, {Dennis Wm} and Brenner, {Eric D.} and Katari, {Manpreet S.} and Gloria Coruzzi and Rob DeSalle",
year = "2009",
month = "6",
day = "2",
doi = "10.1371/journal.pone.0005764",
language = "English (US)",
volume = "4",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "6",

}

TY - JOUR

T1 - The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data

AU - de la Torre-Bárcena, Jose Eduardo

AU - Kolokotronis, Sergios Orestis

AU - Lee, Ernest K.

AU - Stevenson, Dennis Wm

AU - Brenner, Eric D.

AU - Katari, Manpreet S.

AU - Coruzzi, Gloria

AU - DeSalle, Rob

PY - 2009/6/2

Y1 - 2009/6/2

N2 - Background: Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning genomic data being generated for plant genomes to address one of the more important plant phylogenetic questions concerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group. Methodology: We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenated orthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employed programs that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on reference seed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved and highly supported phylogenetic hypothesis that was robust to various outgroup combinations. Conclusions: We evaluated character support and the relative contribution of numerous variables (e.g. gene number, missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics. Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branch support, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny when dealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informative characters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of adding conflicting characters is minimized.

AB - Background: Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning genomic data being generated for plant genomes to address one of the more important plant phylogenetic questions concerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group. Methodology: We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenated orthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employed programs that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on reference seed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved and highly supported phylogenetic hypothesis that was robust to various outgroup combinations. Conclusions: We evaluated character support and the relative contribution of numerous variables (e.g. gene number, missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics. Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branch support, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny when dealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informative characters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of adding conflicting characters is minimized.

UR - http://www.scopus.com/inward/record.url?scp=66749144092&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=66749144092&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0005764

DO - 10.1371/journal.pone.0005764

M3 - Article

VL - 4

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 6

M1 - e5764

ER -