Supervised Bayesian latent class models for high-dimensional data

Stacia M. Desantis, E. Andrés Houseman, Brent A. Coull, Catherine L. Nutt, Rebecca Betensky

Research output: Contribution to journalArticle

Abstract

High-grade gliomas are the most common primary brain tumors in adults and are typically diagnosed using histopathology. However, these diagnostic categories are highly heterogeneous and do not always correlate well with survival. In an attempt to refine these diagnoses, we make several immunohistochemical measurements of YKL-40, a gene previously shown to be differentially expressed between diagnostic groups. We propose two latent class models for classification and variable selection in the presence of high-dimensional binary data, fit by using Bayesian Markov chain Monte Carlo techniques. Penalization and model selection are incorporated in this setting via prior distributions on the unknown parameters. The methods provide valid parameter estimates under conditions in which standard supervised latent class models do not, and outperform two-stage approaches to variable selection and parameter estimation in a variety of settings. We study the properties of these methods in simulations, and apply these methodologies to the glioma study for which identifiable three-class parameter estimates cannot be obtained without penalization. With penalization, the resulting latent classes correlate well with clinical tumor grade and offer additional information on survival prognosis that is not captured by clinical diagnosis alone. The inclusion of YKL-40 features also increases the precision of survival estimates. Fitting models with and without YKL-40 highlights a subgroup of patients who have glioblastoma (GBM) diagnosis but appear to have better prognosis than the typical GBM patient.

Original languageEnglish (US)
Pages (from-to)1342-1360
Number of pages19
JournalStatistics in Medicine
Volume31
Issue number13
DOIs
StatePublished - Jun 15 2012

Fingerprint

Latent Class Model
Penalization
Bayesian Model
High-dimensional Data
Prognosis
Glioblastoma
Variable Selection
Glioma
Correlate
Survival
Diagnostics
Estimate
Brain Tumor
Latent Class
Markov Chains
Monte Carlo Techniques
Binary Data
Model Fitting
Prior distribution
Markov Chain Monte Carlo

Keywords

  • Cancer
  • Glioma
  • Latent class
  • Penalization
  • Ridge
  • Supervised
  • Variable selection

ASJC Scopus subject areas

  • Epidemiology
  • Statistics and Probability

Cite this

Desantis, S. M., Andrés Houseman, E., Coull, B. A., Nutt, C. L., & Betensky, R. (2012). Supervised Bayesian latent class models for high-dimensional data. Statistics in Medicine, 31(13), 1342-1360. https://doi.org/10.1002/sim.4448

Supervised Bayesian latent class models for high-dimensional data. / Desantis, Stacia M.; Andrés Houseman, E.; Coull, Brent A.; Nutt, Catherine L.; Betensky, Rebecca.

In: Statistics in Medicine, Vol. 31, No. 13, 15.06.2012, p. 1342-1360.

Research output: Contribution to journalArticle

Desantis, SM, Andrés Houseman, E, Coull, BA, Nutt, CL & Betensky, R 2012, 'Supervised Bayesian latent class models for high-dimensional data', Statistics in Medicine, vol. 31, no. 13, pp. 1342-1360. https://doi.org/10.1002/sim.4448
Desantis, Stacia M. ; Andrés Houseman, E. ; Coull, Brent A. ; Nutt, Catherine L. ; Betensky, Rebecca. / Supervised Bayesian latent class models for high-dimensional data. In: Statistics in Medicine. 2012 ; Vol. 31, No. 13. pp. 1342-1360.
@article{f881fc790dcb45819edebc3cfcef14d9,
title = "Supervised Bayesian latent class models for high-dimensional data",
abstract = "High-grade gliomas are the most common primary brain tumors in adults and are typically diagnosed using histopathology. However, these diagnostic categories are highly heterogeneous and do not always correlate well with survival. In an attempt to refine these diagnoses, we make several immunohistochemical measurements of YKL-40, a gene previously shown to be differentially expressed between diagnostic groups. We propose two latent class models for classification and variable selection in the presence of high-dimensional binary data, fit by using Bayesian Markov chain Monte Carlo techniques. Penalization and model selection are incorporated in this setting via prior distributions on the unknown parameters. The methods provide valid parameter estimates under conditions in which standard supervised latent class models do not, and outperform two-stage approaches to variable selection and parameter estimation in a variety of settings. We study the properties of these methods in simulations, and apply these methodologies to the glioma study for which identifiable three-class parameter estimates cannot be obtained without penalization. With penalization, the resulting latent classes correlate well with clinical tumor grade and offer additional information on survival prognosis that is not captured by clinical diagnosis alone. The inclusion of YKL-40 features also increases the precision of survival estimates. Fitting models with and without YKL-40 highlights a subgroup of patients who have glioblastoma (GBM) diagnosis but appear to have better prognosis than the typical GBM patient.",
keywords = "Cancer, Glioma, Latent class, Penalization, Ridge, Supervised, Variable selection",
author = "Desantis, {Stacia M.} and {Andr{\'e}s Houseman}, E. and Coull, {Brent A.} and Nutt, {Catherine L.} and Rebecca Betensky",
year = "2012",
month = "6",
day = "15",
doi = "10.1002/sim.4448",
language = "English (US)",
volume = "31",
pages = "1342--1360",
journal = "Statistics in Medicine",
issn = "0277-6715",
publisher = "John Wiley and Sons Ltd",
number = "13",

}

TY - JOUR

T1 - Supervised Bayesian latent class models for high-dimensional data

AU - Desantis, Stacia M.

AU - Andrés Houseman, E.

AU - Coull, Brent A.

AU - Nutt, Catherine L.

AU - Betensky, Rebecca

PY - 2012/6/15

Y1 - 2012/6/15

N2 - High-grade gliomas are the most common primary brain tumors in adults and are typically diagnosed using histopathology. However, these diagnostic categories are highly heterogeneous and do not always correlate well with survival. In an attempt to refine these diagnoses, we make several immunohistochemical measurements of YKL-40, a gene previously shown to be differentially expressed between diagnostic groups. We propose two latent class models for classification and variable selection in the presence of high-dimensional binary data, fit by using Bayesian Markov chain Monte Carlo techniques. Penalization and model selection are incorporated in this setting via prior distributions on the unknown parameters. The methods provide valid parameter estimates under conditions in which standard supervised latent class models do not, and outperform two-stage approaches to variable selection and parameter estimation in a variety of settings. We study the properties of these methods in simulations, and apply these methodologies to the glioma study for which identifiable three-class parameter estimates cannot be obtained without penalization. With penalization, the resulting latent classes correlate well with clinical tumor grade and offer additional information on survival prognosis that is not captured by clinical diagnosis alone. The inclusion of YKL-40 features also increases the precision of survival estimates. Fitting models with and without YKL-40 highlights a subgroup of patients who have glioblastoma (GBM) diagnosis but appear to have better prognosis than the typical GBM patient.

AB - High-grade gliomas are the most common primary brain tumors in adults and are typically diagnosed using histopathology. However, these diagnostic categories are highly heterogeneous and do not always correlate well with survival. In an attempt to refine these diagnoses, we make several immunohistochemical measurements of YKL-40, a gene previously shown to be differentially expressed between diagnostic groups. We propose two latent class models for classification and variable selection in the presence of high-dimensional binary data, fit by using Bayesian Markov chain Monte Carlo techniques. Penalization and model selection are incorporated in this setting via prior distributions on the unknown parameters. The methods provide valid parameter estimates under conditions in which standard supervised latent class models do not, and outperform two-stage approaches to variable selection and parameter estimation in a variety of settings. We study the properties of these methods in simulations, and apply these methodologies to the glioma study for which identifiable three-class parameter estimates cannot be obtained without penalization. With penalization, the resulting latent classes correlate well with clinical tumor grade and offer additional information on survival prognosis that is not captured by clinical diagnosis alone. The inclusion of YKL-40 features also increases the precision of survival estimates. Fitting models with and without YKL-40 highlights a subgroup of patients who have glioblastoma (GBM) diagnosis but appear to have better prognosis than the typical GBM patient.

KW - Cancer

KW - Glioma

KW - Latent class

KW - Penalization

KW - Ridge

KW - Supervised

KW - Variable selection

UR - http://www.scopus.com/inward/record.url?scp=84861190696&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84861190696&partnerID=8YFLogxK

U2 - 10.1002/sim.4448

DO - 10.1002/sim.4448

M3 - Article

C2 - 22495652

AN - SCOPUS:84861190696

VL - 31

SP - 1342

EP - 1360

JO - Statistics in Medicine

JF - Statistics in Medicine

SN - 0277-6715

IS - 13

ER -