Shrinkage-based similarity metric for cluster analysis of microarray data

Vera Cherepinsky, Jiawu Feng, Marc Rejali, Bhubaneswar Mishra

Research output: Contribution to journalArticle

Abstract

The current standard correlation coefficient used in the analysis of microarray data was introduced by M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein [(1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868]. Its formulation is rather arbitrary. We give a mathematically rigorous correlation coefficient of two data vectors based on James-Stein shrinkage estimators. We use the assumptions described by Eisen et al., also using the fact that the data can be treated as transformed into normal distributions. While Eisen et al. use zero as an estimator for the expression vector mean μ, we start with the assumption that for each gene, μ is itself a zero-mean normal random variable [with a priori distribution script N sign (0, τ 2)], and use Bayesian analysis to obtain a posteriori distribution of μ in terms of the data. The shrunk estimator for μ differs from the mean of the data vectors and ultimately leads to a statistically robust estimator for correlation coefficients. To evaluate the effectiveness of shrinkage, we conducted in silico experiments and also compared similarity metrics on a biological example by using the data set from Eisen et al. For the latter, we classified genes involved in the regulation of yeast cell-cycle functions by computing clusters based on various definitions of correlation coefficients and contrasting them against clusters based on the activators known in the literature. The estimated false positives and false negatives from this study indicate that using the shrinkage metric improves the accuracy of the analysis.

Original languageEnglish (US)
Pages (from-to)9668-9673
Number of pages6
JournalProceedings of the National Academy of Sciences of the United States of America
Volume100
Issue number17
DOIs
StatePublished - Aug 19 2003

Fingerprint

Cluster Analysis
Bayes Theorem
Normal Distribution
Microarray Analysis
Computer Simulation
Genes
Cell Cycle
Yeasts
Datasets

ASJC Scopus subject areas

  • Genetics
  • General

Cite this

Shrinkage-based similarity metric for cluster analysis of microarray data. / Cherepinsky, Vera; Feng, Jiawu; Rejali, Marc; Mishra, Bhubaneswar.

In: Proceedings of the National Academy of Sciences of the United States of America, Vol. 100, No. 17, 19.08.2003, p. 9668-9673.

Research output: Contribution to journalArticle

@article{0de81e479f5b441c9f2c91ecb5cad928,
title = "Shrinkage-based similarity metric for cluster analysis of microarray data",
abstract = "The current standard correlation coefficient used in the analysis of microarray data was introduced by M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein [(1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868]. Its formulation is rather arbitrary. We give a mathematically rigorous correlation coefficient of two data vectors based on James-Stein shrinkage estimators. We use the assumptions described by Eisen et al., also using the fact that the data can be treated as transformed into normal distributions. While Eisen et al. use zero as an estimator for the expression vector mean μ, we start with the assumption that for each gene, μ is itself a zero-mean normal random variable [with a priori distribution script N sign (0, τ 2)], and use Bayesian analysis to obtain a posteriori distribution of μ in terms of the data. The shrunk estimator for μ differs from the mean of the data vectors and ultimately leads to a statistically robust estimator for correlation coefficients. To evaluate the effectiveness of shrinkage, we conducted in silico experiments and also compared similarity metrics on a biological example by using the data set from Eisen et al. For the latter, we classified genes involved in the regulation of yeast cell-cycle functions by computing clusters based on various definitions of correlation coefficients and contrasting them against clusters based on the activators known in the literature. The estimated false positives and false negatives from this study indicate that using the shrinkage metric improves the accuracy of the analysis.",
author = "Vera Cherepinsky and Jiawu Feng and Marc Rejali and Bhubaneswar Mishra",
year = "2003",
month = "8",
day = "19",
doi = "10.1073/pnas.1633770100",
language = "English (US)",
volume = "100",
pages = "9668--9673",
journal = "Proceedings of the National Academy of Sciences of the United States of America",
issn = "0027-8424",
number = "17",

}

TY - JOUR

T1 - Shrinkage-based similarity metric for cluster analysis of microarray data

AU - Cherepinsky, Vera

AU - Feng, Jiawu

AU - Rejali, Marc

AU - Mishra, Bhubaneswar

PY - 2003/8/19

Y1 - 2003/8/19

N2 - The current standard correlation coefficient used in the analysis of microarray data was introduced by M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein [(1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868]. Its formulation is rather arbitrary. We give a mathematically rigorous correlation coefficient of two data vectors based on James-Stein shrinkage estimators. We use the assumptions described by Eisen et al., also using the fact that the data can be treated as transformed into normal distributions. While Eisen et al. use zero as an estimator for the expression vector mean μ, we start with the assumption that for each gene, μ is itself a zero-mean normal random variable [with a priori distribution script N sign (0, τ 2)], and use Bayesian analysis to obtain a posteriori distribution of μ in terms of the data. The shrunk estimator for μ differs from the mean of the data vectors and ultimately leads to a statistically robust estimator for correlation coefficients. To evaluate the effectiveness of shrinkage, we conducted in silico experiments and also compared similarity metrics on a biological example by using the data set from Eisen et al. For the latter, we classified genes involved in the regulation of yeast cell-cycle functions by computing clusters based on various definitions of correlation coefficients and contrasting them against clusters based on the activators known in the literature. The estimated false positives and false negatives from this study indicate that using the shrinkage metric improves the accuracy of the analysis.

AB - The current standard correlation coefficient used in the analysis of microarray data was introduced by M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein [(1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868]. Its formulation is rather arbitrary. We give a mathematically rigorous correlation coefficient of two data vectors based on James-Stein shrinkage estimators. We use the assumptions described by Eisen et al., also using the fact that the data can be treated as transformed into normal distributions. While Eisen et al. use zero as an estimator for the expression vector mean μ, we start with the assumption that for each gene, μ is itself a zero-mean normal random variable [with a priori distribution script N sign (0, τ 2)], and use Bayesian analysis to obtain a posteriori distribution of μ in terms of the data. The shrunk estimator for μ differs from the mean of the data vectors and ultimately leads to a statistically robust estimator for correlation coefficients. To evaluate the effectiveness of shrinkage, we conducted in silico experiments and also compared similarity metrics on a biological example by using the data set from Eisen et al. For the latter, we classified genes involved in the regulation of yeast cell-cycle functions by computing clusters based on various definitions of correlation coefficients and contrasting them against clusters based on the activators known in the literature. The estimated false positives and false negatives from this study indicate that using the shrinkage metric improves the accuracy of the analysis.

UR - http://www.scopus.com/inward/record.url?scp=0043194165&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0043194165&partnerID=8YFLogxK

U2 - 10.1073/pnas.1633770100

DO - 10.1073/pnas.1633770100

M3 - Article

VL - 100

SP - 9668

EP - 9673

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

SN - 0027-8424

IS - 17

ER -