A new approach to the estimation of inter-variable correlation

Marc Sobel, Bhubaneswar Mishra

Research output: Contribution to journalArticle

Abstract

The use of different measures of similarity between observed vectors for the purposes of classifying or clustering them has been expanding dramatically in recent years. One result of this expansion has been the use of many new similarity measures, designed for the purpose of satisfying various criteria. A noteworthy application involves estimating the relationships between genes using microarray experimental data. We consider the class of 'correlation-type' similarity measures. The use of these new measures of similarity suggest that the whole problem needs to be formulated in statistical terms to clarify their relative benefits. Pursuant to this need, we define, for each given observed vector, a baseline representing the 'true' value common to each of the component observations. These 'true' values are taken to be parameters. We define the 'true correlation' between each two observed vectors as the average (over the distribution of the observations for given baseline parameters) of Pearson's correlation with sample means replaced by the corresponding baseline parameters. Estimators of this true correlation are assessed using their mean squared error (MSE). Proper Bayes estimators of this true correlation, being based on the predictive posterior distribution of the data, are both difficult to calculate/analyze and highly non robust. By constrast, empirical Bayes estimators are: (i) close to their Bayesian counterparts; (ii) easy to analyze; and (iii) strongly robust. For these reasons, we employ empirical Bayes estimators of correlation in place of their Bayesian counterparts. We show how to construct two different kinds of simultaneous Bayes correlation estimators: the first assumes no apriori correlation between baseline parameters; the second assumes a common unknown correlation between them. Estimators of the latter type frequently have significantly smaller MSE than those of the former type which, in turn, frequently have significantly smaller MSE than their Pearson estimator counterparts. For purposes of illustrating our results, we examine the problem of inferring the relationships between gene expression level vectors, in the context of observing microarray experimental data.

Original languageEnglish (US)
Pages (from-to)2315-2330
Number of pages16
JournalCommunications in Statistics - Theory and Methods
Volume37
Issue number15
DOIs
StatePublished - Sep 2008

Fingerprint

Microarrays
Baseline
Mean Squared Error
Empirical Bayes Estimator
Estimator
Microarray Data
Similarity Measure
Gene expression
Genes
Experimental Data
Pearson Correlation
Bayes Estimator
Predictive Distribution
Sample mean
Bayes
Posterior distribution
Gene Expression
Clustering
Gene
Calculate

Keywords

  • Admissibility
  • Bayes estimation
  • Bioinformatics
  • Correlation
  • Empirical Bayes

ASJC Scopus subject areas

  • Statistics and Probability
  • Safety, Risk, Reliability and Quality

Cite this

A new approach to the estimation of inter-variable correlation. / Sobel, Marc; Mishra, Bhubaneswar.

In: Communications in Statistics - Theory and Methods, Vol. 37, No. 15, 09.2008, p. 2315-2330.

Research output: Contribution to journalArticle

@article{f8130ac8af0f49a19330c6ee46e52b86,
title = "A new approach to the estimation of inter-variable correlation",
abstract = "The use of different measures of similarity between observed vectors for the purposes of classifying or clustering them has been expanding dramatically in recent years. One result of this expansion has been the use of many new similarity measures, designed for the purpose of satisfying various criteria. A noteworthy application involves estimating the relationships between genes using microarray experimental data. We consider the class of 'correlation-type' similarity measures. The use of these new measures of similarity suggest that the whole problem needs to be formulated in statistical terms to clarify their relative benefits. Pursuant to this need, we define, for each given observed vector, a baseline representing the 'true' value common to each of the component observations. These 'true' values are taken to be parameters. We define the 'true correlation' between each two observed vectors as the average (over the distribution of the observations for given baseline parameters) of Pearson's correlation with sample means replaced by the corresponding baseline parameters. Estimators of this true correlation are assessed using their mean squared error (MSE). Proper Bayes estimators of this true correlation, being based on the predictive posterior distribution of the data, are both difficult to calculate/analyze and highly non robust. By constrast, empirical Bayes estimators are: (i) close to their Bayesian counterparts; (ii) easy to analyze; and (iii) strongly robust. For these reasons, we employ empirical Bayes estimators of correlation in place of their Bayesian counterparts. We show how to construct two different kinds of simultaneous Bayes correlation estimators: the first assumes no apriori correlation between baseline parameters; the second assumes a common unknown correlation between them. Estimators of the latter type frequently have significantly smaller MSE than those of the former type which, in turn, frequently have significantly smaller MSE than their Pearson estimator counterparts. For purposes of illustrating our results, we examine the problem of inferring the relationships between gene expression level vectors, in the context of observing microarray experimental data.",
keywords = "Admissibility, Bayes estimation, Bioinformatics, Correlation, Empirical Bayes",
author = "Marc Sobel and Bhubaneswar Mishra",
year = "2008",
month = "9",
doi = "10.1080/03610920801923884",
language = "English (US)",
volume = "37",
pages = "2315--2330",
journal = "Communications in Statistics - Theory and Methods",
issn = "0361-0926",
publisher = "Taylor and Francis Ltd.",
number = "15",

}

TY - JOUR

T1 - A new approach to the estimation of inter-variable correlation

AU - Sobel, Marc

AU - Mishra, Bhubaneswar

PY - 2008/9

Y1 - 2008/9

N2 - The use of different measures of similarity between observed vectors for the purposes of classifying or clustering them has been expanding dramatically in recent years. One result of this expansion has been the use of many new similarity measures, designed for the purpose of satisfying various criteria. A noteworthy application involves estimating the relationships between genes using microarray experimental data. We consider the class of 'correlation-type' similarity measures. The use of these new measures of similarity suggest that the whole problem needs to be formulated in statistical terms to clarify their relative benefits. Pursuant to this need, we define, for each given observed vector, a baseline representing the 'true' value common to each of the component observations. These 'true' values are taken to be parameters. We define the 'true correlation' between each two observed vectors as the average (over the distribution of the observations for given baseline parameters) of Pearson's correlation with sample means replaced by the corresponding baseline parameters. Estimators of this true correlation are assessed using their mean squared error (MSE). Proper Bayes estimators of this true correlation, being based on the predictive posterior distribution of the data, are both difficult to calculate/analyze and highly non robust. By constrast, empirical Bayes estimators are: (i) close to their Bayesian counterparts; (ii) easy to analyze; and (iii) strongly robust. For these reasons, we employ empirical Bayes estimators of correlation in place of their Bayesian counterparts. We show how to construct two different kinds of simultaneous Bayes correlation estimators: the first assumes no apriori correlation between baseline parameters; the second assumes a common unknown correlation between them. Estimators of the latter type frequently have significantly smaller MSE than those of the former type which, in turn, frequently have significantly smaller MSE than their Pearson estimator counterparts. For purposes of illustrating our results, we examine the problem of inferring the relationships between gene expression level vectors, in the context of observing microarray experimental data.

AB - The use of different measures of similarity between observed vectors for the purposes of classifying or clustering them has been expanding dramatically in recent years. One result of this expansion has been the use of many new similarity measures, designed for the purpose of satisfying various criteria. A noteworthy application involves estimating the relationships between genes using microarray experimental data. We consider the class of 'correlation-type' similarity measures. The use of these new measures of similarity suggest that the whole problem needs to be formulated in statistical terms to clarify their relative benefits. Pursuant to this need, we define, for each given observed vector, a baseline representing the 'true' value common to each of the component observations. These 'true' values are taken to be parameters. We define the 'true correlation' between each two observed vectors as the average (over the distribution of the observations for given baseline parameters) of Pearson's correlation with sample means replaced by the corresponding baseline parameters. Estimators of this true correlation are assessed using their mean squared error (MSE). Proper Bayes estimators of this true correlation, being based on the predictive posterior distribution of the data, are both difficult to calculate/analyze and highly non robust. By constrast, empirical Bayes estimators are: (i) close to their Bayesian counterparts; (ii) easy to analyze; and (iii) strongly robust. For these reasons, we employ empirical Bayes estimators of correlation in place of their Bayesian counterparts. We show how to construct two different kinds of simultaneous Bayes correlation estimators: the first assumes no apriori correlation between baseline parameters; the second assumes a common unknown correlation between them. Estimators of the latter type frequently have significantly smaller MSE than those of the former type which, in turn, frequently have significantly smaller MSE than their Pearson estimator counterparts. For purposes of illustrating our results, we examine the problem of inferring the relationships between gene expression level vectors, in the context of observing microarray experimental data.

KW - Admissibility

KW - Bayes estimation

KW - Bioinformatics

KW - Correlation

KW - Empirical Bayes

UR - http://www.scopus.com/inward/record.url?scp=46149104548&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=46149104548&partnerID=8YFLogxK

U2 - 10.1080/03610920801923884

DO - 10.1080/03610920801923884

M3 - Article

VL - 37

SP - 2315

EP - 2330

JO - Communications in Statistics - Theory and Methods

JF - Communications in Statistics - Theory and Methods

SN - 0361-0926

IS - 15

ER -