Principal component analysis combined with truncated-Newton minimization for dimensionality reduction of chemical databases

Dexuan Xie, Suresh B. Singh, Eugene M. Fluder, Tamar Schlick

Research output: Contribution to journalArticle

Abstract

The similarity and diversity sampling problems are two challenging optimization tasks that arise in the analysis of chemical databases. As a first step to their solution, we propose an efficient projection/refinement protocol based on the principal component analysis (PCA) and the truncated-Newton minimization method implemented by our package TNPACK (PCA/TNPACK). We show that PCA can provide the same initial guess as the singular value decomposition (SVD) for the optimization task of solving the distance-geometry optimization problem if each column of a database matrix has a mean of zero. Hence, our PCA/TNPACK approach is analogous to the SVD/TNPACK projection/refinement protocol that we developed recently for visualizing large chemical databases. Using PCA/TNPACK and the Merck MDDR database (MDL Drug Data Report), we further investigate the projection/refinement procedure with regards to the preservation of the original clusters of chemical compounds, the accuracy of similarity and diversity sampling of chemical compounds, and the potential application in the study of structure activity relationships. We also explore by simple experiments accuracy and efficiency aspects of the PCA/TNPACK procedure compared to those of a global optimization algorithm (simulated annealing, as implemented by the program package SIMANN) in terms of producing the projection mapping of a database. Numerical results show that the 2D PCA/TNPACK mapping can preserve the distance relationships of the original database and is thus valuable as a first step in similarity and diversity applications. Of course, the generation of a global rather than local minimizer and its interpretation in terms of pharameceutical applications remains a challenge. Since all numerical tests are performed on the Merck MDDR database, results are representative of realistic cases encountered in the field of drug design, and may help analyze properties of medicinal compounds.

Original languageEnglish (US)
Pages (from-to)161-185
Number of pages25
JournalMathematical Programming
Volume95
Issue number1
DOIs
StatePublished - Jan 2003

Fingerprint

Dimensionality Reduction
Principal component analysis
Principal Component Analysis
Projection
Refinement
Chemical compounds
Singular value decomposition
Distance Geometry
Structure-activity Relationship
Sampling
Drug Design
Local Minimizer
Optimization
Guess
Global optimization
Data base
Dimensionality reduction
Simulated annealing
Simulated Annealing
Preservation

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Software
  • Mathematics(all)
  • Applied Mathematics
  • Safety, Risk, Reliability and Quality
  • Management Science and Operations Research

Cite this

Principal component analysis combined with truncated-Newton minimization for dimensionality reduction of chemical databases. / Xie, Dexuan; Singh, Suresh B.; Fluder, Eugene M.; Schlick, Tamar.

In: Mathematical Programming, Vol. 95, No. 1, 01.2003, p. 161-185.

Research output: Contribution to journalArticle

@article{2dbefb872e874994b3173046591334af,
title = "Principal component analysis combined with truncated-Newton minimization for dimensionality reduction of chemical databases",
abstract = "The similarity and diversity sampling problems are two challenging optimization tasks that arise in the analysis of chemical databases. As a first step to their solution, we propose an efficient projection/refinement protocol based on the principal component analysis (PCA) and the truncated-Newton minimization method implemented by our package TNPACK (PCA/TNPACK). We show that PCA can provide the same initial guess as the singular value decomposition (SVD) for the optimization task of solving the distance-geometry optimization problem if each column of a database matrix has a mean of zero. Hence, our PCA/TNPACK approach is analogous to the SVD/TNPACK projection/refinement protocol that we developed recently for visualizing large chemical databases. Using PCA/TNPACK and the Merck MDDR database (MDL Drug Data Report), we further investigate the projection/refinement procedure with regards to the preservation of the original clusters of chemical compounds, the accuracy of similarity and diversity sampling of chemical compounds, and the potential application in the study of structure activity relationships. We also explore by simple experiments accuracy and efficiency aspects of the PCA/TNPACK procedure compared to those of a global optimization algorithm (simulated annealing, as implemented by the program package SIMANN) in terms of producing the projection mapping of a database. Numerical results show that the 2D PCA/TNPACK mapping can preserve the distance relationships of the original database and is thus valuable as a first step in similarity and diversity applications. Of course, the generation of a global rather than local minimizer and its interpretation in terms of pharameceutical applications remains a challenge. Since all numerical tests are performed on the Merck MDDR database, results are representative of realistic cases encountered in the field of drug design, and may help analyze properties of medicinal compounds.",
author = "Dexuan Xie and Singh, {Suresh B.} and Fluder, {Eugene M.} and Tamar Schlick",
year = "2003",
month = "1",
doi = "10.1007/s10107-002-0345-7",
language = "English (US)",
volume = "95",
pages = "161--185",
journal = "Mathematical Programming",
issn = "0025-5610",
publisher = "Springer-Verlag GmbH and Co. KG",
number = "1",

}

TY - JOUR

T1 - Principal component analysis combined with truncated-Newton minimization for dimensionality reduction of chemical databases

AU - Xie, Dexuan

AU - Singh, Suresh B.

AU - Fluder, Eugene M.

AU - Schlick, Tamar

PY - 2003/1

Y1 - 2003/1

N2 - The similarity and diversity sampling problems are two challenging optimization tasks that arise in the analysis of chemical databases. As a first step to their solution, we propose an efficient projection/refinement protocol based on the principal component analysis (PCA) and the truncated-Newton minimization method implemented by our package TNPACK (PCA/TNPACK). We show that PCA can provide the same initial guess as the singular value decomposition (SVD) for the optimization task of solving the distance-geometry optimization problem if each column of a database matrix has a mean of zero. Hence, our PCA/TNPACK approach is analogous to the SVD/TNPACK projection/refinement protocol that we developed recently for visualizing large chemical databases. Using PCA/TNPACK and the Merck MDDR database (MDL Drug Data Report), we further investigate the projection/refinement procedure with regards to the preservation of the original clusters of chemical compounds, the accuracy of similarity and diversity sampling of chemical compounds, and the potential application in the study of structure activity relationships. We also explore by simple experiments accuracy and efficiency aspects of the PCA/TNPACK procedure compared to those of a global optimization algorithm (simulated annealing, as implemented by the program package SIMANN) in terms of producing the projection mapping of a database. Numerical results show that the 2D PCA/TNPACK mapping can preserve the distance relationships of the original database and is thus valuable as a first step in similarity and diversity applications. Of course, the generation of a global rather than local minimizer and its interpretation in terms of pharameceutical applications remains a challenge. Since all numerical tests are performed on the Merck MDDR database, results are representative of realistic cases encountered in the field of drug design, and may help analyze properties of medicinal compounds.

AB - The similarity and diversity sampling problems are two challenging optimization tasks that arise in the analysis of chemical databases. As a first step to their solution, we propose an efficient projection/refinement protocol based on the principal component analysis (PCA) and the truncated-Newton minimization method implemented by our package TNPACK (PCA/TNPACK). We show that PCA can provide the same initial guess as the singular value decomposition (SVD) for the optimization task of solving the distance-geometry optimization problem if each column of a database matrix has a mean of zero. Hence, our PCA/TNPACK approach is analogous to the SVD/TNPACK projection/refinement protocol that we developed recently for visualizing large chemical databases. Using PCA/TNPACK and the Merck MDDR database (MDL Drug Data Report), we further investigate the projection/refinement procedure with regards to the preservation of the original clusters of chemical compounds, the accuracy of similarity and diversity sampling of chemical compounds, and the potential application in the study of structure activity relationships. We also explore by simple experiments accuracy and efficiency aspects of the PCA/TNPACK procedure compared to those of a global optimization algorithm (simulated annealing, as implemented by the program package SIMANN) in terms of producing the projection mapping of a database. Numerical results show that the 2D PCA/TNPACK mapping can preserve the distance relationships of the original database and is thus valuable as a first step in similarity and diversity applications. Of course, the generation of a global rather than local minimizer and its interpretation in terms of pharameceutical applications remains a challenge. Since all numerical tests are performed on the Merck MDDR database, results are representative of realistic cases encountered in the field of drug design, and may help analyze properties of medicinal compounds.

UR - http://www.scopus.com/inward/record.url?scp=4143092391&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4143092391&partnerID=8YFLogxK

U2 - 10.1007/s10107-002-0345-7

DO - 10.1007/s10107-002-0345-7

M3 - Article

VL - 95

SP - 161

EP - 185

JO - Mathematical Programming

JF - Mathematical Programming

SN - 0025-5610

IS - 1

ER -