Discovering active motifs in sets of related protein sequences and using them for classification

Jason T L Wang, Thomas G. Marr, Dennis Shasha, Bruce A. Shapiro, Gung Wei Chirn

Research output: Contribution to journalArticle

Abstract

We describe a method for discovering active motifs in a set of related protein sequences. The method is an automatic two step process: (1) find candidate motifs in a small sample of the sequences; (2) test whether these motifs are approximately present in all the sequences. To reduce the running time, we develop two optimization heuristics based on statistical estimation and pattern matching techniques. Experimental results obtained by running these algorithms on generated data and functionally related proteins demonstrate the good performance of the presented method compared with visual method of O'Farrell and Leopold. By combining the discovered motifs with an existing fingerprint technique, we develop a protein classifier. When we apply the classifier to the 698 groups of related proteins in the PROSITE catalog, it gives information that is complementary to the BLOCKS protein classifier of Henikoff and Henikoff. Thus, using our classifier in conjunction with theirs, one can obtain high confidence classifications (if BLOCKS and our classifier agree) or suggest a new hypothesis (if the two disagree).

Original languageEnglish (US)
Pages (from-to)2769-2775
Number of pages7
JournalNucleic Acids Research
Volume22
Issue number14
StatePublished - Jul 25 1994

Fingerprint

Protein Sequence
Classifiers
Classifier
Proteins
Protein
Statistical Estimation
Heuristic Optimization
Pattern matching
Dermatoglyphics
Pattern Matching
Fingerprint
Small Sample
Confidence
Experimental Results
Demonstrate

ASJC Scopus subject areas

  • Genetics
  • Statistics, Probability and Uncertainty
  • Applied Mathematics
  • Health, Toxicology and Mutagenesis
  • Toxicology
  • Genetics(clinical)

Cite this

Wang, J. T. L., Marr, T. G., Shasha, D., Shapiro, B. A., & Chirn, G. W. (1994). Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Research, 22(14), 2769-2775.

Discovering active motifs in sets of related protein sequences and using them for classification. / Wang, Jason T L; Marr, Thomas G.; Shasha, Dennis; Shapiro, Bruce A.; Chirn, Gung Wei.

In: Nucleic Acids Research, Vol. 22, No. 14, 25.07.1994, p. 2769-2775.

Research output: Contribution to journalArticle

Wang, JTL, Marr, TG, Shasha, D, Shapiro, BA & Chirn, GW 1994, 'Discovering active motifs in sets of related protein sequences and using them for classification', Nucleic Acids Research, vol. 22, no. 14, pp. 2769-2775.
Wang, Jason T L ; Marr, Thomas G. ; Shasha, Dennis ; Shapiro, Bruce A. ; Chirn, Gung Wei. / Discovering active motifs in sets of related protein sequences and using them for classification. In: Nucleic Acids Research. 1994 ; Vol. 22, No. 14. pp. 2769-2775.
@article{0a2747e037fd410b91dfa046d978b1f9,
title = "Discovering active motifs in sets of related protein sequences and using them for classification",
abstract = "We describe a method for discovering active motifs in a set of related protein sequences. The method is an automatic two step process: (1) find candidate motifs in a small sample of the sequences; (2) test whether these motifs are approximately present in all the sequences. To reduce the running time, we develop two optimization heuristics based on statistical estimation and pattern matching techniques. Experimental results obtained by running these algorithms on generated data and functionally related proteins demonstrate the good performance of the presented method compared with visual method of O'Farrell and Leopold. By combining the discovered motifs with an existing fingerprint technique, we develop a protein classifier. When we apply the classifier to the 698 groups of related proteins in the PROSITE catalog, it gives information that is complementary to the BLOCKS protein classifier of Henikoff and Henikoff. Thus, using our classifier in conjunction with theirs, one can obtain high confidence classifications (if BLOCKS and our classifier agree) or suggest a new hypothesis (if the two disagree).",
author = "Wang, {Jason T L} and Marr, {Thomas G.} and Dennis Shasha and Shapiro, {Bruce A.} and Chirn, {Gung Wei}",
year = "1994",
month = "7",
day = "25",
language = "English (US)",
volume = "22",
pages = "2769--2775",
journal = "Nucleic Acids Research",
issn = "0305-1048",
publisher = "Oxford University Press",
number = "14",

}

TY - JOUR

T1 - Discovering active motifs in sets of related protein sequences and using them for classification

AU - Wang, Jason T L

AU - Marr, Thomas G.

AU - Shasha, Dennis

AU - Shapiro, Bruce A.

AU - Chirn, Gung Wei

PY - 1994/7/25

Y1 - 1994/7/25

N2 - We describe a method for discovering active motifs in a set of related protein sequences. The method is an automatic two step process: (1) find candidate motifs in a small sample of the sequences; (2) test whether these motifs are approximately present in all the sequences. To reduce the running time, we develop two optimization heuristics based on statistical estimation and pattern matching techniques. Experimental results obtained by running these algorithms on generated data and functionally related proteins demonstrate the good performance of the presented method compared with visual method of O'Farrell and Leopold. By combining the discovered motifs with an existing fingerprint technique, we develop a protein classifier. When we apply the classifier to the 698 groups of related proteins in the PROSITE catalog, it gives information that is complementary to the BLOCKS protein classifier of Henikoff and Henikoff. Thus, using our classifier in conjunction with theirs, one can obtain high confidence classifications (if BLOCKS and our classifier agree) or suggest a new hypothesis (if the two disagree).

AB - We describe a method for discovering active motifs in a set of related protein sequences. The method is an automatic two step process: (1) find candidate motifs in a small sample of the sequences; (2) test whether these motifs are approximately present in all the sequences. To reduce the running time, we develop two optimization heuristics based on statistical estimation and pattern matching techniques. Experimental results obtained by running these algorithms on generated data and functionally related proteins demonstrate the good performance of the presented method compared with visual method of O'Farrell and Leopold. By combining the discovered motifs with an existing fingerprint technique, we develop a protein classifier. When we apply the classifier to the 698 groups of related proteins in the PROSITE catalog, it gives information that is complementary to the BLOCKS protein classifier of Henikoff and Henikoff. Thus, using our classifier in conjunction with theirs, one can obtain high confidence classifications (if BLOCKS and our classifier agree) or suggest a new hypothesis (if the two disagree).

UR - http://www.scopus.com/inward/record.url?scp=0027941109&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0027941109&partnerID=8YFLogxK

M3 - Article

C2 - 8052532

AN - SCOPUS:0027941109

VL - 22

SP - 2769

EP - 2775

JO - Nucleic Acids Research

JF - Nucleic Acids Research

SN - 0305-1048

IS - 14

ER -