DNA sequence classification via an expectation maximization algorithm and neural networks: A case study

Qicheng Ma, Jason T L Wang, Dennis Shasha, Cathy H. Wu

Research output: Contribution to journalArticle

Abstract

This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as the spacer between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network (NN) for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different datasets.

Original languageEnglish (US)
Pages (from-to)468-475
Number of pages8
JournalIEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews
Volume31
Issue number4
DOIs
StatePublished - Nov 2001

Fingerprint

DNA sequences
Binding sites
Escherichia coli
Neural networks
Probability distributions
DNA

Keywords

  • Bayesian inference
  • Bioinformatics
  • Data mining
  • Expectation maximization (EM)
  • Neural networks (NNs)
  • Promoter recognition

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Artificial Intelligence
  • Human-Computer Interaction
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

DNA sequence classification via an expectation maximization algorithm and neural networks : A case study. / Ma, Qicheng; Wang, Jason T L; Shasha, Dennis; Wu, Cathy H.

In: IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, Vol. 31, No. 4, 11.2001, p. 468-475.

Research output: Contribution to journalArticle

@article{1363d49e59774342a3ed030751c019aa,
title = "DNA sequence classification via an expectation maximization algorithm and neural networks: A case study",
abstract = "This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as the spacer between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network (NN) for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different datasets.",
keywords = "Bayesian inference, Bioinformatics, Data mining, Expectation maximization (EM), Neural networks (NNs), Promoter recognition",
author = "Qicheng Ma and Wang, {Jason T L} and Dennis Shasha and Wu, {Cathy H.}",
year = "2001",
month = "11",
doi = "10.1109/5326.983930",
language = "English (US)",
volume = "31",
pages = "468--475",
journal = "IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews",
issn = "1094-6977",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "4",

}

TY - JOUR

T1 - DNA sequence classification via an expectation maximization algorithm and neural networks

T2 - A case study

AU - Ma, Qicheng

AU - Wang, Jason T L

AU - Shasha, Dennis

AU - Wu, Cathy H.

PY - 2001/11

Y1 - 2001/11

N2 - This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as the spacer between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network (NN) for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different datasets.

AB - This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as the spacer between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network (NN) for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different datasets.

KW - Bayesian inference

KW - Bioinformatics

KW - Data mining

KW - Expectation maximization (EM)

KW - Neural networks (NNs)

KW - Promoter recognition

UR - http://www.scopus.com/inward/record.url?scp=0035521109&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035521109&partnerID=8YFLogxK

U2 - 10.1109/5326.983930

DO - 10.1109/5326.983930

M3 - Article

AN - SCOPUS:0035521109

VL - 31

SP - 468

EP - 475

JO - IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews

JF - IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews

SN - 1094-6977

IS - 4

ER -