An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees

Jason T L Wang, Bruce A. Shapiro, Dennis Shasha, Kaizhong Zhang, Kathleen M. Currey

Research output: Contribution to journalArticle

Abstract

Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree Tto be a connected subgraph of T. Given two ordered labeled trees 7 and T2 and an integer d, the largest approximately common substructure problem is to find a substructure U1 of 7 and a substructure U2 of T2 such that U1 is within edit distance dof U2 and where there does not exist any other substructure l of 7 and V2 of T2 such that l and V2 satisfy the distance constraint and the sum of the sizes of V-, and V2 is greater than the sum of the sizes of U1 and U2. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees).

Original languageEnglish (US)
Pages (from-to)889-895
Number of pages7
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume20
Issue number8
DOIs
StatePublished - 1998

Fingerprint

Substructure
Labeled Trees
Ordered Trees
Molecular biology
Edit Distance
RNA
Dynamic programming
Pattern recognition
Labels
RNA Secondary Structure
Molecular Biology
Processing
Natural Language
Fast Algorithm
Pattern Recognition
Dynamic Programming
Subgraph
Integer
Computing
Vertex of a graph

Keywords

  • Computational biology
  • Dynamic programming
  • Pattern matching
  • Pattern recognition
  • Trees

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Electrical and Electronic Engineering
  • Artificial Intelligence
  • Computer Vision and Pattern Recognition

Cite this

An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees. / Wang, Jason T L; Shapiro, Bruce A.; Shasha, Dennis; Zhang, Kaizhong; Currey, Kathleen M.

In: IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 8, 1998, p. 889-895.

Research output: Contribution to journalArticle

Wang, Jason T L ; Shapiro, Bruce A. ; Shasha, Dennis ; Zhang, Kaizhong ; Currey, Kathleen M. / An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees. In: IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998 ; Vol. 20, No. 8. pp. 889-895.
@article{64735ac2329142a0bd2dffcc0b6e074e,
title = "An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees",
abstract = "Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree Tto be a connected subgraph of T. Given two ordered labeled trees 7 and T2 and an integer d, the largest approximately common substructure problem is to find a substructure U1 of 7 and a substructure U2 of T2 such that U1 is within edit distance dof U2 and where there does not exist any other substructure l of 7 and V2 of T2 such that l and V2 satisfy the distance constraint and the sum of the sizes of V-, and V2 is greater than the sum of the sizes of U1 and U2. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees).",
keywords = "Computational biology, Dynamic programming, Pattern matching, Pattern recognition, Trees",
author = "Wang, {Jason T L} and Shapiro, {Bruce A.} and Dennis Shasha and Kaizhong Zhang and Currey, {Kathleen M.}",
year = "1998",
doi = "10.1109/34.709622",
language = "English (US)",
volume = "20",
pages = "889--895",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
issn = "0162-8828",
publisher = "IEEE Computer Society",
number = "8",

}

TY - JOUR

T1 - An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees

AU - Wang, Jason T L

AU - Shapiro, Bruce A.

AU - Shasha, Dennis

AU - Zhang, Kaizhong

AU - Currey, Kathleen M.

PY - 1998

Y1 - 1998

N2 - Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree Tto be a connected subgraph of T. Given two ordered labeled trees 7 and T2 and an integer d, the largest approximately common substructure problem is to find a substructure U1 of 7 and a substructure U2 of T2 such that U1 is within edit distance dof U2 and where there does not exist any other substructure l of 7 and V2 of T2 such that l and V2 satisfy the distance constraint and the sum of the sizes of V-, and V2 is greater than the sum of the sizes of U1 and U2. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees).

AB - Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree Tto be a connected subgraph of T. Given two ordered labeled trees 7 and T2 and an integer d, the largest approximately common substructure problem is to find a substructure U1 of 7 and a substructure U2 of T2 such that U1 is within edit distance dof U2 and where there does not exist any other substructure l of 7 and V2 of T2 such that l and V2 satisfy the distance constraint and the sum of the sizes of V-, and V2 is greater than the sum of the sizes of U1 and U2. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees).

KW - Computational biology

KW - Dynamic programming

KW - Pattern matching

KW - Pattern recognition

KW - Trees

UR - http://www.scopus.com/inward/record.url?scp=0032136849&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032136849&partnerID=8YFLogxK

U2 - 10.1109/34.709622

DO - 10.1109/34.709622

M3 - Article

VL - 20

SP - 889

EP - 895

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

SN - 0162-8828

IS - 8

ER -