Simple and practical sequence nearest neighbors with block operations

Shanmugavelayutham Muthukrishnan, S. Cenk Ṣahinalp

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Sequence nearest neighbors problemcan be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the “distance” between sequences S and Q, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks). One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search.1 Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard. The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this paper is the block edit distance. This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O (log £ (log* £)). The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above. In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.

    Original languageEnglish (US)
    Title of host publicationCombinatorial Pattern Matching - 13th Annual Symposium, CPM 2002, Proceedings
    EditorsAlberto Apostolico, Masayuki Takeda
    PublisherSpringer-Verlag
    Pages262-278
    Number of pages17
    ISBN (Electronic)9783540438625
    StatePublished - Jan 1 2002
    Event13th Annual Symposium on Combinatorial Pattern Matching, CPM 2002 - Fukuoka, Japan
    Duration: Jul 3 2002Jul 5 2002

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume2373
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference13th Annual Symposium on Combinatorial Pattern Matching, CPM 2002
    CountryJapan
    CityFukuoka
    Period7/3/027/5/02

    Fingerprint

    Data structures
    Hamming distance
    Nearest Neighbor
    Genes
    Edit Distance
    Copying
    Polynomials
    Nearest Neighbor Search
    Sequence Comparison
    Data Structures
    Hamming Distance
    Chemical analysis
    Rearrangement
    Nearest neighbor search
    Deletion
    Genomics
    Genome
    Query
    Translocation
    Duplication

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • Computer Science(all)

    Cite this

    Muthukrishnan, S., & Cenk Ṣahinalp, S. (2002). Simple and practical sequence nearest neighbors with block operations. In A. Apostolico, & M. Takeda (Eds.), Combinatorial Pattern Matching - 13th Annual Symposium, CPM 2002, Proceedings (pp. 262-278). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2373). Springer-Verlag.

    Simple and practical sequence nearest neighbors with block operations. / Muthukrishnan, Shanmugavelayutham; Cenk Ṣahinalp, S.

    Combinatorial Pattern Matching - 13th Annual Symposium, CPM 2002, Proceedings. ed. / Alberto Apostolico; Masayuki Takeda. Springer-Verlag, 2002. p. 262-278 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2373).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Muthukrishnan, S & Cenk Ṣahinalp, S 2002, Simple and practical sequence nearest neighbors with block operations. in A Apostolico & M Takeda (eds), Combinatorial Pattern Matching - 13th Annual Symposium, CPM 2002, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2373, Springer-Verlag, pp. 262-278, 13th Annual Symposium on Combinatorial Pattern Matching, CPM 2002, Fukuoka, Japan, 7/3/02.
    Muthukrishnan S, Cenk Ṣahinalp S. Simple and practical sequence nearest neighbors with block operations. In Apostolico A, Takeda M, editors, Combinatorial Pattern Matching - 13th Annual Symposium, CPM 2002, Proceedings. Springer-Verlag. 2002. p. 262-278. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
    Muthukrishnan, Shanmugavelayutham ; Cenk Ṣahinalp, S. / Simple and practical sequence nearest neighbors with block operations. Combinatorial Pattern Matching - 13th Annual Symposium, CPM 2002, Proceedings. editor / Alberto Apostolico ; Masayuki Takeda. Springer-Verlag, 2002. pp. 262-278 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
    @inproceedings{89515eced1454420b946d3ae1fdb6019,
    title = "Simple and practical sequence nearest neighbors with block operations",
    abstract = "Sequence nearest neighbors problemcan be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the “distance” between sequences S and Q, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks). One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search.1 Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard. The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this paper is the block edit distance. This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O (log £ (log* £)). The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above. In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.",
    author = "Shanmugavelayutham Muthukrishnan and {Cenk Ṣahinalp}, S.",
    year = "2002",
    month = "1",
    day = "1",
    language = "English (US)",
    series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
    publisher = "Springer-Verlag",
    pages = "262--278",
    editor = "Alberto Apostolico and Masayuki Takeda",
    booktitle = "Combinatorial Pattern Matching - 13th Annual Symposium, CPM 2002, Proceedings",

    }

    TY - GEN

    T1 - Simple and practical sequence nearest neighbors with block operations

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Cenk Ṣahinalp, S.

    PY - 2002/1/1

    Y1 - 2002/1/1

    N2 - Sequence nearest neighbors problemcan be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the “distance” between sequences S and Q, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks). One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search.1 Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard. The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this paper is the block edit distance. This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O (log £ (log* £)). The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above. In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.

    AB - Sequence nearest neighbors problemcan be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the “distance” between sequences S and Q, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks). One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search.1 Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard. The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this paper is the block edit distance. This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O (log £ (log* £)). The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above. In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.

    UR - http://www.scopus.com/inward/record.url?scp=84937440363&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84937440363&partnerID=8YFLogxK

    M3 - Conference contribution

    T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    SP - 262

    EP - 278

    BT - Combinatorial Pattern Matching - 13th Annual Symposium, CPM 2002, Proceedings

    A2 - Apostolico, Alberto

    A2 - Takeda, Masayuki

    PB - Springer-Verlag

    ER -