Comparing sequences with segment rearrangements

Funda Ergun, Shanmugavelayutham Muthukrishnan, S. Cenk Sahinalp

    Research output: Contribution to journalArticle

    Abstract

    Computational genomics involves comparing sequences based on "similarity" for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segments found in abundance in the human genome. In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem is closely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these other problems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor Ω(log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors.

    Original languageEnglish (US)
    Pages (from-to)183-194
    Number of pages12
    JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume2914
    StatePublished - Dec 1 2003

    Fingerprint

    Rearrangement
    Genes
    Genome
    Sequencing
    Strings
    Approximation algorithms
    Edit Distance
    Functional Relationship
    Computing
    Computational complexity
    Similarity Measure
    Best Approximation
    Grammar
    Genomics
    Linear Time
    Approximation Algorithms
    NP-complete problem
    Gene
    Unit
    Similarity

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • Computer Science(all)

    Cite this

    Comparing sequences with segment rearrangements. / Ergun, Funda; Muthukrishnan, Shanmugavelayutham; Sahinalp, S. Cenk.

    In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 2914, 01.12.2003, p. 183-194.

    Research output: Contribution to journalArticle

    @article{2569bebf45d445799fc656daa5333e4b,
    title = "Comparing sequences with segment rearrangements",
    abstract = "Computational genomics involves comparing sequences based on {"}similarity{"} for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segments found in abundance in the human genome. In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem is closely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these other problems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor Ω(log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors.",
    author = "Funda Ergun and Shanmugavelayutham Muthukrishnan and Sahinalp, {S. Cenk}",
    year = "2003",
    month = "12",
    day = "1",
    language = "English (US)",
    volume = "2914",
    pages = "183--194",
    journal = "Lecture Notes in Computer Science",
    issn = "0302-9743",
    publisher = "Springer Verlag",

    }

    TY - JOUR

    T1 - Comparing sequences with segment rearrangements

    AU - Ergun, Funda

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Sahinalp, S. Cenk

    PY - 2003/12/1

    Y1 - 2003/12/1

    N2 - Computational genomics involves comparing sequences based on "similarity" for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segments found in abundance in the human genome. In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem is closely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these other problems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor Ω(log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors.

    AB - Computational genomics involves comparing sequences based on "similarity" for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segments found in abundance in the human genome. In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem is closely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these other problems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor Ω(log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors.

    UR - http://www.scopus.com/inward/record.url?scp=21144458757&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=21144458757&partnerID=8YFLogxK

    M3 - Article

    VL - 2914

    SP - 183

    EP - 194

    JO - Lecture Notes in Computer Science

    JF - Lecture Notes in Computer Science

    SN - 0302-9743

    ER -