Substring compression problems

Graham Cormode, Shanmugavelayutham Muthukrishnan

    Research output: Contribution to conferencePaper

    Abstract

    We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of 5 (Substring Compression Query or SCQ) or to find the length i substring of S whose compression is the least (Least Compressible Substring or LCS problem). Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis. We present the first known, nearly optimal algorithms for substring compression problems - SCQ, LCS and their generalizations - that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.

    Original languageEnglish (US)
    Pages321-330
    Number of pages10
    StatePublished - Jul 1 2005
    EventSixteenth Annual ACM-SIAM Symposium on Discrete Algorithms - Vancouver, BC, United States
    Duration: Jan 23 2005Jan 25 2005

    Other

    OtherSixteenth Annual ACM-SIAM Symposium on Discrete Algorithms
    CountryUnited States
    CityVancouver, BC
    Period1/23/051/25/05

    Fingerprint

    Compressibility
    Compression
    Strings
    Entire
    Trees (mathematics)
    Query
    Suffix Tree
    String Matching
    Approximate Algorithm
    Information Content
    Sequence Analysis
    Parsing
    Matching Problem
    Exact Algorithms
    Optimal Algorithm

    ASJC Scopus subject areas

    • Software
    • Mathematics(all)

    Cite this

    Cormode, G., & Muthukrishnan, S. (2005). Substring compression problems. 321-330. Paper presented at Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, BC, United States.

    Substring compression problems. / Cormode, Graham; Muthukrishnan, Shanmugavelayutham.

    2005. 321-330 Paper presented at Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, BC, United States.

    Research output: Contribution to conferencePaper

    Cormode, G & Muthukrishnan, S 2005, 'Substring compression problems', Paper presented at Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, BC, United States, 1/23/05 - 1/25/05 pp. 321-330.
    Cormode G, Muthukrishnan S. Substring compression problems. 2005. Paper presented at Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, BC, United States.
    Cormode, Graham ; Muthukrishnan, Shanmugavelayutham. / Substring compression problems. Paper presented at Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, BC, United States.10 p.
    @conference{ebb8f46aa0ad4452864f9fd4770ada76,
    title = "Substring compression problems",
    abstract = "We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of 5 (Substring Compression Query or SCQ) or to find the length i substring of S whose compression is the least (Least Compressible Substring or LCS problem). Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis. We present the first known, nearly optimal algorithms for substring compression problems - SCQ, LCS and their generalizations - that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.",
    author = "Graham Cormode and Shanmugavelayutham Muthukrishnan",
    year = "2005",
    month = "7",
    day = "1",
    language = "English (US)",
    pages = "321--330",
    note = "Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms ; Conference date: 23-01-2005 Through 25-01-2005",

    }

    TY - CONF

    T1 - Substring compression problems

    AU - Cormode, Graham

    AU - Muthukrishnan, Shanmugavelayutham

    PY - 2005/7/1

    Y1 - 2005/7/1

    N2 - We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of 5 (Substring Compression Query or SCQ) or to find the length i substring of S whose compression is the least (Least Compressible Substring or LCS problem). Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis. We present the first known, nearly optimal algorithms for substring compression problems - SCQ, LCS and their generalizations - that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.

    AB - We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of 5 (Substring Compression Query or SCQ) or to find the length i substring of S whose compression is the least (Least Compressible Substring or LCS problem). Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis. We present the first known, nearly optimal algorithms for substring compression problems - SCQ, LCS and their generalizations - that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.

    UR - http://www.scopus.com/inward/record.url?scp=20744439529&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=20744439529&partnerID=8YFLogxK

    M3 - Paper

    AN - SCOPUS:20744439529

    SP - 321

    EP - 330

    ER -