Efficient algorithms for document retrieval problems

Shanmugavelayutham Muthukrishnan

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    We are given a collection D of text documents d1,...,dk, with Σ1|di| = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time 0(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated. We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects -points and lines -that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002
    PublisherAssociation for Computing Machinery
    Pages657-666
    Number of pages10
    ISBN (Electronic)089871513X
    StatePublished - Jan 1 2002
    Event13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002 - San Francisco, United States
    Duration: Jan 6 2002Jan 8 2002

    Publication series

    NameProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms
    Volume06-08-January-2002

    Other

    Other13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002
    CountryUnited States
    CitySan Francisco
    Period1/6/021/8/02

    Fingerprint

    Document Retrieval
    Efficient Algorithms
    Pattern matching
    Optimal Algorithm
    Range Query
    Pattern Matching
    Matching Problem
    Information retrieval systems
    Output
    Query processing
    Strings
    Structural properties
    Geometric object
    Computational Biology
    Query Processing
    Systems Biology
    Color
    Structural Properties
    Information Retrieval
    Preprocessing

    ASJC Scopus subject areas

    • Software
    • Mathematics(all)

    Cite this

    Muthukrishnan, S. (2002). Efficient algorithms for document retrieval problems. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002 (pp. 657-666). (Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms; Vol. 06-08-January-2002). Association for Computing Machinery.

    Efficient algorithms for document retrieval problems. / Muthukrishnan, Shanmugavelayutham.

    Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002. Association for Computing Machinery, 2002. p. 657-666 (Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms; Vol. 06-08-January-2002).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Muthukrishnan, S 2002, Efficient algorithms for document retrieval problems. in Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002. Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, vol. 06-08-January-2002, Association for Computing Machinery, pp. 657-666, 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002, San Francisco, United States, 1/6/02.
    Muthukrishnan S. Efficient algorithms for document retrieval problems. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002. Association for Computing Machinery. 2002. p. 657-666. (Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms).
    Muthukrishnan, Shanmugavelayutham. / Efficient algorithms for document retrieval problems. Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002. Association for Computing Machinery, 2002. pp. 657-666 (Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms).
    @inproceedings{8dd3cd253d45493cbc7da929bae93916,
    title = "Efficient algorithms for document retrieval problems",
    abstract = "We are given a collection D of text documents d1,...,dk, with Σ1|di| = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time 0(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated. We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing {"}local{"} encodings whereby they are reduced to range query problems on geometric objects -points and lines -that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.",
    author = "Shanmugavelayutham Muthukrishnan",
    year = "2002",
    month = "1",
    day = "1",
    language = "English (US)",
    series = "Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms",
    publisher = "Association for Computing Machinery",
    pages = "657--666",
    booktitle = "Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002",

    }

    TY - GEN

    T1 - Efficient algorithms for document retrieval problems

    AU - Muthukrishnan, Shanmugavelayutham

    PY - 2002/1/1

    Y1 - 2002/1/1

    N2 - We are given a collection D of text documents d1,...,dk, with Σ1|di| = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time 0(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated. We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects -points and lines -that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

    AB - We are given a collection D of text documents d1,...,dk, with Σ1|di| = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time 0(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated. We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects -points and lines -that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

    UR - http://www.scopus.com/inward/record.url?scp=33744962566&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=33744962566&partnerID=8YFLogxK

    M3 - Conference contribution

    AN - SCOPUS:33744962566

    T3 - Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms

    SP - 657

    EP - 666

    BT - Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002

    PB - Association for Computing Machinery

    ER -