### Abstract

We are given a collection D of text documents d1,...,dk, with Σ1|di| = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time 0(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated. We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects -points and lines -that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

Original language | English (US) |
---|---|

Title of host publication | Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002 |

Publisher | Association for Computing Machinery |

Pages | 657-666 |

Number of pages | 10 |

ISBN (Electronic) | 089871513X |

State | Published - Jan 1 2002 |

Event | 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002 - San Francisco, United States Duration: Jan 6 2002 → Jan 8 2002 |

### Publication series

Name | Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms |
---|---|

Volume | 06-08-January-2002 |

### Other

Other | 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002 |
---|---|

Country | United States |

City | San Francisco |

Period | 1/6/02 → 1/8/02 |

### Fingerprint

### ASJC Scopus subject areas

- Software
- Mathematics(all)

### Cite this

*Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002*(pp. 657-666). (Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms; Vol. 06-08-January-2002). Association for Computing Machinery.

**Efficient algorithms for document retrieval problems.** / Muthukrishnan, Shanmugavelayutham.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002.*Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, vol. 06-08-January-2002, Association for Computing Machinery, pp. 657-666, 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002, San Francisco, United States, 1/6/02.

}

TY - GEN

T1 - Efficient algorithms for document retrieval problems

AU - Muthukrishnan, Shanmugavelayutham

PY - 2002/1/1

Y1 - 2002/1/1

N2 - We are given a collection D of text documents d1,...,dk, with Σ1|di| = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time 0(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated. We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects -points and lines -that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

AB - We are given a collection D of text documents d1,...,dk, with Σ1|di| = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time 0(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated. We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects -points and lines -that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

UR - http://www.scopus.com/inward/record.url?scp=33744962566&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33744962566&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33744962566

T3 - Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms

SP - 657

EP - 666

BT - Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002

PB - Association for Computing Machinery

ER -