Dictionary matching and indexing with errors and don't cares

Richard Cole, Lee Ad Gottlieb, Moshe Lewenstein

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of "don't care" characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront. and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution. The performance bounds all have a similar character. For example, for the indexing problem with n = |t| and m = |p|, the query time for k substitutions is O(m + (c 1 log n) k/ k! + # matches), with a data structure of size O(n (c 2 log n) k/ k!) and a preprocessing time of O(n(c 2 log n) k/ k!), where c 1, C 2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.

Original languageEnglish (US)
Title of host publicationConference Proceedings of the Annual ACM Symposium on Theory of Computing
Pages91-100
Number of pages10
StatePublished - 2004
EventProceedings of the 36th Annual ACM Symposium on Theory of Computing - Chicago, IL, United States
Duration: Jun 13 2004Jun 15 2004

Other

OtherProceedings of the 36th Annual ACM Symposium on Theory of Computing
CountryUnited States
CityChicago, IL
Period6/13/046/15/04

Fingerprint

Glossaries
Flavors
Random access storage
Data structures
Substitution reactions

Keywords

  • Approximate pattern matching
  • Dictionary matching
  • Dictionary query
  • Suffix trees
  • Text indexing
  • Wild-cards

ASJC Scopus subject areas

  • Software

Cite this

Cole, R., Gottlieb, L. A., & Lewenstein, M. (2004). Dictionary matching and indexing with errors and don't cares. In Conference Proceedings of the Annual ACM Symposium on Theory of Computing (pp. 91-100)

Dictionary matching and indexing with errors and don't cares. / Cole, Richard; Gottlieb, Lee Ad; Lewenstein, Moshe.

Conference Proceedings of the Annual ACM Symposium on Theory of Computing. 2004. p. 91-100.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cole, R, Gottlieb, LA & Lewenstein, M 2004, Dictionary matching and indexing with errors and don't cares. in Conference Proceedings of the Annual ACM Symposium on Theory of Computing. pp. 91-100, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, United States, 6/13/04.
Cole R, Gottlieb LA, Lewenstein M. Dictionary matching and indexing with errors and don't cares. In Conference Proceedings of the Annual ACM Symposium on Theory of Computing. 2004. p. 91-100
Cole, Richard ; Gottlieb, Lee Ad ; Lewenstein, Moshe. / Dictionary matching and indexing with errors and don't cares. Conference Proceedings of the Annual ACM Symposium on Theory of Computing. 2004. pp. 91-100
@inproceedings{474583fe652244e989c83b062b906b24,
title = "Dictionary matching and indexing with errors and don't cares",
abstract = "This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of {"}don't care{"} characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront. and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution. The performance bounds all have a similar character. For example, for the indexing problem with n = |t| and m = |p|, the query time for k substitutions is O(m + (c 1 log n) k/ k! + # matches), with a data structure of size O(n (c 2 log n) k/ k!) and a preprocessing time of O(n(c 2 log n) k/ k!), where c 1, C 2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.",
keywords = "Approximate pattern matching, Dictionary matching, Dictionary query, Suffix trees, Text indexing, Wild-cards",
author = "Richard Cole and Gottlieb, {Lee Ad} and Moshe Lewenstein",
year = "2004",
language = "English (US)",
pages = "91--100",
booktitle = "Conference Proceedings of the Annual ACM Symposium on Theory of Computing",

}

TY - GEN

T1 - Dictionary matching and indexing with errors and don't cares

AU - Cole, Richard

AU - Gottlieb, Lee Ad

AU - Lewenstein, Moshe

PY - 2004

Y1 - 2004

N2 - This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of "don't care" characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront. and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution. The performance bounds all have a similar character. For example, for the indexing problem with n = |t| and m = |p|, the query time for k substitutions is O(m + (c 1 log n) k/ k! + # matches), with a data structure of size O(n (c 2 log n) k/ k!) and a preprocessing time of O(n(c 2 log n) k/ k!), where c 1, C 2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.

AB - This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of "don't care" characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront. and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution. The performance bounds all have a similar character. For example, for the indexing problem with n = |t| and m = |p|, the query time for k substitutions is O(m + (c 1 log n) k/ k! + # matches), with a data structure of size O(n (c 2 log n) k/ k!) and a preprocessing time of O(n(c 2 log n) k/ k!), where c 1, C 2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.

KW - Approximate pattern matching

KW - Dictionary matching

KW - Dictionary query

KW - Suffix trees

KW - Text indexing

KW - Wild-cards

UR - http://www.scopus.com/inward/record.url?scp=4544388794&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4544388794&partnerID=8YFLogxK

M3 - Conference contribution

SP - 91

EP - 100

BT - Conference Proceedings of the Annual ACM Symposium on Theory of Computing

ER -