Suffix Trays and Suffix Trists

Structures for Faster Text Indexing

Richard Cole, Tsvi Kopelowitz, Moshe Lewenstein

Research output: Contribution to journalArticle

Abstract

Suffix trees and suffix arrays are two of the most widely used data structures for text indexing. Each uses linear space and can be constructed in linear time for polynomially sized alphabets. However, when it comes to answering queries with worst-case deterministic time bounds, the prior does so in O(mlog|Σ|) time, where m is the query size, |Σ| is the alphabet size, and the latter does so in O(m+logn) time, where n is the text size. If one wants to output all appearances of the query, an additive cost of O(occ) time is sufficient, where occ is the size of the output. Notice that it is possible to obtain a worst case, deterministic query time of O(m) but at the cost of super-linear construction time or space usage.

We propose a novel way of combining the two into, what we call, a suffix tray. The space and construction time remain linear and the query time improves to O(m+log|Σ|) for integer alphabets from a linear range, i.e. Σ⊂{1,…,cn}, for an arbitrary constant c. The construction and query are deterministic. Here also an additive O(occ) time is sufficient if one desires to output all appearances of the query.

We also consider the online version of indexing, where the text arrives online, one character at a time, and indexing queries are answered in tandem. In this variant we create a cross between a suffix tree and a suffix list (a dynamic variant of suffix array) to be called a suffix trist; it supports queries in O(m+log|Σ|) time. The suffix trist also uses linear space. Furthermore, if there exists an online construction for a linear-space suffix tree such that the cost of adding a character is worst-case deterministic f(n,|Σ|) (n is the size of the current text), then one can further update the suffix trist in O(f(n,|Σ|)+log|Σ|) time. The best currently known worst-case deterministic bound for f(n,|Σ|) is O(logn) time.

Original languageEnglish (US)
Pages (from-to)450-466
Number of pages17
JournalAlgorithmica (New York)
Volume72
Issue number2
DOIs
StatePublished - Jun 1 2015

Fingerprint

Text Indexing
Suffix
Query
Suffix Tree
Linear Space
Suffix Array
Data structures
Costs
Indexing
Linear Time
Output
Sufficient

Keywords

  • Data structures
  • Indexing
  • Pattern matching

ASJC Scopus subject areas

  • Computer Science(all)
  • Computer Science Applications
  • Applied Mathematics

Cite this

Suffix Trays and Suffix Trists : Structures for Faster Text Indexing. / Cole, Richard; Kopelowitz, Tsvi; Lewenstein, Moshe.

In: Algorithmica (New York), Vol. 72, No. 2, 01.06.2015, p. 450-466.

Research output: Contribution to journalArticle

Cole, Richard ; Kopelowitz, Tsvi ; Lewenstein, Moshe. / Suffix Trays and Suffix Trists : Structures for Faster Text Indexing. In: Algorithmica (New York). 2015 ; Vol. 72, No. 2. pp. 450-466.
@article{4d95d099460345899667bcfeea70fa5b,
title = "Suffix Trays and Suffix Trists: Structures for Faster Text Indexing",
abstract = "Suffix trees and suffix arrays are two of the most widely used data structures for text indexing. Each uses linear space and can be constructed in linear time for polynomially sized alphabets. However, when it comes to answering queries with worst-case deterministic time bounds, the prior does so in O(mlog|Σ|) time, where m is the query size, |Σ| is the alphabet size, and the latter does so in O(m+logn) time, where n is the text size. If one wants to output all appearances of the query, an additive cost of O(occ) time is sufficient, where occ is the size of the output. Notice that it is possible to obtain a worst case, deterministic query time of O(m) but at the cost of super-linear construction time or space usage.We propose a novel way of combining the two into, what we call, a suffix tray. The space and construction time remain linear and the query time improves to O(m+log|Σ|) for integer alphabets from a linear range, i.e. Σ⊂{1,…,cn}, for an arbitrary constant c. The construction and query are deterministic. Here also an additive O(occ) time is sufficient if one desires to output all appearances of the query.We also consider the online version of indexing, where the text arrives online, one character at a time, and indexing queries are answered in tandem. In this variant we create a cross between a suffix tree and a suffix list (a dynamic variant of suffix array) to be called a suffix trist; it supports queries in O(m+log|Σ|) time. The suffix trist also uses linear space. Furthermore, if there exists an online construction for a linear-space suffix tree such that the cost of adding a character is worst-case deterministic f(n,|Σ|) (n is the size of the current text), then one can further update the suffix trist in O(f(n,|Σ|)+log|Σ|) time. The best currently known worst-case deterministic bound for f(n,|Σ|) is O(logn) time.",
keywords = "Data structures, Indexing, Pattern matching",
author = "Richard Cole and Tsvi Kopelowitz and Moshe Lewenstein",
year = "2015",
month = "6",
day = "1",
doi = "10.1007/s00453-013-9860-6",
language = "English (US)",
volume = "72",
pages = "450--466",
journal = "Algorithmica",
issn = "0178-4617",
publisher = "Springer New York",
number = "2",

}

TY - JOUR

T1 - Suffix Trays and Suffix Trists

T2 - Structures for Faster Text Indexing

AU - Cole, Richard

AU - Kopelowitz, Tsvi

AU - Lewenstein, Moshe

PY - 2015/6/1

Y1 - 2015/6/1

N2 - Suffix trees and suffix arrays are two of the most widely used data structures for text indexing. Each uses linear space and can be constructed in linear time for polynomially sized alphabets. However, when it comes to answering queries with worst-case deterministic time bounds, the prior does so in O(mlog|Σ|) time, where m is the query size, |Σ| is the alphabet size, and the latter does so in O(m+logn) time, where n is the text size. If one wants to output all appearances of the query, an additive cost of O(occ) time is sufficient, where occ is the size of the output. Notice that it is possible to obtain a worst case, deterministic query time of O(m) but at the cost of super-linear construction time or space usage.We propose a novel way of combining the two into, what we call, a suffix tray. The space and construction time remain linear and the query time improves to O(m+log|Σ|) for integer alphabets from a linear range, i.e. Σ⊂{1,…,cn}, for an arbitrary constant c. The construction and query are deterministic. Here also an additive O(occ) time is sufficient if one desires to output all appearances of the query.We also consider the online version of indexing, where the text arrives online, one character at a time, and indexing queries are answered in tandem. In this variant we create a cross between a suffix tree and a suffix list (a dynamic variant of suffix array) to be called a suffix trist; it supports queries in O(m+log|Σ|) time. The suffix trist also uses linear space. Furthermore, if there exists an online construction for a linear-space suffix tree such that the cost of adding a character is worst-case deterministic f(n,|Σ|) (n is the size of the current text), then one can further update the suffix trist in O(f(n,|Σ|)+log|Σ|) time. The best currently known worst-case deterministic bound for f(n,|Σ|) is O(logn) time.

AB - Suffix trees and suffix arrays are two of the most widely used data structures for text indexing. Each uses linear space and can be constructed in linear time for polynomially sized alphabets. However, when it comes to answering queries with worst-case deterministic time bounds, the prior does so in O(mlog|Σ|) time, where m is the query size, |Σ| is the alphabet size, and the latter does so in O(m+logn) time, where n is the text size. If one wants to output all appearances of the query, an additive cost of O(occ) time is sufficient, where occ is the size of the output. Notice that it is possible to obtain a worst case, deterministic query time of O(m) but at the cost of super-linear construction time or space usage.We propose a novel way of combining the two into, what we call, a suffix tray. The space and construction time remain linear and the query time improves to O(m+log|Σ|) for integer alphabets from a linear range, i.e. Σ⊂{1,…,cn}, for an arbitrary constant c. The construction and query are deterministic. Here also an additive O(occ) time is sufficient if one desires to output all appearances of the query.We also consider the online version of indexing, where the text arrives online, one character at a time, and indexing queries are answered in tandem. In this variant we create a cross between a suffix tree and a suffix list (a dynamic variant of suffix array) to be called a suffix trist; it supports queries in O(m+log|Σ|) time. The suffix trist also uses linear space. Furthermore, if there exists an online construction for a linear-space suffix tree such that the cost of adding a character is worst-case deterministic f(n,|Σ|) (n is the size of the current text), then one can further update the suffix trist in O(f(n,|Σ|)+log|Σ|) time. The best currently known worst-case deterministic bound for f(n,|Σ|) is O(logn) time.

KW - Data structures

KW - Indexing

KW - Pattern matching

UR - http://www.scopus.com/inward/record.url?scp=84929061223&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929061223&partnerID=8YFLogxK

U2 - 10.1007/s00453-013-9860-6

DO - 10.1007/s00453-013-9860-6

M3 - Article

VL - 72

SP - 450

EP - 466

JO - Algorithmica

JF - Algorithmica

SN - 0178-4617

IS - 2

ER -