General suffix automaton construction algorithm and space bounds

Mehryar Mohri, Pedro Moreno, Eugene Weinstein

Research output: Contribution to journalArticle

Abstract

Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2 Q - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2 {norm of matrix} U {norm of matrix} - 1, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578-589], where {norm of matrix} U {norm of matrix} is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O (| S |). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

Original languageEnglish (US)
Pages (from-to)3553-3562
Number of pages10
JournalTheoretical Computer Science
Volume410
Issue number37
DOIs
StatePublished - Sep 1 2009

Fingerprint

Suffix
Automata
Strings
Data structures
Norm
Experiments
Text Analysis
Text Retrieval
String Matching
Prefix
Linear-time Algorithm

Keywords

  • Factor automata
  • Finite automata
  • Indexing
  • Inverted text
  • Music identification
  • Pattern-matching
  • String-matching
  • Suffix automata
  • Suffix trees

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

General suffix automaton construction algorithm and space bounds. / Mohri, Mehryar; Moreno, Pedro; Weinstein, Eugene.

In: Theoretical Computer Science, Vol. 410, No. 37, 01.09.2009, p. 3553-3562.

Research output: Contribution to journalArticle

Mohri, Mehryar ; Moreno, Pedro ; Weinstein, Eugene. / General suffix automaton construction algorithm and space bounds. In: Theoretical Computer Science. 2009 ; Vol. 410, No. 37. pp. 3553-3562.
@article{db67eb0a2d1242a88babd4f346d6ec9f,
title = "General suffix automaton construction algorithm and space bounds",
abstract = "Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2 Q - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2 {norm of matrix} U {norm of matrix} - 1, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578-589], where {norm of matrix} U {norm of matrix} is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O (| S |). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.",
keywords = "Factor automata, Finite automata, Indexing, Inverted text, Music identification, Pattern-matching, String-matching, Suffix automata, Suffix trees",
author = "Mehryar Mohri and Pedro Moreno and Eugene Weinstein",
year = "2009",
month = "9",
day = "1",
doi = "10.1016/j.tcs.2009.03.034",
language = "English (US)",
volume = "410",
pages = "3553--3562",
journal = "Theoretical Computer Science",
issn = "0304-3975",
publisher = "Elsevier",
number = "37",

}

TY - JOUR

T1 - General suffix automaton construction algorithm and space bounds

AU - Mohri, Mehryar

AU - Moreno, Pedro

AU - Weinstein, Eugene

PY - 2009/9/1

Y1 - 2009/9/1

N2 - Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2 Q - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2 {norm of matrix} U {norm of matrix} - 1, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578-589], where {norm of matrix} U {norm of matrix} is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O (| S |). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

AB - Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2 Q - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2 {norm of matrix} U {norm of matrix} - 1, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578-589], where {norm of matrix} U {norm of matrix} is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O (| S |). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

KW - Factor automata

KW - Finite automata

KW - Indexing

KW - Inverted text

KW - Music identification

KW - Pattern-matching

KW - String-matching

KW - Suffix automata

KW - Suffix trees

UR - http://www.scopus.com/inward/record.url?scp=67651095813&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67651095813&partnerID=8YFLogxK

U2 - 10.1016/j.tcs.2009.03.034

DO - 10.1016/j.tcs.2009.03.034

M3 - Article

AN - SCOPUS:67651095813

VL - 410

SP - 3553

EP - 3562

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

IS - 37

ER -