Factor automata of automata and applications

Mehryar Mohri, Pedro Moreno, Eugene Weinstein

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

Original languageEnglish (US)
Title of host publicationImplementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers
Pages168-179
Number of pages12
Volume4783 LNCS
StatePublished - 2007
Event12th International Conference on Implementation and Application of Automata, CIAA 2007 - Prague, Switzerland
Duration: Jul 16 2007Jul 18 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4783 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other12th International Conference on Implementation and Application of Automata, CIAA 2007
CountrySwitzerland
CityPrague
Period7/16/077/18/07

Fingerprint

Finite automata
Music
Automata
Data structures
Statistical Factor Analysis
Strings
Experiments
benzoylprop-ethyl
Suffix
Finite Automata
Prefix
Finite Set
Data Structures

Keywords

  • Factor automata
  • Finite automata
  • Information retrieval
  • Inverted files
  • Music identification
  • Suffix automata
  • Suffix trees
  • Text indexing

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Mohri, M., Moreno, P., & Weinstein, E. (2007). Factor automata of automata and applications. In Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers (Vol. 4783 LNCS, pp. 168-179). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4783 LNCS).

Factor automata of automata and applications. / Mohri, Mehryar; Moreno, Pedro; Weinstein, Eugene.

Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers. Vol. 4783 LNCS 2007. p. 168-179 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4783 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mohri, M, Moreno, P & Weinstein, E 2007, Factor automata of automata and applications. in Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers. vol. 4783 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4783 LNCS, pp. 168-179, 12th International Conference on Implementation and Application of Automata, CIAA 2007, Prague, Switzerland, 7/16/07.
Mohri M, Moreno P, Weinstein E. Factor automata of automata and applications. In Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers. Vol. 4783 LNCS. 2007. p. 168-179. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Mohri, Mehryar ; Moreno, Pedro ; Weinstein, Eugene. / Factor automata of automata and applications. Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers. Vol. 4783 LNCS 2007. pp. 168-179 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{d5ab4219716d4f7cada4e65cc62d5e2a,
title = "Factor automata of automata and applications",
abstract = "An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.",
keywords = "Factor automata, Finite automata, Information retrieval, Inverted files, Music identification, Suffix automata, Suffix trees, Text indexing",
author = "Mehryar Mohri and Pedro Moreno and Eugene Weinstein",
year = "2007",
language = "English (US)",
isbn = "9783540763352",
volume = "4783 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "168--179",
booktitle = "Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers",

}

TY - GEN

T1 - Factor automata of automata and applications

AU - Mohri, Mehryar

AU - Moreno, Pedro

AU - Weinstein, Eugene

PY - 2007

Y1 - 2007

N2 - An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

AB - An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

KW - Factor automata

KW - Finite automata

KW - Information retrieval

KW - Inverted files

KW - Music identification

KW - Suffix automata

KW - Suffix trees

KW - Text indexing

UR - http://www.scopus.com/inward/record.url?scp=38149108437&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=38149108437&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:38149108437

SN - 9783540763352

VL - 4783 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 168

EP - 179

BT - Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers

ER -