### Abstract

An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

Original language | English (US) |
---|---|

Title of host publication | Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers |

Pages | 168-179 |

Number of pages | 12 |

Volume | 4783 LNCS |

State | Published - 2007 |

Event | 12th International Conference on Implementation and Application of Automata, CIAA 2007 - Prague, Switzerland Duration: Jul 16 2007 → Jul 18 2007 |

### Publication series

Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|

Volume | 4783 LNCS |

ISSN (Print) | 03029743 |

ISSN (Electronic) | 16113349 |

### Other

Other | 12th International Conference on Implementation and Application of Automata, CIAA 2007 |
---|---|

Country | Switzerland |

City | Prague |

Period | 7/16/07 → 7/18/07 |

### Fingerprint

### Keywords

- Factor automata
- Finite automata
- Information retrieval
- Inverted files
- Music identification
- Suffix automata
- Suffix trees
- Text indexing

### ASJC Scopus subject areas

- Computer Science(all)
- Biochemistry, Genetics and Molecular Biology(all)
- Theoretical Computer Science

### Cite this

*Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers*(Vol. 4783 LNCS, pp. 168-179). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4783 LNCS).

**Factor automata of automata and applications.** / Mohri, Mehryar; Moreno, Pedro; Weinstein, Eugene.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers.*vol. 4783 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4783 LNCS, pp. 168-179, 12th International Conference on Implementation and Application of Automata, CIAA 2007, Prague, Switzerland, 7/16/07.

}

TY - GEN

T1 - Factor automata of automata and applications

AU - Mohri, Mehryar

AU - Moreno, Pedro

AU - Weinstein, Eugene

PY - 2007

Y1 - 2007

N2 - An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

AB - An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

KW - Factor automata

KW - Finite automata

KW - Information retrieval

KW - Inverted files

KW - Music identification

KW - Suffix automata

KW - Suffix trees

KW - Text indexing

UR - http://www.scopus.com/inward/record.url?scp=38149108437&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=38149108437&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:38149108437

SN - 9783540763352

VL - 4783 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 168

EP - 179

BT - Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers

ER -