Combining classifiers to identify online databases

Luciano Barbosa, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.

Original languageEnglish (US)
Title of host publication16th International World Wide Web Conference, WWW2007
Pages431-440
Number of pages10
DOIs
StatePublished - 2007
Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
Duration: May 8 2007May 12 2007

Other

Other16th International World Wide Web Conference, WWW2007
CountryCanada
CityBanff, AB
Period5/8/075/12/07

Fingerprint

Classifiers
Merging
Chemical analysis
Experiments

Keywords

  • Hidden web
  • Hierarchical classifiers
  • Learning classifiers
  • Online database directories
  • Web crawlers

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Barbosa, L., & Freire, J. (2007). Combining classifiers to identify online databases. In 16th International World Wide Web Conference, WWW2007 (pp. 431-440) https://doi.org/10.1145/1242572.1242631

Combining classifiers to identify online databases. / Barbosa, Luciano; Freire, Juliana.

16th International World Wide Web Conference, WWW2007. 2007. p. 431-440.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barbosa, L & Freire, J 2007, Combining classifiers to identify online databases. in 16th International World Wide Web Conference, WWW2007. pp. 431-440, 16th International World Wide Web Conference, WWW2007, Banff, AB, Canada, 5/8/07. https://doi.org/10.1145/1242572.1242631
Barbosa L, Freire J. Combining classifiers to identify online databases. In 16th International World Wide Web Conference, WWW2007. 2007. p. 431-440 https://doi.org/10.1145/1242572.1242631
Barbosa, Luciano ; Freire, Juliana. / Combining classifiers to identify online databases. 16th International World Wide Web Conference, WWW2007. 2007. pp. 431-440
@inproceedings{9d9eda152ad741c29a15f2d0e392a808,
title = "Combining classifiers to identify online databases",
abstract = "We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.",
keywords = "Hidden web, Hierarchical classifiers, Learning classifiers, Online database directories, Web crawlers",
author = "Luciano Barbosa and Juliana Freire",
year = "2007",
doi = "10.1145/1242572.1242631",
language = "English (US)",
isbn = "1595936548",
pages = "431--440",
booktitle = "16th International World Wide Web Conference, WWW2007",

}

TY - GEN

T1 - Combining classifiers to identify online databases

AU - Barbosa, Luciano

AU - Freire, Juliana

PY - 2007

Y1 - 2007

N2 - We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.

AB - We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.

KW - Hidden web

KW - Hierarchical classifiers

KW - Learning classifiers

KW - Online database directories

KW - Web crawlers

UR - http://www.scopus.com/inward/record.url?scp=35348849557&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35348849557&partnerID=8YFLogxK

U2 - 10.1145/1242572.1242631

DO - 10.1145/1242572.1242631

M3 - Conference contribution

SN - 1595936548

SN - 9781595936547

SP - 431

EP - 440

BT - 16th International World Wide Web Conference, WWW2007

ER -