PruSM

A prudent schema matching approach for web forms

Thanh Nguyen, Hoa Nguyen, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.

Original languageEnglish (US)
Title of host publicationCIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops
Pages1385-1388
Number of pages4
DOIs
StatePublished - 2010
Event19th International Conference on Information and Knowledge Management and Co-located Workshops, CIKM'10 - Toronto, ON, Canada
Duration: Oct 26 2010Oct 30 2010

Other

Other19th International Conference on Information and Knowledge Management and Co-located Workshops, CIKM'10
CountryCanada
CityToronto, ON
Period10/26/1010/30/10

Fingerprint

Schema matching
World Wide Web
Data sources
Leverage
Information system integration
Evaluation
Propagation

Keywords

  • Hidden web
  • Schema matching
  • Web forms

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Nguyen, T., Nguyen, H., & Freire, J. (2010). PruSM: A prudent schema matching approach for web forms. In CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops (pp. 1385-1388) https://doi.org/10.1145/1871437.1871627

PruSM : A prudent schema matching approach for web forms. / Nguyen, Thanh; Nguyen, Hoa; Freire, Juliana.

CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops. 2010. p. 1385-1388.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nguyen, T, Nguyen, H & Freire, J 2010, PruSM: A prudent schema matching approach for web forms. in CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops. pp. 1385-1388, 19th International Conference on Information and Knowledge Management and Co-located Workshops, CIKM'10, Toronto, ON, Canada, 10/26/10. https://doi.org/10.1145/1871437.1871627
Nguyen T, Nguyen H, Freire J. PruSM: A prudent schema matching approach for web forms. In CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops. 2010. p. 1385-1388 https://doi.org/10.1145/1871437.1871627
Nguyen, Thanh ; Nguyen, Hoa ; Freire, Juliana. / PruSM : A prudent schema matching approach for web forms. CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops. 2010. pp. 1385-1388
@inproceedings{d21e745c3c02449dacd8d720b02b302c,
title = "PruSM: A prudent schema matching approach for web forms",
abstract = "There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.",
keywords = "Hidden web, Schema matching, Web forms",
author = "Thanh Nguyen and Hoa Nguyen and Juliana Freire",
year = "2010",
doi = "10.1145/1871437.1871627",
language = "English (US)",
isbn = "9781450300995",
pages = "1385--1388",
booktitle = "CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops",

}

TY - GEN

T1 - PruSM

T2 - A prudent schema matching approach for web forms

AU - Nguyen, Thanh

AU - Nguyen, Hoa

AU - Freire, Juliana

PY - 2010

Y1 - 2010

N2 - There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.

AB - There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.

KW - Hidden web

KW - Schema matching

KW - Web forms

UR - http://www.scopus.com/inward/record.url?scp=78651311226&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78651311226&partnerID=8YFLogxK

U2 - 10.1145/1871437.1871627

DO - 10.1145/1871437.1871627

M3 - Conference contribution

SN - 9781450300995

SP - 1385

EP - 1388

BT - CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops

ER -