PruSM: A prudent schema matching approach for web forms

Thanh Nguyen, Hoa Nguyen, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.

Original languageEnglish (US)
Title of host publicationCIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops
Pages1385-1388
Number of pages4
DOIs
Publication statusPublished - 2010
Event19th International Conference on Information and Knowledge Management and Co-located Workshops, CIKM'10 - Toronto, ON, Canada
Duration: Oct 26 2010Oct 30 2010

Other

Other19th International Conference on Information and Knowledge Management and Co-located Workshops, CIKM'10
CountryCanada
CityToronto, ON
Period10/26/1010/30/10

    Fingerprint

Keywords

  • Hidden web
  • Schema matching
  • Web forms

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Nguyen, T., Nguyen, H., & Freire, J. (2010). PruSM: A prudent schema matching approach for web forms. In CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops (pp. 1385-1388) https://doi.org/10.1145/1871437.1871627