Learning to extract form labels

Hoa Nguyen, Thanh Nguyen, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

In this paper we describe a new approach to extract element labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate information that is hidden behind form interfaces, such as hidden Web crawlers and metasearchers. However, given the wide variation in form layout, even within a well-defined domain, automatically extracting these labels is a challenging problem. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, our technique makes use of a learning classifier ensemble to identify element-label mappings; and it applies a reconciliation step which leverages the classifier-derived mappings to boost extraction accuracy. We present a detailed experimental evaluation using over three thousand Web forms. Our results show that our approach is effective: it obtains significantly higher accuracy and is more robust to variability in form layout than previous label extraction techniques.

Original languageEnglish (US)
Title of host publicationProceedings of the VLDB Endowment
Pages684-694
Number of pages11
Volume1
Edition1
StatePublished - 2008

Fingerprint

Labels
Classifiers

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. In Proceedings of the VLDB Endowment (1 ed., Vol. 1, pp. 684-694)

Learning to extract form labels. / Nguyen, Hoa; Nguyen, Thanh; Freire, Juliana.

Proceedings of the VLDB Endowment. Vol. 1 1. ed. 2008. p. 684-694.

Research output: Chapter in Book/Report/Conference proceedingChapter

Nguyen, H, Nguyen, T & Freire, J 2008, Learning to extract form labels. in Proceedings of the VLDB Endowment. 1 edn, vol. 1, pp. 684-694.
Nguyen H, Nguyen T, Freire J. Learning to extract form labels. In Proceedings of the VLDB Endowment. 1 ed. Vol. 1. 2008. p. 684-694
Nguyen, Hoa ; Nguyen, Thanh ; Freire, Juliana. / Learning to extract form labels. Proceedings of the VLDB Endowment. Vol. 1 1. ed. 2008. pp. 684-694
@inbook{b36884a2effd4b0680dced08f41b2c08,
title = "Learning to extract form labels",
abstract = "In this paper we describe a new approach to extract element labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate information that is hidden behind form interfaces, such as hidden Web crawlers and metasearchers. However, given the wide variation in form layout, even within a well-defined domain, automatically extracting these labels is a challenging problem. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, our technique makes use of a learning classifier ensemble to identify element-label mappings; and it applies a reconciliation step which leverages the classifier-derived mappings to boost extraction accuracy. We present a detailed experimental evaluation using over three thousand Web forms. Our results show that our approach is effective: it obtains significantly higher accuracy and is more robust to variability in form layout than previous label extraction techniques.",
author = "Hoa Nguyen and Thanh Nguyen and Juliana Freire",
year = "2008",
language = "English (US)",
volume = "1",
pages = "684--694",
booktitle = "Proceedings of the VLDB Endowment",
edition = "1",

}

TY - CHAP

T1 - Learning to extract form labels

AU - Nguyen, Hoa

AU - Nguyen, Thanh

AU - Freire, Juliana

PY - 2008

Y1 - 2008

N2 - In this paper we describe a new approach to extract element labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate information that is hidden behind form interfaces, such as hidden Web crawlers and metasearchers. However, given the wide variation in form layout, even within a well-defined domain, automatically extracting these labels is a challenging problem. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, our technique makes use of a learning classifier ensemble to identify element-label mappings; and it applies a reconciliation step which leverages the classifier-derived mappings to boost extraction accuracy. We present a detailed experimental evaluation using over three thousand Web forms. Our results show that our approach is effective: it obtains significantly higher accuracy and is more robust to variability in form layout than previous label extraction techniques.

AB - In this paper we describe a new approach to extract element labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate information that is hidden behind form interfaces, such as hidden Web crawlers and metasearchers. However, given the wide variation in form layout, even within a well-defined domain, automatically extracting these labels is a challenging problem. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, our technique makes use of a learning classifier ensemble to identify element-label mappings; and it applies a reconciliation step which leverages the classifier-derived mappings to boost extraction accuracy. We present a detailed experimental evaluation using over three thousand Web forms. Our results show that our approach is effective: it obtains significantly higher accuracy and is more robust to variability in form layout than previous label extraction techniques.

UR - http://www.scopus.com/inward/record.url?scp=77951163340&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77951163340&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:77951163340

VL - 1

SP - 684

EP - 694

BT - Proceedings of the VLDB Endowment

ER -