Using latent-structure to detect objects on the web

Luciano Barbosa, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An important requirement for emerging applications which aim to locate and integrate content distributed over the Web is to identify pages that are relevant for a given domain or task. In this paper, we address the problem of identifying pages that contain objects with a latent structure, i.e., the structure is implicitly represented in the page. We propose an algorithm which, given a set of instances of an object type, derives rules by automatically extracting statistically significant patterns present inside the objects. These rules can then be used to detect the presence of these objects in new, unseen pages. Our approach has several advantages when compared against learning-based text classifiers. Because it relies only on positive examples, constructing accurate object detectors is simpler than constructing learning classifiers, which require both positive and negative examples. Also, besides providing a classification decision for the presence of an object, the derived detectors are able to pinpoint the location of the object inside a Web page. This enables our algorithm to extract additional object fragments and apply online learning to automatically update the rules as new documents become available. An experimental evaluation, using a representative set of domains, indicates that our approach is effective. It is able to learn structural patterns and derive detectors that outperform state-of-art text classifiers and the online learning component leads to substantial improvements over the initial detectors.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
DOIs
StatePublished - 2010
Event13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010 - Indianapolis, IN, United States
Duration: Jun 6 2010Jun 6 2010

Other

Other13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
CountryUnited States
CityIndianapolis, IN
Period6/6/106/6/10

Fingerprint

Detectors
Classifiers
Websites

Keywords

  • Information extraction
  • Online learning
  • Rule inference
  • Web objects

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Barbosa, L., & Freire, J. (2010). Using latent-structure to detect objects on the web. In Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010 [8] https://doi.org/10.1145/1859127.1859138

Using latent-structure to detect objects on the web. / Barbosa, Luciano; Freire, Juliana.

Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010. 2010. 8.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barbosa, L & Freire, J 2010, Using latent-structure to detect objects on the web. in Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010., 8, 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010, Indianapolis, IN, United States, 6/6/10. https://doi.org/10.1145/1859127.1859138
Barbosa L, Freire J. Using latent-structure to detect objects on the web. In Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010. 2010. 8 https://doi.org/10.1145/1859127.1859138
Barbosa, Luciano ; Freire, Juliana. / Using latent-structure to detect objects on the web. Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010. 2010.
@inproceedings{506a145825254f86bc7553a5e51d37ca,
title = "Using latent-structure to detect objects on the web",
abstract = "An important requirement for emerging applications which aim to locate and integrate content distributed over the Web is to identify pages that are relevant for a given domain or task. In this paper, we address the problem of identifying pages that contain objects with a latent structure, i.e., the structure is implicitly represented in the page. We propose an algorithm which, given a set of instances of an object type, derives rules by automatically extracting statistically significant patterns present inside the objects. These rules can then be used to detect the presence of these objects in new, unseen pages. Our approach has several advantages when compared against learning-based text classifiers. Because it relies only on positive examples, constructing accurate object detectors is simpler than constructing learning classifiers, which require both positive and negative examples. Also, besides providing a classification decision for the presence of an object, the derived detectors are able to pinpoint the location of the object inside a Web page. This enables our algorithm to extract additional object fragments and apply online learning to automatically update the rules as new documents become available. An experimental evaluation, using a representative set of domains, indicates that our approach is effective. It is able to learn structural patterns and derive detectors that outperform state-of-art text classifiers and the online learning component leads to substantial improvements over the initial detectors.",
keywords = "Information extraction, Online learning, Rule inference, Web objects",
author = "Luciano Barbosa and Juliana Freire",
year = "2010",
doi = "10.1145/1859127.1859138",
language = "English (US)",
isbn = "9781450301862",
booktitle = "Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010",

}

TY - GEN

T1 - Using latent-structure to detect objects on the web

AU - Barbosa, Luciano

AU - Freire, Juliana

PY - 2010

Y1 - 2010

N2 - An important requirement for emerging applications which aim to locate and integrate content distributed over the Web is to identify pages that are relevant for a given domain or task. In this paper, we address the problem of identifying pages that contain objects with a latent structure, i.e., the structure is implicitly represented in the page. We propose an algorithm which, given a set of instances of an object type, derives rules by automatically extracting statistically significant patterns present inside the objects. These rules can then be used to detect the presence of these objects in new, unseen pages. Our approach has several advantages when compared against learning-based text classifiers. Because it relies only on positive examples, constructing accurate object detectors is simpler than constructing learning classifiers, which require both positive and negative examples. Also, besides providing a classification decision for the presence of an object, the derived detectors are able to pinpoint the location of the object inside a Web page. This enables our algorithm to extract additional object fragments and apply online learning to automatically update the rules as new documents become available. An experimental evaluation, using a representative set of domains, indicates that our approach is effective. It is able to learn structural patterns and derive detectors that outperform state-of-art text classifiers and the online learning component leads to substantial improvements over the initial detectors.

AB - An important requirement for emerging applications which aim to locate and integrate content distributed over the Web is to identify pages that are relevant for a given domain or task. In this paper, we address the problem of identifying pages that contain objects with a latent structure, i.e., the structure is implicitly represented in the page. We propose an algorithm which, given a set of instances of an object type, derives rules by automatically extracting statistically significant patterns present inside the objects. These rules can then be used to detect the presence of these objects in new, unseen pages. Our approach has several advantages when compared against learning-based text classifiers. Because it relies only on positive examples, constructing accurate object detectors is simpler than constructing learning classifiers, which require both positive and negative examples. Also, besides providing a classification decision for the presence of an object, the derived detectors are able to pinpoint the location of the object inside a Web page. This enables our algorithm to extract additional object fragments and apply online learning to automatically update the rules as new documents become available. An experimental evaluation, using a representative set of domains, indicates that our approach is effective. It is able to learn structural patterns and derive detectors that outperform state-of-art text classifiers and the online learning component leads to substantial improvements over the initial detectors.

KW - Information extraction

KW - Online learning

KW - Rule inference

KW - Web objects

UR - http://www.scopus.com/inward/record.url?scp=78650451134&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650451134&partnerID=8YFLogxK

U2 - 10.1145/1859127.1859138

DO - 10.1145/1859127.1859138

M3 - Conference contribution

SN - 9781450301862

BT - Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010

ER -