On finding templates on web collections

Karane Vieira, André Luiz da Costa Carvalho, Klessius Berlt, Edleno S. de Moura, Altigran S. da Silva, Juliana Freire

Research output: Contribution to journalArticle

Abstract

Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.

Original languageEnglish (US)
Pages (from-to)171-211
Number of pages41
JournalWorld Wide Web
Volume12
Issue number2
DOIs
StatePublished - Mar 2009

Fingerprint

HTML
World Wide Web
Websites
Search engines
Navigation
Processing

Keywords

  • Tree-mapping
  • Web IR
  • Web template detection

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Cite this

Vieira, K., da Costa Carvalho, A. L., Berlt, K., de Moura, E. S., da Silva, A. S., & Freire, J. (2009). On finding templates on web collections. World Wide Web, 12(2), 171-211. https://doi.org/10.1007/s11280-009-0059-3

On finding templates on web collections. / Vieira, Karane; da Costa Carvalho, André Luiz; Berlt, Klessius; de Moura, Edleno S.; da Silva, Altigran S.; Freire, Juliana.

In: World Wide Web, Vol. 12, No. 2, 03.2009, p. 171-211.

Research output: Contribution to journalArticle

Vieira, K, da Costa Carvalho, AL, Berlt, K, de Moura, ES, da Silva, AS & Freire, J 2009, 'On finding templates on web collections', World Wide Web, vol. 12, no. 2, pp. 171-211. https://doi.org/10.1007/s11280-009-0059-3
Vieira K, da Costa Carvalho AL, Berlt K, de Moura ES, da Silva AS, Freire J. On finding templates on web collections. World Wide Web. 2009 Mar;12(2):171-211. https://doi.org/10.1007/s11280-009-0059-3
Vieira, Karane ; da Costa Carvalho, André Luiz ; Berlt, Klessius ; de Moura, Edleno S. ; da Silva, Altigran S. ; Freire, Juliana. / On finding templates on web collections. In: World Wide Web. 2009 ; Vol. 12, No. 2. pp. 171-211.
@article{29b82dee16b546d8bb7307fdb325528c,
title = "On finding templates on web collections",
abstract = "Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.",
keywords = "Tree-mapping, Web IR, Web template detection",
author = "Karane Vieira and {da Costa Carvalho}, {Andr{\'e} Luiz} and Klessius Berlt and {de Moura}, {Edleno S.} and {da Silva}, {Altigran S.} and Juliana Freire",
year = "2009",
month = "3",
doi = "10.1007/s11280-009-0059-3",
language = "English (US)",
volume = "12",
pages = "171--211",
journal = "World Wide Web",
issn = "1386-145X",
publisher = "Springer New York",
number = "2",

}

TY - JOUR

T1 - On finding templates on web collections

AU - Vieira, Karane

AU - da Costa Carvalho, André Luiz

AU - Berlt, Klessius

AU - de Moura, Edleno S.

AU - da Silva, Altigran S.

AU - Freire, Juliana

PY - 2009/3

Y1 - 2009/3

N2 - Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.

AB - Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.

KW - Tree-mapping

KW - Web IR

KW - Web template detection

UR - http://www.scopus.com/inward/record.url?scp=71349086902&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=71349086902&partnerID=8YFLogxK

U2 - 10.1007/s11280-009-0059-3

DO - 10.1007/s11280-009-0059-3

M3 - Article

AN - SCOPUS:71349086902

VL - 12

SP - 171

EP - 211

JO - World Wide Web

JF - World Wide Web

SN - 1386-145X

IS - 2

ER -