A fast and robust method for web page template detection and removal

Karane Vieira, Altigran S. Da Silva, Nick Pinto, Edleno S. De Moura, Joo M B Cavalcanti, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.

Original languageEnglish (US)
Title of host publicationProceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006
Pages258-267
Number of pages10
DOIs
StatePublished - 2006
Event15th ACM Conference on Information and Knowledge Management, CIKM 2006 - Arlington, VA, United States
Duration: Nov 6 2006Nov 11 2006

Other

Other15th ACM Conference on Information and Knowledge Management, CIKM 2006
CountryUnited States
CityArlington, VA
Period11/6/0611/11/06

Fingerprint

Template
World Wide Web
Clustering
Resources
Evaluation
Compromise
Web mining
Relevance judgments

Keywords

  • Web page noise removal
  • Web template extraction

ASJC Scopus subject areas

  • Business, Management and Accounting(all)

Cite this

Vieira, K., Da Silva, A. S., Pinto, N., De Moura, E. S., Cavalcanti, J. M. B., & Freire, J. (2006). A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006 (pp. 258-267) https://doi.org/10.1145/1183614.1183654

A fast and robust method for web page template detection and removal. / Vieira, Karane; Da Silva, Altigran S.; Pinto, Nick; De Moura, Edleno S.; Cavalcanti, Joo M B; Freire, Juliana.

Proceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006. 2006. p. 258-267.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Vieira, K, Da Silva, AS, Pinto, N, De Moura, ES, Cavalcanti, JMB & Freire, J 2006, A fast and robust method for web page template detection and removal. in Proceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006. pp. 258-267, 15th ACM Conference on Information and Knowledge Management, CIKM 2006, Arlington, VA, United States, 11/6/06. https://doi.org/10.1145/1183614.1183654
Vieira K, Da Silva AS, Pinto N, De Moura ES, Cavalcanti JMB, Freire J. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006. 2006. p. 258-267 https://doi.org/10.1145/1183614.1183654
Vieira, Karane ; Da Silva, Altigran S. ; Pinto, Nick ; De Moura, Edleno S. ; Cavalcanti, Joo M B ; Freire, Juliana. / A fast and robust method for web page template detection and removal. Proceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006. 2006. pp. 258-267
@inproceedings{6f521ecd89e9440698c3e3016638db3a,
title = "A fast and robust method for web page template detection and removal",
abstract = "The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.",
keywords = "Web page noise removal, Web template extraction",
author = "Karane Vieira and {Da Silva}, {Altigran S.} and Nick Pinto and {De Moura}, {Edleno S.} and Cavalcanti, {Joo M B} and Juliana Freire",
year = "2006",
doi = "10.1145/1183614.1183654",
language = "English (US)",
isbn = "1595934332",
pages = "258--267",
booktitle = "Proceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006",

}

TY - GEN

T1 - A fast and robust method for web page template detection and removal

AU - Vieira, Karane

AU - Da Silva, Altigran S.

AU - Pinto, Nick

AU - De Moura, Edleno S.

AU - Cavalcanti, Joo M B

AU - Freire, Juliana

PY - 2006

Y1 - 2006

N2 - The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.

AB - The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.

KW - Web page noise removal

KW - Web template extraction

UR - http://www.scopus.com/inward/record.url?scp=34547631600&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34547631600&partnerID=8YFLogxK

U2 - 10.1145/1183614.1183654

DO - 10.1145/1183614.1183654

M3 - Conference contribution

SN - 1595934332

SN - 9781595934338

SP - 258

EP - 267

BT - Proceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006

ER -