Clustering Wikipedia infoboxes to discover their types

Thanh Hoang Nguyen, Huong Dieu Nguyen, Viviane Moreira, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.

Original languageEnglish (US)
Title of host publicationCIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management
Pages2134-2138
Number of pages5
DOIs
StatePublished - 2012
Event21st ACM International Conference on Information and Knowledge Management, CIKM 2012 - Maui, HI, United States
Duration: Oct 29 2012Nov 2 2012

Other

Other21st ACM International Conference on Information and Knowledge Management, CIKM 2012
CountryUnited States
CityMaui, HI
Period10/29/1211/2/12

Fingerprint

Labels
Experiments

Keywords

  • clustering
  • wikipedia infobox

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Nguyen, T. H., Nguyen, H. D., Moreira, V., & Freire, J. (2012). Clustering Wikipedia infoboxes to discover their types. In CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management (pp. 2134-2138) https://doi.org/10.1145/2396761.2398588

Clustering Wikipedia infoboxes to discover their types. / Nguyen, Thanh Hoang; Nguyen, Huong Dieu; Moreira, Viviane; Freire, Juliana.

CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. p. 2134-2138.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nguyen, TH, Nguyen, HD, Moreira, V & Freire, J 2012, Clustering Wikipedia infoboxes to discover their types. in CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. pp. 2134-2138, 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, Maui, HI, United States, 10/29/12. https://doi.org/10.1145/2396761.2398588
Nguyen TH, Nguyen HD, Moreira V, Freire J. Clustering Wikipedia infoboxes to discover their types. In CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. p. 2134-2138 https://doi.org/10.1145/2396761.2398588
Nguyen, Thanh Hoang ; Nguyen, Huong Dieu ; Moreira, Viviane ; Freire, Juliana. / Clustering Wikipedia infoboxes to discover their types. CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. pp. 2134-2138
@inproceedings{ada24c8523c048e58e0654fa6e0d82b7,
title = "Clustering Wikipedia infoboxes to discover their types",
abstract = "Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.",
keywords = "clustering, wikipedia infobox",
author = "Nguyen, {Thanh Hoang} and Nguyen, {Huong Dieu} and Viviane Moreira and Juliana Freire",
year = "2012",
doi = "10.1145/2396761.2398588",
language = "English (US)",
isbn = "9781450311564",
pages = "2134--2138",
booktitle = "CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management",

}

TY - GEN

T1 - Clustering Wikipedia infoboxes to discover their types

AU - Nguyen, Thanh Hoang

AU - Nguyen, Huong Dieu

AU - Moreira, Viviane

AU - Freire, Juliana

PY - 2012

Y1 - 2012

N2 - Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.

AB - Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.

KW - clustering

KW - wikipedia infobox

UR - http://www.scopus.com/inward/record.url?scp=84871054933&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84871054933&partnerID=8YFLogxK

U2 - 10.1145/2396761.2398588

DO - 10.1145/2396761.2398588

M3 - Conference contribution

SN - 9781450311564

SP - 2134

EP - 2138

BT - CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management

ER -