Interactive wrapper generation with minimal user effort

Utku Irmak, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 15th International Conference on World Wide Web
    Pages553-563
    Number of pages11
    DOIs
    StatePublished - 2006
    Event15th International Conference on World Wide Web - Edinburgh, Scotland, United Kingdom
    Duration: May 23 2006May 26 2006

    Other

    Other15th International Conference on World Wide Web
    CountryUnited Kingdom
    CityEdinburgh, Scotland
    Period5/23/065/26/06

    Fingerprint

    HTML
    Experiments

    Keywords

    • Active learning
    • Data extraction
    • Wrapper generation

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Software

    Cite this

    Irmak, U., & Suel, T. (2006). Interactive wrapper generation with minimal user effort. In Proceedings of the 15th International Conference on World Wide Web (pp. 553-563) https://doi.org/10.1145/1135777.1135859

    Interactive wrapper generation with minimal user effort. / Irmak, Utku; Suel, Torsten.

    Proceedings of the 15th International Conference on World Wide Web. 2006. p. 553-563.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Irmak, U & Suel, T 2006, Interactive wrapper generation with minimal user effort. in Proceedings of the 15th International Conference on World Wide Web. pp. 553-563, 15th International Conference on World Wide Web, Edinburgh, Scotland, United Kingdom, 5/23/06. https://doi.org/10.1145/1135777.1135859
    Irmak U, Suel T. Interactive wrapper generation with minimal user effort. In Proceedings of the 15th International Conference on World Wide Web. 2006. p. 553-563 https://doi.org/10.1145/1135777.1135859
    Irmak, Utku ; Suel, Torsten. / Interactive wrapper generation with minimal user effort. Proceedings of the 15th International Conference on World Wide Web. 2006. pp. 553-563
    @inproceedings{c22bd7cec2ea4ff9b37b9ad60fd2aaca,
    title = "Interactive wrapper generation with minimal user effort",
    abstract = "While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.",
    keywords = "Active learning, Data extraction, Wrapper generation",
    author = "Utku Irmak and Torsten Suel",
    year = "2006",
    doi = "10.1145/1135777.1135859",
    language = "English (US)",
    isbn = "1595933239",
    pages = "553--563",
    booktitle = "Proceedings of the 15th International Conference on World Wide Web",

    }

    TY - GEN

    T1 - Interactive wrapper generation with minimal user effort

    AU - Irmak, Utku

    AU - Suel, Torsten

    PY - 2006

    Y1 - 2006

    N2 - While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.

    AB - While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.

    KW - Active learning

    KW - Data extraction

    KW - Wrapper generation

    UR - http://www.scopus.com/inward/record.url?scp=34250750133&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=34250750133&partnerID=8YFLogxK

    U2 - 10.1145/1135777.1135859

    DO - 10.1145/1135777.1135859

    M3 - Conference contribution

    AN - SCOPUS:34250750133

    SN - 1595933239

    SN - 9781595933232

    SP - 553

    EP - 563

    BT - Proceedings of the 15th International Conference on World Wide Web

    ER -