On compression-based text classification

Yuval Marton, Ning Wu, Lisa Hellerstein

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.

    Original languageEnglish (US)
    Title of host publicationLecture Notes in Computer Science
    EditorsD.E. Losada, J.M. Fernandez-Luna
    Pages300-314
    Number of pages15
    Volume3408
    StatePublished - 2005
    Event27th European Conference on IR Research, ECIR 2005 - Santiago de Compostella, Spain
    Duration: Mar 21 2005Mar 23 2005

    Other

    Other27th European Conference on IR Research, ECIR 2005
    CountrySpain
    CitySantiago de Compostella
    Period3/21/053/23/05

    Fingerprint

    Experiments

    ASJC Scopus subject areas

    • Computer Science (miscellaneous)

    Cite this

    Marton, Y., Wu, N., & Hellerstein, L. (2005). On compression-based text classification. In D. E. Losada, & J. M. Fernandez-Luna (Eds.), Lecture Notes in Computer Science (Vol. 3408, pp. 300-314)

    On compression-based text classification. / Marton, Yuval; Wu, Ning; Hellerstein, Lisa.

    Lecture Notes in Computer Science. ed. / D.E. Losada; J.M. Fernandez-Luna. Vol. 3408 2005. p. 300-314.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Marton, Y, Wu, N & Hellerstein, L 2005, On compression-based text classification. in DE Losada & JM Fernandez-Luna (eds), Lecture Notes in Computer Science. vol. 3408, pp. 300-314, 27th European Conference on IR Research, ECIR 2005, Santiago de Compostella, Spain, 3/21/05.
    Marton Y, Wu N, Hellerstein L. On compression-based text classification. In Losada DE, Fernandez-Luna JM, editors, Lecture Notes in Computer Science. Vol. 3408. 2005. p. 300-314
    Marton, Yuval ; Wu, Ning ; Hellerstein, Lisa. / On compression-based text classification. Lecture Notes in Computer Science. editor / D.E. Losada ; J.M. Fernandez-Luna. Vol. 3408 2005. pp. 300-314
    @inproceedings{fa3753e734f64935b770d299153e0e8f,
    title = "On compression-based text classification",
    abstract = "Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.",
    author = "Yuval Marton and Ning Wu and Lisa Hellerstein",
    year = "2005",
    language = "English (US)",
    volume = "3408",
    pages = "300--314",
    editor = "D.E. Losada and J.M. Fernandez-Luna",
    booktitle = "Lecture Notes in Computer Science",

    }

    TY - GEN

    T1 - On compression-based text classification

    AU - Marton, Yuval

    AU - Wu, Ning

    AU - Hellerstein, Lisa

    PY - 2005

    Y1 - 2005

    N2 - Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.

    AB - Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.

    UR - http://www.scopus.com/inward/record.url?scp=24644522810&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=24644522810&partnerID=8YFLogxK

    M3 - Conference contribution

    VL - 3408

    SP - 300

    EP - 314

    BT - Lecture Notes in Computer Science

    A2 - Losada, D.E.

    A2 - Fernandez-Luna, J.M.

    ER -