Text Preprocessing for Unsupervised Learning

Why It Matters, When It Misleads, and What to Do about It

Matthew J. Denny, Arthur Spirling

    Research output: Contribution to journalArticle

    Abstract

    Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it AIDS replication efforts.

    Original languageEnglish (US)
    Pages (from-to)168-189
    Number of pages22
    JournalPolitical Analysis
    Volume26
    Issue number2
    DOIs
    StatePublished - Apr 1 2018

    Fingerprint

    learning
    statistical method
    political science
    popularity
    AIDS
    regime
    literature
    software

    Keywords

    • descriptive statistics
    • statistical analysis of texts
    • unsupervised learning

    ASJC Scopus subject areas

    • Sociology and Political Science
    • Political Science and International Relations

    Cite this

    Text Preprocessing for Unsupervised Learning : Why It Matters, When It Misleads, and What to Do about It. / Denny, Matthew J.; Spirling, Arthur.

    In: Political Analysis, Vol. 26, No. 2, 01.04.2018, p. 168-189.

    Research output: Contribution to journalArticle

    Denny, Matthew J. ; Spirling, Arthur. / Text Preprocessing for Unsupervised Learning : Why It Matters, When It Misleads, and What to Do about It. In: Political Analysis. 2018 ; Vol. 26, No. 2. pp. 168-189.
    @article{fab183648b0444b4aaffdffa5df48d09,
    title = "Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It",
    abstract = "Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it AIDS replication efforts.",
    keywords = "descriptive statistics, statistical analysis of texts, unsupervised learning",
    author = "Denny, {Matthew J.} and Arthur Spirling",
    year = "2018",
    month = "4",
    day = "1",
    doi = "10.1017/pan.2017.44",
    language = "English (US)",
    volume = "26",
    pages = "168--189",
    journal = "Political Analysis",
    issn = "1047-1987",
    publisher = "Oxford University Press",
    number = "2",

    }

    TY - JOUR

    T1 - Text Preprocessing for Unsupervised Learning

    T2 - Why It Matters, When It Misleads, and What to Do about It

    AU - Denny, Matthew J.

    AU - Spirling, Arthur

    PY - 2018/4/1

    Y1 - 2018/4/1

    N2 - Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it AIDS replication efforts.

    AB - Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it AIDS replication efforts.

    KW - descriptive statistics

    KW - statistical analysis of texts

    KW - unsupervised learning

    UR - http://www.scopus.com/inward/record.url?scp=85046535232&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85046535232&partnerID=8YFLogxK

    U2 - 10.1017/pan.2017.44

    DO - 10.1017/pan.2017.44

    M3 - Article

    VL - 26

    SP - 168

    EP - 189

    JO - Political Analysis

    JF - Political Analysis

    SN - 1047-1987

    IS - 2

    ER -