Treating words as data with error: Uncertainty in text statements of policy positions

Kenneth Benoit, Michael Laver, Slava Mikhaylov

    Research output: Contribution to journalArticle

    Abstract

    Political text offers extraordinary potential as a source of information about the policy positions of political actors. Despite recent advances in computational text analysis, human interpretative coding of text remains an important source of text-based data, ultimately required to validate more automatic techniques. The profession's main source of cross-national, time-series data on party policy positions comes from the human interpretative coding of party manifestos by the Comparative Manifesto Project (CMP). Despite widespread use of these data, the uncertainty associated with each point estimate has never been available, undermining the value of the dataset as a scientific resource. We propose a remedy. First, we characterize processes by which CMP data are generated. These include inherently stochastic processes of text authorship, as well as of the parsing and coding of observed text by humans. Second, we simulate these error-generating processes by bootstrapping analyses of coded quasi-sentences. This allows us to estimate precise levels of nonsystematic error for every category and scale reported by the CMP for its entire set of 3,000-plus manifestos. Using our estimates of these errors, we show how to correct biased inferences, in recent prominently published work, derived from statistical analyses of error-contaminated CMP data.

    Original languageEnglish (US)
    Pages (from-to)495-513
    Number of pages19
    JournalAmerican Journal of Political Science
    Volume53
    Issue number2
    DOIs
    StatePublished - Apr 2009

    Fingerprint

    uncertainty
    coding
    text analysis
    political actor
    source of information
    remedies
    time series
    profession
    resources
    Values

    ASJC Scopus subject areas

    • Sociology and Political Science

    Cite this

    Treating words as data with error : Uncertainty in text statements of policy positions. / Benoit, Kenneth; Laver, Michael; Mikhaylov, Slava.

    In: American Journal of Political Science, Vol. 53, No. 2, 04.2009, p. 495-513.

    Research output: Contribution to journalArticle

    Benoit, Kenneth ; Laver, Michael ; Mikhaylov, Slava. / Treating words as data with error : Uncertainty in text statements of policy positions. In: American Journal of Political Science. 2009 ; Vol. 53, No. 2. pp. 495-513.
    @article{ec4ac727e8bf4e7ba9193e2d9cd97048,
    title = "Treating words as data with error: Uncertainty in text statements of policy positions",
    abstract = "Political text offers extraordinary potential as a source of information about the policy positions of political actors. Despite recent advances in computational text analysis, human interpretative coding of text remains an important source of text-based data, ultimately required to validate more automatic techniques. The profession's main source of cross-national, time-series data on party policy positions comes from the human interpretative coding of party manifestos by the Comparative Manifesto Project (CMP). Despite widespread use of these data, the uncertainty associated with each point estimate has never been available, undermining the value of the dataset as a scientific resource. We propose a remedy. First, we characterize processes by which CMP data are generated. These include inherently stochastic processes of text authorship, as well as of the parsing and coding of observed text by humans. Second, we simulate these error-generating processes by bootstrapping analyses of coded quasi-sentences. This allows us to estimate precise levels of nonsystematic error for every category and scale reported by the CMP for its entire set of 3,000-plus manifestos. Using our estimates of these errors, we show how to correct biased inferences, in recent prominently published work, derived from statistical analyses of error-contaminated CMP data.",
    author = "Kenneth Benoit and Michael Laver and Slava Mikhaylov",
    year = "2009",
    month = "4",
    doi = "10.1111/j.1540-5907.2009.00383.x",
    language = "English (US)",
    volume = "53",
    pages = "495--513",
    journal = "American Journal of Political Science",
    issn = "0092-5853",
    publisher = "Wiley-Blackwell",
    number = "2",

    }

    TY - JOUR

    T1 - Treating words as data with error

    T2 - Uncertainty in text statements of policy positions

    AU - Benoit, Kenneth

    AU - Laver, Michael

    AU - Mikhaylov, Slava

    PY - 2009/4

    Y1 - 2009/4

    N2 - Political text offers extraordinary potential as a source of information about the policy positions of political actors. Despite recent advances in computational text analysis, human interpretative coding of text remains an important source of text-based data, ultimately required to validate more automatic techniques. The profession's main source of cross-national, time-series data on party policy positions comes from the human interpretative coding of party manifestos by the Comparative Manifesto Project (CMP). Despite widespread use of these data, the uncertainty associated with each point estimate has never been available, undermining the value of the dataset as a scientific resource. We propose a remedy. First, we characterize processes by which CMP data are generated. These include inherently stochastic processes of text authorship, as well as of the parsing and coding of observed text by humans. Second, we simulate these error-generating processes by bootstrapping analyses of coded quasi-sentences. This allows us to estimate precise levels of nonsystematic error for every category and scale reported by the CMP for its entire set of 3,000-plus manifestos. Using our estimates of these errors, we show how to correct biased inferences, in recent prominently published work, derived from statistical analyses of error-contaminated CMP data.

    AB - Political text offers extraordinary potential as a source of information about the policy positions of political actors. Despite recent advances in computational text analysis, human interpretative coding of text remains an important source of text-based data, ultimately required to validate more automatic techniques. The profession's main source of cross-national, time-series data on party policy positions comes from the human interpretative coding of party manifestos by the Comparative Manifesto Project (CMP). Despite widespread use of these data, the uncertainty associated with each point estimate has never been available, undermining the value of the dataset as a scientific resource. We propose a remedy. First, we characterize processes by which CMP data are generated. These include inherently stochastic processes of text authorship, as well as of the parsing and coding of observed text by humans. Second, we simulate these error-generating processes by bootstrapping analyses of coded quasi-sentences. This allows us to estimate precise levels of nonsystematic error for every category and scale reported by the CMP for its entire set of 3,000-plus manifestos. Using our estimates of these errors, we show how to correct biased inferences, in recent prominently published work, derived from statistical analyses of error-contaminated CMP data.

    UR - http://www.scopus.com/inward/record.url?scp=63749095089&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=63749095089&partnerID=8YFLogxK

    U2 - 10.1111/j.1540-5907.2009.00383.x

    DO - 10.1111/j.1540-5907.2009.00383.x

    M3 - Article

    AN - SCOPUS:63749095089

    VL - 53

    SP - 495

    EP - 513

    JO - American Journal of Political Science

    JF - American Journal of Political Science

    SN - 0092-5853

    IS - 2

    ER -