Stylometric authorship attribution of collaborative documents

Edwin Dauber, Rebekah Overdorf, Rachel Greenstadt

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.

    Original languageEnglish (US)
    Title of host publicationCyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings
    EditorsShlomi Dolev, Sachin Lodha
    PublisherSpringer-Verlag
    Pages115-135
    Number of pages21
    ISBN (Print)9783319600796
    DOIs
    StatePublished - Jan 1 2017
    Event1st International Conference on Cyber Security Cryptography and Machine Learning, CSCML 2017 - Beer-Sheva, Israel
    Duration: Jun 29 2017Jun 30 2017

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume10332 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference1st International Conference on Cyber Security Cryptography and Machine Learning, CSCML 2017
    CountryIsrael
    CityBeer-Sheva
    Period6/29/176/30/17

    Fingerprint

    Labels
    Support Vector Machine
    High Accuracy
    Classifier
    Linguistics
    Scenarios
    Support vector machines
    Classifiers
    Experiment
    Experiments
    Learning
    Knowledge
    Text
    Style

    Keywords

    • Authorship attribution
    • Machine learning
    • Multi-label learning
    • Stylometry

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • Computer Science(all)

    Cite this

    Dauber, E., Overdorf, R., & Greenstadt, R. (2017). Stylometric authorship attribution of collaborative documents. In S. Dolev, & S. Lodha (Eds.), Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings (pp. 115-135). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10332 LNCS). Springer-Verlag. https://doi.org/10.1007/978-3-319-60080-2_9

    Stylometric authorship attribution of collaborative documents. / Dauber, Edwin; Overdorf, Rebekah; Greenstadt, Rachel.

    Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings. ed. / Shlomi Dolev; Sachin Lodha. Springer-Verlag, 2017. p. 115-135 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10332 LNCS).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Dauber, E, Overdorf, R & Greenstadt, R 2017, Stylometric authorship attribution of collaborative documents. in S Dolev & S Lodha (eds), Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10332 LNCS, Springer-Verlag, pp. 115-135, 1st International Conference on Cyber Security Cryptography and Machine Learning, CSCML 2017, Beer-Sheva, Israel, 6/29/17. https://doi.org/10.1007/978-3-319-60080-2_9
    Dauber E, Overdorf R, Greenstadt R. Stylometric authorship attribution of collaborative documents. In Dolev S, Lodha S, editors, Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings. Springer-Verlag. 2017. p. 115-135. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-60080-2_9
    Dauber, Edwin ; Overdorf, Rebekah ; Greenstadt, Rachel. / Stylometric authorship attribution of collaborative documents. Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings. editor / Shlomi Dolev ; Sachin Lodha. Springer-Verlag, 2017. pp. 115-135 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
    @inproceedings{629e2209c3274a9dbf2b6ccb4443c0cc,
    title = "Stylometric authorship attribution of collaborative documents",
    abstract = "Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.",
    keywords = "Authorship attribution, Machine learning, Multi-label learning, Stylometry",
    author = "Edwin Dauber and Rebekah Overdorf and Rachel Greenstadt",
    year = "2017",
    month = "1",
    day = "1",
    doi = "10.1007/978-3-319-60080-2_9",
    language = "English (US)",
    isbn = "9783319600796",
    series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
    publisher = "Springer-Verlag",
    pages = "115--135",
    editor = "Shlomi Dolev and Sachin Lodha",
    booktitle = "Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings",

    }

    TY - GEN

    T1 - Stylometric authorship attribution of collaborative documents

    AU - Dauber, Edwin

    AU - Overdorf, Rebekah

    AU - Greenstadt, Rachel

    PY - 2017/1/1

    Y1 - 2017/1/1

    N2 - Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.

    AB - Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.

    KW - Authorship attribution

    KW - Machine learning

    KW - Multi-label learning

    KW - Stylometry

    UR - http://www.scopus.com/inward/record.url?scp=85021710228&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85021710228&partnerID=8YFLogxK

    U2 - 10.1007/978-3-319-60080-2_9

    DO - 10.1007/978-3-319-60080-2_9

    M3 - Conference contribution

    AN - SCOPUS:85021710228

    SN - 9783319600796

    T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    SP - 115

    EP - 135

    BT - Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings

    A2 - Dolev, Shlomi

    A2 - Lodha, Sachin

    PB - Springer-Verlag

    ER -