Abstract
Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.
Original language | English (US) |
---|---|
Title of host publication | Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings |
Editors | Shlomi Dolev, Sachin Lodha |
Publisher | Springer-Verlag |
Pages | 115-135 |
Number of pages | 21 |
ISBN (Print) | 9783319600796 |
DOIs | |
State | Published - Jan 1 2017 |
Event | 1st International Conference on Cyber Security Cryptography and Machine Learning, CSCML 2017 - Beer-Sheva, Israel Duration: Jun 29 2017 → Jun 30 2017 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 10332 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 1st International Conference on Cyber Security Cryptography and Machine Learning, CSCML 2017 |
---|---|
Country | Israel |
City | Beer-Sheva |
Period | 6/29/17 → 6/30/17 |
Fingerprint
Keywords
- Authorship attribution
- Machine learning
- Multi-label learning
- Stylometry
ASJC Scopus subject areas
- Theoretical Computer Science
- Computer Science(all)
Cite this
Stylometric authorship attribution of collaborative documents. / Dauber, Edwin; Overdorf, Rebekah; Greenstadt, Rachel.
Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings. ed. / Shlomi Dolev; Sachin Lodha. Springer-Verlag, 2017. p. 115-135 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10332 LNCS).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
}
TY - GEN
T1 - Stylometric authorship attribution of collaborative documents
AU - Dauber, Edwin
AU - Overdorf, Rebekah
AU - Greenstadt, Rachel
PY - 2017/1/1
Y1 - 2017/1/1
N2 - Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.
AB - Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.
KW - Authorship attribution
KW - Machine learning
KW - Multi-label learning
KW - Stylometry
UR - http://www.scopus.com/inward/record.url?scp=85021710228&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85021710228&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-60080-2_9
DO - 10.1007/978-3-319-60080-2_9
M3 - Conference contribution
AN - SCOPUS:85021710228
SN - 9783319600796
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 115
EP - 135
BT - Cyber Security Cryptography and Machine Learning - 1st International Conference, CSCML 2017, Proceedings
A2 - Dolev, Shlomi
A2 - Lodha, Sachin
PB - Springer-Verlag
ER -