Abstract
A significant portion of data generated on blogging and microblogging websites is non-credible as shown in many recent studies. To filter out such non-credible information, machine learning can be deployed to build automatic credibility classifiers. However, as in the case with most supervised machine learning approaches, a sufficiently large and accurate training data must be available. In this paper, we focus on building a public Arabic corpus of blogs and microblogs that can be used for credibility classification. We focus on Arabic due to the recent popularity of blogs and microblogs in the Arab World and due to the lack of any such public corpora in Arabic. We discuss our data acquisition approach and annotation process, provide rigid analysis on the annotated data and finally report some results on the effectiveness of our data for credibility classification.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 |
Publisher | European Language Resources Association (ELRA) |
Pages | 4396-4401 |
Number of pages | 6 |
ISBN (Electronic) | 9782951740891 |
State | Published - Jan 1 2016 |
Event | 10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia Duration: May 23 2016 → May 28 2016 |
Other
Other | 10th International Conference on Language Resources and Evaluation, LREC 2016 |
---|---|
Country | Slovenia |
City | Portoroz |
Period | 5/23/16 → 5/28/16 |
Fingerprint
Keywords
- Blogs
- Credibility
- Crowdsourcing
ASJC Scopus subject areas
- Linguistics and Language
- Library and Information Sciences
- Language and Linguistics
- Education
Cite this
Arabic corpora for credibility analysis. / Al Zaatari, Ayman; El Ballouli, Rim; Elbassuoni, Shady; El-Hajj, Wassim; Hajj, Hazem; Shaban, Khaled; Habash, Nizar; Yehya, Emad.
Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. p. 4396-4401.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
}
TY - GEN
T1 - Arabic corpora for credibility analysis
AU - Al Zaatari, Ayman
AU - El Ballouli, Rim
AU - Elbassuoni, Shady
AU - El-Hajj, Wassim
AU - Hajj, Hazem
AU - Shaban, Khaled
AU - Habash, Nizar
AU - Yehya, Emad
PY - 2016/1/1
Y1 - 2016/1/1
N2 - A significant portion of data generated on blogging and microblogging websites is non-credible as shown in many recent studies. To filter out such non-credible information, machine learning can be deployed to build automatic credibility classifiers. However, as in the case with most supervised machine learning approaches, a sufficiently large and accurate training data must be available. In this paper, we focus on building a public Arabic corpus of blogs and microblogs that can be used for credibility classification. We focus on Arabic due to the recent popularity of blogs and microblogs in the Arab World and due to the lack of any such public corpora in Arabic. We discuss our data acquisition approach and annotation process, provide rigid analysis on the annotated data and finally report some results on the effectiveness of our data for credibility classification.
AB - A significant portion of data generated on blogging and microblogging websites is non-credible as shown in many recent studies. To filter out such non-credible information, machine learning can be deployed to build automatic credibility classifiers. However, as in the case with most supervised machine learning approaches, a sufficiently large and accurate training data must be available. In this paper, we focus on building a public Arabic corpus of blogs and microblogs that can be used for credibility classification. We focus on Arabic due to the recent popularity of blogs and microblogs in the Arab World and due to the lack of any such public corpora in Arabic. We discuss our data acquisition approach and annotation process, provide rigid analysis on the annotated data and finally report some results on the effectiveness of our data for credibility classification.
KW - Blogs
KW - Credibility
KW - Crowdsourcing
KW - Twitter
UR - http://www.scopus.com/inward/record.url?scp=85037125324&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85037125324&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85037125324
SP - 4396
EP - 4401
BT - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
PB - European Language Resources Association (ELRA)
ER -