SPLIT

Smart preprocessing (quasi) language independent tool

Mohamed Al-Badrashiny, Arfath Pasha, Mona Diab, Nizar Habash, Owen Rambow, Wael Salloum, Ramy Eskander

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

Original languageEnglish (US)
Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
PublisherEuropean Language Resources Association (ELRA)
Pages4055-4060
Number of pages6
ISBN (Electronic)9782951740891
StatePublished - Jan 1 2016
Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
Duration: May 23 2016May 28 2016

Other

Other10th International Conference on Language Resources and Evaluation, LREC 2016
CountrySlovenia
CityPortoroz
Period5/23/165/28/16

Fingerprint

language
work organization
transparency
building
Language
Natural Language Processing
Replication
Transparency
Enterprise

Keywords

  • Corpus linguistics
  • NLP
  • Text preprocessing

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Language and Linguistics
  • Education

Cite this

Al-Badrashiny, M., Pasha, A., Diab, M., Habash, N., Rambow, O., Salloum, W., & Eskander, R. (2016). SPLIT: Smart preprocessing (quasi) language independent tool. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 4055-4060). European Language Resources Association (ELRA).

SPLIT : Smart preprocessing (quasi) language independent tool. / Al-Badrashiny, Mohamed; Pasha, Arfath; Diab, Mona; Habash, Nizar; Rambow, Owen; Salloum, Wael; Eskander, Ramy.

Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. p. 4055-4060.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Al-Badrashiny, M, Pasha, A, Diab, M, Habash, N, Rambow, O, Salloum, W & Eskander, R 2016, SPLIT: Smart preprocessing (quasi) language independent tool. in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), pp. 4055-4060, 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoroz, Slovenia, 5/23/16.
Al-Badrashiny M, Pasha A, Diab M, Habash N, Rambow O, Salloum W et al. SPLIT: Smart preprocessing (quasi) language independent tool. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA). 2016. p. 4055-4060
Al-Badrashiny, Mohamed ; Pasha, Arfath ; Diab, Mona ; Habash, Nizar ; Rambow, Owen ; Salloum, Wael ; Eskander, Ramy. / SPLIT : Smart preprocessing (quasi) language independent tool. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. pp. 4055-4060
@inproceedings{aa99367fa09845eba84a79b2663ac726,
title = "SPLIT: Smart preprocessing (quasi) language independent tool",
abstract = "Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.",
keywords = "Corpus linguistics, NLP, Text preprocessing",
author = "Mohamed Al-Badrashiny and Arfath Pasha and Mona Diab and Nizar Habash and Owen Rambow and Wael Salloum and Ramy Eskander",
year = "2016",
month = "1",
day = "1",
language = "English (US)",
pages = "4055--4060",
booktitle = "Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - SPLIT

T2 - Smart preprocessing (quasi) language independent tool

AU - Al-Badrashiny, Mohamed

AU - Pasha, Arfath

AU - Diab, Mona

AU - Habash, Nizar

AU - Rambow, Owen

AU - Salloum, Wael

AU - Eskander, Ramy

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

AB - Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

KW - Corpus linguistics

KW - NLP

KW - Text preprocessing

UR - http://www.scopus.com/inward/record.url?scp=85037079965&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85037079965&partnerID=8YFLogxK

M3 - Conference contribution

SP - 4055

EP - 4060

BT - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

PB - European Language Resources Association (ELRA)

ER -