SPLIT

Smart preprocessing (quasi) language independent tool

Mohamed Al-Badrashiny, Arfath Pasha, Mona Diab, Nizar Habash, Owen Rambow, Wael Salloum, Ramy Eskander

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
    PublisherEuropean Language Resources Association (ELRA)
    Pages4055-4060
    Number of pages6
    ISBN (Electronic)9782951740891
    StatePublished - Jan 1 2016
    Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
    Duration: May 23 2016May 28 2016

    Other

    Other10th International Conference on Language Resources and Evaluation, LREC 2016
    CountrySlovenia
    CityPortoroz
    Period5/23/165/28/16

    Fingerprint

    language
    work organization
    transparency
    building
    Language
    Natural Language Processing
    Replication
    Enterprise
    Transparency

    Keywords

    • Corpus linguistics
    • NLP
    • Text preprocessing

    ASJC Scopus subject areas

    • Linguistics and Language
    • Library and Information Sciences
    • Language and Linguistics
    • Education

    Cite this

    Al-Badrashiny, M., Pasha, A., Diab, M., Habash, N., Rambow, O., Salloum, W., & Eskander, R. (2016). SPLIT: Smart preprocessing (quasi) language independent tool. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 4055-4060). European Language Resources Association (ELRA).

    SPLIT : Smart preprocessing (quasi) language independent tool. / Al-Badrashiny, Mohamed; Pasha, Arfath; Diab, Mona; Habash, Nizar; Rambow, Owen; Salloum, Wael; Eskander, Ramy.

    Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. p. 4055-4060.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Al-Badrashiny, M, Pasha, A, Diab, M, Habash, N, Rambow, O, Salloum, W & Eskander, R 2016, SPLIT: Smart preprocessing (quasi) language independent tool. in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), pp. 4055-4060, 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoroz, Slovenia, 5/23/16.
    Al-Badrashiny M, Pasha A, Diab M, Habash N, Rambow O, Salloum W et al. SPLIT: Smart preprocessing (quasi) language independent tool. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA). 2016. p. 4055-4060
    Al-Badrashiny, Mohamed ; Pasha, Arfath ; Diab, Mona ; Habash, Nizar ; Rambow, Owen ; Salloum, Wael ; Eskander, Ramy. / SPLIT : Smart preprocessing (quasi) language independent tool. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. pp. 4055-4060
    @inproceedings{aa99367fa09845eba84a79b2663ac726,
    title = "SPLIT: Smart preprocessing (quasi) language independent tool",
    abstract = "Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.",
    keywords = "Corpus linguistics, NLP, Text preprocessing",
    author = "Mohamed Al-Badrashiny and Arfath Pasha and Mona Diab and Nizar Habash and Owen Rambow and Wael Salloum and Ramy Eskander",
    year = "2016",
    month = "1",
    day = "1",
    language = "English (US)",
    pages = "4055--4060",
    booktitle = "Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016",
    publisher = "European Language Resources Association (ELRA)",

    }

    TY - GEN

    T1 - SPLIT

    T2 - Smart preprocessing (quasi) language independent tool

    AU - Al-Badrashiny, Mohamed

    AU - Pasha, Arfath

    AU - Diab, Mona

    AU - Habash, Nizar

    AU - Rambow, Owen

    AU - Salloum, Wael

    AU - Eskander, Ramy

    PY - 2016/1/1

    Y1 - 2016/1/1

    N2 - Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

    AB - Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

    KW - Corpus linguistics

    KW - NLP

    KW - Text preprocessing

    UR - http://www.scopus.com/inward/record.url?scp=85037079965&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85037079965&partnerID=8YFLogxK

    M3 - Conference contribution

    SP - 4055

    EP - 4060

    BT - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

    PB - European Language Resources Association (ELRA)

    ER -