Glue: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel Bowman

    Research output: Contribution to conferencePaper

    Abstract

    For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.

    Original languageEnglish (US)
    StatePublished - Jan 1 2019
    Event7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States
    Duration: May 6 2019May 9 2019

    Conference

    Conference7th International Conference on Learning Representations, ICLR 2019
    CountryUnited States
    CityNew Orleans
    Period5/6/195/9/19

    Fingerprint

    Glues
    language
    Linguistics
    evaluation
    linguistics
    Benchmark
    Language Understanding
    Natural Language
    performance
    genre
    diagnostic
    Evaluation
    learning

    ASJC Scopus subject areas

    • Education
    • Computer Science Applications
    • Linguistics and Language
    • Language and Linguistics

    Cite this

    Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2019). Glue: A multi-task benchmark and analysis platform for natural language understanding. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

    Glue : A multi-task benchmark and analysis platform for natural language understanding. / Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel.

    2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

    Research output: Contribution to conferencePaper

    Wang, A, Singh, A, Michael, J, Hill, F, Levy, O & Bowman, S 2019, 'Glue: A multi-task benchmark and analysis platform for natural language understanding' Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States, 5/6/19 - 5/9/19, .
    Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. Glue: A multi-task benchmark and analysis platform for natural language understanding. 2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
    Wang, Alex ; Singh, Amanpreet ; Michael, Julian ; Hill, Felix ; Levy, Omer ; Bowman, Samuel. / Glue : A multi-task benchmark and analysis platform for natural language understanding. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
    @conference{9c547a5a48db4ff289f9a784b8e25ae5,
    title = "Glue: A multi-task benchmark and analysis platform for natural language understanding",
    abstract = "For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.",
    author = "Alex Wang and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel Bowman",
    year = "2019",
    month = "1",
    day = "1",
    language = "English (US)",
    note = "7th International Conference on Learning Representations, ICLR 2019 ; Conference date: 06-05-2019 Through 09-05-2019",

    }

    TY - CONF

    T1 - Glue

    T2 - A multi-task benchmark and analysis platform for natural language understanding

    AU - Wang, Alex

    AU - Singh, Amanpreet

    AU - Michael, Julian

    AU - Hill, Felix

    AU - Levy, Omer

    AU - Bowman, Samuel

    PY - 2019/1/1

    Y1 - 2019/1/1

    N2 - For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.

    AB - For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.

    UR - http://www.scopus.com/inward/record.url?scp=85071160434&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85071160434&partnerID=8YFLogxK

    M3 - Paper

    ER -