Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition

Mai Oudah, Khaled Shaalan

Research output: Contribution to journalArticle

Abstract

In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.

Original languageEnglish (US)
Pages (from-to)351-378
Number of pages28
JournalLanguage Resources and Evaluation
Volume51
Issue number2
DOIs
StatePublished - Jun 1 2017

Fingerprint

human being
language
integrated system
experiment
learning
social isolation
Person
Language
Names
efficiency
performance
Experiment
Machine Learning
Entity

Keywords

  • Hybrid approach
  • Information extraction
  • Machine learning
  • Named entity recognition
  • Natural language processing
  • Rule-based approach

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Linguistics and Language
  • Library and Information Sciences

Cite this

Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition. / Oudah, Mai; Shaalan, Khaled.

In: Language Resources and Evaluation, Vol. 51, No. 2, 01.06.2017, p. 351-378.

Research output: Contribution to journalArticle

@article{789a8ca7eeea480783ea07e152db4dd1,
title = "Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition",
abstract = "In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.",
keywords = "Hybrid approach, Information extraction, Machine learning, Named entity recognition, Natural language processing, Rule-based approach",
author = "Mai Oudah and Khaled Shaalan",
year = "2017",
month = "6",
day = "1",
doi = "10.1007/s10579-016-9376-1",
language = "English (US)",
volume = "51",
pages = "351--378",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "2",

}

TY - JOUR

T1 - Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition

AU - Oudah, Mai

AU - Shaalan, Khaled

PY - 2017/6/1

Y1 - 2017/6/1

N2 - In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.

AB - In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.

KW - Hybrid approach

KW - Information extraction

KW - Machine learning

KW - Named entity recognition

KW - Natural language processing

KW - Rule-based approach

UR - http://www.scopus.com/inward/record.url?scp=84997236688&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84997236688&partnerID=8YFLogxK

U2 - 10.1007/s10579-016-9376-1

DO - 10.1007/s10579-016-9376-1

M3 - Article

VL - 51

SP - 351

EP - 378

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 2

ER -