A hybrid approach to Arabic named entity recognition

Khaled Shaalan, Mai Oudah

Research output: Contribution to journalArticle

Abstract

In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.

Original languageEnglish (US)
Pages (from-to)67-87
Number of pages21
JournalJournal of Information Science
Volume40
Issue number1
DOIs
StatePublished - Feb 1 2014

Fingerprint

Knowledge acquisition
Decision trees
Time measurement
Support vector machines
Learning systems
Logistics
Classifiers
organization
human being
Processing
language
performance
Experiments
logistics
regression
lack
experiment
resources
knowledge
learning

Keywords

  • hybrid approach
  • information extraction
  • information retrieval
  • machine learning approach
  • named entity recognition
  • natural language processing
  • rule-based approach

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

A hybrid approach to Arabic named entity recognition. / Shaalan, Khaled; Oudah, Mai.

In: Journal of Information Science, Vol. 40, No. 1, 01.02.2014, p. 67-87.

Research output: Contribution to journalArticle

Shaalan, Khaled ; Oudah, Mai. / A hybrid approach to Arabic named entity recognition. In: Journal of Information Science. 2014 ; Vol. 40, No. 1. pp. 67-87.
@article{e965e694f8b7499481d353577cc63e78,
title = "A hybrid approach to Arabic named entity recognition",
abstract = "In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.",
keywords = "hybrid approach, information extraction, information retrieval, machine learning approach, named entity recognition, natural language processing, rule-based approach",
author = "Khaled Shaalan and Mai Oudah",
year = "2014",
month = "2",
day = "1",
doi = "10.1177/0165551513502417",
language = "English (US)",
volume = "40",
pages = "67--87",
journal = "Journal of Information Science",
issn = "0165-5515",
publisher = "SAGE Publications Ltd",
number = "1",

}

TY - JOUR

T1 - A hybrid approach to Arabic named entity recognition

AU - Shaalan, Khaled

AU - Oudah, Mai

PY - 2014/2/1

Y1 - 2014/2/1

N2 - In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.

AB - In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.

KW - hybrid approach

KW - information extraction

KW - information retrieval

KW - machine learning approach

KW - named entity recognition

KW - natural language processing

KW - rule-based approach

UR - http://www.scopus.com/inward/record.url?scp=84892754368&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84892754368&partnerID=8YFLogxK

U2 - 10.1177/0165551513502417

DO - 10.1177/0165551513502417

M3 - Article

AN - SCOPUS:84892754368

VL - 40

SP - 67

EP - 87

JO - Journal of Information Science

JF - Journal of Information Science

SN - 0165-5515

IS - 1

ER -