Document classification for focused topics

Russell Power, Jay Chen, Trishank Karthik, Lakshminarayanan Subramanian

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: (a) most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks); (b) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.

Original languageEnglish (US)
Title of host publicationArtificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report
Pages67-72
Number of pages6
VolumeSS-10-01
StatePublished - 2010
Event2010 AAAI Spring Symposium - Stanford, CA, United States
Duration: Mar 22 2010Mar 24 2010

Other

Other2010 AAAI Spring Symposium
CountryUnited States
CityStanford, CA
Period3/22/103/24/10

Fingerprint

Feature extraction
Classifiers

ASJC Scopus subject areas

  • Artificial Intelligence

Cite this

Power, R., Chen, J., Karthik, T., & Subramanian, L. (2010). Document classification for focused topics. In Artificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report (Vol. SS-10-01, pp. 67-72)

Document classification for focused topics. / Power, Russell; Chen, Jay; Karthik, Trishank; Subramanian, Lakshminarayanan.

Artificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report. Vol. SS-10-01 2010. p. 67-72.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Power, R, Chen, J, Karthik, T & Subramanian, L 2010, Document classification for focused topics. in Artificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report. vol. SS-10-01, pp. 67-72, 2010 AAAI Spring Symposium, Stanford, CA, United States, 3/22/10.
Power R, Chen J, Karthik T, Subramanian L. Document classification for focused topics. In Artificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report. Vol. SS-10-01. 2010. p. 67-72
Power, Russell ; Chen, Jay ; Karthik, Trishank ; Subramanian, Lakshminarayanan. / Document classification for focused topics. Artificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report. Vol. SS-10-01 2010. pp. 67-72
@inproceedings{515c2c0801684ecc91fc19d61d9820db,
title = "Document classification for focused topics",
abstract = "Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: (a) most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks); (b) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.",
author = "Russell Power and Jay Chen and Trishank Karthik and Lakshminarayanan Subramanian",
year = "2010",
language = "English (US)",
isbn = "9781577354550",
volume = "SS-10-01",
pages = "67--72",
booktitle = "Artificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report",

}

TY - GEN

T1 - Document classification for focused topics

AU - Power, Russell

AU - Chen, Jay

AU - Karthik, Trishank

AU - Subramanian, Lakshminarayanan

PY - 2010

Y1 - 2010

N2 - Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: (a) most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks); (b) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.

AB - Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: (a) most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks); (b) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.

UR - http://www.scopus.com/inward/record.url?scp=77957940540&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77957940540&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781577354550

VL - SS-10-01

SP - 67

EP - 72

BT - Artificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report

ER -