CLUSTER ANALYSIS OF ENGLISH TEXT.

Godfried Toussaint, Rajjan Shinghal

Research output: Contribution to conferencePaper

Abstract

Large English texts on ten differnt subject matters were compiled. Estimates were obtained of the n-gram probability distributions, the word-length for each of the texts as well as English as a whole. Experiments were done to test for pairwise differences of the ten texts. Principal component analysis and hierarchical clustering analysis were applied to the data in order to discover any possible similarities and dissimilarities among the different texts. Estimates were obtained of first, second, and third-order entropies for each text, and the texts were tested for pairwise differences according to their first-order entropy estimates. The results are of interest to researchers in psychology, biology, anthropology, and computational linguistics as well as pattern recognition.

Original languageEnglish (US)
Pages164-117
Number of pages48
StatePublished - Jan 1 2017
EventProc IEEE Comput Soc Conf Pattern Recognition Image Process - Chicago, IL, USA
Duration: May 31 1978Jun 2 1978

Other

OtherProc IEEE Comput Soc Conf Pattern Recognition Image Process
CityChicago, IL, USA
Period5/31/786/2/78

Fingerprint

Cluster analysis
Entropy
Computational linguistics
Principal component analysis
Probability distributions
Pattern recognition
Experiments

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Toussaint, G., & Shinghal, R. (2017). CLUSTER ANALYSIS OF ENGLISH TEXT.. 164-117. Paper presented at Proc IEEE Comput Soc Conf Pattern Recognition Image Process, Chicago, IL, USA, .

CLUSTER ANALYSIS OF ENGLISH TEXT. / Toussaint, Godfried; Shinghal, Rajjan.

2017. 164-117 Paper presented at Proc IEEE Comput Soc Conf Pattern Recognition Image Process, Chicago, IL, USA, .

Research output: Contribution to conferencePaper

Toussaint, G & Shinghal, R 2017, 'CLUSTER ANALYSIS OF ENGLISH TEXT.', Paper presented at Proc IEEE Comput Soc Conf Pattern Recognition Image Process, Chicago, IL, USA, 5/31/78 - 6/2/78 pp. 164-117.
Toussaint G, Shinghal R. CLUSTER ANALYSIS OF ENGLISH TEXT.. 2017. Paper presented at Proc IEEE Comput Soc Conf Pattern Recognition Image Process, Chicago, IL, USA, .
Toussaint, Godfried ; Shinghal, Rajjan. / CLUSTER ANALYSIS OF ENGLISH TEXT. Paper presented at Proc IEEE Comput Soc Conf Pattern Recognition Image Process, Chicago, IL, USA, .48 p.
@conference{bd92211e2e1242958bd544e49612a5f8,
title = "CLUSTER ANALYSIS OF ENGLISH TEXT.",
abstract = "Large English texts on ten differnt subject matters were compiled. Estimates were obtained of the n-gram probability distributions, the word-length for each of the texts as well as English as a whole. Experiments were done to test for pairwise differences of the ten texts. Principal component analysis and hierarchical clustering analysis were applied to the data in order to discover any possible similarities and dissimilarities among the different texts. Estimates were obtained of first, second, and third-order entropies for each text, and the texts were tested for pairwise differences according to their first-order entropy estimates. The results are of interest to researchers in psychology, biology, anthropology, and computational linguistics as well as pattern recognition.",
author = "Godfried Toussaint and Rajjan Shinghal",
year = "2017",
month = "1",
day = "1",
language = "English (US)",
pages = "164--117",
note = "Proc IEEE Comput Soc Conf Pattern Recognition Image Process ; Conference date: 31-05-1978 Through 02-06-1978",

}

TY - CONF

T1 - CLUSTER ANALYSIS OF ENGLISH TEXT.

AU - Toussaint, Godfried

AU - Shinghal, Rajjan

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Large English texts on ten differnt subject matters were compiled. Estimates were obtained of the n-gram probability distributions, the word-length for each of the texts as well as English as a whole. Experiments were done to test for pairwise differences of the ten texts. Principal component analysis and hierarchical clustering analysis were applied to the data in order to discover any possible similarities and dissimilarities among the different texts. Estimates were obtained of first, second, and third-order entropies for each text, and the texts were tested for pairwise differences according to their first-order entropy estimates. The results are of interest to researchers in psychology, biology, anthropology, and computational linguistics as well as pattern recognition.

AB - Large English texts on ten differnt subject matters were compiled. Estimates were obtained of the n-gram probability distributions, the word-length for each of the texts as well as English as a whole. Experiments were done to test for pairwise differences of the ten texts. Principal component analysis and hierarchical clustering analysis were applied to the data in order to discover any possible similarities and dissimilarities among the different texts. Estimates were obtained of first, second, and third-order entropies for each text, and the texts were tested for pairwise differences according to their first-order entropy estimates. The results are of interest to researchers in psychology, biology, anthropology, and computational linguistics as well as pattern recognition.

UR - http://www.scopus.com/inward/record.url?scp=0018280884&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0018280884&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:0018280884

SP - 164

EP - 117

ER -