Interrater agreement statistics with skewed data: Evaluation of alternatives to Cohen's kappa

Violet Shu Xu, Michael F. Lorber

Research output: Contribution to journalArticle

Abstract

Objective: In this study, we aimed to evaluate interrater agreement statistics (IRAS) for use in research on low base rate clinical diagnoses or observed behaviors. Establishing and reporting sufficient interrater agreement is essential in such studies. Yet the most commonly applied agreement statistic, Cohen's, has a well known sensitivity to base rates that results in a substantial penalization of interrater agreement when behaviors or diagnoses are very uncommon, a prevalent and frustrating concern in such studies. Method: We performed Monte Carlo simulations to evaluate the performance of 5 of κ's alternatives (Van Eerdewegh's V, Yule's Y, Holley and Guilford's G, Scott's π, and Gwet's AC1), alongside κ itself. The simulations investigated the robustness of these IRAS to conditions that are common in clinical research, with varying levels of behavior or diagnosis base rate, rater bias, observed interrater agreement, and sample size. Results: When the base rate was 0.5, each IRAS provided similar estimates, particularly with unbiased raters. G was the least sensitive of the IRAS to base rates. Conclusions: The results encourage the use of the G statistic for its consistent performance across the simulation conditions. We recommend separately reporting the rates of agreement on the presence and absence of a behavior or diagnosis alongside G as an index of chance corrected overall agreement.

Original languageEnglish (US)
Pages (from-to)1219-1227
Number of pages9
JournalJournal of Consulting and Clinical Psychology
Volume82
Issue number6
DOIs
StatePublished - Jan 1 2014

Fingerprint

Research
Sample Size

Keywords

  • Behavior observation
  • Diagnosis
  • Interrater agreement
  • Low base rate
  • Skew

ASJC Scopus subject areas

  • Clinical Psychology
  • Psychiatry and Mental health

Cite this

Interrater agreement statistics with skewed data : Evaluation of alternatives to Cohen's kappa. / Xu, Violet Shu; Lorber, Michael F.

In: Journal of Consulting and Clinical Psychology, Vol. 82, No. 6, 01.01.2014, p. 1219-1227.

Research output: Contribution to journalArticle

@article{e56b3be1d58c43298d67bd9aec0e4557,
title = "Interrater agreement statistics with skewed data: Evaluation of alternatives to Cohen's kappa",
abstract = "Objective: In this study, we aimed to evaluate interrater agreement statistics (IRAS) for use in research on low base rate clinical diagnoses or observed behaviors. Establishing and reporting sufficient interrater agreement is essential in such studies. Yet the most commonly applied agreement statistic, Cohen's, has a well known sensitivity to base rates that results in a substantial penalization of interrater agreement when behaviors or diagnoses are very uncommon, a prevalent and frustrating concern in such studies. Method: We performed Monte Carlo simulations to evaluate the performance of 5 of κ's alternatives (Van Eerdewegh's V, Yule's Y, Holley and Guilford's G, Scott's π, and Gwet's AC1), alongside κ itself. The simulations investigated the robustness of these IRAS to conditions that are common in clinical research, with varying levels of behavior or diagnosis base rate, rater bias, observed interrater agreement, and sample size. Results: When the base rate was 0.5, each IRAS provided similar estimates, particularly with unbiased raters. G was the least sensitive of the IRAS to base rates. Conclusions: The results encourage the use of the G statistic for its consistent performance across the simulation conditions. We recommend separately reporting the rates of agreement on the presence and absence of a behavior or diagnosis alongside G as an index of chance corrected overall agreement.",
keywords = "Behavior observation, Diagnosis, Interrater agreement, Low base rate, Skew",
author = "Xu, {Violet Shu} and Lorber, {Michael F.}",
year = "2014",
month = "1",
day = "1",
doi = "10.1037/a0037489",
language = "English (US)",
volume = "82",
pages = "1219--1227",
journal = "Journal of Consulting and Clinical Psychology",
issn = "0022-006X",
publisher = "American Psychological Association Inc.",
number = "6",

}

TY - JOUR

T1 - Interrater agreement statistics with skewed data

T2 - Evaluation of alternatives to Cohen's kappa

AU - Xu, Violet Shu

AU - Lorber, Michael F.

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Objective: In this study, we aimed to evaluate interrater agreement statistics (IRAS) for use in research on low base rate clinical diagnoses or observed behaviors. Establishing and reporting sufficient interrater agreement is essential in such studies. Yet the most commonly applied agreement statistic, Cohen's, has a well known sensitivity to base rates that results in a substantial penalization of interrater agreement when behaviors or diagnoses are very uncommon, a prevalent and frustrating concern in such studies. Method: We performed Monte Carlo simulations to evaluate the performance of 5 of κ's alternatives (Van Eerdewegh's V, Yule's Y, Holley and Guilford's G, Scott's π, and Gwet's AC1), alongside κ itself. The simulations investigated the robustness of these IRAS to conditions that are common in clinical research, with varying levels of behavior or diagnosis base rate, rater bias, observed interrater agreement, and sample size. Results: When the base rate was 0.5, each IRAS provided similar estimates, particularly with unbiased raters. G was the least sensitive of the IRAS to base rates. Conclusions: The results encourage the use of the G statistic for its consistent performance across the simulation conditions. We recommend separately reporting the rates of agreement on the presence and absence of a behavior or diagnosis alongside G as an index of chance corrected overall agreement.

AB - Objective: In this study, we aimed to evaluate interrater agreement statistics (IRAS) for use in research on low base rate clinical diagnoses or observed behaviors. Establishing and reporting sufficient interrater agreement is essential in such studies. Yet the most commonly applied agreement statistic, Cohen's, has a well known sensitivity to base rates that results in a substantial penalization of interrater agreement when behaviors or diagnoses are very uncommon, a prevalent and frustrating concern in such studies. Method: We performed Monte Carlo simulations to evaluate the performance of 5 of κ's alternatives (Van Eerdewegh's V, Yule's Y, Holley and Guilford's G, Scott's π, and Gwet's AC1), alongside κ itself. The simulations investigated the robustness of these IRAS to conditions that are common in clinical research, with varying levels of behavior or diagnosis base rate, rater bias, observed interrater agreement, and sample size. Results: When the base rate was 0.5, each IRAS provided similar estimates, particularly with unbiased raters. G was the least sensitive of the IRAS to base rates. Conclusions: The results encourage the use of the G statistic for its consistent performance across the simulation conditions. We recommend separately reporting the rates of agreement on the presence and absence of a behavior or diagnosis alongside G as an index of chance corrected overall agreement.

KW - Behavior observation

KW - Diagnosis

KW - Interrater agreement

KW - Low base rate

KW - Skew

UR - http://www.scopus.com/inward/record.url?scp=84925639509&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84925639509&partnerID=8YFLogxK

U2 - 10.1037/a0037489

DO - 10.1037/a0037489

M3 - Article

C2 - 25090041

AN - SCOPUS:84925639509

VL - 82

SP - 1219

EP - 1227

JO - Journal of Consulting and Clinical Psychology

JF - Journal of Consulting and Clinical Psychology

SN - 0022-006X

IS - 6

ER -