Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches

Jonathan Kropko, Ben Goodrich, Andrew Gelman, Jennifer Hill

Research output: Contribution to journalArticle

Abstract

We consider the relative performance of two common approaches to multiple imputation (MI): joint multivariate normal (MVN) MI, in which the data are modeled as a sample from a joint MVN distribution; and conditional MI, in which each variable is modeled conditionally on all the others. In order to use the multivariate normal distribution, implementations of joint MVN MI typically assume that categories of discrete variables are probabilistically constructed from continuous values. We use simulations to examine the implications of these assumptions. For each approach, we assess (1) the accuracy of the imputed values; and (2) the accuracy of coefficients and fitted values from a model fit to completed data sets. These simulations consider continuous, binary, ordinal, and unordered-categorical variables. One set of simulations uses multivariate normal data, and one set uses data from the 2008 American National Election Studies. We implement a less restrictive approach than is typical when evaluating methods using simulations in the missing data literature: in each case, missing values are generated by carefully following the conditions necessary for missingness to be "missing at random" (MAR). We find that in these situations conditional MI is more accurate than joint MVN MI whenever the data include categorical variables.

Original languageEnglish (US)
Pages (from-to)497-519
Number of pages23
JournalPolitical Analysis
Volume22
Issue number4
DOIs
StatePublished - 2014

Fingerprint

simulation
Values
election research
performance
literature

ASJC Scopus subject areas

  • Sociology and Political Science

Cite this

Multiple imputation for continuous and categorical data : Comparing joint multivariate normal and conditional approaches. / Kropko, Jonathan; Goodrich, Ben; Gelman, Andrew; Hill, Jennifer.

In: Political Analysis, Vol. 22, No. 4, 2014, p. 497-519.

Research output: Contribution to journalArticle

@article{19d736b005e841f39cd48340d9c050ff,
title = "Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches",
abstract = "We consider the relative performance of two common approaches to multiple imputation (MI): joint multivariate normal (MVN) MI, in which the data are modeled as a sample from a joint MVN distribution; and conditional MI, in which each variable is modeled conditionally on all the others. In order to use the multivariate normal distribution, implementations of joint MVN MI typically assume that categories of discrete variables are probabilistically constructed from continuous values. We use simulations to examine the implications of these assumptions. For each approach, we assess (1) the accuracy of the imputed values; and (2) the accuracy of coefficients and fitted values from a model fit to completed data sets. These simulations consider continuous, binary, ordinal, and unordered-categorical variables. One set of simulations uses multivariate normal data, and one set uses data from the 2008 American National Election Studies. We implement a less restrictive approach than is typical when evaluating methods using simulations in the missing data literature: in each case, missing values are generated by carefully following the conditions necessary for missingness to be {"}missing at random{"} (MAR). We find that in these situations conditional MI is more accurate than joint MVN MI whenever the data include categorical variables.",
author = "Jonathan Kropko and Ben Goodrich and Andrew Gelman and Jennifer Hill",
year = "2014",
doi = "10.1093/pan/mpu007",
language = "English (US)",
volume = "22",
pages = "497--519",
journal = "Political Analysis",
issn = "1047-1987",
publisher = "Oxford University Press",
number = "4",

}

TY - JOUR

T1 - Multiple imputation for continuous and categorical data

T2 - Comparing joint multivariate normal and conditional approaches

AU - Kropko, Jonathan

AU - Goodrich, Ben

AU - Gelman, Andrew

AU - Hill, Jennifer

PY - 2014

Y1 - 2014

N2 - We consider the relative performance of two common approaches to multiple imputation (MI): joint multivariate normal (MVN) MI, in which the data are modeled as a sample from a joint MVN distribution; and conditional MI, in which each variable is modeled conditionally on all the others. In order to use the multivariate normal distribution, implementations of joint MVN MI typically assume that categories of discrete variables are probabilistically constructed from continuous values. We use simulations to examine the implications of these assumptions. For each approach, we assess (1) the accuracy of the imputed values; and (2) the accuracy of coefficients and fitted values from a model fit to completed data sets. These simulations consider continuous, binary, ordinal, and unordered-categorical variables. One set of simulations uses multivariate normal data, and one set uses data from the 2008 American National Election Studies. We implement a less restrictive approach than is typical when evaluating methods using simulations in the missing data literature: in each case, missing values are generated by carefully following the conditions necessary for missingness to be "missing at random" (MAR). We find that in these situations conditional MI is more accurate than joint MVN MI whenever the data include categorical variables.

AB - We consider the relative performance of two common approaches to multiple imputation (MI): joint multivariate normal (MVN) MI, in which the data are modeled as a sample from a joint MVN distribution; and conditional MI, in which each variable is modeled conditionally on all the others. In order to use the multivariate normal distribution, implementations of joint MVN MI typically assume that categories of discrete variables are probabilistically constructed from continuous values. We use simulations to examine the implications of these assumptions. For each approach, we assess (1) the accuracy of the imputed values; and (2) the accuracy of coefficients and fitted values from a model fit to completed data sets. These simulations consider continuous, binary, ordinal, and unordered-categorical variables. One set of simulations uses multivariate normal data, and one set uses data from the 2008 American National Election Studies. We implement a less restrictive approach than is typical when evaluating methods using simulations in the missing data literature: in each case, missing values are generated by carefully following the conditions necessary for missingness to be "missing at random" (MAR). We find that in these situations conditional MI is more accurate than joint MVN MI whenever the data include categorical variables.

UR - http://www.scopus.com/inward/record.url?scp=84942135544&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84942135544&partnerID=8YFLogxK

U2 - 10.1093/pan/mpu007

DO - 10.1093/pan/mpu007

M3 - Article

VL - 22

SP - 497

EP - 519

JO - Political Analysis

JF - Political Analysis

SN - 1047-1987

IS - 4

ER -