Assessing Methods for Generalizing Experimental Impact Estimates to Target Populations

Holger L. Kern, Elizabeth A. Stuart, Jennifer Hill, Donald P. Green

Research output: Contribution to journalArticle

Abstract

Randomized experiments are considered the gold standard for causal inference because they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments because the experimental participants may be unrepresentative of the target population of interest. This article examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods’ performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program. Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multisite experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind. Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.

Original languageEnglish (US)
Pages (from-to)103-127
Number of pages25
JournalJournal of Research on Educational Effectiveness
Volume9
Issue number1
DOIs
StatePublished - Jan 2 2016

Fingerprint

experiment
simulation
regression
gold standard
learning method
drop-out
statistical method
lack
evaluation
school
performance
education

Keywords

  • Bayesian Additive Regression Trees external validity generalizability propensity score weighting

ASJC Scopus subject areas

  • Education

Cite this

Assessing Methods for Generalizing Experimental Impact Estimates to Target Populations. / Kern, Holger L.; Stuart, Elizabeth A.; Hill, Jennifer; Green, Donald P.

In: Journal of Research on Educational Effectiveness, Vol. 9, No. 1, 02.01.2016, p. 103-127.

Research output: Contribution to journalArticle

Kern, Holger L. ; Stuart, Elizabeth A. ; Hill, Jennifer ; Green, Donald P. / Assessing Methods for Generalizing Experimental Impact Estimates to Target Populations. In: Journal of Research on Educational Effectiveness. 2016 ; Vol. 9, No. 1. pp. 103-127.
@article{45f73a599cfa45ddb7ec6e1bfb883e8e,
title = "Assessing Methods for Generalizing Experimental Impact Estimates to Target Populations",
abstract = "Randomized experiments are considered the gold standard for causal inference because they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments because the experimental participants may be unrepresentative of the target population of interest. This article examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods’ performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program. Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multisite experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind. Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.",
keywords = "Bayesian Additive Regression Trees external validity generalizability propensity score weighting",
author = "Kern, {Holger L.} and Stuart, {Elizabeth A.} and Jennifer Hill and Green, {Donald P.}",
year = "2016",
month = "1",
day = "2",
doi = "10.1080/19345747.2015.1060282",
language = "English (US)",
volume = "9",
pages = "103--127",
journal = "Journal of Research on Educational Effectiveness",
issn = "1934-5747",
publisher = "Routledge",
number = "1",

}

TY - JOUR

T1 - Assessing Methods for Generalizing Experimental Impact Estimates to Target Populations

AU - Kern, Holger L.

AU - Stuart, Elizabeth A.

AU - Hill, Jennifer

AU - Green, Donald P.

PY - 2016/1/2

Y1 - 2016/1/2

N2 - Randomized experiments are considered the gold standard for causal inference because they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments because the experimental participants may be unrepresentative of the target population of interest. This article examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods’ performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program. Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multisite experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind. Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.

AB - Randomized experiments are considered the gold standard for causal inference because they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments because the experimental participants may be unrepresentative of the target population of interest. This article examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods’ performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program. Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multisite experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind. Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.

KW - Bayesian Additive Regression Trees external validity generalizability propensity score weighting

UR - http://www.scopus.com/inward/record.url?scp=84957845406&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84957845406&partnerID=8YFLogxK

U2 - 10.1080/19345747.2015.1060282

DO - 10.1080/19345747.2015.1060282

M3 - Article

VL - 9

SP - 103

EP - 127

JO - Journal of Research on Educational Effectiveness

JF - Journal of Research on Educational Effectiveness

SN - 1934-5747

IS - 1

ER -