What is it about?

This paper describes a way to use large administrative datasets to boost causal inference and program evaluation--without sacrificing any of the rigor of a carefully-designed observational study. Doing so can both reduce bias and increase precision of causal estimates. One approach to estimating the effect of an intervention without a randomized trial is matching: identify a set of control subjects who are similar to the treated on a set of important covariates, and compare treated to similar untreated subjects. This paper proposes a method to supplement this approach using data from control subjects who are not as similar to the treated subjects, and a much larger set of covariates. The method uses machine learning techniques to predict matched subjects outcomes as a function of all available covariates, using the dissimilar control subjects as a training set. Then use the matched subjects to estimate the effect of the treatment on the prediction errors, not on the outcomes themselves. The paper shows that this method can reduce confounding in a number of ways. It also illustrates the method by applying to a "whole-school reform" program in Arizona.

Featured Image

Why is it important?

Reams of administrative educational data are available and remain a mostly untapped resource. This method suggests a way to better use these data, in all of its vastness, to arrive at preliminary assessments of what works and what doesn't in education.

Perspectives

As a graduate student, I was involved in projects estimating the effectiveness of school-wide programs. The design I had planned--a propensity score matching study--was quickly overwhelmed by all of the publicly available data. I downloaded thousands of variables, pertaining to thousands of schools, from state departments of education and the NCES Common Core of Data. On the one hand, there's no substitute for a careful study design based on well-understood covariates--the sort of design that for both technical and practical reasons cannot make use of more than a small fraction of the available data. On the other hand, modern machine learning methods could potentially extract lots of useful information from the data I had downloaded--but these methods can easily mislead when applied to poorly-understood data. Rebar, the method in the paper, let me live in the best of both worlds. My collaborators and I constructed traditional propensity score matches, but used machine learning to reinforce those designs using all of the available data. This paper shows how researchers can use big data and machine learning to both reduce the confounding bias and increase the precision of a matching study, without sacrificing the benefits of a careful design.

Adam Sales
University of Texas College of Education

Read the Original

This page is a summary of: Rebar: Reinforcing a Matching Estimator With Predictions From High-Dimensional Covariates, Journal of Educational and Behavioral Statistics, October 2017, American Educational Research Association (AERA),
DOI: 10.3102/1076998617731518.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page