We consider studies of cohorts of individuals after a critical event, such as an injury, with the... more We consider studies of cohorts of individuals after a critical event, such as an injury, with the following characteristics. First, the studies are designed to measure "input" variables, which describe the period before the critical event, and to characterize the distribution of the input variables in the cohort. Second, the studies are designed to measure "output" variables, primarily mortality after the critical event, and to characterize the predictive (conditional) distribution of mortality given the input variables in the cohort. Such studies often possess the complication that the input data are missing for those who die shortly after the critical event because the data collection takes place after the event. Standard methods of dealing with the missing inputs, such as imputation or weighting methods based on an assumption of ignorable missingness, are known to be generally invalid when the missingness of inputs is nonignorable, that is, when the distribution of the inputs is different between those who die and those who live. To address this issue, we propose a novel design that obtains and uses information on an additional key variable -a treatment or externally controlled variable, which if set at its "effective" level, could have prevented the death of those who died. We show that the new design can be used to draw valid inferences for the marginal distribution of inputs in the entire cohort, and for the conditional distribution of mortality given the inputs, also in the entire cohort, even under nonignorable missingness. The crucial framework that we use is principal stratification based on the potential outcomes, here mortality under both levels of treatment. We also show using illustrative preliminary injury data, that our approach can reveal results that are more reasonable than the results of standard methods, in relatively dramatic ways. Thus, our approach suggests that the routine collection of data on variables that could be used as possible treatments in such studies of inputs and mortality should become common.
Consider a statistical analysis that draws causal inferences from an observational dataset, infer... more Consider a statistical analysis that draws causal inferences from an observational dataset, inferences that are presented as being valid in the standard frequentist senses; i.e. the analysis produces: (1) consistent point estimates, (2) valid p-values, valid in the sense of rejecting true null hypotheses at the nominal level or less often, and/or (3) confidence intervals, which are presented as having at least their nominal coverage for their estimands. For the hypothetical validity of these statements, the analysis must embed the observational study in a hypothetical randomized experiment that created the observed data, or a subset of that hypothetical randomized data set. This multistage effort with thought-provoking tasks involves: (1) a purely conceptual stage that precisely formulate the causal question in terms of a hypothetical randomized experiment where the exposure is assigned to units; (2) a design stage that approximates a randomized experiment before any outcome data are observed, (3) a statistical analysis stage comparing the outcomes of interest in the exposed and non-exposed units of the hypothetical randomized experiment, and (4) a summary stage providing conclusions about statistical evidence for the sizes of possible causal effects. Stages 2 and 3 may rely on modern computing to implement the effort, whereas Stage 1 demands careful scientific argumentation to make the embedding plausible to scientific readers of the proffered statistical analysis. Otherwise, the resulting analysis is vulnerable to criticism for being simply a presentation of scientifically meaningless arithmetic calculations. The conceptually most demanding tasks are often the most scientifically interesting to the dedicated researcher and readers of the resulting statistical analyses. This perspective is rarely implemented with any rigor, for example, completely eschewing the first stage. We illustrate our approach using an example examining the effect of parental smoking on children's lung function collected in families living in East Boston in the 1970s.
Models for analyzing multivariate data sets with missing values require strong, often unassessabl... more Models for analyzing multivariate data sets with missing values require strong, often unassessable, assumptions. The most common of these is that the mechanism that created the missing data is ignorable -a twofold assumption dependent on the mode of inference. The first part, which is the focus here, under the Bayesian and direct-likelihood paradigms, requires that the missing data are missing at random; in contrast, the frequentist-likelihood paradigm demands that the missing data mechanism always produces missing at random data, a condition known as missing always at random. Under certain regularity conditions, assuming missing always at random leads to an assumption that can be tested using the observed data alone namely, the missing data indicators only depend on fully observed variables. Here, we propose three different diagnostic tests that not only indicate when this assumption is incorrect but also suggest which variables are the most likely culprits. Although missing always at random is not a necessary condition to ensure validity under the Bayesian and direct-likelihood paradigms, it is sufficient, and evidence for its violation should encourage the careful statistician to conduct targeted sensitivity analyses.
Factorial designs are widely used in agriculture, engineering, and the social sciences to study t... more Factorial designs are widely used in agriculture, engineering, and the social sciences to study the causal effects of several factors simultaneously on a response. The objective of such a design is to estimate all factorial effects of interest, which typically include main effects and interactions among factors. To estimate factorial effects with high precision when a large number of pre-treatment covariates are present, balance among covariates across treatment groups should be ensured. We propose utilizing rerandomization to ensure covariate balance in factorial designs. Although both factorial designs and rerandomization have been discussed before, the combination has not. Here, theoretical properties of rerandomization for factorial designs are established, and empirical results are explored using an application from the New York Department of Education.
A common complication that can arise with analyses of high-dimensional data is the repeated use o... more A common complication that can arise with analyses of high-dimensional data is the repeated use of hypothesis tests. A second complication, especially with small samples, is the reliance on asymptotic p-values. Our proposed approach for addressing both complications uses a scientifically motivated scalar summary statistic, and although not entirely novel, seems rarely used. The method is illustrated using a crossover study of seventeen participants examining the effect of exposure to ozone versus clean air on the DNA methylome, where the multivariate outcome involved 484,531 genomic locations. Our proposed test yields a single null randomization distribution, and thus a single Fisher-exact p-value that is statistically valid whatever the structure of the data. However, the relevance and power of the resultant test requires the careful a priori selection of a single test statistic. The common practice using asymptotic p-values or meaningless thresholds for "significance" is inapposite in general. Big data • Causal inference • Randomization-based tests • Sharp null hypotheses • Fisherian inference • Fisher-exact p-value • Test statistic • Randomized crossover experiment • Large P • Small N data • Ozone • Air pollution • Epigenetics Communicated by Maomi Ueno.
By 'partially post-hoc' subgroup analyses, we mean analyses that compare existing data from a ran... more By 'partially post-hoc' subgroup analyses, we mean analyses that compare existing data from a randomized experiment-from which a subgroup specification is derived-to new, subgroup-only experimental data. We describe a motivating example in which partially post hoc subgroup analyses instigated statistical debate about a medical device's efficacy. We clarify the source of such analyses' invalidity and then propose a randomization-based approach for generating valid posterior predictive p-values for such partially post hoc subgroups. Lastly, we investigate the approach's operating characteristics in a simple illustrative setting through a series of simulations, showing that it can have desirable properties under both null and alternative hypotheses.
Knowledge of the effect of unearned income on economic behavior of individuals in general, and on... more Knowledge of the effect of unearned income on economic behavior of individuals in general, and on labor supply in particular, is of great importance to policy makers. Estimation of income effects, however, is a difficult problem because income is not randomly assigned and exogenous changes in income are difficult to identify. Here we exploit the randomized assignment of large amounts of money over long periods of time through lotteries. We carried out a survey of people who played the lottery in the mid-eighties and estimate the effect of lottery winnings on their subsequent earnings, labor supply, consumption, and savings. We find that winning a modest prize ($15,000 per year for twenty years) does not affect labor supply or earnings substantially. Winning such a prize does not considerably reduce savings. Winning a much larger prize ($80,000 rather than $15,000 per year) reduces labor supply as measured by hours, as well as participation and social security earnings; elasticities for hours and earnings are around -0.20 and for participation around -0.14. Winning a large versus modest amount also leads to increased expenditures on cars and larger home values, although mortgages values appear to increase by approximately the same amount. Winning $80,000 increases overall savings, although savings in retirement accounts are not significantly affected. The results do not vary much by gender, age, or prior employment status. There is some evidence that for those with zero earnings prior to winning the lottery there is a positive effect of winning a small prize on subsequent labor market participation.
Propensity score methods were proposed by Rosenbaum and Rubin (1983, Biometrika) as central tools... more Propensity score methods were proposed by Rosenbaum and Rubin (1983, Biometrika) as central tools to help assess the causal effects of interventions. Since their introduction two decades ago, they have found wide application in a variety of areas, including medical research, economics, epidemiology, and education, especially in those situations where randomized experiments are either difficult to perform, or raise ethical questions, or would require extensive delays before answers could be obtained. Rubin (1997, Annals of Internal Medicine) provides an introduction to some of the essential ideas. In the past few years, the number of published applications using propensity score methods to evaluate medical and epidemiological interventions has increased dramatically. Rubin (2003, Erlbaum) provides a summary, which is already out of date. Nevertheless, thus far, there have been few applications of propensity score methods to evaluate marketing interventions (e.g., advertising, promotions), where the tradition is to use generallly inappropriate techniques, which focus on the prediction of an outcome from an indicator for the intervention and background characteristics (such as least-squares regression, data mining, etc.). With these techniques, an estimated parameter in the model is used to estimate some global "causal" effect. This practice can generate grossly incorrect answers that can be self-perpetuating: polishing the Ferraris rather than the Jeeps "causes" them to continue to win more races than the Jeeps ¡=¿ visiting the high-prescribing doctors rather than the low-prescribing doctors "causes" them to continue to write more prescriptions. This presentation will take "causality" seriously, not just as a casual concept implying some predictive association in a data set, and will show why propensity score methods are superior in practice to the standard predictive approaches for estimating causal effects. The results of our approach are estimates of individual-level causal effects, which can be used as building blocks for more complex components, such as response curves. We will also show how the standard predictive approaches can have important supplemental roles to play, both for refining estimates of individual-level causal effect estimates and for assessing how these causal effects might vary as a function of background information, both important uses for situations when targeting an audience and/or allocating resources are critical objectives. The first step in a propensity score analysis is to estimate the individual scores, and there are various ways to do this in practice, the most common
Mahalanobis distance of covariate means between treatment and control groups is often adopted as ... more Mahalanobis distance of covariate means between treatment and control groups is often adopted as a balance criterion when implementing a rerandomization strategy. However, this criterion may not work well for high‐dimensional cases because it balances all orthogonalized covariates equally. We propose using principal component analysis (PCA) to identify proper subspaces in which Mahalanobis distance should be calculated. Not only can PCA effectively reduce the dimensionality for high‐dimensional covariates, but it also provides computational simplicity by focusing on the top orthogonal components. The PCA rerandomization scheme has desirable theoretical properties for balancing covariates and thereby improving the estimation of average treatment effects. This conclusion is supported by numerical studies using both simulated and real examples.
We used a randomized crossover experiment to estimate the effects of ozone (vs. clean air) exposu... more We used a randomized crossover experiment to estimate the effects of ozone (vs. clean air) exposure on genome-wide DNA methylation of target bronchial epithelial cells, using 17 volunteers, each randomly exposed on two separated occasions to clean air or 0.3-ppm ozone for two hours. Twentyfour hours after exposure, participants underwent bronchoscopy to collect epithelial cells whose DNA methylation was measured using the Illumina 450 K platform. We performed global and regional tests examining the ozone versus clean air effect on the DNA methylome and calculated Fisher-exact p-values for a series of univariate tests. We found little evidence of an overall effect of ozone on the DNA methylome but some suggestive changes in PLSCR1, HCAR1, and LINC00336 DNA methylation after ozone exposure relative to clean air. We observed some participant-to-participant heterogeneity in ozone responses.
Proceedings of the National Academy of Sciences of the United States of America, Jul 23, 2020
In randomized experiments, Fisher-exact P values are available and should be used to help evaluat... more In randomized experiments, Fisher-exact P values are available and should be used to help evaluate results rather than the more commonly reported asymptotic P values. One reason is that using the latter can effectively alter the question being addressed by including irrelevant distributional assumptions. The Fisherian statistical framework, proposed in 1925, calculates a P value in a randomized experiment by using the actual randomization procedure that led to the observed data. Here, we illustrate this Fisherian framework in a crossover randomized experiment. First, we consider the first period of the experiment and analyze its data as a completely randomized experiment, ignoring the second period; then, we consider both periods. For each analysis, we focus on 10 outcomes that illustrate important differences between the asymptotic and Fisher tests for the null hypothesis of no ozone effect. For some outcomes, the traditional P value based on the approximating asymptotic Student's t distribution substantially subceeded the minimum attainable Fisher-exact P value. For the other outcomes, the Fisher-exact null randomization distribution substantially differed from the bell-shaped one assumed by the asymptotic t test. Our conclusions: When researchers choose to report P values in randomized experiments, 1) Fisher-exact P values should be used, especially in studies with small sample sizes, and 2) the shape of the actual null randomization distribution should be examined for the recondite scientific insights it may reveal. asymptotic P values | crossover randomized experiments | Fisher-exact P values | sensitivity analyses | randomization-based inference Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. T his famous sentence from John W. Tukey (ref. 1, p. 13) clearly affirms our position that calculating Fisher-exact P values is superior to the current more common practice of calculating approximating asymptotic (i.e., large sample) P values. We believe using the exact null randomization distribution generally addresses the right question, whereas using its approximating asymptotic distribution generally does not. Although randomized experimental studies support the calculation of Fisher-exact P values for sharp null hypotheses, many published analyses report potentially deceptive P values based on assumed asymptotic distributions of statistics. Our attitude with randomized experiments is to eschew asymptotic P values, used decades ago because of the lack of modern computing equipment, and instead examine the actual null randomization distributions, which are generated by the randomized procedure that was used to collect the data, as proposed since R. A. Fisher (2). Here, we illustrate the general statistical framework to assess Fisherian sharp null hypotheses using data from an epigenetic randomized experiment. The sharp null hypothesis, which was investigated in this experiment, is that exposure to ozone has the identical effect on the participant's outcome as exposure to clean air.
The results of nonrandomized clinical experiments are often disputed because patients at greater ... more The results of nonrandomized clinical experiments are often disputed because patients at greater risk may be overrepresented in some treatment groups. This paper proposes a simple technique providing insight into the range of plausible conclusions from a nonrandomized experiment with binary outcome and observed categorical covariate. The technique assesses the sensitivity of conclusions to assumptions about an unobserved binary covariate relevant to treatment assignment, and is illustrated in a medical study of coronary artery disease.,
The propensity score is the conditional probability of assignment to a particular treatment given... more The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a twodimensional plot.
Lord's Paradox is analyzed in terms of a simple mathematical model for causal inference. The reso... more Lord's Paradox is analyzed in terms of a simple mathematical model for causal inference. The resolution of Lord's Paradox from-this perspective ,has two aspects. First, the -descriptive, non-causal conclusions of the two hypotheti=cal statisticians are both-cbrrect. They appear conttadictory only because they describe quite different aspects,of the-data. Second, the causal inferences pf the statisticians are _neither correct nor incorrect since they are based on different assumptions that our 'mathematical model makes explicit, but neither assumption can be tested using the data set that is described-in-the example. We identify these differing assumptions and show how each may be used to justify the differing causal conclusions of the two statisticians. In addition 'to analyzing the classic "diet" example -which Lord used to introduce his paradox, we also examine three-other examples that appear= in the three 'papere where ,Lord.discusses the paradok and related matters. (Author)
Journal of The Royal Statistical Society Series B-statistical Methodology, Mar 23, 2021
Blocking is commonly used in randomized experiments to increase efficiency of estimation. A gener... more Blocking is commonly used in randomized experiments to increase efficiency of estimation. A generalization of blocking removes allocations with imbalance in covariate distributions between treated and control units, and then randomizes within the remaining set of allocations with balance. This idea of rerandomization was formalized by Morgan and Rubin (Annals of Statistics, 2012, 40, 1263-1282), who suggested using Mahalanobis distance between treated and control covariate means as the criterion for removing unbalanced allocations. Kallus (Journal of the Royal Statistical Society, Series B: Statistical Methodology, 2018, 80, 85-112) proposed reducing the set of balanced allocations to the minimum. Here we discuss the implication of such an 'optimal' rerandomization design for inferences to the units in the sample and to the population from which the units in the sample were randomly drawn. We argue that, in general, it is a bad idea to seek the optimal design for an inference because that inference typically only reflects uncertainty from the random sampling of units, which is usually hypothetical, and not the randomization of units to treatment versus control.
While some stakeholders presume that studying abroad distracts students from efficient pursuit of... more While some stakeholders presume that studying abroad distracts students from efficient pursuit of their programs of study, others regard education abroad as a high impact practice that fosters student engagement and hence college completion. The Consortium for Analysis of Student Success through International Education (CASSIE), compiled semester-by-semester records from 221,981 students across 35 institutions. Of those students, 30,549 had studied abroad. Using nearest-neighbor matching techniques that accounted for a myriad of potentially confounding variables along with matching on institution, the analysis found positive impacts of education abroad on graduation within 4 and 6 years and on cumulative GPA at graduation. A very small increase in credit hours earned emerged, counterbalanced by a small decrease in time-to-degree associated with studying abroad. Overall, the results warrant conclusions that studying abroad does not impede timely graduation. To the contrary, encouraging students to study abroad promotes college completion. These results held similarly for students who had multiple study abroad experiences, and who have studied abroad for varying program lengths.
Rejoinder on Causal Inference Through Potential Outcomes and Principal Stratification: Applicatio... more Rejoinder on Causal Inference Through Potential Outcomes and Principal Stratification: Application to Studies with ``Censoring'' Due to Death by D. B. Rubin [math.ST/0612783]
We consider studies of cohorts of individuals after a critical event, such as an injury, with the... more We consider studies of cohorts of individuals after a critical event, such as an injury, with the following characteristics. First, the studies are designed to measure "input" variables, which describe the period before the critical event, and to characterize the distribution of the input variables in the cohort. Second, the studies are designed to measure "output" variables, primarily mortality after the critical event, and to characterize the predictive (conditional) distribution of mortality given the input variables in the cohort. Such studies often possess the complication that the input data are missing for those who die shortly after the critical event because the data collection takes place after the event. Standard methods of dealing with the missing inputs, such as imputation or weighting methods based on an assumption of ignorable missingness, are known to be generally invalid when the missingness of inputs is nonignorable, that is, when the distribution of the inputs is different between those who die and those who live. To address this issue, we propose a novel design that obtains and uses information on an additional key variable -a treatment or externally controlled variable, which if set at its "effective" level, could have prevented the death of those who died. We show that the new design can be used to draw valid inferences for the marginal distribution of inputs in the entire cohort, and for the conditional distribution of mortality given the inputs, also in the entire cohort, even under nonignorable missingness. The crucial framework that we use is principal stratification based on the potential outcomes, here mortality under both levels of treatment. We also show using illustrative preliminary injury data, that our approach can reveal results that are more reasonable than the results of standard methods, in relatively dramatic ways. Thus, our approach suggests that the routine collection of data on variables that could be used as possible treatments in such studies of inputs and mortality should become common.
Consider a statistical analysis that draws causal inferences from an observational dataset, infer... more Consider a statistical analysis that draws causal inferences from an observational dataset, inferences that are presented as being valid in the standard frequentist senses; i.e. the analysis produces: (1) consistent point estimates, (2) valid p-values, valid in the sense of rejecting true null hypotheses at the nominal level or less often, and/or (3) confidence intervals, which are presented as having at least their nominal coverage for their estimands. For the hypothetical validity of these statements, the analysis must embed the observational study in a hypothetical randomized experiment that created the observed data, or a subset of that hypothetical randomized data set. This multistage effort with thought-provoking tasks involves: (1) a purely conceptual stage that precisely formulate the causal question in terms of a hypothetical randomized experiment where the exposure is assigned to units; (2) a design stage that approximates a randomized experiment before any outcome data are observed, (3) a statistical analysis stage comparing the outcomes of interest in the exposed and non-exposed units of the hypothetical randomized experiment, and (4) a summary stage providing conclusions about statistical evidence for the sizes of possible causal effects. Stages 2 and 3 may rely on modern computing to implement the effort, whereas Stage 1 demands careful scientific argumentation to make the embedding plausible to scientific readers of the proffered statistical analysis. Otherwise, the resulting analysis is vulnerable to criticism for being simply a presentation of scientifically meaningless arithmetic calculations. The conceptually most demanding tasks are often the most scientifically interesting to the dedicated researcher and readers of the resulting statistical analyses. This perspective is rarely implemented with any rigor, for example, completely eschewing the first stage. We illustrate our approach using an example examining the effect of parental smoking on children's lung function collected in families living in East Boston in the 1970s.
Models for analyzing multivariate data sets with missing values require strong, often unassessabl... more Models for analyzing multivariate data sets with missing values require strong, often unassessable, assumptions. The most common of these is that the mechanism that created the missing data is ignorable -a twofold assumption dependent on the mode of inference. The first part, which is the focus here, under the Bayesian and direct-likelihood paradigms, requires that the missing data are missing at random; in contrast, the frequentist-likelihood paradigm demands that the missing data mechanism always produces missing at random data, a condition known as missing always at random. Under certain regularity conditions, assuming missing always at random leads to an assumption that can be tested using the observed data alone namely, the missing data indicators only depend on fully observed variables. Here, we propose three different diagnostic tests that not only indicate when this assumption is incorrect but also suggest which variables are the most likely culprits. Although missing always at random is not a necessary condition to ensure validity under the Bayesian and direct-likelihood paradigms, it is sufficient, and evidence for its violation should encourage the careful statistician to conduct targeted sensitivity analyses.
Factorial designs are widely used in agriculture, engineering, and the social sciences to study t... more Factorial designs are widely used in agriculture, engineering, and the social sciences to study the causal effects of several factors simultaneously on a response. The objective of such a design is to estimate all factorial effects of interest, which typically include main effects and interactions among factors. To estimate factorial effects with high precision when a large number of pre-treatment covariates are present, balance among covariates across treatment groups should be ensured. We propose utilizing rerandomization to ensure covariate balance in factorial designs. Although both factorial designs and rerandomization have been discussed before, the combination has not. Here, theoretical properties of rerandomization for factorial designs are established, and empirical results are explored using an application from the New York Department of Education.
A common complication that can arise with analyses of high-dimensional data is the repeated use o... more A common complication that can arise with analyses of high-dimensional data is the repeated use of hypothesis tests. A second complication, especially with small samples, is the reliance on asymptotic p-values. Our proposed approach for addressing both complications uses a scientifically motivated scalar summary statistic, and although not entirely novel, seems rarely used. The method is illustrated using a crossover study of seventeen participants examining the effect of exposure to ozone versus clean air on the DNA methylome, where the multivariate outcome involved 484,531 genomic locations. Our proposed test yields a single null randomization distribution, and thus a single Fisher-exact p-value that is statistically valid whatever the structure of the data. However, the relevance and power of the resultant test requires the careful a priori selection of a single test statistic. The common practice using asymptotic p-values or meaningless thresholds for "significance" is inapposite in general. Big data • Causal inference • Randomization-based tests • Sharp null hypotheses • Fisherian inference • Fisher-exact p-value • Test statistic • Randomized crossover experiment • Large P • Small N data • Ozone • Air pollution • Epigenetics Communicated by Maomi Ueno.
By 'partially post-hoc' subgroup analyses, we mean analyses that compare existing data from a ran... more By 'partially post-hoc' subgroup analyses, we mean analyses that compare existing data from a randomized experiment-from which a subgroup specification is derived-to new, subgroup-only experimental data. We describe a motivating example in which partially post hoc subgroup analyses instigated statistical debate about a medical device's efficacy. We clarify the source of such analyses' invalidity and then propose a randomization-based approach for generating valid posterior predictive p-values for such partially post hoc subgroups. Lastly, we investigate the approach's operating characteristics in a simple illustrative setting through a series of simulations, showing that it can have desirable properties under both null and alternative hypotheses.
Knowledge of the effect of unearned income on economic behavior of individuals in general, and on... more Knowledge of the effect of unearned income on economic behavior of individuals in general, and on labor supply in particular, is of great importance to policy makers. Estimation of income effects, however, is a difficult problem because income is not randomly assigned and exogenous changes in income are difficult to identify. Here we exploit the randomized assignment of large amounts of money over long periods of time through lotteries. We carried out a survey of people who played the lottery in the mid-eighties and estimate the effect of lottery winnings on their subsequent earnings, labor supply, consumption, and savings. We find that winning a modest prize ($15,000 per year for twenty years) does not affect labor supply or earnings substantially. Winning such a prize does not considerably reduce savings. Winning a much larger prize ($80,000 rather than $15,000 per year) reduces labor supply as measured by hours, as well as participation and social security earnings; elasticities for hours and earnings are around -0.20 and for participation around -0.14. Winning a large versus modest amount also leads to increased expenditures on cars and larger home values, although mortgages values appear to increase by approximately the same amount. Winning $80,000 increases overall savings, although savings in retirement accounts are not significantly affected. The results do not vary much by gender, age, or prior employment status. There is some evidence that for those with zero earnings prior to winning the lottery there is a positive effect of winning a small prize on subsequent labor market participation.
Propensity score methods were proposed by Rosenbaum and Rubin (1983, Biometrika) as central tools... more Propensity score methods were proposed by Rosenbaum and Rubin (1983, Biometrika) as central tools to help assess the causal effects of interventions. Since their introduction two decades ago, they have found wide application in a variety of areas, including medical research, economics, epidemiology, and education, especially in those situations where randomized experiments are either difficult to perform, or raise ethical questions, or would require extensive delays before answers could be obtained. Rubin (1997, Annals of Internal Medicine) provides an introduction to some of the essential ideas. In the past few years, the number of published applications using propensity score methods to evaluate medical and epidemiological interventions has increased dramatically. Rubin (2003, Erlbaum) provides a summary, which is already out of date. Nevertheless, thus far, there have been few applications of propensity score methods to evaluate marketing interventions (e.g., advertising, promotions), where the tradition is to use generallly inappropriate techniques, which focus on the prediction of an outcome from an indicator for the intervention and background characteristics (such as least-squares regression, data mining, etc.). With these techniques, an estimated parameter in the model is used to estimate some global "causal" effect. This practice can generate grossly incorrect answers that can be self-perpetuating: polishing the Ferraris rather than the Jeeps "causes" them to continue to win more races than the Jeeps ¡=¿ visiting the high-prescribing doctors rather than the low-prescribing doctors "causes" them to continue to write more prescriptions. This presentation will take "causality" seriously, not just as a casual concept implying some predictive association in a data set, and will show why propensity score methods are superior in practice to the standard predictive approaches for estimating causal effects. The results of our approach are estimates of individual-level causal effects, which can be used as building blocks for more complex components, such as response curves. We will also show how the standard predictive approaches can have important supplemental roles to play, both for refining estimates of individual-level causal effect estimates and for assessing how these causal effects might vary as a function of background information, both important uses for situations when targeting an audience and/or allocating resources are critical objectives. The first step in a propensity score analysis is to estimate the individual scores, and there are various ways to do this in practice, the most common
Mahalanobis distance of covariate means between treatment and control groups is often adopted as ... more Mahalanobis distance of covariate means between treatment and control groups is often adopted as a balance criterion when implementing a rerandomization strategy. However, this criterion may not work well for high‐dimensional cases because it balances all orthogonalized covariates equally. We propose using principal component analysis (PCA) to identify proper subspaces in which Mahalanobis distance should be calculated. Not only can PCA effectively reduce the dimensionality for high‐dimensional covariates, but it also provides computational simplicity by focusing on the top orthogonal components. The PCA rerandomization scheme has desirable theoretical properties for balancing covariates and thereby improving the estimation of average treatment effects. This conclusion is supported by numerical studies using both simulated and real examples.
We used a randomized crossover experiment to estimate the effects of ozone (vs. clean air) exposu... more We used a randomized crossover experiment to estimate the effects of ozone (vs. clean air) exposure on genome-wide DNA methylation of target bronchial epithelial cells, using 17 volunteers, each randomly exposed on two separated occasions to clean air or 0.3-ppm ozone for two hours. Twentyfour hours after exposure, participants underwent bronchoscopy to collect epithelial cells whose DNA methylation was measured using the Illumina 450 K platform. We performed global and regional tests examining the ozone versus clean air effect on the DNA methylome and calculated Fisher-exact p-values for a series of univariate tests. We found little evidence of an overall effect of ozone on the DNA methylome but some suggestive changes in PLSCR1, HCAR1, and LINC00336 DNA methylation after ozone exposure relative to clean air. We observed some participant-to-participant heterogeneity in ozone responses.
Proceedings of the National Academy of Sciences of the United States of America, Jul 23, 2020
In randomized experiments, Fisher-exact P values are available and should be used to help evaluat... more In randomized experiments, Fisher-exact P values are available and should be used to help evaluate results rather than the more commonly reported asymptotic P values. One reason is that using the latter can effectively alter the question being addressed by including irrelevant distributional assumptions. The Fisherian statistical framework, proposed in 1925, calculates a P value in a randomized experiment by using the actual randomization procedure that led to the observed data. Here, we illustrate this Fisherian framework in a crossover randomized experiment. First, we consider the first period of the experiment and analyze its data as a completely randomized experiment, ignoring the second period; then, we consider both periods. For each analysis, we focus on 10 outcomes that illustrate important differences between the asymptotic and Fisher tests for the null hypothesis of no ozone effect. For some outcomes, the traditional P value based on the approximating asymptotic Student's t distribution substantially subceeded the minimum attainable Fisher-exact P value. For the other outcomes, the Fisher-exact null randomization distribution substantially differed from the bell-shaped one assumed by the asymptotic t test. Our conclusions: When researchers choose to report P values in randomized experiments, 1) Fisher-exact P values should be used, especially in studies with small sample sizes, and 2) the shape of the actual null randomization distribution should be examined for the recondite scientific insights it may reveal. asymptotic P values | crossover randomized experiments | Fisher-exact P values | sensitivity analyses | randomization-based inference Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. T his famous sentence from John W. Tukey (ref. 1, p. 13) clearly affirms our position that calculating Fisher-exact P values is superior to the current more common practice of calculating approximating asymptotic (i.e., large sample) P values. We believe using the exact null randomization distribution generally addresses the right question, whereas using its approximating asymptotic distribution generally does not. Although randomized experimental studies support the calculation of Fisher-exact P values for sharp null hypotheses, many published analyses report potentially deceptive P values based on assumed asymptotic distributions of statistics. Our attitude with randomized experiments is to eschew asymptotic P values, used decades ago because of the lack of modern computing equipment, and instead examine the actual null randomization distributions, which are generated by the randomized procedure that was used to collect the data, as proposed since R. A. Fisher (2). Here, we illustrate the general statistical framework to assess Fisherian sharp null hypotheses using data from an epigenetic randomized experiment. The sharp null hypothesis, which was investigated in this experiment, is that exposure to ozone has the identical effect on the participant's outcome as exposure to clean air.
The results of nonrandomized clinical experiments are often disputed because patients at greater ... more The results of nonrandomized clinical experiments are often disputed because patients at greater risk may be overrepresented in some treatment groups. This paper proposes a simple technique providing insight into the range of plausible conclusions from a nonrandomized experiment with binary outcome and observed categorical covariate. The technique assesses the sensitivity of conclusions to assumptions about an unobserved binary covariate relevant to treatment assignment, and is illustrated in a medical study of coronary artery disease.,
The propensity score is the conditional probability of assignment to a particular treatment given... more The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a twodimensional plot.
Lord's Paradox is analyzed in terms of a simple mathematical model for causal inference. The reso... more Lord's Paradox is analyzed in terms of a simple mathematical model for causal inference. The resolution of Lord's Paradox from-this perspective ,has two aspects. First, the -descriptive, non-causal conclusions of the two hypotheti=cal statisticians are both-cbrrect. They appear conttadictory only because they describe quite different aspects,of the-data. Second, the causal inferences pf the statisticians are _neither correct nor incorrect since they are based on different assumptions that our 'mathematical model makes explicit, but neither assumption can be tested using the data set that is described-in-the example. We identify these differing assumptions and show how each may be used to justify the differing causal conclusions of the two statisticians. In addition 'to analyzing the classic "diet" example -which Lord used to introduce his paradox, we also examine three-other examples that appear= in the three 'papere where ,Lord.discusses the paradok and related matters. (Author)
Journal of The Royal Statistical Society Series B-statistical Methodology, Mar 23, 2021
Blocking is commonly used in randomized experiments to increase efficiency of estimation. A gener... more Blocking is commonly used in randomized experiments to increase efficiency of estimation. A generalization of blocking removes allocations with imbalance in covariate distributions between treated and control units, and then randomizes within the remaining set of allocations with balance. This idea of rerandomization was formalized by Morgan and Rubin (Annals of Statistics, 2012, 40, 1263-1282), who suggested using Mahalanobis distance between treated and control covariate means as the criterion for removing unbalanced allocations. Kallus (Journal of the Royal Statistical Society, Series B: Statistical Methodology, 2018, 80, 85-112) proposed reducing the set of balanced allocations to the minimum. Here we discuss the implication of such an 'optimal' rerandomization design for inferences to the units in the sample and to the population from which the units in the sample were randomly drawn. We argue that, in general, it is a bad idea to seek the optimal design for an inference because that inference typically only reflects uncertainty from the random sampling of units, which is usually hypothetical, and not the randomization of units to treatment versus control.
While some stakeholders presume that studying abroad distracts students from efficient pursuit of... more While some stakeholders presume that studying abroad distracts students from efficient pursuit of their programs of study, others regard education abroad as a high impact practice that fosters student engagement and hence college completion. The Consortium for Analysis of Student Success through International Education (CASSIE), compiled semester-by-semester records from 221,981 students across 35 institutions. Of those students, 30,549 had studied abroad. Using nearest-neighbor matching techniques that accounted for a myriad of potentially confounding variables along with matching on institution, the analysis found positive impacts of education abroad on graduation within 4 and 6 years and on cumulative GPA at graduation. A very small increase in credit hours earned emerged, counterbalanced by a small decrease in time-to-degree associated with studying abroad. Overall, the results warrant conclusions that studying abroad does not impede timely graduation. To the contrary, encouraging students to study abroad promotes college completion. These results held similarly for students who had multiple study abroad experiences, and who have studied abroad for varying program lengths.
Rejoinder on Causal Inference Through Potential Outcomes and Principal Stratification: Applicatio... more Rejoinder on Causal Inference Through Potential Outcomes and Principal Stratification: Application to Studies with ``Censoring'' Due to Death by D. B. Rubin [math.ST/0612783]
Uploads
Papers by Donald Rubin