The DirectEffects
package provides the tools for researchers to estimate the controlled direct effect of some treatment, net the effect of some mediator (See Acharaya, Blackwell, and Sen (2016)). This goal is often tricky in practice because there are covariates that causally affect the mediator and are causally affected by the treatment. Including such variables in a regression model, for instance, could lead to post-treatment bias when estimating the direct effect of treatment. A large class of models and estimation approaches have been developed to avoid this problem. Currently, the package only implements one approach to estimating the CDE, which is called sequential g-estimation. The idea behind this approach is that if we can estimate the effect of the mediator using a standard linear model, then to estimate the direct effect of treatment without post-treatment bias if we suitably “remove” that effect from the dependent variable.
The current version of the package is currently on git.
install_github("mattblackwell/DirectEffects", ref = "master")
library(DirectEffects)
Below is a minimal working example of implementing a version of the sequential g-estimation in the DirectEffects
package. In the subsequent sections, we explain the process in detail.
data(ploughs)
ploughs$centered_ln_inc <- ploughs$ln_income - mean(ploughs$ln_income, na.rm = TRUE)
ploughs$centered_ln_incsq <- ploughs$centered_ln_inc^2
first <- lm(women_politics~ plow + centered_ln_inc + centered_ln_incsq + agricultural_suitability + tropical_climate + large_animals + political_hierarchies + economic_complexity + rugged + years_civil_conflict + years_interstate_conflict + oil_pc + european_descent + communist_dummy + polity2_2000 + serv_va_gdp2000,
data = ploughs)
direct <- sequential_g(formula = women_politics ~ plow + agricultural_suitability + tropical_climate + large_animals + political_hierarchies + economic_complexity + rugged | centered_ln_inc + centered_ln_incsq,
first_mod = first,
data = ploughs, subset = rownames(ploughs) %in% rownames(model.matrix(first)))
The main quantity of interest in the DirectEffect package is the Average Controlled Direct Effect (ACDE). This can be illustrated from the following causal structure (Figure 3 in Acharaya, Blackwell, and Sen).
The controlled direct effect defined for a given treatment (\(A_i = a\) vs. \(A_i = a^\prime\)) and a given value of the mediator (\(M_i = m\)) is \[CDE_{i}(a, a^\prime, m) = Y_i(a, m) - Y_i(a^\prime, m)\] and is the total of tehe dashed lines in the Figure above. Thus, we hold the mediator constant across the two treatments.
Contrast this with the natural direct effect, the target of inference in mediation analysis, which is “natural” in the sense that we let the mediator take its potential outcome when treatment is equal to \(a\): \[NDE_{i}(a, a^\prime) = Y_i(a, M_i(a)) - Y_i(a^\prime, M_i(a)).\]
The NDE's counterpart, the natural indirect effect, is the effect that “flows through” the mediator, holding treatment constant: \[NIE_{i}(a, a^\prime) = Y_i(a, M_i(a)) - Y_i(a, M_i(a^\prime)).\] This quantity contains one potential outcome that is impossible to observe since we can never observe \(A_i = a\) and the mediator under \(A_i = a^\prime\) for a given individual \(i\).
The CDE is useful for discriminating between causal mechanisms, because the average total effect of treatmeant, \(\tau(a, a^\prime) \equiv E[(Y_i(a) - Y_i(a^\prime)]\), can be decomposed as the sum of three quantities: (1) the average CDE across observations (ACDE), (2) the average natural indirect effect, and (3) the reference interaction a measure of how much the direct effect of \(A\) depends on a particular \(M_i = m\): \[\tau(a, a^\prime) = ACDE (a, a^\prime, m = 0) + ANIE (a, a^\prime) + E[M_i(a)[ CDE_i(a, a^\prime, m = 1) - CDE_i(a, a^\prime, m = 0)]].\]
Thus, if the ACDE is non-zero, it is evidence that the effect of \(A_i\) is not entirely explained by \(M_i\). We illustrate this with an empirical example.
DirectEffects
estimates the ACDE by sequential g-estimation, a type of structural nested mean model.
The key logic of sequential g-setimation is that, under the sequential unconfoundedness assumption, the ACDE can be identified as follows: \[E[Y_i(a, 0) - Y_i(0,0)|X_i = x] = E[Y_i - \gamma(a, M_i, x) | A_i = a, x] - E[Y_i - \gamma(0, M_i, x) | A_i = 0, x]\]
The function \(\gamma\) above is called the “demediation function” (or “blip-down function”) and is computed as follows. \[\gamma(a, m, x) = E[Y_i(a, m) - Y_i(a, 0) | X_i = x]\]
This function computes the effect of switching from some level of the mediator to 0, and thus is an estimate of the causal effect of the mediator for a fixed value of the treatment \(a\) and within levels of the covariates. Subtracting its estimates \(\gamma(A_i, M_i, X_i)\) from the outcome \(Y_i\) will effectively remove the variation due to the mediator from the outcome.
The identification of the ACDE hold under the sequential unconfoundedness condition, using the definitions by Acharaya, Blackwell, and Sen.
Assumption 1 (Sequential Unconfoundedness)
First, no omitted variables for the effect of treatment (\(A_i\)) on the outcome (\(Y_i\)), conditional on the pretreatment confounders (\(X_i\)). Second, no omitted variables for the effect of the mediator on the outcome, conditional on the treatment, pretreatment confounders, and intermediate confounders (\(Z_i\)).
While the ACDE is estimated nonparametrically with just this assumption, we need to make a further assumption to be able to use sequential g-estimation.
Assumption 2 (No intermediate interactions)
The effect of the mediator (\(M_i\)) on the outcome (\(Y_i\)) is independent of the intermediate confounders (\(Z_i\)).
Without this assumption, we would have to model the multivariate distribution of the intermediate confounders in order to estimate the ACDE.
Under the assumption of sequential unconcoundedness, the de-mediation function can be estimated fro a regression of the outcome on the variables in the de-mediation function plus the intermediate confounders, treatment, and baseline confounders. If this regression model is correctly specified, the coefficients on the variables in the de-mediation function are unbiased for the parameters of the de-mediation function. For example, when there are no interactions between the mediator and the treatment, nor between the mediator and the pretreatment confounders, the de-mediation function is straightforward to estimate. In this case, the de-mediation function is \(\gamma(a, m, x) = \alpha m\) and the OLS coefficient on the mediator is an the estimate of \(\alpha\).
The second stage of sequential g-estimation uses the de-mediated outcome \[\tilde{Y}_i = Y_i - \widehat{\gamma}(A_i, M_i, X_i; \widehat{\alpha}).\] If there are no interactions or nonlinearities in the de-mediation function, then this becomes \(\tilde{Y}_i Y_i - \widehat{\alpha}M_i\). With this de-mediated outcome in hand, we can estimate the ACDE by regressing this outcome on the treatment and pre-treatment covariates: \[E[\tilde{Y}_i | A_i, X_i] = \beta_0 + \beta_1A_i + X_i^T\beta_2.\] Under the above assumptions and assuming this regression is correctly specified, the \(\widehat{\beta_1}\) is an unbiased estimate of the ACDE. \[\widehat{\mathit{ACDE}} = \widehat{\beta_1}\]
We now work through one example in detail. The data comes from Alesina, Giuliano, and Nunn, 2013. The dataset comes with the package:
data(ploughs)
The paper's main argument is that the advent of the capital-intensive agricultural practice of the plow favored men over women participating in agriculture, which have affected gender inequality in modern societies. These authors find that strong effects of plow-based agriculture on female labor-force participation, but not on share of political positions held by women. The authors speculate that this might be due to the (positive) effects of the plow on modern-day income levels, which could offset the direct effects of the plow. The authors control for income and show that a significant effect of the plow appears. We evaluate this approach and try to estimate the ACDE more formally. Thus, in this case, we have the following variables:
women_politics
)plow
)ln_income
)years_civil_conflict
, years_interstate_conflict
, oil_pc
, european_descent
, communist_dummy
, polity2_2000
, serv_va_gdp2000
)tropical_climate
, agricultural_suitability
, large_animals
, political_hierarchies
, economic_complexity
, and rugged
)where \(i\) indexes countries.
A bivariate relationship between the treatment and the dependent variable show no clear relationship. We seek to estimate the direct effect of a nation adopting a plow, controlling for pre-treatment confounders and accounting for the mediator of current national income.
As a baseline, we first regress \(Y\) on \(A\) controlling for the pre-treatment variables \(X\).
ate.mod <- lm(women_politics ~ plow + agricultural_suitability + tropical_climate + large_animals + political_hierarchies + economic_complexity + rugged, data = ploughs)
Notice that the effect of the plow is negative and insignificant.
summary(ate.mod)
##
## Call:
## lm(formula = women_politics ~ plow + agricultural_suitability +
## tropical_climate + large_animals + political_hierarchies +
## economic_complexity + rugged, data = ploughs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.060 -5.292 -2.000 3.752 25.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.5604 4.9038 3.581 0.000466 ***
## plow -2.1032 2.1270 -0.989 0.324422
## agricultural_suitability 0.9712 2.5761 0.377 0.706721
## tropical_climate -7.5137 1.9621 -3.830 0.000191 ***
## large_animals -8.7104 4.0510 -2.150 0.033201 *
## political_hierarchies 0.9359 0.9722 0.963 0.337344
## economic_complexity 1.0284 0.5430 1.894 0.060202 .
## rugged -0.6370 0.4961 -1.284 0.201137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.372 on 145 degrees of freedom
## (81 observations deleted due to missingness)
## Multiple R-squared: 0.1831, Adjusted R-squared: 0.1436
## F-statistic: 4.642 on 7 and 145 DF, p-value: 0.0001028
In this example, we would like to estimate the controlled direct effect of plows fixing the value of current income. To do so, we must choose a value at which to fix the value of log income. The standard sequential g-estimation approach will fix the value to 0, which isn't substantively interesting in this case. To avoid this problem, we can recenter our mediator, log income, so that when we demediate with \(m = 0\), it will be equivalent to demediation with \(m\) set to its mean value. Furthermore, we create a squared term of this centered mediator so that we can include it in the demediation function.
ploughs$centered_ln_inc <- ploughs$ln_income - mean(ploughs$ln_income, na.rm = TRUE)
ploughs$centered_ln_incsq <- ploughs$centered_ln_inc^2
The first stage is a linear model of the outcome on both the \(Y\) on \(A\) (plow
), \(M\) (centered_ln_inc
+ centered_ln_incsq
), and both the baseline covariates, \(X\), and the intermediate covariates, \(Z\). This model will help us estimate the effect of the mediator and, thus, estimate the de-mediation function.
fit_first <- lm(women_politics ~ plow + centered_ln_inc + centered_ln_incsq +
agricultural_suitability + tropical_climate + large_animals +
political_hierarchies + economic_complexity + rugged +
years_civil_conflict + years_interstate_conflict + oil_pc +
european_descent + communist_dummy + polity2_2000 +
serv_va_gdp2000, data = ploughs)
Next, we specify the main formula that, in contrasts to the first stage model, distinguishes between the mediator and other predictors. The first part of the formula should specify a regression of \(Y\) on \(X\) and the second part should specify the variables of the de-mediation function based on \(M\). These two parts should be separated by |
and can be expressed as yvar ~ xvars | mvars
. The mediators must be named the same way as in the first model. For example, here we specify that the de-mediation variables are \(M_i\) and \(M_i^2\):
form_main <- women_politics ~ plow + agricultural_suitability + tropical_climate + large_animals + political_hierarchies + economic_complexity + rugged | centered_ln_inc + centered_ln_incsq
Finally, we enter this formula and the first-stage regression model each step into the sequential g-estimation function. It takes the main model specification as the first argument (formula
), followed by the first stage model used for estimating the de-mediation function (first_mod
).
direct <- sequential_g(formula = form_main,
first_mod = fit_first,
data = ploughs,
subset = rownames(ploughs) %in% rownames(model.matrix(fit_first)))
This function, sequential_g(formula, first_mod, data, ...)
, implements sequential g-estimation in the way outlined in the previous section. Specifically, it first takes the first stage model and constructs the de-mediator function. Here, we also subset the data to those observations that were observed in the first-stage model.
As usual, we can use the summary
function to the output of the sequential_g
function:
summary(direct)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.18450 3.64442 3.3433 0.001121 **
## plow -4.83879 2.34467 -2.0637 0.041312 *
## agricultural_suitability 4.57388 3.10477 1.4732 0.143458
## tropical_climate -2.18919 2.10505 -1.0400 0.300554
## large_animals -1.33001 3.40008 -0.3912 0.696401
## political_hierarchies 0.49575 1.09060 0.4546 0.650283
## economic_complexity -0.10521 0.42973 -0.2448 0.807029
## rugged -0.30869 0.47821 -0.6455 0.519888
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient on the treatment variable (here, plow
), is the estimate of the ACDE. The results show a negative ACDE of plows, where we have defined “direct” in the sense that we are fixing the value of the mediator (income in this case). This controlled direct effect is larger in magnitude than the estimated the total effect:
There are various quantities and output objects available from the sequential_g
function. As usual, the coefficient estimates from the second stage can be accessed using the coef()
function:
coef(direct)
## (Intercept) plow agricultural_suitability
## 12.1845047 -4.8387928 4.5738843
## tropical_climate large_animals political_hierarchies
## -2.1891894 -1.3300131 0.4957518
## economic_complexity rugged
## -0.1052101 -0.3086924
One can access confidence intervals for the coefficients using the usual confint
function:
confint(direct, "plow")
## 2.5 % 97.5 %
## plow -9.434262 -0.2433234
The vcov()
function returns the variance covariance-matrix that accounts for the first stage estimation,
vcov(direct)
## (Intercept) plow agricultural_suitability
## (Intercept) 13.2818030 -0.5383896 0.1744604
## plow -0.5383896 5.4974789 1.2047839
## agricultural_suitability 0.1744604 1.2047839 9.6395922
## tropical_climate -4.2134498 1.9941166 1.5556290
## large_animals -4.6726791 1.0902713 2.2247720
## political_hierarchies -0.7875510 -1.7236787 -1.9068463
## economic_complexity -0.4467889 0.1344913 -0.6397664
## rugged -0.1575923 -0.6212213 0.1149741
## tropical_climate large_animals
## (Intercept) -4.2134498 -4.6726791
## plow 1.9941166 1.0902713
## agricultural_suitability 1.5556290 2.2247720
## tropical_climate 4.4312353 0.6588940
## large_animals 0.6588940 11.5605183
## political_hierarchies -0.2916368 -1.4989856
## economic_complexity -0.1041499 -0.5417805
## rugged -0.1108144 -0.2269955
## political_hierarchies economic_complexity
## (Intercept) -0.7875510 -0.44678890
## plow -1.7236787 0.13449130
## agricultural_suitability -1.9068463 -0.63976636
## tropical_climate -0.2916368 -0.10414986
## large_animals -1.4989856 -0.54178054
## political_hierarchies 1.1893989 0.08020280
## economic_complexity 0.0802028 0.18466739
## rugged 0.1619962 -0.02484048
## rugged
## (Intercept) -0.15759233
## plow -0.62122128
## agricultural_suitability 0.11497405
## tropical_climate -0.11081440
## large_animals -0.22699550
## political_hierarchies 0.16199618
## economic_complexity -0.02484048
## rugged 0.22868322
as well as the the data used, the de-mediated dependent variable, and the model matrix of the treatment and pre-treatment covariates if specified in the function (model = T
, x = T
, y = T
):
head(direct$model)
## women_politics plow agricultural_suitability tropical_climate
## 3 16 0.000000000 0.92106117 1.00000000
## 5 5 1.000000000 0.55592783 0.35931775
## 8 0 0.999999762 0.04219041 1.00000000
## 9 28 0.007799393 0.54676911 0.99706609
## 10 3 0.977451026 0.59414589 0.02472441
## 13 22 0.000000000 0.41929801 1.00000000
## large_animals political_hierarchies economic_complexity rugged
## 3 0.9988955 2.720727 6.697217 0.8582159
## 5 1.0000000 3.029723 5.035051 3.4270583
## 8 1.0000000 4.000000 6.999999 0.7688347
## 9 0.7979339 2.256278 4.757950 0.7746106
## 10 1.0000000 3.975265 6.856073 2.6884225
## 13 0.0000000 1.008231 1.000000 0.1433353
## centered_ln_inc centered_ln_incsq
## 3 -1.0789779 1.1641932
## 5 -0.4771216 0.2276451
## 8 2.4163324 5.8386624
## 9 1.3787510 1.9009543
## 10 -1.1405023 1.3007454
## 13 2.3753187 5.6421390
head(direct$y)
## [,1]
## 3 17.818222
## 5 6.056754
## 8 -11.505354
## 9 22.694182
## 10 4.860141
## 13 10.775676