The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Collaborative Targeted Maximum Likelihood Estimation (C-TMLE) is an extention of Targeted Maximum Likelihood Estimation (TMLE). It applies variable/model selection for nuisance parameter (e.g. the propensity score) estimation in a ‘collaborative’ way, by directly optimizing the empirical metric on the causal estimator.
In this package, we implemented the general template of C-TMLE, for the estimation of the average treatment effect (ATE).
The package also offers convenient functions for discrete C-TMLE for variable selection, and LASSO-C-TMLE for model selection of LASSO, in estimation of the propensity score (PS).
To install the CRAN release version of ctmle
:
install.packages('ctmle')
To install the development version (requires the devtools package):
::install_github('jucheng1992/ctmle') devtools
In this section, we start with examples of discrete C-TMLE for variable selection, using greedy forward searching, and scalable discrete C-TMLE with pre-ordering option.
library(ctmle)
#> Loading required package: SuperLearner
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-22
#> Package created on 2017-07-18
#> Loading required package: tmle
#> Welcome to the tmle package, version 1.2.0-5
#>
#> Use tmleNews() to see details on changes and bug fixes
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-10
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(123)
<- 1000
N = 5
p <- matrix(rnorm(N * p), ncol = p)
Wmat <- 4+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
beta1 <- 2+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
beta0 <- 2
tau <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
gcoef <- as.matrix(Wmat)
W
<- 1/(1+exp(W%*%gcoef /3))
g <- rbinom(N, 1, prob = g)
A
<-rnorm(N, 0, 1)
epsilon <- beta0 + tau * A + epsilon
Y
# With initial estimate of Q
<- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))
Q
<- system.time(
time_greedy <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
ctmle_discrete_fit1 preOrder = FALSE, detailed = TRUE)
)<- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat),
ctmle_discrete_fit2 preOrder = FALSE, detailed = TRUE)
<- system.time(
time_preorder <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
ctmle_discrete_fit3 preOrder = TRUE,
order = rev(1:p), detailed = TRUE)
)
Scalable (discrete) C-TMLE takes much less computation time:
time_greedy#> user system elapsed
#> 1.589 0.045 1.646
time_preorder#> user system elapsed
#> 0.994 0.012 1.008
Show the brief results from greedy CTMLE:
ctmle_discrete_fit1#> C-TMLE result:
#> parameter estimate: 1.99472
#> estimated variance: 0.00838
#> p-value: <2e-16
#> 95% conf interval: (1.81533, 2.1741)
Summary function offers detial information of which variable is selected.
summary(ctmle_discrete_fit1)
#>
#> Number of candidate TMLE estimators created: 6
#> A candidate TMLE estimator was created at each move, as each new term
#> was incorporated into the model for g.
#> ----------------------------------------------------------------------
#> term added cleverCovar estimate cv-RSS cv-varIC cv-penRSS
#> cand 1 (intercept) 1 4.22 19.9 0.0788 14045
#> cand 2 X2 1 3.22 19.6 0.0851 13818
#> cand 3 X5 1 2.61 19.1 0.0870 13485
#> cand 4 X1 1 2.00 18.3 0.0955 12945
#> cand 5 X4 2 1.99 18.3 0.0950 12937
#> cand 6 X3 3 2.01 18.3 0.1008 12941
#> ----------------------------------------------------------------------
#> Selected TMLE estimator is candidate 5
#>
#> Each TMLE candidate was created by fluctuating the initial fit, Q0(A,W)=E[Y|A,W], obtained in stage 1.
#>
#> cand 1: Q1(A,W) = Q0(A,W) + epsilon1a * h1a
#> h1a is based on an intercept-only model for treatment mechanism g(A,W)
#>
#> cand 2: Q2(A,W) = Q0(A,W) + epsilon1b * h1b
#> h1b is based on a treatment mechanism model containing covariates X2
#>
#> cand 3: Q3(A,W) = Q0(A,W) + epsilon1c * h1c
#> h1c is based on a treatment mechanism model containing covariates X2, X5
#>
#> cand 4: Q4(A,W) = Q0(A,W) + epsilon1d * h1d
#> h1d is based on a treatment mechanism model containing covariates X2, X5, X1
#>
#> cand 5: Q5(A,W) = Q0(A,W) + epsilon1d * h1d + epsilon2 * h2 = Q4(A,W) + epsilon2 * h2,
#> h2 is based on a treatment mechanism model containing covariates X2, X5, X1, X4
#>
#> cand 6: Q6(A,W) = Q0(A,W) + epsilon1d * h1d + epsilon2 * h2 + epsilon3 * h3 = Q5(A,W) + epsilon3 * h3,
#> h3 is based on a treatment mechanism model containing covariates X2, X5, X1, X4, X3
#>
#> ----------
#> C-TMLE result:
#> parameter estimate: 1.99472
#> estimated variance: 0.00838
#> p-value: <2e-16
#> 95% conf interval: (1.81533, 2.1741)
In this section, we introduce the LASSO-C-TMLE algorithm for model selection of LASSO in the estimation of the propensity score. We implemented three variations of the LASSO-C-TMLE algorithm. For simplicity, we call them C-TMLE1-3. See technical details in the corresponding references.
# Generate high-dimensional data
set.seed(123)
<- 1000
N = 100
p <- matrix(rnorm(N * p), ncol = p)
Wmat <- 4 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
beta1 <- 2 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
beta0 <- 2
tau <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
gcoef <- as.matrix(Wmat)
W
<- 1/(1+exp(W%*%gcoef /3))
g <- rbinom(N, 1, prob = g)
A
<-rnorm(N, 0, 1)
epsilon <- beta0 + tau * A + epsilon
Y
# With initial estimate of Q
<- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))
Q
<- cv.glmnet(y = A, x = W, family = 'binomial', nlambda = 20) glmnet_fit
We start build a sequence of lambdas from the lambda selected by cross-validation, as the model selected by cv.glmnet would over-smooth w.r.t. the target parameter.
<- glmnet_fit$lambda[(which(glmnet_fit$lambda==glmnet_fit$lambda.min)):length(glmnet_fit$lambda)] lambdas
We fit C-TMLE1 algorithm by feed the algorithm with a vector of lambda, in decreasing order:
<- system.time(
time_ctmlelasso1 <- ctmleGlmnet(Y = Y, A = A,
ctmle_fit1 W = data.frame(W = W),
Q = Q, lambdas = lambdas, ctmletype=1,
family="gaussian",gbound=0.025, V=5)
)
We fit C-TMLE2 algorithm:
<- system.time(
time_ctmlelasso2 <- ctmleGlmnet(Y = Y, A = A,
ctmle_fit2 W = data.frame(W = W),
Q = Q, lambdas = lambdas, ctmletype=2,
family="gaussian",gbound=0.025, V=5)
)
For C-TMLE3, we need two gn estimators, one with lambda selected by cross-validation, and the other with lambda slightly different from the selected lambda:
<- predict.cv.glmnet(glmnet_fit, newx=W, s="lambda.min",type="response")
gcv <- bound(gcv,c(0.025,0.975))
gcv
<- glmnet_fit$lambda[(which(glmnet_fit$lambda == glmnet_fit$lambda.min))] * (1+5e-2)
s_prev <- predict.cv.glmnet(glmnet_fit,newx = W,s = s_prev,type="response")
gcvPrev <- bound(gcvPrev,c(0.025,0.975))
gcvPrev
<- system.time(
time_ctmlelasso3 <- ctmleGlmnet(Y = Y, A = A, W = W, Q = Q,
ctmle_fit3 ctmletype=3, g1W = gcv, g1WPrev = gcvPrev,
family="gaussian",
gbound=0.025, V = 5)
)
Les’t compare the running time for each LASSO-C-TMLE
time_ctmlelasso1#> user system elapsed
#> 15.005 0.104 15.266
time_ctmlelasso2#> user system elapsed
#> 18.351 0.083 18.528
time_ctmlelasso3#> user system elapsed
#> 0.005 0.000 0.006
Finally, we compare three C-TMLE estimates:
ctmle_fit1#> C-TMLE result:
#> parameter estimate: 2.20368
#> estimated variance: 0.09796
#> p-value: 1.9124e-12
#> 95% conf interval: (1.59022, 2.81714)
ctmle_fit2#> C-TMLE result:
#> parameter estimate: 2.16669
#> estimated variance: 0.05327
#> p-value: <2e-16
#> 95% conf interval: (1.71429, 2.61908)
ctmle_fit3#> C-TMLE result:
#> parameter estimate: 2.02388
#> estimated variance: 0.04972
#> p-value: <2e-16
#> 95% conf interval: (1.58684, 2.46093)
Show which regularization parameter (lambda) is selected by C-TMLE1:
$best_k]
lambdas[ctmle_fit1#> [1] 0.004409285
In comparison, we show which regularization parameter (lambda) is selected by cv.glmnet:
$lambda.min
glmnet_fit#> [1] 0.03065303
In this section, we briefly introduce the general template of C-TMLE. In this function, the gn candidates could be a user-specified matrix, each column stand for the estimated PS for each unit. The estimators should be ordered by their empirical fit.
As C-TMLE requires cross-validation, it needs two gn estimate: one
from cross-validated prediction, one from a vanilla prediction. For
example, consider 5-folds cross-validation, where argument
folds
is the list of indices for each folds, then the
(i,j)-th element in input gn_candidates_cv
should be the
predicted value of i-th unit, predicted by j-th unit, trained by other 4
folds where all of them do not contain i-th unit.
gn_candidates
should be just the predicted PS for each
estimator trained on the whole data.
We could easily use SuperLearner
package and
build_gn_seq
function to easily achieve this:
<- cv.glmnet(x = as.matrix(W), y = A, alpha = 1, nlambda = 100, nfolds = 10)
lasso_fit <- lasso_fit$lambda[lasso_fit$lambda <= lasso_fit$lambda.min][1:5]
lasso_lambdas
# Build SL template for glmnet
<- function(Y, X, newX, family, obsWeights, id, alpha = 1,
SL.glmnet_new nlambda = 100, lambda = 0,...){
# browser()
if (!is.matrix(X)) {
<- model.matrix(~-1 + ., X)
X <- model.matrix(~-1 + ., newX)
newX
}<- glmnet::glmnet(x = X, y = Y,
fit lambda = lambda,
family = family$family, alpha = alpha)
<- predict(fit, newx = newX, type = "response")
pred <- list(object = fit)
fit class(fit) <- "SL.glmnet"
<- list(pred = pred, fit = fit)
out return(out)
}
# Use a sequence of estimator to build gn sequence:
<- function (... , alpha = 1, lambda = lasso_lambdas[1]){
SL.cv1lasso SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}
<- function (... , alpha = 1, lambda = lasso_lambdas[2]){
SL.cv2lasso SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}
<- function (... , alpha = 1, lambda = lasso_lambdas[3]){
SL.cv3lasso SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}
<- function (... , alpha = 1, lambda = lasso_lambdas[4]){
SL.cv4lasso SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}
= c('SL.cv1lasso', 'SL.cv2lasso', 'SL.cv3lasso', 'SL.cv4lasso', 'SL.glm') SL.library
Construct the object folds
, which is a list of indices
for each fold
= 5
V <-by(sample(1:N,N), rep(1:V, length=N), list) folds
Use folds
and SuperLearner template to compute
gn_candidates
and gn_candidates_cv
<- build_gn_seq(A = A, W = W, SL.library = SL.library, folds = folds)
gn_seq #> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Non-Negative least squares convergence: TRUE
#> full SL.cv1lasso_All
#> full SL.cv2lasso_All
#> full SL.cv3lasso_All
#> full SL.cv4lasso_All
#> full SL.glm_All
Lets look at the output of build_gn_seq
$gn_candidates %>% dim
gn_seq#> [1] 1000 5
$gn_candidates_cv %>% dim
gn_seq#> [1] 1000 5
$folds %>% length
gn_seq#> [1] 5
Then we could use ctmleGeneral
algorithm. As input
estimator is already trained, it is much faster than previous C-TMLE
algorithms.
Note: we recommand use the same folds
as
build_gn_seq
for ctmleGeneral
, to make
cross-validation objective.
<- ctmleGeneral(Y = Y, A = A, W = W, Q = Q,
ctmle_general_fit1 ctmletype = 1,
gn_candidates = gn_seq$gn_candidates,
gn_candidates_cv = gn_seq$gn_candidates_cv,
folds = folds, V = 5)
ctmle_general_fit1#> C-TMLE result:
#> parameter estimate: 2.19494
#> estimated variance: 0.08348
#> p-value: 3.0302e-14
#> 95% conf interval: (1.62865, 2.76122)
If you used ctmle
package in your research, please
cite:
Ju, Cheng; Susan, Gruber; van der Laan, Mark J.; ctmle: Collaborative Targeted Maximum Likelihood Estimation. R package version 0.1.1, https://CRAN.R-project.org/package=ctmle.
@Manual{,
title = {ctmle: Collaborative Targeted Maximum Likelihood Estimation},
author = {Cheng Ju and Susan Gruber and Mark van der Laan},
year = {2017},
note = {R package version 0.1.1},
url = {https://CRAN.R-project.org/package=ctmle},
}
Ju, Cheng, Schwab, Joshua, & van der Laan, Mark J. (2017). On Adaptive Propensity Score Truncation in Causal Inference. arXiv preprint arXiv:1707.05861 (2017).
Ju, Cheng; Wyss, Richard; Franklin, Jessica M.; Schneeweiss, Sebastian; Häggström, Jenny; van der Laan, Mark J.. “Collaborative-controlled LASSO for Constructing Propensity Score-based Estimators in High-Dimensional Data”, arXiv preprint arXiv: 1706.10029 (2017). (To appear in Statistical Methods in Medical Research)
Ju, Cheng; Gruber, Susan; Lendle, Samuel D.; Chambaz, Antoine; Franklin, Jessica M.; Wyss, Richard; Schneeweiss, Sebastian; and van der Laan, Mark J.. “Scalable Collaborative Targeted Learning for High-dimensional Data”, Statistical Methods in Medical Research (2017), https://doi.org/10.1177/0962280217729845.
Susan, Gruber, and van rder Laan, Mark J.. “An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics.” The International Journal of Biostatistics 6.1 (2010): 1-31.
van der Laan, Mark J., and Susan Gruber. “Collaborative double robust targeted maximum likelihood estimation.” The international journal of biostatistics 6.1 (2010): 1-71.
In preperation
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.