README

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Collaborative Targeted Maximum Likelihood Estimation

Collaborative Targeted Maximum Likelihood Estimation (C-TMLE) is an extention of Targeted Maximum Likelihood Estimation (TMLE). It applies variable/model selection for nuisance parameter (e.g. the propensity score) estimation in a ‘collaborative’ way, by directly optimizing the empirical metric on the causal estimator.

In this package, we implemented the general template of C-TMLE, for the estimation of the average treatment effect (ATE).

The package also offers convenient functions for discrete C-TMLE for variable selection, and LASSO-C-TMLE for model selection of LASSO, in estimation of the propensity score (PS).

Installation

install.packages('ctmle')

devtools::install_github('jucheng1992/ctmle')

C-TMLE for variable selection

In this section, we start with examples of discrete C-TMLE for variable selection, using greedy forward searching, and scalable discrete C-TMLE with pre-ordering option.

library(ctmle)
#> Loading required package: SuperLearner
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-22
#> Package created on 2017-07-18
#> Loading required package: tmle
#> Welcome to the tmle package, version 1.2.0-5
#> 
#> Use tmleNews() to see details on changes and bug fixes
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-10
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(123)

N <- 1000
p = 5
Wmat <- matrix(rnorm(N * p), ncol = p)
beta1 <- 4+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
beta0 <- 2+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
tau <- 2
gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
W <- as.matrix(Wmat)

g <- 1/(1+exp(W%*%gcoef /3))
A <- rbinom(N, 1, prob = g)

epsilon <-rnorm(N, 0, 1)
Y  <- beta0 + tau * A + epsilon

# With initial estimate of Q
Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))

time_greedy <- system.time(
      ctmle_discrete_fit1 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
                                           preOrder = FALSE, detailed = TRUE)
)
ctmle_discrete_fit2 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat),
                                     preOrder = FALSE, detailed = TRUE)


time_preorder <- system.time(
      ctmle_discrete_fit3 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
                                           preOrder = TRUE,
                                           order = rev(1:p), detailed = TRUE)
)

time_greedy
#>    user  system elapsed 
#>   1.589   0.045   1.646
time_preorder
#>    user  system elapsed 
#>   0.994   0.012   1.008

ctmle_discrete_fit1
#> C-TMLE result:
#>  parameter estimate:  1.99472 
#>  estimated variance:  0.00838 
#>             p-value:  <2e-16 
#>   95% conf interval: (1.81533, 2.1741)

summary(ctmle_discrete_fit1)
#> 
#> Number of candidate TMLE estimators created:  6 
#> A candidate TMLE estimator was created at each move, as each new term
#> was incorporated into the model for g.
#> ---------------------------------------------------------------------- 
#>         term added cleverCovar estimate cv-RSS cv-varIC cv-penRSS
#> cand 1 (intercept)           1     4.22   19.9   0.0788     14045
#> cand 2          X2           1     3.22   19.6   0.0851     13818
#> cand 3          X5           1     2.61   19.1   0.0870     13485
#> cand 4          X1           1     2.00   18.3   0.0955     12945
#> cand 5          X4           2     1.99   18.3   0.0950     12937
#> cand 6          X3           3     2.01   18.3   0.1008     12941
#> ---------------------------------------------------------------------- 
#> Selected TMLE estimator is candidate 5 
#> 
#> Each TMLE candidate was created by fluctuating the initial fit, Q0(A,W)=E[Y|A,W], obtained in stage 1.
#> 
#>  cand 1: Q1(A,W) = Q0(A,W) + epsilon1a * h1a 
#>              h1a is based on an intercept-only model for treatment mechanism g(A,W)
#> 
#>      cand 2: Q2(A,W) = Q0(A,W) + epsilon1b * h1b 
#>              h1b is based on a treatment mechanism model containing covariates X2
#> 
#>      cand 3: Q3(A,W) = Q0(A,W) + epsilon1c * h1c 
#>              h1c is based on a treatment mechanism model containing covariates X2, X5
#> 
#>      cand 4: Q4(A,W) = Q0(A,W) + epsilon1d * h1d 
#>              h1d is based on a treatment mechanism model containing covariates X2, X5, X1
#> 
#>      cand 5: Q5(A,W) = Q0(A,W) + epsilon1d * h1d + epsilon2 * h2                     = Q4(A,W) + epsilon2 * h2,
#>              h2 is based on a treatment mechanism model containing covariates X2, X5, X1, X4
#> 
#>      cand 6: Q6(A,W) = Q0(A,W) + epsilon1d * h1d + epsilon2 * h2 + epsilon3 * h3                     = Q5(A,W) + epsilon3 * h3,
#>              h3 is based on a treatment mechanism model containing covariates X2, X5, X1, X4, X3
#> 
#> ---------- 
#> C-TMLE result:
#>  parameter estimate:  1.99472 
#>  estimated variance:  0.00838 
#>             p-value:  <2e-16 
#>   95% conf interval: (1.81533, 2.1741)

LASSO-C-TMLE for model selection of LASSO

In this section, we introduce the LASSO-C-TMLE algorithm for model selection of LASSO in the estimation of the propensity score. We implemented three variations of the LASSO-C-TMLE algorithm. For simplicity, we call them C-TMLE1-3. See technical details in the corresponding references.

# Generate high-dimensional data
set.seed(123)

N <- 1000
p = 100
Wmat <- matrix(rnorm(N * p), ncol = p)
beta1 <- 4 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
beta0 <- 2 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
tau <- 2
gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
W <- as.matrix(Wmat)

g <- 1/(1+exp(W%*%gcoef /3))
A <- rbinom(N, 1, prob = g)

epsilon <-rnorm(N, 0, 1)
Y  <- beta0 + tau * A + epsilon

# With initial estimate of Q
Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))

glmnet_fit <- cv.glmnet(y = A, x = W, family = 'binomial', nlambda = 20)

We start build a sequence of lambdas from the lambda selected by cross-validation, as the model selected by cv.glmnet would over-smooth w.r.t. the target parameter.

lambdas <- glmnet_fit$lambda[(which(glmnet_fit$lambda==glmnet_fit$lambda.min)):length(glmnet_fit$lambda)]

We fit C-TMLE1 algorithm by feed the algorithm with a vector of lambda, in decreasing order:

time_ctmlelasso1 <- system.time(
      ctmle_fit1 <- ctmleGlmnet(Y = Y, A = A,
                                W = data.frame(W = W),
                                Q = Q, lambdas = lambdas, ctmletype=1, 
                                family="gaussian",gbound=0.025, V=5)
)

time_ctmlelasso2 <- system.time(
      ctmle_fit2 <- ctmleGlmnet(Y = Y, A = A,
                                W = data.frame(W = W),
                                Q = Q, lambdas = lambdas, ctmletype=2, 
                                family="gaussian",gbound=0.025, V=5)
)

For C-TMLE3, we need two gn estimators, one with lambda selected by cross-validation, and the other with lambda slightly different from the selected lambda:

gcv <- predict.cv.glmnet(glmnet_fit, newx=W, s="lambda.min",type="response")
gcv <- bound(gcv,c(0.025,0.975))

s_prev <- glmnet_fit$lambda[(which(glmnet_fit$lambda == glmnet_fit$lambda.min))] * (1+5e-2)
gcvPrev <- predict.cv.glmnet(glmnet_fit,newx = W,s = s_prev,type="response")
gcvPrev <- bound(gcvPrev,c(0.025,0.975))

time_ctmlelasso3 <- system.time(
      ctmle_fit3 <- ctmleGlmnet(Y = Y, A = A, W = W, Q = Q,
                                ctmletype=3, g1W = gcv, g1WPrev = gcvPrev,
                                family="gaussian",
                                gbound=0.025, V = 5)
)

time_ctmlelasso1
#>    user  system elapsed 
#>  15.005   0.104  15.266
time_ctmlelasso2
#>    user  system elapsed 
#>  18.351   0.083  18.528
time_ctmlelasso3
#>    user  system elapsed 
#>   0.005   0.000   0.006

ctmle_fit1
#> C-TMLE result:
#>  parameter estimate:  2.20368 
#>  estimated variance:  0.09796 
#>             p-value:  1.9124e-12 
#>   95% conf interval: (1.59022, 2.81714)
ctmle_fit2
#> C-TMLE result:
#>  parameter estimate:  2.16669 
#>  estimated variance:  0.05327 
#>             p-value:  <2e-16 
#>   95% conf interval: (1.71429, 2.61908)
ctmle_fit3
#> C-TMLE result:
#>  parameter estimate:  2.02388 
#>  estimated variance:  0.04972 
#>             p-value:  <2e-16 
#>   95% conf interval: (1.58684, 2.46093)

lambdas[ctmle_fit1$best_k]
#> [1] 0.004409285

In comparison, we show which regularization parameter (lambda) is selected by cv.glmnet:

glmnet_fit$lambda.min
#> [1] 0.03065303

Advanced topic: the general template of C-TMLE

In this section, we briefly introduce the general template of C-TMLE. In this function, the gn candidates could be a user-specified matrix, each column stand for the estimated PS for each unit. The estimators should be ordered by their empirical fit.

As C-TMLE requires cross-validation, it needs two gn estimate: one from cross-validated prediction, one from a vanilla prediction. For example, consider 5-folds cross-validation, where argument folds is the list of indices for each folds, then the (i,j)-th element in input gn_candidates_cv should be the predicted value of i-th unit, predicted by j-th unit, trained by other 4 folds where all of them do not contain i-th unit. gn_candidates should be just the predicted PS for each estimator trained on the whole data.

We could easily use SuperLearner package and build_gn_seq function to easily achieve this:

lasso_fit <- cv.glmnet(x = as.matrix(W), y = A, alpha = 1, nlambda = 100, nfolds = 10)
lasso_lambdas <- lasso_fit$lambda[lasso_fit$lambda <= lasso_fit$lambda.min][1:5]

# Build SL template for glmnet
SL.glmnet_new <- function(Y, X, newX, family, obsWeights, id, alpha = 1,
                           nlambda = 100, lambda = 0,...){
      # browser()
      if (!is.matrix(X)) {
            X <- model.matrix(~-1 + ., X)
            newX <- model.matrix(~-1 + ., newX)
      }
      fit <- glmnet::glmnet(x = X, y = Y,
                            lambda = lambda,
                            family = family$family, alpha = alpha)
      pred <- predict(fit, newx = newX, type = "response")
      fit <- list(object = fit)
      class(fit) <- "SL.glmnet"
      out <- list(pred = pred, fit = fit)
      return(out)
}

# Use a sequence of estimator to build gn sequence:
SL.cv1lasso <- function (... , alpha = 1, lambda = lasso_lambdas[1]){
      SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}

SL.cv2lasso <- function (... , alpha = 1, lambda = lasso_lambdas[2]){
      SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}

SL.cv3lasso <- function (... , alpha = 1, lambda = lasso_lambdas[3]){
      SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}

SL.cv4lasso <- function (... , alpha = 1, lambda = lasso_lambdas[4]){
      SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}

SL.library = c('SL.cv1lasso', 'SL.cv2lasso', 'SL.cv3lasso', 'SL.cv4lasso', 'SL.glm')

V = 5
folds <-by(sample(1:N,N), rep(1:V, length=N), list)

Use folds and SuperLearner template to compute gn_candidates and gn_candidates_cv

gn_seq <- build_gn_seq(A = A, W = W, SL.library = SL.library, folds = folds)
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Non-Negative least squares convergence: TRUE
#> full SL.cv1lasso_All
#> full SL.cv2lasso_All
#> full SL.cv3lasso_All
#> full SL.cv4lasso_All
#> full SL.glm_All

gn_seq$gn_candidates %>% dim
#> [1] 1000    5
gn_seq$gn_candidates_cv %>% dim
#> [1] 1000    5
gn_seq$folds %>% length
#> [1] 5

Then we could use ctmleGeneral algorithm. As input estimator is already trained, it is much faster than previous C-TMLE algorithms.

Note: we recommand use the same folds as build_gn_seq for ctmleGeneral, to make cross-validation objective.

ctmle_general_fit1 <- ctmleGeneral(Y = Y, A = A, W = W, Q = Q,
                                   ctmletype = 1, 
                                   gn_candidates = gn_seq$gn_candidates,
                                   gn_candidates_cv = gn_seq$gn_candidates_cv,
                                   folds = folds, V = 5)

ctmle_general_fit1
#> C-TMLE result:
#>  parameter estimate:  2.19494 
#>  estimated variance:  0.08348 
#>             p-value:  3.0302e-14 
#>   95% conf interval: (1.62865, 2.76122)

Citation

@Manual{,
    title = {ctmle: Collaborative Targeted Maximum Likelihood Estimation},
    author = {Cheng Ju and Susan Gruber and Mark van der Laan},
    year = {2017},
    note = {R package version 0.1.1},
    url = {https://CRAN.R-project.org/package=ctmle},
}

References (by inverse chronological order)

C-TMLE for Adaptive Propensity Score Truncation

LASSO-C-TMLE

Scalable Discrete C-TMLE with Pre-ordering

Discrete C-TMLE with Greedy Search

General Template of C-TMLE

C-TMLE for Model Selection

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.