The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
In some settings, we don’t have access to the full data unit on each observation in our sample. These “coarsened-data” settings (see, e.g., Van der Vaart (2000)) create a layer of complication in estimating variable importance. In particular, the efficient influence function (EIF) in the coarsened-data setting is more complex, and involves estimating an additional quantity: the projection of the full-data EIF (estimated on the fully-observed sample) onto the variables that are always observed (Chapter 25.5.3 of Van der Vaart (2000); see also Example 6 in Williamson, Gilbert, Simon, et al. (2021)).
vimp
vimp
can handle coarsened data, with the specification
of several arguments:
C
: and binary indicator vector, denoting which
observations have been coarsened; 1 denotes fully observed, while 0
denotes coarsened.ipc_weights
: inverse probability of coarsening weights,
assumed to already be inverted (i.e., ipc_weights
= 1 /
[estimated probability of coarsening]).ipc_est_type
: the type of procedure used for
coarsened-at-random settings; options are "ipw"
(for
inverse probability weighting) or "aipw"
(for augmented
inverse probability weighting). Only used if C
is not all
equal to 1.Z
: a character vector specifying the variable(s) among
Y
and X
that are thought to play a role in the
coarsening mechanism. To specify the outcome, use "Y"
; to
specify covariates, use a character number corresponding to the desired
position in X
(e.g., "1"
or "X1"
[the latter is case-insensitive]).Z
plays a role in the additional estimation mentioned
above. Unless otherwise specified, an internal call to
SuperLearner
regresses the full-data EIF (estimated on the
fully-observed data) onto a matrix that is the parsed version of
Z
. If you wish to use any covariates from X
as
part of your coarsening mechanism (and thus include them in
Z
), and they have different names from X1
,
…, then you must use character numbers (i.e., "1"
refers to the first variable, etc.) to refer to the variables to include
in Z
. Otherwise, vimp
will throw an error.
In this example, the outcome Y
is subject to
missingness. We generate data as follows:
set.seed(1234)
p <- 2
n <- 100
x <- replicate(p, stats::rnorm(n, 0, 1))
# apply the function to the x's
y <- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1)
# indicator of observing Y
logit_g_x <- .01 * x[, 1] + .05 * x[, 2] - 2.5
g_x <- exp(logit_g_x) / (1 + exp(logit_g_x))
C <- rbinom(n, size = 1, prob = g_x)
obs_y <- y
obs_y[C == 0] <- NA
x_df <- as.data.frame(x)
full_df <- data.frame(Y = obs_y, x_df, C = C)
Next, we estimate the relevant components for vimp
:
library("vimp")
library("SuperLearner")
# estimate the probability of missing outcome
ipc_weights <- 1 / predict(glm(C ~ V1 + V2, family = "binomial", data = full_df),
type = "response")
# set up the SL
learners <- c("SL.glm", "SL.mean")
V <- 2
# estimate vim for X2
set.seed(1234)
est <- vim(Y = obs_y, X = x_df, indx = 2, type = "r_squared", run_regression = TRUE,
SL.library = learners, alpha = 0.05, delta = 0, C = C, Z = c("Y", "1"),
ipc_weights = ipc_weights, cvControl = list(V = V))
## Warning: All algorithms have zero weight
## Warning: All metalearner coefficients are zero, predictions will all be equal
## to 0
## Warning: All metalearner coefficients are zero, predictions will all be equal
## to 0
In this example, we observe outcome Y
and covariate
X1
on all participants in a study. Based on the value of
Y
and X1
, we include some participants in a
second-phase sample, and further measure covariate X2
on
these participants. This is an example of a two-phase study. We generate
data as follows:
set.seed(4747)
p <- 2
n <- 100
x <- replicate(p, stats::rnorm(n, 0, 1))
# apply the function to the x's
y <- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1)
# make this a two-phase study, assume that X2 is only measured on
# subjects in the second phase; note C = 1 is inclusion
C <- rbinom(n, size = 1, prob = exp(y + 0.1 * x[, 1]) / (1 + exp(y + 0.1 * x[, 1])))
tmp_x <- x
tmp_x[C == 0, 2] <- NA
x <- tmp_x
x_df <- as.data.frame(x)
full_df <- data.frame(Y = y, x_df, C = C)
If we want to estimate variable importance of X2
, we
need to use the coarsened-data arguments in vimp
. This can
be accomplished in the following manner:
library("vimp")
library("SuperLearner")
# estimate the probability of being included only in the first phase sample
ipc_weights <- 1 / predict(glm(C ~ y + V1, family = "binomial", data = full_df),
type = "response")
# set up the SL
learners <- c("SL.glm")
V <- 2
# estimate vim for X2
set.seed(1234)
est <- vim(Y = y, X = x_df, indx = 2, type = "r_squared", run_regression = TRUE,
SL.library = learners, alpha = 0.05, delta = 0, C = C, Z = c("Y", "1"),
ipc_weights = ipc_weights, cvControl = list(V = V), method = "method.CC_LS")
## Loading required package: quadprog
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.