The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This example demonstrates how multibias
is validated
using simulation data. Here we are interested in quantifying the effect
of exposure X on outcome Y. The causal system can be
represented in the following directed acyclic graph (DAG):
The variables are defined:
The DAG indicates that there are three sources of bias: 1. There is uncontrolled confounding from (unobserved) variable U. 2. The true exposure, X, is unobserved, and the misclassified exposure X* is dependent on both the exposure and outcome. 3. Lastly, there is collider stratification at variable S since exposure and outcome both affect selection. The study naturally only assesses those who were selected into the study (i.e. those with S=1), which represents a fraction of all people in the source population from which we are trying to draw inference.
A simulated dataframe corresponding to this DAG,
df_uc_emc_sel
can be loaded from the multibias
package.
head(df_uc_em_sel)
#> Xstar Y C1 C2 C3
#> 1 1 0 0 0 1
#> 2 0 0 0 0 1
#> 3 0 0 0 0 0
#> 4 0 0 1 0 1
#> 5 1 0 0 0 1
#> 6 1 0 0 1 1
In this data, the true, unbiased exposure-outcome odds ratio (ORYX) equals ~2. However, when we run a logistic regression of the outcome on the exposure and confounders, we do not observe an odds ratio of 2 due to the multiple bias sources.
biased_model <- glm(Y ~ Xstar + C1 + C2 + C3, ,
family = binomial(link = "logit"),
data = df_uc_em_sel)
biased_or <- round(exp(coef(biased_model)[2]), 2)
print(paste0("Biased Odds Ratio: ", biased_or))
#> [1] "Biased Odds Ratio: 1.5"
The function adjust_uc_emc_sel()
can be used here to
“reconstruct” the unbiased data and return the exposure-outcome odds
ratio that would be observed in the unbiased setting.
Models for the missing variables (U, X, S) are used to facilitate this data reconstruction. For the above DAG, the corresponding bias models are:
where j indicates the number of measured confounders.
To perform the bias adjustment, it is necessary to obtain values of
these bias parameters. Potential sources of these bias parameters
include internal validation data, estimates in the literature, and
expert opinion. For purposes of demonstrating the methodology, we will
obtain the exact values of these bias parameters. This is possible
because for purposes of validation we have access to the data of missing
values that would otherwise be absent in real-world practice. This
source data is available in multibias
as
df_uc_emc_sel_source
.
u_model <- glm(U ~ X + Y,
family = binomial(link = "logit"),
data = df_uc_em_sel_source)
x_model <- glm(X ~ Xstar + Y + C1 + C2 + C3,
family = binomial(link = "logit"),
data = df_uc_em_sel_source)
s_model <- glm(S ~ Xstar + Y + C1 + C2 + C3,
family = binomial(link = "logit"),
data = df_uc_em_sel_source)
In this example we’ll perform probabilistic bias analysis,
representing each bias parameter as a single draw from a probability
distribution. For this reason, we will run the analysis over 1,000
iterations with bootstrap samples to obtain a valid confidence interval.
To speed up the computation we will run the for loop in parallel using
the foreach()
function in the doParallel
package. We create a cluster, make a seed for consistent results, and
specify the desired number of bootstrap repitions.
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(cores = no_cores)
cl <- makeCluster(no_cores)
set.seed(1234)
nreps <- 1000
est <- vector(length = nreps)
Next we run the parallel for loop in which we apply the
adjust_uc_em_sel()
function to bootstrap samples of the
df_uc_em_sel
data. Within the function, we include the
following arguments: 1. The data_observed
object, which
includes the dataframe and key column names. 2. The bias parameters:
U model coefficients, X model coefficients, and
S model coefficients.
Using the results from the fitted bias models above, we’ll use Normal distribution draws for each bias parameter where the mean correponds to the estimated coefficient from the bias model and the standard deviation comes from the estimated standard deviation (i.e., standard error) of the coefficient in the bias model. Each loop iteration will thus have slightly different values for the bias parameters.
or <- foreach(i = 1:nreps, .combine = c,
.packages = c("dplyr", "multibias")) %dopar% {
df_sample <- df_uc_em_sel[sample(seq_len(nrow(df_uc_em_sel)),
nrow(df_uc_em_sel),
replace = TRUE), ]
est[i] <- adjust_uc_em_sel(
data_observed = data_observed(
data = df_sample,
exposure = "Xstar",
outcome = "Y",
confounders = c("C1", "C2", "C3")
),
u_model_coefs = c(
rnorm(1, mean = u_model$coef[1], sd = summary(u_model)$coef[1, 2]),
rnorm(1, mean = u_model$coef[2], sd = summary(u_model)$coef[2, 2]),
rnorm(1, mean = u_model$coef[3], sd = summary(u_model)$coef[3, 2])
),
x_model_coefs = c(
rnorm(1, mean = x_model$coef[1], sd = summary(x_model)$coef[1, 2]),
rnorm(1, mean = x_model$coef[2], sd = summary(x_model)$coef[2, 2]),
rnorm(1, mean = x_model$coef[3], sd = summary(x_model)$coef[3, 2]),
rnorm(1, mean = x_model$coef[4], sd = summary(x_model)$coef[4, 2]),
rnorm(1, mean = x_model$coef[5], sd = summary(x_model)$coef[5, 2]),
rnorm(1, mean = x_model$coef[6], sd = summary(x_model)$coef[6, 2])
),
s_model_coefs = c(
rnorm(1, mean = s_model$coef[1], sd = summary(s_model)$coef[1, 2]),
rnorm(1, mean = s_model$coef[2], sd = summary(s_model)$coef[2, 2]),
rnorm(1, mean = s_model$coef[3], sd = summary(s_model)$coef[3, 2]),
rnorm(1, mean = s_model$coef[4], sd = summary(s_model)$coef[4, 2]),
rnorm(1, mean = s_model$coef[5], sd = summary(s_model)$coef[5, 2]),
rnorm(1, mean = s_model$coef[6], sd = summary(s_model)$coef[6, 2])
)
)$estimate
}
Finally, we obtain the ORYX estimate and 95% confidence interval from the distribution of 1,000 odds ratio estimates. As expected, ORYX ~ 2, indicating that we were successfully able to obtain a bias-adjusted estimate in our biased data that approximates the known, unbiased estimate.
# odds ratio estimate
round(median(or), 2)
#> 2.02
# confidence interval
round(quantile(or, c(.025, .975)), 2)
#> 1.93 2.11
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.