The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
R package mmabig is used for general multiple mediation/confounding analysis with high dimensional data sets Yu and Li (2022). Mediation/confounding effects refer to the effects from third variables that intervene the relationship between an exposure and an outcome. In the following we generally call the mediation/confounding effect as the third-variable effect (TVE). TVEs include the total effect between the exposure and the outcome, direct effect from the exposure to outcome after adjusting for other variables, and indirect effect of a third variable (the effect from the exposure variable to the third variable and to the outcome), which are defined and the statistical inferences described in Li et al. (2021). In mmabig, a generalized linear model with LASSO or elastic net regularization is used to fit the final model. If readers are interested in using a generalized linear model or multiple Additive Regression Trees (MART) to fit the model, please refer to the R package mma (Yu and Li 2017).
To use the R package mmabig, we first install the package in R
(install.packages("mmabig")
) and load it.
The function data.org.big is used to do a preliminary data analysis to identify potential Mediators/Confounders (MCs) and covariates. It returns an object of “med_iden” class that organizes the data into a format to be used directly for the mediation analysis functions.
The following code generates a simulated data set with 20 potential MCs, of which five (m16-m20) are multicategorical variables. The real mediator/confounders are “m11”, “m12”, and “m16”, which are highly related with both the predictor “pred” and the outcome “y”.
# a binary predictor
set.seed(1)
n=100
pred=rbinom(n,1,0.5)
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*10,mean=pred,sd=1),n,10)
m3.1=m2[,6:10]
m3=m3.1
m2=m2[,1:5]
m3[m3.1<=0.1]=0
m3[0.1<m3.1 & m3.1<=1]=1
m3[m3.1>1]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))
lu<--0.5363+0.701*pred+0.801*m[,1]+0.518*m[,2]+1.402*m[,11]+0.773*m[,12]+
ifelse(m[,16]=="2",2.15,0)+ifelse(m[,16]=="1",0.201,0)
# a continuous y
y<-rnorm(n,lu,1)
Next, we use the function “data.org.big” to identify mediators. The exposure variable is specified by “pred=”. All potential MCs and covariates are in the dataframe “x”. The outcome variable is “y”. Both the exposure and the outcome can be multivariate. The “mediator=1:ncol(m)” indicates that all variables in x should be tested for potential MCs. Two tests are needed to identify a potential MC: first, the variable is significantly related with the predictor adjusting for other covariates. The significance level is set by “alpha2”, whose default value is 0.01. Second, the variable has to be significantly related with the outcome not adjusting (testtype=2, by default) or adjusting (testtype=2) for the predictor and other variables. The significance level is set by “alpha1”. In the following example, p-value 1 shows the results for the second test, and p-value 2 are the results for the first test. A variable that passes the second test but not the first test is considered as a covariate. Variables do not pass the second test are not adopted in further analysis. Variables in “x” but not in “mediator” are forced in further analysis as covariates. The argument “alpha” is the elasticnet mixing parameter such that \(0\leq alpha\leq 1\), with alpha=1 be the lasso penalty, and alpha=0 be the ridge penalty. By default, alpha=1.
data.e1<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),
pred=data.frame(pred),testtype=1)
summary(data.e1,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12" "m161" "m162"
#> Selected as covariates:
#> [1] "m01" "m02"
#> Tests:
#> Coefficients.y P-Value 2.pred
#> m01 0.702 0.548
#> m02 0.534 0.920
#> m11 * 1.180 0.000
#> m12 * 0.479 0.000
#> m161 * 0.000 0.000
#> m162 * 1.548 0.000
#> ----
#> *:mediator,-:joint mediator
#> Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor
The results from “data.org.big” function are summarized by the “summary” function. When “only=TRUE”, it only shows the test results for selected covariates and mediators. The multicategorical mediator “m16” has 3 categories, and it has been binarized into two binary variables “m161” and “m162”, and they are selected as potential mediators together. We specify testtype=1 in the above example to identify covariates/mediators using full model. By default, testtype=2 is used that covariates/mediators are tested one by one in models with the predictor only. The following codes show the results when testtype is 2.
data.e1.2<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred))
summary(data.e1.2,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12" "m161" "m162"
#> Selected as covariates:
#> [1] "m01" "m02"
#> Tests:
#> P-Value 1.y P-Value 2.pred
#> m01 0.001 0.548
#> m02 0.000 0.920
#> m11 * 0.000 0.000
#> m12 * 0.000 0.000
#> m16 * 0.000 0.000
#> ----
#> *:mediator,-:joint mediator
#> P-Value 1:univariate relationship test with the outcome;P-Value 2:Tests of relationship with the Predictor
The functions in mmabig can deal with binary, categorical, continuous, or time-to-event outcomes. If the outcome is time-to-event, it should be defined by the “Surv” function in the survival package. The following is an example.
lambda=1/500
survt=-log(runif(n))/lambda/exp(lu)
st=round(runif(n,1,500),0)
time=ifelse(st+survt>600,600,st+survt)-st
cen=ifelse(st+survt>600,0,1)
y=Surv(time,cen)
data.e3<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),testtype=1)
summary(data.e3,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12" "m161" "m162"
#> Selected as covariates:
#> [1] "m01"
#> Tests:
#> Coefficients.y P-Value 2.pred
#> m01 0.341 0.548
#> m11 * 0.818 0.000
#> m12 * 0.286 0.000
#> m161 * 0.000 0.000
#> m162 * 1.290 0.000
#> ----
#> *:mediator,-:joint mediator
#> Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor
In addition, the package can handle multivariate and multicategorical predictors. If the predictor is multicategorical of k levels, “data.org.big” first transforms the predictor to k-1 binary predictors. If a variable significantly relates with any of the k-1 predictors, the variable passes the first test described above. P-value 2 is shown for each predictor. In the following example, the predictor has three levels.
# multicategorical predictor
set.seed(1)
n=100
pred=rmultinom(100,1,c(0.5, 0.3, 0.2))
pred=pred[1,]*0+pred[2,]*1+pred[3,]*2
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*10,mean=pred,sd=1),n,10)
m3.1=m2[,6:10]
m2=m2[,1:5]
m3=m3.1
m3[m3.1<=0.1]=0
m3[0.1<m3.1 & m3.1<=1]=1
m3[m3.1>1]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))
pred<-as.factor(pred)
# continuous y
lu<--0.5363+ifelse(pred=="1",0.3,0)+ifelse(pred=="2",0.7,0)+0.801*m[,1]+0.518*m[,2]+
1.402*m[,11]+0.773*m[,12]+ifelse(m[,16]=="2",2.15,0)+ifelse(m[,16]=="1",0.201,0)
y<-rnorm(n,lu,1)
data.m.e1<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),testtype=1)
summary(data.m.e1,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12" "m15"
#> Selected as covariates:
#> [1] "m01" "m02" "m16"
#> Tests:
#> Coefficients.y P-Value 2.pred1...i.1 P-Value 2.pred1...i.2
#> m01 0.546 0.721 0.732
#> m02 0.461 0.157 0.201
#> m11 * 1.375 0.000 0.000
#> m12 * 0.609 0.000 0.000
#> m15 * 0.041 0.000 0.000
#> ----
#> *:mediator,-:joint mediator
#> Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor
In the following example, the predictor is bivariate.
# multivariate predictor
set.seed(1)
n=100
pred=cbind(runif(n,-1,1),rnorm(n))
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*5,mean=0.3*pred[,1]+0.4*pred[,2],sd=1),n,5)
m3.1=matrix(rnorm(n*5,mean=0.7*pred[,1]+0.8*pred[,2],sd=1),n,5)
m3=m3.1
m3[m3.1<=0]=0
m3[0<m3.1 & m3.1<=1.28]=1
m3[m3.1>1.28]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))
colnames(pred)=c("x1","x2")
# binary y
lu<--0.6852+0.3*pred[,1]+0.7*pred[,2]+0.801*m[,1]+0.518*m[,2]+1.402*m[,11]+0.773*m[,12]+
ifelse(m[,16]=="2",2.15,0)+ifelse(m[,16]=="1",0.201,0)
y<-rbinom(n,1,exp(lu)/(1+exp(lu)))
data.m.c.e2<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),
testtype=1,alpha1=0.05,alpha2=0.05)
summary(data.m.c.e2,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12" "m15" "m161" "m162" "m181" "m182" "m191" "m192"
#> Selected as covariates:
#> [1] "m01" "m06"
#> Tests:
#> Coefficients.y P-Value 2.x1 P-Value 2.x2
#> m01 0.355 0.607 0.533
#> m06 -0.211 0.216 0.977
#> m11 * 0.637 0.653 0.000
#> m12 * 0.381 0.003 0.000
#> m15 * 0.096 0.095 0.000
#> m161 * 0.000 0.003 0.003
#> m162 * 0.923 0.000 0.000
#> m181 * 0.092 0.018 0.018
#> m182 * 0.000 0.000 0.000
#> m191 * 0.000 0.028 0.028
#> m192 * 0.046 0.000 0.000
#> ----
#> *:mediator,-:joint mediator
#> Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor
Similarly, the package can deal with multivariate outcomes. The following code deals with multivariate predictors and multivariate responses. If the variable is significantly related with any one of the outcomes, it passes the second test described above. The results from “data.org.big” are summarized for each combination of the exposure-outcome relationship.
# multivariate predictor
set.seed(1)
n=100
pred=cbind(runif(n,-1,1),rnorm(n))
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*5,mean=0.3*pred[,1]+0.4*pred[,2],sd=1),n,5)
m3.1=matrix(rnorm(n*5,mean=0.7*pred[,1]+0.8*pred[,2],sd=1),n,5)
m3=m3.1
m3[m3.1<=0]=0
m3[0<m3.1 & m3.1<=1.28]=1
m3[m3.1>1.28]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))
colnames(pred)=c("x1","x2")
#multivariate responses
y<-cbind(rnorm(n,lu,1),rbinom(n,1,exp(lu)/(1+exp(lu))))
colnames(y)=c("y1","y2")
data.m.m.c.e2<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),
testtype=1,alpha1=0.05,alpha2=0.05)
summary(data.m.m.c.e2,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12" "m15" "m161" "m162" "m171" "m172" "m191" "m192" "m201"
#> [11] "m202"
#> Selected as covariates:
#> [1] "m01" "m02" "m09"
#> Tests:
#> Coefficients.y1 Coefficients.y2 P-Value 2.x1 P-Value 2.x2
#> m01 0.652 0.069 0.607 0.533
#> m02 0.454 0.000 0.053 0.936
#> m09 0.101 0.000 0.678 0.292
#> m11 * 1.375 0.813 0.653 0.000
#> m12 * 0.661 0.114 0.003 0.000
#> m15 * 0.059 0.000 0.095 0.000
#> m161 * 0.547 0.000 0.003 0.003
#> m162 * 1.519 0.021 0.000 0.000
#> m171 * 0.000 0.000 0.219 0.219
#> m172 * 0.163 0.000 0.000 0.000
#> m191 * 0.000 0.000 0.028 0.028
#> m192 * 0.072 0.000 0.000 0.000
#> m201 * 0.000 0.008 0.053 0.053
#> m202 * 0.149 0.331 0.000 0.000
#> ----
#> *:mediator,-:joint mediator
#> Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor
data.m.m.c.e2.2<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),
alpha1=0.05,alpha2=0.05)
summary(data.m.m.c.e2.2,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12"
#> Selected as covariates:
#> [1] "m01" "m09"
#> Tests:
#> P-Value 1.y1 P-Value 1.y2 P-Value 2.x1 P-Value 2.x2
#> m01 0.000 0.053 0.607 0.533
#> m09 0.043 0.993 0.678 0.292
#> m11 * 0.000 0.000 0.653 0.000
#> m12 * 0.003 0.148 0.003 0.000
#> ----
#> *:mediator,-:joint mediator
#> P-Value 1:univariate relationship test with the outcome;P-Value 2:Tests of relationship with the Predictor
##Third-Variable Effect Analysis Next, we use med.big function to estimate the TVE using the results from data.org.big function.
To show the results:
For survival outcome, the default option is to fit the final full model using Cox proportional hazard model.
med.m.m.c.e2.2<-med.big(data.m.m.c.e2.2)
med.m.m.c.e2.2
#> $y1
#> x1 x2
#> de 0.2494818 0.9580249
#> m11 0.1204943 0.7289243
#> m12 0.4290098 0.3444318
#> te 0.7989859 2.0313811
#>
#> $y2
#> x1 x2
#> de 0.000000000 0.000000000
#> m11 0.063517400 0.384245356
#> m12 0.003373375 0.002708325
#> te 0.066890775 0.386953680
Finally, in the mmabig package, the relationship between third variables and the exposure variable(s) are fitted through generalized smoothing splines to allow the fit of potential nonlinear relationships. In the package, the ns() function is used to generate the B-spline basis matrix for a natural cubic spline. The argument “df” in the med.big function is used to assign the degrees of freedom for the spline basis matrix. By default, the degree of freedom is 1, which is to fit a linear relationship.
We also allow generalized linear models to fit the relationship between outcomes and all predictors in the full model. The argument used in the med.big function to define the generalized linear model is “family1”. It is a list with the ith item defines the conditional distribution of the ith outcome, y[,i] given the predictors, and the linkage function that links the mean of the outcome with the system component if generalized linear model is used as the final full model. The default value of “family1”” is gaussian(link=“identity”) for continuous outcomes, and binomial(link = “logit”) for binary ones.
The “mma.big” is a function that automatically identify potential MCs, based on which to make statistical inference on the mediation effects for high-dimensional data sets. Bootstrap method is used to make inferences on the TVE. The summary function is used to summarize inference results. In the summary function, three different sets of confidence intervals are calculated: based on the normal assumption of bootstrap method (lwbd, upbd), on the quantiles (lwbd_q,upbd_q), and on the confidence ball (lwbd_b, upbd_b) (Yu and Li 2021).
set.seed(1)
n=100
pred=rbinom(n,1,0.5)
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*10,mean=pred,sd=1),n,10)
m3.1=m2[,6:10]
m3=m3.1
m2=m2[,1:5]
m3[m3.1<=0.1]=0
m3[0.1<m3.1 & m3.1<=1]=1
m3[m3.1>1]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))
lu<--0.5363+0.701*pred+0.801*m[,1]+0.518*m[,2]+1.402*m[,11]+0.773*m[,12]+
ifelse(m[,16]=="2",2.15,0)+ifelse(m[,16]=="1",0.201,0)
y<-rnorm(n,lu,1)
mma.e1<-mma.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),
alpha=1,alpha1=0.05,alpha2=0.05,n2=3) #use only the test results.
#> [1] 1
#> [1] 2
#> [1] 3
summary(mma.e1)
#>
#> The mediaiton effects:
#>
#> For the response variable,y,
#> pred.de pred.m161 pred.m162 pred.m11 pred.m12 pred.te
#> est 0.019 0.000 0.989 1.212 0.521 2.740
#> mean 0.150 -0.012 1.046 1.201 0.528 2.913
#> sd 0.205 0.025 0.143 0.300 0.197 0.353
#> upbd_q 0.368 0.010 1.197 1.469 0.738 3.247
#> lwbd_q 0.003 -0.037 0.936 0.902 0.393 2.576
#> upbd 0.421 0.048 1.270 1.800 0.906 3.432
#> lwbd -0.384 -0.048 0.708 0.623 0.135 2.047
#> upbd_b 0.065 -0.007 1.207 1.237 0.754 2.916
#> lwbd_b 0.000 -0.038 0.932 0.884 0.440 2.558
mma.big also accepts the organized dataset from data.org.big as the first argument.
mma.e1.2<-mma.big(data=data.e1.2,alpha1=0.05,alpha2=0.05,n2=3)
#> [1] 1
#> [1] 2
#> [1] 3
summary(mma.e1.2,RE=TRUE)
#> The relative effects:
#>
#> For the response variable,y,
#> pred.de pred.m161 pred.m162 pred.m11 pred.m12 pred.te
#> est 0.007 0 0.361 0.442 0.190 1
#> mean 0.089 0 0.351 0.374 0.185 1
#> sd 0.069 0 0.050 0.020 0.085 0
#> upbd_q 0.132 0 0.404 0.396 0.267 1
#> lwbd_q 0.016 0 0.321 0.360 0.105 1
#> upbd 0.141 0 0.459 0.482 0.358 1
#> lwbd -0.128 0 0.263 0.402 0.022 1
plot.mma.big() plots the marginal effect of the selected variable on the outcome, and the marginal effect of the predictor on the selected variable.
In the above figure, the upper panel shows the coefficient of “m11” when it is used to estimate y in an elasticnet regression with y being the outcome in bootstrap samples. We see that the coefficient is significantly positive. The lower panel shows when the exposure variable increases by 1 unit, the average change in “m11”. Again we see that the change in “m11” is positive. Therefore, there is a significant indirect effect (positive) through “m11”. As is shown by summary results, all confidence intervals for the indirect effect of “m11” are to the right of 0.
lambda=1/500
survt=-log(runif(n))/lambda/exp(lu)
st=round(runif(n,1,500),0)
time=ifelse(st+survt>600,600,st+survt)-st
cen=ifelse(st+survt>600,0,1)
y=Surv(time,cen)
mma.e3<-mma.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),alpha=1,alpha1=0.05,
alpha2=0.05,n2=3) #use only the test results.
#> [1] 1
#> [1] 2
#> [1] 3
mma.e3.2<-mma.big(data=data.e3.2,alpha1=0.05,alpha2=0.05,n2=3) #use only the test results.
#> [1] 1
#> [1] 2
#> [1] 3
summary(mma.e3)
#>
#> The mediaiton effects:
#>
#> For the response variable,y,
#> pred.de pred.m161 pred.m162 pred.m11 pred.m12 pred.te
#> est 0.201 -0.037 0.683 0.809 0.530 2.185
#> mean 0.182 -0.025 0.581 0.906 0.412 2.055
#> sd 0.160 0.047 0.183 0.451 0.078 0.701
#> upbd_q 0.297 0.005 0.719 1.357 0.460 2.697
#> lwbd_q 0.012 -0.075 0.388 0.506 0.328 1.367
#> upbd 0.514 0.056 1.041 1.693 0.683 3.560
#> lwbd -0.112 -0.130 0.325 -0.075 0.377 0.811
#> upbd_b 0.245 0.005 0.645 0.844 0.460 2.108
#> lwbd_b 0.000 -0.079 0.375 0.489 0.453 1.328
summary(mma.e3.2,RE=TRUE,quant=FALSE)
#> The relative effects:
#>
#> For the response variable,y,
#> pred.de pred.m161 pred.m162 pred.m171 pred.m172 pred.m11 pred.m12
#> est 0.124 -0.002 0.347 0.000 0.000 0.375 0.156
#> mean 0.075 -0.010 0.338 0.013 0.011 0.394 0.178
#> sd 0.111 0.008 0.013 0.014 0.014 0.030 0.066
#> upbd_q 0.194 -0.001 0.348 0.027 0.026 0.422 0.234
#> lwbd_q 0.001 -0.015 0.324 0.001 0.000 0.366 0.111
#> upbd 0.342 0.015 0.373 0.028 0.028 0.433 0.284
#> lwbd -0.095 -0.019 0.321 -0.028 -0.028 0.317 0.027
#> pred.te
#> est 1
#> mean 1
#> sd 0
#> upbd_q 1
#> lwbd_q 1
#> upbd 1
#> lwbd 1
We can also plot the selected variables on the survival outcome.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.