The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
First, we generate a small simulated data.
set.seed(1)
n <- 100
p <- 6
X <- MASS::mvrnorm(n, rep(0, p), cor.mat(p, rho=0.5))
beta0 <- rep(c(1,-1), times=3)
Y <- -2+ X %*% beta0 + rnorm(n, sd=1)
Then, we fit the linear regression model without missing values based on ILSE.
We can also create a (data.frame) object as input for ILSE.
dat <- data.frame(Y=Y, X=X)
ilse1 <- ilse(Y~., data=dat)
print(ilse1)
Coef(ilse1) # access the coefficients
Fitted.values(ilse1)[1:5]
Residuals(ilse1)[1:5]
Check the significant variables by bootstratp.
First, we randomly remove some entries in X.
mis_rate <- 0.3
set.seed(1)
na_id <- sample(1:(n*p), n*p*mis_rate)
Xmis <- X
Xmis[na_id] <- NA
ncomp <- sum(complete.cases(Xmis))
message("Number of complete cases is ", ncomp, '\n')
Second, we use lm to fit linear regression model based on complete cases, i.e., CC analysis. We can not detect any siginificant covariates.
Third, we use ILSE to fit the linear regression model based on all data. We can fit a linear regression model without intercept by setting formula:
Then, we fit a linear regression model with intercept by following command
Fourth, Bootstrap is applied to evaluate the standard error and p-values of each coefficients estimated by ILSE. We observe four significant coefficients.
In ILSE package, we also provide Full Information Maximum Likelihood for Linear Regression fimlreg. We show how to use it to handle the above missing data.
We also use bootstrap to evaluate the standard error and p-values of each coefficients estimated by ILSE. We observe only one significant coefficients.
We visualize the p-vaules of each methods , where red line denotes 0.05 in y-axis and blue line 0.1 in y-axis.
pMat <- cbind(CC=s_cc$coefficients[,4], ILSE=s2[,4], FIML=s_fiml[,4])
library(ggplot2)
df1 <- data.frame(Pval= as.vector(pMat[-1,]),
Method =factor(rep(c('CC', "ILSE", "FIML"),each=p)),
covariate= factor(rep(paste0("X", 1:p), times=3)))
ggplot(data=df1, aes(x=covariate, y=Pval, fill=Method)) + geom_bar(position = "dodge", stat="identity",width = 0.5) + geom_hline(yintercept = 0.05, color='red') + geom_hline(yintercept = 0.1, color='blue')
Base on the above data, we add a new column, a categorical variable (Sex), into the data.frame. This variable is not associated with the outcome variable.
dat <- data.frame(Y=Y, X=Xmis)
dat$Sex <- factor(rep(c('male', 'female'), times=n/2))
dat$Sex[sample(1:n, n*mis_rate)] <- NA
ilse1 <- ilse(Y~., data=dat, verbose = T)
We can change the bootstrap times in calculate the standard errors, Z value and p-values of coefficients.
First, we generate data from a linear regression model with three inportant variables(1,3,5) and three unimportant variables(2,4,6).
set.seed(10)
n <- 100
p <- 6
X <- MASS::mvrnorm(n, rep(0, p), cor.mat(p, rho=0.5))
beta0 <- rep(c(1,0), times=3)
Y <- -2+ X %*% beta0 + rnorm(n, sd=1)
message("The true regression coefficients are: ", paste0(beta0, ' '))
We randomly assign missing values in the design matrix.
Next, we use ILSE to fit model.
Fit model by using lm and FIML, finally compare ILSE with these two methods.
We visualize the p-vaules of each methods , where red line denotes 0.05 in y-axis. Under significance level 0.05, we found both ILSE and FIML can identify all important variables (X1, X3 and X5), while CC method only identified X1 and X5.
library(ggthemes)
pMat <- cbind(CC=s_cc$coefficients[,4], ILSE=s3[,4], FIML=s_fiml[,4])
df1 <- data.frame(Pval= as.vector(pMat[-1,]),
Method =factor(rep(c('CC', "ILSE", "FIML"),each=p)),
covariate= factor(rep(paste0("X", 1:p), times=3)))
ggplot(data=df1, aes(x=covariate, y=Pval, fill=Method)) + geom_bar(position = "dodge", stat="identity",width = 0.5) + geom_hline(yintercept = 0.05, color='red') + scale_fill_economist()
Here, we generate a data with 80% missing values, then use ILSE to fit model.
# generate data from linear model
set.seed(10)
n <- 100
p <- 6
X <- MASS::mvrnorm(n, rep(0, p), cor.mat(p, rho=0.5))
beta0 <- rep(c(1,-1), times=3)
Y <- -2+ X %*% beta0 + rnorm(n, sd=1)
# generate missing values
mis_rate <- 0.8
set.seed(1)
na_id <- sample(1:(n*p), n*p*mis_rate)
Xmis <- X
Xmis[na_id] <- NA
# retain 4 complete cases.
Xmis[1:4,] <- X[1:4, ]
sum(complete.cases(Xmis))
CC method will failed.
However, ILSE can still work.
We generate a large-scale data with n=1000 and p = 50
n <- 1000
p <- 50
X <- MASS::mvrnorm(n, rep(0, p), cor.mat(p, rho=0.5))
beta0 <- rep(c(1,-1), length=p)
Y <- -2+ X %*% beta0 + rnorm(n, sd=1)
mis_rate <- 0.3
set.seed(1)
na_id <- sample(1:(n*p), n*p*mis_rate)
Xmis <- X
Xmis[na_id] <- NA
Xmis[1:10,] <- X[1:10,]
lm1 <- lm(Y~Xmis)
lm1
system.time(ilse2 <- ilse(Y~Xmis, data=NULL, verbose=T))
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=Chinese (Simplified)_China.936
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=Chinese (Simplified)_China.936
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ILSE_1.1.7
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.7 digest_0.6.28 R6_2.5.1 jsonlite_1.7.2
#> [5] magrittr_2.0.1 evaluate_0.14 pbapply_1.5-0 rlang_0.4.11
#> [9] stringi_1.7.5 jquerylib_0.1.4 bslib_0.3.1 rmarkdown_2.11
#> [13] tools_4.0.3 stringr_1.4.0 parallel_4.0.3 xfun_0.29
#> [17] yaml_2.2.2 fastmap_1.1.0 compiler_4.0.3 htmltools_0.5.2
#> [21] knitr_1.37 sass_0.4.0
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.