library(mvnimpute)
This tutorial describes how to use the mvnimpute
package. We will go through an example generated by the
data.generation
function in the package.
#### data generation
<- 500; p <- 6
n # set.seed(133)
<- c(1, 1, 2, 4, 3, 5)
m <- clusterGeneration::genPositiveDefMat(p, "eigen")$Sigma
v
<- c(1, 2)
miss.var <- c(3, 5)
censor.var
<- data.generation(num_ind = n,
example.data mean_vec = m,
cov_mat = v,
miss_var = miss.var,
miss_mech = "MAR",
miss_prob = NULL,
censor_var = censor.var,
censor_type = "interval",
censor_param = 0.5)
names(example.data)
#> [1] "full.data" "observe.data" "bounds" "indicator"
We generate a dataset with sample size 500, with specified mean
vector consists of six specified values, and a random positive definite
covariance matrix. The output includes four objects which include,
full.data
, the data matrix that would have been observed
without missing and censoring information; observe.data
,
the truly observed data matrix with missing and censoring information;
bounds
, the bounds information of the variables that have
missing or censored values. Specifically, we take missing as a special
type of interval censoring with no censoring limits (censoring limits
are set as \(\pm \infty\));
indicator
, the matrix that indicating the data type: 0 as
the missing data, 1 as the observed data, and 2 as the censored data.
Now, let us take a look at the dataset
# data
<- example.data$full.data
full <- example.data$observe.data
observed <- example.data$bounds
censoring.bounds <- example.data$indicator
indicator # summary
tail(observed) # observed data
#> y1 y2 y3 y4 y5 y6
#> [495,] -1.547567 NA NaN 2.315895 NaN 3.907107
#> [496,] NA -0.2994182 NaN 4.492385 6.786569 2.777854
#> [497,] 1.952071 0.9228392 3.27131928 4.488331 NaN 6.952747
#> [498,] -2.256520 -1.4893098 -0.06679569 4.408467 NaN 2.853682
#> [499,] 2.890608 -2.3451872 -5.92138055 6.106857 4.939494 4.159683
#> [500,] 2.510733 2.0193359 NaN 4.387197 3.525996 6.315236
tail(censoring.bounds[[1]]);tail(censoring.bounds[[2]]) # censoring bounds information
#> y1 y2 y3 y4 y5 y6
#> [495,] -1.547567 NA 0.15034900 2.315895 1.4321201 3.907107
#> [496,] NA -0.2994182 0.85767392 4.492385 6.7865694 2.777854
#> [497,] 1.952071 0.9228392 3.27131928 4.488331 3.1594592 6.952747
#> [498,] -2.256520 -1.4893098 -0.06679569 4.408467 0.0873068 2.853682
#> [499,] 2.890608 -2.3451872 -5.92138055 6.106857 4.9394944 4.159683
#> [500,] 2.510733 2.0193359 0.26097282 4.387197 3.5259965 6.315236
#> y1 y2 y3 y4 y5 y6
#> [495,] -1.547567 NA 0.69562111 2.315895 6.794582 3.907107
#> [496,] NA -0.2994182 3.20116375 4.492385 6.786569 2.777854
#> [497,] 1.952071 0.9228392 3.27131928 4.488331 5.118294 6.952747
#> [498,] -2.256520 -1.4893098 -0.06679569 4.408467 4.010630 2.853682
#> [499,] 2.890608 -2.3451872 -5.92138055 6.106857 4.939494 4.159683
#> [500,] 2.510733 2.0193359 5.08191846 4.387197 3.525996 6.315236
tail(indicator) # indicator matrix
#> y1 y2 y3 y4 y5 y6
#> [495,] 1 0 4 1 4 1
#> [496,] 0 1 4 1 1 1
#> [497,] 1 1 1 1 4 1
#> [498,] 1 1 1 1 4 1
#> [499,] 1 1 1 1 1 1
#> [500,] 1 1 4 1 1 1
From the observed data matrix, the missing values are denoted as
NA
, the censored values are denoted as NaN
.
The missing data has censoring limits set as \(\pm 10,000\) (we set \(\pm 10,000\) as proxies for \(\pm \infty\)). The indicator matrix labels
the missing value as 0 and the interval censored value as 4. Then the
plot that visually shows the percentages of observed, missing and
censored values can be generated by the visual.plot
function.
visual.plot(indicator, title = NULL)
For doing multiple imputation, we take a Normal-Inverse-Wishart distribution for joint prior distribution of the mean vector and covariance matrix. We run MCMC simulation for 1,000 iterations.
### prior specifications
<- list(
prior.spec mu.0 = rep(0, p),
Lambda.0 = diag(10, p),
kappa.0 = 100,
nu.0 = p * (p + 1) / 2
)
<- list(
start.vals mu = rep(1, p),
sigma = diag(p)
)### MCMC simulation
<- 1000
iter <- multiple.imputation(
sim.res
censoring.bounds,
prior.spec,
start.vals,
iter,TRUE
)
conv.plot(sim.res$simulated.mu, 0, iter, title = "convergence plot of the mean values")
conv.plot(sim.res$simulated.sig, 0, iter, title = "convergence plot of the variance values")
= paste("acf: variable", 1:p)
title acf.calc(sim.res$simulated.mu, title = title)
acf.calc(sim.res$simulated.sig, title = title)
From the convergence plots and the autocorrelation plots, we see that the posterior distributions of both the mean and variance seem to converge to equilibrium in first few iterations.
We draw and compare the marginal density plots using the full data, the CC data and the imputed data from the last iteration of MCMC using the mvnimpute package.
for (i in c(1, 2, 3, 5)) {
plot(density(full[, i]), main = colnames(full)[i])
lines(density(observed[!is.na(observed[, i]), i]), col = 6, lty = 2)
lines(density(sim.res$imputed.data[[iter]][, i]), col = 4, lty = 3)
}
The black line is the density plot of the full data, the blue dotted line is the imputed data using the mvnimpute pacakge, the pink dashed line is the CC data. We see that the imputed data density align with the full data density quite well.
<- coef(lm(y4 ~ y1 + y2 + y3 + y5 + y6, data = data.frame(full)))
full.reg <- coef(lm(y4 ~ y1 + y2 + y3 + y5 + y6, data = data.frame(observed)))
cc.reg
### imputed data
<- seq(800, 1000, 50)
gibbs
#### mvnimpute
<- list()
mvnimpute.dat <- list()
model.param for (i in 1:length(gibbs)) {
<- sim.res$imputed.data[[gibbs[i]]]
mvnimpute.dat[[i]] colnames(mvnimpute.dat[[i]]) <- paste0("y", 1:p)
<- lm(y4 ~ y1 + y2 + y3 + y5 + y6, data.frame(mvnimpute.dat[[i]]))
model.param[[i]]
}
<- summary(mice::pool(model.param))
mvnimpute.mod <- mvnimpute.mod[, 2]
sim.reg
<- data.frame(
reg.compare true = true.beta,
full = full.reg,
mvnimpute = sim.reg,
cc = cc.reg
)
reg.compare#> true full mvnimpute cc
#> (Intercept) 3.73975211 4.07379080 4.06506771 4.215508888
#> y1 0.03742948 0.02917612 0.02735768 -0.004483437
#> y2 0.02449541 0.09440116 0.10648160 0.069068147
#> y3 0.01128354 0.01674152 0.01285773 0.017741180
#> y5 0.01602758 0.05832823 0.05773384 0.055510475
#> y6 0.02553464 -0.05083920 -0.04895012 -0.035768977
options(old)