Introduction


This vignette is intended to provide a first introduction to the R package mitml for generating and analyzing multiple imputations for multilevel missing data. A usual application of the package may consist of the following steps.

  1. Imputation
  2. Assessment of convergence
  3. Completion of the data
  4. Analysis
  5. Pooling

The mitml package offers a set of tools to facilitate each of these steps. This vignette is intended as a step-by-step illustration of the basic features of mitml.

Example Data (studentratings)

For this vignette, we employ a simple example that makes use of the studentratings data set, which is provided with mitml. To use it, the mitml package and the data set must be loaded as follows.

library(mitml)
data(studentratings)

More information about the variables in the data set can be obtained from its summary.

summary(studentratings)
#        ID       FedState     Sex              MathAchiev       MathDis      
#  Min.   :1001   B :375   Length:750         Min.   :225.0   Min.   :0.2987  
#  1st Qu.:1013   SH:375   Class :character   1st Qu.:440.7   1st Qu.:1.9594  
#  Median :1513            Mode  :character   Median :492.7   Median :2.4350  
#  Mean   :1513                               Mean   :495.4   Mean   :2.4717  
#  3rd Qu.:2013                               3rd Qu.:553.2   3rd Qu.:3.0113  
#  Max.   :2025                               Max.   :808.1   Max.   :4.7888  
#                                             NA's   :132     NA's   :466     
#       SES          ReadAchiev       ReadDis        CognAbility      SchClimate     
#  Min.   :-9.00   Min.   :191.1   Min.   :0.7637   Min.   :28.89   Min.   :0.02449  
#  1st Qu.:35.00   1st Qu.:427.4   1st Qu.:2.1249   1st Qu.:43.80   1st Qu.:1.15338  
#  Median :46.00   Median :490.2   Median :2.5300   Median :48.69   Median :1.65636  
#  Mean   :46.55   Mean   :489.9   Mean   :2.5899   Mean   :48.82   Mean   :1.73196  
#  3rd Qu.:59.00   3rd Qu.:558.4   3rd Qu.:3.0663   3rd Qu.:53.94   3rd Qu.:2.24018  
#  Max.   :93.00   Max.   :818.5   Max.   :4.8554   Max.   :71.29   Max.   :4.19316  
#  NA's   :281                     NA's   :153                      NA's   :140

In addition, the correlations between variables (based on pairwise observations) may be useful for identifying possible sources of information that may be used during the treatment of missing data.

cor(studentratings[,-(1:3)], use="pairwise")
#             MathAchiev MathDis    SES ReadAchiev ReadDis CognAbility SchClimate
# MathAchiev       1.000  -0.106  0.260      0.497  -0.080       0.569     -0.206
# MathDis         -0.106   1.000 -0.206     -0.189   0.613      -0.203      0.412
# SES              0.260  -0.206  1.000      0.305  -0.153       0.138     -0.176
# ReadAchiev       0.497  -0.189  0.305      1.000  -0.297       0.413     -0.320
# ReadDis         -0.080   0.613 -0.153     -0.297   1.000      -0.162      0.417
# CognAbility      0.569  -0.203  0.138      0.413  -0.162       1.000     -0.266
# SchClimate      -0.206   0.412 -0.176     -0.320   0.417      -0.266      1.000

This illustrates that most variables are affected by missing data but also that substantial relations exist between variables. In the following, we focus on a subset of these variables.

Model of interest

For the present example, we focus on the two variables ReadDis (disciplinary problems in reading class) and ReadAchiev (reading achievement).

Assume we are interested in the relation between these variables. Specifically, on the basis of the syntax used in the R package lme4, we may be interested in the following multilevel model.

ReadAchiev ~ 1 + ReadDis + (1|ID)

In this model, the relation between ReadDis and ReadAchiev is represented by the fixed effects of ReadDis, and a random intercept is included to account for the clustered structure of the data.

Generating Imputations

The mitml package includes wrapper functions for the R packages pan (panImpute) and jomo (jomoImpute). Here, we will use the latter option. To generate imputations with jomoImpute, the user has to specify:

  1. an imputation model
  2. the number of iterations and imputations

The easiest way of specifying the imputation model is to use the formula argument of jomoImpute. Generally speaking, the imputation model should include all variables that are either (a) part of the model of interest, (b) related to the variables in the model, or (c) related to whether the variables in the model are missing.

In this simple example, we include only ReadDis and ReadAchiev as well as SchClimate.

fml <- ReadAchiev + ReadDis + SchClimate ~ 1 + (1|ID)

Note that all variables are included on the left-hand side of the model, whereas the right-hand side is left “empty” (for a further explanation of this model, see Grund, Lüdtke, & Robitzsch, 2016).

The imputation procedure is then run for 5,000 iterations (burn-in), after which 100 imputations are drawn every 100 iterations.

imp <- jomoImpute(studentratings, formula=fml, n.burn=5000, n.iter=100, m=100)

This step may take a few seconds. Once the process is completed, the imputations are saved in the imp object.

Assessing Convergence

In mitml, there are two options for assessing whether or not the imputation procedure converged. First, the summary calculates the “potential scale reduction factor” (\(\hat{R}\)) for each parameter in the imputation model. If that value is noticeably larger than 1 for some parameters (say \(>1.05\)), a longer burn-in period may be required.

summary(imp)
# 
# Call:
# 
# jomoImpute(data = studentratings, formula = fml, n.burn = 5000, 
#     n.iter = 100, m = 100)
# 
# Cluster variable:         ID 
# Target variables:         ReadAchiev ReadDis SchClimate 
# Fixed effect predictors:  (Intercept) 
# Random effect predictors: (Intercept) 
# 
# Performed 5000 burn-in iterations, and generated 100 imputed data sets,
# each 100 iterations apart. 
# 
# Potential scale reduction (Rhat, imputation phase):
#  
#          Min   25%  Mean Median   75%   Max
# Beta:  1.000 1.001 1.001  1.001 1.001 1.001
# Psi:   1.000 1.000 1.001  1.000 1.000 1.007
# Sigma: 1.000 1.000 1.000  1.000 1.001 1.001
# 
# Largest potential scale reduction:
# Beta: [1,1], Psi: [1,1], Sigma: [3,2]
# 
# Missing data per variable:
#     ID ReadAchiev ReadDis SchClimate FedState Sex MathAchiev MathDis SES  CognAbility
# MD% 0  0          20.4    18.7       0        0   17.6       62.1    37.5 0

Second, diagnostic plots can be requested with the plot function. These plots consist of a trace plot, an autocorrelation plot, and some additional information about the posterior density. Convergence can be assumed if the trace plot is stationary (i.e., does not “drift”), and the autocorrelation is within reasonable bounds for the number of iterations chosen between imputations.

For this example, we examine only the plot for the parameter Beta[1,2] (i.e., the intercept of ReadDis).

plot(imp, trace="all", print="beta", pos=c(1,2))

Taken together, both \(\hat{R}\) and the diagnostic plots indicate that the imputation model converged, setting the basis for the analysis of the imputed data sets.

Completing the data

In order to work with and analyze the imputed data sets, the data sets must be completed with the imputations generated in the previous steps. To do so, mitml provides the function mitmlComplete.

implist <- mitmlComplete(imp, "all")

This resulting object is a list that contains the 100 completed data sets.

Analysis and Pooling

In order to obtain estimates for the model of interest, the model must be fit separately to each of the completed data sets, and the results must be pooled into a final set of estimates and inferences. The mitml package offers the with function to fit various statistical models to a list of completed data sets.

In this example, we use the lmer function from the R package lme4 to fit the model of interest.

library(lme4)
fit <- with(implist, lmer(ReadAchiev ~ 1 + ReadDis + (1|ID)))

The resulting object is a list contain the 100 fitted models. To pool the results of these models into a set of final estimates and inferences, mitml offers the testEstimates function.

testEstimates(fit, var.comp=TRUE)
# 
# Call:
# 
# testEstimates(model = fit, var.comp = TRUE)
# 
# Final parameter estimates and inferences obtained from 100 imputed data sets.
# 
#              Estimate Std.Error   t.value        df   P(>|t|)       RIV       FMI 
# (Intercept)   582.428    14.580    39.946  3778.659     0.000     0.193     0.162 
# ReadDis       -35.788     5.259    -6.806  2905.217     0.000     0.226     0.185 
# 
#                         Estimate 
# Intercept~~Intercept|ID  900.861 
# Residual~~Residual      6993.816 
# ICC|ID                     0.114 
# 
# Unadjusted hypothesis test as appropriate in larger samples.

The resulting estimates can be interpreted in the same way as the estimates from the corresponding complete-data procedure.

This completes the introduction to the basic features of mitml.


# Author: Simon Grund (grund@ipn.uni-kiel.de)
# Date:   2017-03-15