Getting started with simputation

Mark van der Loo

2016-09-09


This package offers a number of commonly used imputation methods, each with a similar and hopefully simple interface. At the moment the following imputation methodology is supported.

A call to an imputation function has the following structure.

impute_<model>(data, formula, [model-specific options])

The output is similar to the data argument, except that empty values are imputed (where possible) using the specified model.

The formula argument speciefies the variables to be imputed, the model specification for <model> and possibly the grouping of the dataset. The structure of a formula object is as follows:

IMPUTED ~ MODEL_SPECIFICATION [ | GROUPING ]

where the part between [] is optional.

In the following, we assume that the reader already has some familiarity with the use of formulas in R (e.g. when specifying linear models) and statistical models commonly used in imputation.

A first example

First create a copy of the iris dataset with some empty values in columns 1 (Sepal.Length), 2 (Sepal.Width) and 5 (Species).

dat <- iris
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
head(dat,10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            NA         3.5          1.4         0.2  setosa
## 2            NA         3.0          1.4         0.2  setosa
## 3            NA          NA          1.3         0.2  setosa
## 4           4.6          NA          1.5         0.2  setosa
## 5           5.0          NA          1.4         0.2  setosa
## 6           5.4          NA          1.7         0.4  setosa
## 7           4.6          NA          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2    <NA>
## 9           4.4         2.9          1.4         0.2    <NA>
## 10          4.9         3.1          1.5         0.1    <NA>

To impute Sepal.Length using a linear model use the impute_lm function.

da1 <- impute_lm(dat, Sepal.Length ~ Sepal.Width + Species)
head(da1,3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.076579         3.5          1.4         0.2  setosa
## 2     4.675654         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa

Observe that the 3rd value is not imputed. This is because one of the predictor variables is missing so the linear model does not produce an output. simputation does not report such cases but simply returns the partly imputed result. The remaining value can be imputed using a new linear model or as shown below, using the group median.

da2 <- impute_median(da1, Sepal.Length ~ Species)
head(da2,3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.076579         3.5          1.4         0.2  setosa
## 2     4.675654         3.0          1.4         0.2  setosa
## 3     5.000000          NA          1.3         0.2  setosa

Here, Species is used to group the data before computing the medians.

Finally, we impute the Species variable using a decision tree model. All variables except Species are used as predictor.

da3 <- impute_cart(da2, Species ~ .)
head(da3,10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1      5.076579         3.5          1.4         0.2  setosa
## 2      4.675654         3.0          1.4         0.2  setosa
## 3      5.000000          NA          1.3         0.2  setosa
## 4      4.600000          NA          1.5         0.2  setosa
## 5      5.000000          NA          1.4         0.2  setosa
## 6      5.400000          NA          1.7         0.4  setosa
## 7      4.600000          NA          1.4         0.3  setosa
## 8      5.000000         3.4          1.5         0.2  setosa
## 9      4.400000         2.9          1.4         0.2  setosa
## 10     4.900000         3.1          1.5         0.1  setosa

Chaining imputation methods

Using the %>% operator from the popular magrittr allows for a very compact specification of the above examples.

library(magrittr)
da4 <- dat %>% 
  impute_lm(Sepal.Length ~ Sepal.Width + Species) %>%
  impute_median(Sepal.Length ~ Species) %>%
  impute_cart(Species ~ .)

Similar model for multiple variables

The simputation package allows users to specify an imputation model for multiple variables at once. For example, to impute both Sepal.Length and Sepal.Width with a similar robust linear model, do the following.

da5 <- impute_rlm(dat, Sepal.Length + Sepal.Width ~ Petal.Length + Species)
head(da5)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     4.945416    3.500000          1.4         0.2  setosa
## 2     4.945416    3.000000          1.4         0.2  setosa
## 3     4.854056    3.378979          1.3         0.2  setosa
## 4     4.600000    3.440107          1.5         0.2  setosa
## 5     5.000000    3.409543          1.4         0.2  setosa
## 6     5.400000    3.501236          1.7         0.4  setosa

The function will model Sepal.Length and Sepal.Width against the predictor variables independently and impute them. The order of variables in the specification is therefore not important for the result.

In general, the left-hand side of the model formula is analyzed by simputation, combined appropriately with the right hand side and then passed through to the underlying modeling routine. Simputation also understands the "." syntax, which stands for “every variable not otherwise present” and the “-” sign to remove variables from a formula. For example, the next expression imputes every variable except Species with the group mean plus a normally distributed random residual.

da6 <- impute_lm(dat, . - Species ~ 0 + Species, add_residual = "normal")
head(da6)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     4.548924    3.500000          1.4         0.2  setosa
## 2     4.763129    3.000000          1.4         0.2  setosa
## 3     4.795675    3.491191          1.3         0.2  setosa
## 4     4.600000    3.367455          1.5         0.2  setosa
## 5     5.000000    3.447707          1.4         0.2  setosa
## 6     5.400000    3.715657          1.7         0.4  setosa

where Species on the right-hand-side defines the grouping variable.

Grouping data for imputation

Use | in the formula argument to specify groups.

# New data set, leaving Species intact
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA

# split dat into groups according to 'Species', impute, combine and return.
da8 <- impute_lm(dat, Sepal.Length ~ Petal.Width | Species)
head(da8)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     4.968092         3.5          1.4         0.2  setosa
## 2     4.968092         3.0          1.4         0.2  setosa
## 3     4.968092          NA          1.3         0.2  setosa
## 4     4.600000          NA          1.5         0.2  setosa
## 5     5.000000          NA          1.4         0.2  setosa
## 6     5.400000          NA          1.7         0.4  setosa

If one or more grouping variables are specified (multiple are specified by separating them with +), imputation takes place as follows.

  1. Split the data into subsets according to the values of the grouping variables.
  2. Estimate the model for each data subset and impute.
  3. Combine the imputed subsets.

Simputation also integrates with the dplyr package and recognizes grouping specified with group_by.

library(magrittr)
library(dplyr)

dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA

dat %>% group_by(Species) %>% 
  impute_lm(Sepal.Length ~ Petal.Width)