smcfcs for coarsened factor covariates

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Jonathan Bartlett and Lars van der Burg

smcfcs was originally created to create multiple imputations of missing values of covariates in regression models. As of 2025, it has functionality to impute unobserved values of factor variables which are ‘coarsened’, based on the developments in van der Burg et al (2025). By coarsened, we mean that for some of the missing values, some partial information about the value is known - we know that the value belongs to some subset of the possible values. In this vignette we demonstrate the functionality of smcfcs for imputing such variables.

To demonstrate how to do this, we illustrate using the dataset ex_coarsening that is in the smcfcs package:

library(smcfcs)
summary(ex_coarsening)
#>     x          xobs                 z                  y         
#>  a   :15   Length:100         Min.   :-2.08089   Min.   :-3.013  
#>  b   :10   Class :character   1st Qu.:-0.73366   1st Qu.:-0.179  
#>  c   :11   Mode  :character   Median :-0.11602   Median : 1.169  
#>  NA's:64                      Mean   :-0.09867   Mean   : 1.081  
#>                               3rd Qu.: 0.61186   3rd Qu.: 2.564  
#>                               Max.   : 2.06083   Max.   : 4.293
head(ex_coarsening)
#>      x xobs          z          y
#> 1 <NA>  a/c -0.5898450  2.3921826
#> 2    a    a -1.5314078 -3.0128176
#> 3    c    c  1.3189317  3.0480379
#> 4 <NA>  a/c -0.3832246  0.5695512
#> 5 <NA>  b/c  0.6129756  3.1124292
#> 6 <NA>  a/c -0.3664974 -2.4805336

The variable x is a factor variable which has 64 missing values. The variable xobs gives the known information about (some of) the missing values:

table(ex_coarsening$x,ex_coarsening$xobs,useNA = "ifany")
#>       
#>        NA  a a/c  b b/c  c
#>   a     0 15   0  0   0  0
#>   b     0  0   0 10   0  0
#>   c     0  0   0  0   0 11
#>   <NA> 25  0  22  0  17  0

From this we can see that among the 64 missing values in x, for 22 individuals we know that their value for x was either a or c, as indicated by the string ‘a/c’, 17 individuals we know that their value for x was either b or c, as indicated by the string ‘b/c’, while for the remainder we have no further information, indicated by the character string “NA”.

Note: the variable xobs is a character variable, and for rows where x is (plain) missing, xobs takes the character value “NA”, rather than R’s missing value indicator NA. This is important, since if we used the missing value indicator NA, smcfcs would refused to run as we have not told it how to impute the missing values in xobs.

In order to impute the missing values in x using smcfcs we have to define a value for the restrictions argument. For this we must pass a list of length equal to the number of variables in the data frame. For the element in this list corresponding to x we must give a vector of formula typ expressions to specify the possible values for x when xobs equals a/c or b/c. To achieve this we use:

restrictionsX = c("xobs = a/c ~ a + c",
                  "xobs = b/c ~ b + c")
restrictions = append(list(restrictionsX), as.list(c("", "", "")))

We can then impute the missing values accounting for the partial information with:

set.seed(68204812)
imps <- smcfcs(originaldata=ex_coarsening,
               smtype="lm",
               smformula = "y~z+x",
               method = c("mlogit","", "", ""),
               restrictions = restrictions
)

To check that smcfcs has correctly used the partial information about the missing values in x, first we check the first few rows in the first imputed dataset:

head(imps$impDatasets[[1]])
#>   x xobs          z          y
#> 1 c  a/c -0.5898450  2.3921826
#> 2 a    a -1.5314078 -3.0128176
#> 3 c    c  1.3189317  3.0480379
#> 4 a  a/c -0.3832246  0.5695512
#> 5 c  b/c  0.6129756  3.1124292
#> 6 a  a/c -0.3664974 -2.4805336

This looks fine - when xobs=a/c we have imputed values either of a or c, whereas when xobs=b/c we have imputed values of b or c. To check properly, we can repeat the earlier cross-tabulation:

table(imps$impDatasets[[1]]$x,imps$impDatasets[[1]]$xobs,useNA = "ifany")
#>    
#>     NA  a a/c  b b/c  c
#>   a  8 15   9  0   0  0
#>   b  2  0   0 10   5  0
#>   c 15  0  13  0  12 11

This shows that (at least in the first imputed dataset) the imputed values respect the partial information contained in xobs, as desired.

The restrictions argument can also be used for ordered factor variables in the same way.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.