Introduction to misl

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Overview

misl implements Multiple Imputation by Super Learning, a flexible approach to handling missing data described in:

Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super learning. Statistical Methods in Medical Research. 31(10):1904–1915. doi: 10.1177/09622802221104238

Rather than relying on a single parametric imputation model, misl builds a stacked ensemble of machine learning algorithms for each incomplete column, producing well-calibrated imputations for continuous, binary, categorical, and ordered categorical variables.

Installation

# Install from GitHub
remotes::install_github("JustinManjourides/misl")

# Optional backend packages for additional learners
install.packages(c("ranger", "xgboost", "earth", "MASS"))

Simulated data

We simulate a small dataset with four types of incomplete variables to demonstrate misl across all supported outcome types.

library(misl)

set.seed(42)
n <- 300

sim_data <- data.frame(
  # Continuous predictors (always observed)
  age    = round(rnorm(n, mean = 50, sd = 12)),
  bmi    = round(rnorm(n, mean = 26, sd = 4), 1),
  # Continuous outcome with missingness
  sbp    = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) +
                       0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)),
  # Binary outcome with missingness (0 = no, 1 = yes)
  smoker = rbinom(n, 1, prob = 0.3),
  # Unordered categorical outcome with missingness
  group  = factor(sample(c("A", "B", "C"), n, replace = TRUE,
                         prob = c(0.4, 0.35, 0.25))),
  # Ordered categorical outcome with missingness
  health = factor(sample(c("Poor", "Fair", "Good", "Excellent"), n,
                         replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.2)),
                  levels  = c("Poor", "Fair", "Good", "Excellent"),
                  ordered = TRUE)
)

# Introduce missing values
sim_data[sample(n, 40), "sbp"]    <- NA
sim_data[sample(n, 30), "smoker"] <- NA
sim_data[sample(n, 30), "group"]  <- NA
sim_data[sample(n, 30), "health"] <- NA

# Summarise missingness
sapply(sim_data, function(x) sum(is.na(x)))
#>    age    bmi    sbp smoker  group health 
#>      0      0     40     30     30     30

Built-in learners

Use list_learners() to see the available named learners, optionally filtered by outcome type:

knitr::kable(list_learners())

learner	description	continuous	binomial	categorical	ordinal	package	installed
glm	Linear / logistic regression	TRUE	TRUE	FALSE	FALSE	stats	TRUE
mars	Multivariate adaptive regression splines	TRUE	TRUE	FALSE	FALSE	earth	TRUE
multinom_reg	Multinomial regression	FALSE	FALSE	TRUE	FALSE	nnet	TRUE
polr	Proportional odds logistic regression	FALSE	FALSE	FALSE	TRUE	MASS	TRUE
rand_forest	Random forest	TRUE	TRUE	TRUE	TRUE	ranger	TRUE
boost_tree	Gradient boosted trees	TRUE	TRUE	TRUE	TRUE	xgboost	TRUE

knitr::kable(list_learners("ordinal"))

learner	description	package	installed
polr	Proportional odds logistic regression	MASS	TRUE
rand_forest	Random forest	ranger	TRUE
boost_tree	Gradient boosted trees	xgboost	TRUE

Basic usage

The primary function is misl(). Supply a dataset and specify:

m – the number of multiply imputed datasets to create
maxit – the number of Gibbs sampling iterations per dataset
con_method, bin_method, cat_method, ord_method – learners for each outcome type

Running misl() with a single named learner per outcome type is the fastest option and is well suited for exploratory work. Note that ordered factors are automatically detected and routed to ord_method:

misl_imp <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  ord_method = "polr",
  seed       = 42,
  quiet      = TRUE
)

misl() returns a list of length m. Each element contains:

$datasets – the fully imputed tibble
$trace – a long-format tibble of convergence statistics

# Number of imputed datasets
length(misl_imp)

# Confirm no missing values remain
anyNA(misl_imp[[1]]$datasets)

# Confirm ordered factor is preserved
is.ordered(misl_imp[[1]]$datasets$health)
levels(misl_imp[[1]]$datasets$health)

Custom learners via parsnip

In addition to the built-in named learners, misl v2.0 accepts any parsnip-compatible model spec directly. This allows you to use any learner available in the tidymodels ecosystem without waiting for it to be added to the built-in registry.

Passing a custom spec

Simply pass a parsnip model spec in place of (or alongside) a named string. misl will automatically enforce the correct mode (regression vs classification) regardless of what is set on the spec:

library(parsnip)

# A random forest with custom hyperparameters
custom_rf <- rand_forest(trees = 500, mtry = 3) |>
  set_engine("ranger")

misl_custom <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list(custom_rf),
  bin_method = list(custom_rf),
  cat_method = "multinom_reg",
  ord_method = "polr",
  seed       = 42,
  quiet      = TRUE
)

Mixing named learners and custom specs

Named strings and parsnip specs can be freely mixed in the same list. When multiple learners are supplied, misl uses cross-validation to build a stacked ensemble:

library(parsnip)

# Mix a named learner with a custom tuned spec
custom_xgb <- boost_tree(trees = 200, learn_rate = 0.05) |>
  set_engine("xgboost")

misl_mixed <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list("glm", custom_xgb),
  bin_method = list("glm", custom_xgb),
  cat_method = list("multinom_reg", "rand_forest"),
  ord_method = list("polr", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)

Using a learner not in the built-in registry

Any parsnip-supported engine can be used. For example, a support vector machine via the kernlab package:

library(parsnip)

# SVM - not in the built-in registry but works via parsnip
svm_spec <- svm_rbf() |>
  set_engine("kernlab")

misl_svm <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list("glm", svm_spec),
  bin_method = list("glm", svm_spec),
  cat_method = "multinom_reg",
  ord_method = "polr",
  cv_folds   = 3,
  seed       = 42
)

Ordinal outcomes and the polr learner

For ordered categorical variables, misl automatically detects ordered factors and routes them to ord_method. The default learner is "polr" (proportional odds logistic regression from the MASS package), which respects the ordering of the levels:

# Ensure your ordered variable is an ordered factor
sim_data$health <- factor(sim_data$health,
  levels  = c("Poor", "Fair", "Good", "Excellent"),
  ordered = TRUE
)

misl_ordinal <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  ord_method = "polr",   # proportional odds model for ordered categories
  seed       = 42,
  quiet      = TRUE
)

# Imputed values respect the ordering
is.ordered(misl_ordinal[[1]]$datasets$health)
levels(misl_ordinal[[1]]$datasets$health)

Multiple learners and stacking

When multiple learners are supplied (whether named strings, parsnip specs, or a mix), misl uses cross-validation to learn optimal ensemble weights via the stacks package. Use cv_folds to reduce the number of folds and speed up computation:

misl_stack <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  ord_method = c("polr", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)

Analysing the imputed datasets

After imputation, fit your analysis model to each of the m datasets and pool the results using Rubin’s rules. Here we implement pooling manually using base R:

# Fit a linear model to each imputed dataset
models <- lapply(misl_imp, function(imp) {
  lm(sbp ~ age + bmi + smoker + group + health, data = imp$datasets)
})

# Pool point estimates and standard errors using Rubin's rules
m       <- length(models)
ests    <- sapply(models, function(fit) coef(fit))
vars    <- sapply(models, function(fit) diag(vcov(fit)))

Q_bar   <- rowMeans(ests)                          # pooled estimate
U_bar   <- rowMeans(vars)                          # within-imputation variance
B       <- apply(ests, 1, var)                     # between-imputation variance
T_total <- U_bar + (1 + 1 / m) * B                # total variance

pooled <- data.frame(
  term      = names(Q_bar),
  estimate  = round(Q_bar, 4),
  std.error = round(sqrt(T_total), 4),
  conf.low  = round(Q_bar - 1.96 * sqrt(T_total), 4),
  conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4)
)
print(pooled)

For a full implementation of Rubin’s rules including degrees of freedom and p-values, the mice package provides pool() and can be used directly with misl output:

library(mice)
pooled_mice <- summary(pool(models), conf.int = TRUE)

Convergence: trace plots

The plot_misl_trace() function plots the mean or standard deviation of imputed values across iterations for a given variable, with one line per imputed dataset. Stable traces that mix well across datasets indicate convergence. Note that trace statistics are only computed for continuous and numeric binary columns — they are not available for categorical or ordinal columns.

# Plot mean of imputed sbp values across iterations for each dataset
plot_misl_trace(misl_imp, variable = "sbp", ylab = "Mean imputed sbp (mm Hg)")

# Plot the standard deviation instead
plot_misl_trace(misl_imp, variable = "sbp", statistic = "sd")

Stable traces that mix well across datasets indicate the algorithm has converged.

Parallelism

The m imputed datasets are generated independently and can be run in parallel using the future framework. Set a parallel plan before calling misl():

library(future)

# Use all available cores
plan(multisession)

misl_parallel <- misl(
  sim_data,
  m          = 10,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  ord_method = c("polr", "rand_forest"),
  seed       = 42
)

# Always reset the plan when done
plan(sequential)

To limit the number of cores:

plan(multisession, workers = 4)

The largest speedup comes from running the m datasets simultaneously. Check how many cores are available with:

parallel::detectCores()

Session info

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.4
#> 
#> Matrix products: default
#> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] misl_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.2         cli_3.6.5           knitr_1.50         
#>  [4] rlang_1.1.7         xfun_0.54           Formula_1.2-5      
#>  [7] data.table_1.18.2.1 jsonlite_2.0.0      glue_1.8.0         
#> [10] ranger_0.18.0       nnet_7.3-20         htmltools_0.5.8.1  
#> [13] sass_0.4.10         rmarkdown_2.30      grid_4.5.2         
#> [16] evaluate_1.0.5      jquerylib_0.1.4     tibble_3.3.1       
#> [19] MASS_7.3-65         earth_5.3.5         fastmap_1.2.0      
#> [22] yaml_2.3.10         lifecycle_1.0.5     compiler_4.5.2     
#> [25] Rcpp_1.1.1          pkgconfig_2.0.3     rstudioapi_0.17.1  
#> [28] lattice_0.22-7      digest_0.6.39       xgboost_3.2.1.1    
#> [31] R6_2.6.1            pillar_1.11.1       magrittr_2.0.4     
#> [34] Matrix_1.7-4        bslib_0.9.0         tools_4.5.2        
#> [37] plotmo_3.7.0        plotrix_3.8-14      cachem_1.1.0

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.