The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
misl implements Multiple Imputation by Super
Learning, a flexible approach to handling missing data
described in:
Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super learning. Statistical Methods in Medical Research. 31(10):1904–1915. doi: 10.1177/09622802221104238
Rather than relying on a single parametric imputation model,
misl builds a stacked ensemble of machine learning
algorithms for each incomplete column, producing well-calibrated
imputations for continuous, binary, categorical, and ordered categorical
variables.
We simulate a small dataset with four types of incomplete variables
to demonstrate misl across all supported outcome types.
library(misl)
set.seed(42)
n <- 300
sim_data <- data.frame(
# Continuous predictors (always observed)
age = round(rnorm(n, mean = 50, sd = 12)),
bmi = round(rnorm(n, mean = 26, sd = 4), 1),
# Continuous outcome with missingness
sbp = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) +
0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)),
# Binary outcome with missingness (0 = no, 1 = yes)
smoker = rbinom(n, 1, prob = 0.3),
# Unordered categorical outcome with missingness
group = factor(sample(c("A", "B", "C"), n, replace = TRUE,
prob = c(0.4, 0.35, 0.25))),
# Ordered categorical outcome with missingness
health = factor(sample(c("Poor", "Fair", "Good", "Excellent"), n,
replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.2)),
levels = c("Poor", "Fair", "Good", "Excellent"),
ordered = TRUE)
)
# Introduce missing values
sim_data[sample(n, 40), "sbp"] <- NA
sim_data[sample(n, 30), "smoker"] <- NA
sim_data[sample(n, 30), "group"] <- NA
sim_data[sample(n, 30), "health"] <- NA
# Summarise missingness
sapply(sim_data, function(x) sum(is.na(x)))
#> age bmi sbp smoker group health
#> 0 0 40 30 30 30Use list_learners() to see the available named learners,
optionally filtered by outcome type:
| learner | description | continuous | binomial | categorical | ordinal | package | installed |
|---|---|---|---|---|---|---|---|
| glm | Linear / logistic regression | TRUE | TRUE | FALSE | FALSE | stats | TRUE |
| mars | Multivariate adaptive regression splines | TRUE | TRUE | FALSE | FALSE | earth | TRUE |
| multinom_reg | Multinomial regression | FALSE | FALSE | TRUE | FALSE | nnet | TRUE |
| polr | Proportional odds logistic regression | FALSE | FALSE | FALSE | TRUE | MASS | TRUE |
| rand_forest | Random forest | TRUE | TRUE | TRUE | TRUE | ranger | TRUE |
| boost_tree | Gradient boosted trees | TRUE | TRUE | TRUE | TRUE | xgboost | TRUE |
| learner | description | package | installed |
|---|---|---|---|
| polr | Proportional odds logistic regression | MASS | TRUE |
| rand_forest | Random forest | ranger | TRUE |
| boost_tree | Gradient boosted trees | xgboost | TRUE |
The primary function is misl(). Supply a dataset and
specify:
m – the number of multiply imputed datasets to
createmaxit – the number of Gibbs sampling iterations per
datasetcon_method, bin_method,
cat_method, ord_method – learners for each
outcome typeRunning misl() with a single named learner per outcome
type is the fastest option and is well suited for exploratory work. Note
that ordered factors are automatically detected and routed to
ord_method:
misl_imp <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = "glm",
bin_method = "glm",
cat_method = "multinom_reg",
ord_method = "polr",
seed = 42,
quiet = TRUE
)misl() returns a list of length m. Each
element contains:
$datasets – the fully imputed tibble$trace – a long-format tibble of convergence
statisticsIn addition to the built-in named learners, misl v2.0
accepts any parsnip-compatible model spec directly. This allows you to
use any learner available in the tidymodels ecosystem without waiting
for it to be added to the built-in registry.
Simply pass a parsnip model spec in place of (or alongside) a named
string. misl will automatically enforce the correct mode
(regression vs classification) regardless of what is set on the
spec:
library(parsnip)
# A random forest with custom hyperparameters
custom_rf <- rand_forest(trees = 500, mtry = 3) |>
set_engine("ranger")
misl_custom <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = list(custom_rf),
bin_method = list(custom_rf),
cat_method = "multinom_reg",
ord_method = "polr",
seed = 42,
quiet = TRUE
)Named strings and parsnip specs can be freely mixed in the same list.
When multiple learners are supplied, misl uses
cross-validation to build a stacked ensemble:
library(parsnip)
# Mix a named learner with a custom tuned spec
custom_xgb <- boost_tree(trees = 200, learn_rate = 0.05) |>
set_engine("xgboost")
misl_mixed <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = list("glm", custom_xgb),
bin_method = list("glm", custom_xgb),
cat_method = list("multinom_reg", "rand_forest"),
ord_method = list("polr", "rand_forest"),
cv_folds = 3,
seed = 42
)Any parsnip-supported engine can be used. For example, a support
vector machine via the kernlab package:
library(parsnip)
# SVM - not in the built-in registry but works via parsnip
svm_spec <- svm_rbf() |>
set_engine("kernlab")
misl_svm <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = list("glm", svm_spec),
bin_method = list("glm", svm_spec),
cat_method = "multinom_reg",
ord_method = "polr",
cv_folds = 3,
seed = 42
)For ordered categorical variables, misl automatically
detects ordered factors and routes them to ord_method. The
default learner is "polr" (proportional odds logistic
regression from the MASS package), which respects the
ordering of the levels:
# Ensure your ordered variable is an ordered factor
sim_data$health <- factor(sim_data$health,
levels = c("Poor", "Fair", "Good", "Excellent"),
ordered = TRUE
)
misl_ordinal <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = "glm",
bin_method = "glm",
cat_method = "multinom_reg",
ord_method = "polr", # proportional odds model for ordered categories
seed = 42,
quiet = TRUE
)
# Imputed values respect the ordering
is.ordered(misl_ordinal[[1]]$datasets$health)
levels(misl_ordinal[[1]]$datasets$health)When multiple learners are supplied (whether named strings, parsnip
specs, or a mix), misl uses cross-validation to learn
optimal ensemble weights via the stacks package. Use
cv_folds to reduce the number of folds and speed up
computation:
After imputation, fit your analysis model to each of the
m datasets and pool the results using Rubin’s rules. Here
we implement pooling manually using base R:
# Fit a linear model to each imputed dataset
models <- lapply(misl_imp, function(imp) {
lm(sbp ~ age + bmi + smoker + group + health, data = imp$datasets)
})
# Pool point estimates and standard errors using Rubin's rules
m <- length(models)
ests <- sapply(models, function(fit) coef(fit))
vars <- sapply(models, function(fit) diag(vcov(fit)))
Q_bar <- rowMeans(ests) # pooled estimate
U_bar <- rowMeans(vars) # within-imputation variance
B <- apply(ests, 1, var) # between-imputation variance
T_total <- U_bar + (1 + 1 / m) * B # total variance
pooled <- data.frame(
term = names(Q_bar),
estimate = round(Q_bar, 4),
std.error = round(sqrt(T_total), 4),
conf.low = round(Q_bar - 1.96 * sqrt(T_total), 4),
conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4)
)
print(pooled)For a full implementation of Rubin’s rules including degrees of
freedom and p-values, the mice package provides
pool() and can be used directly with misl
output:
The plot_misl_trace() function plots the mean or
standard deviation of imputed values across iterations for a given
variable, with one line per imputed dataset. Stable traces that mix well
across datasets indicate convergence. Note that trace statistics are
only computed for continuous and numeric binary columns — they are not
available for categorical or ordinal columns.
# Plot mean of imputed sbp values across iterations for each dataset
plot_misl_trace(misl_imp, variable = "sbp", ylab = "Mean imputed sbp (mm Hg)")Stable traces that mix well across datasets indicate the algorithm has converged.
The m imputed datasets are generated independently and
can be run in parallel using the future framework. Set a
parallel plan before calling misl():
library(future)
# Use all available cores
plan(multisession)
misl_parallel <- misl(
sim_data,
m = 10,
maxit = 5,
con_method = c("glm", "rand_forest"),
bin_method = c("glm", "rand_forest"),
cat_method = c("multinom_reg", "rand_forest"),
ord_method = c("polr", "rand_forest"),
seed = 42
)
# Always reset the plan when done
plan(sequential)To limit the number of cores:
The largest speedup comes from running the m datasets
simultaneously. Check how many cores are available with:
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.4
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] misl_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.7.2 cli_3.6.5 knitr_1.50
#> [4] rlang_1.1.7 xfun_0.54 Formula_1.2-5
#> [7] data.table_1.18.2.1 jsonlite_2.0.0 glue_1.8.0
#> [10] ranger_0.18.0 nnet_7.3-20 htmltools_0.5.8.1
#> [13] sass_0.4.10 rmarkdown_2.30 grid_4.5.2
#> [16] evaluate_1.0.5 jquerylib_0.1.4 tibble_3.3.1
#> [19] MASS_7.3-65 earth_5.3.5 fastmap_1.2.0
#> [22] yaml_2.3.10 lifecycle_1.0.5 compiler_4.5.2
#> [25] Rcpp_1.1.1 pkgconfig_2.0.3 rstudioapi_0.17.1
#> [28] lattice_0.22-7 digest_0.6.39 xgboost_3.2.1.1
#> [31] R6_2.6.1 pillar_1.11.1 magrittr_2.0.4
#> [34] Matrix_1.7-4 bslib_0.9.0 tools_4.5.2
#> [37] plotmo_3.7.0 plotrix_3.8-14 cachem_1.1.0These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.