README

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

misl

Overview

misl implements Multiple Imputation by Super Learning (MISL), a flexible approach to handling missing data that uses a stacked ensemble of machine learning algorithms to impute missing values across continuous, binary, and categorical variables.

Rather than relying on a single parametric imputation model, MISL builds a super learner for each incomplete variable using the tidymodels framework, combining learners such as linear/logistic regression, random forests, gradient boosted trees, and MARS to produce well-calibrated imputations.

Installation

# install.packages("remotes")
remotes::install_github("JustinManjourides/misl")

install.packages(c("ranger", "xgboost", "earth"))

Quick Start

library(misl)

# Introduce missingness into a dataset
set.seed(42)
n <- 200
demo_data <- data.frame(
  age    = rnorm(n, mean = 50, sd = 10),
  weight = rnorm(n, mean = 70, sd = 15),
  smoker = rbinom(n, 1, 0.3),
  group  = factor(sample(c("A", "B", "C"), n, replace = TRUE))
)
demo_data[sample(n, 20), "age"]    <- NA
demo_data[sample(n, 15), "weight"] <- NA
demo_data[sample(n, 10), "smoker"] <- NA
demo_data[sample(n, 10), "group"]  <- NA

# Run MISL with default settings
misl_imp <- misl(
  demo_data,
  m      = 5,
  maxit  = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("rand_forest", "multinom_reg")
)

# Each of the m imputed datasets is accessible via:
completed_data <- misl_imp[[1]]$datasets

# Trace plots can be used to inspect convergence:
trace <- misl_imp[[1]]$trace

Parallelism

Imputation across the m datasets is parallelised via the future framework. To enable parallel execution, set a plan before calling misl():

library(future)
plan(multisession, workers = 4)

misl_imp <- misl(demo_data, m = 5, maxit = 5)

plan(sequential)  # reset when done

Available learners

# View all available learners
list_learners()

# Filter by outcome type
list_learners("continuous")
list_learners("categorical")

# Show only installed learners
list_learners(installed_only = TRUE)

Citation

@article{carpenito2022misl,
  author  = {Carpenito, T and Manjourides, J},
  title   = {{MISL}: Multiple imputation by super learning},
  journal = {Statistical Methods in Medical Research},
  year    = {2022},
  volume  = {31},
  number  = {10},
  pages   = {1904--1915},
  doi     = {10.1177/09622802221104238}
}

License

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.