A Sample Size Calculator for Machine Learning Applications in Healthcare [R package planningML version 1.0.1]

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

planningML: A Sample Size Calculator for Machine Learning Applications in Healthcare

Advances in automated document classification has led to identifying massive numbers of clinical concepts from handwritten clinical notes. These high dimensional clinical concepts can serve as highly informative predictors in building classification algorithms for identifying patients with different clinical conditions, commonly referred to as patient phenotyping. However, from a planning perspective, it is critical to ensure that enough data is available for the phenotyping algorithm to obtain a desired classification performance. This challenge in sample size planning is further exacerbated by the high dimension of the feature space and the inherent imbalance of the response class. Currently available sample size planning methods can be categorized into: (i) model-based approaches that predict the sample size required for achieving a desired accuracy using a linear machine learning classifier and (ii) learning curve-based approaches (Figueroa et al. (2012) <doi:10.1186/1472-6947-12-8>) that fit an inverse power law curve to pilot data to extrapolate performance. We develop model-based approaches for imbalanced data with correlated features, deriving sample size formulas for performance metrics that are sensitive to class imbalance such as Area Under the receiver operating characteristic Curve (AUC) and Matthews Correlation Coefficient (MCC). This is done using a two-step approach where we first perform feature selection using the innovated High Criticism thresholding method (Hall and Jin (2010) <doi:10.1214/09-AOS764>), then determine the sample size by optimizing the two performance metrics. Further, we develop software in the form of an R package named 'planningML' and an 'R' 'Shiny' app to facilitate the convenient implementation of the developed model-based approaches and learning curve approaches for imbalanced data. We apply our methods to the problem of phenotyping rare outcomes using the MIMIC-III electronic health record database. We show that our developed methods which relate training data size and performance on AUC and MCC, can predict the true or observed performance from linear ML classifiers such as LASSO and SVM at different training data sizes. Therefore, in high-dimensional classification analysis with imbalanced data and correlated features, our approach can efficiently and accurately determine the sample size needed for machine-learning based classification.

Version:	1.0.1
Depends:	R (≥ 3.5.0)
Imports:	glmnet, caret, lubridate, Matrix, MESS, dplyr, pROC, stats
Suggests:	knitr, rmarkdown
Published:	2023-06-23
DOI:	10.32614/CRAN.package.planningML
Author:	Xinying Fang [aut, cre], Satabdi Saha [aut], Jaejoon Song [aut], Sai Dharmarajan [aut]
Maintainer:	Xinying Fang <fxy950225 at gmail.com>
License:	GPL-2
NeedsCompilation:	no
CRAN checks:	planningML results [issues need fixing before 2025-10-21]

Documentation:

Reference manual:	planningML.html , planningML.pdf
Vignettes:	planningML User Guide (source, R code)

Downloads:

Package source:	planningML_1.0.1.tar.gz
Windows binaries:	r-devel: planningML_1.0.1.zip, r-release: planningML_1.0.1.zip, r-oldrel: planningML_1.0.1.zip
macOS binaries:	r-release (arm64): planningML_1.0.1.tgz, r-oldrel (arm64): planningML_1.0.1.tgz, r-release (x86_64): planningML_1.0.1.tgz, r-oldrel (x86_64): planningML_1.0.1.tgz
Old sources:	planningML archive

Linking:

Please use the canonical form https://CRAN.R-project.org/package=planningML to link to this page.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.