The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The two examples below implement Bayesian linear and logistic
regression models using the horseshoe prior over parameters to encourage
a sparse model. Models are fitted using the hsstan
R
package, which performs full Bayesian inference through a
Stan
implementation. In Bayesian inference model
meta-parameters such as the amount of shrinkage are also given prior
distributions and are thus directly learned from the data through
sampling. This bypasses the need to cross-validate results over a grid
of values for the meta-parameters, as would be required to find the
optimal lambda in a lasso or elastic net model.
However, Bayesian inference is computationally intensive. In
high-dimensional settings, with e.g. more than 10,000 biomarkers,
pre-filtering of inputs based on univariate measures of association with
the outcome may be beneficial. If pre-filtering of inputs is used then a
cross-validation procedure is needed to ensure that the data points used
for pre-filtering and model fitting differ from the data points used to
quantify model performance. The outercv()
function is used
to perform univariate pre-filtering and cross-validate model performance
in this setting.
CAUTION should be used when setting the number of cores available for
parallelisation. The default setting in hsstan
is to use 4
cores to parallelise Markov chains in the Bayesian inference procedure.
This can be switched off either by setting
options(mc.cores = 1)
.
Argument cv.cores
in outercv()
controls
parallelisation over the outer CV folds. On unix/mac setting
cv.cores
to >1 will induce nested parallelisation which
will generate an error, unless parallelisation of the chains is disabled
by setting options(mc.cores = 1)
.
Nested parallelisation is feasible if cv.cores
is >1
and multicore_fork = FALSE
is set as this uses cluster
based parallelisation instead of mclapply
. Beware that
large numbers of processes may be spawned: the code will use the product
of cv.cores
and mc.cores
. If we are performing
10-fold cross-validation with 4 chains and set
cv.cores = 10
then 40 processes will be invoked
simultaneously.
We use cross-validation and apply univariate filtering of predictors and model fitting in one part of the data (training fold), followed by evaluation of model performance on the left-out data (testing fold), repeated in each fold.
Only one cross-validation split is needed (function
outercv()
) as the Bayesian model does not require
cross-validation for meta-parameters.
Note, in the examples below length of sampling and number of chains
is curtailed for speed. We recommend 4 chains, warmup
1000
and iter
2000 in practice.
library(hsstan)
# number of cores for parallelising hsstan sampling
# store original options in oldopt
# at the end reset options to old configuration
oldopt <- options(mc.cores = 2)
# load iris dataset and simulate a continuous outcome
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
dt$outcome.cont <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2)
# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
res.cv.hsstan <- outercv(y = dt$outcome.cont, x = dt[, c(uvars, pvars)],
model = "model.hsstan",
filterFUN = lm_filter,
filter_options = list(force_vars = uvars,
nfilter = 2,
p_cutoff = NULL,
rsq_cutoff = 0.9),
chains = 2, # chains parallelised via options(mc.cores)
n_outer_folds = 3, cv.cores = 1, # CV folds not parallelised
unpenalized = uvars, warmup = 100, iter = 200)
We can then view prediction performance based on the testing folds
and examine the Bayesian model using the hsstan
package.
summary(res.cv.hsstan)
#> Single cross-validation to measure performance
#> Outer loop: 3-fold CV
#> No inner loop
#> 150 observations, 4 predictors
#>
#> Model: model.hsstan
#> Filter: lm_filter
#> n.filter
#> Fold 1 3
#> Fold 2 3
#> Fold 3 3
#>
#> Final fit: mean sd 2.5% 97.5% n_eff Rhat
#> (Intercept) -3.17 0.14 -3.40 -2.89 221 0.99
#> marker1 0.37 0.32 -0.36 0.90 209 1.00
#> marker2 2.07 0.22 1.70 2.47 196 1.00
#> marker3 0.11 0.31 -0.43 0.92 104 1.00
#>
#> Result:
#> RMSE Rsquared MAE
#> 2.1226 0.4772 1.6971
sampler.stats(res.cv.hsstan$final_fit)
#> accept.stat stepsize divergences treedepth gradients warmup sample
#> chain:1 0.9843 0.0241 0 8 14844 0.25 0.12
#> chain:2 0.9722 0.0316 0 8 11356 0.12 0.09
#> all 0.9783 0.0279 0 8 26200 0.37 0.21
print(projsel(res.cv.hsstan$final_fit), digits = 4) # adding marker2
#> Model KL ELPD
#> Intercept only 0.32508 -374.99055
#> Initial submodel 0.32031 -374.72416
#> marker2 0.00151 -325.82289
#> marker3 0.00000 -325.88824
#> var kl rel.kl.null rel.kl elpd delta.elpd
#> 1 Intercept only 0.325077 0.00000 NA -375.0 -49.10230
#> 2 Initial submodel 0.320306 0.01468 0.0000 -374.7 -48.83591
#> 3 marker2 0.001506 0.99537 0.9953 -325.8 0.06535
#> 4 marker3 0.000000 1.00000 1.0000 -325.9 0.00000
Here adding marker2
improves the model fit: substantial
decrease of KL-divergence from the full model to the submodel. Adding
marker3
does not improve the model fit: no decrease of
KL-divergence from the full model to the submodel.
We use cross-validation and apply univariate filtering of predictors and model fitting in one part of the data (training fold), followed by evaluation of model performance on the left-out data (testing fold), repeated in each fold.
Only one cross-validation split is needed (function
outercv
) as the Bayesian model does not require
cross-validation for meta-parameters.
# sigmoid function
sigmoid <- function(x) {1 / (1 + exp(-x))}
# load iris dataset and create a binary outcome
set.seed(267)
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
rownames(dt) <- paste0("sample", c(1:nrow(dt)))
dt$outcome.bin <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt))
dt$outcome.bin <- factor(dt$outcome.bin)
# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
res.cv.hsstan <- outercv(y = dt$outcome.bin,
x = as.matrix(dt[, c(uvars, pvars)]),
model = "model.hsstan",
filterFUN = ttest_filter,
filter_options = list(force_vars = uvars,
nfilter = 2,
p_cutoff = NULL,
rsq_cutoff = 0.9),
chains = 2, # parallelise over chains
n_outer_folds = 3, cv.cores = 1,
unpenalized = uvars, warmup = 100, iter = 200)
We view prediction performance based on testing folds and examine the model.
summary(res.cv.hsstan)
#> Single cross-validation to measure performance
#> Outer loop: 3-fold CV
#> No inner loop
#> 150 observations, 4 predictors
#> FALSE TRUE
#> 78 72
#>
#> Model: model.hsstan
#> Filter: ttest_filter
#> n.filter
#> Fold 1 3
#> Fold 2 3
#> Fold 3 3
#>
#> Final fit: mean sd 2.5% 97.5% n_eff Rhat
#> (Intercept) -0.12 0.23 -0.56 0.27 200 1
#> marker1 0.50 0.30 -0.10 1.16 208 1
#> marker2 1.91 0.36 1.26 2.66 263 1
#> marker3 0.01 0.24 -0.55 0.69 171 1
#>
#> Result:
#> Reference
#> Predicted FALSE TRUE
#> FALSE 56 23
#> TRUE 22 49
#>
#> AUC Accuracy Balanced accuracy
#> 0.8284 0.7000 0.6993
# examine the Bayesian model
print(projsel(res.cv.hsstan$final_fit), digits = 4) # adding marker2
#> Model KL ELPD
#> Intercept only 0.20643 -104.26957
#> Initial submodel 0.20118 -104.17411
#> marker2 0.00060 -73.84232
#> marker3 0.00000 -73.93210
#> var kl rel.kl.null rel.kl elpd delta.elpd
#> 1 Intercept only 2.064e-01 0.00000 NA -104.27 -30.33748
#> 2 Initial submodel 2.012e-01 0.02543 0.000 -104.17 -30.24201
#> 3 marker2 5.964e-04 0.99711 0.997 -73.84 0.08977
#> 4 marker3 9.133e-18 1.00000 1.000 -73.93 0.00000
options(oldopt) # reset options
Here adding marker2
improves the model fit: substantial
decrease of KL-divergence from the full model to the submodel. Adding
marker3
does not improve the model fit: no decrease of
KL-divergence from the full model to the submodel.
At time of writing, there appears to be a bug in rstan
(used by hsstan
) leading to it ignoring the pass-thru
argument cores
and instead spawning multiple processes as
specified by the chain
argument. This behaviour can be
limited by setting options(mc.cores = ..)
.
A key problem with parallelisation in R is that errors, warnings and
user input have to be suppressed during multicore processing. If a
nestedcv
call is not working, we recommend that you try it
with cv.cores = 1
first to check it starts up without error
messages.
If you use this package, please cite as:
Lewis MJ, Spiliopoulou A, Goldmann K, Pitzalis C, McKeigue P, Barnes MR (2023). nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high dimensional data. Bioinformatics Advances. https://doi.org/10.1093/bioadv/vbad048
Carpenter, B., et al. Stan: A Probabilistic Programming Language. Journal of Statistical Software 2017;76(1):1-32.
Piironen, J. and Vehtari, A. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics 2017;11(2):5018-5051, 5034.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.