VSOLassoBag
Variable-Selection Oriented Lasso Bagging Algorithm
1. Introduction
VSOLassoBag is a variable-selection oriented Lasso bagging algorithm for biomarker development in omics-based translational research, providing a bagging Lasso framework for selecting significant variables from multiple models. A main application of this package is to screen a limited number of variables that are less dependent to the training data sets. Basically, this package was initially developed to adjust Lasso selected results from bootstrapped sample sets. The variables with the highest frequency among several selected results were considered as the stable variables to discriminate different sample sets. However, it is usually difficult to determine the cut-off point in terms of frequency when applied in real data sets. In this package, we introduced several methods, namely (1) curve elbow point detection, (2) parametric statistical test to help determine the cut-off point for variable selection. The source code and documents are freely available through Github (https://github.com/likelet/VSOLassoBag).
1.1 Citation
If you find this tool useful, please cite:
Liang J, Wang C, Zhang D, et al. VSOLassoBag: A variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research [published online ahead of print, 2023 Jan 3]. J Genet Genomics. 2023;S1673-8527(22)00285-5. doi:10.1016/j.jgg.2022.12.005
2. Installation
VSOLassoBag can run in both Windows system and most POSIX systems. To the latest version of this package, start R and enter:
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}::install_github("likelet/VSOLassoBag") devtools
3. Prepare input data
To perform analysis with VSOLassoBag
function, you need
to provide:
ExpressionData is an object constructed by SummarizedExperiment, including independent variables (e.g., expression matrix) and dependent/outcome variable(s). Please view the SummarizedExperiment doucumention for the construction of SummarizedExperiment objects.
character indicating the type of dependent variables as input to the parameter
a.family
.
Note:
a.family
is determined entirely by the type ofdependent/outcome variable(s)
.You can also tune other parameters to better balance the performance and resource required for the analysis. Key parameters include
bootN
for bagging times, andbagFreq.sigMethod
for the cut-off point decision method.
3.1 Independent variables
Independent variables
is a matrix-like object with
variables as rows and observations as columns, all the entries should be
numeric. It can be obtained by SummarizedExperiment::assay
function.
Example Independent variables
matrix
[,1] [,2] [,3] [,4] [,5] [,6]
X_1 -1.0059043 -0.5831154 0.5188620 0.08714558 0.45080348 0.1149953
X_2 1.2907372 -0.1005391 0.3385391 0.28154759 -0.14453034 1.0213830
X_3 -0.5321077 1.9809263 0.5795232 0.98104200 -0.02066287 -1.0550853
X_4 -0.9984327 1.4455361 1.3130022 0.47038414 -1.23164279 2.3385465
X_5 -0.2032319 -0.6088388 -1.3633668 -0.22835534 2.33349367 0.2281900
X_6 2.4383744 0.7443007 1.7134854 0.62076548 0.07363745 -0.0915762
3.2 dependent/outcome variable(s)
dependent/outcome variable(s)
store sample information
with the same rows as the samples in the independent variables matrix
and can be invoked by SummarizedExperiment::colData
function.
Example out.mat
vector
[1] 0.1580122 0.4223877 4.3253401 3.9309707 1.2470942 -0.4556825
3.3 a.family
a.family
is a character determined by the data type of
out.mat
, including binomial,
gaussian, cox,
multinomial, mgaussian,
poisson.
4. Results
A list with:
results
: a dataframe containing variables with selection frequency >= 1. If settingbagFreq.sigMethod == "PST"
, the P.value and the P.adjust of each variable will be provided. If settingbagFreq.sigMethod = "CEP"
or using the elbow point indicators, elbow point(s) will be marked with * and thresholds for each variable will be assigned.permutations
: the permutation test results.model
: the regression model built on LassoBag results.
5. Usage Example
For tutorial purpose, here we used two examples utilizing different
cut-off point decision methods to exhibit how to interpret results. The
data is simulated by Gaussian model and you can obtain it by
data(simulated_example)
5.1 Using “CEP” cut-off point decision method
“CEP” (i.e. “Curve Elbow Point Detection”) is the default and recommended method for cut-off point decision. This method assumes that a sharp decrease of the observed frequency should separate important features from those unimportant ones, and based on this detects the elbow point(s) on the observed frequency curve. Finally the features with observed frequency higher than the elbow point are inferred as important features.
Note: There may be more than one elbow
point detected on the curve when using loose threshold,
therefore it is recommended to use a stricter threshold first (use a
larger kneedle.S
) and the algorithm will automatically
loosen the S parameter in case no elbow point can be found.
library(VSOLassoBag)
data(ExpressionData)
set.seed(19084)
VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "PST", do.plot = FALSE)
Results of important variables:
Sat Mar 25 00:09:40 2023 -- INIT process completed, start analyzing
Sat Mar 25 00:09:40 2023 -- start calculating observed frequency
2023-03-25 00:10:41--Bagging finished ...
Sat Mar 25 00:10:41 2023 -- Using Non-permutation method to determine a cutoff point
Using Parametric Statistical Test...
Sat Mar 25 00:10:41 2023 -- Done
variable Frequency P.value P.adjust
3 X_3 100 0.000000e+00 0.000000e+00
7 X_7 100 0.000000e+00 0.000000e+00
2 X_2 99 3.872023e-116 3.071805e-114
10 X_10 98 5.138189e-113 3.057222e-111
6 X_6 97 3.375159e-110 1.606576e-108
9 X_9 88 1.233395e-89 4.892466e-88
5 X_5 87 1.214149e-87 4.128108e-86
8 X_8 82 3.167383e-78 9.422964e-77
4 X_4 62 4.736353e-47 1.252502e-45
1 X_1 45 1.290263e-26 3.070827e-25
59 X_108 28 2.930143e-11 6.339763e-10
226 X_468 25 3.983529e-09 7.900667e-08
86 X_169 21 1.372018e-06 2.511849e-05
193 X_404 19 1.836787e-05 3.122538e-04
CEP Observed Frequency Curve (generated by
do.plot = TRUE
):
5.2 Using “PST” cut-off point decision method
“PST” (i.e. “Parametric Statistical Test”) is one of the alternative methods for cut-off point decision, which is computed as fast and memory-efficient as “CEP”. It assumes the expected selection frequency of all variables follows a binomial distribution, so we can first model such a theoretical background distribution, and then obtain the statistical significance (p-value) of all variables.
library(VSOLassoBag)
data(ExpressionData)
set.seed(19084)
VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "CEP", do.plot = TRUE)
Results of PST method
Sat Mar 25 00:10:42 2023 -- INIT process completed, start analyzing
Sat Mar 25 00:10:42 2023 -- start calculating observed frequency
2023-03-25 00:11:42--Bagging finished ...
Sat Mar 25 00:11:43 2023 -- Using Non-permutation method to determine a cutoff point
Detecting Elbow Points on the Observed Frequency Curve...
Using S = 10 for elbow point dection.
Sat Mar 25 00:11:43 2023 -- Done
variable Frequency elbow.point Diff Thres
3 X_3 100 0 0.000000
7 X_7 100 0 0.000000
2 X_2 99 1 0.000000
10 X_10 98 1 0.000000
6 X_6 97 1 0.000000
9 X_9 88 * 9 4.822785
5 X_5 87 1 4.822785
8 X_8 82 5 4.822785
4 X_4 62 * 20 15.822785
1 X_1 45 17 15.822785
59 X_108 28 17 15.822785
226 X_468 25 3 15.822785
86 X_169 21 4 15.822785
193 X_404 19 2 15.822785
PST Observed Frequency Distribution (generated by
do.plot = TRUE
):
6. Session Info
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=C
[2] LC_CTYPE=Chinese (Simplified)_China.utf8
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] VSOLassoBag_0.99.1
loaded via a namespace (and not attached):
[1] POT_1.1-10 Rcpp_1.0.10
[3] lattice_0.20-45 glmnet_4.1-6
[5] digest_0.6.31 foreach_1.5.2
[7] utf8_1.2.3 R6_2.5.1
[9] GenomeInfoDb_1.34.9 stats4_4.2.2
[11] evaluate_0.20 ggplot2_3.4.1
[13] pillar_1.8.1 zlibbioc_1.44.0
[15] rlang_1.0.6 rstudioapi_0.14
[17] jquerylib_0.1.4 S4Vectors_0.36.2
[19] Matrix_1.5-1 rmarkdown_2.20
[21] splines_4.2.2 RCurl_1.98-1.10
[23] munsell_0.5.0 DelayedArray_0.23.2
[25] compiler_4.2.2 xfun_0.37
[27] pkgconfig_2.0.3 BiocGenerics_0.44.0
[29] shape_1.4.6 htmltools_0.5.4
[31] tidyselect_1.2.0 SummarizedExperiment_1.28.0
[33] tibble_3.2.0 GenomeInfoDbData_1.2.9
[35] bookdown_0.33 IRanges_2.32.0
[37] codetools_0.2-18 matrixStats_0.63.0
[39] fansi_1.0.4 dplyr_1.1.0
[41] bitops_1.0-7 grid_4.2.2
[43] jsonlite_1.8.4 gtable_0.3.1
[45] lifecycle_1.0.3 magrittr_2.0.3
[47] scales_1.2.1 rmdformats_1.0.4
[49] cli_3.6.0 cachem_1.0.7
[51] pbapply_1.7-0 XVector_0.38.0
[53] bslib_0.4.2 generics_0.1.3
[55] vctrs_0.5.2 iterators_1.0.14
[57] tools_4.2.2 Biobase_2.58.0
[59] glue_1.6.2 MatrixGenerics_1.10.0
[61] parallel_4.2.2 fastmap_1.1.1
[63] survival_3.4-0 yaml_2.3.7
[65] colorspace_2.1-0 GenomicRanges_1.50.2
[67] knitr_1.42 sass_0.4.5
@ Copyright 2018-2022, Center of Bioinformatics, Sun Yat-sen University Cancer Center Revision 322baf5b.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.