The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Sparse-group SLOPE (SGS) is a penalised regression approach that performs bi-level selection with FDR control under orthogonal designs. SGS is described in detail in .
The method is implemented in the sgs
R package. The
package has implementations for Gaussian and Binomial responses, both of
which are demonstrated here.
For this example, a \(400\times 500\) input matrix is used with a simple grouping structure, sampled from a multivariate Gaussian distribution with no correlation.
We now fit an SGS model to the data using linear regression. The SGS model has many different hyperparameters which can be tuned/selected. Of particular importance is the \(\lambda\) parameter, which defines the level of sparsity in the model. First, we select this manually and then next use cross-validation to tune it. The other parameters we leave as their default values, although they can easily be changed.
model = fit_sgs(X = data$X, y = data$y, groups = groups, type="linear", lambda = 1, alpha=0.95, vFDR=0.1, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE)
Note: we have fit an intercept and applied \(\ell_2\) standardisation. This is the recommended usage when applying SGS.
The package provides several useful outputs after fitting a model. The vector shows the fitted values (note the intercept). We can also recover the indices of the non-zero variables and groups, which are indexed from the first variable, not the intercept.
## v100 v136 v234 v334
## 1.2211478 3.0393493 2.4211141 -0.4108451
## G30 G39 G59 G76
## 1.2211478 3.0393493 2.4211141 0.4108451
## [1] 100 136 234 334
## [1] 30 39 59 76
Defining a function that lets us calculate various metrics (including the FDR and sensitivity):
fdr_sensitivity = function(fitted_ids, true_ids,num_coef){
# calculates FDR, FPR, and sensitivity
num_true = length(intersect(fitted_ids,true_ids))
num_false = length(fitted_ids) - num_true
num_missed = length(true_ids) - num_true
num_true_negatives = num_coef - length(true_ids)
out=c()
out$fdr = num_false / (num_true + num_false)
if (is.nan(out$fdr)){out$fdr = 0}
out$sensitivity = num_true / length(true_ids)
if (length(true_ids) == 0){
out$sensitivity = 1
}
out$fpr = num_false / num_true_negatives
out$f1 = (2*num_true)/(2*num_true + num_false + num_missed)
if (is.nan(out$f1)){out$f1 = 1}
return(out)
}
Calculating relevant metrics give
## $fdr
## [1] 0
##
## $sensitivity
## [1] 0.1428571
##
## $fpr
## [1] 0
##
## $f1
## [1] 0.25
## $fdr
## [1] 0
##
## $sensitivity
## [1] 0.4
##
## $fpr
## [1] 0
##
## $f1
## [1] 0.5714286
The model is currently too sparse, as our choice of \(\lambda\) is too high. We can instead use cross-validation.
Cross-validation is used to fit SGS models along a \(\lambda\) path of length \(20\). The first value, \(\lambda_\text{max}\), is chosen to give the null model and the path is terminated at \(\lambda_\text{min} = min_frac \dot \lambda_\text{max}\). The 1se rule (as in the package) is used to choose the optimal model.
cv_model = fit_sgs_cv(X = data$X, y = data$y, groups=groups, type = "linear", nlambda = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=TRUE,verbose=TRUE)
## [1] "Lambda 1/20 done. Lambda: 1.8388. Number of non-zero: 0. Error: 9522.47623165794. Avg iter: 7"
## [1] "Lambda 2/20 done. Lambda: 1.5706. Number of non-zero: 1. Error: 9121.87427456379. Avg iter: 26"
## [1] "Lambda 3/20 done. Lambda: 1.3415. Number of non-zero: 2. Error: 8348.10001181521. Avg iter: 31"
## [1] "Lambda 4/20 done. Lambda: 1.1458. Number of non-zero: 3. Error: 7525.33322733818. Avg iter: 35"
## [1] "Lambda 5/20 done. Lambda: 0.9787. Number of non-zero: 8. Error: 6706.42167334536. Avg iter: 35"
## [1] "Lambda 6/20 done. Lambda: 0.8359. Number of non-zero: 14. Error: 5509.29608346792. Avg iter: 44"
## [1] "Lambda 7/20 done. Lambda: 0.714. Number of non-zero: 15. Error: 4256.09986924448. Avg iter: 37"
## [1] "Lambda 8/20 done. Lambda: 0.6098. Number of non-zero: 15. Error: 3281.14251906525. Avg iter: 37"
## [1] "Lambda 9/20 done. Lambda: 0.5209. Number of non-zero: 15. Error: 2559.14833615298. Avg iter: 37"
## [1] "Lambda 10/20 done. Lambda: 0.4449. Number of non-zero: 19. Error: 2011.70593862728. Avg iter: 36"
## [1] "Lambda 11/20 done. Lambda: 0.38. Number of non-zero: 22. Error: 1552.91875758507. Avg iter: 39"
## [1] "Lambda 12/20 done. Lambda: 0.3246. Number of non-zero: 22. Error: 1179.20560203183. Avg iter: 33"
## [1] "Lambda 13/20 done. Lambda: 0.2772. Number of non-zero: 22. Error: 892.913519340801. Avg iter: 32"
## [1] "Lambda 14/20 done. Lambda: 0.2368. Number of non-zero: 22. Error: 682.816068341828. Avg iter: 34"
## [1] "Lambda 15/20 done. Lambda: 0.2022. Number of non-zero: 24. Error: 525.918317926705. Avg iter: 36"
## [1] "Lambda 16/20 done. Lambda: 0.1727. Number of non-zero: 24. Error: 405.623243053464. Avg iter: 37"
## [1] "Lambda 17/20 done. Lambda: 0.1475. Number of non-zero: 26. Error: 314.722794039507. Avg iter: 39"
## [1] "Lambda 18/20 done. Lambda: 0.126. Number of non-zero: 26. Error: 244.495442693249. Avg iter: 39"
## [1] "Lambda 19/20 done. Lambda: 0.1076. Number of non-zero: 27. Error: 192.736115731821. Avg iter: 40"
## [1] "Lambda 20/20 done. Lambda: 0.0919. Number of non-zero: 27. Error: 154.39418403602. Avg iter: 41"
The fitting verbose contains useful information, showing the error for each \(\lambda\) values, as well as the number of non-zero parameters. Aside from the fitting verbose, we can see a more succinct summary by using the function
##
## regression type: linear
##
## lambda error estimated_non_zero
## [1,] 1.8387781 9522.4762 0
## [2,] 1.5705583 9121.8743 1
## [3,] 1.3414633 8348.1000 2
## [4,] 1.1457860 7525.3332 3
## [5,] 0.9786519 6706.4217 8
## [6,] 0.8358975 5509.2961 14
## [7,] 0.7139663 4256.0999 15
## [8,] 0.6098211 3281.1425 15
## [9,] 0.5208674 2559.1483 15
## [10,] 0.4448893 2011.7059 19
## [11,] 0.3799940 1552.9188 22
## [12,] 0.3245648 1179.2056 22
## [13,] 0.2772210 892.9135 22
## [14,] 0.2367832 682.8161 22
## [15,] 0.2022440 525.9183 24
## [16,] 0.1727430 405.6232 24
## [17,] 0.1475452 314.7228 26
## [18,] 0.1260230 244.4954 26
## [19,] 0.1076402 192.7361 27
## [20,] 0.0919389 154.3942 27
The best model is found to be the one at the end of the path. Checking the metrics again, we see how CV has generated a model with the correct amount of sparsity that gives FDR levels below the specified values.
fdr_sensitivity(fitted_ids = cv_model$fit$selected_var, true_ids = data$true_var_id, num_coef = 500)
## $fdr
## [1] 0.03703704
##
## $sensitivity
## [1] 0.9285714
##
## $fpr
## [1] 0.002118644
##
## $f1
## [1] 0.9454545
fdr_sensitivity(fitted_ids = cv_model$fit$selected_group, true_ids = data$true_grp_id, num_coef = 100)
## $fdr
## [1] 0.09090909
##
## $sensitivity
## [1] 1
##
## $fpr
## [1] 0.01111111
##
## $f1
## [1] 0.952381
As mentioned, the package can also be used to fit SGS to a Binomial response. First, we generate some Binomial data. We can use the same input matrix, \(X\), and true \(\beta\) as before.
sigmoid = function(x) {
1 / (1 + exp(-x))
}
y = ifelse(sigmoid(data$X %*% data$true_beta + rnorm(400))>0.5,1,0)
train_y = y[1:350]
test_y = y[351:400]
train_X = data$X[1:350,]
test_X = data$X[351:400,]
We can again apply CV.
cv_model = fit_sgs_cv(X = train_X, y = train_y, groups=groups, type = "logistic", nlambda = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=FALSE,verbose=TRUE)
## [1] "Lambda 1/20 done. Lambda: 0.0486. Number of non-zero: 0. Misclassification error: 0.437142857142857. Avg iter: 2"
## [1] "Lambda 2/20 done. Lambda: 0.0415. Number of non-zero: 1. Misclassification error: 0.351428571428571. Avg iter: 4"
## [1] "Lambda 3/20 done. Lambda: 0.0355. Number of non-zero: 2. Misclassification error: 0.331428571428571. Avg iter: 5"
## [1] "Lambda 4/20 done. Lambda: 0.0303. Number of non-zero: 2. Misclassification error: 0.314285714285714. Avg iter: 7"
## [1] "Lambda 5/20 done. Lambda: 0.0259. Number of non-zero: 9. Misclassification error: 0.28. Avg iter: 11"
## [1] "Lambda 6/20 done. Lambda: 0.0221. Number of non-zero: 11. Misclassification error: 0.222857142857143. Avg iter: 9"
## [1] "Lambda 7/20 done. Lambda: 0.0189. Number of non-zero: 13. Misclassification error: 0.208571428571429. Avg iter: 11"
## [1] "Lambda 8/20 done. Lambda: 0.0161. Number of non-zero: 23. Misclassification error: 0.177142857142857. Avg iter: 15"
## [1] "Lambda 9/20 done. Lambda: 0.0138. Number of non-zero: 34. Misclassification error: 0.14. Avg iter: 19"
## [1] "Lambda 10/20 done. Lambda: 0.0118. Number of non-zero: 43. Misclassification error: 0.134285714285714. Avg iter: 20"
## [1] "Lambda 11/20 done. Lambda: 0.0101. Number of non-zero: 57. Misclassification error: 0.131428571428571. Avg iter: 21"
## [1] "Lambda 12/20 done. Lambda: 0.0086. Number of non-zero: 80. Misclassification error: 0.142857142857143. Avg iter: 21"
## [1] "Lambda 13/20 done. Lambda: 0.0073. Number of non-zero: 102. Misclassification error: 0.14. Avg iter: 22"
## [1] "Lambda 14/20 done. Lambda: 0.0063. Number of non-zero: 113. Misclassification error: 0.148571428571429. Avg iter: 26"
## [1] "Lambda 15/20 done. Lambda: 0.0054. Number of non-zero: 120. Misclassification error: 0.145714285714286. Avg iter: 30"
## [1] "Lambda 16/20 done. Lambda: 0.0046. Number of non-zero: 124. Misclassification error: 0.145714285714286. Avg iter: 36"
## [1] "Lambda 17/20 done. Lambda: 0.0039. Number of non-zero: 132. Misclassification error: 0.145714285714286. Avg iter: 45"
## [1] "Lambda 18/20 done. Lambda: 0.0033. Number of non-zero: 136. Misclassification error: 0.137142857142857. Avg iter: 55"
## [1] "Lambda 19/20 done. Lambda: 0.0028. Number of non-zero: 154. Misclassification error: 0.148571428571429. Avg iter: 67"
## [1] "Lambda 20/20 done. Lambda: 0.0024. Number of non-zero: 164. Misclassification error: 0.148571428571429. Avg iter: 81"
and again, use the predict function
In the Binomial case, the function returns both the predicted class
probabilities (response
) and the predicted class
(class
). We can use this to check the prediction accuracy,
given as \(84\%\).
## [1] 0.4085523 0.6480921 0.1706632 0.1489437 0.3894489
## [1] 0 1 0 0 0
## [1] 0.82
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.