The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

SGS reproducible example

Fabio Feser

2023-08-21

Introduction

Sparse-group SLOPE (SGS) is a penalised regression approach that performs bi-level selection with FDR control under orthogonal designs. SGS is described in detail in .

The method is implemented in the sgs R package. The package has implementations for Gaussian and Binomial responses, both of which are demonstrated here.

Gaussian response

Data

For this example, a \(400\times 500\) input matrix is used with a simple grouping structure, sampled from a multivariate Gaussian distribution with no correlation.

library(sgs)
groups = c(rep(1:20, each=3),
           rep(21:40, each=4),
           rep(41:60, each=5),
           rep(61:80, each=6),
           rep(81:100, each=7))

data = generate_toy_data(p=500, n=400, groups = groups, seed_id=3)

Fitting an SGS model

We now fit an SGS model to the data using linear regression. The SGS model has many different hyperparameters which can be tuned/selected. Of particular importance is the \(\lambda\) parameter, which defines the level of sparsity in the model. First, we select this manually and then next use cross-validation to tune it. The other parameters we leave as their default values, although they can easily be changed.

model = fit_sgs(X = data$X, y = data$y, groups = groups, type="linear", lambda = 1, alpha=0.95, vFDR=0.1, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE)

Note: we have fit an intercept and applied \(\ell_2\) standardisation. This is the recommended usage when applying SGS.

Output of model

The package provides several useful outputs after fitting a model. The vector shows the fitted values (note the intercept). We can also recover the indices of the non-zero variables and groups, which are indexed from the first variable, not the intercept.

model$beta[model$selected_var+1,]
##       v100       v136       v234       v334 
##  1.2211478  3.0393493  2.4211141 -0.4108451
model$group.effects[model$selected_group,]
##       G30       G39       G59       G76 
## 1.2211478 3.0393493 2.4211141 0.4108451
model$selected_var
## [1] 100 136 234 334
model$selected_group
## [1] 30 39 59 76

Defining a function that lets us calculate various metrics (including the FDR and sensitivity):

fdr_sensitivity = function(fitted_ids, true_ids,num_coef){
  # calculates FDR, FPR, and sensitivity
  num_true = length(intersect(fitted_ids,true_ids))
  num_false = length(fitted_ids) - num_true
  num_missed = length(true_ids) - num_true
  num_true_negatives = num_coef - length(true_ids)
  out=c()
  out$fdr = num_false / (num_true + num_false)
  if (is.nan(out$fdr)){out$fdr = 0}
  out$sensitivity = num_true / length(true_ids)
  if (length(true_ids) == 0){
    out$sensitivity = 1
  }
  out$fpr = num_false / num_true_negatives
  out$f1 = (2*num_true)/(2*num_true + num_false + num_missed)
  if (is.nan(out$f1)){out$f1 = 1}
  return(out)
}

Calculating relevant metrics give

fdr_sensitivity(fitted_ids = model$selected_var, true_ids = data$true_var_id, num_coef = 500)
## $fdr
## [1] 0
## 
## $sensitivity
## [1] 0.1428571
## 
## $fpr
## [1] 0
## 
## $f1
## [1] 0.25
fdr_sensitivity(fitted_ids = model$selected_group, true_ids = data$true_grp_id, num_coef = 100)
## $fdr
## [1] 0
## 
## $sensitivity
## [1] 0.4
## 
## $fpr
## [1] 0
## 
## $f1
## [1] 0.5714286

The model is currently too sparse, as our choice of \(\lambda\) is too high. We can instead use cross-validation.

Cross validation

Cross-validation is used to fit SGS models along a \(\lambda\) path of length \(20\). The first value, \(\lambda_\text{max}\), is chosen to give the null model and the path is terminated at \(\lambda_\text{min} = min_frac \dot \lambda_\text{max}\). The 1se rule (as in the package) is used to choose the optimal model.

cv_model = fit_sgs_cv(X = data$X, y = data$y, groups=groups, type = "linear", nlambda = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=TRUE,verbose=TRUE)
## [1] "Lambda 1/20 done. Lambda: 1.8388. Number of non-zero: 0. Error: 9522.47623165794. Avg iter: 7"
## [1] "Lambda 2/20 done. Lambda: 1.5706. Number of non-zero: 1. Error: 9121.87427456379. Avg iter: 26"
## [1] "Lambda 3/20 done. Lambda: 1.3415. Number of non-zero: 2. Error: 8348.10001181521. Avg iter: 31"
## [1] "Lambda 4/20 done. Lambda: 1.1458. Number of non-zero: 3. Error: 7525.33322733818. Avg iter: 35"
## [1] "Lambda 5/20 done. Lambda: 0.9787. Number of non-zero: 8. Error: 6706.42167334536. Avg iter: 35"
## [1] "Lambda 6/20 done. Lambda: 0.8359. Number of non-zero: 14. Error: 5509.29608346792. Avg iter: 44"
## [1] "Lambda 7/20 done. Lambda: 0.714. Number of non-zero: 15. Error: 4256.09986924448. Avg iter: 37"
## [1] "Lambda 8/20 done. Lambda: 0.6098. Number of non-zero: 15. Error: 3281.14251906525. Avg iter: 37"
## [1] "Lambda 9/20 done. Lambda: 0.5209. Number of non-zero: 15. Error: 2559.14833615298. Avg iter: 37"
## [1] "Lambda 10/20 done. Lambda: 0.4449. Number of non-zero: 19. Error: 2011.70593862728. Avg iter: 36"
## [1] "Lambda 11/20 done. Lambda: 0.38. Number of non-zero: 22. Error: 1552.91875758507. Avg iter: 39"
## [1] "Lambda 12/20 done. Lambda: 0.3246. Number of non-zero: 22. Error: 1179.20560203183. Avg iter: 33"
## [1] "Lambda 13/20 done. Lambda: 0.2772. Number of non-zero: 22. Error: 892.913519340801. Avg iter: 32"
## [1] "Lambda 14/20 done. Lambda: 0.2368. Number of non-zero: 22. Error: 682.816068341828. Avg iter: 34"
## [1] "Lambda 15/20 done. Lambda: 0.2022. Number of non-zero: 24. Error: 525.918317926705. Avg iter: 36"
## [1] "Lambda 16/20 done. Lambda: 0.1727. Number of non-zero: 24. Error: 405.623243053464. Avg iter: 37"
## [1] "Lambda 17/20 done. Lambda: 0.1475. Number of non-zero: 26. Error: 314.722794039507. Avg iter: 39"
## [1] "Lambda 18/20 done. Lambda: 0.126. Number of non-zero: 26. Error: 244.495442693249. Avg iter: 39"
## [1] "Lambda 19/20 done. Lambda: 0.1076. Number of non-zero: 27. Error: 192.736115731821. Avg iter: 40"
## [1] "Lambda 20/20 done. Lambda: 0.0919. Number of non-zero: 27. Error: 154.39418403602. Avg iter: 41"

The fitting verbose contains useful information, showing the error for each \(\lambda\) values, as well as the number of non-zero parameters. Aside from the fitting verbose, we can see a more succinct summary by using the function

print(cv_model)
## 
##  regression type:  linear 
## 
##          lambda     error estimated_non_zero
##  [1,] 1.8387781 9522.4762                  0
##  [2,] 1.5705583 9121.8743                  1
##  [3,] 1.3414633 8348.1000                  2
##  [4,] 1.1457860 7525.3332                  3
##  [5,] 0.9786519 6706.4217                  8
##  [6,] 0.8358975 5509.2961                 14
##  [7,] 0.7139663 4256.0999                 15
##  [8,] 0.6098211 3281.1425                 15
##  [9,] 0.5208674 2559.1483                 15
## [10,] 0.4448893 2011.7059                 19
## [11,] 0.3799940 1552.9188                 22
## [12,] 0.3245648 1179.2056                 22
## [13,] 0.2772210  892.9135                 22
## [14,] 0.2367832  682.8161                 22
## [15,] 0.2022440  525.9183                 24
## [16,] 0.1727430  405.6232                 24
## [17,] 0.1475452  314.7228                 26
## [18,] 0.1260230  244.4954                 26
## [19,] 0.1076402  192.7361                 27
## [20,] 0.0919389  154.3942                 27

The best model is found to be the one at the end of the path. Checking the metrics again, we see how CV has generated a model with the correct amount of sparsity that gives FDR levels below the specified values.

fdr_sensitivity(fitted_ids = cv_model$fit$selected_var, true_ids = data$true_var_id, num_coef = 500)
## $fdr
## [1] 0.03703704
## 
## $sensitivity
## [1] 0.9285714
## 
## $fpr
## [1] 0.002118644
## 
## $f1
## [1] 0.9454545
fdr_sensitivity(fitted_ids = cv_model$fit$selected_group, true_ids = data$true_grp_id, num_coef = 100)
## $fdr
## [1] 0.09090909
## 
## $sensitivity
## [1] 1
## 
## $fpr
## [1] 0.01111111
## 
## $f1
## [1] 0.952381

Plot

We can visualise the solution using the plot function

plot(cv_model,how_many = 10)

Prediction

The package has an implemented predict function, to allow for easy prediction

predict(model,data$X,type="linear")[1:5]
## [1]  2.79322803  0.03135326  5.00188815  4.00179438 -2.74987113

Logistic regression

As mentioned, the package can also be used to fit SGS to a Binomial response. First, we generate some Binomial data. We can use the same input matrix, \(X\), and true \(\beta\) as before.

sigmoid = function(x) {
   1 / (1 + exp(-x))
}
y = ifelse(sigmoid(data$X %*% data$true_beta + rnorm(400))>0.5,1,0)
train_y = y[1:350] 
test_y = y[351:400]
train_X = data$X[1:350,] 
test_X = data$X[351:400,]

Fitting and prediction

We can again apply CV.

cv_model = fit_sgs_cv(X = train_X, y = train_y, groups=groups, type = "logistic", nlambda = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=FALSE,verbose=TRUE)
## [1] "Lambda 1/20 done. Lambda: 0.0486. Number of non-zero: 0. Misclassification error: 0.437142857142857. Avg iter: 2"
## [1] "Lambda 2/20 done. Lambda: 0.0415. Number of non-zero: 1. Misclassification error: 0.351428571428571. Avg iter: 4"
## [1] "Lambda 3/20 done. Lambda: 0.0355. Number of non-zero: 2. Misclassification error: 0.331428571428571. Avg iter: 5"
## [1] "Lambda 4/20 done. Lambda: 0.0303. Number of non-zero: 2. Misclassification error: 0.314285714285714. Avg iter: 7"
## [1] "Lambda 5/20 done. Lambda: 0.0259. Number of non-zero: 9. Misclassification error: 0.28. Avg iter: 11"
## [1] "Lambda 6/20 done. Lambda: 0.0221. Number of non-zero: 11. Misclassification error: 0.222857142857143. Avg iter: 9"
## [1] "Lambda 7/20 done. Lambda: 0.0189. Number of non-zero: 13. Misclassification error: 0.208571428571429. Avg iter: 11"
## [1] "Lambda 8/20 done. Lambda: 0.0161. Number of non-zero: 23. Misclassification error: 0.177142857142857. Avg iter: 15"
## [1] "Lambda 9/20 done. Lambda: 0.0138. Number of non-zero: 34. Misclassification error: 0.14. Avg iter: 19"
## [1] "Lambda 10/20 done. Lambda: 0.0118. Number of non-zero: 43. Misclassification error: 0.134285714285714. Avg iter: 20"
## [1] "Lambda 11/20 done. Lambda: 0.0101. Number of non-zero: 57. Misclassification error: 0.131428571428571. Avg iter: 21"
## [1] "Lambda 12/20 done. Lambda: 0.0086. Number of non-zero: 80. Misclassification error: 0.142857142857143. Avg iter: 21"
## [1] "Lambda 13/20 done. Lambda: 0.0073. Number of non-zero: 102. Misclassification error: 0.14. Avg iter: 22"
## [1] "Lambda 14/20 done. Lambda: 0.0063. Number of non-zero: 113. Misclassification error: 0.148571428571429. Avg iter: 26"
## [1] "Lambda 15/20 done. Lambda: 0.0054. Number of non-zero: 120. Misclassification error: 0.145714285714286. Avg iter: 30"
## [1] "Lambda 16/20 done. Lambda: 0.0046. Number of non-zero: 124. Misclassification error: 0.145714285714286. Avg iter: 36"
## [1] "Lambda 17/20 done. Lambda: 0.0039. Number of non-zero: 132. Misclassification error: 0.145714285714286. Avg iter: 45"
## [1] "Lambda 18/20 done. Lambda: 0.0033. Number of non-zero: 136. Misclassification error: 0.137142857142857. Avg iter: 55"
## [1] "Lambda 19/20 done. Lambda: 0.0028. Number of non-zero: 154. Misclassification error: 0.148571428571429. Avg iter: 67"
## [1] "Lambda 20/20 done. Lambda: 0.0024. Number of non-zero: 164. Misclassification error: 0.148571428571429. Avg iter: 81"

and again, use the predict function

predictions = predict(cv_model$fit,test_X,type="logistic")

In the Binomial case, the function returns both the predicted class probabilities (response) and the predicted class (class). We can use this to check the prediction accuracy, given as \(84\%\).

predictions$response[1:5]
## [1] 0.4085523 0.6480921 0.1706632 0.1489437 0.3894489
predictions$class[1:5]
## [1] 0 1 0 0 0
sum(predictions$class == test_y)/length(test_y)
## [1] 0.82

Reference

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.