BranchGLM Vignette

1 Description

BranchGLM is a package for fitting glms and performing variable selection. Most functions in this package make use of RcppArmadillo and some of them can also make use of OpenMP to perform parallel computations. This vignette introduces the package, provides examples on how to use the main functions in the package and also briefly describes the methods employed by the functions.

2 Installation

BranchGLM can be installed using the install_github() function from the devtools package.


devtools::install_github("JacobSeedorff21/BranchGLM")

3 Fitting glms

3.1 Optimization methods

3.2 Examples

### Using mtcars

library(BranchGLM)

cars <- mtcars

### Fitting linear regression model with Fisher scoring

carsFit <- BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity")

carsFit
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ .
#> 
#>              Estimate        SE       z  p.values    
#> (Intercept) 12.303374 15.163219  1.7420  0.081510 .  
#> cyl         -0.111440  0.846566 -0.2826  0.777472    
#> disp         0.013335  0.014466  1.9791  0.047810 *  
#> hp          -0.021482  0.017635 -2.6153  0.008914 ** 
#> drat         0.787111  1.324804  1.2755  0.202115    
#> wt          -3.715304  1.534651 -5.1975 2.019e-07 ***
#> qsec         0.821041  0.592052  2.9773  0.002908 ** 
#> vs           0.317763  1.704847  0.4002  0.689041    
#> am           2.520227  1.666077  3.2476  0.001164 ** 
#> gear         0.655413  1.209679  1.1632  0.244745    
#> carb        -0.199419  0.671366 -0.6377  0.523665    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 4.6092
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 147 on 21 degrees of freedom
#> AIC: 164
#> Algorithm converged in 1 iteration using Fisher's scoring

### Fitting linear regression with LBFGS

LBFGSFit <- BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity",
                      method = "LBFGS", grads = 5)

LBFGSFit
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ .
#> 
#>              Estimate        SE       z  p.values    
#> (Intercept) 12.303374 15.163219  1.7420  0.081510 .  
#> cyl         -0.111440  0.846566 -0.2826  0.777472    
#> disp         0.013335  0.014466  1.9791  0.047810 *  
#> hp          -0.021482  0.017635 -2.6153  0.008914 ** 
#> drat         0.787111  1.324804  1.2755  0.202115    
#> wt          -3.715304  1.534651 -5.1975 2.019e-07 ***
#> qsec         0.821041  0.592052  2.9773  0.002908 ** 
#> vs           0.317763  1.704847  0.4002  0.689041    
#> am           2.520227  1.666077  3.2476  0.001164 ** 
#> gear         0.655413  1.209679  1.1632  0.244745    
#> carb        -0.199419  0.671366 -0.6377  0.523665    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 4.6092
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 147 on 21 degrees of freedom
#> AIC: 164
#> Algorithm converged in 1 iteration using L-BFGS

3.3 Useful functions

### Predict method

predict(carsFit)
#>  [1] 22.59951 22.11189 26.25064 21.23740 17.69343 20.38304 14.38626 22.49601
#>  [9] 24.41909 18.69903 19.19165 14.17216 15.59957 15.74222 12.03401 10.93644
#> [17] 10.49363 27.77291 29.89674 29.51237 23.64310 16.94305 17.73218 13.30602
#> [25] 16.69168 28.29347 26.15295 27.63627 18.87004 19.69383 13.94112 24.36827

### Accessing coefficients matrix

carsFit$coefficients
#>                Estimate          SE          z     p.values
#> (Intercept) 12.30337416 15.16321943  1.7419899 8.151021e-02
#> cyl         -0.11144048  0.84656568 -0.2826149 7.774720e-01
#> disp         0.01333524  0.01446623  1.9790571 4.780958e-02
#> hp          -0.02148212  0.01763456 -2.6153222 8.914333e-03
#> drat         0.78711097  1.32480360  1.2755494 2.021148e-01
#> wt          -3.71530393  1.53465098 -5.1975365 2.019469e-07
#> qsec         0.82104075  0.59205195  2.9772665 2.908311e-03
#> vs           0.31776281  1.70484682  0.4001571 6.890408e-01
#> am           2.52022689  1.66607737  3.2475608 1.163988e-03
#> gear         0.65541302  1.20967883  1.1632091 2.447447e-01
#> carb        -0.19941925  0.67136626 -0.6377059 5.236652e-01

4 Performing variable selection

4.1 Stepwise methods

4.1.1 Forward selection example

### Forward selection with mtcars

VariableSelection(carsFit, type = "forward")
#> Variable Selection Info:
#> ---------------------------------------------
#> Variables were selected using forward selection with AIC
#> The best value of AIC obtained was 155
#> Number of models fit: 34
#> 
#> Order the variables were added to the model:
#> 
#> 1). wt
#> 2). cyl
#> 3). hp
#> ---------------------------------------------
#> Final Model:
#> ---------------------------------------------
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ cyl + hp + wt
#> 
#>              Estimate        SE        z  p.values    
#> (Intercept) 38.751787  1.671458  54.4680 < 2.2e-16 ***
#> cyl         -0.941617  0.515335  -4.2927 1.765e-05 ***
#> hp          -0.018038  0.011109  -3.8146 0.0001364 ***
#> wt          -3.166973  0.692745 -10.7403 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 5.5194
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 177 on 28 degrees of freedom
#> AIC: 155
#> Algorithm converged in 1 iteration using Fisher's scoring

4.1.2 Backward elimination example

### Backward elimination with mtcars

VariableSelection(carsFit, type = "backward")
#> Variable Selection Info:
#> ---------------------------------------------
#> Variables were selected using backward elimination with AIC
#> The best value of AIC obtained was 154
#> Number of models fit: 52
#> 
#> Order the variables were removed from the model:
#> 
#> 1). cyl
#> 2). vs
#> 3). carb
#> 4). gear
#> 5). drat
#> 6). disp
#> 7). hp
#> ---------------------------------------------
#> Final Model:
#> ---------------------------------------------
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ wt + qsec + am
#> 
#>             Estimate       SE        z  p.values    
#> (Intercept)  9.61778  6.51010   3.3980 0.0006788 ***
#> wt          -3.91650  0.66527 -13.5406 < 2.2e-16 ***
#> qsec         1.22589  0.27003  10.4419 < 2.2e-16 ***
#> am           2.93584  1.31978   5.1164 3.114e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 5.2902
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 169 on 28 degrees of freedom
#> AIC: 154
#> Algorithm converged in 1 iteration using Fisher's scoring

4.2 Branch and bound

4.2.1 Branch and bound example

  • If showprogress is true, then progress of the branch and bound algorithm will be reported occasionally.
  • Parallel computation can be used with this method and can lead to very large speedups.
### Branch and bound with mtcars

VariableSelection(carsFit, type = "branch and bound", showprogress = FALSE)
#> Variable Selection Info:
#> ---------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 154
#> Number of models fit: 272
#> 
#> 
#> ---------------------------------------------
#> Final Model:
#> ---------------------------------------------
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ wt + qsec + am
#> 
#>             Estimate       SE        z  p.values    
#> (Intercept)  9.61778  6.51010   3.3980 0.0006788 ***
#> wt          -3.91650  0.66527 -13.5406 < 2.2e-16 ***
#> qsec         1.22589  0.27003  10.4419 < 2.2e-16 ***
#> am           2.93584  1.31978   5.1164 3.114e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 5.2902
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 169 on 28 degrees of freedom
#> AIC: 154
#> Algorithm converged in 1 iteration using Fisher's scoring

### Can also use a formula and data

FormulaVS <- VariableSelection(mpg ~ . ,data = cars, family = "gaussian", 
                               link = "identity", type = "branch and bound",
                               showprogress = FALSE)

### Number of models fit divided by the number of possible models

FormulaVS$numchecked / 2^(length(FormulaVS$variables))
#> [1] 0.265625

### Extracting final model

FormulaVS$finalmodel
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ wt + qsec + am
#> 
#>             Estimate       SE        z  p.values    
#> (Intercept)  9.61778  6.51010   3.3980 0.0006788 ***
#> wt          -3.91650  0.66527 -13.5406 < 2.2e-16 ***
#> qsec         1.22589  0.27003  10.4419 < 2.2e-16 ***
#> am           2.93584  1.31978   5.1164 3.114e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 5.2902
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 169 on 28 degrees of freedom
#> AIC: 154
#> Algorithm converged in 1 iteration using Fisher's scoring

4.3 Using keep

### Example of using keep

VariableSelection(mpg ~ . ,data = cars, family = "gaussian", 
                               link = "identity", type = "branch and bound",
                               keep = c("hp", "cyl"), metric = "BIC",
                               showprogress = FALSE)
#> Variable Selection Info:
#> ---------------------------------------------
#> Variables were selected using branch and bound selection with BIC
#> The best value of BIC obtained was 163
#> Number of models fit: 66
#> Variables that were kept in each model:  hp, cyl
#> 
#> ---------------------------------------------
#> Final Model:
#> ---------------------------------------------
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ cyl + hp + wt
#> 
#>              Estimate        SE        z  p.values    
#> (Intercept) 38.751787  1.671458  54.4680 < 2.2e-16 ***
#> cyl         -0.941617  0.515335  -4.2927 1.765e-05 ***
#> hp          -0.018038  0.011109  -3.8146 0.0001364 ***
#> wt          -3.166973  0.692745 -10.7403 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 5.5194
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 177 on 28 degrees of freedom
#> AIC: 155
#> Algorithm converged in 1 iteration using Fisher's scoring

4.4 Convergence issues

5 Utility functions for binomial glms

5.1 Table

### Predicting if a car gets at least 18 mpg

catData <- ToothGrowth

catFit <- BranchGLM(supp ~ ., data = catData, family = "binomial", link = "logit")

Table(catFit)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#> 
#>          OJ  17   13
#> Observed
#>          VC  7    23
#> 
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667 
#> Sensitivity:  0.7667 
#> Specificity:  0.5667 
#> PPV:  0.6389

5.2 ROC


catROC <- ROC(catFit)

plot(catROC, main = "ROC Curve", col = "indianred")

5.3 Cindex/AUC


Cindex(catFit)
#> [1] 0.7127778

AUC(catFit)
#> [1] 0.7127778

5.4 MultipleROCPlots

### Showing ROC plots for logit, probit, and cloglog

probitFit <- BranchGLM(supp ~ . ,data = catData, family = "binomial", 
                       link = "probit")

cloglogFit <- BranchGLM(supp ~ . ,data = catData, family = "binomial", 
                       link = "cloglog")

MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit), 
                  names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))

5.5 Using predictions


preds <- predict(catFit)

Table(preds, catData$supp)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#> 
#>          OJ  17   13
#> Observed
#>          VC  7    23
#> 
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667 
#> Sensitivity:  0.7667 
#> Specificity:  0.5667 
#> PPV:  0.6389

AUC(preds, catData$supp)
#> [1] 0.7127778

ROC(preds, catData$supp) |> plot(main = "ROC Curve", col = "deepskyblue")