BranchGLM is a package for fitting glms and performing variable selection. Most functions in this package make use of RcppArmadillo and some of them can also make use of OpenMP to perform parallel computations. This vignette introduces the package, provides examples on how to use the main functions in the package and also briefly describes the methods employed by the functions.
BranchGLM can be installed using the
install_github()
function from the
devtools package.
::install_github("JacobSeedorff21/BranchGLM") devtools
BranchGLM()
allows fitting of gaussian, binomial, and
poisson glms with a variety of links available.grads
argument is for L-BFGS only and it is the
number of gradients that are stored at a time and are used to
approximate the inverse hessian. The default value for this is 10, but
another common choice is 5.tol
argument controls how strict the convergence
criteria are, higher tolerance will lead to more accurate results, but
may also be slower.offset
, it should be a
numeric vector.### Using mtcars
library(BranchGLM)
<- mtcars
cars
### Fitting linear regression model with Fisher scoring
<- BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity")
carsFit
carsFit#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ .
#>
#> Estimate SE z p.values
#> (Intercept) 12.303374 15.163219 1.7420 0.081510 .
#> cyl -0.111440 0.846566 -0.2826 0.777472
#> disp 0.013335 0.014466 1.9791 0.047810 *
#> hp -0.021482 0.017635 -2.6153 0.008914 **
#> drat 0.787111 1.324804 1.2755 0.202115
#> wt -3.715304 1.534651 -5.1975 2.019e-07 ***
#> qsec 0.821041 0.592052 2.9773 0.002908 **
#> vs 0.317763 1.704847 0.4002 0.689041
#> am 2.520227 1.666077 3.2476 0.001164 **
#> gear 0.655413 1.209679 1.1632 0.244745
#> carb -0.199419 0.671366 -0.6377 0.523665
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 4.6092
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 147 on 21 degrees of freedom
#> AIC: 164
#> Algorithm converged in 1 iteration using Fisher's scoring
### Fitting linear regression with LBFGS
<- BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity",
LBFGSFit method = "LBFGS", grads = 5)
LBFGSFit#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ .
#>
#> Estimate SE z p.values
#> (Intercept) 12.303374 15.163219 1.7420 0.081510 .
#> cyl -0.111440 0.846566 -0.2826 0.777472
#> disp 0.013335 0.014466 1.9791 0.047810 *
#> hp -0.021482 0.017635 -2.6153 0.008914 **
#> drat 0.787111 1.324804 1.2755 0.202115
#> wt -3.715304 1.534651 -5.1975 2.019e-07 ***
#> qsec 0.821041 0.592052 2.9773 0.002908 **
#> vs 0.317763 1.704847 0.4002 0.689041
#> am 2.520227 1.666077 3.2476 0.001164 **
#> gear 0.655413 1.209679 1.1632 0.244745
#> carb -0.199419 0.671366 -0.6377 0.523665
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 4.6092
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 147 on 21 degrees of freedom
#> AIC: 164
#> Algorithm converged in 1 iteration using L-BFGS
coef()
to extract the coefficientslogLik()
to extract the log likelihoodAIC()
to extract the AICBIC()
to extract the BICpredict()
to obtain predictions from the fitted
modelcoefficients
slot of the fitted
modelBranchGLM
object.### Predict method
predict(carsFit)
#> [1] 22.59951 22.11189 26.25064 21.23740 17.69343 20.38304 14.38626 22.49601
#> [9] 24.41909 18.69903 19.19165 14.17216 15.59957 15.74222 12.03401 10.93644
#> [17] 10.49363 27.77291 29.89674 29.51237 23.64310 16.94305 17.73218 13.30602
#> [25] 16.69168 28.29347 26.15295 27.63627 18.87004 19.69383 13.94112 24.36827
### Accessing coefficients matrix
$coefficients
carsFit#> Estimate SE z p.values
#> (Intercept) 12.30337416 15.16321943 1.7419899 8.151021e-02
#> cyl -0.11144048 0.84656568 -0.2826149 7.774720e-01
#> disp 0.01333524 0.01446623 1.9790571 4.780958e-02
#> hp -0.02148212 0.01763456 -2.6153222 8.914333e-03
#> drat 0.78711097 1.32480360 1.2755494 2.021148e-01
#> wt -3.71530393 1.53465098 -5.1975365 2.019469e-07
#> qsec 0.82104075 0.59205195 2.9772665 2.908311e-03
#> vs 0.31776281 1.70484682 0.4001571 6.890408e-01
#> am 2.52022689 1.66607737 3.2475608 1.163988e-03
#> gear 0.65541302 1.20967883 1.1632091 2.447447e-01
#> carb -0.19941925 0.67136626 -0.6377059 5.236652e-01
VariableSelection()
.VariableSelection()
can accept either a
BranchGLM
object or a formula along with the data and the
desired family and link to perform the variable selection.VariableSelection()
returns the final model and some
other information about the search.VariableSelection()
will not properly handle
interaction terms, i.e. it may keep an interaction term while removing
the lower-order terms.keep
can also be specified if any set of variables are
desired to be kept in every model.### Forward selection with mtcars
VariableSelection(carsFit, type = "forward")
#> Variable Selection Info:
#> ---------------------------------------------
#> Variables were selected using forward selection with AIC
#> The best value of AIC obtained was 155
#> Number of models fit: 34
#>
#> Order the variables were added to the model:
#>
#> 1). wt
#> 2). cyl
#> 3). hp
#> ---------------------------------------------
#> Final Model:
#> ---------------------------------------------
#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ cyl + hp + wt
#>
#> Estimate SE z p.values
#> (Intercept) 38.751787 1.671458 54.4680 < 2.2e-16 ***
#> cyl -0.941617 0.515335 -4.2927 1.765e-05 ***
#> hp -0.018038 0.011109 -3.8146 0.0001364 ***
#> wt -3.166973 0.692745 -10.7403 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 5.5194
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 177 on 28 degrees of freedom
#> AIC: 155
#> Algorithm converged in 1 iteration using Fisher's scoring
### Backward elimination with mtcars
VariableSelection(carsFit, type = "backward")
#> Variable Selection Info:
#> ---------------------------------------------
#> Variables were selected using backward elimination with AIC
#> The best value of AIC obtained was 154
#> Number of models fit: 52
#>
#> Order the variables were removed from the model:
#>
#> 1). cyl
#> 2). vs
#> 3). carb
#> 4). gear
#> 5). drat
#> 6). disp
#> 7). hp
#> ---------------------------------------------
#> Final Model:
#> ---------------------------------------------
#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ wt + qsec + am
#>
#> Estimate SE z p.values
#> (Intercept) 9.61778 6.51010 3.3980 0.0006788 ***
#> wt -3.91650 0.66527 -13.5406 < 2.2e-16 ***
#> qsec 1.22589 0.27003 10.4419 < 2.2e-16 ***
#> am 2.93584 1.31978 5.1164 3.114e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 5.2902
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 169 on 28 degrees of freedom
#> AIC: 154
#> Algorithm converged in 1 iteration using Fisher's scoring
showprogress
is true, then progress of the branch
and bound algorithm will be reported occasionally.### Branch and bound with mtcars
VariableSelection(carsFit, type = "branch and bound", showprogress = FALSE)
#> Variable Selection Info:
#> ---------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 154
#> Number of models fit: 272
#>
#>
#> ---------------------------------------------
#> Final Model:
#> ---------------------------------------------
#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ wt + qsec + am
#>
#> Estimate SE z p.values
#> (Intercept) 9.61778 6.51010 3.3980 0.0006788 ***
#> wt -3.91650 0.66527 -13.5406 < 2.2e-16 ***
#> qsec 1.22589 0.27003 10.4419 < 2.2e-16 ***
#> am 2.93584 1.31978 5.1164 3.114e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 5.2902
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 169 on 28 degrees of freedom
#> AIC: 154
#> Algorithm converged in 1 iteration using Fisher's scoring
### Can also use a formula and data
<- VariableSelection(mpg ~ . ,data = cars, family = "gaussian",
FormulaVS link = "identity", type = "branch and bound",
showprogress = FALSE)
### Number of models fit divided by the number of possible models
$numchecked / 2^(length(FormulaVS$variables))
FormulaVS#> [1] 0.265625
### Extracting final model
$finalmodel
FormulaVS#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ wt + qsec + am
#>
#> Estimate SE z p.values
#> (Intercept) 9.61778 6.51010 3.3980 0.0006788 ***
#> wt -3.91650 0.66527 -13.5406 < 2.2e-16 ***
#> qsec 1.22589 0.27003 10.4419 < 2.2e-16 ***
#> am 2.93584 1.31978 5.1164 3.114e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 5.2902
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 169 on 28 degrees of freedom
#> AIC: 154
#> Algorithm converged in 1 iteration using Fisher's scoring
keep
will ensure that those
variables are kept through the selection process.### Example of using keep
VariableSelection(mpg ~ . ,data = cars, family = "gaussian",
link = "identity", type = "branch and bound",
keep = c("hp", "cyl"), metric = "BIC",
showprogress = FALSE)
#> Variable Selection Info:
#> ---------------------------------------------
#> Variables were selected using branch and bound selection with BIC
#> The best value of BIC obtained was 163
#> Number of models fit: 66
#> Variables that were kept in each model: hp, cyl
#>
#> ---------------------------------------------
#> Final Model:
#> ---------------------------------------------
#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ cyl + hp + wt
#>
#> Estimate SE z p.values
#> (Intercept) 38.751787 1.671458 54.4680 < 2.2e-16 ***
#> cyl -0.941617 0.515335 -4.2927 1.765e-05 ***
#> hp -0.018038 0.011109 -3.8146 0.0001364 ***
#> wt -3.166973 0.692745 -10.7403 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 5.5194
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 177 on 28 degrees of freedom
#> AIC: 155
#> Algorithm converged in 1 iteration using Fisher's scoring
Table()
creates a confusion matrix based on the
predicted classes and observed classesROC()
creates an ROC curve which can be plotted with
plot()
AUC()
and Cindex()
calculate the area
under the ROC curveMultipleROCCurves()
allows for the plotting of multiple
ROC curves on the same plot### Predicting if a car gets at least 18 mpg
<- ToothGrowth
catData
<- BranchGLM(supp ~ ., data = catData, family = "binomial", link = "logit")
catFit
Table(catFit)
#> Confusion matrix:
#> ----------------------
#> Predicted
#> OJ VC
#>
#> OJ 17 13
#> Observed
#> VC 7 23
#>
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy: 0.6667
#> Sensitivity: 0.7667
#> Specificity: 0.5667
#> PPV: 0.6389
<- ROC(catFit)
catROC
plot(catROC, main = "ROC Curve", col = "indianred")
Cindex(catFit)
#> [1] 0.7127778
AUC(catFit)
#> [1] 0.7127778
### Showing ROC plots for logit, probit, and cloglog
<- BranchGLM(supp ~ . ,data = catData, family = "binomial",
probitFit link = "probit")
<- BranchGLM(supp ~ . ,data = catData, family = "binomial",
cloglogFit link = "cloglog")
MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit),
names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))
BranchGLM
object.
<- predict(catFit)
preds
Table(preds, catData$supp)
#> Confusion matrix:
#> ----------------------
#> Predicted
#> OJ VC
#>
#> OJ 17 13
#> Observed
#> VC 7 23
#>
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy: 0.6667
#> Sensitivity: 0.7667
#> Specificity: 0.5667
#> PPV: 0.6389
AUC(preds, catData$supp)
#> [1] 0.7127778
ROC(preds, catData$supp) |> plot(main = "ROC Curve", col = "deepskyblue")