Necessary packages

library("brglm2")
library("detectseparation")

Introduction

Daniel J. Eck sent the maize data set to Ioannis Kosmidis on 2021-09-03 03:54. In the current note we attempt to reproduce the analysis that led to the results in Table 2 of the manuscript “Robust model-based estimation for binary outcomes in genomics studies” by Suyoung Park, Alexander E. Lipka, and Daniel J. Eck, in terms of the brglm2 behaviour.

Maize data

The maize data set has been delivered in the file "Combined_Final_Product.csv", which is different than the "Combined_Final_Product.xlsx", which is used in the technical report “Technical Report for Robust model-based estimation for binary outcomes in genomics studies”, which reproduces the results in the manuscript.

Here we assume that "Combined_Final_Product.csv" has exactly the same data as "Combined_Final_Product.xlsx"

corn_dat <- read.csv("Combined_Final_Product.csv")

and we apply the same transformations as in the technical report

names(corn_dat)[c(10, 8)] <- c("Kernel.color","Pop.structure")
Xind <- 11:ncol(corn_dat)
foo <- corn_dat[, c(10,8,Xind[-c(15,19,22,23,24)])]
foo$Pop.structure <- factor(foo$Pop.structure)
dat <- data.frame(foo)
dat[,-c(1,2)] <- scale(dat[,-c(1,2)]) #scale the data

Fitting

Maximum likelihood

Fitting the logistic regression model in the technical report on the maize data using maximum likelihood reports warnings about fitted probabilities on the boundary

corn_fm <- Kernel.color ~ .
mod_ml <- glm(corn_fm , data = dat, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coef(mod_ml)
##               (Intercept)      Pop.structurepopcorn  Pop.structurestiff stalk 
##                 -8.434055                  1.464072                  1.218572 
##   Pop.structuresweet corn     Pop.structuretropical Pop.structureunclassified 
##                  1.127600                 -3.246059                 -0.669141 
##               S6_82170011               S6_82170814               S6_82170859 
##               -114.427961                  3.810524               -120.885808 
##               S6_82170897               S6_82170900               S6_82170957 
##                 24.870237                 92.946991                 -3.740834 
##               S6_82171038               S6_82174349               S6_82174376 
##                 29.695344                 -5.258315                  4.830395 
##               S6_82174378               S6_82176123               S6_82185767 
##                  5.402744                  4.438092                106.503001 
##               S6_82185973               S6_82186654               S6_82217770 
##                -34.040694                  8.095297                -29.556014 
##               S6_82217918               S6_82218018               S6_82218219 
##                -74.037172                 13.625830                 -8.301568 
##               S6_82243856 
##                157.107143

A simple check for separation shows that data separation occurs and that several maximum likelihood estimates are infinite.

update(mod_ml, method = "detect_separation")
## Implementation: ROI | Solver: lpsolve 
## Separation: TRUE 
## Existence of maximum likelihood estimates
##               (Intercept)      Pop.structurepopcorn  Pop.structurestiff stalk 
##                      -Inf                         0                         0 
##   Pop.structuresweet corn     Pop.structuretropical Pop.structureunclassified 
##                         0                         0                         0 
##               S6_82170011               S6_82170814               S6_82170859 
##                      -Inf                       Inf                      -Inf 
##               S6_82170897               S6_82170900               S6_82170957 
##                       Inf                       Inf                      -Inf 
##               S6_82171038               S6_82174349               S6_82174376 
##                       Inf                       Inf                      -Inf 
##               S6_82174378               S6_82176123               S6_82185767 
##                       Inf                       Inf                       Inf 
##               S6_82185973               S6_82186654               S6_82217770 
##                      -Inf                       Inf                      -Inf 
##               S6_82217918               S6_82218018               S6_82218219 
##                      -Inf                       Inf                      -Inf 
##               S6_82243856 
##                       Inf 
## 0: finite value, Inf: infinity, -Inf: -infinity

Firth’s bias-reducing adjusted score equations

A solution to the issues associated to infinite maximum likelihood estimates is to estimate the model using Firth’s bias-reducing adjusted score equations. An implementation of that is provided by the brglm2 R package.

Solving the adjusted score equations with the default optimization tuning parameters (see ?brglmFit) returns a warning for boundary fitted probabilities, and estimates that seem to diverge.

system.time(mod_br <- update(mod_ml, method = "brglm_fit"))
## Warning: brglmFit: algorithm did not converge
## Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
##    user  system elapsed 
##   0.985   0.250   1.585
coef(mod_br)
##               (Intercept)      Pop.structurepopcorn  Pop.structurestiff stalk 
##             -1.154774e+14              1.949092e+15              1.228451e+13 
##   Pop.structuresweet corn     Pop.structuretropical Pop.structureunclassified 
##              1.871982e+14              1.049747e+15              3.635279e+15 
##               S6_82170011               S6_82170814               S6_82170859 
##             -8.100700e+14             -1.115064e+14             -9.087434e+14 
##               S6_82170897               S6_82170900               S6_82170957 
##              2.611892e+14              8.989955e+14              2.045295e+13 
##               S6_82171038               S6_82174349               S6_82174376 
##             -2.486356e+14             -1.994465e+14             -6.212293e+12 
##               S6_82174378               S6_82176123               S6_82185767 
##              4.190308e+13              5.473174e+13              7.966795e+13 
##               S6_82185973               S6_82186654               S6_82217770 
##              1.112076e+14              2.786188e+13              2.639930e+14 
##               S6_82217918               S6_82218018               S6_82218219 
##             -1.265792e+14              2.427798e+14              1.276159e+14 
##               S6_82243856 
##              1.027951e+15

This should not be possible according to the theoretical results in Kosmidis & Firth (2021, http://doi.org/10.1093/biomet/asaa052), which show that the reduced-bias estimates (which is equivalent to maximum penalized likelihood with Jeffreys-prior penalty) are always finite. So, there most probably have been issues with the default turning of the brglm2 optimization procedure and / or the default starting values.

By default, brglm2 uses as starting values the maximum likelihood estimates after adding length(coef(mod_ml)) / nrow(dat) (here 0.0161603) to the binomial responses and totals.

Alternative starting values can be obtained using response_adjustment, which sets alternative values for what is added to the responses. For example, with a bit more shrinkage towards, we seem to overcome the issues that the defaults had with this particular data set.

system.time(mod_br <- update(mod_ml, method = "brglm_fit", response_adjustment = 0.05))
##    user  system elapsed 
##   0.224   0.058   0.377
summary(mod_br)
## 
## Call:
## glm(formula = corn_fm, family = "binomial", data = dat, method = "brglm_fit", 
##     response_adjustment = 0.05)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7437   0.2458   0.5625   0.5625   2.0372  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                2.18150    0.33775   6.459 1.05e-10 ***
## Pop.structurepopcorn       1.36934    0.94659   1.447 0.148009    
## Pop.structurestiff stalk   1.11355    0.62814   1.773 0.076265 .  
## Pop.structuresweet corn    0.73111    0.56448   1.295 0.195251    
## Pop.structuretropical     -3.17578    0.38158  -8.323  < 2e-16 ***
## Pop.structureunclassified -0.60742    0.34966  -1.737 0.082355 .  
## S6_82170011               -1.28662    0.46826  -2.748 0.006002 ** 
## S6_82170814               -0.07744    0.10025  -0.773 0.439807    
## S6_82170859               -1.49768    0.57334  -2.612 0.008996 ** 
## S6_82170897                0.95283    0.26045   3.658 0.000254 ***
## S6_82170900                1.63691    0.53723   3.047 0.002312 ** 
## S6_82170957               -0.15587    0.11101  -1.404 0.160289    
## S6_82171038                0.02878    0.31033   0.093 0.926121    
## S6_82174349                0.01540    0.22208   0.069 0.944710    
## S6_82174376               -0.04520    0.15534  -0.291 0.771082    
## S6_82174378               -0.18211    0.16324  -1.116 0.264595    
## S6_82176123               -0.02366    0.16301  -0.145 0.884610    
## S6_82185767               -0.10369    0.23291  -0.445 0.656164    
## S6_82185973                0.22220    0.19085   1.164 0.244311    
## S6_82186654                0.45739    0.29001   1.577 0.114769    
## S6_82217770               -0.38118    0.19966  -1.909 0.056241 .  
## S6_82217918               -0.97163    0.33824  -2.873 0.004071 ** 
## S6_82218018               -0.06930    0.21735  -0.319 0.749863    
## S6_82218219               -0.23440    0.15234  -1.539 0.123893    
## S6_82243856                2.23227    0.89792   2.486 0.012917 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1547.1  on 1546  degrees of freedom
## Residual deviance: 1118.3  on 1522  degrees of freedom
## AIC: 1168.3
## 
## Number of Fisher Scoring iterations: 32

The maximum absolute value of the adjusted score functions at the estimates shown above is 1.1954247^{-5}. This is close to zero as it should, and we can get closer by setting stricter stopping criteria.

Overall, the user can control all sorts of aspects of the optimization algorithm (quasi-Fisher scoring; see Kosmidis et al, 2020, https://doi.org/10.1007/s11222-019-09860-6 or the brglm2 vignettes), or even directly supply starting values using the start argument.