library("brglm2")
library("detectseparation")
Daniel J. Eck dje13@illinois.edu sent the maize data set to Ioannis Kosmidis ioannis.kosmidis@warwick.ac.uk on 2021-09-03 03:54. In the current note we attempt to reproduce the analysis that led to the results in Table 2 of the manuscript “Robust model-based estimation for binary outcomes in genomics studies” by Suyoung Park, Alexander E. Lipka, and Daniel J. Eck, in terms of the brglm2 behaviour.
The maize data set has been delivered in the file "Combined_Final_Product.csv"
, which is different than the "Combined_Final_Product.xlsx"
, which is used in the technical report “Technical Report for Robust model-based estimation for binary outcomes in genomics studies”, which reproduces the results in the manuscript.
Here we assume that "Combined_Final_Product.csv"
has exactly the same data as "Combined_Final_Product.xlsx"
corn_dat <- read.csv("Combined_Final_Product.csv")
and we apply the same transformations as in the technical report
names(corn_dat)[c(10, 8)] <- c("Kernel.color","Pop.structure")
Xind <- 11:ncol(corn_dat)
foo <- corn_dat[, c(10,8,Xind[-c(15,19,22,23,24)])]
foo$Pop.structure <- factor(foo$Pop.structure)
dat <- data.frame(foo)
dat[,-c(1,2)] <- scale(dat[,-c(1,2)]) #scale the data
Fitting the logistic regression model in the technical report on the maize data using maximum likelihood reports warnings about fitted probabilities on the boundary
corn_fm <- Kernel.color ~ .
mod_ml <- glm(corn_fm , data = dat, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coef(mod_ml)
## (Intercept) Pop.structurepopcorn Pop.structurestiff stalk
## -8.434055 1.464072 1.218572
## Pop.structuresweet corn Pop.structuretropical Pop.structureunclassified
## 1.127600 -3.246059 -0.669141
## S6_82170011 S6_82170814 S6_82170859
## -114.427961 3.810524 -120.885808
## S6_82170897 S6_82170900 S6_82170957
## 24.870237 92.946991 -3.740834
## S6_82171038 S6_82174349 S6_82174376
## 29.695344 -5.258315 4.830395
## S6_82174378 S6_82176123 S6_82185767
## 5.402744 4.438092 106.503001
## S6_82185973 S6_82186654 S6_82217770
## -34.040694 8.095297 -29.556014
## S6_82217918 S6_82218018 S6_82218219
## -74.037172 13.625830 -8.301568
## S6_82243856
## 157.107143
A simple check for separation shows that data separation occurs and that several maximum likelihood estimates are infinite.
update(mod_ml, method = "detect_separation")
## Implementation: ROI | Solver: lpsolve
## Separation: TRUE
## Existence of maximum likelihood estimates
## (Intercept) Pop.structurepopcorn Pop.structurestiff stalk
## -Inf 0 0
## Pop.structuresweet corn Pop.structuretropical Pop.structureunclassified
## 0 0 0
## S6_82170011 S6_82170814 S6_82170859
## -Inf Inf -Inf
## S6_82170897 S6_82170900 S6_82170957
## Inf Inf -Inf
## S6_82171038 S6_82174349 S6_82174376
## Inf Inf -Inf
## S6_82174378 S6_82176123 S6_82185767
## Inf Inf Inf
## S6_82185973 S6_82186654 S6_82217770
## -Inf Inf -Inf
## S6_82217918 S6_82218018 S6_82218219
## -Inf Inf -Inf
## S6_82243856
## Inf
## 0: finite value, Inf: infinity, -Inf: -infinity
A solution to the issues associated to infinite maximum likelihood estimates is to estimate the model using Firth’s bias-reducing adjusted score equations. An implementation of that is provided by the brglm2 R package.
Solving the adjusted score equations with the default optimization tuning parameters (see ?brglmFit
) returns a warning for boundary fitted probabilities, and estimates that seem to diverge.
system.time(mod_br <- update(mod_ml, method = "brglm_fit"))
## Warning: brglmFit: algorithm did not converge
## Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
## user system elapsed
## 0.985 0.250 1.585
coef(mod_br)
## (Intercept) Pop.structurepopcorn Pop.structurestiff stalk
## -1.154774e+14 1.949092e+15 1.228451e+13
## Pop.structuresweet corn Pop.structuretropical Pop.structureunclassified
## 1.871982e+14 1.049747e+15 3.635279e+15
## S6_82170011 S6_82170814 S6_82170859
## -8.100700e+14 -1.115064e+14 -9.087434e+14
## S6_82170897 S6_82170900 S6_82170957
## 2.611892e+14 8.989955e+14 2.045295e+13
## S6_82171038 S6_82174349 S6_82174376
## -2.486356e+14 -1.994465e+14 -6.212293e+12
## S6_82174378 S6_82176123 S6_82185767
## 4.190308e+13 5.473174e+13 7.966795e+13
## S6_82185973 S6_82186654 S6_82217770
## 1.112076e+14 2.786188e+13 2.639930e+14
## S6_82217918 S6_82218018 S6_82218219
## -1.265792e+14 2.427798e+14 1.276159e+14
## S6_82243856
## 1.027951e+15
This should not be possible according to the theoretical results in Kosmidis & Firth (2021, http://doi.org/10.1093/biomet/asaa052), which show that the reduced-bias estimates (which is equivalent to maximum penalized likelihood with Jeffreys-prior penalty) are always finite. So, there most probably have been issues with the default turning of the brglm2 optimization procedure and / or the default starting values.
By default, brglm2 uses as starting values the maximum likelihood estimates after adding length(coef(mod_ml)) / nrow(dat)
(here 0.0161603) to the binomial responses and totals.
Alternative starting values can be obtained using response_adjustment
, which sets alternative values for what is added to the responses. For example, with a bit more shrinkage towards, we seem to overcome the issues that the defaults had with this particular data set.
system.time(mod_br <- update(mod_ml, method = "brglm_fit", response_adjustment = 0.05))
## user system elapsed
## 0.224 0.058 0.377
summary(mod_br)
##
## Call:
## glm(formula = corn_fm, family = "binomial", data = dat, method = "brglm_fit",
## response_adjustment = 0.05)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7437 0.2458 0.5625 0.5625 2.0372
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.18150 0.33775 6.459 1.05e-10 ***
## Pop.structurepopcorn 1.36934 0.94659 1.447 0.148009
## Pop.structurestiff stalk 1.11355 0.62814 1.773 0.076265 .
## Pop.structuresweet corn 0.73111 0.56448 1.295 0.195251
## Pop.structuretropical -3.17578 0.38158 -8.323 < 2e-16 ***
## Pop.structureunclassified -0.60742 0.34966 -1.737 0.082355 .
## S6_82170011 -1.28662 0.46826 -2.748 0.006002 **
## S6_82170814 -0.07744 0.10025 -0.773 0.439807
## S6_82170859 -1.49768 0.57334 -2.612 0.008996 **
## S6_82170897 0.95283 0.26045 3.658 0.000254 ***
## S6_82170900 1.63691 0.53723 3.047 0.002312 **
## S6_82170957 -0.15587 0.11101 -1.404 0.160289
## S6_82171038 0.02878 0.31033 0.093 0.926121
## S6_82174349 0.01540 0.22208 0.069 0.944710
## S6_82174376 -0.04520 0.15534 -0.291 0.771082
## S6_82174378 -0.18211 0.16324 -1.116 0.264595
## S6_82176123 -0.02366 0.16301 -0.145 0.884610
## S6_82185767 -0.10369 0.23291 -0.445 0.656164
## S6_82185973 0.22220 0.19085 1.164 0.244311
## S6_82186654 0.45739 0.29001 1.577 0.114769
## S6_82217770 -0.38118 0.19966 -1.909 0.056241 .
## S6_82217918 -0.97163 0.33824 -2.873 0.004071 **
## S6_82218018 -0.06930 0.21735 -0.319 0.749863
## S6_82218219 -0.23440 0.15234 -1.539 0.123893
## S6_82243856 2.23227 0.89792 2.486 0.012917 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1547.1 on 1546 degrees of freedom
## Residual deviance: 1118.3 on 1522 degrees of freedom
## AIC: 1168.3
##
## Number of Fisher Scoring iterations: 32
The maximum absolute value of the adjusted score functions at the estimates shown above is 1.1954247^{-5}. This is close to zero as it should, and we can get closer by setting stricter stopping criteria.
Overall, the user can control all sorts of aspects of the optimization algorithm (quasi-Fisher scoring; see Kosmidis et al, 2020, https://doi.org/10.1007/s11222-019-09860-6 or the brglm2 vignettes), or even directly supply starting values using the start
argument.