The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Additional Predictor with Maximum Effect Size

Introduction

This vignette of package maxEff (Github, RPubs) documents three methods of selecting one from many numeric predictors for a regression model, to ensure that the additional predictor has the maximum effect size. The three methods are implemented in functions add_num(), add_dummy() and add_dummy_partition().

Note to Users

Examples in this vignette require that the search path has

library(maxEff)
library(groupedHyperframe)
library(survival)
library(rpart)

Users should remove parameter mc.cores = 1L from all examples and use the default option, which engages all CPU cores on the current host for macOS. The authors are forced to have mc.cores = 1L in this vignette in order to pass CRAN’s submission check.

Terms and Abbreviations

Term / Abbreviation Description Reference
Forward pipe operator ?base::pipeOp introduced in R 4.1.0
abs Absolute value base::abs
coxph Cox model survival::coxph
CRAN, R The Comprehensive R Archive Network https://cran.r-project.org
factor Factor, or categorical variable base::factor
function R function base::`function`
groupedHyperframe Grouped hyper data frame groupedHyperframe::as.groupedHyperframe
head First parts of an object utils::head; utils:::head.default
hypercolumns, hyperframe (Hyper columns of) hyper data frame spatstat.geom::hyperframe
labels Labels from object base::labels; maxEff::labels.node1
levels Levels of a factor base::levels
listof List of objects stats::listof
logistic Logistic regression model stats::glm(., family = binomial('logit'))
matrix Matrix base::matrix
partition Stratified partition maxEff::statusPartition, caret::createDataPartition
PFS Progression/recurrence free survival https://en.wikipedia.org/wiki/Progression-free_survival
predict Model prediction stats::predict; maxEff::predict.add_num; maxEff::predict.add_dummy
quantile Quantile stats::quantile
rpart Recursive partitioning and regression trees rpart::rpart
S3, generic, methods S3 object oriented system base::UseMethod; utils::methods; utils::getS3method; https://adv-r.hadley.nz/s3.html
sort_by Sort an object by some criterion base::sort_by; maxEff::sort_by.add_
subset Subsets of object by conditions base::subset; maxEff::subset.add_dummy
Surv Survival object survival::Surv
update Update and re-fit a model call stats::update

Acknowledgement

This work is supported by NCI R01CA222847 (I. Chervoneva, T. Zhan, and H. Rui) and R01CA253977 (H. Rui and I. Chervoneva).

Handy Tools

node1(): Dichotomization via First Node of Recursive Partitioning

Consider the following recursive partitioning and regression example,

data(cu.summary, package = 'rpart')
(r = rpart(Price ~ Mileage, data = cu.summary, cp = .Machine$double.eps, maxdepth = 1L))
#> n=60 (57 observations deleted due to missingness)
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#> 1) root 60 983551500 12615.67  
#>   2) Mileage>=24.5 25 206417200  9730.16 *
#>   3) Mileage< 24.5 35 420299300 14676.74 *

Function node1() creates a dichotomizing rule, i.e., a function, based on the first node of partitioning,

(foo = r |> node1())
#> 
#> Dichotomizing Rule based on Recursive Partitioning:
#> 
#> function (newx) 
#> {
#>     ret <- (newx >= 24.5)
#>     if (all(ret, na.rm = TRUE) || !any(ret, na.rm = TRUE)) 
#>         warning("Dichotomized value is all-0 or all-1")
#>     return(ret)
#> }
#> <environment: 0x125350040>

The dichotomizing rule foo converts a numeric object to a logical object.

set.seed(125); rnorm(6, mean = 24.5) |> foo()
#> [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE

Developers may obtain the numeric cutoff value of foo, or a brief text of to describe foo, for further analysis.

foo |> get_cutoff()
#> [1] 24.5
foo |> labels()
#> [1] ">=24.5"

statusPartition(): Stratified Partition by Survival Status

Function statusPartition() performs stratified partition based on the survival status of a Surv response, to avoid the situation that Cox model is degenerate if all subjects are censored. This is a deviation from the very popular function caret::createDataPartition(), which stratifies Surv response by the quantiles of the survival time.

y = (survival::capacitor) |> 
  with(expr = Surv(time, status))
set.seed(15); id = y |>
  statusPartition(times = 1L, p = .5)
table(y[id[[1L]], 2L]) / table(y[,2L]) # balanced by status
#> 
#>   0   1 
#> 0.5 0.5
set.seed(15); id0 = y |>
  caret::createDataPartition(times = 1L, p = .5)
table(y[id0[[1L]], 2L]) / table(y[,2L]) # *not* balanced by status
#> 
#>       0       1 
#> 0.46875 0.56250

Data Preparation

Data set Ki67 in package groupedHyperframe is a groupedHyperframe.

data(Ki67, package = 'groupedHyperframe')
Ki67
#> Grouped Hyperframe: ~patientID/tissueID
#> 
#> 645 tissueID nested in
#> 622 patientID
#> 
#>      logKi67 tissueID Tstage  PFS recfreesurv_mon recurrence adj_rad adj_chemo
#> 1  (numeric) TJUe_I17      2 100+             100          0   FALSE     FALSE
#> 2  (numeric) TJUe_G17      1   22              22          1   FALSE     FALSE
#> 3  (numeric) TJUe_F17      1  99+              99          0   FALSE        NA
#> 4  (numeric) TJUe_D17      1  99+              99          0   FALSE      TRUE
#> 5  (numeric) TJUe_J18      1  112             112          1    TRUE      TRUE
#> 6  (numeric) TJUe_N17      4   12              12          1    TRUE     FALSE
#> 7  (numeric) TJUe_J17      2  64+              64          0   FALSE     FALSE
#> 8  (numeric) TJUe_F19      2  56+              56          0   FALSE     FALSE
#> 9  (numeric) TJUe_P19      2  79+              79          0   FALSE     FALSE
#> 10 (numeric) TJUe_O19      2   26              26          1   FALSE      TRUE
#>    histology  Her2   HR  node  race age patientID
#> 1          3  TRUE TRUE  TRUE White  66   PT00037
#> 2          3 FALSE TRUE FALSE Black  42   PT00039
#> 3          3 FALSE TRUE FALSE White  60   PT00040
#> 4          3 FALSE TRUE  TRUE White  53   PT00042
#> 5          3 FALSE TRUE  TRUE White  52   PT00054
#> 6          2  TRUE TRUE  TRUE Black  51   PT00059
#> 7          3 FALSE TRUE  TRUE Asian  50   PT00062
#> 8          2  TRUE TRUE  TRUE White  37   PT00068
#> 9          3  TRUE TRUE FALSE White  68   PT00082
#> 10         2  TRUE TRUE FALSE Black  55   PT00084
s = Ki67 |>
  aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01))
#> Column(s) tissueID removed; as they are not identical per aggregation-group

Candidate of additional predictors are stored in a numeric-hypercolumn logKi67.quantile. Users are encouraged to learn more about groupedHyperframe class and function aggregate_quantile() from package groupedHyperframe vignette.

Partition into a training (80%) and test (20%) set.

set.seed(234); id = sample.int(n = nrow(s), size = nrow(s)*.8) |> sort.int()
s0 = s[id, , drop = FALSE] # training set
s1 = s[-id, , drop = FALSE] # test set

Let’s consider a starting model of endpoint PFS with predictor Tstage on the training set s0.

summary(m <- coxph(PFS ~ Tstage, data = s0))
#> Warning in as.data.frame.hyperframe(data, optional = TRUE): 1 variable
#> discarded in conversion to data frame
#> Warning in as.data.frame.hyperframe(data, optional = TRUE): 1 variable
#> discarded in conversion to data frame
#> Warning in as.data.frame.hyperframe(data): 1 variable discarded in conversion
#> to data frame
#> Call:
#> coxph(formula = PFS ~ Tstage, data = s0)
#> 
#>   n= 497, number of events= 90 
#> 
#>          coef exp(coef) se(coef)     z Pr(>|z|)    
#> Tstage 0.7162    2.0466   0.1020 7.019 2.23e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>        exp(coef) exp(-coef) lower .95 upper .95
#> Tstage     2.047     0.4886     1.676       2.5
#> 
#> Concordance= 0.679  (se = 0.028 )
#> Likelihood ratio test= 39.61  on 1 df,   p=3e-10
#> Wald test            = 49.27  on 1 df,   p=2e-12
#> Score (logrank) test = 54.96  on 1 df,   p=1e-13

Additional numeric Predictor with Maximum Effect Size

Function add_num() treats each additional predictor as a numeric variable, and updates the starting model with each additional predictor. Function add_num() returns an 'add_num' object, which is a listof objects with an internal class 'add_num_'.

The S3 generic sort_by() sorts the 'add_num' object by the absolute value of the regression coefficient (i.e., effect size) of the additioanal numeric predictor.

The S3 generic head() chooses the top n element from the object returned from the previous step.

set.seed(14837); m1 = m |>
  add_num(x = ~ logKi67.quantile, mc.cores = 1L) |>
  sort_by(y = abs(effsize)) |>
  head(n = 2L)
m1
#> .slice(logKi67.quantile, "0.99")
#> .slice(logKi67.quantile, "0.98")

The S3 generic predict() uses model m1 on the test set s1.

m1[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.99") :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#> 
#>          coef exp(coef) se(coef)     z      p
#> Tstage 0.4068    1.5020   0.2359 1.724 0.0847
#> x.     0.1114    1.1179   0.1926 0.579 0.5629
#> 
#> Likelihood ratio test=3.21  on 2 df, p=0.2008
#> n= 125, number of events= 28

Additional logical Predictor with Maximum Effect Size

add_dummy(): Naive Practice

Function add_dummy() partitions each additional numeric predictor into a logical variable using function node1(), and updates the starting model by adding in each of the dichotomized logical predictor. Function add_dummy() returns an 'add_dummy' object, which is a listof node1 objects.

The S3 generic subset() subsets the the 'add_dummy' object by the balance of partition of the additional predictor.

The S3 generic sort_by() sorts the 'add_dummy' object by the absolute value of regression coefficient (i.e., effect size) of the additional logical predictor.

The S3 generic head() chooses the top n element from the object returned from the previous step.

set.seed(14837); m2 = m |>
  add_dummy(x = ~ logKi67.quantile, mc.cores = 1L) |>
  subset(subset = p1 > .05 & p1 < .95) |> 
  sort_by(y = abs(effsize)) |>
  head(n = 2L)
m2
#> .slice(logKi67.quantile, "0.7")>=6.60981019916946
#> .slice(logKi67.quantile, "0.01")>=6.02789726883965

The S3 generic predict() uses model m2 on the test set s1.

m2[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.7")>=6.60981019916946 :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#> 
#>          coef exp(coef) se(coef)     z      p
#> Tstage 0.3644    1.4397   0.2402 1.517 0.1292
#> x.TRUE 0.6600    1.9348   0.3841 1.718 0.0857
#> 
#> Likelihood ratio test=5.83  on 2 df, p=0.05415
#> n= 125, number of events= 28

add_dummy_partition(): via Repeated Partitions

Function add_dummy_partition() partitions each additional numeric predictor into a logical variable in the following steps.

  1. Generate multiple, i.e., repeated, partitions.
  2. For each partition, create a dichotomizing rule (via function node1()) on the training set. Apply this dichotomizing rule on the test set and obtain the estimated regression coefficient (i.e., effect size) of the additional logical predictor.
  3. Among all partitions, select the one with median effect size of the additional logical predictor.

Function add_dummy_partition() also returns an 'add_dummy' object.

set.seed(14837); m3 = m |> 
  add_dummy_partition(~ logKi67.quantile, times = 20L, mc.cores = 1L) |>
  subset(subset = p1 > .15 & p1 < .85) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)
m3
#> .slice(logKi67.quantile, "0.14")>=6.26711450931504
#> .slice(logKi67.quantile, "0.26")>=6.38090570405786
m3[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.14")>=6.26711450931504 :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#> 
#>          coef exp(coef) se(coef)     z      p
#> Tstage 0.3912    1.4788   0.2348 1.666 0.0956
#> x.TRUE 0.3783    1.4598   0.3918 0.966 0.3343
#> 
#> Likelihood ratio test=3.79  on 2 df, p=0.1505
#> n= 125, number of events= 28

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.