survivALL Rationale

Dominic Pearce

2017-08-11

Libraries

library(survivALL)
library(Biobase)
library(ggplot2)
library(magrittr)
library(viridis)
library(survival)
library(survcomp)

 

Rationale

Survival analysis typically separates a patient cohort by some measure and compares the relative survival between groups. In practice, for a continuous measure, the decision of where to draw this division relies on previous knowledge or, more problematically, it may simply be a researchers best guess - for example separating around the median. Below are outlined 3 alternative survival analysis approaches — median, hypothesis-driven and our own data-driven approach as part of the survivALL package.

 

Median dichotomisation

We begin with an expressionSet nki_subset, which details both gene expression as well as survival information (i.e. event and time-to-event data). nki_subset includes 319 invasive breast cancer samples of no specific subtype and with complete survival and gene expression information. Here the event we will be measuring is distant metastasis free survival or dmfs.

 

It is important to ensure that there are no NA values in our survival information. In our example data this has already been accounted for.

 

data(nki_subset)
pData(nki_subset)[1:3, ]
Survival Information
  samplename age grade e.dmfs t.dmfs
NKI_4 NKI_4 41 3 0 4747
NKI_6 NKI_6 49 2 0 4075
NKI_7 NKI_7 46 1 0 3703

 

We have our complete survival data with, amongst other variables, the three critical components of our analysis, namely sample names, events (e.dmfs) and times-to-event (t.dmfs).

 

To determine a variables prognostic capacity we can apply the Kaplan-Meier estimator and plot. Here we will split our cohort using the expression of ERBB2 a gene whose increased expression is known to be associated with poor prognosis in a mixed population of invasive breast cancers.

 

We will stratify the cohort into high and low ERBB2 expression using median ERBB2 expression…

 

erbb2_xpr <- exprs(nki_subset)["NM_004448",] #ERBB2 expression vector
erbb2_med <- ifelse(erbb2_xpr >= median(erbb2_xpr), "high", "low") #convert to binary classifier

 

…and produce our Kaplan-Meier plot and statistics

 

srv_obj <- survival::Surv(nki_subset$t.dmfs, nki_subset$e.dmfs)

broom::tidy(survival::coxph(srv_obj ~ erbb2_med)) %>% pandoc.table()
term estimate std.error statistic p.value conf.low conf.high
erbb2_medlow -1.434e-05 0.1917 -7.482e-05 0.9999 -0.3758 0.3757

median_fit <- survival::survfit(srv_obj ~ erbb2_med)
GGally::ggsurv(median_fit) + ggtitle("ERBB2 median")

 

Surprisingly and despite ERBB2 known to be highly prognosistic in invasive breast cancer, there is no association between ERBB2 expression and prognosis.

 

Hypothesis-driven dichotomisation

However, knowing that ERBB2 overexpression is evident in the population for ~20% of invasive breast cancer cases, we can modify our cohort stratification to relfect this.

 

erbb2_hypothesis <- ifelse(erbb2_xpr >= quantile(erbb2_xpr, probs = 0.8), "high", "low") 

hypothesis_fit <- survival::survfit(srv_obj ~ erbb2_hypothesis)
GGally::ggsurv(hypothesis_fit) + ggtitle("ERBB2 hypothesis-driven")

broom::tidy(survival::coxph(srv_obj ~ erbb2_hypothesis)) %>% pandoc.table()
Table continues below
term estimate std.error statistic p.value conf.low
erbb2_hypothesislow -0.4484 0.222 -2.02 0.04343 -0.8836
conf.high
-0.01323

 

And we now demonstrate a significant survival difference based on ERBB2 expression. The difference in approaches can be compared as a forest plot of calculated hazard ratios

 

forest_dfr <- rbind(
                    data.frame(survcomp::hazard.ratio(erbb2_med, nki_subset$t.dmfs, nki_subset$e.dmfs)[1:6]),
                    data.frame(survcomp::hazard.ratio(erbb2_hypothesis, nki_subset$t.dmfs, nki_subset$e.dmfs)[1:6])
                    )
forest_dfr$stratification <- c("median", "hypothesis\ndriven")

ggplot(forest_dfr, aes(x = stratification)) + 
    geom_hline(yintercept = 1, linetype = 3) + 
    geom_linerange(aes(ymin = lower, ymax = upper)) + 
    geom_point(aes(y = hazard.ratio, colour = p.value)) + 
    scale_color_viridis(direction = -1, breaks = c(0.05, 0.5, 1), limits = c(0, 1)) +
    coord_flip()

 

Data-driven dichotomisation

It may still be, however, that our cohort’s composition in terms of ERBB2 expression does not exactly mirror population levels - i.e. that the best point of separation is not found with an 80-20 split. This will be even more problematic for a novel biomarker for which population levels may be unknown.

 

survivALL offers a solution to this difficulty. Instead of selecting a point to stratify our cohort, we instead calculate the signifcance and hazard ratio for all possible points, selecting the best after the fact.

 

We can visually inspect the best point of separation by using the plotALL() function. Here we are effectively plotting a forest plot as above but for all 218 possible separations, rather than just two.

 

plotALL(measure = erbb2_xpr,
        srv = pData(nki_subset), 
        time = "t.dmfs", 
        event = "e.dmfs", 
        title = "ERBB2 Expression") 

 

We can now see that the best point at which we could separate our cohort based on ERBB2 expression is closer to a 90-10 split, rather than 80-20, indicating a sampling bias towards HER2 negative patients in the NKI cohort.

 

We can further investigate the most significant separation using survivALL(), which follows the same procedure as plotALL() but returns a dataframe of calculations rather than plotting.

 

Using this we can select the point of most significant separation…

 

dfr <- survivALL(measure = erbb2_xpr,
               srv = pData(nki_subset), 
               time = "t.dmfs", 
               event = "e.dmfs", 
               measure_name = "ERBB2") 

dfr[which.min(dfr$p),] %>% pandoc.table()
Table continues below
  index samples event_time event measure HR p
284 284 NKI_181 3951 FALSE 0.474 1.201 0.002695
Table continues below
  log10_p name mean sdplus sdmin threshold_residuals
284 2.569 ERBB2 0 0 0 1.201
  dsr most_dsr clsf
284 0.3903 TRUE 0

 

…and re-draw our Kaplan-Meier

 

erbb2_data <- ifelse(erbb2_xpr >= dfr[which.min(dfr$p),]$measure, "high", "low") #convert to binary classifier
data_fit <- survival::survfit(srv_obj ~ erbb2_data)
GGally::ggsurv(data_fit) + ggtitle("ERBB2 data-driven")