Multiplicity Adjustment for Group Sequential Designs

Kaifeng Lu

2023-02-06

Background

Consider a phase 3 study with three arms: Arm A (drug A + drug B), Arm B (drug A placebo + drug B), and Arm C (drug A placebo + drug B placebo). Consider two endpoints: PFS and OS. There are six comparisons of interest corresponding to the combinations of two endpoints by three pairwise treatment comparisons.

Multiplicity adjustment strategy

The overall type I error rate is to be controlled at 1-sided \(\alpha=2.5\%\). The total alpha is split into 20% for PFS and 80% for OS. The following graph depicts the initial alpha allocation and propagation.

Figure 1: Graphical procedure for six hypotheses
Figure 1: Graphical procedure for six hypotheses

This strategy ensures that the A/B comparison will be performed for a given endpoint only if both the A/C and B/C comparisons are significant for the endpoint. This assumes that B/C comparison is likely to be effective and we are interested in assessing the contribution of components (CoC) of drug A in the drug combination A+B.

The weights along the edges are chosen to ensure that (1) about 20% of alpha will be allocated to PFS and 80% of alpha will be allocated to OS once the two hypotheses in the preceding family are rejected, and (2) if the PFS hypothesis is rejected, the majority of alpha will be passed to the OS hypothesis in the same family. This can be verified using the code below:

library(lrstat)

alpha = 0.025
x = list(
  w = c(0.2, 0.8, 0, 0, 0, 0),
  G = matrix(c(0, 0.8, 0.2, 0, 0, 0, 
               0.5, 0, 0, 0.5, 0, 0,
               0, 0.8, 0, 0, 0.2, 0,
               0.72, 0, 0, 0, 0, 0.28,
               0, 1, 0, 0, 0, 0,
               1, 0, 0, 0, 0, 0),
             nrow=6, ncol=6, byrow = TRUE),
  I = seq(1, 6))

print(alpha*x$w, digits=4)
## [1] 0.005 0.020 0.000 0.000 0.000 0.000
for (j in 1:6) {
  x = updateGraph(x$w, x$G, x$I, j)
  print(alpha*x$w, digits=4)
}
## [1] 0.000 0.024 0.001 0.000 0.000 0.000
## [1] 0.000 0.000 0.005 0.020 0.000 0.000
## [1] 0.000000 0.000000 0.000000 0.023846 0.001154 0.000000
## [1] 0.000000 0.000000 0.000000 0.000000 0.005092 0.019908
## [1] 0.000 0.000 0.000 0.000 0.000 0.025
## [1] 0 0 0 0 0 0

Assumptions

We assume that the enrollment rate is 10 patients per month for the first 8 months and 28 patients per month thereafter. We target to enroll 700 patients in about 30 months. The randomization ratio is 2:2:1 for Arms A:B:C.

The table below displays the assumptions for the event process by endpoint and by treatment arm along with the hazard ratios for Arm A or Arm B compared with Arm C.

Endpoint Parameter Arm A Arm B Arm C
PFS Median, mos 22 17 12
HR vs. C 0.55 0.70 NA
OS Median, mos 50 40 30
HR vs. C 0.60 0.75 NA

Table 1: Assumptions for the event process

Group sequential design

We assume that the study has two interim looks and one final look. The comparison in PFS for Arm A vs. C will be performed at Looks 1 and 2 only, while the other five comparisons will be performed at all three looks. An HSD(-4) alpha spending function will be used for testing each of the six hypotheses.

The timing of Look 1 and Look 2 is based on the combined PFS events for Arm A and Arm C, and Look 1 has 75% PFS events relative to Look 2. In addition, we target 97% power for PFS A/C at 1-sided alpha of 0.5%.

The following code calculates the sample size for the PFS A/C comparison. Of note, the accrual intensity is 3/5 of the overall population due to the 2:2:1 randomization ratio and the fact that only Arms A and C are included for the A/C comparison. The accrual duration of 30.143 months is used to enroll 700 patients for Arms A, B and C combined.

(lr1 <- lrsamplesize(
  beta = 0.03, kMax = 2, informationRates = c(0.75,1), 
  alpha = 0.005, typeAlphaSpending = "sfHSD", 
  parameterAlphaSpending = -4, 
  allocationRatioPlanned = 2, 
  accrualTime = c(0,8), accrualIntensity = c(10,28)*3/5, 
  lambda1 = log(2)/22, lambda2 = log(2)/12, 
  accrualDuration = 30.143, followupTime = NA, 
  typeOfComputation = "Schoenfeld")$resultsUnderH1)
##                                                                          
## Group-sequential trial with 2 stages                                     
## Overall power: 0.97, overall significance level (1-sided): 0.005         
## Maximum # events: 247, expected # events: 195.4                          
## Maximum # dropouts: 0, expected # dropouts: 0                            
## Maximum # subjects: 420, expected # subjects: 420                        
## Total study duration: 41.5, expected study duration: 34.7                
## Accrual duration: 30.1, follow-up duration: 11.3, fixed follow-up: FALSE 
##                                                                          
##                              Stage 1 Stage 2
## Information rate             0.750   1.000  
## Efficacy boundary (Z-scale)  2.915   2.629  
## Cumulative rejection         0.835   0.970  
## Cumulative alpha spent       0.0018  0.0050 
## Number of events             185.3   247.0  
## Number of dropouts           0.0     0.0    
## Number of subjects           420.0   420.0  
## Analysis time                33.4    41.5   
## Efficacy boundary (HR-scale) 0.635   0.701  
## Efficacy boundary (p-scale)  0.0018  0.0043 
## Information                  41.17   54.89  
## HR                           0.545   0.545

It can be seen that Look 1 and Look 2 are expected to take place at months 33.4 and 41.5, respectively, with the target number of events for Arm A and Arm C combined of 185 and 247 at the two looks, respectively.

We can now calculate the expected number of OS events for Arm A and Arm C combined at Look 1 and Look 2, respectively.

(lr2a <- lrstat(
  time = lr1$byStageResults$analysisTime, 
  allocationRatioPlanned = 2,
  accrualTime = c(0,8), accrualIntensity = c(10,28)*3/5, 
  lambda1 = log(2)/50, lambda2 = log(2)/30,
  accrualDuration = 30.143, followupTime = 100,
  predictEventOnly = 1))
##       time subjects   nevents nevents1 nevents2 ndropouts ndropouts1 ndropouts2
## 1 33.40175 420.0024  96.43897 54.58462 41.85435         0          0          0
## 2 41.47688 420.0024 137.01703 78.45778 58.55925         0          0          0

Look 3 is planned to have 90% power for the OS A/C comparison at 1-sided alpha of 2%, which can be done as follows:

(d_os <- uniroot(function(d) {
  lr <- lrsamplesize(
    beta = 0.1, kMax = 3, 
    informationRates = c(lr2a$nevents, d)/d,
    alpha = 0.020, typeAlphaSpending = "sfHSD", 
    parameterAlphaSpending = -4, 
    allocationRatioPlanned = 2, 
    accrualTime = c(0,8), 
    accrualIntensity = c(10,28)*3/5, 
    lambda1 = log(2)/50, lambda2 = log(2)/30, 
    accrualDuration = 30.143, followupTime = NA, 
    typeOfComputation = "Schoenfeld")$resultsUnderH1
  lr$overallResults$numberOfEvents - d}, 
  c(138,300))$root) 
## [1] 195.1146

In summary, the OS A/C power calculation is as follows:

(lr2 <- lrsamplesize(
  beta = 0.1, kMax = 3, 
  informationRates = c(lr2a$nevents, d_os)/d_os, 
  alpha = 0.020, typeAlphaSpending = "sfHSD", 
  parameterAlphaSpending = -4, 
  allocationRatioPlanned = 2, 
  accrualTime = c(0,8), accrualIntensity = c(10,28)*3/5, 
  lambda1 = log(2)/50, lambda2 = log(2)/30,
  accrualDuration = 30.143, followupTime = NA, 
  typeOfComputation = "Schoenfeld")$resultsUnderH1)
##                                                                          
## Group-sequential trial with 3 stages                                     
## Overall power: 0.9, overall significance level (1-sided): 0.02           
## Maximum # events: 195.1, expected # events: 147.3                        
## Maximum # dropouts: 0, expected # dropouts: 0                            
## Maximum # subjects: 420, expected # subjects: 420                        
## Total study duration: 55.5, expected study duration: 44.5                
## Accrual duration: 30.1, follow-up duration: 25.3, fixed follow-up: FALSE 
##                                                                          
##                              Stage 1 Stage 2 Stage 3
## Information rate             0.494   0.702   1.000  
## Efficacy boundary (Z-scale)  2.831   2.598   2.100  
## Cumulative rejection         0.321   0.600   0.900  
## Cumulative alpha spent       0.0023  0.0058  0.0200 
## Number of events             96.4    137.0   195.1  
## Number of dropouts           0.0     0.0     0.0    
## Number of subjects           420.0   420.0   420.0  
## Analysis time                33.4    41.5    55.5   
## Efficacy boundary (HR-scale) 0.543   0.624   0.727  
## Efficacy boundary (p-scale)  0.0023  0.0047  0.0179 
## Information                  21.43   30.45   43.36  
## HR                           0.600   0.600   0.600

Therefore, Look 3 is expected to take place at month 55.5. Based on the timing of the three looks, we obtain the expected number of events for the other four comparisons:

(lr3 <- lrstat(
  time = lr2$byStageResults$analysisTime, 
  allocationRatioPlanned = 2,
  accrualTime = c(0,8), accrualIntensity = c(10,28)*3/5, 
  lambda1 = log(2)/17, lambda2 = log(2)/12,
  accrualDuration = 30.143, followupTime = 100,
  predictEventOnly = 1))
##       time subjects  nevents nevents1  nevents2 ndropouts ndropouts1 ndropouts2
## 1 33.40176 420.0024 206.3885 127.3042  79.08424         0          0          0
## 2 41.47688 420.0024 271.9334 170.1415 101.79194         0          0          0
## 3 55.48666 420.0024 340.9393 217.9490 122.99030         0          0          0
(lr4 <- lrstat(
  time = lr2$byStageResults$analysisTime, 
  allocationRatioPlanned = 2,
  accrualTime = c(0,8), accrualIntensity = c(10,28)*3/5, 
  lambda1 = log(2)/40, lambda2 = log(2)/30,
  accrualDuration = 30.143, followupTime = 100,
  predictEventOnly = 1))
##       time subjects  nevents  nevents1 nevents2 ndropouts ndropouts1 ndropouts2
## 1 33.40176 420.0024 107.9582  66.10382 41.85436         0          0          0
## 2 41.47688 420.0024 152.5943  94.03503 58.55925         0          0          0
## 3 55.48666 420.0024 215.2001 134.11990 81.08019         0          0          0
(lr5 <- lrstat(
  time = lr2$byStageResults$analysisTime, 
  allocationRatioPlanned = 1,
  accrualTime = c(0,8), accrualIntensity = c(10,28)*4/5, 
  lambda1 = log(2)/22, lambda2 = log(2)/17,
  accrualDuration = 30.143, followupTime = 100,
  predictEventOnly = 1))
##       time subjects  nevents nevents1 nevents2 ndropouts ndropouts1 ndropouts2
## 1 33.40176 560.0032 233.4828 106.1786 127.3042         0          0          0
## 2 41.47688 560.0032 315.3667 145.2252 170.1415         0          0          0
## 3 55.48666 560.0032 411.2713 193.3223 217.9490         0          0          0
(lr6 <- lrstat(
  time = lr2$byStageResults$analysisTime, 
  allocationRatioPlanned = 1,
  accrualTime = c(0,8), accrualIntensity = c(10,28)*4/5, 
  lambda1 = log(2)/50, lambda2 = log(2)/40,
  accrualDuration = 30.143, followupTime = 100,
  predictEventOnly = 1))
##       time subjects  nevents  nevents1  nevents2 ndropouts ndropouts1
## 1 33.40176 560.0032 120.6884  54.58463  66.10382         0          0
## 2 41.47688 560.0032 172.4928  78.45778  94.03503         0          0
## 3 55.48666 560.0032 248.1543 114.03442 134.11990         0          0
##   ndropouts2
## 1          0
## 2          0
## 3          0

The power calculation for PFS A/C and OS A/C assumes no alpha recycling between the two hypotheses. In the presence of alpha propagation among the six hypotheses depicted in Figure 1, we resort to simulations to obtain the empirical power with multiplicity adjustment for the six hypotheses.

Simulation and empirical power

First we simulate 1000 trials, with the first 2 looks planned at 185 and 247 events for endpoint 1 and the final look planned at 195 events for endpoint 2, i.e., kMaxe1 = 2 and plannedEvents = c(185, 247, 195).

sim1 = lrsim2e3a(
  kMax = 3, 
  kMaxe1 = 2,
  allocation1 = 2, 
  allocation2 = 2, 
  allocation3 = 1,
  accrualTime = c(0, 8), 
  accrualIntensity = c(10, 28),
  lambda1e1 = log(2)/22,
  lambda2e1 = log(2)/17,
  lambda3e1 = log(2)/12,
  lambda1e2 = log(2)/50, 
  lambda2e2 = log(2)/40,
  lambda3e2 = log(2)/30,
  accrualDuration = 30.143,
  plannedEvents = c(185, 247, 195),
  maxNumberOfIterations = 1000,
  maxNumberOfRawDatasetsPerStage = 1,
  seed = 314159)

From the summary data for each simulated data set, we obtain the pairwise total number of events and the raw p-values.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

df <- sim1$sumdata %>% 
  mutate(events13 = events1 + events3,
         events23 = events2 + events3,
         events12 = events1 + events2,
         p13 = pnorm(logRankStatistic13), 
         p23 = pnorm(logRankStatistic23), 
         p12 = pnorm(logRankStatistic12)) %>% 
  select(iterationNumber, stageNumber, endpoint, 
         events13, events23, events12, 
         p13, p23, p12)

dfcomp <- tibble(comparison = c("13", "23", "12"),
                 order = c(1, 2, 3))



dfinfo <- df %>% 
  arrange(iterationNumber, endpoint) %>% 
  pivot_longer(
    c("events13", "events23", "events12"), 
    names_to = "c_events", 
    values_to = "events") %>% 
  mutate(comparison = substring(c_events, 7)) %>%
  left_join(dfcomp, by = c("comparison")) %>% 
  arrange(iterationNumber, order, endpoint, 
          stageNumber)


dfp <- df %>% 
  arrange(iterationNumber, endpoint) %>% 
  pivot_longer(
    c("p13", "p23", "p12"), 
    names_to = "c_p", 
    values_to = "p") %>% 
  mutate(comparison = substring(c_p, 2)) %>%
  left_join(dfcomp, by = c("comparison")) %>% 
  arrange(iterationNumber, order, endpoint, 
          stageNumber)

Now we use specified alpha-spending function along with the target total number of events to adjust the critical values based on the current alpha for the hypothesis. We compare the p-values to the critical values to determine whether to reject the hypothesis at the current look. The empirical power for a hypothesis is estimated by the proportion of simulations which reject the hypothesis.

w = c(0.2, 0.8, 0, 0, 0, 0) 
G = matrix(c(0, 0.8, 0.2, 0, 0, 0, 
             0.5, 0, 0, 0.5, 0, 0,
             0, 0.8, 0, 0, 0.2, 0, 
             0.72, 0, 0, 0, 0, 0.28,
             0, 1, 0, 0, 0, 0, 
             1, 0, 0, 0, 0, 0), 
           nrow=6, ncol=6, byrow=TRUE)


rejStage = fseqbon(
  w = w, G = G, alpha = 0.025, kMax = 3,
  typeAlphaSpending = rep("sfHSD", 6),
  parameterAlphaSpending = rep(-4, 6),
  incidenceMatrix = matrix(
    c(1,1,0, 1,1,1, 1,1,1, 1,1,1, 1,1,1, 1,1,1), 
    nrow=6, ncol=3, byrow=TRUE),
  maxInformation = c(247, 195, 341, 215, 411, 248),
  p = matrix(dfp$p, 6000, 3, byrow=TRUE),
  information = matrix(dfinfo$events, 6000, 3, byrow=TRUE))

reject2a = matrix(rejStage>0, nrow=1000, ncol=6, byrow=TRUE)
(power2 = apply(reject2a, 2, mean))
## [1] 0.983 0.929 0.720 0.517 0.275 0.129