cvcrand package

Hengshi Yu, John A. Gallis, Fan Li, and Elizabeth L. Turner

2017-11-28

Overview

cvcrand is an R package for the design and analysis of cluster randomized trials (CRTs).

Given the baseline values of some cluster-level covariates, users can perform a constrained randomization on the clusters into two arms, with an optional input of user-defined weights on the covariates.

At the end of the study, the individual outcome is collected. The cvcrand package also performs clustered permutation test on either continuous outcome or binary outcome adjusted for some individual-level covariates, producing p-value of the intervention effect.

Design: constrained randomization

In the design of CRTs with two arms, users can use the cvcrand() function to perform constrained randomization. And for the analysis part, user will use the cptest() function for clustered permutation test.

A cluster is the unit of randomization for a cluster randomized trial. Thus, when the number of clusters is small, there might be some baseline imbalance from the randomization between the arms. Constrained randomization constrained the randomization space to randomization schemes with smaller difference among the covariates between the two arms.

The balance score for constrained randomization in the program is developed from (Raab and Butcher 2001). Suppose \(n\), \(n_T\), and \(n_C\) are the total number of clusters, the number of clusters in the treatment arm and the control arm respectively. Suppose also that there are \(K\) cluster-level variables including the continuous covariates as well as the dummy variables created from the categorical covariates. \(\bar{x}_{Tk}\) and \(\bar{x}_{Ck}\) are the means of the \(kth\) variable in the treatment arm and the control arm respectively. \(\omega_k\) is the inverse of the standard deviation of the \(kth\) variable among all \(n\) clusters. There are two choices of metric for the balance score. If "L1" metric is specified and there is no user-defined weight of \(C_k;\ k=1,2,...,K\), the balance score is defined as follows.

\(B_{(l1)}=\left(\frac{n_Tn_C}{n}\right)\sum_{k=1}^{K}\omega_k\left|\bar{x}_{Tk}-\bar{x}_{Ck}\right|\)

And if there is user-defined weight of \(C_k;\ k=1,2,...,K\), the "L1" balance score is: \(B_{(l1)}=\left(\frac{n_Tn_C}{n}\right)\sum_{k=1}^{K}C_k\omega_k\left|\bar{x}_{Tk}-\bar{x}_{Ck}\right|\).

Another metric is "L2". The balance score with and without the user-defined weight is defined as follows:

\(B_{(l2)}=\left(\frac{n_Tn_C}{n}\right)^{2}\sum_{k=1}^{K}\omega_k^2(\bar{x}_{Tk}-\bar{x}_{Ck})^2\) and \(B_{(l2)}=\left(\frac{n_Tn_C}{n}\right)^{2}\sum_{k=1}^{K}C_k^2\omega_k^2(\bar{x}_{Tk}-\bar{x}_{Ck})^2\)

With the baseline values of the specified cluster-level covariates in a cluster randomized trail, the cvcrand() function in the cvcrand package is used to perform the consrailed randomization.

Each categorical variable is transformed into dummy variables to calculate the balance score. Specifically, the level of each categorical variable dropped when it is transformed into several dummiy variables is the first level in the alphanumerical order. If the user wants to specify a different level of each categorical variable to drop, they might create dummy variables themselves before running the cvcrand() function. Then, the user needs to specify the dummy variables created themselves to be "categorical" when running cvcrand().

cvcrand() example

Study 1 in (L. M. Dickinson et al. 2015) is about two approaches (interventions) for increasing the “up-to-date” immunization rate in 19- to 35-month-old children. They planned to randomize 16 counties in Colorado 1:1 to either a population-based approach or a practice-based approach. There are several county-level variables. The program will randomize on a subset of these variables. The continuous variable of average income is categorized to illustrate the use of the cvcrand() on multi-category variables. And the percentage in CIIS variable is trancated at 100%.

county location inciis uptodateonimmunizations hispanic incomecat
1 Rural 94 37 44 Low
2 Rural 85 39 23 High
3 Rural 85 42 12 Low
4 Rural 93 39 18 High
5 Rural 82 31 6 High
6 Rural 80 27 15 Med
7 Rural 94 49 38 Low
8 Rural 100 37 39 Low
9 Urban 93 51 35 Med
10 Urban 89 51 17 Med
11 Urban 83 54 7 High
12 Urban 70 29 13 Med
13 Urban 93 50 13 High
14 Urban 85 36 10 Med
15 Urban 82 38 39 Low
16 Urban 84 43 28 Med

For the constrained randomization, we used the cvcrand() function to randomize 8 out of the 16 counties into the practice-based. For the definition of the whole randomization space, if the total number of all possible schemes is smaller than 50,000, we enumerate all the schemes as the whole randomization space. Otherwise, we simulate 50,000 schemes and choose the unique shemes among them as the whole randomization space. We calculate the balance scores of "L2" metric on three continuous covariates as well as two categorical covariates of location and income category. Location has "Rural" and "Urban". The level of "Rural" was then dropped in cvcrand(). As income category has three levels of "low", "med", and "high", the level of "high" was dropped to create dummy variables according to the alphanumerical order as well. Then we constrained the randomization space to the schemes with "L2" balance scores less than the 0.1 quantile of that in the whole randomization space. Finally, a randomization scheme is sampled from the constrained space.

We saved the constrained randomization space in a CSV file in "dickinson_constrained.csv", the first column of which is an indicator variable of the finally selected scheme (1) or not (0). We also saved the balance scores of the whole randomization space in a CSV file in "dickinson_bscores.csv", and output a histogram displaying the distribution of all balance scores with a red line indicating our selected cutoff (the 0.1 quantile).

 Design_result<-cvcrand(id = Dickinson_design$county,
                       metric = "L2",
                       x = data.frame(Dickinson_design[, -1]),
                       n = 16,
                       ntrt = 8,
                       categorical = c("location","incomecat"),
                       savedata = "dickinson_constrained.csv",
                       savebscores = "dickinson_bscores.csv",
                       cutoff = 0.1,
                       seed = 12345)

The we had the following output:

 # the metric the user specified
 Design_result$metric
## [1] "L2"
 # the selected scheme 
 Design_result$Allocation
##       id allocation
##  [1,]  1          0
##  [2,]  2          0
##  [3,]  3          0
##  [4,]  4          1
##  [5,]  5          1
##  [6,]  6          0
##  [7,]  7          1
##  [8,]  8          0
##  [9,]  9          1
## [10,] 10          1
## [11,] 11          0
## [12,] 12          1
## [13,] 13          1
## [14,] 14          0
## [15,] 15          1
## [16,] 16          0
 # the cutoff balance score, the balance score of the selected   scheme, as well as the histogram of the balance scores of the whole randomization space
 Design_result$Bscores
##   CR-Benchmark Selected Point           Mean             SD            Min             5% 
##       7.638463       6.763833      24.000000      15.774661       1.161107       5.826458 
##            10%            20%            25%            30%            50%            75% 
##       7.638463      10.848749      12.221264      13.839891      20.577839      31.620790 
##            95%            Max 
##      55.486080     116.656181
 # the statement about how many clusters to randomize to the intervention and the control arms respectively
 Design_result$assignment_message
## [1] "You have indicated that you want to assign 8 clusters to treatment and 8 to control"
 # the statement about how to get the whole space of schemes
 Design_result$scheme_message
## [1] "Enumerating all the 12870 schemes for 8 clusters in the treatment arm out of 16 clusters in total"
 # the statement about the benchmark of the constrained space
 Design_result$BLcut_message
## [1] "By cutoff quantile of 0.1 of L2 , the BL constrained randomization benchmark is 7.638"
 # the statement about the selected scheme from constrained randomization
 Design_result$BLchoice_message
## [1] "Balance score of selected scheme by L2 has B score of 6.764"
 # the matrix containing allocation scheme, the id as well as the original covariates' matrix
 Design_result$data_BL
##    inter id location inciis uptodateonimmunizations hispanic incomecat
## 1      0  1    Rural     94                      37       44       Low
## 2      0  2    Rural     85                      39       23      High
## 3      0  3    Rural     85                      42       12       Low
## 4      1  4    Rural     93                      39       18      High
## 5      1  5    Rural     82                      31        6      High
## 6      0  6    Rural     80                      27       15       Med
## 7      1  7    Rural     94                      49       38       Low
## 8      0  8    Rural    100                      37       39       Low
## 9      1  9    Urban     93                      51       35       Med
## 10     1 10    Urban     89                      51       17       Med
## 11     0 11    Urban     83                      54        7      High
## 12     1 12    Urban     70                      29       13       Med
## 13     1 13    Urban     93                      50       13      High
## 14     0 14    Urban     85                      36       10       Med
## 15     1 15    Urban     82                      38       39       Low
## 16     0 16    Urban     84                      43       28       Med
 # the descriptive statistics for all the variables in the original covariates' matrix in the two arms from constrained randomization
 Design_result$BL_result
##                                      Stratified by inter
##                                       0             1             p      test
##   n                                       8             8                    
##   location = Urban (%)                    3 (37.5)      5 (62.5)   0.617     
##   inciis (mean (sd))                  87.00 (6.59)  87.00 (8.45)   1.000     
##   uptodateonimmunizations (mean (sd)) 39.38 (7.65)  42.25 (9.18)   0.507     
##   hispanic (mean (sd))                22.25 (13.77) 22.38 (12.94)  0.985     
##   incomecat (%)                                                    0.819     
##      High                                 2 (25.0)      3 (37.5)             
##      Low                                  3 (37.5)      2 (25.0)             
##      Med                                  3 (37.5)      3 (37.5)

From the output of Design_result$BL_result, the selected scheme is able to properly balance the baseline values of the covariates. And the selected scheme is shown in Design_result$Allocation.

Analysis: Clustered Permutation Test

At the end of cluster randomized trials, individual outcomes are collected. Permutation test based on (Gail et al. 1996) and (Li et al. 2016) is then applied to the continuous or binary outcome with some individual-level covariates.

The cptest() function in the cvcrand package is used to perform the permutation test for the intervention effect of cluster randomized trials.

Each categorical variable is transformed into dummy variables to fit in the linear model or logistic regression for the permutation test. Specifically, the level of each categorical variable dropped when it is transformed into several dummiy variables is the first level in the alphanumerical order. If the user wants to specify a different level of each categorical variable to drop, they might create dummy variables themselves before running the cptest() function. Then, the user needs to specify the dummy variables created themselves to be "categorical" when running cptest().

cptest() example

Suppose that the researchers were able to assess 300 children in each cluster of the study 1 in (L. M. Dickinson et al. 2015), and the cluster randomized trial is processed with the selected randomization scheme from the example above of the cvcrand() function. We expanded the values of the cluster-level covariates on the covariates’ values of the individuals, according to which cluster they belong to. The correlated individual outcome of up-to-date on immunizations (1) or not (0) is then simulated using a generalized linear mixed model (GLMM) to induce correlation by include a random effect. The intracluster correlation (ICC) was set to be 0.01, using the latent response definition provided in (Eldridge, Ukoumunne, and Carlin 2009). This is a reasonable value of the ICC the population health studies (Hannan et al. 1994). We simulated one data set, with the outcome data dependent on the county-level covariates used in the constrained randomization design and a positive treatment effect so that the practice-based intervention increases up-to-date immunization rates more than the community-based intervention. For each individual child, the outcome is equal to 1 if he or she is up-to-date on immunizations and 0 otherwise.

county location inciis uptodateonimmunizations hispanic incomecat outcome
1 Rural 94 37 44 0 1
1 Rural 94 37 44 0 1
1 Rural 94 37 44 0 1
1 Rural 94 37 44 0 1
1 Rural 94 37 44 0 0
1 Rural 94 37 44 0 0
1 Rural 94 37 44 0 1
1 Rural 94 37 44 0 1
1 Rural 94 37 44 0 1
1 Rural 94 37 44 0 1

We used the cptest() function to process the clustered permutation test on the binary outcome of the status of up-to-date on immunizations. We input the file about the constrained space with the first column indicating the final scheme. The permutation test is on the continuous covariates of "inciis", "uptodateonimmunizations", "hispanic", as well as categorical variables of "location" and "incomecat". Location has "Rural" and "Urban". The level of "Rural" was then dropped in cptest(). As income category has three levels of "low", "med", and "high", the level of "high" was dropped to create dummy variables according to the alphanumerical order as well.

 Analysis_result <- cptest(outcome = Dickinson_outcome$outcome,
                           id = Dickinson_outcome$county,
                           x = data.frame(Dickinson_outcome[, c(-1, -7)]),
                           cspacedatname = system.file("dickinson_constrained.csv", package="cvcrand"),
                           outcometype = "binary",
                           categorical = c("location","incomecat"))

The result of "cptest()" includes the final scheme for the cluster randomized trial, the p-value from the permutation test as well as a statement about that p-value.

 Analysis_result 
## $FinalScheme
##    Cluster_ID Intervention
## 1           1            0
## 2           2            0
## 3           3            0
## 4           4            1
## 5           5            1
## 6           6            0
## 7           7            1
## 8           8            0
## 9           9            1
## 10         10            1
## 11         11            0
## 12         12            1
## 13         13            1
## 14         14            0
## 15         15            1
## 16         16            0
## 
## $pvalue
## [1] 0.0497
## 
## $pvalue_statement
## [1] "Adjusted permutation test p-value = 0.0497"

From the p-value of 0.0497 in Analysis_result, the probability of up-to-date on immunizations for the practice-based approach (1) is significantly different from that for the population-based approach (0).

Session Information

## R Under development (unstable) (2017-07-03 r72882)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                           LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] cvcrand_0.0.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11    lattice_0.20-35 zoo_1.8-0       class_7.3-14    digest_0.6.12   rprojroot_1.2  
##  [7] MASS_7.3-47     grid_3.5.0      backports_1.1.0 magrittr_1.5    e1071_1.6-8     evaluate_0.10.1
## [13] survey_3.32-1   highr_0.6       stringi_1.1.5   Matrix_1.2-10   rmarkdown_1.8   splines_3.5.0  
## [19] tools_3.5.0     stringr_1.2.0   survival_2.41-3 yaml_2.1.14     compiler_3.5.0  htmltools_0.3.6
## [25] knitr_1.17      tableone_0.8.1

References

Dickinson, L Miriam, Brenda Beaty, Chet Fox, Wilson Pace, W Perry Dickinson, Caroline Emsermann, and Allison Kempe. 2015. “Pragmatic Cluster Randomized Trials Using Covariate Constrained Randomization: A Method for Practice-Based Research Networks (Pbrns).” The Journal of the American Board of Family Medicine 28 (5). Am Board Family Med: 663–72.

Eldridge, Sandra M, Obioha C Ukoumunne, and John B Carlin. 2009. “The Intra-Cluster Correlation Coefficient in Cluster Randomized Trials: A Review of Definitions.” International Statistical Review 77 (3). Wiley Online Library: 378–94.

Gail, Mitchell H, Steven D Mark, Raymond J Carroll, Sylvan B Green, and David Pee. 1996. “On Design Considerations and Randomization-Based Inference for Community Intervention Trials.” Statistics in Medicine 15 (11). Wiley Online Library: 1069–92.

Hannan, Peter J, David M Murray, David R Jacobs Jr, and Paul G McGovern. 1994. “Parameters to Aid in the Design and Analysis of Community Trials: Intraclass Correlations from the Minnesota Heart Health Program.” Epidemiology. JSTOR, 88–95.

Li, Fan, Yuliya Lokhnygina, David M Murray, Patrick J Heagerty, and Elizabeth R DeLong. 2016. “An Evaluation of Constrained Randomization for the Design and Analysis of Group-Randomized Trials.” Statistics in Medicine 35 (10). Wiley Online Library: 1565–79.

Raab, Gillian M, and Izzy Butcher. 2001. “Balance in Cluster Randomized Trials.” Statistics in Medicine 20 (3). Wiley Online Library: 351–65.