Correlated Data

Correlated data

Sometimes it is desirable to simulate correlated data from a correlation matrix directly. For example, a simulation might require two random effects (e.g. a random intercept and a random slope). Correlated data like this could be generated using the defData functionality, but it may be more natural to do this with genCorData or addCorData. Currently, simstudy can only generate multivariate normal using these functions. (In the future, additional distributions will be available.)

genCorData requires the user to specify a mean vector mu, a single standard deviation or a vector of standard deviations sigma, and either a correlation matrix corMatrix or a correlation coefficient rho and a correlation structure corsrt. It is easy to see how this can be used from a few different examples.

# specifying a specific correlation matrix C
C <- matrix(c(1, 0.7, 0.2, 0.7, 1, 0.8, 0.2, 0.8, 1), nrow = 3)
C
##      [,1] [,2] [,3]
## [1,]  1.0  0.7  0.2
## [2,]  0.7  1.0  0.8
## [3,]  0.2  0.8  1.0
# generate 3 correlated variables with different location and scale for each
# field
dt <- genCorData(1000, mu = c(4, 12, 3), sigma = c(1, 2, 3), corMatrix = C)
dt
##         id       V1        V2         V3
##    1:    1 4.058035 11.775641  1.4136784
##    2:    2 2.123084  9.863020  3.4490109
##    3:    3 3.141109  9.018995 -0.8078367
##    4:    4 3.512148 10.602635  0.6695056
##    5:    5 4.537790 12.370902  2.1505126
##   ---                                   
##  996:  996 3.615397 12.742743  5.2624659
##  997:  997 3.794947 13.341832  6.3298183
##  998:  998 4.932117 13.114247  3.1206027
##  999:  999 4.164953 12.311527  3.7568423
## 1000: 1000 3.242038 10.258521  1.3226698
# estimate correlation matrix
dt[, round(cor(cbind(V1, V2, V3)), 1)]
##     V1  V2  V3
## V1 1.0 0.7 0.2
## V2 0.7 1.0 0.8
## V3 0.2 0.8 1.0
# estimate standard deviation
dt[, round(sqrt(diag(var(cbind(V1, V2, V3)))), 1)]
##  V1  V2  V3 
## 1.0 2.0 3.1
# generate 3 correlated variables with different location but same standard
# deviation and compound symmetry (cs) correlation matrix with correlation
# coefficient = 0.4.  Other correlation matrix structures are 'independent'
# ('ind') and 'auto-regressive' ('ar1').

dt <- genCorData(1000, mu = c(4, 12, 3), sigma = 3, rho = 0.4, corstr = "cs", cnames = c("x0", 
    "x1", "x2"))
dt
##         id         x0        x1          x2
##    1:    1  0.6637534 15.411935  8.21102940
##    2:    2 -2.1386381  6.647061 -4.87059866
##    3:    3  0.7738308 14.010696  2.36971424
##    4:    4 -1.3669604  7.805225  5.84619105
##    5:    5  1.7462404 12.802335  6.18068532
##   ---                                      
##  996:  996 -0.4226211 13.550424  3.04177758
##  997:  997  1.0399654  8.842263 -0.02695016
##  998:  998  1.9620598  8.392662 -5.13841535
##  999:  999  6.8600500 14.109390  1.04625707
## 1000: 1000 -0.2261344  8.646182  4.87754944
# estimate correlation matrix
dt[, round(cor(cbind(x0, x1, x2)), 1)]
##     x0  x1  x2
## x0 1.0 0.4 0.4
## x1 0.4 1.0 0.4
## x2 0.4 0.4 1.0
# estimate standard deviation
dt[, round(sqrt(diag(var(cbind(x0, x1, x2)))), 1)]
##  x0  x1  x2 
## 3.0 3.0 3.1

The new data generated by genCorData can be merged with an existing data set. Alternatively, addCorData will do this directly:

# define and generate the original data set
def <- defData(varname = "x", dist = "normal", formula = 0, variance = 1, id = "cid")
dt <- genData(1000, def)

# add new correlate fields a0 and a1 to 'dt'
dt <- addCorData(dt, idname = "cid", mu = c(0, 0), sigma = c(2, 0.2), rho = -0.2, 
    corstr = "cs", cnames = c("a0", "a1"))

dt
##        cid           x          a0          a1
##    1:    1 -0.42253976  1.70913663 -0.21916549
##    2:    2  0.21082079  2.24083912 -0.18539546
##    3:    3 -0.43131449 -0.02808837  0.26726127
##    4:    4  0.73399704 -2.36834345 -0.29376261
##    5:    5  0.94016230 -1.47098599 -0.02953758
##   ---                                         
##  996:  996 -0.46009949 -2.00774558  0.28938213
##  997:  997 -0.91059568  2.96603486  0.02535612
##  998:  998 -0.16363279 -1.90368728  0.09981716
##  999:  999 -0.15381623  0.16926667  0.11932288
## 1000: 1000  0.06060543  1.96515980 -0.08131505
# estimate correlation matrix
dt[, round(cor(cbind(a0, a1)), 1)]
##      a0   a1
## a0  1.0 -0.2
## a1 -0.2  1.0
# estimate standard deviation
dt[, round(sqrt(diag(var(cbind(a0, a1)))), 1)]
##  a0  a1 
## 2.0 0.2

Correlated data: additional distributions

Two additional functions facilitate the generation of correlated data from binomial, poisson, gamma, and uniform distributions: genCorGen and addCorGen.

genCorGen is an extension of genCorData. In the first example, we are generating data from a multivariate Poisson distribution. We start by specifying the mean of the Poisson distribution for each new variable, and then we specify the correlation structure, just as we did with the normal distribution.

l <- c(8, 10, 12) # lambda for each new variable

dx <- genCorGen(1000, nvars = 3, params1 = l, dist = "poisson", rho = .3, corstr = "cs", wide = TRUE)
dx
##         id V1 V2 V3
##    1:    1  4  7 10
##    2:    2  8 10 11
##    3:    3  4  9  9
##    4:    4  8 11  8
##    5:    5  8 10 14
##   ---              
##  996:  996  2  6  8
##  997:  997  9 13 12
##  998:  998  8 10 17
##  999:  999 11  8 18
## 1000: 1000  5  5 14
round(cor(as.matrix(dx[, .(V1, V2, V3)])), 2)
##      V1   V2   V3
## V1 1.00 0.28 0.19
## V2 0.28 1.00 0.24
## V3 0.19 0.24 1.00

We can also generate correlated binary data by specifying the probabilities:

genCorGen(1000, nvars = 3, params1 = c(.3, .5, .7), dist = "binary", rho = .8, corstr = "cs", wide = TRUE)
##         id V1 V2 V3
##    1:    1  1  1  1
##    2:    2  0  0  1
##    3:    3  0  0  1
##    4:    4  0  1  0
##    5:    5  0  0  1
##   ---              
##  996:  996  0  0  0
##  997:  997  1  1  1
##  998:  998  0  0  1
##  999:  999  0  0  1
## 1000: 1000  1  1  1

The gamma distribution requires two parameters - the mean and dispersion. (These are converted into shape and rate parameters more commonly used.)

dx <- genCorGen(1000, nvars = 3, params1 = l, params2 = c(1,1,1), dist = "gamma", rho = .7, corstr = "cs", wide = TRUE, cnames="a, b, c")
dx
##         id          a         b         c
##    1:    1  3.2210286  6.603092  9.433625
##    2:    2  4.8227247 11.670059  2.779509
##    3:    3  0.3718405  1.080022  3.350224
##    4:    4 11.2559658 18.605721  9.422487
##    5:    5  9.3505345  6.911211 14.987907
##   ---                                    
##  996:  996 11.6806256 16.844255 29.036638
##  997:  997  5.6878312  8.917821  1.545817
##  998:  998  1.0845072  1.505709  5.368722
##  999:  999  5.4031168  9.251644  9.703376
## 1000: 1000  5.6008289  4.733656 10.395727
round(cor(as.matrix(dx[, .(a, b, c)])), 2)
##      a    b    c
## a 1.00 0.65 0.66
## b 0.65 1.00 0.67
## c 0.66 0.67 1.00

These data sets can be generated in either wide or long form. So far, we have generated wide form data, where there is one row per unique id. Now, we will generate data using the long form, where the correlated data are on different rows, so that there are repeated measurements for each id. An id will have multiple records (i.e. one id will appear on multiple rows):

dx <- genCorGen(1000, nvars = 3, params1 = l, params2 = c(1,1,1), dist = "gamma", rho = .7, corstr = "cs", wide = FALSE, cnames="NewCol")
dx
##         id period    NewCol
##    1:    1      0 2.7330362
##    2:    1      1 0.4607286
##    3:    1      2 0.7506392
##    4:    2      0 1.2700871
##    5:    2      1 2.7766983
##   ---                      
## 2996:  999      1 1.0962913
## 2997:  999      2 4.7320269
## 2998: 1000      0 0.2802979
## 2999: 1000      1 4.2086395
## 3000: 1000      2 2.2394438

addCorGen allows us to create correlated data from an existing data set, as one can already do using addCorData. In the case of addCorGen, the parameter(s) used to define the distribution are created as a field (or fields) in the dataset. The correlated data are added to the existing data set. In the example below, we are going to generate three sets (poisson, binary, and gamma) of correlated data with means that are a function of the variable xbase, which varies by id.

First we define the data and generate a data set:

def <- defData(varname = "xbase", formula = 5, variance = .2, dist = "gamma", id = "cid")
def <- defData(def, varname = "lambda", formula = ".5 + .1*xbase", dist="nonrandom", link = "log")
def <- defData(def, varname = "p", formula = "-2 + .3*xbase", dist="nonrandom", link = "logit")
def <- defData(def, varname = "gammaMu", formula = ".5 + .2*xbase", dist="nonrandom", link = "log")
def <- defData(def, varname = "gammaDis", formula = 1, dist="nonrandom")

dt <- genData(10000, def)
dt
##          cid    xbase   lambda         p  gammaMu gammaDis
##     1:     1 4.942294 2.702641 0.3734811 4.430263        1
##     2:     2 5.399692 2.829130 0.4061046 4.854657        1
##     3:     3 3.313523 2.296422 0.2677745 3.198572        1
##     4:     4 8.754599 3.956896 0.6516681 9.496467        1
##     5:     5 4.673274 2.630904 0.3547973 4.198196        1
##    ---                                                    
##  9996:  9996 5.182748 2.768415 0.3905084 4.648524        1
##  9997:  9997 6.473087 3.149705 0.4854856 6.017172        1
##  9998:  9998 5.698362 2.914902 0.4278836 5.153481        1
##  9999:  9999 5.789669 2.941639 0.4346020 5.248455        1
## 10000: 10000 5.932624 2.983993 0.4451682 5.400680        1

The Poisson distribution has a single parameter, lambda:

dtX1 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = .1, corstr = "cs",
                    dist = "poisson", param1 = "lambda", cnames = "a, b, c")
dtX1
##          cid    xbase   lambda         p  gammaMu gammaDis a b c
##     1:     1 4.942294 2.702641 0.3734811 4.430263        1 4 2 2
##     2:     2 5.399692 2.829130 0.4061046 4.854657        1 1 0 2
##     3:     3 3.313523 2.296422 0.2677745 3.198572        1 2 1 3
##     4:     4 8.754599 3.956896 0.6516681 9.496467        1 3 5 1
##     5:     5 4.673274 2.630904 0.3547973 4.198196        1 1 1 4
##    ---                                                          
##  9996:  9996 5.182748 2.768415 0.3905084 4.648524        1 2 0 2
##  9997:  9997 6.473087 3.149705 0.4854856 6.017172        1 2 5 2
##  9998:  9998 5.698362 2.914902 0.4278836 5.153481        1 1 3 4
##  9999:  9999 5.789669 2.941639 0.4346020 5.248455        1 4 1 5
## 10000: 10000 5.932624 2.983993 0.4451682 5.400680        1 8 5 4

The Bernoulli (binary) distribution has a single parameter, p:

dtX2 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4, corstr = "ar1",
                    dist = "binary", param1 = "p")
dtX2
##          cid    xbase   lambda         p  gammaMu gammaDis V1 V2 V3 V4
##     1:     1 4.942294 2.702641 0.3734811 4.430263        1  0  0  0  0
##     2:     2 5.399692 2.829130 0.4061046 4.854657        1  0  0  0  0
##     3:     3 3.313523 2.296422 0.2677745 3.198572        1  0  0  0  0
##     4:     4 8.754599 3.956896 0.6516681 9.496467        1  1  0  1  1
##     5:     5 4.673274 2.630904 0.3547973 4.198196        1  0  0  0  1
##    ---                                                                
##  9996:  9996 5.182748 2.768415 0.3905084 4.648524        1  1  0  0  0
##  9997:  9997 6.473087 3.149705 0.4854856 6.017172        1  1  0  1  0
##  9998:  9998 5.698362 2.914902 0.4278836 5.153481        1  0  0  0  1
##  9999:  9999 5.789669 2.941639 0.4346020 5.248455        1  1  1  1  1
## 10000: 10000 5.932624 2.983993 0.4451682 5.400680        1  0  1  0  0

The Gamma distribution has two parameters - in simstudy the mean and dispersion are specified:

dtX3 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4, corstr = "cs",
                  dist = "gamma", param1 = "gammaMu", param2 = "gammaDis")
dtX3
##          cid    xbase   lambda         p  gammaMu gammaDis         V1
##     1:     1 4.942294 2.702641 0.3734811 4.430263        1  4.5827763
##     2:     2 5.399692 2.829130 0.4061046 4.854657        1  6.3696259
##     3:     3 3.313523 2.296422 0.2677745 3.198572        1  2.8339987
##     4:     4 8.754599 3.956896 0.6516681 9.496467        1  1.6840195
##     5:     5 4.673274 2.630904 0.3547973 4.198196        1  3.8304399
##    ---                                                               
##  9996:  9996 5.182748 2.768415 0.3905084 4.648524        1  1.9510948
##  9997:  9997 6.473087 3.149705 0.4854856 6.017172        1  0.5384673
##  9998:  9998 5.698362 2.914902 0.4278836 5.153481        1  1.4544040
##  9999:  9999 5.789669 2.941639 0.4346020 5.248455        1  1.7798068
## 10000: 10000 5.932624 2.983993 0.4451682 5.400680        1 10.8308558
##                V2         V3          V4
##     1:  6.9349628  1.2167647  0.81201729
##     2:  4.9414240  8.0505113  2.56651408
##     3:  0.1976552  1.7561377  0.65642144
##     4: 19.1556327  6.0133948  8.46809194
##     5:  2.8191067  0.6836075 11.66770192
##    ---                                  
##  9996:  0.2791214  0.6791530  0.02628097
##  9997:  4.1095608  2.1455912  1.47861097
##  9998:  5.8319255  0.4525492  1.79852344
##  9999:  0.2417584  4.8559813  8.55681650
## 10000:  4.7053416 15.9349774  7.58807488

If we have data in long form (e.g. longitudinal data), the function will recognize the structure:

def <- defData(varname = "xbase", formula = 5, variance = .4, dist = "gamma", id = "cid")
def <- defData(def, "nperiods", formula = 3, dist = "noZeroPoisson")

def2 <- defDataAdd(varname = "lambda", formula = ".5+.5*period + .1*xbase", dist="nonrandom", link = "log")

dt <- genData(1000, def)

dtLong <- addPeriods(dt, idvars = "cid", nPeriods = 3)
dtLong <- addColumns(def2, dtLong)

dtLong
##        cid period    xbase nperiods timeID    lambda
##    1:    1      0 8.854073        1      1  3.996453
##    2:    1      1 8.854073        1      2  6.589037
##    3:    1      2 8.854073        1      3 10.863486
##    4:    2      0 3.416826        2      4  2.320268
##    5:    2      1 3.416826        2      5  3.825475
##   ---                                               
## 2996:  999      1 1.194655        2   2996  3.063216
## 2997:  999      2 1.194655        2   2997  5.050390
## 2998: 1000      0 2.815540        3   2998  2.184865
## 2999: 1000      1 2.815540        3   2999  3.602233
## 3000: 1000      2 2.815540        3   3000  5.939078
### Generate the data 

dtX3 <- addCorGen(dtOld = dtLong, idvar = "cid", nvars = 3, rho = .6, corstr = "cs",
                  dist = "poisson", param1 = "lambda", cnames = "NewPois")
dtX3
##        cid period    xbase nperiods timeID    lambda NewPois
##    1:    1      0 8.854073        1      1  3.996453       5
##    2:    1      1 8.854073        1      2  6.589037       8
##    3:    1      2 8.854073        1      3 10.863486       7
##    4:    2      0 3.416826        2      4  2.320268       4
##    5:    2      1 3.416826        2      5  3.825475       5
##   ---                                                       
## 2996:  999      1 1.194655        2   2996  3.063216       3
## 2997:  999      2 1.194655        2   2997  5.050390       5
## 2998: 1000      0 2.815540        3   2998  2.184865       3
## 2999: 1000      1 2.815540        3   2999  3.602233       4
## 3000: 1000      2 2.815540        3   3000  5.939078       5

We can fit a generalized estimating equation (GEE) model and examine the coefficients and the working correlation matrix. They match closely to the data generating parameters:

geefit <- gee(NewPois ~ period + xbase, data = dtX3, id = cid, family = poisson, corstr = "exchangeable")
## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
## running glm to get initial regression estimate
## (Intercept)      period       xbase 
##  0.51949114  0.49858066  0.09860462
round(summary(geefit)$working.correlation, 2)
##      [,1] [,2] [,3]
## [1,] 1.00 0.58 0.58
## [2,] 0.58 1.00 0.58
## [3,] 0.58 0.58 1.00