Simulation using simstudy
has two primary steps. First, the user defines the data elements of a data set. Second, the user generates the data, using the definitions in the first step. Additional functionality exists to simulate observed or randomized treatment assignment/exposures, to generate survival data, to create longitudinal/panel data, to create multi-level/hierarchical data, to create datasets with correlated variables based on a specified covariance structure, to merge datasets, and to create data sets with missing data.
The key to simulating data in simstudy
is the creation of series of data defintion tables that look like this:
varname | formula | variance | dist | link |
---|---|---|---|---|
nr | 7 | 0 | nonrandom | identity |
x1 | 10;20 | 0 | uniform | identity |
y1 | nr + x1 * 2 | 8 | normal | identity |
y2 | nr - 0.2 * x1 | 0 | poisson | log |
xCat | 0.3;0.2;0.5 | 0 | categorical | identity |
g1 | 5+xCat | 1 | gamma | log |
a1 | -3 + xCat | 0 | binary | logit |
These definition tables can be generated two ways. One option is to to use any external editor that allows the creation of csv
files, which can be read in with a call to defRead
. An alternative is to make repeated calls to the function defData
. Here, we illustrate the R code that builds this definition table internally:
def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1",
link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1,
link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat",
link = "logit")
The first call to defData
without specifying a definition name (in this example the definition name is def) creates a new data.table with a single row. An additional row is added to the table def
each time the function defData
is called. Each of these calls is the definition of a new field in the data set that will be generated. In this example, the first data field is named ‘nr’, defined as a constant with a value to be 7. In each call to defData
the user defines a variable name, a distribution (the default is ‘normal’), a mean formula (if applicable), a variance parameter (if applicable), and a link function for the mean (defaults to ‘identity’).
The possible distributions include normal, gamma, poisson, zero-truncated poisson, binary, uniform, categorical, and deterministic/non-random. For all of these distributions, key parameters defining the distribution are entered in the formula
, variance
, and link
fields.
In the case of the normal and gamma distributions, the formula specifies the mean. The formula can be a scalar value (number) or a string that represents a function of previously defined variables in the data set definition (or, as we will see later, in a previously generated data set). In the example, the mean of y1
, a normally distributed value, is declared as a linear function of nr
and x1
, and the mean of g1
is a function of the category defined by xCat
. The variance
field is defined only for normal and gamma random variables, and can only be defined as a scalar value. In the case of gamma random variables, the value entered in variance field is really a dispersion value \(d\), where the actual variance will be \(d \times mean^2\).
In the case of the poisson, zero-truncated poisson, and binary distributions, the formula also specifies the mean. The variance is not a valid parameter in these cases, but the link
field is. The default link is ‘identity’ but a ‘log’ link is available for the poisson distributions and a “logit” link is available for the binary outcomes. In this example, y2
is defined as poisson random variable with a mean that is function of nr
and x1
on the log scale. For binary variables, which take a value of 0 or 1, the formula represents probability (with the ‘identity’ link) or log odds (with the ‘logit’ link) of the variable having a value of 1. In the example, a1
has been defined as a binary random variable with a log odds that is a function of xCat
.
Variables defined with a uniform, categorical, or deterministic/non-random distribution are specified using the formula only. The variance
and link
fields are not used in these cases.
For a uniformly distributed variable, The formula is a string with the format “a;b”, where a and b are scalars or functions of previously defined variables. The uniform distribution has two parameters - the minimum and the maximum. In this case, a represents the minimum and b represents the maximum.
For a categorical variable with \(k\) categories, the formula is a string of probabilities that sum to 1: “\(p_1 ; p_2 ; ... ; p_k\)”. \(p_1\) is the probability of the random variable falling category 1, \(p_2\) is the probablity of category 2, etc. The probabilities can be specified as functions of other variables previously defined. In the example, xCat
has three possibilities with probabilites 0.3, 0.2, and 0.5, respectively.
Non-random variables are defined by the formula. Since these variables are deterministic, variance is not relevant. They can be functions of previously defined variables or a scalar, as we see in the sample for variable defined as nr
.
After the data set definitions have been created, a new data set with \(n\) observations can be created with a call to function genData
. In this example, 1,000 observations are generated using the data set defitions in def
, and then stored in the object dt
:
dt <- genData(1000, def)
dt
## idnum nr x1 y1 y2 xCat g1 a1
## 1: 1 7 18.71470 48.13110 25 1 1104.82145 0
## 2: 2 7 12.63977 34.82680 87 2 222.26269 1
## 3: 3 7 13.21247 34.96022 80 1 289.08795 0
## 4: 4 7 19.21613 38.93975 17 3 1218.53035 0
## 5: 5 7 10.70988 24.16021 148 1 1011.98398 0
## ---
## 996: 996 7 12.69114 34.43474 88 3 1007.88648 0
## 997: 997 7 11.48129 31.34903 108 3 4146.61939 0
## 998: 998 7 16.88184 41.60436 45 1 27.90073 0
## 999: 999 7 10.24263 25.36589 151 3 4626.50014 0
## 1000: 1000 7 12.72076 33.53079 78 1 37.97592 0
New data can be added to an existing data set with a call to function addColumns
. The new data definitions are created with a call to defData
and then included as an argument in the call to addColumns
:
addef <- defDataAdd(varname = "zExtra", dist = "normal", formula = "3 + y1",
variance = 2)
dt <- addColumns(addef, dt)
dt
## idnum nr x1 y1 y2 xCat g1 a1 zExtra
## 1: 1 7 18.71470 48.13110 25 1 1104.82145 0 51.82939
## 2: 2 7 12.63977 34.82680 87 2 222.26269 1 38.09525
## 3: 3 7 13.21247 34.96022 80 1 289.08795 0 37.20631
## 4: 4 7 19.21613 38.93975 17 3 1218.53035 0 41.14921
## 5: 5 7 10.70988 24.16021 148 1 1011.98398 0 26.37784
## ---
## 996: 996 7 12.69114 34.43474 88 3 1007.88648 0 36.83071
## 997: 997 7 11.48129 31.34903 108 3 4146.61939 0 37.31045
## 998: 998 7 16.88184 41.60436 45 1 27.90073 0 45.54451
## 999: 999 7 10.24263 25.36589 151 3 4626.50014 0 30.51333
## 1000: 1000 7 12.72076 33.53079 78 1 37.97592 0 35.63998
Treatment assignment can be accomplished through the original data generation process, using defData
and genData
. However, the functions trtAssign
and trtObserve
provide more options to generate treatment assignment.
Treatment assignment can simulate how treatment is made in a randomized study. Assignment to treatment groups can be (close to) balanced (as would occur in a block randomized trial); this balancing can be done without or without strata. Alternatively, the assignment can be left to chance without blocking; in this case, balance across treatment groups is not guaranteed, particularly with small sample sizes.
First, create the data definition:
def <- defData(varname = "male", dist = "binary", formula = 0.5, id = "cid")
def <- defData(def, varname = "over65", dist = "binary", formula = "-1.7 + .8*male",
link = "logit")
def <- defData(def, varname = "baseDBP", dist = "normal", formula = 70, variance = 40)
dtstudy <- genData(330, def)
Balanced treatment assignment, stratified by gender and age category (not blood pressure)
study1 <- trtAssign(dtstudy, n = 3, balanced = TRUE, strata = c("male", "over65"),
grpName = "rxGrp")
study1
## cid rxGrp male over65 baseDBP
## 1: 1 1 1 0 70.51994
## 2: 2 3 1 0 68.37788
## 3: 3 2 1 0 68.91449
## 4: 4 3 1 1 58.65074
## 5: 5 3 1 1 59.28269
## ---
## 326: 326 1 1 0 61.63004
## 327: 327 2 0 0 78.49473
## 328: 328 2 0 0 71.21033
## 329: 329 3 1 0 71.90673
## 330: 330 2 0 0 67.42199
Balanced treatment assignment (without stratification)
study2 <- trtAssign(dtstudy, n = 3, balanced = TRUE, grpName = "rxGrp")
Random (unbalanced) treatment assignment
study3 <- trtAssign(dtstudy, n = 3, balanced = FALSE, grpName = "rxGrp")
Comparison of three treatment assignment mechanisms
If exposure or treatment is observed (rather than randomly assigned), use trtObserve
to generate groups. There may be any number of possible exposure or treatment groups, and the probability of exposure to a specific level can depend on covariates already in the data set. In this case, there are three exposure groups that vary by gender and age:
formula1 <- c("-2 + 2*male - .5*over65", "-1 + 2*male + .5*over65")
dtExp <- trtObserve(dtstudy, formulas = formula1, logit.link = TRUE, grpName = "exposure")
Here are the exposure distributions by gender and age:
Here is a second case of three exposures where the exposure is independent of any covariates. Note that specifying the formula as c(.35, .45)
is the same as specifying it is c(.35, .45, .20)
. Also, when referring to probabilities, the identity link is used:
formula2 <- c(0.35, 0.45)
dtExp2 <- trtObserve(dtstudy, formulas = formula2, logit.link = FALSE, grpName = "exposure")
Time-to-event data, including both survival and censoring times, are created using functions defSurv
and genSurv
. The survival data definitions require a variable name as well as a specification of a scale value, which determines the mean survival time at a baseline level of covariates (i.e. all covariates set to 0). The Weibull distribution is used to generate these survival times. In addition, covariates (which have been defined previously) that influence survival time can be included in the formula
field. Positive coeffecients are associated with longer survival times (and lower hazard rates). Finally, the shape of the distribution can be specified. A shape
value of 1 reflects the exponential distribution.
# Baseline data definitions
def <- defData(varname = "x1", formula = 0.5, dist = "binary")
def <- defData(def, varname = "x2", formula = 0.5, dist = "binary")
def <- defData(def, varname = "grp", formula = 0.5, dist = "binary")
# Survival data definitions
sdef <- defSurv(varname = "survTime", formula = "1.5*x1", scale = "grp*50 + (1-grp)*25",
shape = "grp*1 + (1-grp)*1.5")
sdef <- defSurv(sdef, varname = "censorTime", scale = 80, shape = 1)
sdef
## varname formula scale shape
## 1: survTime 1.5*x1 grp*50 + (1-grp)*25 grp*1 + (1-grp)*1.5
## 2: censorTime 0 80 1
The data are generated with calls to genData
and genSurv
:
# Baseline data definitions
dtSurv <- genData(300, def)
dtSurv <- genSurv(dtSurv, sdef)
head(dtSurv)
## id x1 x2 grp survTime censorTime
## 1: 1 1 1 1 380 21
## 2: 2 1 0 1 162 32
## 3: 3 0 1 0 0 539
## 4: 4 0 1 1 26 11
## 5: 5 0 0 1 56 17
## 6: 6 0 0 1 169 84
# A comparison of survival by group and x1
dtSurv[, round(mean(survTime), 1), keyby = .(grp, x1)]
## grp x1 V1
## 1: 0 0 9.8
## 2: 0 1 22.1
## 3: 1 0 53.8
## 4: 1 1 221.1
Observed survival times and censoring indicators can be generated by defining new fields:
cdef <- defDataAdd(varname = "obsTime", formula = "pmin(survTime, censorTime)",
dist = "nonrandom")
cdef <- defDataAdd(cdef, varname = "status", formula = "I(survTime <= censorTime)",
dist = "nonrandom")
dtSurv <- addColumns(cdef, dtSurv)
head(dtSurv)
## id x1 x2 grp survTime censorTime obsTime status
## 1: 1 1 1 1 380 21 21 0
## 2: 2 1 0 1 162 32 32 0
## 3: 3 0 1 0 0 539 0 1
## 4: 4 0 1 1 26 11 11 0
## 5: 5 0 0 1 56 17 17 0
## 6: 6 0 0 1 169 84 84 0
# estimate proportion of censoring by x1 and group
dtSurv[, round(1 - mean(status), 2), keyby = .(grp, x1)]
## grp x1 V1
## 1: 0 0 0.09
## 2: 0 1 0.21
## 3: 1 0 0.39
## 4: 1 1 0.81
Here is a Kaplan-Meier plot of the data by the four groups:
To simulate longitudinal data, we start with a ‘cross-sectional’ data set and convert it to a time-dependent data set. The original cross-sectional data set may or may not include time-dependent data in the columns. In the next example, we measure outcome Y
once before and twice after intervention T
in a randomized trial:
tdef <- defData(varname = "T", dist = "binary", formula = 0.5)
tdef <- defData(tdef, varname = "Y0", dist = "normal", formula = 10, variance = 1)
tdef <- defData(tdef, varname = "Y1", dist = "normal", formula = "Y0 + 5 + 5 * T",
variance = 1)
tdef <- defData(tdef, varname = "Y2", dist = "normal", formula = "Y0 + 10 + 5 * T",
variance = 1)
dtTrial <- genData(500, tdef)
dtTrial
## id T Y0 Y1 Y2
## 1: 1 0 9.183977 13.94165 17.27805
## 2: 2 1 8.643123 18.82474 22.37659
## 3: 3 1 10.324793 19.35620 25.71761
## 4: 4 0 10.282520 16.10805 20.92872
## 5: 5 1 9.657632 20.49713 22.13235
## ---
## 496: 496 0 10.430134 15.00519 20.56333
## 497: 497 0 9.801622 17.45522 19.28969
## 498: 498 1 12.034422 21.22105 28.24692
## 499: 499 1 9.359974 18.83581 25.33157
## 500: 500 0 8.817763 14.18292 20.04614
The data in longitudinal form is created with a call to addPeriods
. If the cross-sectional data includes time dependent data, then the number of periods nPeriods
must be the same as the number of time dependent columns. If a variable is not declared as one of the timevars
, it will be repeated each time period. In this example, the treatment indicator T
is not specified as a time dependent variable. (Note: if there are two time-dependent variables, it is best to create two data sets and merge them. This will be shown later in the vignette).
dtTime <- addPeriods(dtTrial, nPeriods = 3, idvars = "id", timevars = c("Y0",
"Y1", "Y2"), timevarName = "Y")
dtTime
This is what the longitudinal data look like:
It is also possible to generate longitudinal data with varying numbers of measurement periods as well as varying time intervals between each measurement period. This is done by defining specific variables in the data set that define the number of observations per subject and the average interval time between each observation. nCount
defines the number of measurements for an individual; mInterval
specifies the average time between intervals for an subject; and vInterval
specifies the variance of those interval times. If vInterval
is set to 0 or is not defined, the interval for a subject is deterimined entirely by the mean interval. If vInterval
is greater than 0, time intervals are generated using a gamma distribution with mean and dispersion specified.
In this simple example, the cross-sectional data generates individuals with a different number of measurement observations and different times between each observation. Data for two of these individuals is printed:
def <- defData(varname = "xbase", dist = "normal", formula = 20, variance = 3)
def <- defData(def, varname = "nCount", dist = "noZeroPoisson", formula = 6)
def <- defData(def, varname = "mInterval", dist = "gamma", formula = 30, variance = 0.01)
def <- defData(def, varname = "vInterval", dist = "nonrandom", formula = 0.07)
dt <- genData(200, def)
dt[id %in% c(8, 121)] # View individuals 8 and 121
## id xbase nCount mInterval vInterval
## 1: 8 18.24515 5 33.84171 0.07
## 2: 121 17.33932 5 27.60829 0.07
The resulting longitudinal data for these two subjects can be inspected after a call to addPeriods
. Notice that no parameters need to be set since all information resides in the data set itself:
dtPeriod <- addPeriods(dt)
dtPeriod[id %in% c(8, 121)] # View individuals 8 and 121 only
## id period xbase time timeID
## 1: 8 0 18.24515 0 49
## 2: 8 1 18.24515 56 50
## 3: 8 2 18.24515 107 51
## 4: 8 3 18.24515 151 52
## 5: 8 4 18.24515 180 53
## 6: 121 0 17.33932 0 753
## 7: 121 1 17.33932 14 754
## 8: 121 2 17.33932 37 755
## 9: 121 3 17.33932 66 756
## 10: 121 4 17.33932 94 757
If a time sensitive measurement is added to the data set …
def2 <- defDataAdd(varname = "Y", dist = "normal", formula = "15 + .1 * time",
variance = 5)
dtPeriod <- addColumns(def2, dtPeriod)
… a plot of a five randomly selected individuals looks like this:
The function genCluster
generates multilevel or clustered data based on a previously generated data set that is one “level” up from the clustered data. For example, if there is a data set that contains school level (considered here to be level 2), classrooms (level 1) can be generated. And then, students (now level 1) can be generated within classrooms (now level 2)
In the example here, we do in fact generate school, class, and student level data. There are eight schools, four of which are randomized to receive an intervention. The number of classes per school varies, as does the number of students per class. (It is straightforward to generate fully balanced data by using constant values.) The outcome of interest is a test score, which is influenced by gender and the intervention. In addition, test scores vary by schools, and by classrooms, so the simulation provides random effects at each of these levels.
We start by definining the school level data:
gen.school <- defData(varname = "s0", dist = "normal", formula = 0, variance = 3,
id = "idSchool")
gen.school <- defData(gen.school, varname = "nClasses", dist = "noZeroPoisson",
formula = 3)
dtSchool <- genData(8, gen.school)
dtSchool <- trtAssign(dtSchool, n = 2)
dtSchool
The classroom level data are generated with a call to genCluster
, and then school level data is added by a call to addColumns
:
gen.class <- defDataAdd(varname = "c0", dist = "normal", formula = 0, variance = 2)
gen.class <- defDataAdd(gen.class, varname = "nStudents", dist = "noZeroPoisson",
formula = 20)
dtClass <- genCluster(dtSchool, "idSchool", numIndsVar = "nClasses", level1ID = "idClass")
dtClass <- addColumns(gen.class, dtClass)
head(dtClass, 10)
## idSchool trtGrp s0 nClasses idClass c0 nStudents
## 1: 1 0 4.507355 3 1 -1.7030717 19
## 2: 1 0 4.507355 3 2 0.9972415 19
## 3: 1 0 4.507355 3 3 0.6907191 19
## 4: 2 1 -1.774387 2 4 1.2638098 17
## 5: 2 1 -1.774387 2 5 -0.2549515 27
## 6: 3 1 -2.245730 5 6 -0.1392407 16
## 7: 3 1 -2.245730 5 7 0.9852097 17
## 8: 3 1 -2.245730 5 8 0.0693371 20
## 9: 3 1 -2.245730 5 9 -0.6584024 17
## 10: 3 1 -2.245730 5 10 -1.1545564 8
Finally, the student level data are added using the same process:
gen.student <- defDataAdd(varname = "Male", dist = "binary",
formula = 0.5)
gen.student <- defDataAdd(gen.student, varname = "age", dist = "uniform",
formula = "9.5; 10.5")
gen.student <- defDataAdd(gen.student, varname = "test", dist = "normal",
formula = "50 - 5*Male + s0 + c0 + 8 * trtGrp", variance = 2)
dtStudent <- genCluster(dtClass, cLevelVar = "idClass", numIndsVar = "nStudents",
level1ID = "idChild")
dtStudent <- addColumns(gen.student, dtStudent)
This is what the clustered data look like. Each classroom is represented by a box, and each school is represented by a color. The intervention group is highlighted by dark outlines:
After generating a complete data set, it is possible to generate missing data. defMiss
defines the parameters of missingness. genMiss
generates a missing data matrix of indicators for each field. Indicators are set to 1 if the data are missing for a subject, 0 otherwise. genObs
creates a data set that reflects what would have been observed had data been missing; this is a replicate of the orginal data set with “NAs” replacing values where missing data has been generated.
By controlling the parameters of missingness, it is possible to represent different missing data mechanisms: (1) missing completely at random (MCAR), where the probability missing data is independent of any covariates, measured or unmeasured, that are associated with the measure, (2) missing at random (MAR), where the probability of subject missing data is a function only of observed covariates that are associated with the measure, and (3) not missing at random (NMAR), where the probability of missing data is related to unmeasured covariates that are associated with measure.
These possibilities are illustrated with an example. A data set of 1000 observations with three “outcome” measures" x1
, x2
, and x3
is defined. This data set also includes two independent predictors, m
and u
that largely determine the value of each outcome (subject to random noise).
def1 <- defData(varname = "m", dist = "binary", formula = 0.5)
def1 <- defData(def1, "u", dist = "binary", formula = 0.5)
def1 <- defData(def1, "x1", dist = "normal", formula = "20*m + 20*u", variance = 2)
def1 <- defData(def1, "x2", dist = "normal", formula = "20*m + 20*u", variance = 2)
def1 <- defData(def1, "x3", dist = "normal", formula = "20*m + 20*u", variance = 2)
dtAct <- genData(1000, def1)
In this example, the missing data mechanism is different for each outcome. As defined below, missingness for x1
is MCAR, since the probability of missing is fixed. Missingness for x2
is MAR, since missingness is a function of m
, a measured predictor of x2
. And missingness for x3
is NMAR, since the probability of missing is dependent on u
, an unmeasured predictor of x3
:
defM <- defMiss(varname = "x1", formula = 0.15, logit.link = FALSE)
defM <- defMiss(defM, varname = "x2", formula = ".05 + m * 0.25", logit.link = FALSE)
defM <- defMiss(defM, varname = "x3", formula = ".05 + u * 0.25", logit.link = FALSE)
defM <- defMiss(defM, varname = "u", formula = 1, logit.link = FALSE) # not observed
missMat <- genMiss(dtName = dtAct, missDefs = defM, idvars = "id")
dtObs <- genObs(dtAct, missMat, idvars = "id")
missMat
## id x1 x2 x3 u m
## 1: 1 0 1 0 1 0
## 2: 2 0 0 0 1 0
## 3: 3 0 0 0 1 0
## 4: 4 0 0 0 1 0
## 5: 5 0 0 0 1 0
## ---
## 996: 996 0 0 0 1 0
## 997: 997 0 0 0 1 0
## 998: 998 0 0 0 1 0
## 999: 999 0 0 0 1 0
## 1000: 1000 0 0 0 1 0
dtObs
## id m u x1 x2 x3
## 1: 1 0 NA 0.4897648 NA -0.41784137
## 2: 2 0 NA 0.7115354 0.44336010 0.09161338
## 3: 3 0 NA -2.0015986 1.14532572 2.27791736
## 4: 4 1 NA 17.6922028 18.87469486 18.56019618
## 5: 5 1 NA 20.3853462 19.39040578 19.32114600
## ---
## 996: 996 0 NA -2.1475359 -0.16479811 0.31246825
## 997: 997 0 NA 1.1448311 0.08416758 -1.09136051
## 998: 998 1 NA 41.3487531 39.62171697 38.85677524
## 999: 999 0 NA -1.1410528 -0.37949988 0.79642002
## 1000: 1000 0 NA -0.4259718 -0.74783361 -0.63368629
The impacts of the various data mechanisms on estimation can be seen with a simple calculation of means using both the “true” data set without missing data as a comparison for the “observed” data set. Since x1
is MCAR, the averages for both data sets are roughly equivalent. However, we can see below that estimates for x2
and x3
are biased, as the difference between observed and actual is not close to 0:
# Two functions to calculate means and compare them
rmean <- function(var, digits = 1) {
round(mean(var, na.rm = TRUE), digits)
}
showDif <- function(dt1, dt2, rowName = c("Actual", "Observed", "Difference")) {
dt <- data.frame(rbind(dt1, dt2, dt1 - dt2))
rownames(dt) <- rowName
return(dt)
}
# data.table functionality to estimate means for each data set
meanAct <- dtAct[, .(x1 = rmean(x1), x2 = rmean(x2), x3 = rmean(x3))]
meanObs <- dtObs[, .(x1 = rmean(x1), x2 = rmean(x2), x3 = rmean(x3))]
showDif(meanAct, meanObs)
## x1 x2 x3
## Actual 20.1 20.1 20.1
## Observed 20.2 18.3 18.5
## Difference -0.1 1.8 1.6
After adjusting for the measured covariate m
, the bias for the estimate of the mean of x2
is mitigated, but not for x3
, since u
is not observed:
meanActm <- dtAct[, .(x1 = rmean(x1), x2 = rmean(x2), x3 = rmean(x3)), keyby = m]
meanObsm <- dtObs[, .(x1 = rmean(x1), x2 = rmean(x2), x3 = rmean(x3)), keyby = m]
# compare observed and actual when m = 0
showDif(meanActm[m == 0, .(x1, x2, x3)], meanObsm[m == 0, .(x1, x2, x3)])
## x1 x2 x3
## Actual 10.3 10.2 10.3
## Observed 10.3 10.2 8.8
## Difference 0.0 0.0 1.5
# compare observed and actual when m = 1
showDif(meanActm[m == 1, .(x1, x2, x3)], meanObsm[m == 1, .(x1, x2, x3)])
## x1 x2 x3
## Actual 29.9 29.9 29.8
## Observed 30.1 29.3 28.2
## Difference -0.2 0.6 1.6
Missingness can occur, of course, in the context of longitudinal data. missDef
provides two additional arguments that are relevant for these types of datas: baseline
and monotonic
. In the case of variables that are measured at baseline only, a missing value would be reflected throughout the course of the study. In the case where a variable is time-dependent (i.e it is measured at each time point), it is possible to declare missingness to be monotonic. This means that if a value for this field is missing at time t
, then values will also be missing at all times T > t
as well. The call to genMiss
must set repeated
to TRUE.
The following two examples describe an outcome variable y
that is measured over time, whose value is a function of time and an observed exposure:
# use baseline definitions from previous example
dtAct <- genData(120, def1)
dtAct <- trtObserve(dtAct, formulas = 0.5, logit.link = FALSE, grpName = "rx")
# add longitudinal data
defLong <- defDataAdd(varname = "y", dist = "normal", formula = "10 + period*2 + 2 * rx",
variance = 2)
dtTime <- addPeriods(dtAct, nPeriods = 4)
dtTime <- addColumns(defLong, dtTime)
In the first case, missingness is not monotonic; a subject might miss a measurement but returns for subsequent measurements:
# missingness for y is not monotonic
defMlong <- defMiss(varname = "x1", formula = 0.2, baseline = TRUE)
defMlong <- defMiss(defMlong, varname = "y", formula = "-1.5 - 1.5 * rx + .25*period",
logit.link = TRUE, baseline = FALSE, monotonic = FALSE)
missMatLong <- genMiss(dtName = dtTime, missDefs = defMlong, idvars = c("id",
"rx"), repeated = TRUE, periodvar = "period")
Here is a conceptual plot that shows the pattern of missingness. Each row represents an individual, and each box represents a time period. A box that is colored reflects missing data; a box colored grey reflects observed. The missingness pattern is shown for two variables x1
and y
:
In the second case, missingness is monotonic; once a subject misses a measurement for y
, there are no subsequent measurements:
# missingness for y is not monotonic
defMlong <- defMiss(varname = "x1", formula = 0.2, baseline = TRUE)
defMlong <- defMiss(defMlong, varname = "y", formula = "-1.8 - 1.5 * rx + .25*period",
logit.link = TRUE, baseline = FALSE, monotonic = TRUE)
missMatLong <- genMiss(dtName = dtTime, missDefs = defMlong, idvars = c("id",
"rx"), repeated = TRUE, periodvar = "period")