Getting started with clusteredMSM

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Overview

clusteredMSM provides nonparametric analysis of clustered multistate process data, based on Bakoyannis (2021) and Bakoyannis & Bandyopadhyay (2022). It exposes one user-facing function, patp(), that:

Estimates the population-averaged transition probability P(X(t) = j | X(s) = h) using the working-independence Aalen-Johansen estimator.
Returns cluster-bootstrap pointwise confidence intervals and simultaneous bands.
Conducts a two-sample Kolmogorov-Smirnov-type test when the formula has a grouping variable on the right-hand side.
Supports a landmark variant for non-Markov processes.
Natively handles non-monotone (recovery) multistate models.

The package depends only on survival.

Input data format

clusteredMSM expects data in interval format: one row per mutually-exclusive time interval per subject, with columns for interval start time (Tstart), end time (Tstop), starting state (Sstart), and ending state (Sstop). The column names are flexible — they are passed by name through the formula.

Censoring is encoded as Sstart == Sstop on the final row of a subject’s record. Within each subject, intervals must be temporally contiguous (Tstop[k] == Tstart[k+1]) and state contiguous (Sstop[k] == Sstart[k+1]).

A progressive illness-death example

library(clusteredMSM)

# 1=Healthy, 2=Ill, 3=Dead. Allowed: 1->2, 1->3, 2->3.
tmat <- trans_mat(
  list(c(2, 3), 3, integer(0)),
  names = c("Healthy", "Ill", "Dead")
)
tmat
#>          to
#> from      Healthy Ill Dead
#>   Healthy      NA   1    2
#>   Ill          NA  NA    3
#>   Dead         NA  NA   NA

A small example dataset in interval format:

mydata <- data.frame(
  pid     = c(1, 1, 2, 3, 4, 4),
  site    = c(1, 1, 1, 2, 2, 2),
  treatment = c("A", "A", "A", "B", "B", "B"),
  t0      = c(0.0, 1.5, 0.0, 0.0, 0.0, 1.0),
  t1      = c(1.5, 3.0, 2.0, 1.0, 1.0, 2.5),
  s0      = c(1,   2,   1,   1,   1,   2),
  s1      = c(2,   3,   1,   3,   2,   3)
)
mydata
#>   pid site treatment  t0  t1 s0 s1
#> 1   1    1         A 0.0 1.5  1  2
#> 2   1    1         A 1.5 3.0  2  3
#> 3   2    1         A 0.0 2.0  1  1
#> 4   3    2         B 0.0 1.0  1  3
#> 5   4    2         B 0.0 1.0  1  2
#> 6   4    2         B 1.0 2.5  2  3

Subject 1: H -> I -> D. Subject 2: censored healthy at t = 2.0 (s0 == s1). Subject 3: died at t = 1.0. Subject 4: H -> I -> D.

One-sample estimation

fit <- patp(
  msm(t0, t1, s0, s1) ~ 1,
  data    = mydata,
  tmat    = tmat,
  id      = "pid",
  cluster = "site",
  h = 1, j = 2, s = 0,
  B = 1000, cband = TRUE,
  seed = 1
)
fit

fit$curves contains the time-indexed point estimate of \[\begin{eqnarray*} P(X(t) = j \,|\, X(s) = h) &=& P(X(t) = 2 \,|\, X(0) = 1) \\ &=& P(X(t) = 2), \end{eqnarray*}\] (second equality due to the fact that all start at state = 1 in the classical illness-death model), standard error, pointwise confidence interval, and (with cband = TRUE) simultaneous band limits.

Two-sample comparison (estimation + test in one call)

If the formula’s right-hand side names a binary grouping variable, patp() automatically estimates both curves and conducts a Kolmogorov-Smirnov-type test:

tt <- patp(
  msm(t0, t1, s0, s1) ~ treatment,
  data    = mydata,
  tmat    = tmat,
  id      = "pid",
  cluster = "site",
  h = 1, j = 2, s = 0,
  B = 1000,
  seed = 1
)
tt

The $curves slot now has one row block per group. The $test slot contains the observed K-S statistic and the bootstrap p-value.

Recovery models (non-monotone processes)

For processes with cyclic transitions (e.g., illness-death with recovery), use the long event format directly. The same patp() call works:

tmat_rec <- trans_mat(
  list(c(2, 3), c(1, 3), integer(0)),
  names = c("Healthy", "Ill", "Dead")
)

# Subject who went Healthy -> Ill -> Healthy -> censored:
recovery_data <- data.frame(
  pid = c(1, 1, 1),
  t0  = c(0.0, 1.0, 2.0),
  t1  = c(1.0, 2.0, 3.5),
  s0  = c(1,   2,   1),
  s1  = c(2,   1,   1)    # last row: censored healthy
)

patp(msm(t0, t1, s0, s1) ~ 1,
     data = recovery_data, tmat = tmat_rec,
     id = "pid",
     h = 1, j = 2, s = 0,
     B = 500, seed = 1)

Landmark variant

When s > 0 and the Markov assumption is questionable, use LMAJ = TRUE:

patp(msm(t0, t1, s0, s1) ~ 1,
     data = mydata, tmat = tmat,
     id = "pid", cluster = "site",
     h = 1, j = 2, s = 1.0,
     LMAJ = TRUE, B = 1000, seed = 1)

Inverse-cluster-size weighting

When cluster size is potentially informative (e.g., sicker centers contribute more patients), use the weighted estimator:

patp(msm(t0, t1, s0, s1) ~ 1,
     data = mydata, tmat = tmat,
     id = "pid", cluster = "site",
     h = 1, j = 2, s = 0,
     weighted = TRUE, B = 1000, seed = 1)

weighted = TRUE requires the cluster argument.

What `B = 0` does

Setting B = 0 returns the point estimate without any bootstrap inference. This is permitted only for one-sample formulas; two-sample formulas always require B > 0 (since the entire point of a two-sample analysis is the test). Use B = 0 for fast exploratory work, then switch to B = 1000 for inference.

References

Bakoyannis G (2021). Nonparametric analysis of nonhomogeneous multistate processes with clustered observations. Biometrics 77(2):533-546. doi:10.1111/biom.13327

Bakoyannis G, Bandyopadhyay D (2022). Nonparametric tests for multistate processes with clustered data. Annals of the Institute of Statistical Mathematics 74(5):837-867. doi:10.1007/s10463-021-00819-x

Putter H, Spitoni C (2018). Non-parametric estimation of transition probabilities in non-Markov multi-state models: the landmark Aalen-Johansen estimator. Statistical Methods in Medical Research 27(7):2081-2092. doi:10.1177/0962280216674497

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Getting started with clusteredMSM

Overview

Input data format

A progressive illness-death example

One-sample estimation

Two-sample comparison (estimation + test in one call)

Recovery models (non-monotone processes)

Landmark variant

Inverse-cluster-size weighting

What B = 0 does

References

What `B = 0` does