In screening programs, individuals are usually followed up and tested (screened) for the development of a disease, such as cancer. The target disease often develops progressively in stages; for example healthy (state 1), pre-state disease (state 2), and the disease state (state 3). When the pre-state disease is found during screening it is intervened upon, for example by surgical removal of a lesion, so that the progression of the pre-state disease to disease is interrupted. This is one of the major objectives of many screening programs, such as colorectal cancer screening, breast cancer screening or cervical cancer screening.
Researchers often want to estimate the time \(x\) from baseline to the pre-state disease, the time \(s\) from the pre-state disease to the disease, and the total time \(x+s\) from baseline to the disease. In addition, researchers often want to regress these times on baseline covariates, such as age, gender, or bio-markers (\(z\)). In doing so, electronic health records from screening programs can serve as data. For \(n\) individuals numbered \(i=1,...,n\), these data typically consist of a series of time stamps (screening times) \(v_{1i} < v_{2i} <...< v_{c_ii} < \infty\) at which screening tests have been carried out and a test result was obtained (e.g., \(v_i = \{1, 3, 5, 8\}\) years after baseline). The number of screening times and their spacing may differ over individuals \(i\) (i.e., irregular screening). Since pre-state diseases are intervened upon, the screening series end with a positive test outcome (either state 2 or 3) or loss to follow up (right censoring). I refer to this observation process as panel data with censoring after intervention.
BayesTSM is a progressive three-state semi-Markov model developed by Klausch et al. (2023) for screening panel data with censoring after intervention. An Accelerated Failure Time (AFT) specification is used to model the failure times \(x\) and \(s\) and regress them on covariates; specifically:
\[\begin{equation} \begin{split} \log x_i &= z_x^T \beta_x + \sigma_x \epsilon_i \\ \log s_i &= z_s^T \beta_s + \sigma_s \xi_i \end{split} \tag{1} \end{equation}\]
The independent error terms \(\epsilon_i\) and \(\xi_i\) are chosen such that common survival time distributions for \(x\) and \(s\) are induced. Currently, BayesTSM allows for Weibull, lognormal, and loglogistically distributed times. To estimate the model, BayesTSM utilizes a Bayesian estimation method (a Gibbs sampler). Klausch et al. (2023) showed how weakly informative priors on the parameters, applied by default in BayesTSM, help to regularize the likelihood surface of the model. Specifically, the likelihood can be flat when state 2 and 3 events are infrequent, which is common in practice. Akwiwu et al. (2026) showed that BayesTSM has better performance than other commonly available semi-Markov multi-state models, which can lead to wrong conclusions in practice.
Here, I describe the usage of the BayesTSM package. An analysis with BayesTSM consists of multiple steps using dedicated functions:
bayestsm)get_IC)summary.bayestsm, plot.bayestsm, ppCIF and plot.ppCIF)These functions are discussed in Section 2, Section 3, and Section 4. Additional options are discussed in Section 5.
bayestsmThe function bayestsm estimates the parameters of model (1). Model estimation is done by means of a Gibbs sampler. As described in Klausch et al. (2023), the main steps of the Gibbs sampler are (after initialization)
d, L, R, X, S), the augmented \(s_i\), and the model parameters.d, L, R, X, S), the augmented \(x_i\), and the model parameters.These updating steps are run repeatedly. After a warmup period, the sampler yields auto-correlated draws from the target posterior distribution of the parameters.
The data augmentation procedure used in steps 1 and 2 is described in detail in Klausch et al. (2023). It can be understood as an imputation procedure in which all individuals receive a plausible draw of the unobserved true times \(x_i\) and \(s_i\) given the observed data. The model parameters can then be drawn from the complete data posterior
\[\begin{align} q(\beta_x, \sigma_x, \beta_s, \sigma_s \mid x_i, s_i) \propto L(\beta_x, \sigma_x, \beta_s, \sigma_s \mid x_i, s_i)\pi(\beta_x, \sigma_x, \beta_s, \sigma_s) \end{align}\]
where \(L\) denotes the complete data likelihood function and \(\pi\) the prior density function; see Section 2.3. As can be seen, the complete data posterior distribution of BayesTSM does not depend on the observed data anymore since the missing times \(x_i\) and \(s_i\) have been augmented. Klausch et al. (2023) suggested a Metropolis-Hastings step to draw from the complete data posterior using a random walk normal proposal distribution. In addition, the current version of BayesTSM allows using univariate slice sampling to update the parameters one after another while holding the other parameters fixed. Our tests showed that slice sampling leads to faster model convergence than the initially proposed Metropolis step by Klausch et al. In addition, slice sampling usually does not require tuning of the sampler and hence is a convenient off-the-shelf approach to sampling the model parameters. Slice sampling is therefore used as the default parameter sampler in BayesTSM. Metropolis sampling can still be obtained through the option MH = TRUE; see Section 5.1.
bayestsm input data structurebayestsm accepts five different types of variables as input arguments d, L, R, Z.X, and Z.S. These variables have to be available for all individuals in the data. Missing information has to be handled outside of BayesTSM.
Argument d takes the type of censoring event with 1 denoting right censoring (loss to follow up), 2 denoting a pre-state event (e.g., pre-state disease / early disease), and 3 denoting a terminal state event (e.g., disease / late disease).
Arguments L and R take, respectively, the left and right bound of the last screening interval. BayesTSM currently assumes that the screening test has perfect sensitivity and specificity (as noted by Klausch et al. (2023), slight deviations from this assumption usually lead to negligible bias). As a consequence, the possibly longer screening time series \(v_1 < v_2 <...<v_c\) can be shortened to the last interval \((v_{c-1}, v_c]\), where the input variable L corresponds to \(v_{c-1}\) and R corresponds to \(v_c\) when d = 2 or d = 3. When d = 1 (right censoring), BayesTSM expects L to be the last follow-up moment (\(v_c\)) and R to be Inf.
Arguments Z.X and Z.S take the covariates used in the AFT models for \(x\) and \(s\), respectively, both specified as \(n \times p\) matrix objects. In principle, the number and type of covariates passed to Z.X and Z.S may differ, although, arguably, we often pass the same covariates to both transition models in practice. If no covariates are passed, Z.X = NULL and/or Z.S = NULL are specified and the resulting AFT model is an intercept-only model, respectively. Categorical variables have to be dummy coded outside of BayesTSM by the user. To do so, stats::model.matrix may be useful. In addition, the Gibbs sampler can be sensitive to large scale sd of continuous covariates. I advise to standardize any covariates with large scale, e.g. using base::scale.
BayesTSM has a built-in data simulation function which generates data under the data generation mechanism (i.e., the model) of Klausch et al. (2023); see ?gendat for all options. Here I use gendat to generate example data and illustrate the function input arguments and model estimation.
library(BayesTSM)
# Generate data
set.seed(1)
dat = gendat(
n = 1000, # Sample size
p = 2, # Number of normally distributed covariates
sigma.X = 0.3, # True scale parameter of X
mu.X = 2, # True intercept parameter of X
beta.X = c(0.5,0.5), # True slope parameters of all covariates of X
sigma.S = 0.5, # True scale parameter of S
mu.S = 1, # True intercept parameter of S
beta.S = c(0.5,0.5), # True slope parameters of all covariates of S
dist.X = 'weibull', # Distribution of X
dist.S = 'weibull', # Distribution of S
v.min = 1, # Minimum time between screening moments
v.max = 5, # Maximum time between screening moments
Tmax = 2e2, # Maximum number of screening times
mean.rc = 10 # Mean time to right censoring, exponential distribution
)gendat generates data for \(x\) and \(s\) for n data points. p covariates are simulated from a multivariate normal distribution. gendat also allows the option for a binary covariate, which I do not use in the example above. Parameters sigma.X, mu.X, and beta.X give respectively the scale parameter \(\sigma_x\), the intercept parameter \(\beta_0\) of the AFT model, and the slope parameters \(\beta_1,...,\beta_p\) for the \(x\)-model. Furthermore, sigma.S, mu.S, and beta.S are the corresponding input arguments for the \(s\)-model. The arguments dist.X and dist.S specify the distributions of \(x\) and \(s\). Besides weibull, the lognormal and loglog (log-logistic) distributions are available.
Subsequently, gendat generates visit times \(v_i\) for all \(i\) using the following process:
Here, v.min and v.max can be specified. The right censoring mechanism can either be controlled via the maximum number of follow-up moments Tmax or via a right censoring time \(v_{rc}\). When the next generated screening time \(v_{ij}>v_{rc}\), right censoring after v_{ij-1} occurs. gendat assumes the right censoring time is exponentially distributed with parameter equal to one divided by mean.rc, so that mean.rc is the mean time to right censoring.
Looking at the top rows of the generated dat illustrates the input structure required by BayesTSM. This type of data has to be matched by any real-world data passed to BayesTSM in practice.
head(dat)
#> L R d X S Z.1 Z.2
#> 1 1.173862 5.031941 3 2.898811 0.3247500 -1.22594831 0.2967671
#> 2 4.390002 Inf 1 7.616756 2.0652360 -0.60971302 0.8821001
#> 3 0.000000 3.106385 3 1.355291 0.7824814 -0.03558337 -1.2038542
#> 4 19.733713 22.842735 2 21.540353 12.9003481 1.04172886 1.3244550
#> 5 1.828183 Inf 1 13.043796 2.1470839 0.19781749 0.2909215
#> 6 2.439027 6.655726 3 3.226180 0.9205919 0.50686312 -1.7238144Specifically, gendat returns a data.frame with columns L, R, d, X, S, and the covariates, in this case Z.1 and Z.2. L and R correspond to the bounds of the most recent screening interval if d=2 or d=3, or an open interval with L denoting the last screening time and R=Inf if d=1. If an event (d=2 or d=3) occurred in the first interval, i.e. between zero and the first screening time, which is sometimes called left censoring, the coding is kept (i.e., L=0 and R the screening time at which the event was detected).
Note that columns X and S give the true simulated screening times, which are not available in practice and are not passed to BayesTSM. Furthermore, BayesTSM assumes that all cases are in state d=1 at time zero (baseline). Hence, baseline positive tests (also called prevalence) should be excluded before running BayesTSM. Cases with L=0 and R=Inf, denoting a baseline negative test with right censoring before follow-up, may be part of the data. Including these cases does not change the likelihood of the parameters. However, when estimating marginalized statistics such as cumulative distribution functions (incidence functions), selection bias may occur if such cases are excluded. The practical effects of having many L=0 and R=Inf observations in the data, however, have not been studied and are not yet well tested.
Klausch et al. (2023) showed that weakly informative priors for the model parameters are helpful to obtain consistent parameter estimates even in settings where advanced state events are infrequent. Specifically, Klausch et al. (2023) proposed Student-\(t\) priors with four degrees of freedom for the intercept and slope coefficients \(\beta_x\) and \(\beta_s\). Furthermore, half-normal priors are used for the \(\sigma_x\) and \(\sigma_s\) parameters, where bayestsm by default uses a scale parameter of \(1\). Klausch et al. (2023) suggested \(\sqrt{10}\) as the default scale parameter. The posterior distributions may be sensitive to these prior parameter choices. Therefore, a prior sensitivity analysis can be useful to assess the robustness of findings to different prior parameters; see Klausch et al. (2023) for an example.
In bayestsm, the degrees of freedom of the beta Student-\(t\) prior are controlled by beta.prior.X (default 4) and beta.prior.S (default 4). The scale parameter of the half-normal prior is controlled through sig.prior.X (default 1) and sig.prior.S (default 1). In addition, the user can pass their own prior density function through argument log_prior_fun; see ?log_aft_prior for guidelines on how the function has to be structured.
Alternatively, the bayestsm internal prior function log_aft_prior allows switching from the Student-\(t\) prior to a normal prior for the model \(\beta\) parameters by specifying option beta.prior = 'norm' in bayestsm. In that case, beta.prior.X and beta.prior.S are used to change the standard deviation of the normal prior distribution.
bayestsm runBefore running the sampler, the analyst should always check the distribution of events.
In the case of our simulated data, about 60% of individuals have state 2 or state 3 events, which is ample for BayesTSM to run successfully. Care in model specification (e.g. number of covariates included and type of distribution chosen) should be taken when there are few state 2 or state 3 events. Although BayesTSM is designed to deal with a substantially smaller number of events and usually converges if the Gibbs sampler is run for a sufficient number of iterations, the precision of estimation may be low, especially when too many covariates are included with few advanced state events.
We now run bayestsm; note again that the input format should align with the requirements described above in Section 2.2.
# Run bayestsm Gibbs sampler with data augmentation and slice sampling of the parameters
d = dat$d
L = dat$L
R = dat$R
Z = dat[, c("Z.1", "Z.2")]
mod_slice = bayestsm(
d = d,
L = L,
R = R,
Z.X = Z,
Z.S = Z,
mc = 1e4,
warmup = 5e2,
thinning = 10,
chains = 4,
update_till_convergence = FALSE,
MH = FALSE,
dist.X = "weibull",
dist.S = "weibull",
seed_chains = 1:4
)bayestsm accepts data as arguments d, L, R, Z.X, and Z.S, as shown above. The desired distributions for \(x\) and \(s\) are passed via dist.X and dist.S with options weibull, lognormal, and loglog (log-logistic). In practice, the correct distribution is unknown. Therefore, multiple models can be compared with different distributions and the best model can be found using information criteria, such as the Widely Applicable Information Criterion (WAIC); for an example, see Klausch et al. (2023). In BayesTSM, the post-estimation function get_IC returns information criteria; see Section 3. In addition, it can be considered to use an exponential distribution instead of one of the two-parameter distributions (Weibull, lognormal, or log-logistic). The exponential model is obtained from the Weibull distribution with \(\sigma\) fixed at one. This step fixes \(\sigma\) in the model and decreases uncertainty, which stabilizes estimation in sparse data settings, but is based on the assumption that hazards do not change across time (equivalent to a Markov model). I suggest resorting to the exponential specification only if the model is too imprecise or shows convergence issues. When an exponential model is desired, the analyst specifies dist.X = "weibull" and fix.sigma.X = 1. The latter argument by default is set to FALSE, which means that \(\sigma_x\) is updated during model estimation. If instead a numeric value is given, \(\sigma_x\) is held fixed at that value during model estimation. Hence, for dist.X = "weibull" and fix.sigma.X = 1, the special case of the exponential \(x\)-model emerges. The same holds for \(s\) when using dist.S = "weibull" and fix.sigma.S = 1.
The basic behavior of the Gibbs sampler is controlled via the arguments mc, warmup, chains, MH, thinning and seed_chains. The initial call to bayestsm shown above lets bayestsm do mc = 1e4 draws and then stop. A slice sampler is used since MH = FALSE; if MH = TRUE, Metropolis sampling would be used. bayestsm by default runs multiple MCMC chains in parallel, which is needed to estimate the convergence diagnostics reliably and to efficiently generate draws from the posterior distribution. I recommend a minimum of four chains for reliable convergence diagnostics estimation, but more is possible. The user’s machine should have at least chains free CPUs available for the parallel R workers. If no parallel processing inside bayestsm is desired, the alternative bayestsm_seq function can be used; see Section 5.5. It is identical to bayestsm but uses a for loop over chains instead.
Each Gibbs sampler chain is randomly initialized. However, the user has control over the initialization seeds for chain, using argument seed_chains. One integer seed per chain has to be provided as a vector. For example seed_chains = 1:4 specifies seeds 1 to 4 for a model that is estimated with chains = 4. Repeatedly running bayestsm with these seeds produces the same posterior chains and hence the same estimates. seed_chains = NULL initializes the chains randomly.
The Gibbs sampler is always initially run for mc draws, after which its convergence is evaluated and returned. bayestsm evaluates convergence as the rank-normalized R-hat by Vehtari et al. (2021) and the effective sample size (ESS) of the mean, computed by the rhat and ess_mean functions of the posterior R package (Bürkner et al., 2026). These statistics are printed to screen and available from the created bayestsm object. By default, bayestsm checks these values against those specified in min_R and min_eff, with defaults 1.01 and chains * 100, respectively, and prints to screen whether the run converged. If convergence is not achieved, the user can perform another batch of posterior draws using prev.run; see Section 2.5. A user-friendly alternative is to let bayestsm run until convergence using update_till_convergence = TRUE; see Section 2.6.
Starting Gibbs sampler with 4 chains and 10000 iterations.
Not converged after 1000 stored iter/chain; update +10000 raw.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 2634.1
X Z.1 1.000 2878.8
X Z.2 1.002 2392.1
X sigma 1.002 2852.0
S Intercept 1.025 188.6
S Z.1 1.010 416.5
S Z.2 1.010 413.9
S sigma 1.030 179.8
bayestsm automatically thins the chains through argument thinning. thinning means that only every thinning draw is saved, which can be useful to save machine memory and reduce autocorrelation between saved MCMC draws. By default, bayestsm does not thin the MCMC chain (thinning = 1), but thinning can be necessary when very long MCMC runs are required to obtain convergence. In this basic run, I chose thinning = 10 so that after 10000 draws per chain, 1000 draws per chain are saved.
Before evaluation of convergence, warmup (unthinned / raw) iterations are discarded for warm-up. Since this integer is unknown a priori it is advisable to run bayestsm initially for mc iterations and inspect MCMC chains plots using plot.bayestsm to decide what a good warmup value is.
Looking at the MCMC trace plots for the parameters of the \(s\)-model, we can see that convergence is fast, and occurs within less than 100 (thinned) posterior draws (= less than 1000 raw draws).
Figure 1: Trace plots for the MCMC chains of the s-model (after thinning, 10000 raw draws).
bayestsm runsIf an initial call to bayestsm has not converged the user can update a previous run through passing the bayestsm object to the prev.run argument in a new call to bayestsm. The number of added updates is controlled by the mc_update argument.
# Updating previous run
mod_slice_update = bayestsm(
prev.run = mod_slice, # pass previous bayestsm object
mc_update = 2e4
)Updating previous MCMC run.
Starting Gibbs sampler with 4 chains and 20000 iterations.
Converged after 3000 stored iter/chain.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 8090.9
X Z.1 1.000 7788.1
X Z.2 1.000 7694.5
X sigma 1.001 9293.9
S Intercept 1.006 890.5
S Z.1 1.002 1439.7
S Z.2 1.003 1475.4
S sigma 1.007 767.8
After updating for another 2e4 posterior draws the sampler has converged (according to the default criteria specified by min_R and min_eff). The total number of iterations over which the Gibbs sampler has been run now is 3e4 with 3e3 iterations saved (thinning = 10). We have obtained sufficient draws for posterior inference.
Besides manual adding of posterior draws, bayestsm has a built-in automatic updating feature called by argument update_till_convergence = TRUE. In that case, if no convergence was obtained after mc initial Gibbs iterations (posterior draws), bayestsm adds another mc_update draws, after which convergence is re-evaluated. If still no convergence is obtained another batch of size mc_update is added and so on until convergence (specified by min_R and min_eff) or until the maximum number of iterations (maxit) is reached. update_till_convergence = TRUE can, of course, also be specified right away without an initial run, as demonstrated below showing successful convergence after 4e4 draws (stored 4e3).
# Automated updating till convergence
mod_slice_weibull = bayestsm(
d = d,
L = L,
R = R,
Z.X = Z,
Z.S = Z,
mc = 1e4,
warmup = 5e2,
thinning = 10,
chains = 4,
update_till_convergence = TRUE,
mc_update = 1e4,
MH = FALSE,
dist.X = "weibull",
dist.S = "weibull"
)Starting Gibbs sampler with 4 chains and 10000 iterations.
Not converged after 1000 stored iter/chain; update +10000 raw.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 2634.1
X Z.1 1.000 2878.8
X Z.2 1.002 2392.1
X sigma 1.002 2852.0
S Intercept 1.025 188.6
S Z.1 1.010 416.5
S Z.2 1.010 413.9
S sigma 1.030 179.8
Not converged after 2000 stored iter/chain; update +10000 raw.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 5667.9
X Z.1 1.001 5642.5
X Z.2 1.001 5187.8
X sigma 1.000 5887.3
S Intercept 1.009 471.1
S Z.1 1.004 795.6
S Z.2 1.005 801.3
S sigma 1.012 383.7
Not converged after 3000 stored iter/chain; update +10000 raw.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 8856.9
X Z.1 1.000 8339.7
X Z.2 1.000 7711.1
X sigma 1.000 8390.9
S Intercept 1.011 700.6
S Z.1 1.004 1219.3
S Z.2 1.006 1147.5
S sigma 1.013 575.3
Converged after 4000 stored iter/chain.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 12139.7
X Z.1 1.000 11497.3
X Z.2 1.000 10289.1
X sigma 1.000 10830.2
S Intercept 1.007 1033.9
S Z.1 1.002 1739.9
S Z.2 1.003 1656.2
S sigma 1.008 849.9
The model runtime (obtained from a fit on an 2025 Apple Mac Mini M4) is saved in
Time difference of 45.78194 secs
bayestsmbayestsm offers three distributions for the latent transition times, specified by the arguments dist.X and dist.S (weibull, loglog, lognormal). In practice, it is usually not known which distribution is best in a given application. Therefore, multiple models can be fitted and compared using information criteria. The function get_IC computes the deviance information criterion (DIC, Spiegelhalter et al. 2002), as well as two versions of the widely applicable information criterion (WAIC-1 and WAIC-2, Watanabe et al., 2010). The DIC is commonly used when the posterior is approximately multivariate normal, which is often not the case for BayesTSM because some posterior distributions may be skewed, depending on the data. The WAIC criteria are common alternatives in such settings.
In my experience, the two-parameter models weibull, loglog, and lognormal often have similar fit, and the choice among them usually has little impact on the regression slope coefficients or the cumulative distribution functions. The information criteria are then also similar. However, larger differences are often observed for the exponential distribution, which is a special case of the Weibull distribution with \(\sigma\) constrained to 1. In bayestsm, this can be achieved by specifying dist.X = "weibull" and fix.sigma.X = TRUE; the same can be done for the \(s\)-model using dist.S = "weibull" and fix.sigma.S = TRUE. Below, we estimate two alternative models, one lognormal and one exponential, and compare their information criteria.
We first fit the models as follows:
# Lognormal model
mod_slice_lognormal = bayestsm(
d = d,
L = L,
R = R,
Z.X = Z,
Z.S = Z,
mc = 1e4,
warmup = 5e2,
thinning = 10,
chains = 4,
update_till_convergence = TRUE,
mc_update = 1e4,
MH = FALSE,
dist.X = "lognormal",
dist.S = "lognormal",
seed_chains = 5:8
)Starting Gibbs sampler with 4 chains and 10000 iterations.
Not converged after 1000 stored iter/chain; update +10000 raw.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 3413.8
X Z.1 1.001 2959.8
X Z.2 1.000 3117.9
X sigma 0.999 2260.3
S Intercept 1.001 815.9
S Z.1 1.005 561.5
S Z.2 1.003 588.0
S sigma 1.011 340.9
Converged after 2000 stored iter/chain.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 7479.9
X Z.1 1.000 6186.9
X Z.2 1.001 6308.8
X sigma 1.000 4963.7
S Intercept 1.001 1662.4
S Z.1 1.000 1178.4
S Z.2 1.001 1164.7
S sigma 1.002 708.1
# Exponential model
mod_slice_exponential = bayestsm(
d = d,
L = L,
R = R,
Z.X = Z,
Z.S = Z,
mc = 1e4,
warmup = 5e2,
thinning = 10,
chains = 4,
update_till_convergence = TRUE,
mc_update = 1e4,
MH = FALSE,
dist.X = "weibull",
dist.S = "weibull",
fix.sigma.X = T, # Fix sigma.X at sig.prior.X (default 1)
fix.sigma.S = T, # Fix sigma.S at sig.prior.S (default 1)
seed_chains = 9:12
)Starting Gibbs sampler with 4 chains and 10000 iterations.
Converged after 1000 stored iter/chain.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0 for sampled parameters; fixed sigma parameters excluded.
block parameter R_hat ESS
X Intercept 1.000 3866.5
X Z.1 1.000 3422.4
X Z.2 1.000 3549.0
S Intercept 1.001 2389.9
S Z.1 1.000 2766.7
S Z.2 1.000 2530.0
As can be seen, the exponential model obtains a larger effective sample size and smaller R-hat values in fewer iterations than the Weibull or lognormal models. This speedup results from the stronger constraints imposed by fixing \(\sigma\). However, in the initial data generation using gendat, we fixed sigma.X at 0.3 and sigma.S at 0.2. The exponential model assumption that these values are equal to one is therefore strongly violated. This can be seen by inspecting the sigma estimates, as described in more detail in Section 4.1. In addition, the information criteria reflect this lack of fit, with higher values denoting worse fit.
WAIC1 WAIC2 DIC
[1,] 1830.464 1830.704 1830.982
WAIC1 WAIC2 DIC
[1,] 1883.824 1884.098 1883.716
WAIC1 WAIC2 DIC
[1,] 2493.822 2493.891 2495.685
Indeed, the Weibull model has the best fit, reflecting that the data were generated under a true Weibull model using gendat. The lognormal model has worse fit, although it remains close to the Weibull model. The exponential model has the worst fit among the three candidate models indicating that fixing the \(\sigma\) parameters are 1 is probably not a good idea.
Note that by default get_IC uses all posterior samples available in the bayestsm model object, dropping only a number of samples as warmup, to estimate the information criteria. The precision of estimation can depend on the number of posterior samples used. By default, get_IC uses two CPUs to estimate the criteria. Since estimation is computational expensive, and scales linearly in the number of posterior samples, using more cores speeds up estimation. Specifying cores = NULL lets get_IC determine the maximum number of available cores on the local machine and distributes computation over these.
bayestsmAfter estimating a model, the main interest usually lies in posterior inference on the parameters. In addition, plots of state transition probabilities as a function of time can be informative. For both goals, BayesTSM offers built-in functionality: parameter summaries are discussed in Section 4.1, and posterior predictive transition probability plots are discussed in Section 4.2.
Model estimation is performed using a Gibbs sampler, which produces mc samples from the posterior distribution. If the model is updated, as described in Section 2.5 and Section 2.6, the total number of samples may be higher than the initial mc draws. In addition, multiple chains are run with different randomly initialized values. This allows estimation of the R-hat convergence diagnostic and efficiently produces more samples simultaneously through parallel computation.
The posterior samples of the model parameters \((\beta_x, \sigma_x, \beta_s, \sigma_s)\) are available in the bayestsm object as the list elements par.X and par.S, respectively, each as an MCMCpack::mcmc.list() object (Martin et al., 2011). This allows straightforward plotting of the MCMC trace plots through MCMCpack::plot.mcmc.list(). By default, the plotting method for bayestsm objects calls this function to plot the trace plots and posterior densities. For example:
Sometimes it can be helpful to omit some iterations as warmup to improve the scaling of the trace plots.
Posterior summaries of the parameter estimates are efficiently obtained through bayestsm’s summary method.
Parameters of x-model
2.5% 50% 97.5% R_hat ESS
Intercept 1.976 2.004 2.034 1 9299.7
Z.1 0.482 0.514 0.548 1 8900.7
Z.2 0.445 0.477 0.512 1 7755.0
sigma 0.273 0.295 0.319 1 9175.4
Parameters of s-model
2.5% 50% 97.5% R_hat ESS
Intercept 0.837 1.000 1.212 1.006 904.6
Z.1 0.390 0.525 0.694 1.003 1455.8
Z.2 0.416 0.562 0.741 1.003 1446.5
sigma 0.286 0.462 0.688 1.008 745.7
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
Total posterior draws saved after thinning: 12000 (total draws: 120000)
The summary method focuses on posterior medians and 95% credible intervals. For posterior means, additional quantiles, and variance estimates, the user may use, for example:
which uses MCMCpack’s built-in summary function for mcmc.list objects (Martin et al., 2011). Here, care should be taken to omit some iterations as warmup. This can be achieved, for example, through the BayesTSM function trim.mcmc.
Iterations = 500:3000
Thinning interval = 1
Number of chains = 4
Sample size per chain = 2501
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
Intercept 2.0040 0.01463 0.0001463 0.0001537
Z.1 0.5144 0.01664 0.0001664 0.0001865
Z.2 0.4773 0.01692 0.0001692 0.0002046
sigma 0.2948 0.01149 0.0001149 0.0001263
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
Intercept 1.9754 1.9942 2.0039 2.0137 2.0328
Z.1 0.4822 0.5030 0.5142 0.5256 0.5480
Z.2 0.4450 0.4659 0.4771 0.4885 0.5115
sigma 0.2730 0.2869 0.2946 0.3025 0.3185
It can be seen that the estimates are close to the true values specified in the data generation using gen.dat and the posterior (credible) intervals cover the true values.
A relevant question in screening data research is: what is the probability of progression to pre-state disease and disease as a function of time? BayesTSM has built-in functionality to estimate posterior predictive transition probabilities up to a given point in time. These probabilities are given by the cumulative distribution functions (CDFs) of the latent times. Depending on the field of study, CDFs are sometimes also called cumulative incidence functions (CIFs). BayesTSM uses the CIF terminology. Specifically, the function ppCIF allows estimation of three CIFs:
\[\begin{align} F_x(x_0 \mid \beta_x, \sigma_x) &= P(x \le x_0 \mid \beta_x, \sigma_x),\\ F_s(s_0 \mid \beta_s, \sigma_s) &= P(s \le s_0 \mid \beta_s, \sigma_s),\\ F_y(y_0 \mid \beta_x, \sigma_x, \beta_s, \sigma_s) &= P(y \le y_0 \mid \beta_x, \sigma_x, \beta_s, \sigma_s), \end{align}\]
where \(x\) is the time from baseline to the second state, for example pre-state disease; \(s\) is the time from the second state to the third state, for example disease; and \(y = x+s\) is the time from baseline to the third state. The distribution of \(y\) involves the convolution of \(x\) and \(s\), which is generally not available in closed form for the distributions offered by BayesTSM.
The CIFs shown above are called marginal CIFs because they do not condition on any covariates \(z\). By contrast, CIFs that condition on one or more covariate values \(z\) are called conditional CIFs:
\[\begin{align} F_x(x_0 \mid z, \beta_x, \sigma_x) &= P(x \le x_0 \mid z, \beta_x, \sigma_x),\\ F_s(s_0 \mid z, \beta_s, \sigma_s) &= P(s \le s_0 \mid z, \beta_s, \sigma_s),\\ F_y(y_0 \mid z, \beta_x, \sigma_x, \beta_s, \sigma_s) &= P(y \le y_0 \mid z, \beta_x, \sigma_x, \beta_s, \sigma_s). \end{align}\]
The function ppCIF estimates the percentiles \(F_x\), \(F_s\), and \(F_y\) for supplied quantiles (type = 'quantiles'), or the quantiles \(F^{-1}_x\), \(F^{-1}_s\), and \(F^{-1}_y\) for supplied percentiles (type = 'percentiles'), for both marginal and conditional CIFs. Percentiles of F_x and F_s can be obtained analytically, while percentiles of F_y are obtained by simulation. Quantiles are always obtained by simulation. ppCIF offers the options method = c("simulation", "analytic"), which give almost equivalent results in practice.
We distinguish two main use cases:
The first objective usually uses the option type = 'quantiles' and supplies a small number of time points at which the posterior predictive probabilities should be evaluated. The code below estimates the posterior predictive cumulative transition probabilities up to 5 and 10 time units.
Obtaining the posterior predictive CDFs of X, S, and X+S=Y by Monte Carlo simulation.
$med.p.x
[1] 0.391 0.710
$med.p.s
[1] 0.809 0.945
$med.p.y
[1] 0.252 0.568
$p.x.ci
[,1] [,2]
2.5% 0.367 0.687
97.5% 0.416 0.732
$p.s.ci
[,1] [,2]
2.5% 0.713000 0.871
97.5% 0.897025 0.989
$p.y.ci
[,1] [,2]
2.5% 0.230 0.534
97.5% 0.272 0.597
$grid
[1] 5 10
$type
[1] "quantiles"
For example, the median predicted probability of transition to state 2 by 5 time units is 0.391, with a 95% credible interval of [0.367, 0.416]. The median predicted probability of transition to state 3 by 5 time units is 0.252, with a 95% credible interval of [0.230, 0.272].
These are marginal probabilities, understood as transition probabilities averaged over the observed covariate distribution. They can also be interpreted as the transition probabilities for a randomly selected individual from the population. Often, however, conditional probabilities are also of interest. These condition on one or more covariate values. In ppCIF, conditional probabilities are specified through the arguments fix_Z.X and fix_Z.S. Specifically, if fix_Z.X = NULL and fix_Z.S = NULL, marginal cumulative posterior transition probabilities are obtained; this is the default. If conditioning on a covariate vector is desired, a vector of length ncol(Z.X) or ncol(Z.S) is passed to fix_Z.X or fix_Z.S, respectively. For example,
conditions estimation on both covariates in Z.X and Z.S having the value 1. It is also possible to condition on some covariates while marginalizing over others. For this, the entries corresponding to the columns of Z.X and Z.S over which marginalization should be performed are set to NA. For example,
conditions on the first covariate being 1, while marginalizing over the second covariate.
ppCIF( mod_slice_weibull,
type = 'quantiles',
warmup = 500,
quant = c(5, 10),
fix_Z.X = c(1, NA),
fix_Z.S = c(1, NA)
)Obtaining the posterior predictive CDFs of X, S, and X+S=Y by Monte Carlo simulation.
$med.p.x
[1] 0.1190 0.4605
$med.p.s
[1] 0.658 0.911
$med.p.y
[1] 0.04 0.26
$p.x.ci
[,1] [,2]
2.5% 0.096 0.422975
97.5% 0.144 0.499025
$p.s.ci
[,1] [,2]
2.5% 0.512000 0.769975
97.5% 0.820075 0.985025
$p.y.ci
[,1] [,2]
2.5% 0.028 0.221
97.5% 0.055 0.299
$grid
[1] 5 10
$type
[1] "quantiles"
attr(,"class")
[1] "ppCIF"
The estimated probabilities now differ from the previous estimates because they condition on the first covariate being 1.
Instead of evaluating the posterior predictive CIF at a few pre-specified values, as in Section 4.2.1, we can also evaluate it across a fine grid of quantiles. The resulting percentiles can then be plotted against that grid. ppCIF has a plotting method, plot.ppCIF, that provides a default quick view of the resulting CIFs. Alternatively, the user can create custom plots using the returned vectors $med.p.x, $med.p.s, and $med.p.y, together with the associated credible intervals. Here, I demonstrate the default plotting method.
Obtaining the posterior predictive CDFs of X, S, and X+S=Y by Monte Carlo simulation.
Quantiles were not provided and therefore chosen automatically.
Since no quant grid was provided, ppCIF chooses a grid automatically, scaled between 0 and the maximum finite screening time.
Figure 2: Posterior predictive CIFs obtained via default quantile grid.
A similarly effective way to create CIF plots is to provide a fine grid of percentiles between 0 and 1 and let ppCIF obtain the associated quantiles.
pq_grid <- ppCIF( mod_slice_weibull, type = 'percentiles', warmup = 500 )
plot(pq_grid, xlim=c(0,50))Figure 3: Posterior predictive CIFs obtained via default percentile grid.
The resulting plots differ only with regard to the maximum obtained quantile. Some tuning of either the argument quant for type = 'quantiles' or the argument perc for type = 'percentiles' may be needed to obtain sufficiently high quantile cut-offs for the desired plotting result.
The resulting plots are marginalized across the covariates and therefore show marginal CIFs. Conditional variants can be obtained in the same way as described in Section 4.2.1, that is, by fixing values through the arguments fix_Z.X and fix_Z.S.
In this section, we discuss several further options to change the behavior of bayestsm.
By default, the current implementation of bayestsm uses a slice sampler to sample from the complete-data posterior of the parameters after data augmentation in the Gibbs sampler. By contrast, the first implementation proposed by Klausch et al. (2023) used a random-walk Metropolis sampler with a multivariate normal proposal distribution. In bayestsm, both samplers are available: MH = FALSE calls the slice sampler, whereas MH = TRUE calls the Metropolis sampler.
Unlike slice sampling, the Metropolis sampler moves through the posterior distribution by proposing random jumps from a multivariate normal distribution with dimension equal to the number of model parameters (\(\sigma\) is sampled on the log scale). By default, the proposal covariance matrix is diagonal with the same variance for all parameters, and its standard deviation has to be passed to bayestsm through the argument prop.sd. If prop.sd = NULL, a useful proposal standard deviation is selected in a pilot run using the heuristic search algorithm described in Klausch et al. (2023). The algorithm resembles adaptive Metropolis sampling and has so far proven reliable for finding proposal standard deviations that yield acceptance rates close to the 20% to 25% range. An acceptance rate of 23% is often quoted as desirable for efficient sampling from the posterior under multivariate normality conditions (Roberts et al., 1996). Other target ranges can be specified by calling search_prop_sd on an initial short bayestsm run; see ?search_prop_sd.
To fit a model with Metropolis sampling instead of slice sampling, the only option that needs to be changed is MH = TRUE. With the default prop.sd = NULL, a suitable proposal standard deviation is searched automatically. All other aspects of the bayestsm call remain unchanged and are discussed in Section 2.4, Section 2.5, and Section 2.6. However, I recommend specifying a substantially larger number of Gibbs iterations mc and a larger number of warmup iterations, as shown below.
mod_MH = bayestsm(
d = d,
L = L,
R = R,
Z.X = Z,
Z.S = Z,
mc = 5e5,
warmup = 1e5,
thinning = 100,
chains = 4,
update_till_convergence = TRUE,
MH = TRUE,
dist.X = "weibull",
dist.S = "weibull",
seed_chains = 1:4
)No proposal sd provided. Searching.
Iteration 1
Averaged acceptance rate: 0.357
sd set to 0.016
Iteration 2
Averaged acceptance rate: 0.233
Success with sd=0.016
Repeating for 10000 posterior draws.
Iteration 3
Averaged acceptance rate: 0.183
sd set to 0.013
Iteration 4
Averaged acceptance rate: 0.193
sd set to 0.011
Iteration 5
Averaged acceptance rate: 0.214
Success with sd=0.011
Repeating for 20000 posterior draws.
Iteration 6
Averaged acceptance rate: 0.236
Success with sd=0.011
Search completed.
Starting Gibbs sampler with 4 chains and 500000 iterations.
The progress output printed to the screen shows the tuning of the proposal standard deviation of the Metropolis sampler, followed by a note that the Gibbs sampler has started. Since a substantially larger number of Gibbs iterations is carried out, substantially longer computation times can be expected compared with slice sampling. However, because the Metropolis sampler implementation itself is fast, model estimation can still be completed in acceptable runtime.
Not converged after 5000 stored iter/chain; update +500000 raw.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 10743.2
X Z.1 1.000 11408.8
X Z.2 1.000 9338.5
X sigma 1.001 6577.4
S Intercept 1.018 354.0
S Z.1 1.011 569.3
S Z.2 1.011 624.9
S sigma 1.031 229.4
Not converged after 10000 stored iter/chain; update +500000 raw.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 23357.7
X Z.1 1.000 24879.2
X Z.2 1.000 21004.9
X sigma 1.000 12071.0
S Intercept 1.011 751.7
S Z.1 1.007 1144.7
S Z.2 1.006 1292.1
S sigma 1.015 538.7
Converged after 15000 stored iter/chain.
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
block parameter R_hat ESS
X Intercept 1.000 34892.7
X Z.1 1.000 38553.0
X Z.2 1.000 32043.2
X sigma 1.000 17583.1
S Intercept 1.005 1197.7
S Z.1 1.004 1853.6
S Z.2 1.003 1992.8
S sigma 1.007 871.7
In this example, I used update steps of 5e5 with thinning=100 and omitted the first 1e5 iterations as warmup. Convergence was obtained after 1.5 million raw iterations per chain.
Parameters of x-model
2.5% 50% 97.5% R_hat ESS
Intercept 1.975 2.004 2.033 1 34892.7
Z.1 0.482 0.514 0.547 1 38553.0
Z.2 0.445 0.477 0.511 1 32043.2
sigma 0.273 0.294 0.318 1 17583.1
Parameters of s-model
2.5% 50% 97.5% R_hat ESS
Intercept 0.835 1.003 1.215 1.005 1197.7
Z.1 0.395 0.529 0.706 1.004 1853.6
Z.2 0.417 0.562 0.751 1.003 1992.8
sigma 0.287 0.466 0.697 1.007 871.7
Convergence criteria: R-hat <= 1.010 and ESS >= 400.0
Total posterior draws saved after thinning: 60000 (total draws: 6000000)
The posterior summary statistics are very similar to those from the slice sampler fit shown in Section 4.1.
Time difference of 7.52409 mins
The model runtime was 7.52 minutes on a 2025 Apple Mac Mini M4 and was therefore substantially longer than the same model fit by slice sampling until convergence, which took about 45 seconds; see Section 2.6. However, a fairer comparison considers the effective sample size generated per second:
ess_per_second_slice <- mod_slice_weibull$convergence$eff_s /
as.numeric(mod_slice_weibull$runtime, units = "secs")
ess_per_second_MH <- mod_MH$convergence$eff_s /
as.numeric(mod_MH$runtime, units = "secs")
rbind(ess_per_second_slice, ess_per_second_MH) Intercept Z.1 Z.2 sigma
ess_per_second_slice 19.758764 31.798440 31.595294 16.287304
ess_per_second_MH 2.652929 4.105893 4.414367 1.930899
This comparison shows that, for the parameters of the s-model, the slice sampler generates about 7 to 9 times more effective samples per second than MH.
An interesting property of the slice sampler is that it usually requires less tuning than the Metropolis sampler, which needs a suitable proposal variance for the jumping distribution. However, the slice sampler also has a tuning parameter: the step size used in its so-called step-out algorithm. In bayestsm, this parameter is controlled by slicesampler_stepsize, with default value slicesampler_stepsize = 1.
In most applications, the default step size works well and rarely needs to be changed. The step size does not affect the target posterior distribution, but it can affect computational efficiency. If the step size is too small, the sampler may require many step-out steps before finding a suitable interval. If it is too large, the sampler may spend more time shrinking the interval before accepting a draw. Therefore, if the slice sampler mixes slowly or requires surprisingly long computation time, changing slicesampler_stepsize can be useful.
BayesTSM uses pre-specified weakly informative priors by default. Specifically, Student-\(t\) priors are used for the intercept and \(\beta\) parameters, and half-normal priors are used for the \(\sigma\) parameters. The bayestsm arguments beta.prior.X and beta.prior.S control the number of degrees of freedom of the \(t\) prior (default: 4). The arguments sig.prior.X and sig.prior.S control the scale parameter of the half-normal priors (default: 1). In addition, changing the default argument beta.prior = 't' to beta.prior = 'norm' switches to normal priors for the \(\beta\) parameters. In that case, beta.prior.X and beta.prior.S are used to specify the prior standard deviation. These priors have been tested in various bayestsm settings and have proven to provide sufficient regularization to enable model estimation while not causing noticeable bias.
In addition, bayestsm allows the user to specify custom prior distribution functions. For this, the log-prior function has to be written in R and passed to the argument log_prior_fun; see also ?log_aft_prior. Specifically, log_aft_prior is the default log-prior function used by bayestsm, whose input arguments are given by:
Any user-written log-prior function has to take the same input arguments as log_aft_prior. Specifically, eta is the vector of the \(\beta\) AFT model parameters, with the \(\sigma\) AFT model parameter appended in the last position on the log scale.
The argument tau is related to the bayestsm input arguments beta.prior.X and beta.prior.S. Internally, beta.prior.X is passed to tau for the \(x\)-model through a call to log_prior_fun, and beta.prior.S is passed to tau for the \(s\)-model through another call to log_prior_fun. Similarly, sig.prior.X and sig.prior.S are passed to sig.prior. Through this structure, the user can write any custom log-prior function and pass prior parameters to that function. In addition, the argument beta.prior may be used to switch between different priors, as in log_aft_prior, or to pass other information to the function.
Note: in the current implementation of BayesTSM, I have observed that the slice sampler can react sensitively to changes in the default prior during warmup. In such cases, it may produce extreme parameter estimates in the first few Gibbs iterations, which can cause the sampler to stop with an error. If that happens, changing to MH = TRUE (Metropolis sampler) is a robust alternative. This sampler appears to be more stable in this situation because it samples all parameters in one block, so extreme proposed parameter combinations are more likely to be rejected jointly. By contrast, the slice sampler updates parameters univariately, conditioning on all other parameters.
The scale of the times passed to bayestsm is often arbitrary. For example, time may be measured in days, weeks, months, or years from baseline. From a modeling perspective, changing the time scale mainly affects the location of the transition time distribution, which is governed by the AFT model intercept parameter \(\beta_0\). Since \(\beta_0\) is regularized using a Student-\(t\) prior, the amount of regularization towards zero can depend on the chosen time scale. This may be undesirable, because the prior should ideally express the same substantive information regardless of whether time is measured in days, months, or years.
There are two possible ways to address this issue. One option is to specify a less restrictive prior on the intercept. Another option is to internally rescale the screening times L and R before model fitting and then readjust the intercept parameter after estimation. By default, bayestsm uses the second approach and rescales times by dividing them by the median finite observed time:
This transformation changes the scale on which the AFT intercept is estimated. If times are divided by a constant \(c\), then the intercept on the rescaled time scale is shifted by \(-\log(c)\). Therefore, after model fitting, the intercept can be transformed back to the original time scale by adding \(\log(c)\). The remaining regression coefficients and the AFT scale parameter are not affected by this deterministic change of time scale.
The use of the median finite observed time is a pragmatic default. It is robust to very large finite time values and usually places the rescaled times on a scale that is suitable for the default prior on the intercept.
Alternatively, the user can decide not to rescale event times by setting rescale_times = FALSE. Care should generally be taken when changing this option and specifying priors on the intercept, because overly strong regularization of the intercept may be undesirable in terms of the resulting bias-variance trade-off.
By default, bayestsm runs multiple MCMC chains in parallel using the packages doParallel and foreach. Specifically, a foreach loop is used over chains. In some settings, however, parallel processing outside of bayestsm is preferable. For example, in Monte Carlo simulations, BayesTSM may be run on multiple simulated data sets, with the data sets themselves distributed over cores.
For such settings, BayesTSM provides the alternative function bayestsm_seq, which runs the MCMC chains sequentially using a for loop on a single core. This makes it easier to apply foreach parallelization outside of BayesTSM, for example over simulation replications. Nested parallelization with foreach loops is generally not recommended, because it can lead to inefficient use of computing resources, oversubscription of cores, and more complicated handling of random-number generation.
Sequential implementations are also available for computing information criteria (get_IC_seq) and for finding the Metropolis proposal standard deviation (search_prop_sd_seq).
The input handling of the sequential versions is identical to the parallelized functions.
Akwiwu, E. U., Coupé, V. M. H., Berkhof, J., & Klausch, T. (2026). A comparison of methods for modeling multistate cancer progression using screening data with censoring after intervention. Medical Decision Making. https://doi.org/10.1177/0272989X261422681
Bürkner, P.-C., Gabry, J., Kay, M., & Vehtari, A. (2026). posterior: Tools for working with posterior distributions. R package version 1.7.0.
Klausch, T., Akwiwu, E. U., van de Wiel, M. A., Coupé, V. M. H., & Berkhof, J. (2023). A Bayesian accelerated failure time model for interval censored three-state screening outcomes. The Annals of Applied Statistics, 17(2), 1285–1306. https://doi.org/10.1214/22-AOAS1669
Martin, A. D., Quinn, K. M., & Park, J. H. (2011). MCMCpack: Markov Chain Monte Carlo in R. Journal of Statistical Software, 42(9), 1–21. https://doi.org/10.18637/jss.v042.i09
Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1), 7–11.
Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7(1), 110–120. https://doi.org/10.1214/aoap/1034625254
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64(4), 583–639. https://doi.org/10.1111/1467-9868.00353
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2021). Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC. Bayesian Analysis, 16(2), 667–718. https://doi.org/10.1214/20-BA1221
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11, 3571–3594.