Overview

The Stochastic Process Model (SPM) was developed several decades ago [1,2], and applied for analyses of clinical, demographic, epidemiologic longitudinal data as well as in many other studies that relate stochastic dynamics of repeated measures to the probability of end-points (outcomes). SPM links the dynamic of stochastical variables with a hazard rate as a quadratic function of the state variables [3]. The R-package, “stpm”, is a set of utilities to estimate parameters of stochastic process and modeling survival trajectories and time-to-event outcomes observed from longitudinal studies. It is a general framework for studying and modeling survival (censored) traits depending on random trajectories (stochastic paths) of variables.

Installation

Stable version from CRAN

install.packages("stpm")

Most-recent version from GitHub

require(devtools)
devtools::install_github("izhbannikov/stpm")

Data description

Data represents a typical longitudinal data in form of two datasets: longitudinal dataset (follow-up studies), in which one record represents a single observation, and vital (survival) statistics, where one record represents all information about the subject. Longitudinal dataset cat contain a subject ID (identification number), status (event(1)/censored(0)), time and measurements across the variables. The stpm can handle an infinite number of variables but in practice, 5-7 variables is enough.

Below there is an example of clinical data that can be used in stpm and we will discuss the fields later.

Longitudinal table:

##   ID IndicatorDeath Age      DBP      BMI
## 1  1              0  30 80.00000 25.00000
## 2  1              0  32 80.51659 26.61245
## 3  1              0  34 77.78412 29.16790
## 4  1              0  36 77.86665 32.40359
## 5  1              0  38 96.55673 31.92014
## 6  1              0  40 94.48616 32.89139

Vital statistics table:

##   ID IsDead   LSmort
## 1  1      1 85.34578
## 2  2      1 80.55053
## 3  3      1 98.07315
## 4  4      1 81.29779
## 5  5      1 89.89829
## 6  6      1 72.47687

Description of data fields

Longitude studies
  • ID - subject unique identificatin number.
  • IndicatorDeath - 0/1, indicates death of a subject.
  • Age - current age of subject at observation.
  • DBP, BMI - covariates, here “DBP” represents a diastolic blood pressure, “BMI” a body-mass index.
Vital statistics
  • ID - subject’s unique ID.
  • IsDead - death indicator, 0 - alive, 1 - dead.
  • LSmort - age at death of stopping observations.

Discrete- and continuous-time models

There are two main SPM types in the package: discrete-time model [4] and continuous-time model [3]. Discrete model assumes equal intervals between follow-up observations. The example of discrete dataset is given below.

library(stpm)
data <- simdata_discr(N=10) # simulate data for 10 individuals
head(data)
##      id xi t1 t2       y1  y1.next
## [1,]  1  0 30 31 80.00000 80.50409
## [2,]  1  0 31 32 80.50409 88.89760
## [3,]  1  0 32 33 88.89760 89.65612
## [4,]  1  0 33 34 89.65612 88.08988
## [5,]  1  0 34 35 88.08988 81.76396
## [6,]  1  0 35 36 81.76396 89.78588

In this case there are equal intervals between \(t_1\) and \(t_2\).

In the continuous-time SPM, in which intervals between observations are not equal (arbitrary or random). The example of such dataset is shown below:

library(stpm)
data <- simdata_cont(N=5) # simulate data for 5 individuals
head(data)
##      id xi       t1       t2        y1  y1.next
## [1,]  0  0 37.83559 39.58677 80.070594 5.057979
## [2,]  0  0 39.58677 41.07646  5.057979 8.198235
## [3,]  0  0 41.07646 42.44267  8.198235 5.722875
## [4,]  0  0 42.44267 43.72889  5.722875 6.056424
## [5,]  0  0 43.72889 45.24662  6.056424 6.716898
## [6,]  0  0 45.24662 46.92195  6.716898 8.098054

Discrete-time model

The discrete model assumes fixed time intervals between consecutive observations. In this model, \(\mathbf{Y}(t)\) (a \(k \times 1\) matrix of the values of covariates, where \(k\) is the number of considered covariates) and \(\mu(t, \mathbf{Y}(t))\) (the hazard rate) have the following form:

\(\mathbf{Y}(t+1) = \mathbf{u} + \mathbf{R} \mathbf{Y}(t) + \mathbf{\epsilon}\)

\(\mu (t, \mathbf{Y}(t)) = [\mu_0 + \mathbf{b} \mathbf{Y}(t) + \mathbf{Y}(t)^* \mathbf{Q} \mathbf{Y}(t)] e^{\theta t}\)

Coefficients \(\mathbf{u}\) (a \(k \times 1\) matrix, where \(k\) is a number of covariates), \(\mathbf{R}\) (a \(k \times k\) matrix), \(\mu_0\), \(\mathbf{b}\) (a \(1 \times k\) matrix), \(\mathbf{Q}\) (a \(k \times k\) matrix) are assumed to be constant in the particular implementation of this model in the R-package stpm. \(\mathbf{\epsilon}\) are normally-distributed random residuals, \(k \times 1\) matrix. A symbol ’*’ denotes transpose operation. \(\theta\) is a parameter to be estimated along with other parameters (\(\mathbf{u}\), \(\mathbf{R}\), \(\mathbf{\mu_0}\), \(\mathbf{b}\), \(\mathbf{Q}\)).

Example

library(stpm)
#Data simulation (200 individuals)
data <- simdata_discr(N=200)
#Estimation of parameters
pars <- spm_discrete(data)
pars
## $Ak2005
## $Ak2005$theta
## [1] 0.081
## 
## $Ak2005$mu0
## [1] 0.0001638956937
## 
## $Ak2005$b
## [1] -4.121719205e-06
## 
## $Ak2005$Q
##                 [,1]
## [1,] 2.715165309e-08
## 
## $Ak2005$u
## [1] 3.559420616
## 
## $Ak2005$R
##             [,1]
## [1,] 0.955516794
## 
## $Ak2005$Sigma
## [1] 5.020360331
## 
## 
## $Ya2007
## $Ya2007$a
##                [,1]
## [1,] -0.04448320597
## 
## $Ya2007$f1
##             [,1]
## [1,] 80.01717814
## 
## $Ya2007$Q
##                 [,1]
## [1,] 2.715165309e-08
## 
## $Ya2007$f
##             [,1]
## [1,] 75.90180958
## 
## $Ya2007$b
##             [,1]
## [1,] 5.020360331
## 
## $Ya2007$mu0
##                 [,1]
## [1,] 7.472720599e-06
## 
## $Ya2007$theta
## [1] 0.081
## 
## 
## attr(,"class")
## [1] "spm.discrete"

Continuous-time model

In the specification of the SPM described in 2007 paper by Yashin and collegaues [3] the stochastic differential equation describing the age dynamics of a covariate is:

\(d\mathbf{Y}(t)= \mathbf{a}(t)(\mathbf{Y}(t) -\mathbf{f}_1(t))dt + \mathbf{b}(t)d\mathbf{W}(t), \mathbf{Y}(t=t_0)\)

In this equation, \(\mathbf{Y}(t)\) (a \(k \times 1\) matrix) is the value of a particular covariate at a time (age) \(t\). \(\mathbf{f}_1(t)\) (a \(k \times 1\) matrix) corresponds to the long-term mean value of the stochastic process \(\mathbf{Y}(t)\), which describes a trajectory of individual covariate influenced by different factors represented by a random Wiener process \(\mathbf{W}(t)\). Coefficient \(\mathbf{a}(t)\) (a \(k \times k\) matrix) is a negative feedback coefficient, which characterizes the rate at which the process reverts to its mean. In the area of research on aging, \(\mathbf{f}_1(t)\) represents the mean allostatic trajectory and \(\mathbf{a}(t)\) represents the adaptive capacity of the organism. Coefficient \(\mathbf{b}(t)\) (a \(k \times 1\) matrix) characterizes a strength of the random disturbances from Wiener process \(\mathbf{W}(t)\).

The following function \(\mu(t, \mathbf{Y}(t))\) represents a hazard rate:

\(\mu(t, \mathbf{Y}(t)) = \mu_0(t) + (\mathbf{Y}(t) - \mathbf{f}(t))^* \mathbf{Q}(t) (\mathbf{Y}(t) - \mathbf{f}(t))\)

here \(\mu_0(t)\) is the baseline hazard, which represents a risk when \(\mathbf{Y}(t)\) follows its optimal trajectory; \(\mathbf{f}(t)\) (a \(k \times 1\) matrix) represents the optimal trajectory that minimizes the risk and \(\mathbf{Q}(t)\) (\(k \times k\) matrix) represents a sensitivity of risk function to deviation from the norm.

Example

library(stpm)
#Simulate some data for 100 individuals
data <- simdata_cont(N=100)
head(data)
##      id xi          t1          t2           y1     y1.next
## [1,]  0  0 32.05650960 33.96967369 79.818545653 8.196009022
## [2,]  0  0 33.96967369 35.29857970  8.196009022 5.884494880
## [3,]  0  0 35.29857970 36.91860047  5.884494880 5.148925869
## [4,]  0  0 36.91860047 38.91063524  5.148925869 6.537602809
## [5,]  0  0 38.91063524 40.15397972  6.537602809 6.508450034
## [6,]  0  0 40.15397972 41.85655318  6.508450034 4.969544513
#Estimate parameters
# a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=2e-5, theta=0.08 are starting values for estimation procedure
pars <- spm_continuous(dat=data,a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=2e-5, theta=0.08)
## Parameter a achieved lower/upper bound.
## 0 
## Parameter f1 achieved lower/upper bound.
## 72 
## Parameter Q achieved lower/upper bound.
## 2.2e-08 
## Parameter b achieved lower/upper bound.
## 5.5
pars
## $a
##      [,1]
## [1,]    0
## 
## $f1
##      [,1]
## [1,]   72
## 
## $Q
##         [,1]
## [1,] 2.2e-08
## 
## $f
##             [,1]
## [1,] 87.74248991
## 
## $b
##      [,1]
## [1,]  5.5
## 
## $mu0
## [1] 2.190430711e-05
## 
## $theta
## [1] 0.08790180285
## 
## $status
## [1] 3
## 
## $LogLik
## [1] -19165.83219
## 
## $objective
## [1] 19165.82186
## 
## $message
## [1] "NLOPT_FTOL_REACHED: Optimization stopped because ftol_rel or ftol_abs (above) was reached."
## 
## $limit
## [1] TRUE
## 
## attr(,"class")
## [1] "spm.continuous"

Coefficient conversion between continuous- and discrete-time models

The coefficient conversion between continuous- and discrete-time models is as follows (‘c’ and ‘d’ denote continuous- and discrete-time models respectively; note: these equations can be used if intervals between consecutive observations of discrete- and continuous-time models are equal; it also required that matrices \(\mathbf{a}_c\) and \(\mathbf{Q}_{c,d}\) must be full-rank matrices):

\(\mathbf{Q}_c = \mathbf{Q}_d\)

\(\mathbf{a}_c = \mathbf{R}_d - I(k)\)

\(\mathbf{b}_c = \mathbf{\Sigma}\)

\({\mathbf{f}_1}_c = -\mathbf{a}_c^{-1} \times \mathbf{u}_d\)

\(\mathbf{f}_c = -0.5 \mathbf{b}_d \times \mathbf{Q}^{-1}_d\)

\({\mu_0}_c = {\mu _0}_d - \mathbf{f}_c \times \mathbf{Q_c} \times \mathbf{f}_c^*\)

\(\theta_c = \theta_d\)

where \(k\) is a number of covariates, which is equal to model’s dimension and ’*’ denotes transpose operation; \(\mathbf{\Sigma}\) is a \(k \times 1\) matrix which contains s.d.s of corresponding residuals (residuals of a linear regression \(\mathbf{Y}(t+1) = \mathbf{u} + \mathbf{R}\mathbf{Y}(t) + \mathbf{\epsilon}\); s.d. is a standard deviation), \(I(k)\) is an identity \(k \times k\) matrix.

Model with time-dependent coefficients

In previous models, we assumed that coefficients is sort of time-dependant: we multiplied them on to \(e^{\theta t}\). In general, this may not be the case [5]. We extend this to a general case, i.e. (we consider one-dimensional case):

\(\mathbf{a(t)} = \mathbf{par}_1 t + \mathbf{par}_2\) - linear function.

The corresponding equations will be equivalent to one-dimensional continuous case described above.

Example

library(stpm)
#Data preparation:
n <- 50
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data, 
                        start = list(a = -0.05, f1 = 80, Q = 2e-08, f = 80, b = 5, mu0 = 0.001), 
                        frm = list(at = "a", f1t = "f1", Qt = "Q", ft = "f", bt = "b", mu0t= "mu0"))
opt.par
## [[1]]
## [[1]]$a
## [1] -0.05211167053
## 
## [[1]]$f1
## [1] 80.34947033
## 
## [[1]]$Q
## [1] 2.306818108e-08
## 
## [[1]]$f
## [1] 91.20842302
## 
## [[1]]$b
## [1] 5.052118798
## 
## [[1]]$mu0
## [1] 0.000751496302
## 
## [[1]]$status
## [1] 3
## 
## [[1]]$LogLik
##           t2 
## -8296.243271 
## 
## [[1]]$objective
## [1] 8296.242174
## 
## [[1]]$message
## [1] "NLOPT_FTOL_REACHED: Optimization stopped because ftol_rel or ftol_abs (above) was reached."

Setting lower and upper boundaries of the model parameters

Lower and upper boundaries can be set up with parameters \(lb\) and \(ub\), which represents simple numeric vectors. Note: lengths of \(lb\) and \(ub\) must be the same as the total length of the parameters. Lower and upper boundaries can be set for continuous-time and time-dependent models only.

Setting lb and ub for continuous-time model

One covariate

Below we show the example of setting up \(lb\) and \(ub\) when we have a single covariate:

library(stpm)
data <- simdata_cont(N=100, ystart = 80, a = -0.1, Q = 1e-06, mu0 = 1e-5, theta = 0.08, f1 = 80, f=80, b=1, dt=1, sd0=5)
ans <- spm_continuous(dat=data,
                      a = -0.1,
                      f1 = 82, 
                      Q = 1.4e-6,
                      f = 77,
                      b = 1,
                      mu0 = 1.6e-5,
                      theta = 0.1,
                      stopifbound = FALSE, maxeval=300,
                      lb=c(-0.2, 60, 0.1e-6, 60, 0.1, 0.1e-5, 0.01), 
                      ub=c(0, 140, 5e-06, 140, 3, 5e-5, 0.20),
                      algorithm="NLOPT_LN_NELDERMEAD")
## Parameter a achieved lower/upper bound.
## 0 
## Parameter b achieved lower/upper bound.
## 3
ans
## $a
##      [,1]
## [1,]    0
## 
## $f1
##             [,1]
## [1,] 79.98541537
## 
## $Q
##                 [,1]
## [1,] 1.035145715e-06
## 
## $f
##             [,1]
## [1,] 75.07801609
## 
## $b
##      [,1]
## [1,]    3
## 
## $mu0
## [1] 2.977091419e-05
## 
## $theta
## [1] 0.1478534275
## 
## $status
## [1] 3
## 
## $LogLik
## [1] -32474.06117
## 
## $objective
## [1] 32474.06117
## 
## $message
## [1] "NLOPT_FTOL_REACHED: Optimization stopped because ftol_rel or ftol_abs (above) was reached."
## 
## $limit
## [1] TRUE
## 
## attr(,"class")
## [1] "spm.continuous"

Two covariates

This is an example for two physiological variables (covariates).

library(stpm)

data <- simdata_cont(N=100, 
                     a=matrix(c(-0.1,  0.001, 0.001, -0.1), nrow = 2, ncol = 2, byrow = T),
                     f1=t(matrix(c(100, 200), nrow = 2, ncol = 1, byrow = F)),
                     Q=matrix(c(1e-06, 1e-7, 1e-7,  1e-06), nrow = 2, ncol = 2, byrow = T),
                     f=t(matrix(c(100, 200), nrow = 2, ncol = 1, byrow = F)),
                     b=matrix(c(1, 2), nrow = 2, ncol = 1, byrow = F),
                     mu0=1e-4,
                     theta=0.08,
                     ystart = c(100,200), sd0=c(5, 10), dt=1)

a.d <- matrix(c(-0.15,  0.002, 0.002, -0.15), nrow = 2, ncol = 2, byrow = T)
f1.d <- t(matrix(c(95, 195), nrow = 2, ncol = 1, byrow = F))
Q.d <- matrix(c(1.2e-06, 1.2e-7, 1.2e-7,  1.2e-06), nrow = 2, ncol = 2, byrow = T)
f.d <- t(matrix(c(105, 205), nrow = 2, ncol = 1, byrow = F))
b.d <- matrix(c(1, 2), nrow = 2, ncol = 1, byrow = F)
mu0.d <- 1.1e-4
theta.d <- 0.07

ans <- spm_continuous(dat=data,
                      a = a.d, 
                      f1 = f1.d,
                      Q = Q.d,
                      f = f.d,
                      b = b.d,
                      mu0 = mu0.d,
                      theta = theta.d,
                      maxeval=150,
                      lb=c(-0.5, ifelse(a.d[2,1] > 0, a.d[2,1]-0.5*a.d[2,1], a.d[2,1]+0.5*a.d[2,1]), ifelse(a.d[1,2] > 0, a.d[1,2]-0.5*a.d[1,2], a.d[1,2]+0.5*a.d[1,2]), -0.5,  
                           80, 100, 
                           Q.d[1,1]-0.5*Q.d[1,1], ifelse(Q.d[2,1] > 0, Q.d[2,1]-0.5*Q.d[2,1], Q.d[2,1]+0.5*Q.d[2,1]), ifelse(Q.d[1,2] > 0, Q.d[1,2]-0.5*Q.d[1,2], Q.d[1,2]+0.5*Q.d[1,2]), Q.d[2,2]-0.5*Q.d[2,2],
                           80, 100,
                           0.1, 0.5,
                           0.1e-4,
                           0.01),
                      ub=c(-0.08,  0.002,  0.002, -0.08,  
                           110, 220, 
                           Q.d[1,1]+0.1*Q.d[1,1], ifelse(Q.d[2,1] > 0, Q.d[2,1]+0.1*Q.d[2,1], Q.d[2,1]-0.1*Q.d[2,1]), ifelse(Q.d[1,2] > 0, Q.d[1,2]+0.1*Q.d[1,2], Q.d[1,2]-0.1*Q.d[1,2]), Q.d[2,2]+0.1*Q.d[2,2],
                           110, 220,
                           1.5, 2.5,
                           1.2e-4,
                           0.10), algorithm = "NLOPT_LN_NELDERMEAD")
ans
## $a
##                 [,1]            [,2]
## [1,] -0.091624122048  0.001941525244
## [2,]  0.001908522884 -0.089466200848
## 
## $f1
##              [,1]
## [1,]  80.59686663
## [2,] 102.37898580
## 
## $Q
##                 [,1]            [,2]
## [1,] 1.320000000e-06 1.320000000e-07
## [2,] 1.124429625e-07 9.456159758e-07
## 
## $f
##             [,1]
## [1,] 109.1155730
## [2,] 207.4766745
## 
## $b
##             [,1]
## [1,] 1.485347673
## [2,] 2.492962027
## 
## $mu0
## [1] 0.0001009249539
## 
## $theta
## [1] 0.0961449891
## 
## $status
## [1] 5
## 
## $LogLik
## [1] -203566.1789
## 
## $objective
## [1] 203180.3453
## 
## $message
## [1] "NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached."
## 
## $limit
## [1] FALSE
## 
## attr(,"class")
## [1] "spm.continuous"

Setting lb and ub for model with time-dependent coefficients

This model uses only one covariate, therefore setting-up model parameters is easy:

n <- 10
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data, start=list(a=-0.05, f1=80, Q=2e-08, f=80, b=5, mu0=0.001), 
                        lb=c(-1, 30, 1e-8, 30, 1, 1e-6), ub=c(0, 120, 5e-8, 130, 10, 1e-2))
opt.par
## [[1]]
## [[1]]$a
## [1] -0.03917269718
## 
## [[1]]$f1
## [1] 79.22784314
## 
## [[1]]$Q
## [1] 2.691356614e-08
## 
## [[1]]$f
## [1] 73.18826698
## 
## [[1]]$b
## [1] 4.630588924
## 
## [[1]]$mu0
## [1] 0.001454410246
## 
## [[1]]$status
## [1] 3
## 
## [[1]]$LogLik
##           t2 
## -1529.894425 
## 
## [[1]]$objective
## [1] 1529.894425
## 
## [[1]]$message
## [1] "NLOPT_FTOL_REACHED: Optimization stopped because ftol_rel or ftol_abs (above) was reached."

Special case when some model parameter functions are equal to zero

Imagine a situation when one parameter function you want to be equal to zero: \(f=0\). Let’s emulate this case:

library(stpm)
n <- 10
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data, frm = list(at="a", f1t="f1", Qt="Q", ft="0", bt="b", mu0t="mu0"))
opt.par
## [[1]]
## [[1]]$a
## [1] -0.06246404735
## 
## [[1]]$f1
## [1] 79.51632369
## 
## [[1]]$Q
## [1] 2.170975641e-08
## 
## [[1]]$b
## [1] 60
## 
## [[1]]$mu0
## [1] 3.75
## 
## [[1]]$status
## [1] 3
## 
## [[1]]$LogLik
##           t2 
## -4624.181368 
## 
## [[1]]$objective
## [1] 4624.178531
## 
## [[1]]$message
## [1] "NLOPT_FTOL_REACHED: Optimization stopped because ftol_rel or ftol_abs (above) was reached."

As you can see, there is no parameter \(f\) in \(opt.par\). This because we set \(f=0\) in \(frm\)!

Then, is you want to set the constraints, you must not specify the starting value (parameter \(start\)) and \(lb\)/\(ub\) for the parameter \(f\) (otherwise, the function raises an error):

n <- 10
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data, frm = list(at="a", f1t="f1", Qt="Q", ft="0", bt="b", mu0t="mu0"), 
                        start=list(a=-0.05, f1=80, Q=2e-08, b=5, mu0=0.001), 
                        lb=c(-1, 30, 1e-8, 1, 1e-6), ub=c(0, 120, 5e-8, 10, 1e-2))
opt.par
## [[1]]
## [[1]]$a
## [1] -0.05057866408
## 
## [[1]]$f1
## [1] 78.8321946
## 
## [[1]]$Q
## [1] 3.177487132e-08
## 
## [[1]]$b
## [1] 5.005407072
## 
## [[1]]$mu0
## [1] 0.001139035337
## 
## [[1]]$status
## [1] 3
## 
## [[1]]$LogLik
##           t2 
## -1554.643326 
## 
## [[1]]$objective
## [1] 1554.642593
## 
## [[1]]$message
## [1] "NLOPT_FTOL_REACHED: Optimization stopped because ftol_rel or ftol_abs (above) was reached."

You can do the same manner if you want two or more parameters to be equal to zero.

Simulation (individual trajectory projection, also known as microsimulations)

We added one- and multi- dimensional simulation to be able to generate test data for hyphotesis testing. Data, which can be simulated can be discrete (equal intervals between observations) and continuous (with arbitrary intervals).

Discrete-time simulation

The corresponding function is (k - a number of variables(covariates), equal to model’s dimension):

simdata_discr(N=100, a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=1e-5, theta=0.08, ystart=80, tstart=30, tend=105, dt=1)

Here:

N - Number of individuals

a - A matrix of kxk, which characterize the rate of the adaptive response

f1 - A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k

Q - A matrix of k by k, which is a non-negative-definite symmetric matrix

f - A vector-function (with length k) of the normal (or optimal) state

b - A diffusion coefficient, k by k matrix

mu0 - mortality at start period of time (baseline hazard)

theta - A displacement coefficient of the Gompertz function

ystart - A vector with length equal to number of dimensions used, defines starting values of covariates

tstart - A number that defines a start time (30 by default). Can be a number (30 by default) or a vector of two numbers: c(a, b) - in this case, starting value of time is simulated via uniform(a,b) distribution.

tend - A number, defines a final time (105 by default)

dt - A time interval between observations.

This function returns a table with simulated data, as shown in example below:

library(stpm)
data <- simdata_discr(N=10)
head(data)
##      id xi t1 t2          y1     y1.next
## [1,]  1  0 30 31 80.00000000 75.30859335
## [2,]  1  0 31 32 75.30859335 74.93901653
## [3,]  1  0 32 33 74.93901653 76.04806159
## [4,]  1  0 33 34 76.04806159 73.54152697
## [5,]  1  0 34 35 73.54152697 75.20048292
## [6,]  1  0 35 36 75.20048292 76.85295005

Continuous-time simulation

The corresponding function is (k - a number of variables(covariates), equal to model’s dimension):

simdata_cont(N=100, a=-0.05, f1=80, Q=2e-07, f=80, b=5, mu0=2e-05, theta=0.08, ystart=80, tstart=c(30,50), tend=105)

Here:

N - Number of individuals

a - A matrix of kxk, which characterize the rate of the adaptive response

f1 - A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k

Q - A matrix of k by k, which is a non-negative-definite symmetric matrix

f - A vector-function (with length k) of the normal (or optimal) state

b - A diffusion coefficient, k by k matrix

mu0 - mortality at start period of time (baseline hazard)

theta - A displacement coefficient of the Gompertz function

ystart - A vector with length equal to number of dimensions used, defines starting values of covariates

tstart - A number that defines a start time (30 by default). Can be a number (30 by default) or a vector of two numbers: c(a, b) - in this case, starting value of time is simulated via uniform(a,b) distribution.

tend - A number, defines a final time (105 by default)

This function returns a table with simulated data, as shown in example below:

library(stpm)
data <- simdata_cont(N=10)
head(data)
##      id xi          t1          t2           y1     y1.next
## [1,]  0  0 34.85958278 35.93733725 80.952780924 4.326638672
## [2,]  0  0 35.93733725 37.86255724  4.326638672 5.707187016
## [3,]  0  0 37.86255724 39.23340260  5.707187016 5.602650376
## [4,]  0  0 39.23340260 41.08823239  5.602650376 7.642635023
## [5,]  0  0 41.08823239 42.11898094  7.642635023 7.384254061
## [6,]  0  0 42.11898094 43.92716142  7.384254061 7.534008344

SPM with partially observed covariates

Stochastic Process Model has many applications in analysis of longitudinal biodemographic data. Such data contain various physiological variables (known as covariates). Data can also potentially contain genetic information available for all or a part of participants. Taking advantage from both genetic and non-genetic information can provide future insights into a broad range of processes describing aging-related changes in the organism.

Method

In this package, SPM with partially observed covariates is implemented in form of GenSPM (Genetic SPM), presented in 2009 by Arbeev at al [6] and further advanced in [7,8], further elaborates the basic stochastic process model conception by introducing a categorical variable, \(Z\), which may be a specific value of a genetic marker or, in general, any categorical variable. Currently, \(Z\) has two gradations: 0 or 1 in a genetic group of interest, assuming that \(P(Z=1) = p\), \(p \in [0, 1]\), were \(p\) is the proportion of carriers and non-carriers of an allele in a population. Example of longitudinal data with genetic component \(Z\) is provided below.

library(stpm)
data <- sim_pobs(N=10)
head(data)
##   id xi          t1          t2 Z          y1     y1.next
## 1  0  0 46.02983745 47.06230885 0 80.73686576 81.77930889
## 2  0  0 47.06230885 47.97958746 0 81.77930889 73.66613392
## 3  0  0 47.97958746 49.04275730 0 73.66613392 74.78312444
## 4  0  0 49.04275730 50.13329034 0 74.78312444 79.64172999
## 5  0  0 50.13329034 51.12933192 0 79.64172999 77.24653092
## 6  0  0 51.12933192 52.06182763 0 77.24653092 75.79324332

In the specification of the SPM described in 2007 paper by Yashin and colleagues [3] the stochastic differential equation describing the age dynamics of a physiological variable (a dynamic component of the model) is:

\(dY(t) = a(Z, t)(Y(t) - f1(Z, t))dt + b(Z, t)dW(t), Y(t = t_0)\)

Here in this equation, \(Y(t)\) is a \(k \times 1\) matrix, where \(k\) is a number of covariates, which is a model dimension) describing the value of a physiological variable at a time (e.g. age) t. \(f_1(Z,t)\) is a \(k \times 1\) matrix that corresponds to the long-term average value of the stochastic process \(Y(t)\), which describes a trajectory of individual variable influenced by different factors represented by a random Wiener process \(W(t)\). The negative feedback coefficient \(a(Z,t)\) (\(k \times k\) matrix) characterizes the rate at which the stochastic process goes to its mean. In research on aging and well-being, \(f_1(Z,t)\) represents the average allostatic trajectory and \(a(t)\) in this case represents the adaptive capacity of the organism. Coefficient \(b(Z,t)\) (\(k \times 1\) matrix) characterizes a strength of the random disturbances from Wiener process \(W(t)\). All of these parameters depend on \(Z\) (a genetic marker having values 1 or 0). The following function \(\mu(t,Y(t))\) represents a hazard rate:

\(\mu(t,Y(t)) = \mu_0(t) + (Y(t) - f(Z, t))^*Q(Z, t)(Y(t) - f(Z, t))\)

In this equation: \(\mu_0(t)\) is the baseline hazard, which represents a risk when \(Y(t)\) follows its optimal trajectory; f(t) (\(k \times 1\) matrix) represents the optimal trajectory that minimizes the risk and \(Q(Z, t)\) (\(k \times k\) matrix) represents a sensitivity of risk function to deviation from the norm. In general, model coefficients \(a(Z, t)\), \(f1(Z, t)\), \(Q(Z, t)\), \(f(Z, t)\), \(b(Z, t)\) and \(\mu_0(t)\) are time(age)-dependent. Once we have data, we then can run analysis, i.e. estimate coefficients (they are assumed to be time-independent and data here is simulated):

library(stpm)
#Generating data:
data <- sim_pobs(N=10)
head(data)
##   id xi          t1          t2 Z          y1     y1.next
## 1  0  0 77.13240643 78.21507971 0 80.73439688 76.45193893
## 2  0  0 78.21507971 79.25361609 0 76.45193893 72.62483747
## 3  0  0 79.25361609 80.26317235 0 72.62483747 74.23254070
## 4  0  0 80.26317235 81.17983604 0 74.23254070 76.24026360
## 5  0  0 81.17983604 82.22082449 0 76.24026360 75.18887034
## 6  0  0 82.22082449 83.30248791 0 75.18887034 78.45315536
#Parameters estimation:
pars <- spm_pobs(x=data)
## Parameter mu0H achieved lower/upper bound.
## 7.2e-06 
## Parameter thetaL achieved lower/upper bound.
## 0.09
pars
## $aH
##                [,1]
## [1,] -0.05465783438
## 
## $aL
##                [,1]
## [1,] -0.00993384491
## 
## $f1H
##             [,1]
## [1,] 63.11843682
## 
## $f1L
##             [,1]
## [1,] 72.25350915
## 
## $QH
##                 [,1]
## [1,] 1.693537221e-08
## 
## $QL
##                 [,1]
## [1,] 2.722256986e-08
## 
## $fH
##             [,1]
## [1,] 61.17639433
## 
## $fL
##             [,1]
## [1,] 72.04736146
## 
## $bH
##             [,1]
## [1,] 4.016091307
## 
## $bL
##            [,1]
## [1,] 4.85862775
## 
## $mu0H
## [1] 7.2e-06
## 
## $mu0L
## [1] 9.037355295e-06
## 
## $thetaH
## [1] 0.07308316
## 
## $thetaL
## [1] 0.09
## 
## $p
## [1] 0.2707545659
## 
## $limit
## [1] TRUE
## 
## attr(,"class")
## [1] "pobs.spm"

Here and represents parameters when \(Z\) = 1 (H) and 0 (L).

Joint analysis of two datasets: first dataset with genetic and second dataset with non-genetic component

library(stpm)
data.genetic <- sim_pobs(N=10, mode='observed')
head(data.genetic)
##   id xi          t1          t2 Z          y1     y1.next
## 1  0  0 79.48383291 80.39515951 0 79.54317222 82.81030375
## 2  0  0 80.39515951 81.46361898 0 82.81030375 82.93364072
## 3  0  0 81.46361898 82.39851754 0 82.93364072 80.72857270
## 4  0  0 82.39851754 83.43216498 0 80.72857270 90.28996683
## 5  0  0 83.43216498 84.49518831 0 90.28996683 86.97105972
## 6  0  0 84.49518831 85.54288377 0 86.97105972 95.21701323
data.nongenetic <- sim_pobs(N=50, mode='unobserved')
head(data.nongenetic)
##   id xi          t1          t2          y1     y1.next
## 1  0  0 91.98375488 93.01776870 79.90353911 81.80383085
## 2  0  0 93.01776870 93.93222594 81.80383085 85.04172799
## 3  0  0 93.93222594 94.91920841 85.04172799 85.98897474
## 4  0  0 94.91920841 95.98856050 85.98897474 81.43385406
## 5  0  0 95.98856050 96.99994110 81.43385406 82.61591612
## 6  0  0 96.99994110 97.91354800 82.61591612 87.29304801
#Parameters estimation:
pars <- spm_pobs(x=data.genetic, y = data.nongenetic, mode='combined')
## Parameter thetaH achieved lower/upper bound.
## 0.072
pars
## $aH
##                [,1]
## [1,] -0.01325007888
## 
## $aL
##                [,1]
## [1,] -0.00500855276
## 
## $f1H
##             [,1]
## [1,] 65.62715583
## 
## $f1L
##             [,1]
## [1,] 85.84438318
## 
## $QH
##                [,1]
## [1,] 1.25559488e-08
## 
## $QL
##                 [,1]
## [1,] 2.739888553e-08
## 
## $fH
##             [,1]
## [1,] 63.25476943
## 
## $fL
##             [,1]
## [1,] 84.75933485
## 
## $bH
##             [,1]
## [1,] 4.358386969
## 
## $bL
##             [,1]
## [1,] 5.168357932
## 
## $mu0H
## [1] 8.524793362e-06
## 
## $mu0L
## [1] 9.063136978e-06
## 
## $thetaH
## [1] 0.072
## 
## $thetaL
## [1] 0.09005111272
## 
## $p
## [1] 0.2697500323
## 
## $limit
## [1] TRUE
## 
## attr(,"class")
## [1] "pobs.spm"

Here mode ‘observed’ is used for simlation of data with genetic component \(Z\) and ‘unobserved’ - without genetic component.

References

[1] Woodbury M.A., Manton K.G., Random-Walk of Human Mortality and Aging. Theoretical Population Biology, 1977 11:37-48.

[2] Yashin, A.I., Manton K.G., Vaupel J.W. Mortality and aging in a heterogeneous population: a stochastic process model with observed and unobserved varia-bles. Theor Pop Biology, 1985 27.

[3] Yashin, A.I. et al. Stochastic model for analysis of longitudinal data on aging and mortality. Mathematical Biosciences, 2007 208(2) 538-551.

[4] Akushevich I., Kulminski A. and Manton K.: Life tables with covariates: Dynamic model for Nonlinear Analysis of Longitudinal Data. 2005. Mathematical Popu-lation Studies, 12(2), pp.: 51-80.

[5] Yashin, A. et al. Health decline, aging and mortality: how are they related? Biogerontology, 2007 8(3), 291-302.

[6] Arbeev, K.G., Akushevich, I., Kulminski, A.M., Arbeeva, L.S., Akushevich, L., Ukraintseva, S.V., Culminskaya, I.V., Yashin, A.I.: Genetic model for longitudinal studies of aging, health, and longevity and its potential application to incomplete data. Journal of Theoretical Biology 258(1), 103{111 (2009).

[7] Arbeev K.G, Akushevich I., Kulminski A.M., Ukraintseva S.V., Yashin A.I., Joint Analyses of Longitudinal and Time-to-Event Data in Research on Aging: Implications for Predicting Health and Survival, Front Public Health. 2014 Nov 6;2:228. doi: 10.3389/fpubh.2014.00228

[8] Arbeev K., Arbeeva L., Akushevich I., Kulminski A., Ukraintseva S., Yashin A., Latent Class and Genetic Stochastic Process Models: Implications for Analyses of Longitudinal Data on Aging, Health, and Longevity, JSM-2015, Seattle, WA.