The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Dynamic Factor Model

Greg Veramendi

Overview

An example of how to estimate the dynamics of a latent factor. Assume that indicators are observed at two time points and you want to estimate how the period-2 value depends on the period-1 value. You need two things:

  1. A measurement system that is invariant across time (same loadings, thresholds / residual SDs across the two periods) so the factor scale is comparable, and
  2. A structural equation linking the period-2 factor to the period-1 factor: f₂ = α + β·f₁ + ε.

factorana supports this with a two-stage workflow, and two helper functions make the setup a few lines:

This vignette walks through a simulated example where α, β, and σ²_ε are known, shows the recovery, and explains one subtlety of the workflow (why the anchor-period intercepts are the right ones to carry into Stage 2).

Simulate data

One latent construct, three linear indicators per period, two periods. The DGP:

\[ f_1 \sim N(0,\sigma^2_{f_1}), \qquad f_2 = \alpha + \beta\,f_1 + \varepsilon, \quad \varepsilon \sim N(0,\sigma^2_\varepsilon) \]

\[ Y_{t,m} = \tau_m + \lambda_m\,f_t + u_{t,m}, \quad u_{t,m} \sim N(0,\sigma_m^2) \]

with measurement parameters τ_m, λ_m, σ_m shared across the two periods.

library(factorana)

set.seed(41)
n <- 1500

# Structural parameters (what we want to recover)
true_var_f1   <- 1.0
true_alpha    <- 0.4
true_beta     <- 0.6
true_sigma_e2 <- 0.5

# Shared measurement parameters
item_int   <- c(1.5, 1.0, 0.8)
item_load  <- c(1.0, 0.9, 1.1)   # first loading fixed to 1 in the model
item_sigma <- c(0.7, 0.75, 0.65)

f1  <- rnorm(n, 0, sqrt(true_var_f1))
eps <- rnorm(n, 0, sqrt(true_sigma_e2))
f2  <- true_alpha + true_beta * f1 + eps

gen_Y <- function(f, i) {
  item_int[i] + item_load[i] * f + rnorm(length(f), 0, item_sigma[i])
}

dat <- data.frame(
  intercept = 1,
  eval      = 1L,
  Y_t1_m1 = gen_Y(f1, 1), Y_t1_m2 = gen_Y(f1, 2), Y_t1_m3 = gen_Y(f1, 3),
  Y_t2_m1 = gen_Y(f2, 1), Y_t2_m2 = gen_Y(f2, 2), Y_t2_m3 = gen_Y(f2, 3)
)

The wide-format data has columns named <prefix><item>: Y_t1_m1, Y_t1_m2, Y_t1_m3 for wave 1, and Y_t2_m1, … for wave 2.

Stage 1: define_dynamic_measurement()

One function call builds the measurement model: a 2-factor independent system (one factor per period) with loadings and sigmas tied across periods, intercepts left free per period.

dyn <- define_dynamic_measurement(
  data                 = dat,
  items                = c("m1", "m2", "m3"),
  period_prefixes      = c("Y_t1_", "Y_t2_"),
  model_type           = "linear",
  evaluation_indicator = "eval"
)
# The wrapper generates 5 equality constraints: 2 for loadings (items
# m2, m3; item m1's loading is fixed to 1 on its factor slot) and 3
# for sigmas.
length(dyn$equality_constraints)
#> [1] 5

Estimate Stage 1 with estimate_model_rcpp() exactly as for any model system:

ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)
result_stage1 <- estimate_model_rcpp(
  model_system = dyn$model_system,
  data         = dat,
  control      = ctrl,
  optimizer    = "nlminb",
  parallel     = FALSE,
  verbose      = FALSE
)
result_stage1$convergence
#> [1] 0

Inspect the intercepts: the wave-1 intercepts should match the DGP τ_m (because E[f₁] = 0 by convention). The wave-2 intercepts are shifted by λ_m · E[f₂], which is the mean-drift artefact we are about to discard.

est <- result_stage1$estimates
tab <- data.frame(
  m        = 1:3,
  DGP_tau  = item_int,
  wave_1   = round(c(est["Y_t1_m1_intercept"],
                     est["Y_t1_m2_intercept"],
                     est["Y_t1_m3_intercept"]), 3),
  wave_2   = round(c(est["Y_t2_m1_intercept"],
                     est["Y_t2_m2_intercept"],
                     est["Y_t2_m3_intercept"]), 3)
)
knitr::kable(tab, row.names = FALSE)
m DGP_tau wave_1 wave_2
1 1.5 1.488 1.905
2 1.0 1.000 1.332
3 0.8 0.778 1.234

Why we do not pool the intercepts

A natural alternative is to stack period-1 and period-2 rows into one long-format dataset and fit a single-factor CFA. The resulting intercepts would be the pooled means of Y across both periods, biased by λ_m · E[f₂]/2 (because the pooled mean averages the period-1 and period-2 factor means). Plugging those into Stage 2 under-estimates α by roughly E[f₂]/2.

The wrapper avoids that: Stage 1 estimates period-specific intercepts, then build_dynamic_previous_stage() carries only the anchor period’s intercepts into Stage 2.

Stage 2: build_dynamic_previous_stage() + SE_linear

dummy <- build_dynamic_previous_stage(
  dyn           = dyn,
  stage1_result = result_stage1,
  data          = dat,
  anchor_period = 1L
)

fm_stage2 <- define_factor_model(
  n_factors        = 2,
  n_types          = 1,
  factor_structure = "SE_linear"
)
ms_stage2 <- define_model_system(
  components     = list(),       # measurement components prepended from previous_stage
  factor         = fm_stage2,
  previous_stage = dummy
)
#> Two-stage SE estimation: Stage 1 (independent) -> Stage 2 (SE_linear)
#>   Measurement parameters (loadings, thresholds) will be fixed from Stage 1
#>   Factor distribution parameters will use Stage 2 structure
#>   Fixing 16 measurement parameters, ignoring 2 factor distribution parameters

init_s2 <- initialize_parameters(ms_stage2, dat, verbose = FALSE)
init_s2$init_params["factor_var_1"]    <- unname(dummy$estimates["factor_var_1"])
init_s2$init_params["se_intercept"]    <- 0.0
init_s2$init_params["se_linear_1"]     <- 0.5
init_s2$init_params["se_residual_var"] <- 0.5

result_stage2 <- estimate_model_rcpp(
  model_system = ms_stage2,
  data         = dat,
  init_params  = init_s2$init_params,
  control      = ctrl,
  optimizer    = "nlminb",
  parallel     = FALSE,
  verbose      = FALSE
)
result_stage2$convergence
#> [1] 0

Recovery

est <- result_stage2$estimates
se  <- result_stage2$std_errors
ps  <- c("factor_var_1", "se_intercept", "se_linear_1", "se_residual_var")

tab <- data.frame(
  param = ps,
  true  = c(true_var_f1, true_alpha, true_beta, true_sigma_e2),
  est   = round(unname(est[ps]), 4),
  se    = round(unname(se[ps]),  4)
)
tab$z <- round((tab$est - tab$true) / tab$se, 2)
knitr::kable(tab, row.names = FALSE)
param true est se z
factor_var_1 1.0 0.9104 0.0325 -2.76
se_intercept 0.4 0.4036 0.0217 0.17
se_linear_1 0.6 0.5880 0.0236 -0.51
se_residual_var 0.5 0.5051 0.0261 0.20

se_intercept (the structural α) recovers essentially exactly, which is the whole point of the workflow. The slope β, residual variance σ²_ε, and input factor variance σ²_f₁ also recover within their standard errors.

Cluster-robust standard errors

The standard errors above are model-based (inverse observed information). When individuals are grouped (for example pupils within schools, or siblings within households) and that grouping induces dependence the model does not capture, vcov_factorana() provides cluster-robust standard errors. Pass cluster as the name of a grouping column in the data (or a vector of length nrow(dat)).

Here we attach a synthetic school id (50 schools) and cluster on it:

set.seed(7)
dat$school <- sample(1:50, n, replace = TRUE)

se_cluster <- robust_se(result_stage2, dat, type = "cluster", cluster = "school")

ps <- c("factor_var_1", "se_intercept", "se_linear_1", "se_residual_var")
data.frame(
  param      = ps,
  se_model   = round(unname(result_stage2$std_errors[ps]), 4),
  se_cluster = round(unname(se_cluster[ps]),                4),
  row.names  = NULL
)
#>             param se_model se_cluster
#> 1    factor_var_1   0.0325     0.0328
#> 2    se_intercept   0.0217     0.0212
#> 3     se_linear_1   0.0236     0.0256
#> 4 se_residual_var   0.0261     0.0302

Because the schools were assigned at random here, the cluster-robust and model-based standard errors are close; with genuine within-cluster dependence the cluster-robust values would typically be larger.

Two-stage caveat. These standard errors are computed on the Stage 2 fit and treat the Stage 1 measurement parameters (loadings, sigmas, intercepts) as known. They therefore understate the total uncertainty (the generated-regressor problem). For honest two-stage inference, resample clusters and re-run both stages in a bootstrap; a single-stage robust or cluster-robust covariance is exact only when the measurement system is not itself estimated.

Bootstrap for honest two-stage standard errors

A cluster bootstrap that re-runs both stages on each resample accounts for the first-stage uncertainty. bootstrap_factorana_multistage() drives this in one call: it resamples clusters, and for each replicate fits Stage 1 and then Stage 2 chained on that replicate’s own Stage 1 fit. Each (stage, sample) fit is saved to disk and skipped if it already exists, so the run is restartable and the per-sample work can be scattered across a compute cluster with bootstrap_fit_sample().

stage_builders <- list(
  # Stage 1: the measurement system (prev_fits is empty for the first stage)
  function(prev_fits, data) dyn$model_system,
  # Stage 2: SE_linear, chaining on THIS replicate's Stage 1 fit
  function(prev_fits, data) {
    prev <- build_dynamic_previous_stage(
      dyn = dyn, stage1_result = prev_fits[[1]], data = data, anchor_period = 1L)
    fm2 <- define_factor_model(n_factors = 2, factor_structure = "SE_linear")
    define_model_system(components = list(), factor = fm2, previous_stage = prev)
  }
)

boot <- bootstrap_factorana_multistage(
  stage_builders, dat, R = 500, cluster = "person",
  dir = "boot_run", control = ctrl,
  optimizer = "nlminb", parallel = FALSE, verbose = FALSE
)

boot$final$boot_se   # Stage 2 bootstrap standard errors (include Stage 1 uncertainty)
boot$final$ci        # percentile intervals

For a multi-node run, generate the samples once with generate_bootstrap_samples() and call bootstrap_fit_sample() from an array job (one task per sample, per stage), then summarise with collect_bootstrap().

Notes on generalisation

Where to go next

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.