The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

factorana

factorana (R interface)

R front-end for specifying model components (linear/logit/probit/ordered probit), defining latent factor structures (with any number of factors), and producing initialization values before passing to C++ for estimation.


Installation

Install directly from GitHub using devtools or remotes:

# Install devtools if you don't have it
install.packages("devtools")

# Install factorana from GitHub
devtools::install_github("GregVeramendi/factorana", subdir = "factorana_R")

# Load the package
library(factorana)

Requirements: - R ≥ 4.0.0 - C++ compiler: - Windows: Install Rtools (matches your R version) - macOS: Install Xcode Command Line Tools via xcode-select --install - Linux: Install build-essential (Ubuntu/Debian) or gcc-c++ (Fedora/RHEL) - R packages: Rcpp, RcppEigen (automatically installed as dependencies)

The package will automatically compile the C++ code during installation.

Troubleshooting Installation

Windows: If you get compilation errors, make sure Rtools is installed and on your PATH:

# Check if Rtools is found
Sys.which("make")

macOS: If you get “clang: error: unsupported option ‘-fopenmp’”, this is expected and can be ignored (OpenMP is optional).

All platforms: If installation fails, try:

# Install with verbose output to see errors
devtools::install_github("GregVeramendi/factorana",
                        subdir = "factorana_R",
                        build_vignettes = FALSE,
                        force = TRUE)

Features

Quick start: Roy model example

This example demonstrates a Roy selection model with unobserved ability (latent factor), sector choice, test scores, and wages.

Click to expand full example code
library(factorana)

# Generate Roy model data
set.seed(108)
n <- 10000

# Covariates
x1 <- rnorm(n)  # Affects wages
x2 <- rnorm(n)  # Affects wages and sector choice
f <- rnorm(n)   # Latent ability (unobserved)

# Test scores (measure ability with error)
T1 <- 2.0 + 1.0*f + rnorm(n, 0, 0.5)
T2 <- 1.5 + 1.2*f + rnorm(n, 0, 0.6)
T3 <- 1.0 + 0.8*f + rnorm(n, 0, 0.4)

# Potential wages in each sector
wage0 <- 2.0 + 0.5*x1 + 0.3*x2 + 0.5*f + rnorm(n, 0, 0.6)     # Ability affects wage0
wage1 <- 2.5 + 0.6*x1 + 1.0*f + rnorm(n, 0, 0.7)              # Ability affects wage1

# Sector choice (high ability → more likely sector 1)
z_sector <- 0.0 + 0.4*x2 + 0.8*f
sector <- as.numeric(runif(n) < pnorm(z_sector))

# Observed wage (only see wage in chosen sector)
wage <- ifelse(sector == 1, wage1, wage0)

# Create dataset with evaluation indicators
dat <- data.frame(
  intercept = 1,
  x1 = x1, x2 = x2,
  T1 = T1, T2 = T2, T3 = T3,
  wage = wage,
  sector = sector,
  eval_tests = 1,                    # Always observe test scores
  eval_wage0 = 1 - sector,           # Observe wage0 when sector=0
  eval_wage1 = sector,               # Observe wage1 when sector=1
  eval_sector = 1                    # Always observe sector choice
)

# Define factor model (1 latent ability factor)
fm <- define_factor_model(n_factors = 1, n_types = 1)

# Define model components
# Test 1: Normalize loading to 1.0 for identification
mc_T1 <- define_model_component(
  name = "T1", data = dat, outcome = "T1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = 1.0,
  evaluation_indicator = "eval_tests"
)

# Tests 2 and 3: Free loadings
mc_T2 <- define_model_component(
  name = "T2", data = dat, outcome = "T2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_tests"
)
mc_T3 <- define_model_component(
  name = "T3", data = dat, outcome = "T3", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_tests"
)

# Wage in sector 0: Ability effect (free loading)
mc_wage0 <- define_model_component(
  name = "wage0", data = dat, outcome = "wage", factor = fm,
  covariates = c("intercept", "x1", "x2"), model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_wage0"
)

# Wage in sector 1: Ability matters (free loading)
mc_wage1 <- define_model_component(
  name = "wage1", data = dat, outcome = "wage", factor = fm,
  covariates = c("intercept", "x1"), model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_wage1"
)

# Sector choice: Probit model
mc_sector <- define_model_component(
  name = "sector", data = dat, outcome = "sector", factor = fm,
  covariates = c("intercept", "x2"), model_type = "probit",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_sector"
)

# Define model system
ms <- define_model_system(
  components = list(mc_T1, mc_T2, mc_T3, mc_wage0, mc_wage1, mc_sector),
  factor = fm
)

# Single-core estimation
# Note: init_params = NULL triggers automatic initialization
ctrl_single <- define_estimation_control(n_quad_points = 16, num_cores = 1)
time_start_single <- Sys.time()
result_single <- estimate_model_rcpp(
  model_system = ms,
  data = dat,
  init_params = NULL,       # Automatic initialization (recommended)
  control = ctrl_single,
  parallel = FALSE,
  optimizer = "nlminb",     # Default: fast with analytical Hessian
  verbose = TRUE
)
time_end_single <- Sys.time()
time_single <- as.numeric(difftime(time_end_single, time_start_single, units = "secs"))

# Parallel estimation with 4 cores (3x speedup on large datasets)
ctrl_parallel <- define_estimation_control(n_quad_points = 16, num_cores = 4)
time_start_parallel <- Sys.time()
result_parallel <- estimate_model_rcpp(
  model_system = ms,
  data = dat,
  init_params = NULL,       # Automatic initialization
  control = ctrl_parallel,
  parallel = TRUE,
  optimizer = "nlminb",
  verbose = TRUE
)
time_end_parallel <- Sys.time()
time_parallel <- as.numeric(difftime(time_end_parallel, time_start_parallel, units = "secs"))

# Create formatted results tables
# Define true parameter values
true_params <- c(
  1.0,      # factor_var (fixed to estimates)
  2.0, 0.5, # T1: intercept, sigma
  1.5, 1.2, 0.6, # T2: intercept, loading, sigma
  1.0, 0.8, 0.4, # T3: intercept, loading, sigma
  2.0, 0.5, 0.3, 0.5, 0.6, # wage0: intercept, x1, x2, loading, sigma
  2.5, 0.6, 1.0, 0.7, # wage1: intercept, x1, loading, sigma
  0.0, 0.4, 0.8  # sector: intercept, x2, loading
)

# Update factor variance to match estimate
true_params[1] <- result_parallel$estimates[1]

# Parameter names
param_names <- c(
  "factor_var",
  "T1_intercept", "T1_sigma",
  "T2_intercept", "T2_loading", "T2_sigma",
  "T3_intercept", "T3_loading", "T3_sigma",
  "wage0_intercept", "wage0_x1", "wage0_x2", "wage0_loading", "wage0_sigma",
  "wage1_intercept", "wage1_x1", "wage1_loading", "wage1_sigma",
  "sector_intercept", "sector_x2", "sector_loading"
)

# Component labels for grouping
components <- c(
  "Factor",
  "T1", "T1",
  "T2", "T2", "T2",
  "T3", "T3", "T3",
  "wage0", "wage0", "wage0", "wage0", "wage0",
  "wage1", "wage1", "wage1", "wage1",
  "sector", "sector", "sector"
)

# Table 1: Parameter estimates
results_table <- data.frame(
  Component = components,
  Parameter = param_names,
  True = sprintf("%.3f", true_params),
  Estimate = sprintf("%.3f", result_parallel$estimates),
  Std_Error = sprintf("%.3f", result_parallel$std_errors)
)

cat("\n=== Parameter Estimates ===\n")
print(results_table, row.names = FALSE, right = FALSE)

# Table 2: Estimation diagnostics
speedup <- time_single / time_parallel
diagnostics <- data.frame(
  Method = c("Single-core", "Parallel (4 cores)"),
  Log_Likelihood = sprintf("%.2f", c(result_single$loglik, result_parallel$loglik)),
  Time_sec = sprintf("%.2f", c(time_single, time_parallel)),
  Speedup = c("1.0x", sprintf("%.2fx", speedup)),
  Convergence = c(result_single$convergence, result_parallel$convergence),
  N_Parameters = c(length(result_single$estimates), length(result_parallel$estimates))
)

cat("\n=== Estimation Diagnostics ===\n")
print(diagnostics, row.names = FALSE, right = FALSE)

Key features demonstrated:


Two-Stage (Sequential) Estimation

For complex models, you can estimate in multiple stages to improve convergence and interpretability. In the first stage, estimate a subset of components (e.g., measurement system). In subsequent stages, fix those components and add new ones.

Click to expand two-stage estimation example

Example: Estimate test scores first, then add wage and sector equations:

# Using the same Roy model data from above...

# ======================================================================
# STAGE 1: Estimate measurement system (test scores only)
# ======================================================================

fm <- define_factor_model(n_factors = 1, n_types = 1)
ctrl <- define_estimation_control(n_quad_points = 16, num_cores = 1)

# Define three test score components
mc_T1 <- define_model_component(
  name = "T1", data = dat, outcome = "T1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = 1.0,
  evaluation_indicator = "eval_tests"
)

mc_T2 <- define_model_component(
  name = "T2", data = dat, outcome = "T2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_tests"
)

mc_T3 <- define_model_component(
  name = "T3", data = dat, outcome = "T3", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_tests"
)

# Create model system for stage 1
ms_stage1 <- define_model_system(
  components = list(mc_T1, mc_T2, mc_T3),
  factor = fm
)

# Estimate stage 1
result_stage1 <- estimate_model_rcpp(
  model_system = ms_stage1,
  data = dat,
  init_params = NULL,
  control = ctrl,
  optimizer = "nlminb",
  verbose = TRUE
)

# ======================================================================
# STAGE 2: Add wage/sector equations, fixing stage 1 parameters
# ======================================================================

# Define wage and sector components (same as before)
mc_wage0 <- define_model_component(
  name = "wage0", data = dat, outcome = "wage", factor = fm,
  covariates = c("intercept", "x1", "x2"), model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_wage0"
)

mc_wage1 <- define_model_component(
  name = "wage1", data = dat, outcome = "wage", factor = fm,
  covariates = c("intercept", "x1"), model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_wage1"
)

mc_sector <- define_model_component(
  name = "sector", data = dat, outcome = "sector", factor = fm,
  covariates = c("intercept", "x2"), model_type = "probit",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_sector"
)

# Create model system for stage 2, passing stage 1 results
ms_stage2 <- define_model_system(
  components = list(mc_wage0, mc_wage1, mc_sector),
  factor = fm,
  previous_stage = result_stage1  # Fix stage 1 parameters
)

# Estimate stage 2
result_stage2 <- estimate_model_rcpp(
  model_system = ms_stage2,
  data = dat,
  init_params = NULL,
  control = ctrl,
  optimizer = "nlminb",
  verbose = TRUE
)

# View combined results
print(result_stage2$estimates)    # Includes both stage 1 and stage 2 parameters
print(result_stage2$std_errors)   # Standard errors preserved from stage 1

How it works: - Stage 1 estimates only the measurement system (9 parameters) - Stage 2 fixes those 9 parameters and estimates wage/sector parameters (12 new parameters) - result_stage2 contains all 21 parameters with correct standard errors - Factor variance is fixed at stage 1 estimate - Computational efficiency: gradients/Hessians not computed for fixed components

Benefits: - Improved convergence for complex models - Interpretable intermediate results - Faster optimization (skip derivatives for fixed parameters) - Can chain multiple stages: stage3 <- define_model_system(..., previous_stage = result_stage2)


Adaptive Integration

For large-scale two-stage estimation, adaptive integration dramatically reduces computation time by using factor scores from Stage 1 to determine how many integration points each observation needs in Stage 2:

Formula: n_quad_obs = 1 + 2 × floor(factor_se / factor_sd / threshold)

Click to expand adaptive integration example
library(factorana)

# === STAGE 1: Estimate measurement model ===
fm <- define_factor_model(n_factors = 2, n_types = 1)
ctrl <- define_estimation_control(n_quad_points = 16, num_cores = 4)

# Define measurement components...
ms_stage1 <- define_model_system(components = list(mc1, mc2, mc3, mc4), factor = fm)
result_s1 <- estimate_model_rcpp(ms_stage1, data, control = ctrl)

# Get factor scores and standard errors
fscores <- estimate_factorscores_rcpp(ms_stage1, data, result_s1$par, ctrl)
factor_scores <- fscores$factor_scores   # matrix [nobs x nfac]
factor_ses <- fscores$factor_ses         # matrix [nobs x nfac]

# === STAGE 2: Estimate with adaptive integration ===
# Add structural components...
ms_stage2 <- define_model_system(
  components = list(mc1, mc2, mc3, mc4, mc_outcome),
  factor = fm
)

# Initialize FactorModel for Stage 2
fm_ptr <- initialize_factor_model_cpp(ms_stage2, data, n_quad = 16)

# Extract factor variances from Stage 1
factor_vars <- c(result_s1$par["factor_var_1"], result_s1$par["factor_var_2"])

# Enable adaptive integration (prints diagnostic summary)
set_adaptive_quadrature_cpp(
  fm_ptr,
  factor_scores,   # From Stage 1
  factor_ses,      # From Stage 1
  factor_vars,     # Factor variances from Stage 1
  threshold = 0.3, # Smaller = more integration points
  max_quad = 16,   # Maximum points per factor
  verbose = TRUE   # Print summary table
)

# Output example:
# Adaptive Integration Summary
# ----------------------------
# Threshold: 0.3, Max quad points: 16
#
# Integration points per observation:
#   Points   Observations   Percent
#        1            400      80.0%
#        3            100      20.0%
#
# Average integration points: 1.4 (vs 256 standard)
# Computational reduction: 99.4%

# Run estimation with the adaptive FactorModel...

Key points: - Importance sampling correction is applied automatically - Works best when most observations have well-identified factor scores - Can achieve 90%+ computational reduction for well-identified models - Use disable_adaptive_quadrature_cpp() to revert to standard integration


Observation Weights

Observation weights allow different observations to have different influence on the likelihood. This is useful for: - Survey weights: Make estimates representative of a population - Importance sampling: As used internally by adaptive integration - Down-weighting outliers: Reduce influence of problematic observations

Click to expand observation weights example
library(factorana)

# Create data with survey weights
data$survey_weight <- compute_ipw_weights(data)  # Your weight function

# Define factor model and components...
fm <- define_factor_model(n_factors = 1, n_types = 1)
mc1 <- define_model_component(...)
mc2 <- define_model_component(...)

# Specify weights in define_model_system()
ms <- define_model_system(
  components = list(mc1, mc2),
  factor = fm,
  weights = "survey_weight"  # Column name in data
)

# Estimate - weights applied automatically
ctrl <- define_estimation_control(n_quad_points = 16, num_cores = 4)
result <- estimate_model_rcpp(ms, data, control = ctrl, verbose = TRUE)
# Output: "Using observation weights from 'survey_weight' (range: 0.5 to 2.0)"

Requirements: - Weights must be a numeric column in the data - Weights must be positive (warning if any are ≤ 0) - Weights cannot contain NA values - Each observation’s log-likelihood contribution is multiplied by its weight


Concepts

Latent factors & normalization

Identification: At least one component must have a fixed loading to normalize the factor scale.

Examples:

# Single factor model
fm <- define_factor_model(n_factors = 1, n_types = 1)
ctrl <- define_estimation_control(n_quad_points = 16, num_cores = 1)

# First component: fix loading to 1.0 for identification
mc1 <- define_model_component(
  name = "test1", data = dat, outcome = "y1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = 1.0
)

# Other components: free loadings
mc2 <- define_model_component(
  name = "test2", data = dat, outcome = "y2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_  # Free parameter
)

Ordered probit (OProbit)


API (R)

More detailed explanations within functions.

define_factor_model(n_factors, n_types = 1, factor_structure = “independent”, n_mixtures = 1)

define_model_component(name, data, outcome, factor, evaluation_indicator = NULL, covariates, model_type, loading_normalization = NA_real_, factor_spec = “linear”, intercept = TRUE, num_choices = 2, nrank = NULL)

fix_coefficient(component, covariate, value, choice = NULL)

fix_type_intercepts(component, types = NULL, choice = NULL)

define_model_system(components, factor, previous_stage = NULL, weights = NULL, equality_constraints = NULL)

define_estimation_control(n_quad_points = 16, num_cores = 1, adaptive_integration = FALSE, adapt_int_thresh = 0.3)

estimate_model_rcpp(model_system, data, init_params = NULL, control = NULL, optimizer = “nlminb”, parallel = TRUE, verbose = TRUE)

Parallelization notes: - Splits observations across workers (not parameters) - Each worker evaluates likelihood on its data subset - Results are aggregated across workers - Achieves ~3x speedup with 4 cores on large datasets - Uses doParallel for Windows compatibility - Single evaluation indicator per observation (one worker evaluates it)

initialize_parameters(model_system, data, verbose = TRUE)


Displaying Results

The package provides functions to display estimation results in formatted tables, either to the screen or as LaTeX output.

components_table(result, components = NULL)

components_to_latex(result, components = NULL, caption = NULL, label = NULL, digits = 3)

Example: Displaying All Components

After estimating a Roy model (see Quick Start example):

# Display all components in one table
components_table(result_parallel)

Output:

Factor Model Results by Component
=========================================================================================================

Parameter          sector         wage1         wage0            T1            T2            T3
---------------------------------------------------------------------------------------------
beta_intercept                   2.531***      2.013***      1.998***      1.500***      0.994***
                                  (0.043)       (0.035)       (0.012)       (0.016)       (0.011)
beta_x1                          0.594***      0.498***
                                  (0.048)       (0.035)
beta_x2             0.416***                   0.299***
                     (0.044)                    (0.030)
Loading 1           0.807***      0.983***      0.521***      1.000         1.191***      0.797***
                     (0.058)       (0.037)       (0.038)         (-)        (0.021)       (0.016)
Sigma                             0.697***      0.609***      0.501***      0.600***      0.399***
                                   (0.027)       (0.024)       (0.011)       (0.013)       (0.009)
---------------------------------------------------------------------------------------------------------
N                    10000          5012          4988         10000         10000         10000
Log-likelihood: -42156.78

Standard errors in parentheses
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1

Example: Splitting Components into Separate Tables

For complex models, you may want to display the measurement system separately from the structural equations:

# Table 1: Measurement system (test scores)
components_table(result_parallel, components = c("T1", "T2", "T3"))

# Table 2: Selection and outcome equations
components_table(result_parallel, components = c("sector", "wage0", "wage1"))

Measurement System Table:

Factor Model Results by Component
=============================================================================

Parameter               T1            T2            T3
-------------------------------------------------------------
beta_intercept      1.998***      1.500***      0.994***
                     (0.012)       (0.016)       (0.011)
Loading 1           1.000         1.191***      0.797***
                       (-)         (0.021)       (0.016)
Sigma               0.501***      0.600***      0.399***
                     (0.011)       (0.013)       (0.009)
-----------------------------------------------------------------------------
N                    10000         10000         10000
Log-likelihood: -42156.78

Standard errors in parentheses
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1

Selection and Outcomes Table:

Factor Model Results by Component
=============================================================================

Parameter          sector         wage1         wage0
-------------------------------------------------------------
beta_intercept                   2.531***      2.013***
                                  (0.043)       (0.035)
beta_x1                          0.594***      0.498***
                                  (0.048)       (0.035)
beta_x2             0.416***                   0.299***
                     (0.044)                    (0.030)
Loading 1           0.807***      0.983***      0.521***
                     (0.058)       (0.037)       (0.038)
Sigma                             0.697***      0.609***
                                   (0.027)       (0.024)
-----------------------------------------------------------------------------
N                    10000          5012          4988
Log-likelihood: -42156.78

Standard errors in parentheses
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1

Example: LaTeX Export

# Export measurement system to LaTeX
latex_code <- components_to_latex(
  result_parallel,
  components = c("T1", "T2", "T3"),
  caption = "Measurement System Estimates",
  label = "tab:measurement"
)
cat(latex_code)

# Export selection/outcome to LaTeX
latex_code2 <- components_to_latex(
  result_parallel,
  components = c("sector", "wage0", "wage1"),
  caption = "Selection and Outcome Equations",
  label = "tab:structural"
)
writeLines(latex_code2, "structural_table.tex")

Other Display Functions


Examples

Correlated Factors Model

Estimate a two-factor model where the latent factors are correlated, using factor_structure = "correlation":

Click to expand correlated factors example
set.seed(42)
n <- 500

# True parameters
true_rho <- 0.6  # Correlation between factors

# Generate correlated factors using Cholesky decomposition
z1 <- rnorm(n)
z2 <- rnorm(n)
f1 <- z1
f2 <- true_rho * z1 + sqrt(1 - true_rho^2) * z2

# Measurement system (3 indicators per factor)
Y1 <- 1.0*f1 + rnorm(n, 0, 0.5)  # f1 loading fixed to 1
Y2 <- 0.8*f1 + rnorm(n, 0, 0.5)
Y3 <- 1.2*f1 + rnorm(n, 0, 0.5)
Y4 <- 1.0*f2 + rnorm(n, 0, 0.5)  # f2 loading fixed to 1
Y5 <- 0.9*f2 + rnorm(n, 0, 0.5)
Y6 <- 1.1*f2 + rnorm(n, 0, 0.5)

dat <- data.frame(intercept = 1, Y1 = Y1, Y2 = Y2, Y3 = Y3,
                  Y4 = Y4, Y5 = Y5, Y6 = Y6)

# Define correlated 2-factor model
fm <- define_factor_model(n_factors = 2, factor_structure = "correlation")
ctrl <- define_estimation_control(n_quad_points = 16)

# Factor 1 measurement equations
mc1 <- define_model_component("m1", dat, "Y1", fm, covariates = "intercept",
                               model_type = "linear", loading_normalization = c(1, 0))
mc2 <- define_model_component("m2", dat, "Y2", fm, covariates = "intercept",
                               model_type = "linear", loading_normalization = c(NA, 0))
mc3 <- define_model_component("m3", dat, "Y3", fm, covariates = "intercept",
                               model_type = "linear", loading_normalization = c(NA, 0))

# Factor 2 measurement equations
mc4 <- define_model_component("m4", dat, "Y4", fm, covariates = "intercept",
                               model_type = "linear", loading_normalization = c(0, 1))
mc5 <- define_model_component("m5", dat, "Y5", fm, covariates = "intercept",
                               model_type = "linear", loading_normalization = c(0, NA))
mc6 <- define_model_component("m6", dat, "Y6", fm, covariates = "intercept",
                               model_type = "linear", loading_normalization = c(0, NA))

ms <- define_model_system(components = list(mc1, mc2, mc3, mc4, mc5, mc6), factor = fm)
result <- estimate_model_rcpp(ms, dat, control = ctrl, verbose = FALSE)

# Results
cat("Factor variances: Var(f1)=", result$estimates["factor_var_1"],
    ", Var(f2)=", result$estimates["factor_var_2"], "\n")
cat("Factor correlation:", result$estimates["factor_corr_1_2"], "(true:", true_rho, ")\n")

Key points: - Use factor_structure = "correlation" for correlated two-factor models - Correlation is estimated via Cholesky decomposition (factor_corr_1_2 parameter) - Currently limited to 2 factors; for more complex factor relationships, use SE models


Structural Equation Factor Model (Longitudinal with Measurement Invariance)

Model structural relationships between latent factors at different time points with measurement invariance constraints. This is useful for longitudinal studies where the same measures are used at different ages.

Click to expand longitudinal SE model example
library(factorana)

set.seed(123)
n <- 2000

# True parameters for SE_linear model: f2 = alpha_1*f1 + epsilon
true_var_f1 <- 1.0          # Variance of factor at age 10
true_se_linear <- 0.7       # Effect of age-10 factor on age-15 factor
true_se_residual_var <- 0.5 # Residual variance

# Generate factors according to structural equation
f1 <- rnorm(n, 0, sqrt(true_var_f1))
eps <- rnorm(n, 0, sqrt(true_se_residual_var))
f2 <- true_se_linear * f1 + eps

# Measurement system - 3 tests measured at each age
# Same test properties (measurement invariance)
true_loading_t2 <- 1.1
true_loading_t3 <- 0.9
true_sigma <- 0.5

# Age 10 measurements (factor 1)
T1_age10 <- 1.0 * f1 + rnorm(n, 0, true_sigma)  # loading fixed to 1
T2_age10 <- true_loading_t2 * f1 + rnorm(n, 0, true_sigma)
T3_age10 <- true_loading_t3 * f1 + rnorm(n, 0, true_sigma)

# Age 15 measurements (factor 2) - SAME loadings and sigmas (invariance)
T1_age15 <- 1.0 * f2 + rnorm(n, 0, true_sigma)  # loading fixed to 1
T2_age15 <- true_loading_t2 * f2 + rnorm(n, 0, true_sigma)
T3_age15 <- true_loading_t3 * f2 + rnorm(n, 0, true_sigma)

dat <- data.frame(
  intercept = 1,
  T1_age10 = T1_age10, T2_age10 = T2_age10, T3_age10 = T3_age10,
  T1_age15 = T1_age15, T2_age15 = T2_age15, T3_age15 = T3_age15,
  eval = 1
)

# ============================================================
# Define SE_linear model: f2 = alpha_1*f1 + epsilon
# ============================================================

fm <- define_factor_model(n_factors = 2, factor_structure = "SE_linear")
ctrl <- define_estimation_control(n_quad_points = 16, num_cores = 1)

# Age 10 measurement equations (factor 1 = input factor)
mc_t1_age10 <- define_model_component(
  name = "t1_age10", data = dat, outcome = "T1_age10", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(1, 0), evaluation_indicator = "eval"
)
mc_t2_age10 <- define_model_component(
  name = "t2_age10", data = dat, outcome = "T2_age10", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(NA_real_, 0), evaluation_indicator = "eval"
)
mc_t3_age10 <- define_model_component(
  name = "t3_age10", data = dat, outcome = "T3_age10", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(NA_real_, 0), evaluation_indicator = "eval"
)

# Age 15 measurement equations (factor 2 = outcome factor)
mc_t1_age15 <- define_model_component(
  name = "t1_age15", data = dat, outcome = "T1_age15", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(0, 1), evaluation_indicator = "eval"
)
mc_t2_age15 <- define_model_component(
  name = "t2_age15", data = dat, outcome = "T2_age15", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(0, NA_real_), evaluation_indicator = "eval"
)
mc_t3_age15 <- define_model_component(
  name = "t3_age15", data = dat, outcome = "T3_age15", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(0, NA_real_), evaluation_indicator = "eval"
)

# Bundle with EQUALITY CONSTRAINTS for measurement invariance
ms <- define_model_system(
  components = list(mc_t1_age10, mc_t2_age10, mc_t3_age10,
                    mc_t1_age15, mc_t2_age15, mc_t3_age15),
  factor = fm,
  equality_constraints = list(
    # Same loadings across time
    c("t2_age10_loading_1", "t2_age15_loading_2"),
    c("t3_age10_loading_1", "t3_age15_loading_2"),
    # Same sigmas across time
    c("t1_age10_sigma", "t1_age15_sigma"),
    c("t2_age10_sigma", "t2_age15_sigma"),
    c("t3_age10_sigma", "t3_age15_sigma")
  )
)

# Estimate
result <- estimate_model_rcpp(ms, dat, control = ctrl,
                              optimizer = "nlminb", verbose = FALSE)

# View structural equation parameters
cat("Structural Equation: f2 = alpha_1*f1 + epsilon\n")
cat("----------------------------------------------\n")
cat(sprintf("  Var(f1):        %.3f (true = %.3f)\n",
            result$estimates["factor_var_1"], true_var_f1))
cat(sprintf("  SE linear:      %.3f (true = %.3f)\n",
            result$estimates["se_linear_1"], true_se_linear))
cat(sprintf("  Residual var:   %.3f (true = %.3f)\n",
            result$estimates["se_residual_var"], true_se_residual_var))

# Verify measurement invariance (constrained loadings should be equal)
cat("\nMeasurement Invariance:\n")
cat(sprintf("  T2 loading: age10=%.3f, age15=%.3f (constrained equal)\n",
            result$estimates["t2_age10_loading_1"],
            result$estimates["t2_age15_loading_2"]))
cat(sprintf("  T3 loading: age10=%.3f, age15=%.3f (constrained equal)\n",
            result$estimates["t3_age10_loading_1"],
            result$estimates["t3_age15_loading_2"]))

Key points: - factor_structure = "SE_linear" specifies: f₂ = α₁f₁ + ε (linear structural relationship) - factor_structure = "SE_quadratic" adds quadratic term: f₂ = α + α₁f₁ + α₂f₁² + ε - Input factor (f₁) variance is estimated from measurement equations - Residual variance (σ²_ε) is a free parameter - equality_constraints enforces measurement invariance (same loadings/sigmas at both ages) - Parameter names: se_linear_1, se_residual_var (and se_intercept, se_quadratic_1 for SE_quadratic)


Measurement System with Three Tests

Click to expand example

A standard factor analysis setup with three test scores measuring a latent ability factor:

set.seed(104)
n <- 500
f <- rnorm(n)  # Latent ability

# Three test scores (T1 loading fixed to 1 for identification)
T1 <- 2.0 + 1.0*f + rnorm(n, 0, 0.5)
T2 <- 1.5 + 1.2*f + rnorm(n, 0, 0.6)
T3 <- 1.0 + 0.8*f + rnorm(n, 0, 0.4)

dat <- data.frame(intercept = 1, T1 = T1, T2 = T2, T3 = T3, eval = 1)

# Define factor model and estimation control
fm <- define_factor_model(n_factors = 1, n_types = 1)
ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)

# Define three test components
mc_T1 <- define_model_component(
  name = "T1", data = dat, outcome = "T1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = 1.0,  # Fix for identification
  evaluation_indicator = "eval"
)

mc_T2 <- define_model_component(
  name = "T2", data = dat, outcome = "T2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_,  # Free parameter
  evaluation_indicator = "eval"
)

mc_T3 <- define_model_component(
  name = "T3", data = dat, outcome = "T3", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_,  # Free parameter
  evaluation_indicator = "eval"
)

# Estimate
ms <- define_model_system(components = list(mc_T1, mc_T2, mc_T3), factor = fm)
result <- estimate_model_rcpp(ms, dat, init_params = NULL, control = ctrl, verbose = TRUE)

print(result$estimates)
print(result$std_errors)

Ordered Probit with Measurement System

Click to expand example

Combining test scores with an ordered outcome (e.g., educational attainment):

set.seed(106)
n <- 500
x1 <- rnorm(n)
f <- rnorm(n)  # Latent ability

# Three test scores (same as above)
T1 <- 2.0 + 1.0*f + rnorm(n, 0, 0.5)
T2 <- 1.5 + 1.2*f + rnorm(n, 0, 0.6)
T3 <- 1.0 + 0.8*f + rnorm(n, 0, 0.4)

# Ordered outcome with 3 categories
# Note: Latent z = 0.5 (intercept) + 0.6*x1 + 0.8*f + error
# Intercept absorbed into thresholds, so model estimates only beta and loading
z <- 0.5 + 0.6*x1 + 0.8*f + rnorm(n)
y <- cut(z, breaks = c(-Inf, -0.5, 0.5, Inf), labels = FALSE)

dat <- data.frame(intercept = 1, x1 = x1, T1 = T1, T2 = T2, T3 = T3, y = y, eval = 1)

# Define factor model and estimation control
fm <- define_factor_model(n_factors = 1, n_types = 1)
ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)

mc_T1 <- define_model_component(
  name = "T1", data = dat, outcome = "T1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = 1.0, evaluation_indicator = "eval"
)

mc_T2 <- define_model_component(
  name = "T2", data = dat, outcome = "T2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_, evaluation_indicator = "eval"
)

mc_T3 <- define_model_component(
  name = "T3", data = dat, outcome = "T3", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_, evaluation_indicator = "eval"
)

# Ordered probit component (NO intercept - absorbed into thresholds)
mc_y <- define_model_component(
  name = "y", data = dat, outcome = "y", factor = fm,
  covariates = "x1",  # No intercept for oprobit
  model_type = "oprobit",
  num_choices = 3,  # 3 ordered categories
  loading_normalization = NA_real_,
  evaluation_indicator = "eval"
)

# Estimate
ms <- define_model_system(components = list(mc_T1, mc_T2, mc_T3, mc_y), factor = fm)
result <- estimate_model_rcpp(ms, dat, init_params = NULL, control = ctrl, verbose = TRUE)

print(result$estimates)

Multinomial Logit with Measurement System

Click to expand example

Combining test scores with a discrete choice outcome (3 alternatives):

set.seed(107)
n <- 500
x1 <- rnorm(n)
f <- rnorm(n)  # Latent ability

# Three test scores (same as above)
T1 <- 2.0 + 1.0*f + rnorm(n, 0, 0.5)
T2 <- 1.5 + 1.2*f + rnorm(n, 0, 0.6)
T3 <- 1.0 + 0.8*f + rnorm(n, 0, 0.4)

# Multinomial choice with 3 alternatives (choice 0 is reference)
z1 <- 0.5 + 0.6*x1 + 0.7*f
z2 <- 1.0 - 0.5*x1 + 0.9*f

exp_z0 <- 1
exp_z1 <- exp(z1)
exp_z2 <- exp(z2)
denom <- exp_z0 + exp_z1 + exp_z2

p0 <- exp_z0 / denom
p1 <- exp_z1 / denom
p2 <- exp_z2 / denom

y <- numeric(n)
for (i in seq_len(n)) {
  # C++ expects choices coded as 1, 2, 3 (not 0, 1, 2)
  y[i] <- sample(1:3, 1, prob = c(p0[i], p1[i], p2[i]))
}

dat <- data.frame(intercept = 1, x1 = x1, T1 = T1, T2 = T2, T3 = T3, y = y, eval = 1)

# Define factor model and estimation control
fm <- define_factor_model(n_factors = 1, n_types = 1)
ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)

mc_T1 <- define_model_component(
  name = "T1", data = dat, outcome = "T1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = 1.0, evaluation_indicator = "eval"
)

mc_T2 <- define_model_component(
  name = "T2", data = dat, outcome = "T2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_, evaluation_indicator = "eval"
)

mc_T3 <- define_model_component(
  name = "T3", data = dat, outcome = "T3", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_, evaluation_indicator = "eval"
)

# Multinomial logit component
mc_y <- define_model_component(
  name = "y", data = dat, outcome = "y", factor = fm,
  covariates = c("intercept", "x1"),
  model_type = "logit",
  num_choices = 3,  # 3 alternatives
  loading_normalization = NA_real_,
  evaluation_indicator = "eval"
)

# Estimate
ms <- define_model_system(components = list(mc_T1, mc_T2, mc_T3, mc_y), factor = fm)
result <- estimate_model_rcpp(ms, dat, init_params = NULL, control = ctrl, verbose = TRUE)

print(result$estimates)

Quadratic Factor Effects

Click to expand example

When factor effects are nonlinear, use factor_spec = "quadratic" to include f² terms:

set.seed(109)
n <- 500
f <- rnorm(n)  # Latent ability
x1 <- rnorm(n)

# Three test scores (standard measurement system)
T1 <- 2.0 + 1.0*f + rnorm(n, 0, 0.5)
T2 <- 1.5 + 1.2*f + rnorm(n, 0, 0.6)
T3 <- 1.0 + 0.8*f + rnorm(n, 0, 0.4)

# Outcome with quadratic factor effect
# Y = intercept + beta*x1 + lambda*f + lambda_quad*f^2 + error
Y <- 3.0 + 0.5*x1 + 0.8*f + 0.3*f^2 + rnorm(n, 0, 0.5)

dat <- data.frame(intercept = 1, x1 = x1, T1 = T1, T2 = T2, T3 = T3, Y = Y, eval = 1)

# Define factor model and estimation control
fm <- define_factor_model(n_factors = 1, n_types = 1)
ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)

# Standard measurement equations (linear factor terms)
mc_T1 <- define_model_component(
  name = "T1", data = dat, outcome = "T1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = 1.0, evaluation_indicator = "eval"
)

mc_T2 <- define_model_component(
  name = "T2", data = dat, outcome = "T2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_, evaluation_indicator = "eval"
)

mc_T3 <- define_model_component(
  name = "T3", data = dat, outcome = "T3", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_, evaluation_indicator = "eval"
)

# Outcome with QUADRATIC factor effect
mc_Y <- define_model_component(
  name = "Y", data = dat, outcome = "Y", factor = fm,
  covariates = c("intercept", "x1"), model_type = "linear",
  loading_normalization = NA_real_,
  factor_spec = "quadratic",  # Enable f^2 term
  evaluation_indicator = "eval"
)

# Estimate
ms <- define_model_system(components = list(mc_T1, mc_T2, mc_T3, mc_Y), factor = fm)
result <- estimate_model_rcpp(ms, dat, init_params = NULL, control = ctrl, verbose = TRUE)

# View results - includes Y_loading_quad_1 parameter
print(result$estimates)

Key points: - factor_spec = "quadratic" adds quadratic loading parameters (λ_quad) - Linear predictor becomes: xβ + λf + λ_quad f² - Works with all model types: linear, probit, logit, oprobit - Quadratic loadings are always free (estimated), never fixed - Parameter naming: {component}_loading_quad_{factor_index}

Factor Interaction Terms (Multi-Factor)

Click to expand example

For multi-factor models, use factor_spec = "interactions" to include cross-product terms (f_j × f_k):

set.seed(110)
n <- 500
f1 <- rnorm(n)  # Latent factor 1
f2 <- rnorm(n)  # Latent factor 2
x1 <- rnorm(n)

# Measurement system for 2-factor model (4 test scores, 2 per factor)
T1 <- 2.0 + 1.0*f1 + rnorm(n, 0, 0.5)  # Factor 1 indicator (loading fixed to 1)
T2 <- 1.5 + 1.2*f1 + rnorm(n, 0, 0.6)  # Factor 1 indicator
T3 <- 1.0 + 1.0*f2 + rnorm(n, 0, 0.4)  # Factor 2 indicator (loading fixed to 1)
T4 <- 0.8 + 0.9*f2 + rnorm(n, 0, 0.5)  # Factor 2 indicator

# Outcome with factor interaction effect
# Y = intercept + beta*x1 + lambda1*f1 + lambda2*f2 + lambda_inter*f1*f2 + error
Y <- 3.0 + 0.5*x1 + 0.8*f1 + 0.6*f2 + 0.4*f1*f2 + rnorm(n, 0, 0.5)

dat <- data.frame(intercept = 1, x1 = x1, T1 = T1, T2 = T2, T3 = T3, T4 = T4, Y = Y, eval = 1)

# Define 2-factor model
fm <- define_factor_model(n_factors = 2, n_types = 1)
ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)

# Measurement equations for factor identification
mc_T1 <- define_model_component(
  name = "T1", data = dat, outcome = "T1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(1.0, 0),  # Factor 1 loading=1, Factor 2 loading=0
  evaluation_indicator = "eval"
)

mc_T2 <- define_model_component(
  name = "T2", data = dat, outcome = "T2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(NA_real_, 0),  # Factor 1 free, Factor 2 = 0
  evaluation_indicator = "eval"
)

mc_T3 <- define_model_component(
  name = "T3", data = dat, outcome = "T3", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(0, 1.0),  # Factor 1 = 0, Factor 2 loading=1
  evaluation_indicator = "eval"
)

mc_T4 <- define_model_component(
  name = "T4", data = dat, outcome = "T4", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = c(0, NA_real_),  # Factor 1 = 0, Factor 2 free
  evaluation_indicator = "eval"
)

# Outcome with INTERACTION factor effect
mc_Y <- define_model_component(
  name = "Y", data = dat, outcome = "Y", factor = fm,
  covariates = c("intercept", "x1"), model_type = "linear",
  loading_normalization = c(NA_real_, NA_real_),  # Both linear loadings free
  factor_spec = "interactions",  # Enable f1*f2 interaction term
  evaluation_indicator = "eval"
)

# Estimate
ms <- define_model_system(components = list(mc_T1, mc_T2, mc_T3, mc_T4, mc_Y), factor = fm)
result <- estimate_model_rcpp(ms, dat, init_params = NULL, control = ctrl, verbose = TRUE)

# View results - includes Y_loading_inter_1_2 parameter
print(result$estimates)

Key points: - factor_spec = "interactions" adds interaction loading parameters for all factor pairs (f_j × f_k, j < k) - For k factors, adds k(k-1)/2 interaction terms (e.g., 2 factors → 1 interaction term) - Linear predictor becomes: xβ + Σ λ_j f_j + Σ_{j<k} λ_inter_{jk} f_j f_k - Works with all model types: linear, probit, logit, oprobit - Requires k ≥ 2 factors (silently ignored for single-factor models) - Parameter naming: {component}_loading_inter_{j}_{k} (e.g., Y_loading_inter_1_2) - Use factor_spec = "full" to include both quadratic and interaction terms


Performance and optimization

Choosing an optimizer

nlminb (default, recommended): - Uses analytical Hessian for fast convergence - ~4.6x faster than nloptr on typical problems - Supports box constraints (sigma >= 0.01, cutpoint increments >= 0.01) - Best choice for most applications

nloptr: - L-BFGS with gradient only (approximates Hessian) - Slower but more conservative with constraints - Use if nlminb has convergence issues

optim (L-BFGS-B): - Similar to nloptr but uses stats::optim - Good alternative to nloptr

trust: - Trust region method with Hessian - Experimental, doesn’t support fixed parameters well

Parallelization guidelines

When to use parallelization: - Large datasets (n > 5,000) - Complex models with many components - Multiple local optimizations (e.g., testing initial values)

Typical speedups (n=10,000 Roy model on laptop): - 2 cores: 1.7x speedup (85% efficiency) - 4 cores: 3.0x speedup (75% efficiency)

Note: Speedup depends on several factors: - CPU architecture and speed - System load (other processes running) - Number of observations and model complexity - Memory bandwidth

For complex models with large datasets, higher core counts can be beneficial. Running with 32 cores on a server has been effective for very complex models.

Best practices:

# Use 1 core for small datasets or quick tests
ctrl <- define_estimation_control(num_cores = 1)

# On a shared server, be considerate of other users
n_cores <- min(8, parallel::detectCores() / 2)
ctrl <- define_estimation_control(num_cores = n_cores)

# On a dedicated machine with no other users, use all available cores
n_cores <- parallel::detectCores()
ctrl <- define_estimation_control(num_cores = n_cores)

# Enable parallel mode
result <- estimate_model_rcpp(ms, dat, control = ctrl, parallel = TRUE)

Note: Parallelization splits observations across workers. Each worker gets a subset of the data and evaluates the likelihood independently. Results are aggregated to compute the total log-likelihood and gradients.


Testing

Quick Tests

Run all automated tests (should complete in ~15 seconds):

devtools::test()

Run a subset while developing:

devtools::test(filter = "modeltypes|multifactor|oprobit")

Comprehensive Tests

The systematic test suite performs extensive validation including: - Analytical gradient verification vs. finite differences - Analytical Hessian verification vs. finite differences - Parameter recovery from simulated data - Convergence from default and true initial values

Run the comprehensive suite with verbose output:

# Enable verbose output and log saving
Sys.setenv(FACTORANA_TEST_VERBOSE = "TRUE")
Sys.setenv(FACTORANA_TEST_SAVE_LOGS = "TRUE")

# Run systematic tests (takes ~2-3 minutes)
devtools::test(filter = "systematic")

This will test: - Model A: Measurement system (3 linear tests) - Model B: Measurement + probit outcome - Model C: Measurement + ordered probit (3 categories) - Model D: Measurement + multinomial logit (3 choices) - Model E: Roy selection model (wages + sector choice)

Logs are saved to tests/testthat/test_logs/ with detailed diagnostics.

What’s Covered


These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.