Introduction to smriti: Structural Variance Preservation

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

The Imputation Uncertainty Principle

Modern machine learning imputation algorithms (like missForest) excel at minimizing point-wise prediction error (RMSE). However, this point-wise optimization inherently shrinks the variance of the imputed values, causing structural variance collapse. In longitudinal Growth Curve Models (GCM), this crushes the latent slope variance (\(\sigma^2_S\)), destroying the statistical power needed to track patient trajectories over time.

The smriti package resolves this by decoupling prediction from structural geometry. It utilizes a two-stage architecture: 1. Initialization: Non-parametric imputation bridges the missingness to establish a dense matrix. 2. Lagrangian Projection: A C++ gradient descent layer projects the hallucinated data toward a target covariance manifold while preserving fidelity to the initial imputed values. The augmented loss function is

\[L(X) = \frac{1}{2}\|X - X_{\text{imp}}\|_F^2 + \frac{\lambda}{2}\|\operatorname{cov}(X) - \Sigma_{\text{target}}\|_F^2\]

where the first term anchors the solution near the initial imputation and the second (governed by \(\lambda\)) enforces the covariance structure.

The Robustness-Efficiency Tradeoff

Real-world clinical data often contains heavy-tailed skew or corrupted sensor artifacts. The smriti_impute() function handles this via the robust routing toggle:

robust = FALSE: Uses pairwise-complete Pearson covariance, projected to the nearest positive-semidefinite matrix to correct any non-PSD artefacts from pairwise deletion. Best for well-behaved, approximately-Normal data.
robust = TRUE: Constructs the target from pairwise Spearman correlations (rank-based, outlier-resistant) and column-wise MAD scale estimates. The resulting matrix is projected to the nearest PSD manifold, producing a target that is structurally robust to severe outliers (e.g., broken EHR sensors).

Fidelity-Constraint Balance

The penalty weight lambda controls the trade-off between preserving the original imputation values and matching the target covariance. At lambda = 1.0 (the default) both objectives are weighted equally. Increasing lambda enforces the covariance constraint more strictly but allows greater deviation from the initial imputation. The learning_rate (default 0.001) governs gradient step size; max_iter (default 2000) bounds the optimisation.

Example: Shielding Against Corrupted EHR Data

library(smriti)
library(missForest)

# Load clinical data with structural missingness and sensor artifacts
data <- read.csv("clinical_proxy.csv")

# Execute robust refinement to isolate the structural manifold
clean_data <- smriti_impute(
  data       = data,
  time_cols  = c("T1", "T2", "T3", "T4"),
  robust     = TRUE,
  lambda     = 1.0
)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.