Outlier-Robust Longitudinal Imputation with smriti

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

The Outlier Problem in Longitudinal Data

Real-world longitudinal data — clinical measurements, sensor readings, behavioural logs — routinely contains outliers. A single faulty blood pressure reading, a corrupted accelerometer spike, or a miscoded survey response can poison the sample covariance matrix and cascade through the entire imputation pipeline.

Most imputation methods are vulnerable:

mice (CART/method = “cart”): regression trees split on individual points; a single extreme value can dominate a leaf node.
missForest / missRanger: random forests are more robust than linear models, but the bootstrapped trees still propagate outlier influence through the response variable.
Amelia: assumes multivariate normality; a single 5-sigma outlier inflates the EM-estimated covariance.
FIML (lavaan): maximum likelihood under normality; outliers inflate residual variances.

smriti’s robust = TRUE mode takes a different approach: it constructs the target covariance manifold from rank-based (Spearman) correlations and median absolute deviation (MAD) scale estimates, then projects the result to the nearest positive-semidefinite matrix (Higham 1988). This target is structurally immune to outliers before the Lagrangian routing even begins.

Demonstration: 5% Contamination

We simulate a 4-wave longitudinal study (N = 200) where 5% of subjects have their measurements shifted by +5 SD — a plausible sensor-artefact scenario.

library(smriti)

set.seed(20250601)
n <- 200; t_points <- 4

# ── Generate clean data with a linear growth process ──────────────────────
generate_data <- function(n, add_outliers = FALSE) {
  latent_intercept <- rnorm(n, 6, 1)
  latent_slope     <- rnorm(n, 2, 1)
  data_mat <- matrix(0, n, t_points)
  for (j in seq_len(t_points)) {
    data_mat[, j] <- latent_intercept + (j - 1) * latent_slope + rnorm(n, 0, 1)
  }
  if (add_outliers) {
    idx <- sample(seq_len(n), floor(0.05 * n))
    data_mat[idx, ] <- data_mat[idx, ] + 5.0  # +5 SD shift
  }
  colnames(data_mat) <- paste0("T", seq_len(t_points))
  as.data.frame(data_mat)
}

df_clean   <- generate_data(n, add_outliers = FALSE)
df_outlier <- generate_data(n, add_outliers = TRUE)

# ── Induce 15% MAR missingness (same pattern for both) ────────────────────
set.seed(42)
apply_mar <- function(df) {
  df_miss <- df
  for (t in 1:(t_points - 1)) {
    idx <- which(!is.na(df_miss[, t]))
    x_prev <- scale(df_miss[idx, t])
    p_miss <- 1 / (1 + exp(-(x_prev - qnorm(1 - 0.15))))
    drop_idx <- idx[rbinom(length(idx), 1, p_miss) == 1]
    df_miss[drop_idx, t + 1] <- NA
  }
  df_miss
}

df_clean_miss   <- apply_mar(df_clean)
df_outlier_miss <- apply_mar(df_outlier)

cat("Clean data missingness:  ", sum(is.na(df_clean_miss)),   "cells\n")
#> Clean data missingness:   149 cells
cat("Outlier data missingness:", sum(is.na(df_outlier_miss)), "cells\n")
#> Outlier data missingness: 144 cells

Clean Data: All Methods Perform Similarly

# On clean Normal data, default and robust modes produce similar results.
imp_clean_default <- smriti_impute(df_clean_miss, time_cols = 1:4, robust = FALSE)
imp_clean_robust  <- smriti_impute(df_clean_miss, time_cols = 1:4, robust = TRUE)

Contaminated Data: The Robust Advantage

# On outlier-contaminated data, the robust mode preserves the true structure.
imp_outlier_default <- smriti_impute(df_outlier_miss, time_cols = 1:4,
                                     robust = FALSE)
imp_outlier_robust  <- smriti_impute(df_outlier_miss, time_cols = 1:4,
                                     robust = TRUE)

# Compare recovered covariance against the true (clean) population matrix.
true_cov <- cov(df_clean[, 1:4])  # no missingness, no outliers

cat("Default mode Frobenius distance from truth:",
    sqrt(sum((cov(imp_outlier_default[, 1:4]) - true_cov)^2)), "\n")
cat("Robust  mode Frobenius distance from truth:",
    sqrt(sum((cov(imp_outlier_robust[, 1:4])  - true_cov)^2)), "\n")

In our production benchmarks (500 replications across 90 conditions), the robust mode consistently improves covariance recovery by 1–3 percentage points over missForest under outlier-contaminated MAR data. The Spearman/MAD target matrix isolates the structural signal from the contamination before the Lagrangian routing step, preventing outlier-induced variance inflation.

When to Use robust = TRUE

Scenario	Recommendation
Clean, approximately Normal data	`robust = FALSE` (Pearson)
Known sensor artefacts or data errors	`robust = TRUE`
Heavy-tailed distributions (not skewed)	`robust = TRUE`
Severely skewed (e.g. lognormal)	`smriti_fiml()` or `custom_target`

The robust mode is not a cure for skew. For lognormal or otherwise asymmetric distributions, use smriti_fiml() to derive the target covariance from a properly specified latent growth model, or supply your own custom_target matrix.

Integration with Existing Pipelines

The robust mode is accessible from all convenience wrappers:

# missForest + smriti robust refinement
smriti_forest(df, time_cols = 1:4, robust = TRUE)

# missRanger + smriti robust refinement
smriti_ranger(df, time_cols = 1:4, robust = TRUE)

# Multiple imputation with robust targets on every replicate
smriti_mi(df, time_cols = 1:4, m = 10, robust = TRUE)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.