The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Outlier-Robust Longitudinal Imputation with smriti

The Outlier Problem in Longitudinal Data

Real-world longitudinal data — clinical measurements, sensor readings, behavioural logs — routinely contains outliers. A single faulty blood pressure reading, a corrupted accelerometer spike, or a miscoded survey response can poison the sample covariance matrix and cascade through the entire imputation pipeline.

Most imputation methods are vulnerable:

smriti’s robust = TRUE mode takes a different approach: it constructs the target covariance manifold from rank-based (Spearman) correlations and median absolute deviation (MAD) scale estimates, then projects the result to the nearest positive-semidefinite matrix (Higham 1988). This target is structurally immune to outliers before the Lagrangian routing even begins.

Demonstration: 5% Contamination

We simulate a 4-wave longitudinal study (N = 200) where 5% of subjects have their measurements shifted by +5 SD — a plausible sensor-artefact scenario.

library(smriti)

set.seed(20250601)
n <- 200; t_points <- 4

# ── Generate clean data with a linear growth process ──────────────────────
generate_data <- function(n, add_outliers = FALSE) {
  latent_intercept <- rnorm(n, 6, 1)
  latent_slope     <- rnorm(n, 2, 1)
  data_mat <- matrix(0, n, t_points)
  for (j in seq_len(t_points)) {
    data_mat[, j] <- latent_intercept + (j - 1) * latent_slope + rnorm(n, 0, 1)
  }
  if (add_outliers) {
    idx <- sample(seq_len(n), floor(0.05 * n))
    data_mat[idx, ] <- data_mat[idx, ] + 5.0  # +5 SD shift
  }
  colnames(data_mat) <- paste0("T", seq_len(t_points))
  as.data.frame(data_mat)
}

df_clean   <- generate_data(n, add_outliers = FALSE)
df_outlier <- generate_data(n, add_outliers = TRUE)

# ── Induce 15% MAR missingness (same pattern for both) ────────────────────
set.seed(42)
apply_mar <- function(df) {
  df_miss <- df
  for (t in 1:(t_points - 1)) {
    idx <- which(!is.na(df_miss[, t]))
    x_prev <- scale(df_miss[idx, t])
    p_miss <- 1 / (1 + exp(-(x_prev - qnorm(1 - 0.15))))
    drop_idx <- idx[rbinom(length(idx), 1, p_miss) == 1]
    df_miss[drop_idx, t + 1] <- NA
  }
  df_miss
}

df_clean_miss   <- apply_mar(df_clean)
df_outlier_miss <- apply_mar(df_outlier)

cat("Clean data missingness:  ", sum(is.na(df_clean_miss)),   "cells\n")
#> Clean data missingness:   149 cells
cat("Outlier data missingness:", sum(is.na(df_outlier_miss)), "cells\n")
#> Outlier data missingness: 144 cells

Clean Data: All Methods Perform Similarly

# On clean Normal data, default and robust modes produce similar results.
imp_clean_default <- smriti_impute(df_clean_miss, time_cols = 1:4, robust = FALSE)
imp_clean_robust  <- smriti_impute(df_clean_miss, time_cols = 1:4, robust = TRUE)

Contaminated Data: The Robust Advantage

# On outlier-contaminated data, the robust mode preserves the true structure.
imp_outlier_default <- smriti_impute(df_outlier_miss, time_cols = 1:4,
                                     robust = FALSE)
imp_outlier_robust  <- smriti_impute(df_outlier_miss, time_cols = 1:4,
                                     robust = TRUE)

# Compare recovered covariance against the true (clean) population matrix.
true_cov <- cov(df_clean[, 1:4])  # no missingness, no outliers

cat("Default mode Frobenius distance from truth:",
    sqrt(sum((cov(imp_outlier_default[, 1:4]) - true_cov)^2)), "\n")
cat("Robust  mode Frobenius distance from truth:",
    sqrt(sum((cov(imp_outlier_robust[, 1:4])  - true_cov)^2)), "\n")

In our production benchmarks (500 replications across 90 conditions), the robust mode consistently improves covariance recovery by 1–3 percentage points over missForest under outlier-contaminated MAR data. The Spearman/MAD target matrix isolates the structural signal from the contamination before the Lagrangian routing step, preventing outlier-induced variance inflation.

When to Use robust = TRUE

Scenario Recommendation
Clean, approximately Normal data robust = FALSE (Pearson)
Known sensor artefacts or data errors robust = TRUE
Heavy-tailed distributions (not skewed) robust = TRUE
Severely skewed (e.g. lognormal) smriti_fiml() or custom_target

The robust mode is not a cure for skew. For lognormal or otherwise asymmetric distributions, use smriti_fiml() to derive the target covariance from a properly specified latent growth model, or supply your own custom_target matrix.

Integration with Existing Pipelines

The robust mode is accessible from all convenience wrappers:

# missForest + smriti robust refinement
smriti_forest(df, time_cols = 1:4, robust = TRUE)

# missRanger + smriti robust refinement
smriti_ranger(df, time_cols = 1:4, robust = TRUE)

# Multiple imputation with robust targets on every replicate
smriti_mi(df, time_cols = 1:4, m = 10, robust = TRUE)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.