The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Real-world longitudinal data — clinical measurements, sensor readings, behavioural logs — routinely contains outliers. A single faulty blood pressure reading, a corrupted accelerometer spike, or a miscoded survey response can poison the sample covariance matrix and cascade through the entire imputation pipeline.
Most imputation methods are vulnerable:
smriti’s robust = TRUE mode takes a different approach:
it constructs the target covariance manifold from rank-based
(Spearman) correlations and median absolute deviation (MAD) scale
estimates, then projects the result to the nearest positive-semidefinite
matrix (Higham 1988). This target is structurally immune to outliers
before the Lagrangian routing even begins.
We simulate a 4-wave longitudinal study (N = 200) where 5% of subjects have their measurements shifted by +5 SD — a plausible sensor-artefact scenario.
library(smriti)
set.seed(20250601)
n <- 200; t_points <- 4
# ── Generate clean data with a linear growth process ──────────────────────
generate_data <- function(n, add_outliers = FALSE) {
latent_intercept <- rnorm(n, 6, 1)
latent_slope <- rnorm(n, 2, 1)
data_mat <- matrix(0, n, t_points)
for (j in seq_len(t_points)) {
data_mat[, j] <- latent_intercept + (j - 1) * latent_slope + rnorm(n, 0, 1)
}
if (add_outliers) {
idx <- sample(seq_len(n), floor(0.05 * n))
data_mat[idx, ] <- data_mat[idx, ] + 5.0 # +5 SD shift
}
colnames(data_mat) <- paste0("T", seq_len(t_points))
as.data.frame(data_mat)
}
df_clean <- generate_data(n, add_outliers = FALSE)
df_outlier <- generate_data(n, add_outliers = TRUE)
# ── Induce 15% MAR missingness (same pattern for both) ────────────────────
set.seed(42)
apply_mar <- function(df) {
df_miss <- df
for (t in 1:(t_points - 1)) {
idx <- which(!is.na(df_miss[, t]))
x_prev <- scale(df_miss[idx, t])
p_miss <- 1 / (1 + exp(-(x_prev - qnorm(1 - 0.15))))
drop_idx <- idx[rbinom(length(idx), 1, p_miss) == 1]
df_miss[drop_idx, t + 1] <- NA
}
df_miss
}
df_clean_miss <- apply_mar(df_clean)
df_outlier_miss <- apply_mar(df_outlier)
cat("Clean data missingness: ", sum(is.na(df_clean_miss)), "cells\n")
#> Clean data missingness: 149 cells
cat("Outlier data missingness:", sum(is.na(df_outlier_miss)), "cells\n")
#> Outlier data missingness: 144 cells# On outlier-contaminated data, the robust mode preserves the true structure.
imp_outlier_default <- smriti_impute(df_outlier_miss, time_cols = 1:4,
robust = FALSE)
imp_outlier_robust <- smriti_impute(df_outlier_miss, time_cols = 1:4,
robust = TRUE)
# Compare recovered covariance against the true (clean) population matrix.
true_cov <- cov(df_clean[, 1:4]) # no missingness, no outliers
cat("Default mode Frobenius distance from truth:",
sqrt(sum((cov(imp_outlier_default[, 1:4]) - true_cov)^2)), "\n")
cat("Robust mode Frobenius distance from truth:",
sqrt(sum((cov(imp_outlier_robust[, 1:4]) - true_cov)^2)), "\n")In our production benchmarks (500 replications across 90 conditions), the robust mode consistently improves covariance recovery by 1–3 percentage points over missForest under outlier-contaminated MAR data. The Spearman/MAD target matrix isolates the structural signal from the contamination before the Lagrangian routing step, preventing outlier-induced variance inflation.
| Scenario | Recommendation |
|---|---|
| Clean, approximately Normal data | robust = FALSE (Pearson) |
| Known sensor artefacts or data errors | robust = TRUE |
| Heavy-tailed distributions (not skewed) | robust = TRUE |
| Severely skewed (e.g. lognormal) | smriti_fiml() or custom_target |
The robust mode is not a cure for skew. For lognormal or otherwise
asymmetric distributions, use smriti_fiml() to derive the
target covariance from a properly specified latent growth model, or
supply your own custom_target matrix.
The robust mode is accessible from all convenience wrappers:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.