The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Introduction to DICErClust

Sarah Ayton and Yiye Zhang

2026-05-21

What is DICErClust?

DICErClust provides an R implementation of Deep Significance Clustering (DICE), a self-supervised framework that discovers clinically meaningful patient subgroups from electronic health record (EHR) data. Unlike conventional unsupervised clustering, DICE simultaneously optimises four objectives:

  1. Reconstruction fidelity — an LSTM autoencoder learns a compact latent representation of time-varying continuous features.
  2. Cluster cohesion — a soft k-means classifier assigns patients to clusters in the latent space.
  3. Outcome prediction — a logistic regression head predicts a binary clinical outcome from the cluster assignment and auxiliary demographic features.
  4. Statistical significance — a likelihood-ratio test (LRT) penalty ensures at least one cluster pair shows a significantly different outcome rate (p < 0.05) at the saved checkpoint.

The result is a partition into risk-stratified subgroups that are both data-driven and statistically validated.

Reference: Huang Y, Du C, Zhu F, et al. (2021). Self-supervised deep clustering of patient subgroups for heart failure with preserved ejection fraction. J Am Med Inform Assoc, 28, 2394–2403. doi:10.1093/jamia/ocab203


Installation

## From a local source tarball:
install.packages(
  "/path/to/DICErClust_0.1.1.tar.gz",
  repos = NULL, type = "source"
)

DICErClust depends on the torch package for R. If you have not installed torch before, run torch::install_torch() once after installing the package.


Data format

DICEr() reads training and test data from RDS files. Each file must be a length-3 list:

Position Name Type Description
[[1]] data_x numeric matrix n × p Continuous features — LSTM encoder input
[[2]] data_v numeric matrix n × q Binary demographic covariates — outcome-head auxiliary input
[[3]] data_y integer vector length n Binary outcome (0/1)

Important: data_v must use R numeric (float64) storage, not integer. The torch backend infers tensor dtype from the R storage mode; integer columns produce int64 tensors that are incompatible with the float32 model weights.

## Build a minimal synthetic dataset ----------------------------------------
set.seed(42)
n_train <- 120L; n_test <- 40L; p <- 6L; q <- 3L

make_rds <- function(n, path) {
  saveRDS(
    list(
      matrix(runif(n * p), n, p),                   # data_x: continuous
      matrix(as.numeric(rbinom(n * q, 1, 0.5)), n, q), # data_v: binary float
      rbinom(n, 1, 0.3)                             # data_y: outcome
    ),
    path
  )
}

data_dir <- file.path(tempdir(), "dice_intro")
dir.create(data_dir, showWarnings = FALSE)
make_rds(n_train, file.path(data_dir, "train.rds"))
make_rds(n_test,  file.path(data_dir, "test.rds"))

## Verify format
d <- readRDS(file.path(data_dir, "train.rds"))
cat("data_x:", nrow(d[[1]]), "×", ncol(d[[1]]), " storage:", storage.mode(d[[1]]), "\n")
cat("data_v:", nrow(d[[2]]), "×", ncol(d[[2]]), " storage:", storage.mode(d[[2]]), "\n")
cat("data_y: length", length(d[[3]]), " table:", paste(table(d[[3]]), collapse = "/"), "\n")

Quick start

library(DICErClust)

args <- list(
  seed              = 42L,
  input_path        = data_dir,
  filename_train    = "train.rds",
  filename_test     = "test.rds",
  n_input_fea       = p,       # columns in data_x
  n_hidden_fea      = 3L,      # LSTM latent dimension
  lstm_layer        = 1L,
  lstm_dropout      = 0.0,
  K_clusters        = 2L,      # number of clusters
  n_dummy_demov_fea = q,       # columns in data_v
  cuda              = FALSE,   # set TRUE to use GPU
  lr                = 1e-4,
  init_AE_epoch     = 5L,      # Stage 1 warm-up epochs
  iter              = 20L,     # Stage 2 iterations
  epoch_in_iter     = 2L,
  lambda_AE         = 1.0,
  lambda_classifier = 1.0,
  lambda_outcome    = 1.0,
  lambda_p_value    = 1.0
)

old_wd <- setwd(tempdir())
DICEr(args)            # writes output to hn_3_K_2/part2_AE_nhidden_3/
setwd(old_wd)

Loading results

part2_dir <- file.path(tempdir(), "hn_3_K_2", "part2_AE_nhidden_3")

res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test  <- readRDS(file.path(part2_dir, "data_test_iter.rds"))

## Cluster assignments
## Training set: use res_train$C   (k-means labels, re-ordered by outcome rate)
## Test set:     use res_test$pred_C (nearest-centroid assignments)
table(res_test$pred_C)

Hyperparameters at a glance

Argument Default Effect
n_hidden_fea LSTM latent dimension; controls representation capacity
K_clusters Number of clusters
init_AE_epoch 5 Stage 1 warm-up length
iter 20 Maximum Stage 2 iterations
epoch_in_iter 1 Gradient-update epochs per iteration
lr 1e-4 Adam learning rate
lambda_AE 1.0 Weight on reconstruction loss
lambda_classifier 1.0 Weight on cluster-assignment loss
lambda_outcome 1.0 Weight on outcome BCE loss
lambda_p_value 1.0 Weight on LRT significance penalty

All four lambda weights are equal at their defaults, giving each objective equal influence. The LRT significance threshold (χ²₁, α = 0.05 → 3.841) is fixed and not user-tunable.


Output directory structure

After a successful run, DICEr creates:

<working_dir>/
└── hn_<n_hidden>_K_<K>/
    ├── part1_AE_nhidden_<n>/          # Stage 1 autoencoder outputs
    │   └── part1_loss_AE.png
    └── part2_AE_nhidden_<n>/          # Stage 2 best checkpoint
        ├── data_train_iter.rds        # training set with C assignments
        └── data_test_iter.rds         # test set with pred_C assignments

data_train_iter.rds and data_test_iter.rds are the data lists enriched with cluster assignment fields:


Full worked example

For a complete end-to-end analysis on the UCI Heart Failure Clinical Records dataset — including preprocessing, training, cluster evaluation (AUC = 0.823, χ² = 32.99, p < 0.001), and publication-quality figures — see:

vignette("heart-failure-example", package = "DICErClust")

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.