Introduction to DICErClust

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Sarah Ayton and Yiye Zhang

2026-07-08

What is DICErClust?

DICErClust provides an R implementation of Deep Significance Clustering (DICE), a self-supervised framework that discovers clinically meaningful patient subgroups from electronic health record (EHR) data. Unlike conventional unsupervised clustering, DICE simultaneously optimises four objectives:

Reconstruction fidelity — an LSTM autoencoder learns a compact latent representation of time-varying continuous features.
Cluster cohesion — a soft k-means classifier assigns patients to clusters in the latent space.
Outcome prediction — a logistic regression head predicts a binary clinical outcome from the cluster assignment and auxiliary demographic features.
Statistical significance — a likelihood-ratio test (LRT) penalty ensures at least one cluster pair shows a significantly different outcome rate (p < 0.05) at the saved checkpoint.

The result is a partition into risk-stratified subgroups that are both data-driven and statistically validated.

Reference: Huang Y, Du C, Zhu F, et al. (2021). Self-supervised deep clustering of patient subgroups for heart failure with preserved ejection fraction. J Am Med Inform Assoc, 28, 2394–2403. doi:10.1093/jamia/ocab203

Installation

## From a local source tarball:
install.packages(
  "/path/to/DICErClust_0.1.1.tar.gz",
  repos = NULL, type = "source"
)

DICErClust depends on the torch package for R. If you have not installed torch before, run torch::install_torch() once after installing the package.

Data format

DICEr() reads training and test data from RDS files. Each file must be a length-3 list:

Position	Name	Type	Description
`[[1]]`	`data_x`	numeric matrix n × p	Continuous features — LSTM encoder input
`[[2]]`	`data_v`	numeric matrix n × q	Binary demographic covariates — outcome-head auxiliary input
`[[3]]`	`data_y`	integer vector length n	Binary outcome (0/1)

Important: data_v must use R numeric (float64) storage, not integer. The torch backend infers tensor dtype from the R storage mode; integer columns produce int64 tensors that are incompatible with the float32 model weights.

## Build a minimal synthetic dataset ----------------------------------------
set.seed(42)
n_train <- 120L; n_test <- 40L; p <- 6L; q <- 3L

make_rds <- function(n, path) {
  saveRDS(
    list(
      matrix(runif(n * p), n, p),                   # data_x: continuous
      matrix(as.numeric(rbinom(n * q, 1, 0.5)), n, q), # data_v: binary float
      rbinom(n, 1, 0.3)                             # data_y: outcome
    ),
    path
  )
}

data_dir <- file.path(tempdir(), "dice_intro")
dir.create(data_dir, showWarnings = FALSE)
make_rds(n_train, file.path(data_dir, "train.rds"))
make_rds(n_test,  file.path(data_dir, "test.rds"))

## Verify format
d <- readRDS(file.path(data_dir, "train.rds"))
cat("data_x:", nrow(d[[1]]), "×", ncol(d[[1]]), " storage:", storage.mode(d[[1]]), "\n")
cat("data_v:", nrow(d[[2]]), "×", ncol(d[[2]]), " storage:", storage.mode(d[[2]]), "\n")
cat("data_y: length", length(d[[3]]), " table:", paste(table(d[[3]]), collapse = "/"), "\n")

Quick start

library(DICErClust)

args <- list(
  seed              = 42L,
  input_path        = data_dir,
  filename_train    = "train.rds",
  filename_test     = "test.rds",
  n_input_fea       = p,       # columns in data_x
  n_hidden_fea      = 3L,      # LSTM latent dimension
  lstm_layer        = 1L,
  lstm_dropout      = 0.0,
  K_clusters        = 2L,      # number of clusters
  n_dummy_demov_fea = q,       # columns in data_v
  cuda              = FALSE,   # set TRUE to use GPU
  lr                = 1e-4,
  init_AE_epoch     = 5L,      # Stage 1 warm-up epochs
  iter              = 20L,     # Stage 2 iterations
  epoch_in_iter     = 2L,
  lambda_AE         = 1.0,
  lambda_classifier = 1.0,
  lambda_outcome    = 1.0,
  lambda_p_value    = 1.0
)

old_wd <- setwd(tempdir())
DICEr(args)            # writes output to hn_3_K_2/part2_AE_nhidden_3/
setwd(old_wd)

Loading results

part2_dir <- file.path(tempdir(), "hn_3_K_2", "part2_AE_nhidden_3")

res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test  <- readRDS(file.path(part2_dir, "data_test_iter.rds"))

## Cluster assignments
## Training set: use res_train$C   (k-means labels, re-ordered by outcome rate)
## Test set:     use res_test$pred_C (nearest-centroid assignments)
table(res_test$pred_C)

Hyperparameters at a glance

Argument	Default	Effect
`n_hidden_fea`	—	LSTM latent dimension; controls representation capacity
`K_clusters`	—	Number of clusters
`init_AE_epoch`	5	Stage 1 warm-up length
`iter`	20	Maximum Stage 2 iterations
`epoch_in_iter`	1	Gradient-update epochs per iteration
`lr`	1e-4	Adam learning rate
`lambda_AE`	1.0	Weight on reconstruction loss
`lambda_classifier`	1.0	Weight on cluster-assignment loss
`lambda_outcome`	1.0	Weight on outcome BCE loss
`lambda_p_value`	1.0	Weight on LRT significance penalty

All four lambda weights are equal at their defaults, giving each objective equal influence. The LRT significance threshold (χ²₁, α = 0.05 → 3.841) is fixed and not user-tunable.

Output directory structure

After a successful run, DICEr creates:

<working_dir>/
└── hn_<n_hidden>_K_<K>/
    ├── part1_AE_nhidden_<n>/          # Stage 1 autoencoder outputs
    │   └── part1_loss_AE.png
    └── part2_AE_nhidden_<n>/          # Stage 2 best checkpoint
        ├── data_train_iter.rds        # training set with C assignments
        └── data_test_iter.rds         # test set with pred_C assignments

data_train_iter.rds and data_test_iter.rds are the data lists enriched with cluster assignment fields:

$C — k-means cluster labels (training set; cluster 0 = highest mortality)
$pred_C — nearest-centroid labels for the test set (use this, not $C, for test-set evaluation)

Full worked example

For a complete end-to-end analysis on the UCI Heart Failure Clinical Records dataset — including preprocessing, training, cluster evaluation (AUC = 0.823, χ² = 32.99, p < 0.001), and publication-quality figures — see:

vignette("heart-failure-example", package = "DICErClust")

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.