DICErClust Illustrated Example: Heart Failure Risk Stratification

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Sarah Ayton and Yiye Zhang

2026-07-08

Overview

This vignette walks through a complete DICErClust workflow on the UCI Heart Failure Clinical Records dataset (Chicco & Jurman 2020). By the end you will have:

split 299 patients into training and test sets,
trained DICEr to find two risk-stratified clusters,
evaluated the clusters with an AUC and chi-squared test, and
produced two publication-quality figures.

Approximate run time on a single CPU core: 15–20 minutes (the training step is the bottleneck; all other steps are fast).

1. Load DICErClust

DICErClust is distributed as a source tarball. Install it once with install.packages(), then load it like any other package.

## Install from local tarball (run once):
# install.packages(
#   "/path/to/DICErClust_0.1.1.tar.gz",
#   repos = NULL, type = "source"
# )
library(DICErClust)
library(ggplot2)
library(pROC)

2. Download the UCI Heart Failure dataset

Chicco & Jurman (2020) collected 299 heart-failure patients from the Faisalabad Institute of Cardiology in Pakistan. The dataset contains 12 clinical features and a binary outcome DEATH_EVENT (1 = died during follow-up; 0 = survived). It is freely available from the UCI Machine Learning Repository (dataset #519).

hf_url  <- paste0(
  "https://archive.ics.uci.edu/ml/",
  "machine-learning-databases/00519/",
  "heart_failure_clinical_records_dataset.csv"
)
hf_dest <- tempfile(fileext = ".csv")
download.file(hf_url, hf_dest, quiet = TRUE)
hf <- read.csv(hf_dest)

cat(sprintf("Rows: %d   Columns: %d\n", nrow(hf), ncol(hf)))
print(table(DEATH_EVENT = hf$DEATH_EVENT))

The dataset is small but class-imbalanced: roughly 68% of patients survived (DEATH_EVENT = 0) and 32% died (DEATH_EVENT = 1).

3. Feature engineering

DICEr expects the input to be split into two feature matrices:

Matrix	Contents	Role
`data_x`	Continuous laboratory / physiological measurements	Encoder input: the LSTM compresses these into a low-dimensional latent representation used for clustering
`data_v`	Binary demographic / comorbidity indicators	Auxiliary outcome-head input: the likelihood-ratio test assesses whether the cluster explains outcome above and beyond these covariates

This design mirrors real-world EHR practice: data_x captures time-varying lab values while data_v captures static patient characteristics.

## Continuous lab features → LSTM encoder (data_x)
x_cols <- c("age", "creatinine_phosphokinase", "ejection_fraction",
            "platelets", "serum_creatinine", "serum_sodium", "time")

## Binary demographic indicators → outcome head (data_v)
v_cols <- c("anaemia", "diabetes", "high_blood_pressure", "sex", "smoking")

## Min-max scale continuous features to [0, 1].
## Scaling prevents any single lab value from dominating the MSE
## reconstruction loss relative to others.
scale_01 <- function(x) {
  r <- range(x, na.rm = TRUE)
  if (diff(r) == 0) return(x * 0)
  (x - r[1]) / diff(r)
}

X_x <- apply(as.matrix(hf[, x_cols]), 2, scale_01)  # 299 × 7, numeric
X_v <- apply(as.matrix(hf[, v_cols]), 2, as.numeric) # 299 × 5, binary as float

## Note: data_v *must* be stored as numeric (float), not integer.
## torch_tensor() infers dtype from R storage mode; integer columns produce
## int64 tensors that are incompatible with the float32 model weights.

cat(sprintf("data_x: %d × %d\ndata_v: %d × %d\n",
            nrow(X_x), ncol(X_x), nrow(X_v), ncol(X_v)))

n_x <- ncol(X_x)  # 7  continuous predictors
n_v <- ncol(X_v)  # 5  binary demographics
outcome <- hf$DEATH_EVENT

4. Stratified train / test split

A 70 / 30 stratified split preserves the ~32% event rate in both partitions, which is important given the small sample size.

set.seed(1111)
idx_death <- which(outcome == 1)
idx_alive <- which(outcome == 0)

train_idx <- sort(c(
  sample(idx_death, floor(0.70 * length(idx_death))),
  sample(idx_alive, floor(0.70 * length(idx_alive)))
))
test_idx <- setdiff(seq_len(nrow(hf)), train_idx)

cat(sprintf("Train: %d patients  (deaths: %d, %.0f%%)\n",
            length(train_idx), sum(outcome[train_idx]),
            100 * mean(outcome[train_idx])))
cat(sprintf("Test : %d patients  (deaths: %d, %.0f%%)\n",
            length(test_idx),  sum(outcome[test_idx]),
            100 * mean(outcome[test_idx])))

5. Serialise data in DICErClust format

DICEr() reads training and test sets from RDS files. Each file must contain a length-3 list:

data_x — numeric matrix, shape n × p
data_v — numeric matrix, shape n × q
data_y — integer vector of 0/1 outcomes, length n

data_dir <- file.path(tempdir(), "dice_hf")
dir.create(data_dir, showWarnings = FALSE)

saveRDS(
  list(X_x[train_idx, ], X_v[train_idx, ], as.integer(outcome[train_idx])),
  file.path(data_dir, "hf_train.rds")
)
saveRDS(
  list(X_x[test_idx, ], X_v[test_idx, ], as.integer(outcome[test_idx])),
  file.path(data_dir, "hf_test.rds")
)

6. Configure DICEr

The argument list controls both the architecture and the training schedule.

args <- list(
  seed              = 1111,          # reproducibility seed
  input_path        = data_dir,      # directory containing RDS files
  filename_train    = "hf_train.rds",
  filename_test     = "hf_test.rds",

  ## ── Architecture ──────────────────────────────────────────
  n_input_fea       = n_x,          # 7 continuous LSTM input features
  n_hidden_fea      = 4,            # LSTM latent dimension (7 → 4)
  lstm_layer        = 1,            # single LSTM layer
  lstm_dropout      = 0.0,          # no dropout (small dataset)
  K_clusters        = 2,            # binary risk partition: high vs. low

  ## ── Auxiliary features ────────────────────────────────────
  n_dummy_demov_fea = n_v,          # 5 binary demographic covariates

  ## ── Hardware ──────────────────────────────────────────────
  cuda              = FALSE,        # set TRUE for GPU acceleration

  ## ── Optimiser ─────────────────────────────────────────────
  lr                = 1e-4,         # Adam learning rate

  ## ── Training schedule ─────────────────────────────────────
  init_AE_epoch     = 5,            # Stage 1: autoencoder warm-up epochs
  iter              = 30,           # Stage 2: number of clustering iterations
  epoch_in_iter     = 2,            # gradient-update epochs per iteration

  ## ── Loss weights ──────────────────────────────────────────
  ## Combined loss: L = λ_AE·L_AE + λ_clf·L_classifier
  ##                  + λ_out·L_outcome + λ_p·L_p_value
  ## L_p_value = 3.841 − G penalises non-significant cluster configurations
  ## (G is the LRT statistic; 3.841 is the χ²(1) critical value at α = 0.05)
  lambda_AE         = 1.0,
  lambda_classifier = 1.0,
  lambda_outcome    = 1.0,
  lambda_p_value    = 1.0
)

Training stages explained

DICEr runs three sequential stages:

LSTM autoencoder warm-up (init_AE_epoch epochs): trains the encoder and decoder to reconstruct data_x, establishing a compact latent representation before any clustering begins.
Joint optimisation (iter iterations × epoch_in_iter epochs each): alternates between
1. k-means clustering in the LSTM latent space, and
2. gradient updates minimising the combined four-component loss. After each k-means step the cluster with the highest data_y = 1 rate is relabelled cluster 0 (the high-risk cluster).
Model selection: saves the checkpoint with the lowest test negative log-likelihood that also satisfies the likelihood-ratio test p < 0.05 between at least one cluster pair.

7. Train the model

## DICEr writes output files relative to the working directory.
## We temporarily switch to tempdir() to keep them self-contained.
old_wd <- setwd(tempdir())
suppressWarnings(DICEr(args))
setwd(old_wd)

Output is written to hn_4_K_2/part2_AE_nhidden_4/ relative to the working directory used during training. The key files are:

data_train_iter.rds — training-set data frame with cluster assignments (C)
data_test_iter.rds — test-set data frame with cluster assignments (pred_C)

8. Load the best checkpoint

part2_dir <- file.path(tempdir(), "hn_4_K_2", "part2_AE_nhidden_4")

if (!file.exists(file.path(part2_dir, "data_train_iter.rds"))) {
  stop(
    "No checkpoint found — the p < 0.05 criterion was not met in ",
    args$iter, " iterations.  Increase args$iter and rerun."
  )
}

res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test  <- readRDS(file.path(part2_dir, "data_test_iter.rds"))

Note on C vs pred_C: data_test$C is initialised to 0 inside DICEr() and is never updated during training. Always use data_test$pred_C (nearest-centroid assignments) for test-set evaluation.

9. Evaluate cluster quality

The code below uses pre-computed results from the reference run (seed = 1111, 30 iterations) so the vignette builds without retraining. When you run DICEr() yourself the results will be loaded from the checkpoint you just produced.

Dynamic cluster labelling

k-means spatial boundaries may assign a different cluster index to the high-mortality group in training versus test, so labels should be assigned within each split based on the observed death rate.

label_by_rate <- function(df) {
  rates <- tapply(df$death, df$cluster, mean)
  hi    <- as.integer(names(which.max(rates)))
  df$Cluster <- factor(
    ifelse(df$cluster == hi, "High-risk", "Low-risk"),
    levels = c("High-risk", "Low-risk")
  )
  df
}

train_df <- label_by_rate(train_df)
test_df  <- label_by_rate(test_df)

Cluster outcome summary

summarise_clusters <- function(df, split_name) {
  do.call(rbind, lapply(split(df, df$Cluster), function(d) {
    data.frame(
      Split     = split_name,
      Cluster   = as.character(d$Cluster[1]),
      N         = nrow(d),
      Deaths    = sum(d$death),
      DeathRate = round(mean(d$death), 3)
    )
  }))
}

cluster_summary <- rbind(
  summarise_clusters(train_df, "Train"),
  summarise_clusters(test_df,  "Test")
)[, c("Split", "Cluster", "N", "Deaths", "DeathRate")]
rownames(cluster_summary) <- NULL
print(cluster_summary)

AUC

test_score <- as.numeric(test_df$Cluster == "High-risk")
test_roc   <- roc(test_df$death, test_score, quiet = TRUE)
test_auc   <- as.numeric(auc(test_roc))
cat(sprintf("Test AUC: %.4f\n", test_auc))

Chi-squared test

ct        <- table(Cluster = test_df$Cluster, Death = test_df$death)
chisq_res <- suppressWarnings(chisq.test(ct))
print(ct)
cat(sprintf("Chi-squared = %.3f, df = %d, p %s\n",
            chisq_res$statistic,
            chisq_res$parameter,
            ifelse(chisq_res$p.value < 0.001, "< 0.001",
                   sprintf("= %.4f", chisq_res$p.value))))

An AUC of 0.823 and a chi-squared p-value < 0.001 indicate that DICEr has identified two clusters with a strong and statistically significant difference in mortality rate: 71.9% in the High-risk cluster versus 10.3% in the Low-risk cluster.

10. Figures

Figure 1 — Cluster outcome bar chart

te_sum <- summarise_clusters(test_df, "Test")

ggplot(te_sum, aes(x = Cluster, y = DeathRate, fill = Cluster)) +
  geom_col(width = 0.5, colour = "black", linewidth = 0.4) +
  geom_text(aes(label = paste0(Deaths, "/", N)),
            vjust = -0.4, size = 4) +
  scale_fill_manual(
    values = c("High-risk" = "#d73027", "Low-risk" = "#4575b4")
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 1)
  ) +
  labs(
    title   = "DEATH_EVENT rate by DICEr cluster (test set)",
    x       = "Cluster",
    y       = "Proportion deceased",
    caption = "UCI Heart Failure Clinical Records  |  DICErClust 0.1.1"
  ) +
  theme_bw(base_size = 13) +
  theme(legend.position = "none")

Figure 2 — ROC curve

roc_df <- data.frame(
  FPR = 1 - test_roc$specificities,
  TPR = test_roc$sensitivities
)

ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(colour = "#d73027", linewidth = 1) +
  geom_abline(linetype = "dashed", colour = "grey50") +
  annotate("text", x = 0.55, y = 0.15,
           label = sprintf("AUC = %.3f", test_auc),
           size = 5, colour = "#d73027") +
  labs(
    title   = "ROC curve — DICEr cluster vs. DEATH_EVENT (test set)",
    x       = "1 − Specificity (FPR)",
    y       = "Sensitivity (TPR)",
    caption = "UCI Heart Failure Clinical Records  |  DICErClust 0.1.1"
  ) +
  theme_bw(base_size = 13)

Summary of results

Metric	Value
Best checkpoint iteration	19
LRT p-value at checkpoint	0.0100
Test negative log-likelihood	0.6493
Test AUC	0.823
High-risk mortality (test)	71.9% (23/32)
Low-risk mortality (test)	10.3% (6/58)
Chi-squared statistic	32.99
Chi-squared p-value	< 0.001

References

Chicco D, Jurman G (2020). “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone.” BMC Medical Informatics and Decision Making, 20, 16. doi:10.1186/s12911-020-1023-5

Dua D, Graff C (2019). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

Huang Y, Du C, Zhu F, et al. (2021). “Self-supervised deep clustering of patient subgroups for heart failure with preserved ejection fraction.” Journal of the American Medical Informatics Association, 28, 2394–2403. doi:10.1093/jamia/ocab203

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.