The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette walks through a complete DICErClust workflow on the UCI Heart Failure Clinical Records dataset (Chicco & Jurman 2020). By the end you will have:
Approximate run time on a single CPU core: 15–20 minutes (the training step is the bottleneck; all other steps are fast).
DICErClust is distributed as a source tarball. Install it once with
install.packages(), then load it like any other
package.
## Install from local tarball (run once):
# install.packages(
# "/path/to/DICErClust_0.1.1.tar.gz",
# repos = NULL, type = "source"
# )
library(DICErClust)
library(ggplot2)
library(pROC)Chicco & Jurman (2020) collected 299 heart-failure patients from
the Faisalabad Institute of Cardiology in Pakistan. The dataset contains
12 clinical features and a binary outcome DEATH_EVENT (1 =
died during follow-up; 0 = survived). It is freely available from the
UCI Machine Learning Repository (dataset #519).
hf_url <- paste0(
"https://archive.ics.uci.edu/ml/",
"machine-learning-databases/00519/",
"heart_failure_clinical_records_dataset.csv"
)
hf_dest <- tempfile(fileext = ".csv")
download.file(hf_url, hf_dest, quiet = TRUE)
hf <- read.csv(hf_dest)
cat(sprintf("Rows: %d Columns: %d\n", nrow(hf), ncol(hf)))
print(table(DEATH_EVENT = hf$DEATH_EVENT))The dataset is small but class-imbalanced: roughly 68% of patients survived (DEATH_EVENT = 0) and 32% died (DEATH_EVENT = 1).
DICEr expects the input to be split into two feature matrices:
| Matrix | Contents | Role |
|---|---|---|
data_x |
Continuous laboratory / physiological measurements | Encoder input: the LSTM compresses these into a low-dimensional latent representation used for clustering |
data_v |
Binary demographic / comorbidity indicators | Auxiliary outcome-head input: the likelihood-ratio test assesses whether the cluster explains outcome above and beyond these covariates |
This design mirrors real-world EHR practice: data_x
captures time-varying lab values while data_v captures
static patient characteristics.
## Continuous lab features → LSTM encoder (data_x)
x_cols <- c("age", "creatinine_phosphokinase", "ejection_fraction",
"platelets", "serum_creatinine", "serum_sodium", "time")
## Binary demographic indicators → outcome head (data_v)
v_cols <- c("anaemia", "diabetes", "high_blood_pressure", "sex", "smoking")
## Min-max scale continuous features to [0, 1].
## Scaling prevents any single lab value from dominating the MSE
## reconstruction loss relative to others.
scale_01 <- function(x) {
r <- range(x, na.rm = TRUE)
if (diff(r) == 0) return(x * 0)
(x - r[1]) / diff(r)
}
X_x <- apply(as.matrix(hf[, x_cols]), 2, scale_01) # 299 × 7, numeric
X_v <- apply(as.matrix(hf[, v_cols]), 2, as.numeric) # 299 × 5, binary as float
## Note: data_v *must* be stored as numeric (float), not integer.
## torch_tensor() infers dtype from R storage mode; integer columns produce
## int64 tensors that are incompatible with the float32 model weights.
cat(sprintf("data_x: %d × %d\ndata_v: %d × %d\n",
nrow(X_x), ncol(X_x), nrow(X_v), ncol(X_v)))
n_x <- ncol(X_x) # 7 continuous predictors
n_v <- ncol(X_v) # 5 binary demographics
outcome <- hf$DEATH_EVENTA 70 / 30 stratified split preserves the ~32% event rate in both partitions, which is important given the small sample size.
set.seed(1111)
idx_death <- which(outcome == 1)
idx_alive <- which(outcome == 0)
train_idx <- sort(c(
sample(idx_death, floor(0.70 * length(idx_death))),
sample(idx_alive, floor(0.70 * length(idx_alive)))
))
test_idx <- setdiff(seq_len(nrow(hf)), train_idx)
cat(sprintf("Train: %d patients (deaths: %d, %.0f%%)\n",
length(train_idx), sum(outcome[train_idx]),
100 * mean(outcome[train_idx])))
cat(sprintf("Test : %d patients (deaths: %d, %.0f%%)\n",
length(test_idx), sum(outcome[test_idx]),
100 * mean(outcome[test_idx])))DICEr() reads training and test sets from RDS files.
Each file must contain a length-3 list:
data_x — numeric matrix, shape n ×
pdata_v — numeric matrix, shape n ×
qdata_y — integer vector of 0/1 outcomes, length
ndata_dir <- file.path(tempdir(), "dice_hf")
dir.create(data_dir, showWarnings = FALSE)
saveRDS(
list(X_x[train_idx, ], X_v[train_idx, ], as.integer(outcome[train_idx])),
file.path(data_dir, "hf_train.rds")
)
saveRDS(
list(X_x[test_idx, ], X_v[test_idx, ], as.integer(outcome[test_idx])),
file.path(data_dir, "hf_test.rds")
)The argument list controls both the architecture and the training schedule.
args <- list(
seed = 1111, # reproducibility seed
input_path = data_dir, # directory containing RDS files
filename_train = "hf_train.rds",
filename_test = "hf_test.rds",
## ── Architecture ──────────────────────────────────────────
n_input_fea = n_x, # 7 continuous LSTM input features
n_hidden_fea = 4, # LSTM latent dimension (7 → 4)
lstm_layer = 1, # single LSTM layer
lstm_dropout = 0.0, # no dropout (small dataset)
K_clusters = 2, # binary risk partition: high vs. low
## ── Auxiliary features ────────────────────────────────────
n_dummy_demov_fea = n_v, # 5 binary demographic covariates
## ── Hardware ──────────────────────────────────────────────
cuda = FALSE, # set TRUE for GPU acceleration
## ── Optimiser ─────────────────────────────────────────────
lr = 1e-4, # Adam learning rate
## ── Training schedule ─────────────────────────────────────
init_AE_epoch = 5, # Stage 1: autoencoder warm-up epochs
iter = 30, # Stage 2: number of clustering iterations
epoch_in_iter = 2, # gradient-update epochs per iteration
## ── Loss weights ──────────────────────────────────────────
## Combined loss: L = λ_AE·L_AE + λ_clf·L_classifier
## + λ_out·L_outcome + λ_p·L_p_value
## L_p_value = 3.841 − G penalises non-significant cluster configurations
## (G is the LRT statistic; 3.841 is the χ²(1) critical value at α = 0.05)
lambda_AE = 1.0,
lambda_classifier = 1.0,
lambda_outcome = 1.0,
lambda_p_value = 1.0
)DICEr runs three sequential stages:
LSTM autoencoder warm-up
(init_AE_epoch epochs): trains the encoder and decoder to
reconstruct data_x, establishing a compact latent
representation before any clustering begins.
Joint optimisation (iter iterations
× epoch_in_iter epochs each): alternates between
data_y = 1
rate is relabelled cluster 0 (the high-risk cluster).Model selection: saves the checkpoint with the lowest test negative log-likelihood that also satisfies the likelihood-ratio test p < 0.05 between at least one cluster pair.
## DICEr writes output files relative to the working directory.
## We temporarily switch to tempdir() to keep them self-contained.
old_wd <- setwd(tempdir())
suppressWarnings(DICEr(args))
setwd(old_wd)Output is written to hn_4_K_2/part2_AE_nhidden_4/
relative to the working directory used during training. The key files
are:
data_train_iter.rds — training-set data frame with
cluster assignments (C)data_test_iter.rds — test-set data frame with cluster
assignments (pred_C)part2_dir <- file.path(tempdir(), "hn_4_K_2", "part2_AE_nhidden_4")
if (!file.exists(file.path(part2_dir, "data_train_iter.rds"))) {
stop(
"No checkpoint found — the p < 0.05 criterion was not met in ",
args$iter, " iterations. Increase args$iter and rerun."
)
}
res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test <- readRDS(file.path(part2_dir, "data_test_iter.rds"))Note on
Cvspred_C:data_test$Cis initialised to 0 insideDICEr()and is never updated during training. Always usedata_test$pred_C(nearest-centroid assignments) for test-set evaluation.
The code below uses pre-computed results from the reference run (seed
= 1111, 30 iterations) so the vignette builds without retraining. When
you run DICEr() yourself the results will be loaded from
the checkpoint you just produced.
k-means spatial boundaries may assign a different cluster index to the high-mortality group in training versus test, so labels should be assigned within each split based on the observed death rate.
label_by_rate <- function(df) {
rates <- tapply(df$death, df$cluster, mean)
hi <- as.integer(names(which.max(rates)))
df$Cluster <- factor(
ifelse(df$cluster == hi, "High-risk", "Low-risk"),
levels = c("High-risk", "Low-risk")
)
df
}
train_df <- label_by_rate(train_df)
test_df <- label_by_rate(test_df)summarise_clusters <- function(df, split_name) {
do.call(rbind, lapply(split(df, df$Cluster), function(d) {
data.frame(
Split = split_name,
Cluster = as.character(d$Cluster[1]),
N = nrow(d),
Deaths = sum(d$death),
DeathRate = round(mean(d$death), 3)
)
}))
}
cluster_summary <- rbind(
summarise_clusters(train_df, "Train"),
summarise_clusters(test_df, "Test")
)[, c("Split", "Cluster", "N", "Deaths", "DeathRate")]
rownames(cluster_summary) <- NULL
print(cluster_summary)test_score <- as.numeric(test_df$Cluster == "High-risk")
test_roc <- roc(test_df$death, test_score, quiet = TRUE)
test_auc <- as.numeric(auc(test_roc))
cat(sprintf("Test AUC: %.4f\n", test_auc))ct <- table(Cluster = test_df$Cluster, Death = test_df$death)
chisq_res <- suppressWarnings(chisq.test(ct))
print(ct)
cat(sprintf("Chi-squared = %.3f, df = %d, p %s\n",
chisq_res$statistic,
chisq_res$parameter,
ifelse(chisq_res$p.value < 0.001, "< 0.001",
sprintf("= %.4f", chisq_res$p.value))))An AUC of 0.823 and a chi-squared p-value < 0.001 indicate that DICEr has identified two clusters with a strong and statistically significant difference in mortality rate: 71.9% in the High-risk cluster versus 10.3% in the Low-risk cluster.
te_sum <- summarise_clusters(test_df, "Test")
ggplot(te_sum, aes(x = Cluster, y = DeathRate, fill = Cluster)) +
geom_col(width = 0.5, colour = "black", linewidth = 0.4) +
geom_text(aes(label = paste0(Deaths, "/", N)),
vjust = -0.4, size = 4) +
scale_fill_manual(
values = c("High-risk" = "#d73027", "Low-risk" = "#4575b4")
) +
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 1)
) +
labs(
title = "DEATH_EVENT rate by DICEr cluster (test set)",
x = "Cluster",
y = "Proportion deceased",
caption = "UCI Heart Failure Clinical Records | DICErClust 0.1.1"
) +
theme_bw(base_size = 13) +
theme(legend.position = "none")roc_df <- data.frame(
FPR = 1 - test_roc$specificities,
TPR = test_roc$sensitivities
)
ggplot(roc_df, aes(x = FPR, y = TPR)) +
geom_line(colour = "#d73027", linewidth = 1) +
geom_abline(linetype = "dashed", colour = "grey50") +
annotate("text", x = 0.55, y = 0.15,
label = sprintf("AUC = %.3f", test_auc),
size = 5, colour = "#d73027") +
labs(
title = "ROC curve — DICEr cluster vs. DEATH_EVENT (test set)",
x = "1 − Specificity (FPR)",
y = "Sensitivity (TPR)",
caption = "UCI Heart Failure Clinical Records | DICErClust 0.1.1"
) +
theme_bw(base_size = 13)| Metric | Value |
|---|---|
| Best checkpoint iteration | 19 |
| LRT p-value at checkpoint | 0.0100 |
| Test negative log-likelihood | 0.6493 |
| Test AUC | 0.823 |
| High-risk mortality (test) | 71.9% (23/32) |
| Low-risk mortality (test) | 10.3% (6/58) |
| Chi-squared statistic | 32.99 |
| Chi-squared p-value | < 0.001 |
Chicco D, Jurman G (2020). “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone.” BMC Medical Informatics and Decision Making, 20, 16. doi:10.1186/s12911-020-1023-5
Dua D, Graff C (2019). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records
Huang Y, Du C, Zhu F, et al. (2021). “Self-supervised deep clustering of patient subgroups for heart failure with preserved ejection fraction.” Journal of the American Medical Informatics Association, 28, 2394–2403. doi:10.1093/jamia/ocab203
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.