The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
DICErClust provides an R implementation of Deep Significance Clustering (DICE), a self-supervised framework that discovers clinically meaningful patient subgroups from electronic health record (EHR) data. Unlike conventional unsupervised clustering, DICE simultaneously optimises four objectives:
The result is a partition into risk-stratified subgroups that are both data-driven and statistically validated.
Reference: Huang Y, Du C, Zhu F, et al. (2021). Self-supervised deep clustering of patient subgroups for heart failure with preserved ejection fraction. J Am Med Inform Assoc, 28, 2394–2403. doi:10.1093/jamia/ocab203
## From a local source tarball:
install.packages(
"/path/to/DICErClust_0.1.1.tar.gz",
repos = NULL, type = "source"
)DICErClust depends on the torch package for R. If
you have not installed torch before, run
torch::install_torch() once after installing the
package.
DICEr() reads training and test data from RDS files.
Each file must be a length-3 list:
| Position | Name | Type | Description |
|---|---|---|---|
[[1]] |
data_x |
numeric matrix n × p | Continuous features — LSTM encoder input |
[[2]] |
data_v |
numeric matrix n × q | Binary demographic covariates — outcome-head auxiliary input |
[[3]] |
data_y |
integer vector length n | Binary outcome (0/1) |
Important:
data_vmust use Rnumeric(float64) storage, notinteger. Thetorchbackend infers tensor dtype from the R storage mode; integer columns produce int64 tensors that are incompatible with the float32 model weights.
## Build a minimal synthetic dataset ----------------------------------------
set.seed(42)
n_train <- 120L; n_test <- 40L; p <- 6L; q <- 3L
make_rds <- function(n, path) {
saveRDS(
list(
matrix(runif(n * p), n, p), # data_x: continuous
matrix(as.numeric(rbinom(n * q, 1, 0.5)), n, q), # data_v: binary float
rbinom(n, 1, 0.3) # data_y: outcome
),
path
)
}
data_dir <- file.path(tempdir(), "dice_intro")
dir.create(data_dir, showWarnings = FALSE)
make_rds(n_train, file.path(data_dir, "train.rds"))
make_rds(n_test, file.path(data_dir, "test.rds"))
## Verify format
d <- readRDS(file.path(data_dir, "train.rds"))
cat("data_x:", nrow(d[[1]]), "×", ncol(d[[1]]), " storage:", storage.mode(d[[1]]), "\n")
cat("data_v:", nrow(d[[2]]), "×", ncol(d[[2]]), " storage:", storage.mode(d[[2]]), "\n")
cat("data_y: length", length(d[[3]]), " table:", paste(table(d[[3]]), collapse = "/"), "\n")library(DICErClust)
args <- list(
seed = 42L,
input_path = data_dir,
filename_train = "train.rds",
filename_test = "test.rds",
n_input_fea = p, # columns in data_x
n_hidden_fea = 3L, # LSTM latent dimension
lstm_layer = 1L,
lstm_dropout = 0.0,
K_clusters = 2L, # number of clusters
n_dummy_demov_fea = q, # columns in data_v
cuda = FALSE, # set TRUE to use GPU
lr = 1e-4,
init_AE_epoch = 5L, # Stage 1 warm-up epochs
iter = 20L, # Stage 2 iterations
epoch_in_iter = 2L,
lambda_AE = 1.0,
lambda_classifier = 1.0,
lambda_outcome = 1.0,
lambda_p_value = 1.0
)
old_wd <- setwd(tempdir())
DICEr(args) # writes output to hn_3_K_2/part2_AE_nhidden_3/
setwd(old_wd)part2_dir <- file.path(tempdir(), "hn_3_K_2", "part2_AE_nhidden_3")
res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test <- readRDS(file.path(part2_dir, "data_test_iter.rds"))
## Cluster assignments
## Training set: use res_train$C (k-means labels, re-ordered by outcome rate)
## Test set: use res_test$pred_C (nearest-centroid assignments)
table(res_test$pred_C)| Argument | Default | Effect |
|---|---|---|
n_hidden_fea |
— | LSTM latent dimension; controls representation capacity |
K_clusters |
— | Number of clusters |
init_AE_epoch |
5 | Stage 1 warm-up length |
iter |
20 | Maximum Stage 2 iterations |
epoch_in_iter |
1 | Gradient-update epochs per iteration |
lr |
1e-4 | Adam learning rate |
lambda_AE |
1.0 | Weight on reconstruction loss |
lambda_classifier |
1.0 | Weight on cluster-assignment loss |
lambda_outcome |
1.0 | Weight on outcome BCE loss |
lambda_p_value |
1.0 | Weight on LRT significance penalty |
All four lambda weights are equal at their defaults,
giving each objective equal influence. The LRT significance threshold
(χ²₁, α = 0.05 → 3.841) is fixed and not user-tunable.
After a successful run, DICEr creates:
<working_dir>/
└── hn_<n_hidden>_K_<K>/
├── part1_AE_nhidden_<n>/ # Stage 1 autoencoder outputs
│ └── part1_loss_AE.png
└── part2_AE_nhidden_<n>/ # Stage 2 best checkpoint
├── data_train_iter.rds # training set with C assignments
└── data_test_iter.rds # test set with pred_C assignments
data_train_iter.rds and data_test_iter.rds
are the data lists enriched with cluster assignment fields:
$C — k-means cluster labels (training set; cluster 0 =
highest mortality)$pred_C — nearest-centroid labels for the test set
(use this, not $C, for test-set
evaluation)For a complete end-to-end analysis on the UCI Heart Failure Clinical Records dataset — including preprocessing, training, cluster evaluation (AUC = 0.823, χ² = 32.99, p < 0.001), and publication-quality figures — see:
vignette("heart-failure-example", package = "DICErClust")These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.