The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

dicepro - Simulated Data Workflow

dicepro Team

2026-06-24

Note: All code chunks have eval = FALSE and are shown for illustration only. To run them interactively:

library(dicepro)
# copy-paste the chunks below into your R session

1 Overview

This vignette demonstrates complete dicepro deconvolution workflows on fully synthetic data, enabling end-to-end reproducibility without requiring access to proprietary data-sets.

Two simulation strategies are available and covered here:

Function Reference matrix Cell-type names Best used for
simulation() Synthetic (generated internally) Generic (CellType_k) Flexible benchmarking with custom parameters
simulation_bluecode() BlueCode (bundled real reference) Real biological names Realistic benchmarking on human tissue cell types

Both functions return the same standardized triplet – $W, $p, $B – and are fully interchangeable as input to dicepro().

The pipeline proceeds in four stages:

  1. Simulate a reference signature matrix and ground-truth cell-type proportions.
  2. Generate a noisy bulk RNA-seq matrix from the simulated mixture.
  3. Deconvolve using the main dicepro() function with automated hyper-parameter optimization.
  4. Evaluate the recovered proportions against the ground truth.

2 Strategy 1 – Fully Synthetic Simulation

simulation() generates a self-consistent triplet from scratch: a synthetic reference matrix $W, ground-truth proportion matrix $p, and noisy bulk matrix $B, guaranteeing column-name alignment across all three outputs.

2.1 Data Generation

library(dicepro)
set.seed(2101L)

sim_data <- simulation(
  loi        = "gauss",
  scenario   = "hierarchical",
  nSample    = 30,
  nGenes     = 200,
  nCellsType = 10,
  sigma_bio  = 0.07,
  sigma_tech = 0.07,
  seed       = 2101L
)

cat("Reference  :", dim(sim_data$W), "\n")
cat("Proportions:", dim(sim_data$p), "\n")
cat("Bulk       :", dim(sim_data$B), "\n")
cat("Row sums   :", round(range(rowSums(sim_data$p)), 4), "\n")

2.2 Noise Model Sanity Check

bulk_clean <- as.matrix(sim_data$W) %*% t(as.matrix(sim_data$p))

plot(
  bulk_clean[seq_len(500)],
  as.matrix(sim_data$B)[seq_len(500)],
  xlab = "Clean bulk (first 500 entries)",
  ylab = "Noisy bulk",
  pch  = 19, cex = 0.4, col = "#2c7bb6",
  main = "Noise model: clean vs noisy bulk"
)
abline(0, 1, col = "firebrick", lwd = 1.5)

2.3 Deconvolution

out <- dicepro(
  reference             = as.matrix(sim_data$W)[, -c(1,5,10)],
  bulk                  = as.matrix(sim_data$B),
  methodDeconv          = "FARDEEP",
  W_prime               = 0,
  bulkName              = "SimBulk",
  refName               = "SimRef",
  hp_max_evals          = 500,
  algo_select           = "random",
  output_path           = tempdir(),
  hspaceTechniqueChoose = "gamma_dominant",
  normalize             = FALSE
)

2.4 Results

2.4.1 Optimal Hyperparameters

cat("Best lambda :", out$hyperparameters$lambda, "\n")
cat("Best gamma  :", out$hyperparameters$gamma,  "\n")
cat("Loss        :", out$metrics$loss,           "\n")
cat("Constraint  :", out$metrics$constraint,     "\n")

2.4.2 Hyperparameter Optimisation Report

out$plot_hyperopt

2.4.3 Pareto Frontier

out$plot

2.4.4 Recovered vs True Proportions

true_prop <- as.matrix(sim_data$p)
pred_prop <- as.matrix(out$H)

common_ct   <- intersect(colnames(pred_prop), colnames(true_prop))
true_common <- true_prop[, common_ct, drop = FALSE]
pred_common <- pred_prop[, common_ct, drop = FALSE]

r_overall <- cor(as.vector(true_common), as.vector(pred_common))
cat(sprintf("Overall Pearson r: %.3f\n", r_overall))

plot(
  as.vector(true_common),
  as.vector(pred_common),
  xlab = "True proportions",
  ylab = "Predicted proportions",
  pch  = 19, cex = 0.5, col = "#2c7bb6aa",
  main = sprintf("True vs Predicted  (r = %.3f)", r_overall)
)
abline(0, 1, col = "firebrick", lwd = 1.5)

2.4.5 Per-Cell-Type Correlation

ct_cors <- vapply(common_ct, function(ct) {
  cor(true_common[, ct], pred_common[, ct])
}, numeric(1L))

par(mar = c(5, 10, 3, 1))
barplot(
  sort(ct_cors),
  horiz = TRUE, las = 1,
  col   = ifelse(sort(ct_cors) > 0.7, "#2c7bb6", "#d7191c"),
  xlab  = "Pearson r",
  main  = "Per-cell-type correlation (synthetic)",
  xlim  = c(-0.2, 1)
)
abline(v = 0.7, lty = 2, col = "gray40")

2.4.6 Quantitative Performance Metrics

perf <- makeTable1Tool(pred_mat = pred_common, obs_mat = true_common)
knitr::kable(perf$Perf, digits = 3,
             caption = "Performance metrics -- fully synthetic data")

3 Strategy 2 – BlueCode-Based Simulation

simulation_bluecode() uses the BlueCode reference matrix bundled with the package – a curated signature matrix covering 34 human cell types across five tissue compartments – as the fixed reference $W. Proportions are simulated via a two-level Dirichlet hierarchy that reflects the biological organisation of human tissues.

This strategy is recommended when benchmarking should be grounded in a biologically realistic reference, while keeping full control over the ground-truth proportions.

3.1 Compartment Structure

The hierarchy encodes five major tissue compartments with their constituent cell types:

Compartment Cell types (n) Default α
Immune B naive/memory, T CD4/CD8, NK, Monocyte, Macrophage, Neutrophil 4.0
Stromal Fibroblasts (×5), MSC, Chondrocyte, Osteoblast 2.5
Endothelial Large vessel, microvascular (×2) 1.8
Epithelial Mammary, renal, respiratory, keratinocyte, melanocyte 1.8
Muscle Smooth muscle (×7), cardiac myocyte, myometrial 1.5

3.2 Data Generation

library(dicepro)

sim_bc <- simulation_bluecode(
  nSample    = 30,
  sigma_bio  = 0.15,
  sigma_tech = 0.02,
  seed       = 2101L
)

cat("Reference  :", dim(sim_bc$W), "\n")  # nGenes x 34
cat("Proportions:", dim(sim_bc$p), "\n")  # 30 x 34
cat("Bulk       :", dim(sim_bc$B), "\n")  # nGenes x 30
cat("Row sums   :", round(range(rowSums(sim_bc$p)), 4), "\n")

# Real cell-type names from BlueCode
head(colnames(sim_bc$p))

3.3 Proportion Distribution by Compartment

The hierarchical model induces realistic between-compartment variation. Visualising the compartment-level totals confirms the expected dominance of the Immune compartment.

# Cell-type to compartment mapping (mirrors .bluecode_cell_groups)
compartment_map <- c(
  rep("Immune",      9),
  rep("Stromal",     8),
  rep("Endothelial", 3),
  rep("Epithelial",  5),
  rep("Muscle",      9)
)
names(compartment_map) <- colnames(sim_bc$p)

# Aggregate proportions by compartment for each sample
comp_props <- t(apply(sim_bc$p, 1, function(row) {
  tapply(row, compartment_map[names(row)], sum)
}))

boxplot(
  comp_props,
  col  = c("#4393c3", "#74c476", "#fd8d3c", "#9ecae1", "#fb6a4a"),
  ylab = "Compartment proportion",
  main = "Simulated compartment proportions (BlueCode)",
  las  = 2
)

3.4 Noise Model Sanity Check

bulk_clean_bc <- as.matrix(sim_bc$W) %*% t(as.matrix(sim_bc$p))

plot(
  bulk_clean_bc[seq_len(500)],
  as.matrix(sim_bc$B)[seq_len(500)],
  xlab = "Clean bulk (first 500 entries)",
  ylab = "Noisy bulk",
  pch  = 19, cex = 0.4, col = "#74c476",
  main = "Noise model: clean vs noisy bulk (BlueCode)"
)
abline(0, 1, col = "firebrick", lwd = 1.5)

3.5 Deconvolution

out_bc <- dicepro(
  reference             = as.matrix(sim_bc$W),
  bulk                  = as.matrix(sim_bc$B),
  methodDeconv          = "FARDEEP",
  W_prime               = 0,
  bulkName              = "BlueCodeBulk",
  refName               = "BlueCode",
  hp_max_evals          = 100,
  algo_select           = "random",
  output_path           = tempdir(),
  hspaceTechniqueChoose = "gamma_dominant",
  normalize             = FALSE
)

3.6 Results

3.6.1 Optimal Hyperparameters

cat("Best lambda :", out_bc$hyperparameters$lambda, "\n")
cat("Best gamma  :", out_bc$hyperparameters$gamma,  "\n")
cat("Loss        :", out_bc$metrics$loss,            "\n")
cat("Constraint  :", out_bc$metrics$constraint,      "\n")

3.6.2 Hyperparameter Optimisation Report

out_bc$plot_hyperopt

3.6.3 Pareto Frontier

out_bc$plot

3.6.4 Recovered vs True Proportions

true_prop_bc <- as.matrix(sim_bc$p)
pred_prop_bc <- as.matrix(out_bc$H)

common_ct_bc   <- intersect(colnames(pred_prop_bc), colnames(true_prop_bc))
true_common_bc <- true_prop_bc[, common_ct_bc, drop = FALSE]
pred_common_bc <- pred_prop_bc[, common_ct_bc, drop = FALSE]

r_overall_bc <- cor(as.vector(true_common_bc), as.vector(pred_common_bc))
cat(sprintf("Overall Pearson r: %.3f\n", r_overall_bc))

plot(
  as.vector(true_common_bc),
  as.vector(pred_common_bc),
  xlab = "True proportions",
  ylab = "Predicted proportions",
  pch  = 19, cex = 0.5, col = "#74c47699",
  main = sprintf("True vs Predicted -- BlueCode  (r = %.3f)", r_overall_bc)
)
abline(0, 1, col = "firebrick", lwd = 1.5)

3.6.5 Per-Cell-Type Correlation

ct_cors_bc <- vapply(common_ct_bc, function(ct) {
  cor(true_common_bc[, ct], pred_common_bc[, ct])
}, numeric(1L))

par(mar = c(5, 14, 3, 1))
barplot(
  sort(ct_cors_bc),
  horiz = TRUE, las = 1,
  col   = ifelse(sort(ct_cors_bc) > 0.7, "#2c7bb6", "#d7191c"),
  xlab  = "Pearson r",
  main  = "Per-cell-type correlation (BlueCode)",
  xlim  = c(-0.2, 1)
)
abline(v = 0.7, lty = 2, col = "gray40")

3.6.6 Quantitative Performance Metrics

perf_bc <- makeTable1Tool(pred_mat = pred_common_bc, obs_mat = true_common_bc)
knitr::kable(perf_bc$Perf, digits = 3,
             caption = "Performance metrics -- BlueCode simulation")

4 Comparing Both Strategies

When both strategies are run under the same conditions, their overall correlation scores can be compared directly.

cat(sprintf(
  "Overall Pearson r\n  Synthetic  : %.3f\n  BlueCode   : %.3f\n",
  r_overall, r_overall_bc
))

These two strategies are complementary: simulation() offers maximum flexibility for testing deconvolution under stress conditions, with varying numbers of genes, cell types, or noise levels; simulation_bluecode() anchors the benchmark to a biologically grounded reference, making the results easier to interpret in the context of real-world data from human tissues.


5 Session Info

sessionInfo()

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.