dicepro - Simulated Data Workflow

dicepro Team

2026-06-24

Note: All code chunks have eval = FALSE and are shown for illustration only. To run them interactively:

library(dicepro)
# copy-paste the chunks below into your R session

1 Overview

This vignette demonstrates complete dicepro deconvolution workflows on fully synthetic data, enabling end-to-end reproducibility without requiring access to proprietary data-sets.

Two simulation strategies are available and covered here:

Function Reference matrix Cell-type names Best used for
simulation() Synthetic (generated internally) Generic (CellType_k) Flexible benchmarking with custom parameters
simulation_bluecode() BlueCode (bundled real reference) Real biological names Realistic benchmarking on human tissue cell types

Both functions return the same standardized triplet – $W, $p, $B – and are fully interchangeable as input to dicepro().

The pipeline proceeds in four stages:

  1. Simulate a reference signature matrix and ground-truth cell-type proportions.
  2. Generate a noisy bulk RNA-seq matrix from the simulated mixture.
  3. Deconvolve using the main dicepro() function with automated hyper-parameter optimization.
  4. Evaluate the recovered proportions against the ground truth.

2 Strategy 1 – Fully Synthetic Simulation

simulation() generates a self-consistent triplet from scratch: a synthetic reference matrix $W, ground-truth proportion matrix $p, and noisy bulk matrix $B, guaranteeing column-name alignment across all three outputs.

2.1 Data Generation

library(dicepro)
set.seed(2101L)

sim_data <- simulation(
  loi        = "gauss",
  scenario   = "hierarchical",
  nSample    = 30,
  nGenes     = 200,
  nCellsType = 10,
  sigma_bio  = 0.07,
  sigma_tech = 0.07,
  seed       = 2101L
)

cat("Reference  :", dim(sim_data$W), "\n")
cat("Proportions:", dim(sim_data$p), "\n")
cat("Bulk       :", dim(sim_data$B), "\n")
cat("Row sums   :", round(range(rowSums(sim_data$p)), 4), "\n")

2.2 Noise Model Sanity Check

bulk_clean <- as.matrix(sim_data$W) %*% t(as.matrix(sim_data$p))

plot(
  bulk_clean[seq_len(500)],
  as.matrix(sim_data$B)[seq_len(500)],
  xlab = "Clean bulk (first 500 entries)",
  ylab = "Noisy bulk",
  pch  = 19, cex = 0.4, col = "#2c7bb6",
  main = "Noise model: clean vs noisy bulk"
)
abline(0, 1, col = "firebrick", lwd = 1.5)

2.3 Deconvolution

out <- dicepro(
  reference             = as.matrix(sim_data$W)[, -c(1,5,10)],
  bulk                  = as.matrix(sim_data$B),
  methodDeconv          = "FARDEEP",
  W_prime               = 0,
  bulkName              = "SimBulk",
  refName               = "SimRef",
  hp_max_evals          = 500,
  algo_select           = "random",
  output_path           = tempdir(),
  hspaceTechniqueChoose = "gamma_dominant",
  normalize             = FALSE
)

2.4 Results

2.4.1 Optimal Hyperparameters

cat("Best lambda :", out$hyperparameters$lambda, "\n")
cat("Best gamma  :", out$hyperparameters$gamma,  "\n")
cat("Loss        :", out$metrics$loss,           "\n")
cat("Constraint  :", out$metrics$constraint,     "\n")

2.4.2 Hyperparameter Optimisation Report

out$plot_hyperopt

2.4.3 Pareto Frontier

out$plot

2.4.4 Recovered vs True Proportions

true_prop <- as.matrix(sim_data$p)
pred_prop <- as.matrix(out$H)

common_ct   <- intersect(colnames(pred_prop), colnames(true_prop))
true_common <- true_prop[, common_ct, drop = FALSE]
pred_common <- pred_prop[, common_ct, drop = FALSE]

r_overall <- cor(as.vector(true_common), as.vector(pred_common))
cat(sprintf("Overall Pearson r: %.3f\n", r_overall))

plot(
  as.vector(true_common),
  as.vector(pred_common),
  xlab = "True proportions",
  ylab = "Predicted proportions",
  pch  = 19, cex = 0.5, col = "#2c7bb6aa",
  main = sprintf("True vs Predicted  (r = %.3f)", r_overall)
)
abline(0, 1, col = "firebrick", lwd = 1.5)

2.4.5 Per-Cell-Type Correlation

ct_cors <- vapply(common_ct, function(ct) {
  cor(true_common[, ct], pred_common[, ct])
}, numeric(1L))

par(mar = c(5, 10, 3, 1))
barplot(
  sort(ct_cors),
  horiz = TRUE, las = 1,
  col   = ifelse(sort(ct_cors) > 0.7, "#2c7bb6", "#d7191c"),
  xlab  = "Pearson r",
  main  = "Per-cell-type correlation (synthetic)",
  xlim  = c(-0.2, 1)
)
abline(v = 0.7, lty = 2, col = "gray40")

2.4.6 Quantitative Performance Metrics

perf <- makeTable1Tool(pred_mat = pred_common, obs_mat = true_common)
knitr::kable(perf$Perf, digits = 3,
             caption = "Performance metrics -- fully synthetic data")

3 Strategy 2 – BlueCode-Based Simulation

simulation_bluecode() uses the BlueCode reference matrix bundled with the package – a curated signature matrix covering 34 human cell types across five tissue compartments – as the fixed reference $W. Proportions are simulated via a two-level Dirichlet hierarchy that reflects the biological organisation of human tissues.

This strategy is recommended when benchmarking should be grounded in a biologically realistic reference, while keeping full control over the ground-truth proportions.

3.1 Compartment Structure

The hierarchy encodes five major tissue compartments with their constituent cell types:

Compartment Cell types (n) Default α
Immune B naive/memory, T CD4/CD8, NK, Monocyte, Macrophage, Neutrophil 4.0
Stromal Fibroblasts (×5), MSC, Chondrocyte, Osteoblast 2.5
Endothelial Large vessel, microvascular (×2) 1.8
Epithelial Mammary, renal, respiratory, keratinocyte, melanocyte 1.8
Muscle Smooth muscle (×7), cardiac myocyte, myometrial 1.5

3.2 Data Generation

library(dicepro)

sim_bc <- simulation_bluecode(
  nSample    = 30,
  sigma_bio  = 0.15,
  sigma_tech = 0.02,
  seed       = 2101L
)

cat("Reference  :", dim(sim_bc$W), "\n")  # nGenes x 34
cat("Proportions:", dim(sim_bc$p), "\n")  # 30 x 34
cat("Bulk       :", dim(sim_bc$B), "\n")  # nGenes x 30
cat("Row sums   :", round(range(rowSums(sim_bc$p)), 4), "\n")

# Real cell-type names from BlueCode
head(colnames(sim_bc$p))

3.3 Proportion Distribution by Compartment

The hierarchical model induces realistic between-compartment variation. Visualising the compartment-level totals confirms the expected dominance of the Immune compartment.

# Cell-type to compartment mapping (mirrors .bluecode_cell_groups)
compartment_map <- c(
  rep("Immune",      9),
  rep("Stromal",     8),
  rep("Endothelial", 3),
  rep("Epithelial",  5),
  rep("Muscle",      9)
)
names(compartment_map) <- colnames(sim_bc$p)

# Aggregate proportions by compartment for each sample
comp_props <- t(apply(sim_bc$p, 1, function(row) {
  tapply(row, compartment_map[names(row)], sum)
}))

boxplot(
  comp_props,
  col  = c("#4393c3", "#74c476", "#fd8d3c", "#9ecae1", "#fb6a4a"),
  ylab = "Compartment proportion",
  main = "Simulated compartment proportions (BlueCode)",
  las  = 2
)

3.4 Noise Model Sanity Check

bulk_clean_bc <- as.matrix(sim_bc$W) %*% t(as.matrix(sim_bc$p))

plot(
  bulk_clean_bc[seq_len(500)],
  as.matrix(sim_bc$B)[seq_len(500)],
  xlab = "Clean bulk (first 500 entries)",
  ylab = "Noisy bulk",
  pch  = 19, cex = 0.4, col = "#74c476",
  main = "Noise model: clean vs noisy bulk (BlueCode)"
)
abline(0, 1, col = "firebrick", lwd = 1.5)

3.5 Deconvolution

out_bc <- dicepro(
  reference             = as.matrix(sim_bc$W),
  bulk                  = as.matrix(sim_bc$B),
  methodDeconv          = "FARDEEP",
  W_prime               = 0,
  bulkName              = "BlueCodeBulk",
  refName               = "BlueCode",
  hp_max_evals          = 100,
  algo_select           = "random",
  output_path           = tempdir(),
  hspaceTechniqueChoose = "gamma_dominant",
  normalize             = FALSE
)

3.6 Results

3.6.1 Optimal Hyperparameters

cat("Best lambda :", out_bc$hyperparameters$lambda, "\n")
cat("Best gamma  :", out_bc$hyperparameters$gamma,  "\n")
cat("Loss        :", out_bc$metrics$loss,            "\n")
cat("Constraint  :", out_bc$metrics$constraint,      "\n")

3.6.2 Hyperparameter Optimisation Report

out_bc$plot_hyperopt

3.6.3 Pareto Frontier

out_bc$plot

3.6.4 Recovered vs True Proportions

true_prop_bc <- as.matrix(sim_bc$p)
pred_prop_bc <- as.matrix(out_bc$H)

common_ct_bc   <- intersect(colnames(pred_prop_bc), colnames(true_prop_bc))
true_common_bc <- true_prop_bc[, common_ct_bc, drop = FALSE]
pred_common_bc <- pred_prop_bc[, common_ct_bc, drop = FALSE]

r_overall_bc <- cor(as.vector(true_common_bc), as.vector(pred_common_bc))
cat(sprintf("Overall Pearson r: %.3f\n", r_overall_bc))

plot(
  as.vector(true_common_bc),
  as.vector(pred_common_bc),
  xlab = "True proportions",
  ylab = "Predicted proportions",
  pch  = 19, cex = 0.5, col = "#74c47699",
  main = sprintf("True vs Predicted -- BlueCode  (r = %.3f)", r_overall_bc)
)
abline(0, 1, col = "firebrick", lwd = 1.5)

3.6.5 Per-Cell-Type Correlation

ct_cors_bc <- vapply(common_ct_bc, function(ct) {
  cor(true_common_bc[, ct], pred_common_bc[, ct])
}, numeric(1L))

par(mar = c(5, 14, 3, 1))
barplot(
  sort(ct_cors_bc),
  horiz = TRUE, las = 1,
  col   = ifelse(sort(ct_cors_bc) > 0.7, "#2c7bb6", "#d7191c"),
  xlab  = "Pearson r",
  main  = "Per-cell-type correlation (BlueCode)",
  xlim  = c(-0.2, 1)
)
abline(v = 0.7, lty = 2, col = "gray40")

3.6.6 Quantitative Performance Metrics

perf_bc <- makeTable1Tool(pred_mat = pred_common_bc, obs_mat = true_common_bc)
knitr::kable(perf_bc$Perf, digits = 3,
             caption = "Performance metrics -- BlueCode simulation")

4 Comparing Both Strategies

When both strategies are run under the same conditions, their overall correlation scores can be compared directly.

cat(sprintf(
  "Overall Pearson r\n  Synthetic  : %.3f\n  BlueCode   : %.3f\n",
  r_overall, r_overall_bc
))

These two strategies are complementary: simulation() offers maximum flexibility for testing deconvolution under stress conditions, with varying numbers of genes, cell types, or noise levels; simulation_bluecode() anchors the benchmark to a biologically grounded reference, making the results easier to interpret in the context of real-world data from human tissues.


5 Session Info

sessionInfo()