library(dtGAP)

1 Introduction

Decision trees are prized for their simplicity and interpretability but often fail to reveal underlying data structures. Generalized Association Plots (GAP) excel at illustrating complex associations yet are typically unsupervised. We introduce dtGAP, a novel framework that embeds supervised correlation and distance measures into GAP for enriched decision-tree visualization. dtGAP offers confusion matrix maps, decision-tree matrix maps, predicted class membership maps, and evaluation panels. The dtGAP package is available on GitHub and CRAN at (https://github.com/hanmingwu1103/dtGAP) and (https://CRAN.R-project.org/package=dtGAP).

2 Quick Start

Let’s begin with the penguins dataset! Running the dtGAP() function can be as simple as:

penguins <- na.omit(penguins)
dtGAP(
  data_all = penguins, model = "party", show = "all",
  trans_type = "percentize", target_lab = "species",
  simple_metrics = TRUE,
  label_map_colors = c(
    "Adelie" = "#50046d", "Gentoo" = "#fcc47f",
    "Chinstrap" = "#e15b76"
  ),
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 220,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9)
)

3 Selecting Data Subsets and Tree-Based Models

By default dtGAP visualizes the entire data, but you can focus on just the training or testing split using the show argument, which takes either 'all', 'train' or 'test'. Similarly, You can choose between two tree models via the model argument, which can be either 'rpart'or 'party'.

When you choose model = "rpart" (classic CART), each node shows its class-membership probabilities and display the percentage of samples in each branch.

dtGAP(
  data_all = Psychosis_Disorder, show = "all",
  trans_type = "none", target_lab = "UNIQID", print_eval = FALSE
)

In contrast, with model = "party" (conditional inference trees), dtGAP will annotate each internal node with its split-variable p-value and display the percentage of samples in each branch. Also, you can custom label mapping and colors.

dtGAP(
  data_all = Psychosis_Disorder, model = "party", show = "all",
  trans_type = "none", target_lab = "UNIQID", print_eval = FALSE,
  label_map = c("0" = "bipolar", "1" = "schizophrenia"),
  label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f")
)

4 Computing Row and Column Proximity and Seriation

At the beginning, we choose suitable data transformation via trans_type argument, which can be either 'none', 'percentize', 'normalize', and 'scale'.

Before sorting, we build two proximity measures:

  • Column Proximity : Calculate a combined conditional correlation matrix weighted by group memberships.
  • Row Proximity : Initially, sort samples by tree leaf. For each leaf, measure supervised distance—combining within-leaf dispersion and between-leaf separation—using linkage "CT" (centroid) , "SG" (single), or "CP" (complete).

Use any method from seriation to reorder rows and columns.

> seriation::list_seriation_methods("dist")
#>  [1] "ARSA"           "BBURCG"         "BBWRCG"         "Enumerate"     
#>  [5] "GSA"            "GW"             "GW_average"     "GW_complete"   
#>  [9] "GW_single"      "GW_ward"        "HC"             "HC_average"    
#> [13] "HC_complete"    "HC_single"      "HC_ward"        "Identity"      
#> [17] "isomap"         "isoMDS"         "MDS"            "MDS_angle"     
#> [21] "metaMDS"        "monoMDS"        "OLO"            "OLO_average"   
#> [25] "OLO_complete"   "OLO_single"     "OLO_ward"       "QAP_2SUM"      
#> [29] "QAP_BAR"        "QAP_Inertia"    "QAP_LS"         "R2E"           
#> [33] "Random"         "Reverse"        "Sammon_mapping" "SGD"           
#> [37] "Spectral"       "Spectral_norm"  "SPIN_NH"        "SPIN_STS"      
#> [41] "TSP"            "VAT"

Also, when show = "all", use sort_by_data_type = TRUE to preserve the original train/test grouping; set it to FALSE if you’d rather intermix samples from both sets when ordering.

how to measure the quality of sorting?

Then compute the cRGAR —an average of node-specific anti-Robinson scores weighted by each node’s sample fraction—to quantify order quality.

  • near 0 means good sorting (ordering the layout closely follows a Robinson structure).
  • near 1 indicate bad sorting (many violations).
dtGAP(
  data_all = Psychosis_Disorder, model = "party", show = "all",
  trans_type = "none", target_lab = "UNIQID",
  label_map = c("0" = "bipolar", "1" = "schizophrenia"),
  label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f"),
  seriate_method = "GW_average", sort_by_data_type = FALSE
)

5 Data Information and Metrics

When you set print_eval = TRUE, dtGAP will append an evaluation panel containing two sections:

  • Data Information

    • Dataset name, model and train/test sample sizes.

    • Column proximity method, linkage, seriation algorithm and cRGAR score.

  • Train/Test Metrics

    • Full confusion-matrix report (default)
      Uses caret::confusionMatrix() to show accuracy, kappa, sensitivity, specificity, etc.

    • Simple metrics
      If you set simple_metrics = TRUE, you’ll instead get six key measures from the yardstick package:

      • Accuracy

      • Balanced accuracy

      • Kappa

      • Precision

      • Recall

      • Specificity

dtGAP(
  data_all = Psychosis_Disorder, model = "party", show = "all",
  label_map = c("0" = "bipolar", "1" = "schizophrenia"),
  label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f"),
  trans_type = "none", target_lab = "UNIQID", simple_metrics = TRUE
)

6 Train/Test Workflow

If the default conditional tree is not desired, you can create your tree (e.g. with rpart) and wrap as.party() around this object to plug into dtGAP(). As an example, we will examine the datasets of COVID-19 cases in Wuhan from 2020-01-10 to 2020-02-18 from a recent study.

dtGAP(
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "train",
  label_map = c("0" = "Survival", "1" = "Death"),
  label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
  simple_metrics = TRUE,
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 200,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9)
)

6.1 Apply the learned tree on external/holdout/test/validation dataset

You can print measures evaluating the conditional decision tree’s performance by setting print_eval = TRUE. By defaults, we show 5 measures for classification tasks:

  • Accuracy
  • Balanced accuracy (BAL_ACCURACY)
  • Kappa coefficient (KAP)
  • Area under the receiver operating characteristics curve (ROC_AUC)
  • Area under the precision recall curve (PR_AUC)

and 4 measures for regression tasks:

  • R-squared (RSQ)
  • Mean absolute error (MAE)
  • Root mean squared error (RMSE)
  • Concordance correlation coefficient (CCC).
dtGAP(
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  label_map = c("0" = "Survival", "1" = "Death"),
  label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
  simple_metrics = TRUE,
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 200,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9)
)

7 Regression

Compared with classification, interpreting a regression tree can be challenging. A heatmap, however, can make the structure more transparent by showing how observations cluster within each terminal node. Here’s an example:

dtGAP(
  data_all = galaxy, task = "regression",
  target_lab = "target", show = "all",
  trans_type = "percentize", model = "party",
  simple_metrics = TRUE, y_eval_start = 220,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9)
)

8 Customization

  • Variable Importance and split-variable Labels panel

    • col_var_imp set the bar fill color (e.g. "orange", "#2c7bb6").

    • var_imp_bar_width Adjust bar thickness (default 0.8).

    • var_imp_fontsize / split_var_fontsize Control the font size (default 5).

    • split_var_bg Background color behind each split-variable name (default "darkgreen").

  • Color

    Define the RColorBrewer palette and number of shades.

    • Col_Prox_palette (e.g. "RdBu", "Viridis") and Col_Prox_n_colors

    • Row_Prox_palette and Row_Prox_n_colors

    • sorted_dat_palette & sorted_dat_n_colors

Uses display.brewer.all() to displays all available RColorBrewer palettes.

You can customize the color schemes and font sizes in the visualization to match your preferences.

dtGAP(
  data_all = Psychosis_Disorder, show = "all", trans_type = "none",
  target_lab = "UNIQID", simple_metrics = TRUE, col_var_imp = "blue",
  split_var_bg = "darkblue", Col_Prox_palette = "RdYlGn",
  type_palette = "Set2",
  Row_Prox_palette = "Spectral",
  var_imp_fontsize = 7, split_var_fontsize = 7,
  sorted_dat_palette = "Oranges", sorted_dat_n_colors = 9,
  label_map = c("0" = "bipolar", "1" = "schizophrenia"),
  label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f")
)

You can also choose whether to display the row or column proximity.

dtGAP(
  data_all = Psychosis_Disorder, model = "party", show = "all",
  trans_type = "none", target_lab = "UNIQID",
  seriate_method = "GW_average",
  label_map = c("0" = "bipolar", "1" = "schizophrenia"),
  label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f"),
  show_row_prox = FALSE, show_col_prox = FALSE
)

9 Smart Node Layout

While extreme tree visualizations may reduce immediate interpretability, they effectively illustrate the structural adaptability of our layout algorithm in the context of increasing tree complexity. The horizontal positioning of tree components is governed by the tree_p parameter in dtGAP(), which determines the proportion of the overall canvas dedicated to the tree structure. Adjusting tree_p helps mitigate issues such as branch overlapping by providing adequate spacing between nodes.

dtGAP(
  data_all = wine_quality_red, target_lab = "target",
  show = "all", model = "party", simple_metrics = TRUE,
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 40,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9),
  show_row_names = FALSE
)

dtGAP(
  data_all = wine_quality_red, target_lab = "target",
  show = "all", model = "party", simple_metrics = TRUE,
  tree_p = 0.4,
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 40,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9),
  show_row_names = FALSE
)

10 Variable Selection

Sometimes you may want to focus the heatmap on a subset of features while keeping the tree trained on all variables. The select_vars parameter lets you specify which variables to display—the tree still uses every feature for splitting, but only the selected ones appear in the heatmap panels.

dtGAP(
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  select_vars = c("LDH", "Lymphocyte"),
  label_map = c("0" = "Survival", "1" = "Death"),
  label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
  simple_metrics = TRUE,
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 200,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9)
)

Note that select_vars must be a character vector of column names that exist in the data (excluding the target). Variable importance values are rescaled to sum to 1 for the selected subset.

11 Custom Tree Input

If you have already trained a decision tree outside of dtGAP, you can pass it directly using the fit parameter. This is useful when you want to use a specific tree configuration or compare a custom model with the built-in options.

dtGAP() accepts rpart, party, and train (caret) objects. The model type is automatically detected, and the tree is converted internally. You can optionally supply your own variable importance vector via user_var_imp.

library(rpart)

# Train a custom rpart tree with specific parameters
custom_tree <- rpart(
  Outcome ~ ., data = train_covid,
  control = rpart.control(maxdepth = 3, cp = 0.01)
)

dtGAP(
  fit = custom_tree,
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  label_map = c("0" = "Survival", "1" = "Death"),
  label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
  simple_metrics = TRUE,
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 200,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9)
)

12 Interactive Visualization

Set interactive = TRUE to launch a Shiny-based interactive heatmap viewer powered by InteractiveComplexHeatmap. This lets you hover, click, and zoom into the heatmap panels directly in your browser.

dtGAP(
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  interactive = TRUE, print_eval = FALSE
)

Note: InteractiveComplexHeatmap must be installed separately from Bioconductor:

BiocManager::install("InteractiveComplexHeatmap")

In interactive mode, only the heatmap panels are displayed (the tree panel is omitted, as InteractiveComplexHeatmap handles ComplexHeatmap objects).

13 Multi-Model Comparison

The compare_dtGAP() function lets you compare two or more tree models side-by-side on a single wide canvas. Each model gets its own tree + heatmap panel with a label header.

compare_dtGAP(
  models = c("rpart", "party"),
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  label_map = c("0" = "Survival", "1" = "Death"),
  label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
  simple_metrics = TRUE,
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 200,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9)
)

Supported models include "rpart", "party", "C50", and "caret". The default page width is 594 mm (two A4 pages side-by-side); you can adjust it with total_w.

14 Random Forest Extension

dtGAP extends beyond single decision trees with three functions for conditional random forests via partykit::cforest:

  • train_rf() — train a conditional random forest
  • rf_summary() — ensemble-level summary (variable importance + representative tree)
  • rf_dtGAP() — visualize any individual tree from the forest using the full dtGAP pipeline

14.1 Training a Random Forest

train_rf() fits a cforest and returns the forest object, normalized variable importance, and the number of trees.

rf <- train_rf(
  data_train = train_covid,
  target_lab = "Outcome",
  ntree = 50
)
names(rf)
#> [1] "forest"  "var_imp" "ntree"
rf$var_imp
#>        LDH     hs_CRP Lymphocyte 
#>        0.5        0.3        0.2

14.2 Ensemble Summary

rf_summary() provides an overview of the fitted random forest. It displays a variable importance barplot and identifies the representative tree—the individual tree whose predictions agree most closely with the full ensemble.

result <- rf_summary(
  data_train = train_covid,
  data_test = test_covid,
  target_lab = "Outcome",
  ntree = 50,
  top_n_vars = 3
)

result$rep_tree_index
#> [1] 11

The returned rep_tree_index tells you which tree best represents the ensemble, which you can then visualize with rf_dtGAP().

14.3 Visualizing Individual Trees

rf_dtGAP() extracts a single tree from the forest and renders it through the full dtGAP pipeline (decision tree + heatmap + evaluation). The title automatically shows “Tree k/N”.

rf_dtGAP(
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  tree_index = 1, ntree = 50,
  label_map = c("0" = "Survival", "1" = "Death"),
  label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
  simple_metrics = TRUE,
  show_col_prox = FALSE, show_row_prox = FALSE,
  y_eval_start = 200,
  raw_value_col = colorRampPalette(
    c("#33286b", "#26828e", "#75d054", "#fae51f")
  )(9)
)

15 Exporting Plots

save_dtGAP() exports the dtGAP visualization to PNG, PDF, or SVG files. The format is automatically inferred from the file extension, or you can set it explicitly. Dimensions are specified in millimeters (default A4 landscape: 297 x 210 mm).

# Save as PNG (300 dpi)
save_dtGAP(
  file = "my_plot.png",
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  print_eval = FALSE
)

# Save as PDF
save_dtGAP(
  file = "my_plot.pdf",
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  print_eval = FALSE
)

# Custom dimensions (wide format)
save_dtGAP(
  file = "wide_plot.svg",
  width = 500, height = 250,
  data_train = train_covid, data_test = test_covid,
  target_lab = "Outcome", show = "test",
  print_eval = FALSE
)

All dtGAP() arguments can be passed through ..., so you can customize colors, metrics, and layout just as you would with dtGAP() directly.

16 Citation

Han-Ming Wu, Chia-Yu Chang, and Chun-houh Chen (2025), dtGAP: Supervised matrix visualization for decision trees based on the GAP framework. R package version 0.0.2, (https://github.com/hanmingwu1103/dtGAP).

References:

  • Chen, C. H. (2002). Generalized association plots: Information visualization via iteratively generated correlation matrices. Statistica Sinica, 12, 7-29.
  • Le, T. T., & Moore, J. H. (2021). Treeheatr: An R package for interpretable decision tree visualizations. Bioinformatics, 37(2), 282-284.
  • Wu, H. M., Tien, Y. J., & Chen, C. H. (2010). GAP: A graphical environment for matrix visualization and cluster analysis. Computational Statistics & Data Analysis, 54(3), 767-778.