Real cancer drivers walkthrough

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

This vignette uses the bundled dataset_real_cancer_drivers_4 dataset to illustrate a real biological analysis: how do four canonical cancer driver catalogs overlap?

The four sources are:

Vogelstein — the 138-gene catalog from Vogelstein et al. (Science 2013), often cited as the “core” oncogene set.
COSMIC_CGC — the COSMIC Cancer Gene Census (Sondka et al. 2018), a curated list of genes causally implicated in cancer.
OncoKB — the MSK precision-oncology knowledge base annotation level ≥ “Oncogenic” (Chakravarty et al. 2017).
IntOGen — pan-cancer driver mutations from the IntOGen pipeline (Martínez-Jiménez et al. 2020).

library(vennDiagramLab)
ds <- load_sample("dataset_real_cancer_drivers_4")
ds@set_names

Set sizes

sapply(ds@items, length)

The lists are very different in size — Vogelstein is the smallest curated set; OncoKB is the most permissive at this annotation tier.

Universe

The dataset was built from a 20,000-gene background (universe_size):

ds@universe_size

This is the population N used in the hypergeometric over-representation tests (see vignette("v05_statistics_deep_dive")).

Analyze

result <- analyze(ds)
result@model
length(result@regions)

The default model for 4 sets is venn-4-set (Edwards-style).

Set sizes (inclusive) and intersection layout

result@set_sizes

A summary at a glance

broom::glance() returns a one-row tibble with the headline numbers:

broom::glance(result)

Render the venn diagram

The default render uses the dataset’s set names as labels. To shorten them for the diagram, pass a per-letter override:

svg <- render_venn_svg(
    result,
    set_names = c(A = "Vogelstein", B = "COSMIC", C = "OncoKB", D = "IntOGen"),
    title = "Cancer driver overlap (4 sources)"
)
nchar(svg)

(See vignette("v08_custom_styling_and_export") for color overrides and post-render SVG manipulation.)

UpSet view

For 4+ sets, an UpSet plot is often easier to read than the Venn diagram — each intersection size is a bar, sorted by cardinality.

upset_plot <- render_upset(result, sort_by = "size")
upset_plot

(The chunk above is gated on R >= 4.6 because the CRAN release of ComplexUpset (1.3.3) is incompatible with ggplot2 >= 4.0 on older R — see ?vennDiagramLab::render_upset for context.)

Top significant intersections

broom::tidy() returns one row per set pair, with all five pairwise metrics plus the BH-FDR-adjusted hypergeometric p-value:

top_pairs <- broom::tidy(result)
top_pairs[order(top_pairs$p_adjusted), c("set_a", "set_b", "intersection",
                                          "jaccard", "p_adjusted",
                                          "significant")]

Every pair is significant at FDR < 0.05 (as expected — these catalogs are designed to overlap on biology).

Item-level annotation

broom::augment() returns one row per gene with set-membership flags and the region label.

gene_table <- broom::augment(result)
head(gene_table)
nrow(gene_table)        # total unique genes across all four sets
table(gene_table$region_label)   # how many genes in each region

Save the region summary

to_region_summary_tsv(result, "cancer_drivers_regions.tsv")

What’s next

vignette("v05_statistics_deep_dive") — interpret the Jaccard / Dice / hypergeometric numbers in detail.
vignette("v07_pdf_reports") — turn this analysis into a multi-page PDF.
vignette("v08_custom_styling_and_export") — customize colors, embed in a ggplot, export to PDF/PNG.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.