The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Statistics deep dive (Jaccard, Dice, hypergeometric, BH-FDR)

Statistics deep dive

vennDiagramLab reports five pairwise metrics for every set pair plus a multiple-testing correction. This vignette explains what each metric means, when to prefer it, and how to reproduce the values that appear in the web tool’s significance coloring.

library(vennDiagramLab)
result <- analyze(load_sample("dataset_real_cancer_drivers_4"))
stats <- statistics(result)

The five metrics

For two sets A and B of sizes |A|, |B| with intersection |A ∩ B| drawn from a universe of N items:

Jaccard = |A ∩ B| / |A ∪ B|. Range [0, 1]. Symmetric.
Dice = 2 |A ∩ B| / (|A| + |B|). Range [0, 1]. Symmetric. Always >= Jaccard; the two relate by Dice = 2J / (1 + J).
Overlap coefficient = |A ∩ B| / min(|A|, |B|). Range [0, 1]. Equal to 1 when one set is contained in the other.
Hypergeometric p-value — probability of observing |A ∩ B| or more shared items by chance, given |A|, |B|, and N. Tests over- representation.
Fold enrichment = (|A ∩ B| * N) / (|A| * |B|). The ratio of observed to expected intersection size under independence. > 1 is over- representation.

Compute one metric directly

The helpers are exported and stateless:

jaccard(size_a = 138, size_b = 581, intersection = 100)
dice(size_a = 138, size_b = 581, intersection = 100)
overlap_coefficient(size_a = 138, size_b = 581, intersection = 100)
hypergeometric_p_value(N = 20000, K = 138, n = 581, k = 100)
fold_enrichment(N = 20000, K = 138, n = 581, k = 100)

The hypergeometric p-value is essentially zero: 100 shared genes out of an expected (138 * 581) / 20000 ≈ 4 is a 25× enrichment.

All pairs at once

statistics(result) returns five tables (four square NxN matrices for the ratio metrics + a long-form data.frame for the hypergeometric test):

stats@jaccard

head(stats@hypergeometric)

BH-FDR adjustment

stats@hypergeometric already carries the BH-FDR-adjusted q-value (p_adjusted) and a boolean significant (q < 0.05) and highly_significant (q < 0.001). The adjustment uses stats::p.adjust(method = "BH"):

raw_p <- stats@hypergeometric$p_value
adjusted <- bh_fdr(raw_p)
all.equal(adjusted, stats@hypergeometric$p_adjusted)

For unrelated p-values, BH-FDR is more permissive than Bonferroni and more conservative than no correction:

toy_p <- c(0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 0.9)
data.frame(
    raw            = toy_p,
    bonferroni     = pmin(toy_p * length(toy_p), 1),
    bh_fdr         = bh_fdr(toy_p)
)

broom-compatible long-form

broom::tidy() produces a tibble that’s pipeline-friendly (one row per pair, all metrics in one frame, sorted by adjusted p-value):

broom::tidy(result)

Reproducing the web tool’s coloring

The web tool colors significant pairs via the same p_adjusted thresholds:

sig_table <- broom::tidy(result)
sig_table$colour <- ifelse(sig_table$highly_significant, "red",
                    ifelse(sig_table$significant,        "orange",
                                                          "grey"))
sig_table[, c("set_a", "set_b", "p_adjusted", "colour")]

What’s next

vignette("v02_real_cancer_drivers") — see these stats in the context of a real biological analysis.
vignette("v06_pipeline_integration") — feed broom::tidy() into a downstream tidyverse pipeline.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.