The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The multiDEGGs package performs multi-omic differential network analysis by showing differential interactions between molecular entities (genes, proteins, miRNAs, or other biomolecules) across the omic datasets provided.
For each omic dataset, a differential network is constructed, where links represent statistically significant differential interactions between entities. These networks are then integrated into a comprehensive visualization using distinct colors to distinguish interactions from different omic layers.
This unified visualization allows interactive exploration of cross-omic patterns (e.g., differential interactions present at both transcript and protein level). For each link, users can access differential statistical significance metrics (p-values or adjusted p-values, calculated via robust or traditional linear regression with interaction term), and differential regression plots.
multiDEGGs can be installed from GitHub using devtools:
devtools::install_github("elisabettasciacca/multiDEGGs")
Let’s start by loading the package and sample data:
library(multiDEGGs)
data("synthetic_metadata")
data("synthetic_rnaseqData")
data("synthetic_proteomicData")
data("synthetic_OlinkData")
assayData_list <- list("RNAseq" = synthetic_rnaseqData,
"Proteomics" = synthetic_proteomicData,
"Olink" = synthetic_OlinkData)
deggs_object <- get_diffNetworks(assayData = assayData_list,
metadata = synthetic_metadata,
category_variable = "response",
regression_method = "lm",
padj_method = "bonferroni",
verbose = FALSE,
show_progressBar = FALSE,
cores = 2)
get_diffNetworks
It’s worth explaining some of the important parameters of
get_diffNetworks
:
assayData
: accepts either a single normalized
matrix/data frame (for single omic differential analysis), or a list of
matrices/data frames (for multi-omic scenarios). For multi-omic
analysis, it’s highly recommended to use a named list of data. If
unnamed, sequential names (assayData1, assayData2, etc.) will be
assigned to identify each matrix or data frame.
metadata
: can also be a named factor vector, with
names matching the patient IDs in column names of the assay data
matrices/data frames. In that case, the category_variable can remain
unset (NULL by default).
category_subset
: this parameter can restrict the
analysis to a certain subset of categories available in the
metadata/category vector.
regression_method
: set to "lm"
by
default because it is faster and highly recommended in machine learning
scenarios, where the function might be repeatedly called many times. For
basic differential analyses, "rlm"
can also be used and may
perform better in some cases.
percentile_vector
: by default, molecular targets
(genes, proteins, etc.) whose expression level is below the 35th
percentile of the entire data matrix are excluded from the analysis.
This threshold can be modified by specifying the percentile vector that
is internally used for the percolation analysis. For example, to remove
only targets below the 25th percentile, set
percentile_vector = seq(0.25, 0.98, by = 0.05)
.
padj_method
: the default method is Bonferroni.
Storey’s q values often give more generous results but the
qvalue
package needs to be installed first.
NOTE: Not all patient IDs need to be present across datasets. Different numbers of samples per omic are acceptable. Only IDs whose data is available in the colnames of the assayData will be included in the analysis. Missing IDs will be listed in a message similar to:
The following samples IDs are missing in Proteomics: PT001, PT005, PT0030
The deggs_object
now contains the differential networks
for each omic data in assayData_list
. These networks can be
integrated into a comprehensive visualization where different colors
distinguish links from different omic layers.
View_diffNetworks(deggs_object)
This visualization interface allows you to:
Â
Â
Thicker links correspond to higher significant p-values.
The direction of the arrows shows the relationship direction reported in
literature, not derived from the data.
The user can visualize differential regression plots by clicking on a link:
Â
Â
Single node differential expressions can also be visualized by clicking on the nodes:
Â
Â
NOTE: For multi-omic scenarios, the data from the
first matrix in the list passed to assayData
will be used
for this boxplot.
Outside of the interactive environment, the
get_multiOmics_diffNetworks()
function can be used to get a
table of all differential interactions, ordered by p-value or adjusted
p-value:
get_multiOmics_diffNetworks(deggs_object, sig_threshold = 0.05)
#> $`Non-responder`
#> from to p.value p.adj layer
#> TNF-TNFRSF1A TNF TNFRSF1A 3.325023e-04 1.330009e-03 RNAseq
#> IL1B-IL1R2 IL1B IL1R2 5.389195e-04 2.155678e-03 RNAseq
#> TGFB3-TGFBR1 TGFB3 TGFBR1 5.329071e-15 2.131628e-14 RNAseq
#> AKT2-MTOR AKT2 MTOR 4.347083e-04 1.738833e-03 RNAseq
#> FANCD2-FAN1 FANCD2 FAN1 4.440892e-16 3.552714e-15 Proteomics
#> GNG12-RASA2 GNG12 RASA2 0.000000e+00 0.000000e+00 Proteomics
#> RASGRP3-RRAS RASGRP3 RRAS 2.202344e-04 1.761875e-03 Proteomics
#> TNF-TNFRSF1A1 TNF TNFRSF1A 0.000000e+00 0.000000e+00 Proteomics
#> RASGRP1-RAP1A RASGRP1 RAP1A 2.840564e-03 2.272452e-02 Proteomics
#>
#> $Responder
#> from to p.value p.adj layer
#> FANCD2-FAN1 FANCD2 FAN1 0.000000e+00 0.000000e+00 RNAseq
#> FASLG-FAS FASLG FAS 6.947839e-07 2.084352e-06 RNAseq
#> MAP2K2-MAPK3 MAP2K2 MAPK3 1.153300e-12 3.459899e-12 RNAseq
For single omic scenarios, use the get_sig_deggs()
function:
deggs_object_oneOmic <- get_diffNetworks(assayData = synthetic_rnaseqData,
metadata = synthetic_metadata,
category_variable = "response",
regression_method = "lm",
padj_method = "bonferroni",
verbose = FALSE,
show_progressBar = FALSE,
cores = 2)
get_sig_deggs(deggs_object_oneOmic, sig_threshold = 0.05)
#> from to p.value p.adj
#> Non-responder.TNF-TNFRSF1A TNF TNFRSF1A 3.325023e-04 1.330009e-03
#> Non-responder.IL1B-IL1R2 IL1B IL1R2 5.389195e-04 2.155678e-03
#> Non-responder.TGFB3-TGFBR1 TGFB3 TGFBR1 5.329071e-15 2.131628e-14
#> Non-responder.AKT2-MTOR AKT2 MTOR 4.347083e-04 1.738833e-03
#> Responder.FANCD2-FAN1 FANCD2 FAN1 0.000000e+00 0.000000e+00
#> Responder.FASLG-FAS FASLG FAS 6.947839e-07 2.084352e-06
#> Responder.MAP2K2-MAPK3 MAP2K2 MAPK3 1.153300e-12 3.459899e-12
To plot the differential regression fits outside of the interactive
environment, use plot_regressions()
specifying the omic
data to be used and the two targets:
plot_regressions(deggs_object,
assayDataName = "RNAseq",
gene_A = "MTOR",
gene_B = "AKT2",
legend_position = "bottomright")
In single omic analyses, the assayDataName
parameter can
remain unset.
It’s possible to compare differential interactions among more than
two categorical groups. All steps described above stay the same;
the dropdown menu of the interactive environment will show all available
categories:
|
while regressions and boxplots will show all categories:
|
|
The statistical significance of the interaction term is calculated
via one-way ANOVA in this case.
We highly recommend to have at least 4 or 5 observations per group.
Detecting differentially expressed interactions is useful for gaining insights into molecular dysregulations affecting certain conditions. This information can also be leveraged to train predictive models for the category of interest.
While some models, such as glmnet, allow for sparsity and have
built-in variable selection, many models fail to fit when given massive
numbers of predictors, or perform poorly due to overfitting. In
medicine, one of the common goals of predictive modeling is the
development of diagnostic or biomarker tests, for which reducing the
number of predictors is typically a practical necessity.
The differential molecular interactions found with multiDEGGs can be
used to filter down the number of predictors.
However, filtering predictors on the whole dataset creates
information leakage about the test set, leading to overly optimistic
performance assessments (Vabalas et al., 2019).
Feature selection should be considered an integral part of a model, with
selection performed only on training data. Then the selected features
and accompanying model can be tested on hold-out test data without
bias.
While most machine learning pipelines involve splitting data into training and testing cohorts (typically 2/3 and 1/3 respectively), medical datasets may be too small for this approach, resulting in accuracy determination problems, due to small test sets. Nested cross-validation (CV) provides a way to address this by maximizing use of the whole dataset for testing overall accuracy, while maintaining the split between training and testing.
As mentioned earlier, it is important that any filtering of
predictors is performed within the CV loops to prevent test data
information leakage. For this reason, we demonstrate how to use
multiDEGGs in combination with the nestedCV package, where filtering can
be applied to each outer CV training fold (see also the
nestedCV
vignette).
The core idea behind multiDEGGs is that the interaction between two molecular entities can provide additional information beyond the expression of individual molecular entities. In machine learning terms, this suggests that combinations of certain predictors may carry more information than individual predictor data.
Therefore, we propose using multiDEGGs not only to select predictors involved in differential interactions but also to create new predictors based on combinations of the original ones. These relationships can be modeled by adding new predictors as combinations of pairs of original predictors (e.g., multiplication, ratio, etc.).
To evaluate the performance of the resulting trained model, these combined predictors must also be added to the test set. The nestedCV package allows integration of predictor modification into the CV loop, enabling modification of both training and test sets separately, avoiding data leakage.
The first step is to define a custom filtering function that extracts the differential nodes and links from the selected fold:
# Convert metadata into a named factor vector containing only the labels to
# predict (to ensure compatibility with the nestedCV functions)
metadata_vector <- as.factor(synthetic_metadata[, "response"])
# Make sure the assay data you want to use for prediction is a matrix and
# transpose it, so features are in columns
# (standard format for machine learning)
assayData <- as.matrix(t(synthetic_rnaseqData))
# NOTE: Make sure your vector is ALIGNED with assayData.
# The order of the annotations in the metadata_vector must match the
# samples in the rows of assayData
# Remove zero variance columns from data
assayData <- assayData[,apply(assayData, 2, var, na.rm=TRUE) != 0]
# define a filtering function that extracts differential nodes and links
# using multiDEGGs:
DEGGs_modxy <- function(metadata, assayData, ...) {
counts <- t(assayData)
names(metadata_vector) <- rownames(assayData)
deggs_object <- multiDEGGs::get_diffNetworks(
assayData = counts,
metadata = metadata_vector,
percentile_vector = seq(0.25, 0.98, by = 0.05),
use_qvalues = TRUE,
show_progressBar = FALSE,
verbose = FALSE,
cores = 1
)
pairs <- multiDEGGs::get_sig_deggs(deggs_object, 1, 0.05)
# Take genes 2 by 2 from top pairs to lower ones
keep_DEGGs <- unique(unlist(lapply(1:nrow(pairs), function(i) {
row_n = c(pairs[i,1], pairs[i,2])
})))
# The following could be added if you want to set a maximum number of
# predictors to be selected:
# if (length(keep_DEGGs) > 50) { # take only top 50 predictors
# keep_DEGGs <- keep_DEGGs[1:50]
# pairs <- pairs[which(pairs$var1 %in% keep_DEGGs &
# pairs$var2 %in% keep_DEGGs), ]
# }
out <- list(keep_DEGGs = keep_DEGGs, pairs = pairs)
class(out) <- "DEGGs_modxy"
return(out)
}
Next, define a predict function that specifies how to modify the train and test set predictors based on the filtering:
# This custom predict function will add new columns to x (can be train or test)
predict.DEGGs_modxy <- function(DEGGs.object, newdata, filter = TRUE,
interaction.type = "ratio",
sep = ":", ...) {
if (length(DEGGs.object$keep) != 0) {
pairs <- DEGGs.object$pairs
x2a <- newdata[, pairs[, 1], drop = FALSE]
x2b <- newdata[, pairs[, 2], drop = FALSE]
if (interaction.type == "ratio") {
x2 <- x2a/x2b
} else {
x2 <- x2a*x2b
}
colnames(x2) <- paste(colnames(x2a), colnames(x2b), sep = sep)
if (filter) {
keep <- DEGGs.object$keep[!is.na(DEGGs.object$keep)]
x1 <- newdata[, keep]
return(cbind(x1, x2))
} else {
return(cbind(newdata, x2))
}
}
return(newdata)
}
Now you’re ready to train your model with glmnet or any other model
available in the caret
package:
library(nestedCV)
# Train a glmnet model
fitted_model <- nestcv.glmnet(
y = metadata_vector,
x = assayData,
filterFUN = NULL,
filter_options = list(nfilter = 20),
family = "binomial",
alphaSet = seq(0.7, 1, 0.1),
min_1se = 0,
cv.cores = 1,
modifyX_useY = TRUE,
verbose = TRUE
)
# Train a random forest model
fitted_model <- nestcv.train(
y = metadata_vector,
x = assayData,
method = "rf",
n_outer_folds = 8,
modifyX = "DEGGs_ttest_modxy",
filter_options = list(nfilter = 20),
modifyX_useY = TRUE,
cv.cores = 4,
verbose = TRUE
)
The ROC curve can be easily plotted:
plot(fitted_model$roc, col = 'red')
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.2
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Rome
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] multiDEGGs_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.52
#> [5] magrittr_2.0.3 cachem_1.1.0 parallel_4.4.2 knitr_1.50
#> [9] htmltools_0.5.8.1 rmarkdown_2.29 lifecycle_1.0.4 cli_3.6.5
#> [13] sfsmisc_1.1-20 sass_0.4.10 jquerylib_0.1.4 compiler_4.4.2
#> [17] rstudioapi_0.17.1 tools_4.4.2 evaluate_1.0.3 bslib_0.9.0
#> [21] yaml_2.3.10 rlang_1.1.6 jsonlite_2.0.0 MASS_7.3-65
citation("multiDEGGs")
#> To cite package 'multiDEGGs' in publications use:
#>
#> Sciacca E, et al. (2023). "DEGGs: An R package with shiny app for the
#> identification of differentially expressed gene-gene interactions in
#> high-Throughput sequencing data." _Bioinformatics_, *39*, btad192.
#> doi:10.1093/bioinformatics/btad192
#> <https://doi.org/10.1093/bioinformatics/btad192>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {DEGGs: An R package with shiny app for the identification of differentially expressed gene-gene interactions in high-Throughput sequencing data},
#> author = {Elisabetta Sciacca and {et al.}},
#> journal = {Bioinformatics},
#> year = {2023},
#> volume = {39},
#> pages = {btad192},
#> doi = {10.1093/bioinformatics/btad192},
#> }
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.