Tutorial on how to use the RNAseqNet package to infer networks from RNA-seq expression datasets with or without an auxiliary dataset.
RNAseqNet 0.1.1
The R package RNAseqNet can be used to infer networks from RNA-seq expression data. The count data are given as a \(n \times p\) matrix in which \(n\) is the number of individuals and \(p\) the number of genes. This matrix is denoted by \(\mathbf{X}\) in the sequel.
Eventually, the RNA-seq dataset is complemented with an \(n' \times d\) matrix, \(\mathbf{Y}\) which can be used to impute missing individuals in \(\mathbf{X}\) as described in [Imbert et al., 2017]1 Imbert, A., Valsesia, A., Le Gall, C., Armenise, C., Gourraud, P.A., Viguerie, N. and Villa-Vialaneix, N. (2017) Multiple hot-deck imputation for network inference from RNA sequencing data. Preprint..
Two datasets are available in the package: lung
and thyroid
with \(n = 221\) rows and respectively 100 and 50 columns. The raw data were downloaded from https://gtexportal.org/. The TMM normalisation of RNA-seq expression was performed with the R package edgeR.
Data are loaded with:
data(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)
data(thyroid)
boxplot(log2(thyroid + 1), las = 3, cex.names = 0.5)
Network inference from RNA-seq data is performed with the Poisson GLM model described in [Allen and Liu, 2012]2 Allen, G. and Liu, Z. (2012) A log-linear model for inferring genetic networks from high-throughput sequencing data. In Proceedings of IEEE International Conference on Bioinformatics and Biomedecine (BIBM). The inference can be performed with the function GLMnetwork
as follows:
lambdas <- 4 * 10^(seq(0, -2, length = 10))
ref_lung <- GLMnetwork(lung, lambdas = lambdas)
The entry path
of ref_lung
contains length(lambdas)
= 10 matrices with estimated coefficients. Each matrix is a square matrix with ncol(lung)
= 100 rows and columns.
The choice of the most appropriate value for \(\lambda\) can be performed with the StARS criterion of [Liu et al., 2010]3 Liu, H., Roeber, K. and Wasserman, L. (2010) Stability approach to regularization selection (StARS) for high dimensional graphical models. In Proceedings of Neural Information Processing Systems (NIPS 2010), 23, 1432-1440, Vancouver, Canada., which is implemented in the function stabilitySelection
. The argument B
is used to specify the number of re-sampling used to compute the stability criterion:
set.seed(11051608)
stability_lung <- stabilitySelection(lung, lambdas = lambdas, B = 50)
plot(stability_lung)
The entry best
of stability_lung
is the index of the chosen \(\lambda\) in lambdas
. Here, the value \(\lambda=\) lambdas[stability_lung$best]
= 0.3097055 is chosen.
The corresponding set of estimated coefficients, is in ref_lung$path[[stability_lung$best]]
and can be transformed into a network with the function GLMnetToGraph
:
lung_refnet <- GLMnetToGraph(ref_lung$path[[stability_lung$best]])
print(lung_refnet)
## IGRAPH UN-- 100 454 --
## + attr: name (v/c)
## + edges (vertex names):
## [1] MT-CO1--MT-ND4 MT-CO1--SFTPA2 MT-CO1--hsa-mir-6723
## [4] MT-CO1--A2M MT-CO1--MT-CO2 MT-CO1--MT-CO3
## [7] MT-CO1--MT-RNR2 MT-CO1--MT-ATP6 MT-CO1--MT-ND1
## [10] MT-CO1--MTND2P28 MT-CO1--MT-ND6 MT-CO1--SAT1
## [13] MT-CO1--HSPB1 MT-CO1--MTND4P12 MT-CO1--HLA-DRA
## [16] MT-CO1--MTND1P23 SFTPB --SFTPA2 SFTPB --FN1
## [19] SFTPB --B2M SFTPB --NEAT1 SFTPB --SLC34A2
## [22] SFTPB --PGC SFTPB --CTSD SFTPB --TSC22D3
## + ... omitted several edges
set.seed(1243)
plot(lung_refnet, vertex.size = 5, vertex.color = "orange",
vertex.frame.color = "orange", vertex.label.cex = 0.5,
vertex.label.color = "black")
In this section, we artificially remove some of the observations in lung
to create missing individuals (as compared to those in thyroid
):
set.seed(1717)
nobs <- nrow(lung)
miss_ind <- sample(1:nobs, round(0.2 * nobs), replace = FALSE)
lung[miss_ind, ] <- NA
lung <- na.omit(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)
The method described in [Imbert et al., 2017] is thus used to infer a network for lung
expression data, imputing missing individuals from the information provided between gene expressions by the thyroid
dataset.
The first step of the method is to choose a relevant value for the donor list parameter, \(\sigma\). This is done computing \(V_{\textrm{intra}}\), the intra-variability in donor pool, for various values of \(\sigma\). An elbow rule is thus used to choose an appropriate value:
sigmalist <- 1:5
sigma_stats <- chooseSigma(lung, thyroid, sigmalist)
p <- ggplot(sigma_stats, aes(x = sigma, y = varintra)) + geom_point() +
geom_line() + theme_bw() +
ggtitle(expression("Evolution of intra-pool homogeneity versus" ~ sigma)) +
xlab(expression(sigma)) + ylab(expression(V[intra])) +
theme(title = element_text(size = 10))
print(p)
Here, \(\sigma = 2\) is chosen. Finally, hd-MI is processed with the chosen \(\sigma\), a list of regularization parameters \(\lambda\) that are selected with with the StARS criterion (from B = 10
subsamples) in m = 100
replicates of the inference, all performed on a different imputed dataset. The function imputedGLMnetwork
is the one implementing the full method. The distribution of edge frequency among the m = 100
inferred network is obtained with the function plot
applied to the result of this function.
set.seed(16051244)
lung_hdmi <- imputedGLMnetwork(lung, thyroid, sigma = 2, lambdas = lambdas,
m = 100, B = 10)
plot(lung_hdmi)
Finally, the final graph is extracted using the function GLMnetToGraph
on the result of the function ``imputedGLMnetwork``` and providing a threshold for edge frequency prediction.
lung_net <- GLMnetToGraph(lung_hdmi, threshold = 0.9)
lung_net
## IGRAPH UN-- 100 132 --
## + attr: name (v/c)
## + edges (vertex names):
## [1] MT-CO1 --MT-ND4 MT-CO1 --MT-CO2
## [3] MT-CO1 --MT-CO3 MT-CO1 --MT-ND1
## [5] SFTPB --SLC34A2 SFTPB --PGC
## [7] SFTPB --NAPSA SFTPB --ABCA3
## [9] SFTPC --SFTPA2 SFTPC --SFTPA1
## [11] MT-ND4 --MT-ND4L SFTPA2 --SFTPA1
## [13] hsa-mir-6723--MT-RNR2 hsa-mir-6723--MTATP6P1
## [15] hsa-mir-6723--RP5-857K21.11 hsa-mir-6723--MTND1P23
## + ... omitted several edges
set.seed(1605)
plot(lung_net, vertex.size = 5, vertex.color = "orange",
vertex.frame.color = "orange", vertex.label.cex = 0.5,
vertex.label.color = "black")
Here is the output of sessionInfo()
on the system on which this document was compiled:
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RNAseqNet_0.1.1 ggplot2_2.2.1 BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 compiler_3.4.0 plyr_1.8.3 iterators_1.0.8
## [5] tools_3.4.0 rpart_4.1-11 digest_0.6.9 evaluate_0.10
## [9] tibble_1.3.0 gtable_0.1.2 lattice_0.20-35 Matrix_1.2-10
## [13] foreach_1.4.3 igraph_1.0.1 yaml_2.1.14 stringr_1.0.0
## [17] knitr_1.15.1 rprojroot_1.2 glmnet_2.0-10 grid_3.4.0
## [21] nnet_7.3-12 mice_2.30 PoiClaClu_1.0.2 hot.deck_1.1
## [25] survival_2.41-3 rmarkdown_1.5 bookdown_0.3 magrittr_1.5
## [29] backports_1.0.5 scales_0.4.1 codetools_0.2-15 htmltools_0.3.5
## [33] MASS_7.3-47 splines_3.4.0 colorspace_1.2-4 labeling_0.3
## [37] stringi_1.0-1 lazyeval_0.2.0 munsell_0.4.2