1 Introduction

The R package RNAseqNet can be used to infer networks from RNA-seq expression data. The count data are given as a \(n \times p\) matrix in which \(n\) is the number of individuals and \(p\) the number of genes. This matrix is denoted by \(\mathbf{X}\) in the sequel.

Eventually, the RNA-seq dataset is complemented with an \(n' \times d\) matrix, \(\mathbf{Y}\) which can be used to impute missing individuals in \(\mathbf{X}\) as described in [Imbert et al., 2017]1 Imbert, A., Valsesia, A., Le Gall, C., Armenise, C., Gourraud, P.A., Viguerie, N. and Villa-Vialaneix, N. (2017) Multiple hot-deck imputation for network inference from RNA sequencing data. Preprint..

2 Dataset description

Two datasets are available in the package: lung and thyroid with \(n = 221\) rows and respectively 100 and 50 columns. The raw data were downloaded from https://gtexportal.org/. The TMM normalisation of RNA-seq expression was performed with the R package edgeR.

Data are loaded with:

data(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)

data(thyroid)
boxplot(log2(thyroid + 1), las = 3, cex.names = 0.5)

3 Network inference from RNA-seq data

Network inference from RNA-seq data is performed with the Poisson GLM model described in [Allen and Liu, 2012]2 Allen, G. and Liu, Z. (2012) A log-linear model for inferring genetic networks from high-throughput sequencing data. In Proceedings of IEEE International Conference on Bioinformatics and Biomedecine (BIBM). The inference can be performed with the function GLMnetwork as follows:

lambdas <- 4 * 10^(seq(0, -2, length = 10))
ref_lung <- GLMnetwork(lung, lambdas = lambdas)

The entry path of ref_lung contains length(lambdas) = 10 matrices with estimated coefficients. Each matrix is a square matrix with ncol(lung) = 100 rows and columns.

The choice of the most appropriate value for \(\lambda\) can be performed with the StARS criterion of [Liu et al., 2010]3 Liu, H., Roeber, K. and Wasserman, L. (2010) Stability approach to regularization selection (StARS) for high dimensional graphical models. In Proceedings of Neural Information Processing Systems (NIPS 2010), 23, 1432-1440, Vancouver, Canada., which is implemented in the function stabilitySelection. The argument B is used to specify the number of re-sampling used to compute the stability criterion:

set.seed(11051608)
stability_lung <- stabilitySelection(lung, lambdas = lambdas, B = 50)
plot(stability_lung)

The entry best of stability_lung is the index of the chosen \(\lambda\) in lambdas. Here, the value \(\lambda=\) lambdas[stability_lung$best] = 0.3097055 is chosen.

The corresponding set of estimated coefficients, is in ref_lung$path[[stability_lung$best]] and can be transformed into a network with the function GLMnetToGraph:

lung_refnet <- GLMnetToGraph(ref_lung$path[[stability_lung$best]])
print(lung_refnet)
## IGRAPH UN-- 100 454 -- 
## + attr: name (v/c)
## + edges (vertex names):
##  [1] MT-CO1--MT-ND4       MT-CO1--SFTPA2       MT-CO1--hsa-mir-6723
##  [4] MT-CO1--A2M          MT-CO1--MT-CO2       MT-CO1--MT-CO3      
##  [7] MT-CO1--MT-RNR2      MT-CO1--MT-ATP6      MT-CO1--MT-ND1      
## [10] MT-CO1--MTND2P28     MT-CO1--MT-ND6       MT-CO1--SAT1        
## [13] MT-CO1--HSPB1        MT-CO1--MTND4P12     MT-CO1--HLA-DRA     
## [16] MT-CO1--MTND1P23     SFTPB --SFTPA2       SFTPB --FN1         
## [19] SFTPB --B2M          SFTPB --NEAT1        SFTPB --SLC34A2     
## [22] SFTPB --PGC          SFTPB --CTSD         SFTPB --TSC22D3     
## + ... omitted several edges
set.seed(1243)
plot(lung_refnet, vertex.size = 5, vertex.color = "orange", 
     vertex.frame.color = "orange", vertex.label.cex = 0.5, 
     vertex.label.color = "black")

4 Network inference with an auxiliary dataset

In this section, we artificially remove some of the observations in lung to create missing individuals (as compared to those in thyroid):

set.seed(1717)
nobs <- nrow(lung)
miss_ind <- sample(1:nobs, round(0.2 * nobs), replace = FALSE)
lung[miss_ind, ] <- NA
lung <- na.omit(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)

The method described in [Imbert et al., 2017] is thus used to infer a network for lung expression data, imputing missing individuals from the information provided between gene expressions by the thyroid dataset.

The first step of the method is to choose a relevant value for the donor list parameter, \(\sigma\). This is done computing \(V_{\textrm{intra}}\), the intra-variability in donor pool, for various values of \(\sigma\). An elbow rule is thus used to choose an appropriate value:

sigmalist <- 1:5
sigma_stats <- chooseSigma(lung, thyroid, sigmalist)
p <- ggplot(sigma_stats, aes(x = sigma, y = varintra)) + geom_point() +
  geom_line() + theme_bw() + 
  ggtitle(expression("Evolution of intra-pool homogeneity versus" ~ sigma)) +
  xlab(expression(sigma)) + ylab(expression(V[intra])) +
  theme(title = element_text(size = 10))
print(p)

Here, \(\sigma = 2\) is chosen. Finally, hd-MI is processed with the chosen \(\sigma\), a list of regularization parameters \(\lambda\) that are selected with with the StARS criterion (from B = 10 subsamples) in m = 100 replicates of the inference, all performed on a different imputed dataset. The function imputedGLMnetwork is the one implementing the full method. The distribution of edge frequency among the m = 100 inferred network is obtained with the function plot applied to the result of this function.

set.seed(16051244)
lung_hdmi <- imputedGLMnetwork(lung, thyroid, sigma = 2, lambdas = lambdas,
                               m = 100, B = 10)
plot(lung_hdmi)

Finally, the final graph is extracted using the function GLMnetToGraph on the result of the function ``imputedGLMnetwork``` and providing a threshold for edge frequency prediction.

lung_net <- GLMnetToGraph(lung_hdmi, threshold = 0.9)
lung_net
## IGRAPH UN-- 100 132 -- 
## + attr: name (v/c)
## + edges (vertex names):
##  [1] MT-CO1      --MT-ND4        MT-CO1      --MT-CO2       
##  [3] MT-CO1      --MT-CO3        MT-CO1      --MT-ND1       
##  [5] SFTPB       --SLC34A2       SFTPB       --PGC          
##  [7] SFTPB       --NAPSA         SFTPB       --ABCA3        
##  [9] SFTPC       --SFTPA2        SFTPC       --SFTPA1       
## [11] MT-ND4      --MT-ND4L       SFTPA2      --SFTPA1       
## [13] hsa-mir-6723--MT-RNR2       hsa-mir-6723--MTATP6P1     
## [15] hsa-mir-6723--RP5-857K21.11 hsa-mir-6723--MTND1P23     
## + ... omitted several edges
set.seed(1605)
plot(lung_net, vertex.size = 5, vertex.color = "orange", 
     vertex.frame.color = "orange", vertex.label.cex = 0.5, 
     vertex.label.color = "black")

Session information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RNAseqNet_0.1.1 ggplot2_2.2.1   BiocStyle_2.4.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     compiler_3.4.0   plyr_1.8.3       iterators_1.0.8 
##  [5] tools_3.4.0      rpart_4.1-11     digest_0.6.9     evaluate_0.10   
##  [9] tibble_1.3.0     gtable_0.1.2     lattice_0.20-35  Matrix_1.2-10   
## [13] foreach_1.4.3    igraph_1.0.1     yaml_2.1.14      stringr_1.0.0   
## [17] knitr_1.15.1     rprojroot_1.2    glmnet_2.0-10    grid_3.4.0      
## [21] nnet_7.3-12      mice_2.30        PoiClaClu_1.0.2  hot.deck_1.1    
## [25] survival_2.41-3  rmarkdown_1.5    bookdown_0.3     magrittr_1.5    
## [29] backports_1.0.5  scales_0.4.1     codetools_0.2-15 htmltools_0.3.5 
## [33] MASS_7.3-47      splines_3.4.0    colorspace_1.2-4 labeling_0.3    
## [37] stringi_1.0-1    lazyeval_0.2.0   munsell_0.4.2