library(scapGNN)
With the widespread availability of single-cell sequencing technology, numerous single-cell RNA-seq and ATAC-seq data are collected into public databases. However, the single-cell data have characteristics such as drop-out events and low library sizes. How to leverage these data to parse cellular heterogeneity at the pathway level and identify activated gene modules with multiple cell phenotypes is still a challenge.
Here, we propose scapGNN
, an accurate, effective, and robust framework to comprehensively parse single-cell data at the pathway level from single-omics data or multi-omics integration data. The scapGNN
model takes single-cell gene expression profiles or the gene activity matrix of scATAC-seq as input data. As shown in Figure 1A, the computing architecture consists of three modules:
A depp neural network-based autoencoder, which is used to extract cellular features, gene features, and mine latent associations between genes and cells.
A graph convolutional network-based graph autoencoder, which is used to construct cell-cell association network and gene-gene association network.
Random walk with restart (RWR), which is used to calculate the activity score of a pathway or gene set in each cell.
For the SNARE-seq dataset, scapGNN
can integrate gene-cell association networks from scATAC-seq and scRNA-seq into one combined network by the Brown’s extension of the Fisher’s combined probability test. According to the combined gene-cell association network, pathway activity scores or activated gene modules are supported by single-cell multi-omics (Figure 2). scapGNN
also designs abundant visualization programs to clearly display the analysis results. Taken together, its capabilities enable scapGNN
to calculate pathway activity scores in a single cell, assess pathway heterogeneity between cells, identify key active genes in pathway, mine gene modules under multiple cell phenotypes, and provide gene and cell network data (Figure 1B).
Figure 1 Overview of the scapGNN framework.
Figure 2 The workflow of integrating scRNA-seq data and scATAC-seq data by scapGNN.
Single-cell gene expression profiles data needs to pass Seurat
’s pre-processing workflow. These represent the selection and filtration of cells based on QC metrics, data normalization, and the detection of highly variable features.
library(Seurat)
# Load the PBMC dataset (Case data for seurat)
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Our recommended data filtering is that only genes expressed as non-zero in more than
# 1% of cells, and cells expressed as non-zero in more than 1% of genes are kept.
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "case",
min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# Normalizing the data.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize")
# Identification of highly variable features.
pbmc <- FindVariableFeatures(pbmc, selection.method = 'vst', nfeatures = 2000)
# Run Preprocessing() of scapGNN.
Prep_data <- Preprocessing(pbmc,verbose=FALSE)
Detailed procedures and descriptions can be found in vignettes of Seurat
(https://satijalab.org/seurat/).
# Users can also directly input data in a data frame or matrix format which contains hypervariable genes and is log-transformed.
data("Hv_exp")
<- Preprocessing(Hv_exp,verbose=FALSE)
Prep_data summary(Prep_data)
#> Length Class Mode
#> orig_dara 45000 -none- numeric
#> cell_features 45000 -none- numeric
#> gene_features 45000 -none- numeric
#> ltmg_matrix 1 -none- function
#> cell_adj 8100 -none- numeric
#> gene_adj 250000 -none- numeric
First, the Signac
package is used to quantify the activity of each gene in the genome by assessing the chromatin accessibility associated with each gene, and create a new gene activity matrix derived from the scATAC-seq data. For details, please refer to vignettes of Signac
(https://satijalab.org/signac/index.html). As with single-cell gene expression profiles, the ensuing gene activity matrix is normalized and detected highly variable features by the Seurat
package. Finally, the Preprocessing()
of scapGNN
further processes the data into the required format.
Using the Preprocessing()
function results of scRNA-seq or scATAC-seq as input, the ConNetGNN()
function evaluates whether gene-gene, cell-cell, and gene-cell are associated and levels of strength. Finally, an undirected and weighted gene-cell network is constructed and use for subsequent analysis.
The ConNetGNN()
function is implemented based on pytorch
, so an appropriate python environment is required:
We also provide environment files for conda: /inst/extdata/scapGNN_env.yaml
. Users can install it with the command: conda env create -f scapGNN_env.yaml
.
library(coop)
library(reticulate)
library(parallel)
# Data preprocessing
data("Hv_exp")
Prep_data <- Preprocessing(Hv_exp)
# Run ConNetGNN()
ConNetGNN_data <- ConNetGNN(Prep_data,python.path="../miniconda3/envs/scapGNN_env/python.exe")
# View the content of the ConNetGNN() results.
data(ConNetGNN_data)
summary(ConNetGNN_data)
#> Length Class Mode
#> cell_network 8100 -none- numeric
#> gene_network 250000 -none- numeric
#> gene_cell_network 45000 -none- numeric
For the SNARE-seq dataset, the ConNetGNN()
results of scRNA-seq and scATAC-seq can be integrated into a combined gene-cell association network via InteNet()
.
library(ActivePathways)
library(parallel)
data(RNA_net)
data(ATAC_net)
RNA_ATAC_IntNet<-InteNet(RNA_net,ATAC_net,parallel.cores=1)
For the gene-cell association network constructed by ConNetGNN()
, the scPathway()
function uses RWR algorithm to calculate the pathway activity score of each cell.
library(parallel)
library(utils)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
scPathway_data<-scPathway(ConNetGNN_data,gmt.path=system.file("extdata", "KEGG_human.gmt", package = "scapGNN"),
parallel.cores=1)
Use the plotGANetwork()
function to plot the gene network of pathways in the specified cell phenotype.
library(igraph)
#> Warning: 程辑包'igraph'是用R版本4.1.3 来建造的
#>
#> 载入程辑包:'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
# Load data.
data(ConNetGNN_data)
data("Hv_exp")
# Construct cell set.
<-grep("0h",colnames(Hv_exp))
index<-colnames(Hv_exp)[index]
cellset
# Construct gene set.
<-load_path_data(system.file("extdata", "KEGG_human.gmt", package = "scapGNN"))
pathways<-pathways[[which(names(pathways)=="Tight junction [PATH:hsa04530]")]]
geneset
plotGANetwork(ConNetGNN_data,cellset,geneset,vertex.label.dist=1.5,main = "Tight junction [PATH:hsa04530]")
The plotGANetwork()
function can construct and display a cell community network.
require(igraph)
require(graphics)
data(ConNetGNN_data)
# Construct the cell phenotype vector.
<-colnames(ConNetGNN_data[["cell_network"]])
cell_id<-unlist(strsplit(cell_id,"_"))
temp<-temp[seq(2,length(temp)-1,by=3)]
cell_phennames(cell_id)<-cell_phen
head(cell_id)
#> 0h 0h 0h 0h 0h 0h
#> "H9_0h_1" "H9_0h_2" "H9_0h_3" "H9_0h_4" "H9_0h_5" "H9_0h_6"
plotCCNetwork(ConNetGNN_data,cell_id,edge.width=10)
The cpGModule()
can identify cell phenotype activated gene module.
require(parallel)
#> 载入需要的程辑包:parallel
require(stats)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
data(Hv_exp)
# Construct the cell set corresponding to 0h.
<-grep("0h",colnames(Hv_exp))
index<-colnames(Hv_exp)[index]
cellset<-cpGModule(ConNetGNN_data,cellset,parallel.cores=1)
H9_0h_cpGM_data#> Start RWR
#> Start perturbation analysis
summary(H9_0h_cpGM_data)
#> Genes AS Pvalue FDR
#> Length:5 Min. :0.002330 Min. :0.000000 Min. :0.0000
#> Class :character 1st Qu.:0.002377 1st Qu.:0.000000 1st Qu.:0.0000
#> Mode :character Median :0.002407 Median :0.000060 Median :0.0100
#> Mean :0.002454 Mean :0.000104 Mean :0.0118
#> 3rd Qu.:0.002546 3rd Qu.:0.000120 3rd Qu.:0.0150
#> Max. :0.002608 Max. :0.000340 Max. :0.0340
Use the plotGANetwork()
function to plot the gene network of module.
library(igraph)
# Load data.
data(ConNetGNN_data)
data("Hv_exp")
data("H9_0h_cpGM_data")
# Construct cell set.
<-grep("0h",colnames(Hv_exp))
index<-colnames(Hv_exp)[index]
cellset
# Construct gene set.
<-H9_0h_cpGM_data$Genes
geneset
plotGANetwork(ConNetGNN_data,cellset,geneset,vertex.label.dist=1.5,main = "Gene network of 0h cells activated gene module")
For multiple cellular phenotypes (e.g. classification, time stage, etc.), cpGModule()
identifies the activated gene modules individually. The plotMulPhenGM()
function then quantifies the activation strength of genes in different cell phenotypes and displays gene networks under multiple cell phenotypes.
require(igraph)
require(grDevices)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
# Obtain cpGModule results for each cell phenotype.
data(H9_0h_cpGM_data)
data(H9_24h_cpGM_data)
data(H9_36h_cpGM_data)
<-list(H9_0h=H9_0h_cpGM_data,H9_24h=H9_24h_cpGM_data,H9_36h=H9_36h_cpGM_data)
data.listplotMulPhenGM(data.list,ConNetGNN_data,margin=-0.05)
The single-cell pathway activity matrix can be docked with the Seurat
package to perform non-linear dimensional reduction, identify differentially active pathways, draw heat map, violin plot, and more.
library(dplyr)
library(Seurat)
library(patchwork)
data(scPathway_data)
data <- CreateSeuratObject(counts = scPathway_data, project = "Pathway")
all.genes <- rownames(data)
data <- ScaleData(data, features = all.genes,verbose = FALSE)
data <- RunPCA(data, features =all.genes,verbose = FALSE)
data <- FindNeighbors(data, verbose = FALSE)
data <- FindClusters(data, resolution = 0.5,verbose = FALSE)
data <- RunUMAP(data,dims = 1:15, verbose = FALSE)
cell_phen<-unlist(strsplit(colnames(scPathway_data),"_"))
cell_phen<-cell_phen[seq(2,length(cell_phen)-1,by=3)]
data@active.ident<-as.factor(cell_phen)
# Visualize non-linear dimensional reduction
DimPlot(data, reduction = "umap")
# Identify differentially active pathways
diff.pathways <- FindAllMarkers(data,verbose = FALSE)
# violin plot
VlnPlot(data, features = "MAPK signaling pathway [PATH:hsa04010]")
# heat map
DoHeatmap(data, features = row.names(scPathway_data))