The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
SifiNet (Single-cell feature identification with Network topology) is an R package of a clustering-independent method for directly identifying feature gene sets. Feature gene set are sets of gene exhibiting unique phenotypes in different cell subpopulations. SifiNet is designed for single cell RNA-seq data, finding sets of feature genes that are differentially expressed in some cell subpopulation. However, it can also find epigenomic feature gene sets when applied to single cell ATAC-seq data. This vignette gives an introduction to SifiNet package.
The development version of SifiNet can be installed from Github:
SifiNet method consists of two major sections, A and B. After loading the data, in section A we construct a gene co-expression network with genes as nodes and the co-expressions between genes as edges. And in section B, we identify the feature gene sets based on the gene co-expression network.
Suppose we already have a matrix of count data that we wish to find
feature gene sets from. Here is a toy example generated with the
complexSCsimulation
package:
# Load package
library(SiFINeT)
# Load data
data <- readRDS("toy_matrix.rds")
sifi_obj <- create_SiFINeT_object(counts = data, gene.name = NULL, meta.data = NULL,
data.name = NULL, sparse = FALSE, rowfeature = TRUE)
The count
matrix should be a gene (row) by cell (column)
matrix by default. Otherwise, for a cell (row) by gene (column) matrix,
the rowfeature
argument need to be set to
FALSE
. Name of the genes could be specified by the
gene.name
argument. Meta data of the cells could be input
using the meta.data
argument. If the input count matrix is
sparse (majority of the matrix is 0), we can set the sparse
argument to be TRUE
, which would potentially reduce the
space and time cost of the method. We can also assign a name to the
input data set by data.name
argument. Multiple count matrix
could be saved in a SifiNet object. The name of data set could help to
distinguish those inputs.
In SifiNet, we use quantile association to measure the co-expression
between genes. So, to construct the gene co-expression network, we first
divide the read counts in the matrix into “low” and “high” groups using
quantile regression in quantile_thres
function. If
sifi_obj@meta.data
has at least one column, then the
columns of sifi_obj@meta.data
are used as independent
variables. Otherwise, the mean total number of reads in each cell is
used as independent variable for the quantile regression. The results
are saved in sifi_obj@data.thres
. Based on the results of
the “low” and “high” group classification, we calculate the
co-expression score using feature_coexp
function. The
results are saved in sifi_obj@coexp
.
# build the co-expression network
sifi_obj <- create_network(sifi_obj, alpha = 0.05, manual = FALSE, least_edge_prop = 0.01)
Then we select a threshold for co-expression score. Gene pairs with
absolute value of co-expression score greater than the threshold would
be considered to have an edge between them in the network. This
procedure is done by function create_network
. By default,
an FDR control procedure is applied to select the threshold. Argument
alpha
is the type I error of the FDR control procedure.
After finding the threshold using the FDR control procedure, SifiNet
also allows an additional modification of the threshold. In case that
the threshold is overly strict, we can set a lower bound for the
proportion of edges (total number of edges divided by the total number
of different gene pairs) with argument least_edge_prop
.
Note the least_edge_prop
is only used when argument
manual
is TRUE
. The results are saved in
sifi_obj@est_ms
and sifi_obj@thres
.
# filter gene to improve the quality of the co-expression network
sifi_obj <- filter_lowexp(sifi_obj, t1 = 10, t2 = 0.9, t3 = 0.9)
After building the co-expression network, we further perform
additional gene filtering steps to improve the quality of the co-
expression network. Greater values of arguments t1
,
t2
, and t3
would lead to a less strict
filtering, keeping more genes in the network. The results are saved in
sifi_obj@kset
.
In section B, based on the topology structure of gene co-expression
network, we identify the feature gene sets. Using function
cal_connectivity
, we calculate the 1st, 2nd, and 3rd order
connectivities for each gene. The connectivities are measures of the
topology structure of gene co-expression network. Greater values of
argument m
and niter
would improve the
accuracy of 3rd order connectivity, but would also increase the
computation time. The results are saved in
sifi_obj@conn
.
# detect core feature gene sets
sifi_obj <- find_unique_feature(sifi_obj, t1 = 5, t2 = 0.4, t3 = 0.3, t1p = 5, t2p = 0.7,
t3p = 0.5, resolution = 1, min_set_size = 5)
After calculating the connectivities, we can detect candidate feature
genes. And then we identify the core feature genes in each of the
feature gene sets among the candidate feature genes. Core feature genes
are genes that show unique phenotypes in only one cell subpopulation
(not shared with other cell subpopulation). These steps are done by
function find_unique_feature
. Arguments t1
,
t2
, t3
, t1p
, t2p
,
t3p
are thresholds for 1st, 2nd, and 3rd order
connectivities. Greater input values of those arguments would lead to
less number of feature gene sets and feature genes.
resolution
argument is used for
cluster_louvain
function in igraph
package. We
require a feature gene set to have at least min_set_size
core feature genes to be detected. The results are saved in
sifi_obj@conn2
, sifi_obj@fg_id
,
sifi_obj@uni_fg_id
, sifi_obj@uni_cluster
,
sifi_obj@selected_cluster
and
sifi_obj@featureset
.
# assign shared feature genes to core feature gene sets
sifi_obj <- assign_shared_feature(sifi_obj, min_edge_prop = 0.5)
Then we assign other candidate feature genes (that are not identified
as core feature gene) into core feature gene sets. Those genes are
allowed to be shared by more than 1 core feature gene sets. The
assignment of shared feature gene is based on the proportion of edge
between a core feature gene set and the feature gene (number of edges
between the feature gene and genes in the core feature gene set divided
by number of genes in the core feature gene set). An assignment with
greater value of argument min_edge_prop
would have less
shared feature gene added. The results are saved in
sifi_obj@featureset
.
# enrich feature gene sets with other genes
sifi_obj <- enrich_feature_set(sifi_obj, min_edge_prop = 0.9)
In the last step, we enrich the feature gene sets (with both core and
shared feature gene) with other genes that are not identified previously
as feature gene. Those genes are also allowed to be shared by more than
1 core feature gene sets. Similarly, the assignment of enriched feature
gene is based on the proportion of edge between a feature gene set and
the gene (number of edges between the gene and genes in the feature gene
set divided by number of genes in the feature gene set). An assignment
with greater value of argument min_edge_prop
would have
less shared feature gene added. The results are saved in
sifi_obj@featureset
.
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] SiFINeT_1.13
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.5 sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [5] tidyr_1.3.1 lattice_0.22-6 digest_0.6.35 magrittr_2.0.3
## [9] evaluate_0.23 grid_4.3.1 fastmap_1.1.1 jsonlite_1.8.8
## [13] Matrix_1.6-5 ggrepel_0.9.5 survival_3.5-8 gridExtra_2.3
## [17] formatR_1.14 purrr_1.0.2 fansi_1.0.6 viridisLite_0.4.2
## [21] scales_1.3.0 tweenr_2.0.3 jquerylib_0.1.4 cli_3.6.2
## [25] rlang_1.1.3 graphlayouts_1.1.1 polyclip_1.10-6 splines_4.3.1
## [29] tidygraph_1.3.1 munsell_0.5.0 withr_3.0.0 cachem_1.0.8
## [33] yaml_2.3.8 tools_4.3.1 MatrixModels_0.5-3 SparseM_1.81
## [37] memoise_2.0.1 dplyr_1.1.4 colorspace_2.1-0 ggplot2_3.5.0
## [41] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 MASS_7.3-60.0.1
## [45] ggraph_2.2.1 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.6.2
## [49] gtable_0.3.4 glue_1.7.0 Rcpp_1.0.12 ggforce_0.4.2
## [53] xfun_0.42 tibble_3.2.1 tidyselect_1.2.1 rstudioapi_0.16.0
## [57] knitr_1.45 farver_2.1.1 htmltools_0.5.7 igraph_2.0.3
## [61] rmarkdown_2.26 compiler_4.3.1 quantreg_5.97
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.