Alternative title: blindly digging for cell types in scRNA-seq clusters with clustermole
A typical computational pipeline to process single-cell RNA sequencing (scRNA-seq) data involves clustering of cells. Assignment of cell type labels to those clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search. This is especially challenging if you are not familiar with all the captured subpopulations or have unexpected contaminants. clustermole
is an R package that provides a comprehensive meta collection of cell identity markers for thousands of human and mouse cell types sourced from a variety of databases as well as methods to query them.
The clustermole
package includes three primary features:
clustermole_overlaps
)clustermole_enrichment
)clustermole_markers
)Install clustermole
if it is not yet available on your system.
Load clustermole
.
If you have a set of genes (for example, cluster markers), you can perform overrepresentation analysis to see if they overlap any of the known cell type markers.
my_genes = c("CD2", "CD3D", "CD3E", "IL7R", "IL32", "LTB", "LDHB", "CCR7")
my_overlaps = clustermole_overlaps(genes = my_genes, species = "hs")
my_overlaps
#> # A tibble: 2,563 x 9
#> db species organ celltype celltype_full n_genes overlap p_value fdr
#> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
#> 1 Pangl… Human Immu… T memor… T memory cel… 54 6 3.90e-15 5.51e-12
#> 2 SCSig Human Cent… Fan_Emb… Fan_Embryoni… 150 7 4.30e-15 5.51e-12
#> 3 Pangl… Mouse Immu… T memor… T memory cel… 57 6 7.56e-15 6.46e-12
#> 4 CellM… Human Peri… T cell T cell | Per… 19 5 1.58e-14 1.01e-11
#> 5 CellM… Human Kidn… T helpe… T helper cel… 5 4 4.06e-14 2.08e-11
#> 6 Pangl… Human Immu… T cells T cells | Im… 95 6 1.31e-13 5.12e-11
#> 7 Pangl… Mouse Immu… T cells T cells | Im… 93 6 1.40e-13 5.12e-11
#> 8 SaVanT "" "" CD3plus… CD3plus_T-ce… 50 5 2.87e-12 5.65e-10
#> 9 SaVanT Human "" HPCA_T_… HPCA_T_cells… 50 5 2.87e-12 5.65e-10
#> 10 SaVanT Mouse "" IMGN_T_… IMGN_T_4Nve_… 50 5 2.87e-12 5.65e-10
#> # … with 2,553 more rows
If you have a table of expression values (for example, average expression across clusters), you can perform cell type enrichment based on a given gene expression matrix (log-transformed CPM/TPM/FPKM values).
You can retrieve a data frame of all cell type markers in the database.
markers = clustermole_markers(species = "hs")
markers
#> # A tibble: 163,509 x 8
#> db species organ celltype celltype_full n_genes gene_original gene
#> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ACCSL ACCSL
#> 2 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ACVR1B ACVR…
#> 3 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ARHGEF16 ARHG…
#> 4 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ASF1B ASF1B
#> 5 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BCL2L10 BCL2…
#> 6 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BLCAP BLCAP
#> 7 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BNIP1 BNIP1
#> 8 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 C1orf210 C1or…
#> 9 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 C1orf226 C1or…
#> 10 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 CASC3 CASC3
#> # … with 163,499 more rows
Each row contains a gene and a cell type associated with it. The gene
column is the gene symbol (human or mouse versions can be retrieved) and the celltype_full
column contains the full cell type string, including the species and the original database.
If you need to convert the markers from a data frame to a list format for other applications, you can use gene
as the values and celltype_full
as the grouping variable.
We will use dplyr
to help with summary statistics.
Retrieve a data frame of all cell type markers in the database.
markers = clustermole_markers(species = "hs")
markers
#> # A tibble: 163,509 x 8
#> db species organ celltype celltype_full n_genes gene_original gene
#> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ACCSL ACCSL
#> 2 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ACVR1B ACVR…
#> 3 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ARHGEF16 ARHG…
#> 4 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ASF1B ASF1B
#> 5 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BCL2L10 BCL2…
#> 6 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BLCAP BLCAP
#> 7 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BNIP1 BNIP1
#> 8 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 C1orf210 C1or…
#> 9 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 C1orf226 C1or…
#> 10 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 CASC3 CASC3
#> # … with 163,499 more rows
Check the number of available cell types.
Check the number of available cell types per species (not available for every cell type).
markers %>% distinct(celltype_full, species) %>% count(species, sort = TRUE)
#> # A tibble: 3 x 2
#> species n
#> <chr> <int>
#> 1 Human 1618
#> 2 Mouse 730
#> 3 "" 215
Check the number of available cell types per organ (not available for every cell type).
markers %>% distinct(celltype_full, organ) %>% count(organ, sort = TRUE)
#> # A tibble: 117 x 2
#> organ n
#> <chr> <int>
#> 1 "" 1282
#> 2 Brain 127
#> 3 Central Nervous System 88
#> 4 Digestive System 63
#> 5 Kidney 56
#> 6 Lung 52
#> 7 Bone marrow 51
#> 8 Immune system 50
#> 9 Peripheral blood 46
#> 10 Hematopoietic system 44
#> # … with 107 more rows
Check package version.