DUBStepR (Determining the Underlying Basis using Step-wise Regression) is a feature selection algorithm for cell type identification in single-cell RNA-sequencing data.
Feature selection, i.e. determining the optimal subset of genes to cluster cells into cell types, is a critical step in the unsupervised clustering of scRNA-seq data.
DUBStepR is based on the intuition that cell-type-specific marker genes tend to be well correlated with each other, i.e. they typically have strong positive and negative correlations with other marker genes. After filtering genes based on a correlation range score, DUBStepR exploits structure in the gene-gene correlation matrix to prioritize genes as features for clustering.
DUBStepR requires your R version to be >= 3.5.0. Once you’ve ensured that, you can install DUBStepR from GitHub using the following commands:
if (!("DUBStepR" %in% .packages()))
install.packages("DUBStepR", repos = "https://cloud.r-project.org")
#> package 'DUBStepR' successfully unpacked and MD5 sums checked
#>
#> The downloaded binary packages are in
#> C:\Users\ranjanb\AppData\Local\Temp\RtmpCSa3MZ\downloaded_packages
The installation should take ~ 30 seconds, not including dependencies.
After installation, load DUBStepR using the following command:
library(DUBStepR)
For users new to single-cell RNA sequencing data analysis, we recommend using DUBStepR with the Seurat (Satija, R. et al. Nature Biotechnol. 33.5(2015):495.) package to analyze single-cell RNA seq data. Below is a tutorial using the Seurat package.
Seurat can be installed and loaded into your R environment using the following commands:
# install.packages(c("Seurat", "hdf5r"), repos = "https://cloud.r-project.org")
library(Seurat)
library(dplyr)
Here, we use a publicly available PBMC dataset generated by 10X Genomics. Here’s a link to the dataset. We use the Feature / cell matrix HDF5 (filtered) file.
Locate the file in your working directory and load the data in to your Seurat object in the following manner:
seuratObj <- CreateSeuratObject(counts = Read10X_h5("pbmc_1k_v2_filtered_feature_bc_matrix.h5"), assay = "RNA", project = "10k_PBMC")
seuratObj
#> An object of class Seurat
#> 33538 features across 996 samples within 1 assay
#> Active assay: RNA (33538 features, 0 variable features)
For DUBStepR, we recommend log-normalizing your data. That can be performed in Seurat using the following command:
seuratObj <- NormalizeData(object = seuratObj, normalization.method = "LogNormalize")
DUBStepR can be inserted into the Seurat workflow at this stage, and we recommend that be done in the following manner:
dubstepR.out <- DUBStepR(input.data = seuratObj@assays$RNA@data, min.cells = 0.05*ncol(seuratObj), optimise.features = TRUE, k = 10, num.pcs = 20, error = 0)
#> Dimensions of input data: 33538Dimensions of input data: 996
#>
#> Running DUBStepR...
#>
#> Expression Filtering Done.
#> Mitochondrial, Ribosomal and Pseudo Genes Filtering Done.
#>
#> Computing GGC...
#> Done.
#>
#> Running Stepwise Regression...
#>
#> Done.
#>
#> Adding correlated features...
#>
#> Done.
#> Determining optimal feature set...
#> Done.
seuratObj@assays$RNA@var.features <- dubstepR.out$optimal.feature.genes
seuratObj
This step could take upto 1 minute on a normal desktop computer.
Following Seurat’s recommendations, we scale the gene expression data and run Principal Component Analysis (PCA). We then visualize the standard deviation of PCs using an elbow plot and select the number of PCs we think is sufficient to explain the variance in the dataset.
seuratObj <- ScaleData(seuratObj, features = rownames(seuratObj))
#> Centering and scaling data matrix
seuratObj <- RunPCA(seuratObj, features = VariableFeatures(object = seuratObj), npcs = 30)
#> PC_ 1
#> Positive: MALAT1, CTSW, GZMA, HLA-C, CST7, CCL5, PRF1, NKG7, CD79A, EEF1A1
#> GNLY, CD79B, IGHM, KLRD1, MS4A1, IGHD, TRDC, TCL1A, LINC00926, IGKC
#> KLRF1, BANK1, GZMB, SPON2, FGFBP2, B2M, FCGR3A, TMSB10, HLA-DPA1, HLA-DPB1
#> Negative: LYZ, S100A9, CST3, FCN1, S100A8, CSTA, LGALS1, LST1, S100A12, MNDA
#> CTSS, AIF1, FTL, AC020656.1, TYROBP, VCAN, MS4A6A, TYMP, PSAP, FOS
#> S100A6, FTH1, FGL2, SERPINA1, AP1S2, GRN, FCER1G, NPC2, CD36, LGALS2
#> PC_ 2
#> Positive: CD79A, IGHM, CD79B, MS4A1, IGHD, IGKC, TCL1A, HLA-DRB1, LINC00926, HLA-DRA
#> HLA-DPA1, BANK1, HLA-DPB1, CD74, EEF1A1, FTH1, NPC2, CTSS, FTL, TMSB10
#> GRN, LGALS2, AP1S2, FOS, FGL2, MALAT1, MS4A6A, LST1, SERPINA1, TYMP
#> Negative: NKG7, PRF1, GNLY, GZMA, CST7, CTSW, KLRD1, KLRF1, TRDC, GZMB
#> SPON2, FGFBP2, B2M, CCL5, S100A4, SRGN, FCGR3A, HLA-C, ACTB, FCER1G
#> TYROBP, S100A6, LGALS1, S100A8, S100A12, PSAP, VCAN, LYZ, S100A9, AC020656.1
#> PC_ 3
#> Positive: EEF1A1, MALAT1, S100A12, S100A6, VCAN, FOS, CD36, S100A8, AIF1, AC020656.1
#> S100A9, S100A4, MNDA, MS4A6A, LYZ, B2M, CSTA, FCN1, AP1S2, LGALS2
#> LST1, SERPINA1, FGL2, CST3, GRN, TYMP, LGALS1, FTH1, NPC2, CTSS
#> Negative: CD74, GZMB, KLRF1, GNLY, KLRD1, SPON2, HLA-DPA1, HLA-DPB1, FGFBP2, HLA-DRB1
#> CD79B, HLA-DRA, PRF1, CD79A, IGHM, NKG7, TRDC, IGHD, MS4A1, CST7
#> TCL1A, FCGR3A, IGKC, LINC00926, GZMA, BANK1, FCER1G, CTSW, TYROBP, ACTB
#> PC_ 4
#> Positive: FCGR3A, HLA-C, HLA-DPA1, EEF1A1, TMSB10, HLA-DPB1, FTH1, B2M, ACTB, S100A4
#> NPC2, PSAP, AIF1, HLA-DRB1, CD74, SERPINA1, LST1, S100A6, TYMP, HLA-DRA
#> FCER1G, CST3, GRN, FTL, FGL2, LGALS1, AP1S2, LGALS2, CTSS, TYROBP
#> Negative: S100A12, VCAN, AC020656.1, S100A8, CD36, S100A9, MS4A6A, IGHD, MNDA, TCL1A
#> IGHM, LYZ, LINC00926, FOS, CD79A, TRDC, IGKC, CSTA, MS4A1, SPON2
#> KLRF1, GNLY, KLRD1, PRF1, FGFBP2, BANK1, FCN1, GZMA, CST7, SRGN
#> PC_ 5
#> Positive: CCL5, ACTB, B2M, SRGN, FTH1, HLA-C, FTL, BANK1, CST3, GZMA
#> CTSW, NKG7, IGKC, HLA-DPA1, CD79A, MS4A1, HLA-DPB1, CST7, LINC00926, S100A4
#> HLA-DRB1, NPC2, CD79B, IGHM, HLA-DRA, CD36, CD74, PSAP, TCL1A, MNDA
#> Negative: TMSB10, MALAT1, EEF1A1, FGFBP2, SPON2, GZMB, KLRF1, KLRD1, FOS, GNLY
#> FCGR3A, TRDC, TYROBP, GRN, AIF1, LGALS1, CTSS, VCAN, TYMP, LGALS2
#> FGL2, MS4A6A, FCER1G, SERPINA1, FCN1, CSTA, AC020656.1, S100A12, PRF1, S100A8
ElbowPlot(seuratObj, ndims = 30)
We select the first few feature genes selected by DUBStepR to show cell type specific expression, using 10 PCs to compute UMAP coordinates.
seuratObj <- RunUMAP(seuratObj, dims = 1:10, n.components = 2, seed.use = 2019)
#> Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
#> To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
#> This message will be shown once per session
#> 12:00:04 UMAP embedding parameters a = 0.9922 b = 1.112
#> 12:00:04 Read 996 rows and found 10 numeric columns
#> 12:00:04 Using Annoy for neighbor search, n_neighbors = 30
#> 12:00:04 Building Annoy index with metric = cosine, n_trees = 50
#> 0% 10 20 30 40 50 60 70 80 90 100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 12:00:04 Writing NN index file to temp file C:\Users\ranjanb\AppData\Local\Temp\RtmpCSa3MZ\filebe8c53e47d18
#> 12:00:04 Searching Annoy index using 1 thread, search_k = 3000
#> 12:00:04 Annoy recall = 100%
#> 12:00:04 Commencing smooth kNN distance calibration using 1 thread
#> 12:00:04 Initializing from normalized Laplacian + noise
#> 12:00:04 Commencing optimization for 500 epochs, with 39618 positive edges
#> 12:00:07 Optimization finished
FeaturePlot(seuratObj, features = VariableFeatures(object = seuratObj)[1:9], cols = c("lightgrey", "magenta"))
Using known marker genes, we show cell type specific regions of the UMAP
FeaturePlot(seuratObj, features = c("MS4A1", "NKG7", "CD3E", "IL7R", "CD8A", "CD14", "CST3", "FCGR3A", "PPBP"))
We select 10 PCs for clustering, and visualize the cells in a 2D UMAP.
seuratObj <- FindNeighbors(seuratObj, reduction = "pca", dims = 1:10)
#> Computing nearest neighbor graph
#> Computing SNN
seuratObj <- FindClusters(seuratObj)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#>
#> Number of nodes: 996
#> Number of edges: 32747
#>
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8320
#> Number of communities: 9
#> Elapsed time: 0 seconds
DimPlot(seuratObj, reduction = "umap", label = TRUE, pt.size = 0.5, repel = T, label.size = 5)
Identifying top 10 marker genes of each cluster
top.10.markers <- FindAllMarkers(object = seuratObj, assay = "RNA", logfc.threshold = 0.5, min.pct = 0.5, only.pos = TRUE) %>% filter(p_val_adj < 0.1) %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
#> Calculating cluster 0
#> Calculating cluster 1
#> Calculating cluster 2
#> Calculating cluster 3
#> Calculating cluster 4
#> Calculating cluster 5
#> Calculating cluster 6
#> Calculating cluster 7
#> Calculating cluster 8
DoHeatmap(object = seuratObj, features = unique(top.10.markers$gene), size = 5)
Annotating clusters using gene expression
cell.types <- c("0" = "CD14+ Monocytes", "5" = "Inflammatory CD14+ Monocytes", "1" = "Naive CD4+ T cells", "3" = "Memory CD4+ T cells", "4" = "Naive CD8+ T cells", "2" = "B cells", "6" = "NK cells", "7" = "CD16+ Monocytes", "8" = "Platelets")
seuratObj <- RenameIdents(seuratObj, cell.types)
DimPlot(seuratObj, reduction = "umap", label = TRUE, pt.size = 1, repel = TRUE, label.size = 5) + NoLegend()