The diem
package provides a lightweight solution to clean up single-cell
data during the pre-processing step. The purpose
of diem
is to remove droplets that contain extranuclear or extracellular
RNA. diem
is designed for droplet-based single-cell RNA-seq and has been
tested on 10X Chromium data.
This tutorial will use the publicly available 10X single-nucleus RNA-seq data that was generated from fresh mouse brain tissue. We will use a subset of the data that is available with the package. The full data set is available here.
diem
takes as input the raw counts from a droplet-based single-cell
experiment such as from 10X. Since the raw data consists of roughly
20,000 genes and over 100,000 droplets, the counts data has to be
stored as a sparse matrix from the R package
Matrix.
Furthermore, diem
takes advantage of the matrix manipulation
functions in the Matrix
package to speed up computations. The R
object that wraps data access and the diem
functions is the
SCE
object. It is important to note here that diem
works best
when all droplets generated by the experiment are included, even
those with only 1 read mapping to a gene. They provide extensive
information on the RNA profile of the background distribution.
If you have 10X data that has been aligned with
CellRanger,
then the raw data will be stored in outs/raw_feature_bc_matrix
in
the output directory (for CellRanger v3). We provide a convenience function
read_10x
that can read the 10X output into a sparse Matrix.
library(diem)
# In the diem directory
counts <- read_10x("../tests/testdata/")
dim(counts)
#> [1] 1001 2451
class(counts)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
This function reads in the counts present in the tests/testdata
directory into a sparse matrix. This subset of the mouse brain raw data
from 10X contains 1,001 genes and 2,457 droplets.
The raw counts in a sparse matrix forms the starting point for
pre-processing with diem
. To create the SCE
object:
mb_small <- create_SCE(counts, name="MouseBrain")
dim(mb_small)
#> [1] 1001 2451
class(mb_small)
#> [1] "SCE"
#> attr(,"package")
#> [1] "diem"
The SCE object mb_small
stores the counts as well as the meta data
on the droplets:
drop_data <- droplet_data(mb_small)
head(drop_data)
#> total_counts n_genes
#> CTGTTTATCTCAAGTG 744 343
#> CATTATCGTATCGCAT 633 311
#> CGTGTAACACAGCGTC 1663 441
#> GTAACTGTCGAACTGT 305 175
#> ACGCAGCGTTACAGAA 718 334
#> CAAGAAAAGTTACGGG 646 341
summary(drop_data)
#> total_counts n_genes
#> Min. : 54.0 Min. : 9.00
#> 1st Qu.: 92.0 1st Qu.: 27.00
#> Median : 180.0 Median : 41.00
#> Mean : 318.5 Mean : 80.83
#> 3rd Qu.: 387.5 3rd Qu.: 77.00
#> Max. :5375.0 Max. :705.00
Useful diagnostics for assessing droplets include the percent of reads that align to the mitochondrial genome, in addition to percent that align the nuclear-localized MALAT1. We can generate this using the following:
mt_genes <- grep(pattern = "^mt-", x = rownames(mb_small@gene_data),
ignore.case = TRUE, value = TRUE)
mb_small <- get_gene_pct(x = mb_small, genes = mt_genes, name = "pct.mt")
genes <- grep(pattern = "^malat1$", x = rownames(mb_small@gene_data),
ignore.case = TRUE, value = TRUE)
mb_small <- get_gene_pct(x = mb_small, genes = genes, name = "MALAT1")
drop_data <- droplet_data(mb_small)
summary(drop_data)
#> total_counts n_genes pct.mt MALAT1
#> Min. : 54.0 Min. : 9.00 Min. : 1.671 Min. : 0.0000
#> 1st Qu.: 92.0 1st Qu.: 27.00 1st Qu.:53.862 1st Qu.: 0.2738
#> Median : 180.0 Median : 41.00 Median :76.923 Median : 1.2232
#> Mean : 318.5 Mean : 80.83 Mean :67.531 Mean : 3.5747
#> 3rd Qu.: 387.5 3rd Qu.: 77.00 3rd Qu.:89.529 3rd Qu.: 3.4833
#> Max. :5375.0 Max. :705.00 Max. :98.701 Max. :70.5285
Before filtering, it's good plot the distribution of reads and MT%.
datf <- drop_data
par(mfrow=c(2,2))
plot(datf$total_counts, datf$n_genes,
xlab="Total Counts", ylab="Number Genes")
plot(datf$n_genes, datf$pct.mt,
xlab="Number Genes", ylab="MT%")
plot(datf$n_genes, datf$MALAT1,
xlab="Number Genes", ylab="MALAT1%")
plot(datf$pct.mt, datf$MALAT1,
xlab="MT%", ylab="MALAT1%")
The data suggests we can separate out 2 clusters, consisting of one low count/high MT% cluster, and one high count/low MT% cluster.
The following outlines the steps involved in diem
:
To run diem
, we first have to specify which droplets we would like to:
1 and 2 are mutually exclusive, while 3 is a subset of 1. It is helpful to use the above plots, as well as a barcode-rank plot, to determine the droplet sets.
barcode_rank_plot(mb_small, title = "MouseBrain")
Since this is only a subset, the tail of the low-count droplets in the
barcode-rank plot is much smaller. Typical data sets will have
tens to hundreds of thousands of droplets more than what is shown here.
The choice of the cutoff to fix debris and classify droplets is somewhat
arbitrary and depends on subjective interpretation of what will likely
be included in the debris. We should select the cutoff to fix as much
of the debris droplets into the debris group, while allowing to classify
droplets that may contain cells or nuclei. To set the groups, we use
the function set_debris_test_set
:
mb_small <- set_debris_test_set(mb_small,
top_n = 2000,
min_counts = 100,
min_genes = 100)
length(mb_small@test_set)
#> [1] 451
length(mb_small@bg_set)
#> [1] 2000
This results in 457 test droplets and 2,000 debris droplets. The function
call states that the test set (droplets to classify) must have at least
100 total counts and 100 genes detected. Furthermore, only consider the
top 2,000 droplets ranked by counts. Anything below this, regardless
of total counts and genes detected, will be fixed to debris. This option
isn't necessary here but we put it here to illustrate it. This can
help with variable total counts when we have a good approximation of the
maximum number of nuclei or cells that can be captured. To turn this
off, set top_n = Inf
.
Now that we set the test and debris droplets, we remove lowly expressed genes. We use counts per million mapped reads (CPM) to set the filtering criteria.
mb_small <- filter_genes(mb_small, cpm_thresh = 10)
genes <- gene_data(mb_small)
summary(genes)
#> mean total_counts n_cells exprsd
#> Min. : 0.02203 Min. : 54.0 Min. : 50.0 Mode:logical
#> 1st Qu.: 0.05508 1st Qu.: 135.0 1st Qu.: 117.0 TRUE:1001
#> Median : 0.06936 Median : 170.0 Median : 143.0
#> Mean : 0.31817 Mean : 779.8 Mean : 197.9
#> 3rd Qu.: 0.10282 3rd Qu.: 252.0 3rd Qu.: 196.0
#> Max. :41.12893 Max. :100807.0 Max. :2444.0
All genes are expressed here (since this is a subset including the expressed genes).
Before running EM on the multinomial mixture model, we need to
estimate the number of mixtures and initialize the mean parameters
for each multinomial. We do this by non-parametric clustering of
of droplets that are likely to contain nuclei. Here we use the top
200 ranked by gene to estimate the cell types. Alternatively, you
can order_by = "count"
.
mb_small <- set_cluster_set(mb_small,
cluster_n = 200,
order_by = "gene")
length(mb_small@cluster_set)
#> [1] 201
The number of cell types are estimated by running Louvain clustering
on the cluster set defined above. Briefly, this involves extracting
the variable genes, normalizing the counts, generating a weighted
k-nearest graph, and clustering the graph. This is done with a
single function initialize_clusters
. You can use variable genes
or all genes by setting the use_var
parameter, along with
setting the number of variable genes with n_var
.
mb_small <- initialize_clusters(mb_small,
use_var = TRUE,
n_var = 200,
verbose = TRUE)
#> finding 30 nearest neighbors
#> initializing clusters
#> initialized k=3 cell types and 1 debris cluster
This identified 3 distinct cell types in addition to the debris set. Now that we have the number of mixtures and their initializations, we can run EM to estimate the parameters and classify the droplets.
Use the run_em
function to estimate the parameters of the multinomial
mixture and calculate droplet probabilities:
mb_small <- run_em(mb_small, verbose = TRUE)
#> running EM
#> iteration 1; llk -685549.699955759
#> iteration 2; llk -685311.551269961
#> iteration 3; llk -685194.624627864
#> iteration 4; llk -685148.812551972
#> iteration 5; llk -685146.83935549
#> iteration 6; llk -685146.223223009
#> iteration 7; llk -685146.202654638
#> iteration 8; llk -685145.914947313
#> iteration 9; llk -685136.244165002
#> iteration 10; llk -685127.300109299
#> iteration 11; llk -685124.534463941
#> iteration 12; llk -685120.580378233
#> iteration 13; llk -685116.900326752
#> iteration 14; llk -685116.683921089
#> iteration 15; llk -685116.578755806
#> iteration 16; llk -685116.532325537
#> iteration 17; llk -685116.479264071
#> iteration 18; llk -685116.404038732
#> iteration 19; llk -685116.32091811
#> iteration 20; llk -685116.268584721
#> iteration 21; llk -685116.252728651
#> iteration 22; llk -685116.25015957
#> converged!
#> finished EM
drop_data <- droplet_data(mb_small)
summary(drop_data)
#> total_counts n_genes pct.mt MALAT1
#> Min. : 54.0 Min. : 9.00 Min. : 1.671 Min. : 0.0000
#> 1st Qu.: 92.0 1st Qu.: 27.00 1st Qu.:53.862 1st Qu.: 0.2738
#> Median : 180.0 Median : 41.00 Median :76.923 Median : 1.2232
#> Mean : 318.5 Mean : 80.83 Mean :67.531 Mean : 3.5747
#> 3rd Qu.: 387.5 3rd Qu.: 77.00 3rd Qu.:89.529 3rd Qu.: 3.4833
#> Max. :5375.0 Max. :705.00 Max. :98.701 Max. :70.5285
#> CleanProb Cluster ClusterProb
#> Min. :0.0000 1:2086 Min. :0.9297
#> 1st Qu.:0.0000 2: 167 1st Qu.:1.0000
#> Median :0.0000 3: 57 Median :1.0000
#> Mean :0.1489 4: 141 Mean :0.9999
#> 3rd Qu.:0.0000 3rd Qu.:1.0000
#> Max. :1.0000 Max. :1.0000
This provides the probability that a droplet belongs to one of the k cell
types or debris. The column CleanProb
gives 1 minus the probability that
the droplet belongs to the debris group. The Cluster
column gives the
group with the highest probability, while ClusterProb
gives the probability
of that cluster.
To classify a droplet as debris or nucleus/cell, set a threshold on
CleanProb
.
mb_small <- call_targets(mb_small,
pp_thresh = 0.95,
min_genes = 100)
drop_data <- droplet_data(mb_small)
summary(drop_data)
#> total_counts n_genes pct.mt MALAT1
#> Min. : 54.0 Min. : 9.00 Min. : 1.671 Min. : 0.0000
#> 1st Qu.: 92.0 1st Qu.: 27.00 1st Qu.:53.862 1st Qu.: 0.2738
#> Median : 180.0 Median : 41.00 Median :76.923 Median : 1.2232
#> Mean : 318.5 Mean : 80.83 Mean :67.531 Mean : 3.5747
#> 3rd Qu.: 387.5 3rd Qu.: 77.00 3rd Qu.:89.529 3rd Qu.: 3.4833
#> Max. :5375.0 Max. :705.00 Max. :98.701 Max. :70.5285
#> CleanProb Cluster ClusterProb Call
#> Min. :0.0000 1:2086 Min. :0.9297 Clean : 365
#> 1st Qu.:0.0000 2: 167 1st Qu.:1.0000 Debris:2086
#> Median :0.0000 3: 57 Median :1.0000
#> Mean :0.1489 4: 141 Mean :0.9999
#> 3rd Qu.:0.0000 3rd Qu.:1.0000
#> Max. :1.0000 Max. :1.0000
This gives 371 clean droplets and 2,086 debris. The majority of the 2,086 droplets had very low counts and would have been excluded anyways. To see how many of the droplets with at least 100 genes detected were removed as debris, we can run the following:
clean <- get_clean_ids(mb_small)
removed <- setdiff(mb_small@test_set, clean)
length(clean)
#> [1] 365
length(removed)
#> [1] 86
length(removed) / (length(removed) + length(clean))
#> [1] 0.1906874
This removed nearly 19% of the 457 test droplets. To see which droplets which were removed, we can plot the droplets colored by their probability.
plot_umi_gene_call(mb_small, alpha = 1)
We can also plot with MT% on one of the axes.
library(ggplot2)
drop_data <- droplet_data(mb_small)
ggplot(drop_data, aes(x = n_genes, y = pct.mt, color = CleanProb)) +
geom_point() +
theme_minimal() +
xlab("Number Genes") + ylab("MT%")
The droplets removed tend to be those with high MT% and low UMI counts/ genes detected.
diem
filtering provided a set of droplets we can use for downstream
analysis. We can extract the droplets passing diem
filtering as well
as the counts.
counts <- raw_counts(mb_small)
counts_clean <- counts[,clean]
dim(counts_clean)
#> [1] 1001 365
We also provide a convenience function for converting to a Seurat object:
seur <- convert_to_seurat(mb_small,
targets = TRUE,
meta = TRUE,
min.features = 100)
dim(seur)
#> [1] 1001 365
head(seur@meta.data)
#> orig.ident nCount_RNA nFeature_RNA total_counts
#> CTGTTTATCTCAAGTG SeuratProject 744 343 744
#> CATTATCGTATCGCAT SeuratProject 633 311 633
#> CGTGTAACACAGCGTC SeuratProject 1663 441 1663
#> GTAACTGTCGAACTGT SeuratProject 305 175 305
#> ACGCAGCGTTACAGAA SeuratProject 718 334 718
#> CAAGAAAAGTTACGGG SeuratProject 646 341 646
#> n_genes pct.mt MALAT1 CleanProb Cluster ClusterProb
#> CTGTTTATCTCAAGTG 343 7.392473 14.919355 1 3 1
#> CATTATCGTATCGCAT 311 12.006319 13.270142 1 2 1
#> CGTGTAACACAGCGTC 441 53.698136 1.683704 1 4 1
#> GTAACTGTCGAACTGT 175 12.786885 9.836066 1 2 1
#> ACGCAGCGTTACAGAA 334 8.217270 14.066852 1 2 1
#> CAAGAAAAGTTACGGG 341 14.860681 10.526316 1 3 1
#> Call
#> CTGTTTATCTCAAGTG Clean
#> CATTATCGTATCGCAT Clean
#> CGTGTAACACAGCGTC Clean
#> GTAACTGTCGAACTGT Clean
#> ACGCAGCGTTACAGAA Clean
#> CAAGAAAAGTTACGGG Clean
Since we only used a subset of the entire raw data that covers the transcriptome for this example, we modified many of the parameters to fit this. We recommend using the entire raw data set with default parameters, as this should be appropriate for most data sets.