The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Single-cell differential expression analysis with FLASHMM

Changjiang Xu, Delaram Pouyabahar, Veronique Voisin, and Gary D. Bader

April 09, 2025

Abstract

FLASHMM is a package for single-cell differential expression analysis using linear mixed-effects models (LMMs). Mixed-effects models have become a powerful tool in single-cell studies due to their ability to model intra-subject correlation and inter-subject variability. However, classic LMM estimation methods face limitations in computational speed and memory usage when applied to large-scale single-cell transcriptomics data. The FLASHMM package provides a fast and scalable approach to address scalability and memory efficiency in LMM fitting. This vignette describes the methods for LMM estimation and hypothesis testing, as well as the R functions for implementing these methods, and demonstrates the use of the package through an example.

1 Methods

1.1 LMM parameter estimation and hypothesis testing

Consider the linear mixed-effects model (LMM) as expressed below (Searle, Casella, and McCulloch 2006) \[\begin{equation}\label{eq:lmm} y = X\beta + Zb + \epsilon, \end{equation}\] where \(y\) is an \(n\times 1\) vector of observed responses, \(X\) is an \(n\times p\) design matrix for fixed effects \(\beta\), \(Z\) is an \(n\times q\) design matrix for random effects \(b\), and \(\epsilon\) is an \(n\times 1\) vector of residual errors. The term of random effects may be a combination of various components, \[ Zb = Z_1 b_1 + \cdots + Z_K b_K, \] where \(Z=[Z_1,\ldots,Z_K]\), \(b=[b^T_1,\ldots,b^T_K]^T\), \(K\) is the number of random-effect components, and \(Z_k\) is an \(n\times q_k\) design matrix for the \(k\)-th random-effect component. The superscript \(T\) denotes a transpose of vector or matrix. The basic assumptions are as follows:

The design matrix \(X\) is of full rank, satisfying conditions of estimability for the parameters;
The random vectors \(b_k\) and \(\epsilon\) are independent and follow a normal distribution, \[b_k \sim N(\mathbf{0}, \sigma^2_k I_{q_k})\quad\text{and}\quad\epsilon \sim N(\mathbf{0}, \sigma^2I_n).\]

Here \(\mathbf{0}\) is a vector or matrix of zero elements, \(I_n\) is an \(n\times n\) identity matrix, and \(\sigma^2_k\) and \(\sigma^2\) are unknown parameters, called variance components.

Hartley and Rao (1967) developed the maximum likelihood (ML) method for estimating the LMM parameters (fixed effects and variance components). Patterson and Thompson (1971) proposed a modified maximum likelihood procedure which partitions the data into two mutually uncorrelated parts, one being free of the fixed effects used for estimating variance components, called restricted maximum likelihood (REML) method. The REML estimator is unbiased. The ML estimator of variance components is biased in general. With variance components, \(\theta\), estimated, the fixed effects estimated by either ML or REML are given as follows \[ \hat\beta = (X^TV_{\theta}^{-1}X )^{-1}X^TV_{\theta}^{-1}y, \] with covariance matrix \[var(\hat\beta) = (X^TV_{\theta}^{-1}X)^{-1},\] where \(\theta = (\theta_0, \theta_1,\ldots, \theta_K)\), \(\theta_0 = \sigma^2\), \(\theta_k = \sigma^2_k\), and \[V_{\theta} = \theta_0 I_n + \theta_1 Z_1Z_1^T + \ldots + \theta_K Z_KZ_K^T.\]

Estimating the variance components by either ML or REML is a numerical optimization problem. Various iterative methods based on the log likelihood, called gradient methods, have been proposed (Searle, Casella, and McCulloch 2006). The gradient methods are represented by the iteration equation \[\begin{equation}\label{eq:gradient} \theta^{(i+1)} = \theta^{(i)} + \Gamma(\theta^{(i)})\frac{\partial l(\theta^{(i)})}{\partial\theta}, \end{equation}\] where \(\partial l(\theta)/\partial\theta\) is the gradient of the log likelihood function, and \(\Gamma(\theta)\) is a modifier matrix of the gradient direction, which can be specified by Newton–Raphson, Fisher scoring or average information. The Newton-Raphson method uses the inverse Hessian matrix as a modifier, while the Fisher scoring method uses the inverse of the expectation of the Hessian matrix. The average information method uses the inverse of the average Hessian matrix and its expectation. We use the Fisher scoring method. The Fisher scoring method is more stable than others because the Hessian matrix may not be positive definite but its expectation is positive definite.

The hypotheses for testing fixed effects and variance components can be respectively defined as \[ H_{0, i}: \beta_i = 0 ~~\text{versus}~~H_{1,i}: \beta_i\ne 0, \] \[ H_{0, k}: \theta_k \leq 0 ~~\text{versus}~~H_{1,k}: \theta_k > 0, \] where \(\theta_k\), \(k=1, \ldots, K\), represent the parameters of variance components \(\sigma^2_k\) but are allowed to be negative. The lower boundary of the parameters, \(\theta\), can be the negative value such that the variance-covariance matrix, \(V_{\theta}\), remains definable (positive definite). The negative lower boundary exists and can be less than \(- \sigma^2/\lambda_{max}\), where \(\lambda_{max} > 0\) is the largest singular value of \(ZZ^T\). If \(\theta_k > 0\), then \(\sigma_k^2 = \theta_k\) is definable and the mixed model is well-specified. Otherwise, if \(\theta_k \le 0\), the mixed model is miss-specified and the \(k\)-th random-effect component shouldn’t be included.

Allowing the parameters of variance components to take negative value avoids the zero boundary at the null hypothesis for the variance components. Consequently, the asymptotic normality of the maximum likelihood estimation at the null hypothesis holds under regularity conditions, which enables us to use z-statistics or t-statistics for hypothesis testing of the fixed effects and variance components. The t-statistics for fixed effects are given by \[\begin{equation} \label{eq:tcoef} T_i = \frac{\hat\beta_i}{\sqrt{var(\hat\beta_i)}} = \frac{\hat\beta_i}{\sqrt{var(\hat\beta)_{ii}}} ~\sim ~t(n - p). \end{equation}\] The t-statistic for a contrast, a linear combination of the estimated fixed effects, \(c^T\hat\beta\), is \[\begin{equation} \label{eq:tcontrast} T_c = \frac{c^T\hat\beta}{\sqrt{c^Tvar(\hat\beta) c}} \sim t(n-p). \end{equation}\] The test statistics for the parameters of variance components are given by \[\begin{equation} \label{eq:zvarcomp} T_{\theta_k} = \frac{\hat\theta_k}{\sqrt{[I(\hat\theta)^{-1}]_{kk}}} \sim N(0, 1), \end{equation}\] where \(I(\theta)\) is the Fisher information matrix.

1.2 Fast and scalable algorithm

We developed a summary statistics based algorithm for implementing the gradient method \(\eqref{eq:gradient}\). Let \(XX\), \(XY\), \(ZX\), \(ZY\), and \(ZZ\) denote a matrix, respectively, which define the summary statistics that are computed from cell-level data \(X\), \(Y\) and \(Z\) as follows: \[\begin{equation} \label{eq:sdata} \begin{array}{l} XX = X^TX, ~XY = X^TY^T,\\ ZX = Z^TX, ~ZY = Z^TY^T, ~ZZ = Z^TZ,\\ Ynorm = [y_1y_1^T, \ldots, y_my_m^T], \end{array} \end{equation}\] where \(Y = [y_1^T, \ldots, y_m^T]^T\) is a \(m\)-by-\(n\) matrix of gene expression profile with each row \(y_i\) corresponding to the expression of gene \(i\), \(i=1,\ldots,m\), \(m\) is the number of genes and \(n\) is the number of cells (sample size). The summary statistics can be precomputed and stored. The summary statistics based algorithm has a complexity of \(O(m(p^3 + q^3))\), which doesn’t depend on number of cells (sample size \(n\)). In single-cell differential expression (DE) analysis, the numbers of fixed and random effects, \(p\) and \(q\), are relatively small. Therefore, the algorithm is fast and scalable, and requires less computer memory.

1.3 FLASHMM

FLASHMM package provides two functions, lmm and lmmfit, for fitting LMM. The lmm function uses summary statistics as arguments. The lmmfit function is a wrapper function of lmm, which directly uses cell-level data and computes the summary statistics inside the function. The lmmfit function is simple to be operated but it has a limitation of memory use. For extremely large scale data, we can precompute the summary-level data by \(\eqref{eq:sdata}\), and then use lmm function to fit LMM. FLASHMM provides lmmtest function to perform statistical test on the fixed effects and the contrasts of the fixed effects.

In summary, FLASHMM package provides the following main functions.

lmm: fit LMM using summary-level data.
lmmfit: fit LMM using cell-level data.
lmmtest: perform statistical tests on the fixed effects and their contrasts.
contrast.matrix: construct contrast matrix combining the fixed effects for various comparisons.
simuRNAseq: simulate multi-sample multi-cell-type scRNA-seq dataset based on a negative binomial distribution.

##Install FLASHMM from CRAN.
# install.packages("FLASHMM")
##Install FLASHMM from Github.
# devtools::install_github("https://github.com/Baderlab/FLASHMM")

##Load the package.
library(FLASHMM)

2 Example

We use a simulated multi-sample multi-cell-type single-cell RNA-seq (scRNA-seq) dataset to illustrate how to utilize FLASHMM to perform single-cell differential expression analysis. In this example, we are interested in identifying the genes differentially expressed between two treatments (conditions) within a cell type.

2.1 Simulating the scRNA-seq data

We simulate the multi-sample multi-cell-type scRNA-seq data by simuRNAseq function in FLASHMM package. The simuRNAseq function can generate scRNA-seq data, with or without a reference dataset. The reference dataset contains count data and metadata. The count data is a genes-by-cells matrix. The metadata must be a data frame containing three columns: samples (subjects), cell-types (clusters), and treatments (experimental conditions), and the three columns must be named ‘sam’, ‘cls’, and ‘trt’, respectively.

For simplicity and demonstration purposes, we use a simulated reference dataset to generate the scRNA-seq data. First we simulate the reference dataset.

##Generate a reference dataset by simuRNAseq function.
set.seed(2502)
refdata <- simuRNAseq(nGenes = 1000, nCells = 10000)

##counts
counts <- refdata$counts
##metadata
metadata <- refdata$metadata
head(metadata)
#>       sam cls trt libsize
#> Cell1  A5   9   A    1941
#> Cell2  A9   9   A    1951
#> Cell3  A3   8   A    2005
#> Cell4  A6   3   A    1906
#> Cell5  B8   5   B    1913
#> Cell6 A11   4   A    1985

rm(refdata)

For the simulated reference dataset, we don’t need to change the metadata column names because the columns of samples, clusters, and treatments, are already named ‘sam’, ‘cls’, and ‘trt’, respectively. But, for a real biological reference dataset, we need to change the metadata column names.

Next we use the reference counts and metadata to simulate scRNA-seq data that contains 100 genes and 100,000 cells from 25 samples (subjects) and 10 clusters (cell-types) with 2 treatments. There are 10 differentially expressed genes.

##Generate the scRNA-seq data by simuRNAseq function.
set.seed(2503)
dat <- simuRNAseq(counts, metadata = metadata, nGenes = 100, nCells = 100000, nsam = 25, 
       ncls = 10, ntrt = 2, nDEgenes = 10, minbeta = 0.5, maxbeta = 1, var.randomeffects = 0.1)
str(dat)
#> List of 5
#>  $ ref.mean.dispersion:'data.frame': 1000 obs. of  2 variables:
#>   ..$ mu        : num [1:1000] 3.19 3.56 1.02 4.24 2.11 ...
#>   ..$ dispersion: num [1:1000] 0.814 1.476 0.791 0.829 2.319 ...
#>  $ metadata           :'data.frame': 100000 obs. of  4 variables:
#>   ..$ sam    : chr [1:100000] "B10" "A2" "A10" "A3" ...
#>   ..$ cls    : chr [1:100000] "8" "6" "3" "1" ...
#>   ..$ trt    : chr [1:100000] "B" "A" "A" "A" ...
#>   ..$ libsize: num [1:100000] 2292 2108 2176 1938 2183 ...
#>  $ counts             : num [1:100, 1:100000] 0 2 3 0 1 0 1 0 7 4 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:100] "Gene260" "Gene724" "Gene774" "Gene616" ...
#>   .. ..$ : chr [1:100000] "Cell9224" "Cell3950" "Cell530" "Cell2039" ...
#>  $ DEgenes            :'data.frame': 10 obs. of  3 variables:
#>   ..$ gene   : chr [1:10] "Gene201" "Gene864" "Gene522" "Gene597" ...
#>   ..$ beta   : num [1:10] 0.977 0.819 -0.532 0.896 0.887 ...
#>   ..$ cluster: chr [1:10] "1" "10" "2" "2" ...
#>  $ treatment          : chr "B"
rm(counts, metadata)

The simulated data contains a genes-by-cells matrix of expression counts and a data frame of metadata consisting of 3 columns: samples (sam), cell-types (cls) and treatments (trt).

2.2 Analyzing the simulated scRNA-seq data

We perform differential expression analysis of the simulated dataset using FLASHMM package. We are to identify the significantly differentially expressed genes between two treatments in a cell-type. The analyses involve following steps: LMM design, LMM fitting and hypothesis testing, exploring LMM fit and the differetially expressed (DE) genes.

LMM design: construct design matrices for fixed and random effects described as \(\eqref{eq:lmm}\), and compute the gene expression profile. The gene expression is taken as log-transformed count matrix, \[Y = \log(1+\text{counts}),\] in which each row corresponds to the expression profile for a gene. The design matrix for the fixed effects can be created by the function model.matrix, as follows \[ X = model.matrix(\sim 0 + log(library.size) + cell.type + cell.type:treatment), \] where the interaction term \(cell.type:treatment\) represents the treatment effect in a specific cell-type. \(library.size\) is a normalization factor of the scRNA-seq counts, calculated by the total sum of counts across all genes for each cell. We consider the subjects (samples) as random effects that reflect the intra-subject correlation and inter-subject variability. The design matrix for the random effects is given by \[ Z = model.matrix(\sim 0 + subject). \]

LMM fitting and hypothesis testing: We use either lmm or lmmfit function to fit LMMs for all genes. With the cell-level data matrices \(Y\), \(X\) and \(Z\), the LMMs can be fit by \[lmmfit(Y, X, Z, d = d),\] where \(d\) is a vector of the numbers of random-effects in different components. For \(K\) components of random-effects, \(d = (q_1, \ldots, q_K)\), where \(q_k\) is the number of columns of the design matrix \(Z_k\) for the \(k\)-th random-effect component. For a large-scale data, the lmmfit function has a problem of computer memory limit. In this case, it is recommended to pre-compute and store the summary-level data: \(XX\), \(XY\), \(ZX\), \(ZY\), \(ZZ\), and \(Y_{norm}\), defined in \(\eqref{eq:sdata}\), and then use the lmm function to fit the LMMs as follows: \[lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d).\] The summary-level data doesn’t depend on the sample size \(n\). This makes lmm memory-efficient. The default method in the lmm and lmmfit functions is restricted maximum likelihood (REML), that is, method = ‘REML’. If you use the maximum likelihood (ML) method to fit the LMM, set method = ‘ML’ in the lmm and lmmfit functions.

In addition, the lmm and lmmfit functions perform hypothesis testing on the fixed effects (coefficients). We can also use the lmmtest function to conduct statistical tests on both the fixed effects and their contrasts for various comparisons between different levels.

LMM fitting returns a list of estimates of LMM parameters (coefficients and variance components), standard errors of the estimates, covariance matrix of the coefficients (fixed effects), and t-values and p-values for the hypothesis testing on the coefficients, for each gene. The LMM fitting also returns the number of iterations, the first partial derivatives of the log likelihood, and the log-likelihood at the termination of the iterations for each gene.

Exploring LMM fit and DE genes: We check if the LMM fitting is convergent, and perform hypothesis testing on the variance components of random effects. Then we identify DE genes based on hypothesis testing p-values for the coefficients (fixed effects). If the absolute first partial derivatives of log likelihood are all less than the convergence tolerance, the LMM fitting converges, otherwise it doesn’t converge. The genes for which LMM fitting doesn’t converge should be excluded in the subsequent analysis because the estimated coefficients for these genes are not reliable. DE genes can be identified by adjusting p-values using the false discovery rate (FDR) or family-wise error rate (Bonferroni correction). Additionally, we may exclude genes with small coefficients (effect size or log-fold change).

2.2.1 LMM design

##Gene expression matrix, Y = log2(1 + counts)
Y <- log2(1 + dat$counts)
dat$counts <- NULL #Remove the counts.

##Design matrix for fixed effects
X <- model.matrix(~ 0 + log(libsize) + cls + cls:trt, data = dat$meta)

##Design matrix for random effects
Z <- model.matrix(~ 0 + as.factor(sam), data = dat$meta)
##Dimension of random effects
d <- ncol(Z)

rm(dat)

2.2.2 LMM fitting

##Option 1: fit LMM by cell-level data.
max.iter <- 100
epsilon <- 1e-5
fit <- lmmfit(Y, X, Z, d = d, max.iter = max.iter, epsilon = epsilon)

##Option 2: fit LMM by summary-level data.
##(1) Compute the summary-level data.
n <- nrow(X)
XX <- t(X)%*%X
XY <- t(Y%*%X)
ZX <- t(Z)%*%X
ZY <- t(Y%*%Z)
ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)

rm(X, Y, Z) #release the memory.

##(2) Fit LMM.
fitss <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d, 
             max.iter = max.iter, epsilon = epsilon)

identical(fit, fitss) #The two LMM fits are identical.
#> [1] TRUE
rm(fitss)

2.2.3 Hypothesis testing

##Testing coefficients (fixed effects)
test <- lmmtest(fit)
#head(test)

##The testing t-value and p-values are also provided in the LMM fit.
range(test - cbind(t(fit$coef), t(fit$t), t(fit$p))) #identical
#> [1] 0 0
#fit$t[, 1:4]
#fit$p[, 1:4]
fit$coef[, 1:4]
#>                   Gene260    Gene724     Gene774     Gene616
#> log(libsize)  0.823538186  0.9338158  0.84825642  0.47303182
#> cls1         -5.033788454 -5.4098141 -5.22008950 -3.01064722
#> cls10        -5.016216620 -5.4194900 -5.23064461 -2.97351422
#> cls2         -5.023813654 -5.4187129 -5.23977394 -2.96924442
#> cls3         -4.997721940 -5.3974915 -5.23458730 -2.97942173
#> cls4         -5.020303938 -5.4443891 -5.22523214 -2.98406521
#> cls5         -5.034040308 -5.4158073 -5.21912587 -2.96949879
#> cls6         -5.039858094 -5.4040391 -5.22563185 -3.01283214
#> cls7         -5.032770530 -5.4027480 -5.21182988 -2.96310244
#> cls8         -5.027633571 -5.4037420 -5.22567389 -2.99787326
#> cls9         -5.011484228 -5.3934569 -5.21931480 -2.97552877
#> cls1:trtB     0.042932556 -0.2901492 -0.05952280  0.11278682
#> cls10:trtB    0.027706615 -0.2490376 -0.03298227  0.05911209
#> cls2:trtB     0.043924185 -0.2725758 -0.06385726  0.06042628
#> cls3:trtB     0.034137125 -0.2807799 -0.03698257  0.04462700
#> cls4:trtB     0.048231220 -0.2272270 -0.01519083  0.08609152
#> cls5:trtB     0.040408456 -0.2817817 -0.05896284  0.05373212
#> cls6:trtB     0.075970031 -0.2511360 -0.03689410  0.11972056
#> cls7:trtB     0.029742613 -0.2712344 -0.07932668  0.06190330
#> cls8:trtB     0.021522916 -0.2660295 -0.04673630  0.07685644
#> cls9:trtB     0.005106265 -0.3013190 -0.08311758  0.06462308

The coefficients of the interaction term cls\(\ast\):trtB represent the effects of the treatment B versus A in a cell-type (cls\(\ast\)). We can make a comparison of the treatment B versus A within all cell-types by constructing a contrast by summing the treatment effects across cell-types as follows.

contrast <- cbind("BvsA" = numeric(nrow(fit$coef)))
index <- grep(":", rownames(fit$coef))
contrast[index, ] <- 1/length(index)

##Test the contrast.
test <- lmmtest(fit, contrast = contrast)
head(test)
#>           BvsA_coef     BvsA_t     BvsA_p
#> Gene260  0.03696820  0.4499201 0.65276901
#> Gene724 -0.26912699 -2.0665983 0.03877459
#> Gene774 -0.05135733 -0.4228609 0.67239767
#> Gene616  0.07398792  1.4665370 0.14250518
#> Gene581  0.05802063  0.7702806 0.44113530
#> Gene591 -0.06573230 -0.8597515 0.38992809

The contrast can also be constructed by contrast.matrix function as follows.

BvsA <- paste0(paste0("cls", 1:10, ":trtB"), collapse = "+")
BvsA <- paste0("(", BvsA, ")/10")
BvsA
#> [1] "(cls1:trtB+cls2:trtB+cls3:trtB+cls4:trtB+cls5:trtB+cls6:trtB+cls7:trtB+cls8:trtB+cls9:trtB+cls10:trtB)/10"
contrast <- contrast.matrix(contrast = c(BvsA = BvsA), model.matrix.names = rownames(fit$coef))
test <- lmmtest(fit, contrast = contrast)
head(test)
#>           BvsA_coef     BvsA_t     BvsA_p
#> Gene260  0.03696820  0.4499201 0.65276901
#> Gene724 -0.26912699 -2.0665983 0.03877459
#> Gene774 -0.05135733 -0.4228609 0.67239767
#> Gene616  0.07398792  1.4665370 0.14250518
#> Gene581  0.05802063  0.7702806 0.44113530
#> Gene591 -0.06573230 -0.8597515 0.38992809

2.2.4 Exploring LMM fit and the DE genes

##(1) Check the genes for which LMM fitting converges.
gc <- (apply(abs(fit$dlogL), 2, max) < epsilon) 
sum(gc) 
#> [1] 100

##(2) Hypothesis testing for variance components:
##    H0, theta </= 0 vs H1, theta > 0.
z <- fit$theta["var1", ]/fit$se.theta["var1", ]
p <- pnorm(z, lower.tail = FALSE)
sum(p < 0.05)
#> [1] 100

##(3) The DE genes specific to a cell-type
##Coefficients, t-values, and p-values for the genes specific to a cell-type.
index <- grep(":", rownames(fit$coef))
ce <- fit$coef[index, gc]
tv <- fit$t[index, gc]
pv <- fit$p[index, gc]

out <- data.frame(
    gene = rep(colnames(ce), nrow(ce)), 
    cluster = rep(rownames(ce), each = ncol(ce)),
    coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))

##Adjust p-values by FDR.
out$FDR <- p.adjust(out$p, method = "fdr")

##The DE genes with FDR < 0.05 and abs(logFC) > 0.5
out <- out[order(out$p), ]
rownames(out) <- NULL
out[(out$FDR < 0.05) & (abs(out$coef) > 0.5) , ]
#>      gene    cluster       coef         t            p          FDR
#> 1 Gene456  cls4:trtB  0.6159160 10.144524 3.600651e-24 3.600651e-21
#> 2 Gene201  cls1:trtB  0.7149779  8.625628 6.465361e-18 3.232680e-15
#> 3 Gene864 cls10:trtB  0.8716965  7.992823 1.332776e-15 4.442588e-13
#> 4 Gene597  cls2:trtB  0.7638176  7.737909 1.020050e-14 2.550125e-12
#> 5 Gene982  cls8:trtB  0.9979503  7.288876 3.148478e-13 6.296956e-11
#> 6 Gene499  cls8:trtB -0.6079898 -5.757651 8.554122e-09 1.425687e-06
#> 7 Gene699  cls7:trtB -0.5373567 -4.919868 8.674004e-07 1.239143e-04
#> 8 Gene632  cls4:trtB -0.5355741 -4.810256 1.509563e-06 1.886953e-04
#> 9 Gene378  cls9:trtB -0.6419090 -4.427605 9.538617e-06 1.059846e-03

##The true DE genes
#dat$DEgenes

2.2.5 Using ML method

##Fitting LMM by ML method
#fit <- lmmfit(Y, X, Z, d = d, method = "ML", max.iter = max.iter, epsilon = epsilon)
fit <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d, method = "ML",
             max.iter = max.iter, epsilon = epsilon)

##The DE genes specific to a cell-type
##Coefficients, t-values, and p-values
index <- NULL
index <- grep(":", rownames(fit$coef))
ce <- fit$coef[index, gc]
tv <- fit$t[index, gc]
pv <- fit$p[index, gc]

out <- NULL
out <- data.frame(
    gene = rep(colnames(ce), nrow(ce)), 
    cluster = rep(rownames(ce), each = ncol(ce)),
    coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))

##Adjusting p-values by FDR
out$FDR <- p.adjust(out$p, method = "fdr")

##The DE genes with FDR < 0.05 and abs(logFC) > 0.5
out <- out[order(out$p), ]
rownames(out) <- NULL
out[(out$FDR < 0.05) & (abs(out$coef) > 0.5) , ]
#>      gene    cluster       coef         t            p          FDR
#> 1 Gene456  cls4:trtB  0.6159049 10.542755 5.661896e-26 5.661896e-23
#> 2 Gene201  cls1:trtB  0.7149782  8.979640 2.762145e-19 1.381073e-16
#> 3 Gene864 cls10:trtB  0.8717021  8.322877 8.691959e-17 2.897320e-14
#> 4 Gene597  cls2:trtB  0.7638143  8.058659 7.797845e-16 1.949461e-13
#> 5 Gene982  cls8:trtB  0.9979550  7.589598 3.236647e-14 6.473295e-12
#> 6 Gene499  cls8:trtB -0.6079984 -5.993079 2.066072e-09 3.443453e-07
#> 7 Gene699  cls7:trtB -0.5373524 -5.121801 3.031907e-07 4.331296e-05
#> 8 Gene632  cls4:trtB -0.5355670 -5.009191 5.475210e-07 6.844013e-05
#> 9 Gene378  cls9:trtB -0.6419070 -4.612034 3.992413e-06 4.436014e-04

##
sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Monterey 12.7.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] FLASHMM_1.2.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.37     R6_2.6.1          fastmap_1.2.0     Matrix_1.7-3     
#>  [5] xfun_0.51         lattice_0.22-6    cachem_1.1.0      knitr_1.50       
#>  [9] htmltools_0.5.8.1 rmarkdown_2.29    lifecycle_1.0.4   cli_3.6.4        
#> [13] grid_4.4.3        sass_0.4.9        jquerylib_0.1.4   compiler_4.4.3   
#> [17] rstudioapi_0.17.1 tools_4.4.3       evaluate_1.0.3    bslib_0.9.0      
#> [21] yaml_2.3.10       rlang_1.1.5       jsonlite_1.9.1    MASS_7.3-65

3 Remarks

Building design matrices for fixed and random effects is a key step in LMM-based differential expression analysis. Including library size, a normalization factor for scRNA-seq, as a fixed effect can help reduce p-value inflation. If necessary, we can add principal components as fixed effects to further mitigate unknown batch effects.

In differential expression analysis, modeling samples (subjects) as random effects accounts for between-sample variability and within-sample correlation. If samples have different effects on gene expression, we can also model them as fixed effects. However, this can lead to overfitting when the number of samples (subjects) is large. For scRNA-seq data, this may also result in collinearity when the samples are nested within an experimental condition.

References

Hartley, H. O., and J. N. K. Rao. 1967. “Maximum-Likelihood Estimation for the Mixed Analysis of Variance Model.” Biometrika 54 (1/2): 93–108. http://www.jstor.org/stable/2333854.

Patterson, H. D., and R. Thompson. 1971. “Recovery of Inter-Block Information When Block Sizes Are Unequal.” Biometrika 58 (3): 545–54. https://doi.org/10.2307/2334389.

Searle, Shayle R., George Casella, and Charles E. McCulloch. 2006. Variance Components. New Jersey: John Wiley & Sons, Inc.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.